Re: 2.6.20.3 AMD64 oops in CFQ code
On Thursday March 22, dan.j.williams [at] intel.com wrote:
>
> Not a cfq failure, but I have been able to reproduce a different oops
> at array stop time while i/o's were pending. I have not dug into it
> enough to suggest a patch, but I wonder if it is somehow related to
> the cfq failure since it involves congestion and drives going away:
Thanks. I know about that one and have a patch about to be posted
which should fix it. But I don't completely understand it.
When a raid5 array shuts down, it clears mddev->private, but doesn't
clean q->backing_dev_info.congested_fn. So if someone tries to call
that congested_fn, it will try to dereference mddev->private and Oops.
Only by the time that raid5 is shutting down, no-one should have a
reference to the device any more, and no-one should be in a position
to call congested_fn !!
Maybe pdflush is just trying to sync the block device, even though
there is no dirty date .... dunno....
But I don't think it is related to the cfq problem as this one is only
a problem when the array is being stopped.
Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
As an additional data point, here's a libata problem I'm having trying to
rebuild the array.
I have six identical 400 GB drives (ST3400832AS), and one is giving
me hassles. I've run SMART short and long diagnostics, badblocks, and
Seagate's "seatools" diagnostic software, and none of these find problems.
It is the only one of the six with a non-zero reallocated sector count
(it's 26).
Anyway, the drive is partitioned into a 45G RAID-10 part and a 350G RAID-5
part. The RAID-10 part integrated successfully, but the RAID-5 got to
about 60% and then puked:
ata5.00: exception Emask 0x0 SAct 0x1ef SErr 0x0 action 0x2 frozen
ata5.00: cmd 61/c0:00:d2:d0:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 98304 out
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/40:08:92:d1:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 32768 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:10:d2:d1:b9/01:00:1c:00:00/40 tag 2 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:18:d2:d2:b9/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:28:d2:d3:b9/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:30:d2:d4:b9/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:38:d2:d5:b9/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: cmd 61/00:40:d2:d6:b9/01:00:1c:00:00/40 tag 8 cdb 0x0 data 131072 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata5.00: configured for UDMA/100
ata5: EH complete
SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
sde: Write Protect is off
SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 cdb 0x0 data 0
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5: soft resetting port
ata5: softreset failed (timeout)
ata5: softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: follow-up softreset failed, retrying in 5 secs
ata5: hard resetting port
ata5: softreset failed (timeout)
ata5: reset failed, giving up
ata5.00: disabled
ata5: EH complete
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 91795259
md: super_written gets error=-5, uptodate=0
raid10: Disk failure on sde3, disabling device.
Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481942994
raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
sd 4:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sde, sector 481944018
md: md5: recovery done.
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 3, wo:1, o:0, dev:sde3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID10 conf printout:
--- wd:5 rd:6
disk 0, wo:0, o:1, dev:sdb3
disk 1, wo:0, o:1, dev:sdc3
disk 2, wo:0, o:1, dev:sdd3
disk 4, wo:0, o:1, dev:sdf3
disk 5, wo:0, o:1, dev:sda3
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 4, o:0, dev:sde4
disk 5, o:1, dev:sdf4
RAID5 conf printout:
--- rd:6 wd:5
disk 0, o:1, dev:sda4
disk 1, o:1, dev:sdb4
disk 2, o:1, dev:sdc4
disk 3, o:1, dev:sdd4
disk 5, o:1, dev:sdf4
The first error address is just barely inside the RAID-10 part (which ends at
sector 91,795,410), while the second and third errors (at 481,942,994)
look like where the reconstruction was working.
Anyway, what's annoying is that I can't figure out how to bring the
drive back on line without resetting the box. It's in a hot-swap enclosure,
but power cycling the drive doesn't seem to help. I thought libata hotplug
was working? (SiI3132 card, using the sil24 driver.)
(H'm... after rebooting, reallocated sectors jumped from 26 to 39.
Something is up with that drive.)
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
[resending. my mail service was down for more than a week and this
message didn't get delivered.]
linux [at] horizon.com wrote:
> > Anyway, what's annoying is that I can't figure out how to bring the
> > drive back on line without resetting the box. It's in a hot-swap
enclosure,
> > but power cycling the drive doesn't seem to help. I thought libata
hotplug
> > was working? (SiI3132 card, using the sil24 driver.)
Yeah, it's working but failing resets are considered highly dangerous
(in that the controller status is unknown and may cause something
dangerous like screaming interrupts) and port is muted after that. The
plan is to handle this with polling hotplug such that libata tries to
revive the port if PHY status change is detected by polling. Patches
are available but they need other things to resolved to get integrated.
I think it'll happen before the summer.
Anyways, you can tell libata to retry the port by manually telling it to
rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
> > (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
> > Something is up with that drive.)
Yeap, seems like a broken drive to me.
Thanks.
--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
linux [at] horizon.com wrote:
>> Anyway, what's annoying is that I can't figure out how to bring the
>> drive back on line without resetting the box. It's in a hot-swap enclosure,
>> but power cycling the drive doesn't seem to help. I thought libata hotplug
>> was working? (SiI3132 card, using the sil24 driver.)
> Yeah, it's working but failing resets are considered highly dangerous
> (in that the controller status is unknown and may cause something
> dangerous like screaming interrupts) and port is muted after that. The
> plan is to handle this with polling hotplug such that libata tries to
> revive the port if PHY status change is detected by polling. Patches
> are available but they need other things to resolved to get integrated.
> I think it'll happen before the summer.
> Anyways, you can tell libata to retry the port by manually telling it to
> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
Ah, thank you! I have to admit, that is at least as mysterious as any
Microsoft registry tweak.
>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
>> Something is up with that drive.)
> Yeap, seems like a broken drive to me.
Actually, after a few rounds, the reallocated sectors stabilized at 56
and all is working well again. It's like there was a major problem with
error handling.
The problem is that I don't know where the blame lies.
In case it helps, here is the syslog associated with the first failure:
07:17:56: ata5.00: exception Emask 0x0 SAct 0x7fffff SErr 0x0 action 0x2 frozen
07:20:15: ata5.00: cmd 61/30:00:a2:d8:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 24576 out
07:20:15: res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/08:08:9a:7c:b9/00:00:1c:00:00/40 tag 1 cdb 0x0 data 4096 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/08:10:ca:7c:b9/00:00:1c:00:00/40 tag 2 cdb 0x0 data 4096 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/20:18:b2:d6:b9/00:00:1c:00:00/40 tag 3 cdb 0x0 data 16384 in
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 61/08:20:fa:a2:b6/00:00:1c:00:00/40 tag 4 cdb 0x0 data 4096 out
07:20:15: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:15: ata5.00: cmd 60/40:28:1a:d7:b9/00:00:1c:00:00/40 tag 5 cdb 0x0 data 32768 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/00:30:d2:11:c0/01:00:24:00:00/40 tag 6 cdb 0x0 data 131072 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/28:38:2a:0b:b9/00:00:1c:00:00/40 tag 7 cdb 0x0 data 20480 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:40:82:7c:b9/00:00:1c:00:00/40 tag 8 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/58:48:02:a3:b6/00:00:1c:00:00/40 tag 9 cdb 0x0 data 45056 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:50:52:49:ba/00:00:1c:00:00/40 tag 10 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:58:6a:49:ba/00:00:1c:00:00/40 tag 11 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:60:7a:49:ba/00:00:1c:00:00/40 tag 12 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:68:8a:49:ba/00:00:1c:00:00/40 tag 13 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:70:a2:49:ba/00:00:1c:00:00/40 tag 14 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:78:ba:49:ba/00:00:1c:00:00/40 tag 15 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/18:80:0a:b0:b8/00:00:1c:00:00/40 tag 16 cdb 0x0 data 12288 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/10:88:2a:b0:b8/00:00:1c:00:00/40 tag 17 cdb 0x0 data 8192 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/08:90:52:0a:b9/00:00:1c:00:00/40 tag 18 cdb 0x0 data 4096 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/40:98:d2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 32768 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/40:a0:92:d7:b9/00:00:1c:00:00/40 tag 20 cdb 0x0 data 32768 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/50:a8:52:d8:b9/00:00:1c:00:00/40 tag 21 cdb 0x0 data 40960 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 61/78:b0:da:d7:b9/00:00:1c:00:00/40 tag 22 cdb 0x0 data 61440 out
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5: soft resetting port
07:20:16: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
07:20:16: ata5.00: configured for UDMA/100
07:20:16: ata5: EH complete
07:20:16: SCSI device sde: 781422768 512-byte hdwr sectors (400088 MB)
07:20:16: sde: Write Protect is off
07:20:16: sde: Mode Sense: 00 3a 00 00
07:20:16: SCSI device sde: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
07:20:16: ata5.00: exception Emask 0x0 SAct 0xb0000 SErr 0x0 action 0x2 frozen
07:20:16: ata5.00: cmd 60/00:80:d2:11:c0/01:00:24:00:00/40 tag 16 cdb 0x0 data 131072 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/40:88:1a:d7:b9/00:00:1c:00:00/40 tag 17 cdb 0x0 data 32768 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5.00: cmd 60/20:98:b2:d6:b9/00:00:1c:00:00/40 tag 19 cdb 0x0 data 16384 in
07:20:16: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
07:20:16: ata5: soft resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: softreset failed, retrying in 5 secs
07:20:16: ata5: hard resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: follow-up softreset failed, retrying in 5 secs
07:20:16: ata5: hard resetting port
07:20:16: ata5: softreset failed (timeout)
07:20:16: ata5: reset failed, giving up
07:20:16: ata5.00: disabled
07:20:16: ata5: EH complete
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 481941170
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 481941274
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 616567250
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 779457538
07:20:16: md: super_written gets error=-5, uptodate=0
07:20:16: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
07:20:16: sd 4:0:0:0: SCSI error: return code = 0x00040000
07:20:16: end_request: I/O error, dev sde, sector 91795259
07:20:16: md: super_written gets error=-5, uptodate=0
07:20:16: raid10: Disk failure on sde3, disabling device.
07:20:16: ^IOperation continuing on 5 devices
07:20:16: RAID5 conf printout:
07:20:16: --- rd:6 wd:5
07:20:16: disk 0, o:1, dev:sda4
07:20:16: disk 1, o:1, dev:sdb4
07:20:16: disk 2, o:1, dev:sdc4
07:20:16: disk 3, o:1, dev:sdd4
07:20:16: disk 4, o:0, dev:sde4
07:20:16: disk 5, o:1, dev:sdf4
07:20:16: RAID5 conf printout:
07:20:16: --- rd:6 wd:5
07:20:16: disk 0, o:1, dev:sda4
07:20:16: disk 1, o:1, dev:sdb4
07:20:16: disk 2, o:1, dev:sdc4
07:20:16: disk 3, o:1, dev:sdd4
07:20:16: disk 5, o:1, dev:sdf4
07:20:16: RAID10 conf printout:
07:20:16: --- wd:5 rd:6
07:20:16: disk 0, wo:0, o:1, dev:sdb3
07:20:16: disk 1, wo:0, o:1, dev:sdc3
07:20:16: disk 2, wo:0, o:1, dev:sdd3
07:20:16: disk 3, wo:1, o:0, dev:sde3
07:20:16: disk 4, wo:0, o:1, dev:sdf3
07:20:16: disk 5, wo:0, o:1, dev:sda3
07:20:16: RAID10 conf printout:
07:20:16: --- wd:5 rd:6
07:20:16: disk 0, wo:0, o:1, dev:sdb3
07:20:16: disk 1, wo:0, o:1, dev:sdc3
07:20:16: disk 2, wo:0, o:1, dev:sdd3
07:20:16: disk 4, wo:0, o:1, dev:sdf3
07:20:16: disk 5, wo:0, o:1, dev:sda3
And another one while recovering. I notice that the reset timeouts seem
to take much longer this time. Or could the above have had similar
timeouts, but syslog was blocked by the error and didn't receive and
timestamp the messages until recovery was complete?
14:49:55: RAID5 conf printout:
14:49:55: --- rd:6 wd:5
14:49:55: disk 0, o:1, dev:sda4
14:49:55: disk 1, o:1, dev:sdb4
14:49:55: disk 2, o:1, dev:sdc4
14:49:55: disk 3, o:1, dev:sdd4
14:49:55: disk 4, o:1, dev:sde4
14:49:55: disk 5, o:1, dev:sdf4
14:49:55: md: recovery of RAID array md5
14:49:55: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
14:49:55: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
14:49:55: md: using 128k window, over a total of 343831040 blocks.
14:56:13: ata5.00: exception Emask 0x0 SAct 0xff SErr 0x0 action 0x2 frozen
14:56:13: ata5.00: cmd 61/f8:00:da:d6:b9/00:00:1c:00:00/40 tag 0 cdb 0x0 data 126976 out
14:56:13: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/90:08:d2:2f:ba/00:00:1c:00:00/40 tag 1 cdb 0x0 data 73728 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/70:10:62:30:ba/01:00:1c:00:00/40 tag 2 cdb 0x0 data 188416 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:18:d2:31:ba/01:00:1c:00:00/40 tag 3 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:20:d2:32:ba/01:00:1c:00:00/40 tag 4 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:28:d2:33:ba/01:00:1c:00:00/40 tag 5 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:30:d2:34:ba/01:00:1c:00:00/40 tag 6 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5.00: cmd 61/00:38:d2:35:ba/01:00:1c:00:00/40 tag 7 cdb 0x0 data 131072 out
14:56:13: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
14:56:13: ata5: soft resetting port
14:56:43: ata5: softreset failed (timeout)
14:56:43: ata5: softreset failed, retrying in 5 secs
14:56:48: ata5: hard resetting port
14:57:20: ata5: softreset failed (timeout)
14:57:20: ata5: follow-up softreset failed, retrying in 5 secs
14:57:25: ata5: hard resetting port
14:57:58: ata5: softreset failed (timeout)
14:57:58: ata5: reset failed, giving up
14:57:58: ata5.00: disabled
14:57:58: ata5: EH complete
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965522
14:57:58: raid5: Disk failure on sde4, disabling device. Operation continuing on 5 devices
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965266
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481965010
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964754
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964498
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481964130
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481963986
14:57:58: sd 4:0:0:0: SCSI error: return code = 0x00040000
14:57:58: end_request: I/O error, dev sde, sector 481941210
14:57:58: md: md5: recovery done.
14:57:58: RAID5 conf printout:
14:57:58: --- rd:6 wd:5
14:57:58: disk 0, o:1, dev:sda4
14:57:58: disk 1, o:1, dev:sdb4
14:57:58: disk 2, o:1, dev:sdc4
14:57:58: disk 3, o:1, dev:sdd4
14:57:58: disk 4, o:0, dev:sde4
14:57:58: disk 5, o:1, dev:sdf4
14:57:58: RAID5 conf printout:
14:57:58: --- rd:6 wd:5
14:57:58: disk 0, o:1, dev:sda4
14:57:58: disk 1, o:1, dev:sdb4
14:57:58: disk 2, o:1, dev:sdc4
14:57:58: disk 3, o:1, dev:sdd4
14:57:58: disk 5, o:1, dev:sdf4
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
linux [at] horizon.com wrote:
> linux [at] horizon.com wrote:
>>> Anyway, what's annoying is that I can't figure out how to bring the
>>> drive back on line without resetting the box. It's in a hot-swap enclosure,
>>> but power cycling the drive doesn't seem to help. I thought libata hotplug
>>> was working? (SiI3132 card, using the sil24 driver.)
>
>> Yeah, it's working but failing resets are considered highly dangerous
>> (in that the controller status is unknown and may cause something
>> dangerous like screaming interrupts) and port is muted after that. The
>> plan is to handle this with polling hotplug such that libata tries to
>> revive the port if PHY status change is detected by polling. Patches
>> are available but they need other things to resolved to get integrated.
>> I think it'll happen before the summer.
>
>> Anyways, you can tell libata to retry the port by manually telling it to
>> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
>
> Ah, thank you! I have to admit, that is at least as mysterious as any
> Microsoft registry tweak.
Polling hotplug should fix this. I thought I would be able to merge it
much earlier. I apparently was way too optimistic. :-(
>>> (H'm... after rebooting, reallocated sectors jumped from 26 to 39.
>>> Something is up with that drive.)
>
>> Yeap, seems like a broken drive to me.
>
> Actually, after a few rounds, the reallocated sectors stabilized at 56
> and all is working well again. It's like there was a major problem with
> error handling.
>
> The problem is that I don't know where the blame lies.
I'm pretty sure it's the firmware's fault. It's not supposed to go out
for lunch like that even when internal error occurs.
--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
Tejun Heo wrote:
> [resending. my mail service was down for more than a week and this
> message didn't get delivered.]
>
> linux [at] horizon.com wrote:
>
>>> Anyway, what's annoying is that I can't figure out how to bring the
>>> drive back on line without resetting the box. It's in a hot-swap
>>>
> enclosure,
>
>>> but power cycling the drive doesn't seem to help. I thought libata
>>>
> hotplug
>
>>> was working? (SiI3132 card, using the sil24 driver.)
>>>
>
> Yeah, it's working but failing resets are considered highly dangerous
> (in that the controller status is unknown and may cause something
> dangerous like screaming interrupts) and port is muted after that. The
> plan is to handle this with polling hotplug such that libata tries to
> revive the port if PHY status change is detected by polling. Patches
> are available but they need other things to resolved to get integrated.
> I think it'll happen before the summer.
>
> Anyways, you can tell libata to retry the port by manually telling it to
> rescan the port (echo - - - > /sys/class/scsi_host/hostX/scan).
>
I won't say that's voodoo, but if I ever did it I'd wipe down my
keyboard with holy water afterward. ;-)
Well, I did save the message in my tricks file, but it sounds like a
last ditch effort after something get very wrong.
--
bill davidsen <davidsen [at] tmr.com>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
On 4/4/07, Bill Davidsen <davidsen [at] tmr.com> wrote:
> I won't say that's voodoo, but if I ever did it I'd wipe down my
> keyboard with holy water afterward. ;-)
>
> Well, I did save the message in my tricks file, but it sounds like a
> last ditch effort after something get very wrong.
Would it reallty be an impediment to development if the kernel
maintainers simply refuse to merge patches that add new sysfs entries
without corresponding documentation?
Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20.3 AMD64 oops in CFQ code
Lee Revell wrote:
> On 4/4/07, Bill Davidsen <davidsen [at] tmr.com> wrote:
>> I won't say that's voodoo, but if I ever did it I'd wipe down my
>> keyboard with holy water afterward. ;-)
>>
>> Well, I did save the message in my tricks file, but it sounds like a
>> last ditch effort after something get very wrong.
Which actually is true. ATA ports failing to reset indicate something
is very wrong. Either the attached device or the controller is broken
and libata shuts down the port to protect the rest of the system from
it. The manual scan requests tell libata to give it one more shot and
polling hotplug can do that automatically. Anyways, this shouldn't
happen unless you have a broken piece of hardware.
> Would it reallty be an impediment to development if the kernel
> maintainers simply refuse to merge patches that add new sysfs entries
> without corresponding documentation?
SCSI host scan nodes have been there for a long time. I think it's
documented somewhere.
--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html