
possible bus loading problem during resync
I'm working on 2 systems that are mainly for running vdr. I've had these
running somewhat for awhile with raid. But a couple nights ago as I was
quitting for the night, I noticed one of the computers drive light
staying on. I had just made some changes to xine and didn't know if
something had crashed. Turned on the TV and found the video was freezing
for 10-20secs every 10-20secs. Logging in using putty and winscp I found
it very sluggish to respond.Starting top I found it was doing the
regular array check/resync. The process was using about 64% cpu and cpu
was staying at idle speed (1000Mhz). These computers use Athlon64 x2
cpu's. A problem with the AN2 socket systems is that when the cpu is
throttled back, it also slows the bus. This has been found to be a
problem on boards with integrated graphics when using nvidia's vdpau for
hardware video decoding because they use system ram. The fix is to set
the lower speed limit to 1800Mhz and/or change the up_threshold to ~50%
.. However, I am using PCIe video cards and so up till now have not had a
problem.
I stopped vdr, but putty and winscp where still sluggish. This tells me
that it is loading the bus so much that both the video card and the
network is effected. it would also effect any tuner cards interfering
with any recording that may be going on at the time. I change the
up_threshold from the default 95% to 50% which should kick the speed up
when it's syncing. But I'm not sure that will be enough. Could there be
some other setting that is wrong raising the priority of the process?
Seems like this would be a problem for any system having raid
maintenance bring the system to its knees. The eta to finish was 75 minutes.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bus loading problem during resync
On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz [at] vorgon.com> wrote:
> I'm working on 2 systems that are mainly for running vdr. I've had these
> running somewhat for awhile with raid. But a couple nights ago as I was
> quitting for the night, I noticed one of the computers drive light staying
> on. I had just made some changes to xine and didn't know if something had
> crashed. Turned on the TV and found the video was freezing for 10-20secs
> every 10-20secs. Logging in using putty and winscp I found it very sluggish
> to respond.Starting top I found it was doing the regular array check/resync.
> The process was using about 64% cpu and cpu was staying at idle speed
> (1000Mhz). These computers use Athlon64 x2 cpu's. A problem with the AN2
> socket systems is that when the cpu is throttled back, it also slows the
> bus. This has been found to be a problem on boards with integrated graphics
> when using nvidia's vdpau for hardware video decoding because they use
> system ram. The fix is to set the lower speed limit to 1800Mhz and/or change
> the up_threshold to ~50% . However, I am using PCIe video cards and so up
> till now have not had a problem.
>
> I stopped vdr, but putty and winscp where still sluggish. This tells me that
> it is loading the bus so much that both the video card and the network is
> effected. it would also effect any tuner cards interfering with any
> recording that may be going on at the time. I change the up_threshold from
> the default 95% to 50% which should kick the speed up when it's syncing. But
> I'm not sure that will be enough. Could there be some other setting that is
> wrong raising the priority of the process? Seems like this would be a
> problem for any system having raid maintenance bring the system to its
> knees. The eta to finish was 75 minutes.
> --
Sorry about the incredibly brief answer: Not to dismiss other issues,
but that behavior seems like exactly what I've seen when a disk has
been failing.
-- Kristleifur
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bus loading problem during resync
Kristleifur Da=F0ason wrote:
> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz [at] vorgon.com> wr=
ote:
>
>> I'm working on 2 systems that are mainly for running vdr. I've had t=
hese
>> running somewhat for awhile with raid. But a couple nights ago as I =
was
>> quitting for the night, I noticed one of the computers drive light s=
taying
>> on. I had just made some changes to xine and didn't know if somethin=
g had
>> crashed. Turned on the TV and found the video was freezing for 10-20=
secs
>> every 10-20secs. Logging in using putty and winscp I found it very s=
luggish
>> to respond.Starting top I found it was doing the regular array check=
/resync.......
>> --
>>
>
>
> Sorry about the incredibly brief answer: Not to dismiss other issues,
> but that behavior seems like exactly what I've seen when a disk has
> been failing.
>
If that is true, how does that happen, the driver is hung? But anyway,
how can such things happen when there is more than one CPU-core?
try disabling NCQ by echo 1 > /sys/block/sdX/device/queue_depth for all=
drives. After doing this, at most 1 request can be issued to one drive
until the drive has serviced such request.
After doing this, firstly I'd say the sluggishness should disappear, at=
least on SSH when not touching the disks. And then you can look with
"iostat -x 1": probably the bad drive will have a service time (svctm)
or await much worse than the others.
Just guesses, correct me if I'm wrong
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bus loading problem during resync
Asdo <asdo [at] shiftmail.org> writes:
> Kristleifur Da=F0ason wrote:
>> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz [at] vorgon.com> w=
rote:
>>
>>> I'm working on 2 systems that are mainly for running vdr. I've had =
these
>>> running somewhat for awhile with raid. But a couple nights ago as I=
was
>>> quitting for the night, I noticed one of the computers drive light =
staying
>>> on. I had just made some changes to xine and didn't know if somethi=
ng had
>>> crashed. Turned on the TV and found the video was freezing for 10-2=
0secs
>>> every 10-20secs. Logging in using putty and winscp I found it very =
sluggish
>>> to respond.Starting top I found it was doing the regular array chec=
k/resync.......
>>> --
>>>
>>
>>
>> Sorry about the incredibly brief answer: Not to dismiss other issues=
,
>> but that behavior seems like exactly what I've seen when a disk has
>> been failing.
>>
>
> If that is true, how does that happen, the driver is hung? But anyway=
,
> how can such things happen when there is more than one CPU-core?
A drive produces an error, the whole controler hangs and resets all
ports, all drives have to finish being reset before any IO can continue=
=2E
Hapens easily enough.
> try disabling NCQ by echo 1 > /sys/block/sdX/device/queue_depth for
> all drives. After doing this, at most 1 request can be issued to one
> drive until the drive has serviced such request.
>
> After doing this, firstly I'd say the sluggishness should disappear,
> at least on SSH when not touching the disks. And then you can look
> with "iostat -x 1": probably the bad drive will have a service time
> (svctm) or await much worse than the others.
>
> Just guesses, correct me if I'm wrong
What I would start with is check the resync/check speed of the raid and
kernel messages. If it is running at high speed and there are no kernel
messages about IO errors then it is probably just a case of the IO
subsystem being busy. I got similar sluggish behaviour when I increased
the stripe cache to 16384 for a reshape.
If there are no hardware problems on the disks causing this then try
setting the max speed for the resync lower. That way the resync will
leave pauses where other IO and bus activity can happen. The raid shoul=
d
slow down automatically if there is normal IO pending but in my
experience that doesn't always work.
MfG
Goswin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Re: possible bus loading problem during resync
-------- Original Message --------
Subject: Re: possible bus loading problem during resync
Date: Wed, 10 Mar 2010 23:23:23 -0700
=46rom: Timothy D. Lenz <tlenz [at] vorgon.com>
To: Goswin von Brederlow <goswin-v-b [at] web.de>
On 3/10/2010 10:53 PM, Goswin von Brederlow wrote:
> Asdo<asdo [at] shiftmail.org> writes:
>
>> Kristleifur Da=F0ason wrote:
>>> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz<tlenz [at] vorgon.com> =
wrote:
>>>
>>>> I'm working on 2 systems that are mainly for running vdr. I've had=
these
>>>> running somewhat for awhile with raid. But a couple nights ago as =
I was
>>>> quitting for the night, I noticed one of the computers drive light=
staying
>>>> on. I had just made some changes to xine and didn't know if someth=
ing had
>>>> crashed. Turned on the TV and found the video was freezing for 10-=
20secs
>>>> every 10-20secs. Logging in using putty and winscp I found it very=
sluggish
>>>> to respond.Starting top I found it was doing the regular array che=
ck/resync.......
>>>> --
>>>>
>>>
>>>
>>> Sorry about the incredibly brief answer: Not to dismiss other issue=
s,
>>> but that behavior seems like exactly what I've seen when a disk has
>>> been failing.
>>>
>>
>> If that is true, how does that happen, the driver is hung? But anywa=
y,
>> how can such things happen when there is more than one CPU-core?
>
> A drive produces an error, the whole controler hangs and resets all
> ports, all drives have to finish being reset before any IO can contin=
ue.
> Hapens easily enough.
>
>> try disabling NCQ by echo 1> /sys/block/sdX/device/queue_depth for
>> all drives. After doing this, at most 1 request can be issued to one
>> drive until the drive has serviced such request.
>>
>> After doing this, firstly I'd say the sluggishness should disappear,
>> at least on SSH when not touching the disks. And then you can look
>> with "iostat -x 1": probably the bad drive will have a service time
>> (svctm) or await much worse than the others.
>>
>> Just guesses, correct me if I'm wrong
>
> What I would start with is check the resync/check speed of the raid a=
nd
> kernel messages. If it is running at high speed and there are no kern=
el
> messages about IO errors then it is probably just a case of the IO
> subsystem being busy. I got similar sluggish behaviour when I increas=
ed
> the stripe cache to 16384 for a reshape.
>
> If there are no hardware problems on the disks causing this then try
> setting the max speed for the resync lower. That way the resync will
> leave pauses where other IO and bus activity can happen. The raid sho=
uld
> slow down automatically if there is normal IO pending but in my
> experience that doesn't always work.
>
> MfG
> Goswin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo [at] vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
=46ound these 3 entries in /var/log/kern.log.1:
Mar 7 00:57:01 LLLx64-32 kernel: md: data-check of RAID array md0
Mar 7 00:57:01 LLLx64-32 kernel: md: minimum _guaranteed_ speed: 1000
KB/sec/disk.
Mar 7 00:57:01 LLLx64-32 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar 7 00:57:01 LLLx64-32 kernel: md: using 128k window, over a total o=
f
24418688 blocks.
Mar 7 00:57:01 LLLx64-32 kernel: md: delaying data-check of md1 until
md0 has finished (they share one or more physical units)
Mar 7 00:57:01 LLLx64-32 kernel: md: delaying data-check of md2 until
md1 has finished (they share one or more physical units)
Mar 7 00:57:01 LLLx64-32 kernel: md: delaying data-check of md1 until
md0 has finished (they share one or more physical units)
------------------------------------------------------------ ---------
Mar 7 01:02:50 LLLx64-32 kernel: md: md0: data-check done.
Mar 7 01:02:50 LLLx64-32 kernel: md: delaying data-check of md2 until
md1 has finished (they share one or more physical units)
Mar 7 01:02:50 LLLx64-32 kernel: md: data-check of RAID array md1
Mar 7 01:02:50 LLLx64-32 kernel: md: minimum _guaranteed_ speed: 1000
KB/sec/disk.
Mar 7 01:02:50 LLLx64-32 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar 7 01:02:50 LLLx64-32 kernel: md: using 128k window, over a total o=
f
4891712 blocks.
Mar 7 01:03:50 LLLx64-32 kernel: md: md1: data-check done.
Mar 7 01:03:50 LLLx64-32 kernel: md: data-check of RAID array md2
Mar 7 01:03:50 LLLx64-32 kernel: md: minimum _guaranteed_ speed: 1000
KB/sec/disk.
Mar 7 01:03:50 LLLx64-32 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Mar 7 01:03:50 LLLx64-32 kernel: md: using 128k window, over a total o=
f
459073344 blocks.
------------------------------------------------------------ ---------
Mar 7 02:47:43 LLLx64-32 kernel: md: md2: data-check done.
kern.log.1 ended at Mar 7 06:25:03
There was no ref to "raid" or "md" in /var/log/kern.log
I don't see any raid logs
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Re: possible bus loading problem during resync
This was ment to goto the list. Keep forgetting, this list uses
responder instead of list for reply address.
-------- Original Message --------
Subject: Re: possible bus loading problem during resync
Date: Wed, 10 Mar 2010 17:04:07 -0700
=46rom: Timothy D. Lenz <tlenz [at] vorgon.com>
To: Asdo <asdo [at] shiftmail.org>
On 3/9/2010 4:00 AM, Asdo wrote:
> Kristleifur Da=F0ason wrote:
>> On Tue, Mar 9, 2010 at 6:31 AM, Timothy D. Lenz <tlenz [at] vorgon.com> w=
rote:
>>> I'm working on 2 systems that are mainly for running vdr. I've had =
these
>>> running somewhat for awhile with raid. But a couple nights ago as I=
was
>>> quitting for the night, I noticed one of the computers drive light
>>> staying
>>> on. I had just made some changes to xine and didn't know if somethi=
ng
>>> had
>>> crashed. Turned on the TV and found the video was freezing for 10-2=
0secs
>>> every 10-20secs. Logging in using putty and winscp I found it very
>>> sluggish
>>> to respond.Starting top I found it was doing the regular array
>>> check/resync.......
>>> --
>>
>>
>> Sorry about the incredibly brief answer: Not to dismiss other issues=
,
>> but that behavior seems like exactly what I've seen when a disk has
>> been failing.
>
> If that is true, how does that happen, the driver is hung? But anyway=
,
> how can such things happen when there is more than one CPU-core?
>
> try disabling NCQ by echo 1 > /sys/block/sdX/device/queue_depth for a=
ll
> drives. After doing this, at most 1 request can be issued to one driv=
e
> until the drive has serviced such request.
>
> After doing this, firstly I'd say the sluggishness should disappear, =
at
> least on SSH when not touching the disks. And then you can look with
> "iostat -x 1": probably the bad drive will have a service time (svctm=
)
> or await much worse than the others.
>
> Just guesses, correct me if I'm wrong
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo [at] vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
=46irst output is 5.12 for sda and 1.15 for sdb every time it's started=
=2E
then mostly 0 for both. When there are numbers it changes back and fort=
h
between then as to which is greater.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 6.90 30.46 2.09 1.90 1164.19 258.92
356.52 0.10 23.99 5.12 2.04
sdb 0.16 30.46 8.84 1.90 1165.65 258.92
132.67 0.02 2.25 1.51 1.62
Was this test supposed to be done while it was doing a sync? Because it
was the same if I made the change to 1 or put them back to the default
value 31.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bus loading problem during resync
>> If that is true, how does that happen, the driver is hung? But anyway,
>> how can such things happen when there is more than one CPU-core?
>>
>
> A drive produces an error, the whole controler hangs and resets all
> ports, all drives have to finish being reset before any IO can continue.
> Hapens easily enough.
>
Ok but this is a multi-core CPU and he said Putty and WinSCP were hung.
Ok for WinSCP... but Putty?
Timothy is Putty hung on array check even on NCQ disabled?
What is the resync speed? If it is very high it could be a CPU
starvation but it's strange with only 2 drives. If it is very low I am
not sure.
Are disks write caches enabled?
I am not able to spot any problem from your iostat or dmesg.
Note: yes the iostat -x 1 was supposed to be captured during resync.
Trigger a resync manually for the test. You can start it with echo check
and stop it with echo idle > /sys/block/mdX/md/sync_action
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible bus loading problem during resync
On Fri, Mar 12, 2010 at 11:00 AM, Asdo <asdo [at] shiftmail.org> wrote:
>
>>> If that is true, how does that happen, the driver is hung? But anyway,
>>> how can such things happen when there is more than one CPU-core?
>>>
>>
>> A drive produces an error, the whole controler hangs and resets all
>> ports, all drives have to finish being reset before any IO can continue.
>> Hapens easily enough.
>>
>
> Ok but this is a multi-core CPU and he said Putty and WinSCP were hung.
> Ok for WinSCP... but Putty?
>
> Timothy is Putty hung on array check even on NCQ disabled?
>
> What is the resync speed? If it is very high it could be a CPU starvation
> but it's strange with only 2 drives. If it is very low I am not sure.
> Are disks write caches enabled?
>
> I am not able to spot any problem from your iostat or dmesg.
> Note: yes the iostat -x 1 was supposed to be captured during resync.
> Trigger a resync manually for the test. You can start it with echo check and
> stop it with echo idle > /sys/block/mdX/md/sync_action
>
I find "gnome-disk-utility" A.K.A. "palimpsest" to be a very good
heuristic to tell whether any drives are giving me physical trouble.
If there is a high remapped-sector-count or if Palimpsest otherwise
thinks a drive is suspect, there is a good chance that any unexplained
slowdowns in the machine are due to that drive.
If nothing more, it's a cheap way to get more information.
(My way of getting Palimpsest to check out drives in a machine that
doesn't have the program available in its installed distro
repositories is to run Ubuntu 9.10 from an USB stick. Boot, and up
comes Palimpsest.)
Hope this helps. Best of luck.
-- Kristleifur
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html