Raid10 device hangs during resync and heavy I/O.

--BOKacYhQ+x31HxR3
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I've been able to reproduce this across a number of machines with the same
hardware configuration. During a raid10 resync, it's possible to hang the
device so that any further I/O operations will also block. This can be
fairly simply done using dd.

Interestingly, this is not reproducible when using a non-partitioned device.
That is, creating the device with --auto=3Dyes and then directly using it
functions as expected. However, using --auto=3Dyes or --auto=3Dmdp and then
creating a partition across the device will cause the hang.

=46rom all appearances, this is not just slow I/O, days later the same tasks
are still blocked. The rest of the system continues to function normally,
including other raid devices.

Below I'm going to include the script I'm using to reproduce, the relevant
kernel tracebacks, and /proc/mdstat. Thanks in advance for any help
resolving this.


=3D=3D=3D md10-hang.sh =3D=3D=3D
#!/bin/bash

MDP=3Dfalse
# Pick two unused drives here.
MD_DRIVES=3D"sdc sdd"

if ${MDP}; then
MD_DEV=3D"md_d99"
else
MD_DEV=3D"md99"
fi

M=3D"/mnt/mdmount"
SIZE=3D8192

die () {
echo
echo "ERROR: $*"
echo
exit 1
}

mkraid() {
local d
local drives
local mdargs=3D"--auto=3Dyes"

${MDP} && mdargs=3D"--auto=3Dmdp"

mkdir -p ${M}
umount -f ${M} &>/dev/null
mdadm --stop /dev/md_d99 &>/dev/null
mdadm --stop /dev/md99 &>/dev/null

for d in ${MD_DRIVES}; do
sfdisk -uM /dev/${d} <<-EOF
,${SIZE},83
,,83
EOF
mdadm --zero-superblock /dev/${d}1 &>/dev/null
drives=3D"${drives} /dev/${d}1"
done
mdadm --create /dev/${MD_DEV} \
--run \
--force \
--level=3D10 \
--layout=3Df2 \
--raid-devices=3D2 \
${mdargs} ${drives} || die "mdadm --create failed"

if ${MDP}; then
printf ",,83\n" | sfdisk -uM /dev/${MD_DEV}
mkfs.ext2 -q /dev/${MD_DEV}p1
mount /dev/${MD_DEV}p1 ${M} || die "Mount failed"
else
printf ",,83\n" | sfdisk -uM /dev/${MD_DEV}
mkfs.ext2 -q /dev/${MD_DEV}p1
mount /dev/${MD_DEV}p1 ${M} || die "Mount failed"
fi

echo "Creating tmp file"
dd if=3D/dev/zero of=3D${M}/tmpfile bs=3D1M count=3D4000
}

mkraid
i=3D1
while [ "$(</sys/block/${MD_DEV}/md/sync_action)" !=3D "idle" ]; do
echo "Attempt ${i} to cause crash"
cat /proc/mdstat
dd if=3D${M}/tmpfile of=3D${M}/cpfile bs=3D1M
i=3D$((i++))
done


=3D=3D=3D kernel trace =3D=3D=3D
[ 9002.405247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables =
this message.
[ 9002.433361] ffff88025436fc30 0000000000000046 ffff88025436fc10 ffff8802=
54616800
[ 9002.460415] ffff88025d40dd70 ffff88025d40a3f0 0000000354616800 00000000=
000de600
[ 9002.487497] ffff88025436fc10 ffff8801570343c0 ffff880157034420 ffff8801=
57034448
[ 9002.514575] Call Trace:
[ 9002.526609] [<ffffffff81320efb>] raise_barrier+0x167/0x1a3
[ 9002.548139] [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9002.571218] [<ffffffff813238e1>] sync_request+0x57d/0x8a8
[ 9002.592430] [<ffffffff81320ca5>] ? raid10_unplug+0x24/0x28
[ 9002.613833] [<ffffffff8132ad63>] ? md_thread+0x0/0xe8
[ 9002.633938] [<ffffffff8132dab2>] md_do_sync+0x685/0xa9d
[ 9002.654556] [<ffffffff8132ad63>] ? md_thread+0x0/0xe8
[ 9002.674650] [<ffffffff8132ae31>] md_thread+0xce/0xe8
[ 9002.694435] [<ffffffff81034aa6>] ? spin_unlock_irqrestore+0x9/0xb
[ 9002.717583] [<ffffffff81056cc0>] kthread+0x69/0x71
[ 9002.736753] [<ffffffff810037e4>] kernel_thread_helper+0x4/0x10
[ 9002.759062] [<ffffffff81056c57>] ? kthread+0x0/0x71
[ 9002.778478] [<ffffffff810037e0>] ? kernel_thread_helper+0x0/0x10
[ 9002.801286] INFO: task flush-9:99:5896 blocked for more than 120 seconds.
[ 9002.826287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables =
this message.
[ 9002.854475] ffff88012fa8b870 0000000000000046 ffff88012fa8b850 ffff8802=
54616800
[ 9002.881589] ffff88025d40ebd0 ffff88025d40a3f0 000000036fb73300 00000000=
00000001
[ 9002.908691] ffff88012fa8b850 ffff8801570343c0 ffff880157034420 ffff8801=
57034448
[ 9002.935778] Call Trace:
[ 9002.947781] [<ffffffff81320d5b>] wait_barrier+0xa7/0xe0
[ 9002.968438] [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9002.991452] [<ffffffff8132163e>] make_request+0x121/0x507
[ 9003.012697] [<ffffffff8132d2aa>] md_make_request+0xc7/0x101
[ 9003.034515] [<ffffffff811dc817>] generic_make_request+0x1af/0x276
[ 9003.057953] [<ffffffff811dda3b>] submit_bio+0x9e/0xa7
[ 9003.078197] [<ffffffff810e950d>] submit_bh+0x11b/0x13f
[ 9003.098648] [<ffffffff810ebba9>] __block_write_full_page+0x20b/0x310
[ 9003.122755] [<ffffffff810ec383>] ? end_buffer_async_write+0x0/0x13a
[ 9003.146576] [<ffffffff810ef5b2>] ? blkdev_get_block+0x0/0x50
[ 9003.168612] [<ffffffff810ec383>] ? end_buffer_async_write+0x0/0x13a
[ 9003.192521] [<ffffffff810ef5b2>] ? blkdev_get_block+0x0/0x50
[ 9003.214684] [<ffffffff810ebd30>] block_write_full_page_endio+0x82/0x8e
[ 9003.239426] [<ffffffff810ebd4c>] block_write_full_page+0x10/0x12
[ 9003.262539] [<ffffffff810eea92>] blkdev_writepage+0x13/0x15
[ 9003.284297] [<ffffffff8109e005>] __writepage+0x12/0x2b
[ 9003.304718] [<ffffffff8109e46c>] write_cache_pages+0x1fa/0x306
[ 9003.327254] [<ffffffff8109dff3>] ? __writepage+0x0/0x2b
[ 9003.347976] [<ffffffff810e9f65>] ? mark_buffer_dirty+0x85/0x89
[ 9003.370488] [<ffffffff8109e597>] generic_writepages+0x1f/0x25
[ 9003.392661] [<ffffffff8109e5b9>] do_writepages+0x1c/0x25
[ 9003.413477] [<ffffffff810e43e0>] writeback_single_inode+0xb0/0x1c7
[ 9003.436962] [<ffffffff810e4b5a>] writeback_inodes_wb+0x2bf/0x35a
[ 9003.459949] [<ffffffff810e4d1a>] wb_writeback+0x125/0x1a1
[ 9003.481145] [<ffffffff810e4f66>] wb_do_writeback+0x138/0x14f
[ 9003.503124] [<ffffffff810ab3e7>] ? bdi_start_fn+0x0/0xca
[ 9003.524024] [<ffffffff810e4fa4>] bdi_writeback_task+0x27/0x92
[ 9003.546241] [<ffffffff810ab44c>] bdi_start_fn+0x65/0xca
[ 9003.566875] [<ffffffff81056cc0>] kthread+0x69/0x71
[ 9003.586179] [<ffffffff810037e4>] kernel_thread_helper+0x4/0x10
[ 9003.608532] [<ffffffff81056c57>] ? kthread+0x0/0x71
[ 9003.627949] [<ffffffff810037e0>] ? kernel_thread_helper+0x0/0x10
[ 9003.650791] INFO: task dd:5912 blocked for more than 120 seconds.
[ 9003.673621] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables =
this message.
[ 9003.701721] ffff88024c84d7b8 0000000000000082 ffff88024c84d798 ffff8802=
54616800
[ 9003.728793] ffff88025d6840b0 ffff88025f065640 0000000296364968 00000000=
00000000
[ 9003.755893] 000000014c84d798 ffff8801570343c0 ffff880157034420 ffff8801=
57034448
[ 9003.773323] Call Trace:
[ 9003.773326] [<ffffffff81320d5b>] wait_barrier+0xa7/0xe0
[ 9003.773328] [<ffffffff810383b6>] ? default_wake_function+0x0/0xf
[ 9003.773330] [<ffffffff8132163e>] make_request+0x121/0x507
[ 9003.773332] [<ffffffff810edbd7>] ? bio_split+0xca/0x183
[ 9003.773334] [<ffffffff813215d5>] make_request+0xb8/0x507
[ 9003.773337] [<ffffffff811d780d>] ? __elv_add_request+0xa1/0xaa
[ 9003.773339] [<ffffffff8132d2aa>] md_make_request+0xc7/0x101
[ 9003.773341] [<ffffffff811dc817>] generic_make_request+0x1af/0x276
[ 9003.773343] [<ffffffff810ed885>] ? bio_alloc_bioset+0x70/0xc0
[ 9003.773345] [<ffffffff811dda3b>] submit_bio+0x9e/0xa7
[ 9003.773347] [<ffffffff810f0d0b>] mpage_bio_submit+0x22/0x26
[ 9003.773349] [<ffffffff810f17df>] do_mpage_readpage+0x462/0x54e
[ 9003.773352] [<ffffffff8109fb21>] ? get_page+0x9/0xf
[ 9003.773354] [<ffffffff810a004d>] ? __lru_cache_add+0x40/0x58
[ 9003.773357] [<ffffffff8112c194>] ? ext2_get_block+0x0/0x78a
[ 9003.773359] [<ffffffff810f1a66>] mpage_readpages+0xc9/0x10f
[ 9003.773361] [<ffffffff8112c194>] ? ext2_get_block+0x0/0x78a
[ 9003.773363] [<ffffffff81001d89>] ? __switch_to+0x10e/0x1e1
[ 9003.773366] [<ffffffff8112b40c>] ext2_readpages+0x1a/0x1c
[ 9003.773368] [<ffffffff8109f4d0>] __do_page_cache_readahead+0xf6/0x191
[ 9003.773370] [<ffffffff8109f587>] ra_submit+0x1c/0x20
[ 9003.773372] [<ffffffff8109f7e3>] ondemand_readahead+0x17b/0x18e
[ 9003.773374] [<ffffffff8109f870>] page_cache_async_readahead+0x7a/0xa2
[ 9003.773379] [<ffffffff81098a59>] generic_file_aio_read+0x26e/0x55d
[ 9003.773382] [<ffffffff810cb32e>] do_sync_read+0xc2/0x106
[ 9003.773384] [<ffffffff810a009d>] ? lru_cache_add_lru+0x38/0x3d
[ 9003.773387] [<ffffffff8100338e>] ? apic_timer_interrupt+0xe/0x20
[ 9003.773389] [<ffffffff810cb980>] vfs_read+0xa4/0xde
[ 9003.773391] [<ffffffff810cbc02>] sys_read+0x47/0x6d
[ 9003.773393] [<ffffffff81002a42>] system_call_fastpath+0x16/0x1b


=3D=3D=3D /proc/mdstat =3D=3D=3D
Personalities : [raid1] [raid10]
md99 : active raid10 sdd1[1] sdc1[0]
8393856 blocks 64K chunks 2 far-copies [2/2] [UU]
[=3D>...................] resync =3D 5.4% (455360/8393856) finish=
=3D3938.0min speed=3D33K/sec

md1 : active raid10 sda2[0] sdb2[1]
976703488 blocks 512K chunks 2 far-copies [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
56128 blocks [2/2] [UU]

unused devices: <none>


--
Justin Bronder

--BOKacYhQ+x31HxR3
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)

iEYEARECAAYFAkxAqPoACgkQ4MrvBE1wQ8mOCACeNVfdTn3jRPIwRgYDNS1K uXd1
6eQAn06Z9mW0qAmHSEmwZNhxBgEa366J
=jQfw
-----END PGP SIGNATURE-----

--BOKacYhQ+x31HxR3--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Bronder [ Fr, 16 Juli 2010 20:46 ] [ ID #2044588 ]

Re: Raid10 device hangs during resync and heavy I/O.

--ibTvN161/egqYuK8
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 16/07/10 14:46 -0400, Justin Bronder wrote:
> I've been able to reproduce this across a number of machines with the same
> hardware configuration. During a raid10 resync, it's possible to hang the
> device so that any further I/O operations will also block. This can be
> fairly simply done using dd.
>
> Interestingly, this is not reproducible when using a non-partitioned devi=
ce.
> That is, creating the device with --auto=3Dyes and then directly using it
> functions as expected. However, using --auto=3Dyes or --auto=3Dmdp and t=
hen
> creating a partition across the device will cause the hang.
>
> From all appearances, this is not just slow I/O, days later the same tasks
> are still blocked. The rest of the system continues to function normally,
> including other raid devices.
>
> Below I'm going to include the script I'm using to reproduce, the relevant
> kernel tracebacks, and /proc/mdstat. Thanks in advance for any help
> resolving this.

My apologies, I should have proof read this first. I've reproduced using
2.6.28, 2.6.32 and 2.6.34.1.

--
Justin Bronder

--ibTvN161/egqYuK8
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)

iEYEARECAAYFAkxAqbIACgkQ4MrvBE1wQ8nrygCgre4994g+zYv4ZqMTWPPV N1q+
m6EAn12cHTIhmUbeh9wES8sNcb3MCRCr
=ZOX1
-----END PGP SIGNATURE-----

--ibTvN161/egqYuK8--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Bronder [ Fr, 16 Juli 2010 20:49 ] [ ID #2044589 ]

Re: Raid10 device hangs during resync and heavy I/O.

--7AUc2qLy4jB3hD7Z
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 16/07/10 14:46 -0400, Justin Bronder wrote:
> I've been able to reproduce this across a number of machines with the same
> hardware configuration. During a raid10 resync, it's possible to hang the
> device so that any further I/O operations will also block. This can be
> fairly simply done using dd.
>
> Interestingly, this is not reproducible when using a non-partitioned devi=
ce.
> That is, creating the device with --auto=3Dyes and then directly using it
> functions as expected. However, using --auto=3Dyes or --auto=3Dmdp and t=
hen
> creating a partition across the device will cause the hang.
>

I've done some more research that may potentially help. All of
the following was done with 2.6.34.1.

Still produces the hang:
- Using cp (may take a bit longer).
- Using jfs as the filesystem.
- Dropping RESYNC_DEPTH to 32
- Using the offset layout.

Does not produce the hang:
- Using the near layout.
- Using dd on the partition directly instead of on a
filesystem via something like:
dd if=3D/dev/${MD_DEV}p1 of=3D/dev/${MD_DEV}p1 seek=3D4001 bs=3D1M


As the barrier code is very similiar, I repeated a number of
these tests using raid1 instead of raid10. In every case, I was
unable to cause the system to hang. I focused on the barriers
due to the tracebacks in the previous email. For the heck of it,
I added some tracing (patch below) where the reason for the hang
is fairly obvious. Of course, how it happened isn't.

The last bit of the trace before the hang.

<idle>-0 [003] 188.987489: allow_barrier: swapper - w:2=
p:1 b:1
<idle>-0 [003] 188.987495: allow_barrier: swapper - w:2=
p:0 b:1
<...>-4910 [003] 188.987518: raise_barrier: out: md99_resync =
- w:2 p:0 b:1
<...>-4910 [003] 188.987733: raise_barrier: in: md99_resync =
- w:2 p:0 b:1
<...>-4909 [003] 188.988174: lower_barrier: md99_raid10 -=
w:2 p:0 b:0
<...>-4915 [002] 188.988188: wait_barrier: out: flush-9:99 - =
w:1 p:1 b:0
<...>-4996 [003] 188.988199: wait_barrier: out: dd - w:0 p:2 =
b:0
<...>-4915 [002] 188.988203: wait_barrier: in: flush-9:99 - =
w:0 p:2 b:0
<...>-4915 [002] 188.988204: wait_barrier: out: flush-9:99 - =
w:0 p:3 b:0
<...>-4915 [002] 188.988208: wait_barrier: in: flush-9:99 - =
w:0 p:3 b:0
<...>-4915 [002] 188.988209: wait_barrier: out: flush-9:99 - =
w:0 p:4 b:0
<...>-4996 [003] 188.988211: wait_barrier: in: dd - w:0 p:4 =
b:0
<...>-4996 [003] 188.988211: wait_barrier: out: dd - w:0 p:5 =
b:0
<...>-4996 [003] 188.988214: wait_barrier: in: dd - w:0 p:5 =
b:0
<...>-4996 [003] 188.988214: wait_barrier: out: dd - w:0 p:6 =
b:0
<...>-4915 [002] 188.988234: wait_barrier: in: flush-9:99 - =
w:0 p:6 b:0
<...>-4915 [002] 188.988235: wait_barrier: out: flush-9:99 - =
w:0 p:7 b:0
<...>-4915 [002] 188.988244: wait_barrier: in: flush-9:99 - =
w:0 p:7 b:0
<...>-4915 [002] 188.988244: wait_barrier: out: flush-9:99 - =
w:0 p:8 b:0
<...>-4915 [002] 188.988248: wait_barrier: in: flush-9:99 - =
w:0 p:8 b:0
<...>-4915 [002] 188.988249: wait_barrier: out: flush-9:99 - =
w:0 p:9 b:0
<...>-4996 [003] 188.988251: wait_barrier: in: dd - w:0 p:9 =
b:0
<...>-4996 [003] 188.988252: wait_barrier: out: dd - w:0 p:10=
b:0
<...>-4996 [003] 188.988261: wait_barrier: in: dd - w:0 p:10=
b:0
<...>-4996 [003] 188.988262: wait_barrier: out: dd - w:0 p:11=
b:0
<...>-4996 [003] 188.988264: wait_barrier: in: dd - w:0 p:11=
b:0
<...>-4996 [003] 188.988264: wait_barrier: out: dd - w:0 p:12=
b:0
<...>-4915 [002] 188.988272: wait_barrier: in: flush-9:99 - =
w:0 p:12 b:0
<...>-4915 [002] 188.988272: wait_barrier: out: flush-9:99 - =
w:0 p:13 b:0
<...>-4915 [002] 188.988281: wait_barrier: in: flush-9:99 - =
w:0 p:13 b:0
<...>-4915 [002] 188.988281: wait_barrier: out: flush-9:99 - =
w:0 p:14 b:0
<...>-4996 [003] 188.988282: wait_barrier: in: dd - w:0 p:14=
b:0
<...>-4996 [003] 188.988283: wait_barrier: out: dd - w:0 p:15=
b:0
<...>-4915 [002] 188.988285: wait_barrier: in: flush-9:99 - =
w:0 p:15 b:0
<...>-4915 [002] 188.988286: wait_barrier: out: flush-9:99 - =
w:0 p:16 b:0
<...>-4996 [003] 188.988290: wait_barrier: in: dd - w:0 p:16=
b:0
<...>-4996 [003] 188.988290: wait_barrier: out: dd - w:0 p:17=
b:0
<...>-4996 [003] 188.988292: wait_barrier: in: dd - w:0 p:17=
b:0
<...>-4996 [003] 188.988293: wait_barrier: out: dd - w:0 p:18=
b:0
<...>-4915 [002] 188.988309: wait_barrier: in: flush-9:99 - =
w:0 p:18 b:0
<...>-4915 [002] 188.988310: wait_barrier: out: flush-9:99 - =
w:0 p:19 b:0
<...>-4996 [003] 188.988310: wait_barrier: in: dd - w:0 p:19=
b:0
<...>-4996 [003] 188.988311: wait_barrier: out: dd - w:0 p:20=
b:0
<...>-4996 [003] 188.988317: wait_barrier: in: dd - w:0 p:20=
b:0
<...>-4996 [003] 188.988318: wait_barrier: out: dd - w:0 p:21=
b:0
<...>-4996 [003] 188.988321: wait_barrier: in: dd - w:0 p:21=
b:0
<...>-4996 [003] 188.988321: wait_barrier: out: dd - w:0 p:22=
b:0
<...>-4915 [002] 188.988323: allow_barrier: flush-9:99 - =
w:0 p:21 b:0
<...>-4996 [003] 188.988327: wait_barrier: in: dd - w:0 p:21=
b:0
<...>-4996 [003] 188.988328: wait_barrier: out: dd - w:0 p:22=
b:0
<...>-4915 [002] 188.988356: wait_barrier: in: flush-9:99 - =
w:0 p:22 b:0
<...>-4915 [002] 188.988356: wait_barrier: out: flush-9:99 - =
w:0 p:23 b:0
<...>-4915 [002] 188.988361: wait_barrier: in: flush-9:99 - =
w:0 p:23 b:0
<...>-4915 [002] 188.988361: wait_barrier: out: flush-9:99 - =
w:0 p:24 b:0
<...>-4910 [003] 188.988370: raise_barrier: mid: md99_resync =
- w:0 p:24 b:1
<...>-4915 [002] 188.988378: wait_barrier: in: flush-9:99 - =
w:0 p:24 b:1
<...>-4996 [003] 188.988638: wait_barrier: in: dd - w:1 p:24=
b:1
<idle>-0 [003] 188.988887: allow_barrier: swapper - w:2=
p:23 b:1
<idle>-0 [003] 188.988895: allow_barrier: swapper - w:2=
p:22 b:1
<idle>-0 [003] 188.988899: allow_barrier: swapper - w:2=
p:21 b:1
<idle>-0 [003] 188.988905: allow_barrier: swapper - w:2=
p:20 b:1
<idle>-0 [003] 188.988909: allow_barrier: swapper - w:2=
p:19 b:1
<idle>-0 [002] 188.996299: allow_barrier: swapper - w:2=
p:18 b:1
<idle>-0 [002] 188.996310: allow_barrier: swapper - w:2=
p:17 b:1
<idle>-0 [002] 188.996317: allow_barrier: swapper - w:2=
p:16 b:1
<idle>-0 [002] 188.996325: allow_barrier: swapper - w:2=
p:15 b:1
<idle>-0 [002] 188.996330: allow_barrier: swapper - w:2=
p:14 b:1
<idle>-0 [002] 188.996336: allow_barrier: swapper - w:2=
p:13 b:1
<idle>-0 [002] 188.996340: allow_barrier: swapper - w:2=
p:12 b:1
<idle>-0 [003] 189.004270: allow_barrier: swapper - w:2=
p:11 b:1
<idle>-0 [003] 189.004278: allow_barrier: swapper - w:2=
p:10 b:1
<idle>-0 [003] 189.004284: allow_barrier: swapper - w:2=
p:9 b:1
<idle>-0 [003] 189.004302: allow_barrier: swapper - w:2=
p:8 b:1
<idle>-0 [003] 189.004313: allow_barrier: swapper - w:2=
p:7 b:1
<idle>-0 [003] 189.004322: allow_barrier: swapper - w:2=
p:6 b:1
<idle>-0 [002] 189.004936: allow_barrier: swapper - w:2=
p:5 b:1
<idle>-0 [002] 189.004959: allow_barrier: swapper - w:2=
p:4 b:1
<idle>-0 [002] 189.004970: allow_barrier: swapper - w:2=
p:3 b:1
<idle>-0 [002] 189.004979: allow_barrier: swapper - w:2=
p:2 b:1
<idle>-0 [002] 189.004996: allow_barrier: swapper - w:2=
p:1 b:1
<idle>-0 [002] 189.005006: allow_barrier: swapper - w:2=
p:0 b:1
<...>-4910 [002] 189.005030: raise_barrier: out: md99_resync =
- w:2 p:0 b:1
<...>-4910 [002] 189.005251: raise_barrier: in: md99_resync =
- w:2 p:0 b:1
<...>-4909 [002] 189.005676: lower_barrier: md99_raid10 -=
w:2 p:0 b:0
<...>-4915 [003] 189.005687: wait_barrier: out: flush-9:99 - =
w:1 p:1 b:0
<...>-4996 [003] 189.005713: wait_barrier: out: dd - w:0 p:2 =
b:0
<...>-4910 [002] 189.005715: raise_barrier: mid: md99_resync =
- w:0 p:2 b:1
<...>-4996 [003] 189.005724: wait_barrier: in: dd - w:0 p:2 =
b:1
<idle>-0 [002] 189.012925: allow_barrier: swapper - w:1=
p:1 b:1
<idle>-0 [002] 189.013271: allow_barrier: swapper - w:1=
p:0 b:1
<...>-4910 [002] 189.013295: raise_barrier: out: md99_resync =
- w:1 p:0 b:1
<...>-4910 [002] 189.013414: raise_barrier: in: md99_resync =
- w:1 p:0 b:1
<...>-4909 [002] 189.021171: lower_barrier: md99_raid10 -=
w:1 p:0 b:0
<...>-4996 [003] 189.021179: wait_barrier: out: dd - w:0 p:1 =
b:0
<...>-4910 [002] 189.021182: raise_barrier: mid: md99_resync =
- w:0 p:1 b:1
<...>-4996 [003] 189.021184: wait_barrier: in: dd - w:0 p:1 =
b:1
<...>-4915 [003] 218.929283: wait_barrier: in: flush-9:99 - =
w:1 p:1 b:1


diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index ad945cc..35ad593 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
[at] [at] -22,6 +22,7 [at] [at]
#include <linux/delay.h>
#include <linux/blkdev.h>
#include <linux/seq_file.h>
+#include <linux/sched.h>
#include "md.h"
#include "raid10.h"
#include "bitmap.h"
[at] [at] -704,6 +705,9 [at] [at] static void raise_barrier(conf_t *conf, int force)
BUG_ON(force && !conf->barrier);
spin_lock_irq(&conf->resync_lock);

+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk("in: %s - w:%u p:%u b:%u\n", current->comm, conf->nr=
_waiting, conf->nr_pending, conf->barrier);
+
/* Wait until no block IO is waiting (unless 'force') */
wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
conf->resync_lock,
[at] [at] -712,12 +716,18 [at] [at] static void raise_barrier(conf_t *conf, int force)
/* block any new IO from starting */
conf->barrier++;

+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk("mid: %s - w:%u p:%u b:%u\n", current->comm, conf->nr=
_waiting, conf->nr_pending, conf->barrier);
+
/* No wait for all pending IO to complete */
wait_event_lock_irq(conf->wait_barrier,
!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
conf->resync_lock,
raid10_unplug(conf->mddev->queue));

+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk("out: %s - w:%u p:%u b:%u\n", current->comm, conf->nr=
_waiting, conf->nr_pending, conf->barrier);
+
spin_unlock_irq(&conf->resync_lock);
}

[at] [at] -726,6 +736,8 [at] [at] static void lower_barrier(conf_t *conf)
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
conf->barrier--;
+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk(" %s - w:%u p:%u b:%u\n", current->comm, conf->nr_=
waiting, conf->nr_pending, conf->barrier);
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}
[at] [at] -733,7 +745,9 [at] [at] static void lower_barrier(conf_t *conf)
static void wait_barrier(conf_t *conf)
{
spin_lock_irq(&conf->resync_lock);
- if (conf->barrier) {
+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk("in: %s - w:%u p:%u b:%u\n", current->comm, conf->nr=
_waiting, conf->nr_pending, conf->barrier);
+ if (conf->barrier) {
conf->nr_waiting++;
wait_event_lock_irq(conf->wait_barrier, !conf->barrier,
conf->resync_lock,
[at] [at] -741,6 +755,8 [at] [at] static void wait_barrier(conf_t *conf)
conf->nr_waiting--;
}
conf->nr_pending++;
+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk("out: %s - w:%u p:%u b:%u\n", current->comm, conf->nr=
_waiting, conf->nr_pending, conf->barrier);
spin_unlock_irq(&conf->resync_lock);
}

[at] [at] -749,6 +765,8 [at] [at] static void allow_barrier(conf_t *conf)
unsigned long flags;
spin_lock_irqsave(&conf->resync_lock, flags);
conf->nr_pending--;
+ if ( conf->mddev->md_minor =3D=3D 99 )
+ trace_printk(" %s - w:%u p:%u b:%u\n", current->comm, conf->nr_=
waiting, conf->nr_pending, conf->barrier);
spin_unlock_irqrestore(&conf->resync_lock, flags);
wake_up(&conf->wait_barrier);
}

--
Justin Bronder

--7AUc2qLy4jB3hD7Z
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)

iEYEARECAAYFAkxIkr0ACgkQ4MrvBE1wQ8lxSQCghQsDvtHrQWvoqtsitN2R sECL
iygAoJNFWQZCxzsAGCDRurdEhJ8/Pum1
=EWZv
-----END PGP SIGNATURE-----

--7AUc2qLy4jB3hD7Z--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Bronder [ Do, 22 Juli 2010 20:49 ] [ ID #2044927 ]

Re: Raid10 device hangs during resync and heavy I/O.

On Thu, 22 Jul 2010 14:49:33 -0400
Justin Bronder <jsbronder [at] gentoo.org> wrote:

> On 16/07/10 14:46 -0400, Justin Bronder wrote:
> > I've been able to reproduce this across a number of machines with the same
> > hardware configuration. During a raid10 resync, it's possible to hang the
> > device so that any further I/O operations will also block. This can be
> > fairly simply done using dd.
> >
> > Interestingly, this is not reproducible when using a non-partitioned device.
> > That is, creating the device with --auto=yes and then directly using it
> > functions as expected. However, using --auto=yes or --auto=mdp and then
> > creating a partition across the device will cause the hang.
> >
>
> I've done some more research that may potentially help. All of
> the following was done with 2.6.34.1.
>
> Still produces the hang:
> - Using cp (may take a bit longer).
> - Using jfs as the filesystem.
> - Dropping RESYNC_DEPTH to 32
> - Using the offset layout.
>
> Does not produce the hang:
> - Using the near layout.
> - Using dd on the partition directly instead of on a
> filesystem via something like:
> dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
>
>
> As the barrier code is very similiar, I repeated a number of
> these tests using raid1 instead of raid10. In every case, I was
> unable to cause the system to hang. I focused on the barriers
> due to the tracebacks in the previous email. For the heck of it,
> I added some tracing (patch below) where the reason for the hang
> is fairly obvious. Of course, how it happened isn't.
>
> The last bit of the trace before the hang.

Thanks for doing this!

See below...

>
> <idle>-0 [003] 188.987489: allow_barrier: swapper - w:2 p:1 b:1
> <idle>-0 [003] 188.987495: allow_barrier: swapper - w:2 p:0 b:1
> <...>-4910 [003] 188.987518: raise_barrier: out: md99_resync - w:2 p:0 b:1
> <...>-4910 [003] 188.987733: raise_barrier: in: md99_resync - w:2 p:0 b:1
> <...>-4909 [003] 188.988174: lower_barrier: md99_raid10 - w:2 p:0 b:0
> <...>-4915 [002] 188.988188: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
> <...>-4996 [003] 188.988199: wait_barrier: out: dd - w:0 p:2 b:0
> <...>-4915 [002] 188.988203: wait_barrier: in: flush-9:99 - w:0 p:2 b:0
> <...>-4915 [002] 188.988204: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
> <...>-4915 [002] 188.988208: wait_barrier: in: flush-9:99 - w:0 p:3 b:0
> <...>-4915 [002] 188.988209: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
> <...>-4996 [003] 188.988211: wait_barrier: in: dd - w:0 p:4 b:0
> <...>-4996 [003] 188.988211: wait_barrier: out: dd - w:0 p:5 b:0
> <...>-4996 [003] 188.988214: wait_barrier: in: dd - w:0 p:5 b:0
> <...>-4996 [003] 188.988214: wait_barrier: out: dd - w:0 p:6 b:0
> <...>-4915 [002] 188.988234: wait_barrier: in: flush-9:99 - w:0 p:6 b:0
> <...>-4915 [002] 188.988235: wait_barrier: out: flush-9:99 - w:0 p:7 b:0
> <...>-4915 [002] 188.988244: wait_barrier: in: flush-9:99 - w:0 p:7 b:0
> <...>-4915 [002] 188.988244: wait_barrier: out: flush-9:99 - w:0 p:8 b:0
> <...>-4915 [002] 188.988248: wait_barrier: in: flush-9:99 - w:0 p:8 b:0
> <...>-4915 [002] 188.988249: wait_barrier: out: flush-9:99 - w:0 p:9 b:0
> <...>-4996 [003] 188.988251: wait_barrier: in: dd - w:0 p:9 b:0
> <...>-4996 [003] 188.988252: wait_barrier: out: dd - w:0 p:10 b:0
> <...>-4996 [003] 188.988261: wait_barrier: in: dd - w:0 p:10 b:0
> <...>-4996 [003] 188.988262: wait_barrier: out: dd - w:0 p:11 b:0
> <...>-4996 [003] 188.988264: wait_barrier: in: dd - w:0 p:11 b:0
> <...>-4996 [003] 188.988264: wait_barrier: out: dd - w:0 p:12 b:0
> <...>-4915 [002] 188.988272: wait_barrier: in: flush-9:99 - w:0 p:12 b:0
> <...>-4915 [002] 188.988272: wait_barrier: out: flush-9:99 - w:0 p:13 b:0
> <...>-4915 [002] 188.988281: wait_barrier: in: flush-9:99 - w:0 p:13 b:0
> <...>-4915 [002] 188.988281: wait_barrier: out: flush-9:99 - w:0 p:14 b:0
> <...>-4996 [003] 188.988282: wait_barrier: in: dd - w:0 p:14 b:0
> <...>-4996 [003] 188.988283: wait_barrier: out: dd - w:0 p:15 b:0
> <...>-4915 [002] 188.988285: wait_barrier: in: flush-9:99 - w:0 p:15 b:0
> <...>-4915 [002] 188.988286: wait_barrier: out: flush-9:99 - w:0 p:16 b:0
> <...>-4996 [003] 188.988290: wait_barrier: in: dd - w:0 p:16 b:0
> <...>-4996 [003] 188.988290: wait_barrier: out: dd - w:0 p:17 b:0
> <...>-4996 [003] 188.988292: wait_barrier: in: dd - w:0 p:17 b:0
> <...>-4996 [003] 188.988293: wait_barrier: out: dd - w:0 p:18 b:0
> <...>-4915 [002] 188.988309: wait_barrier: in: flush-9:99 - w:0 p:18 b:0
> <...>-4915 [002] 188.988310: wait_barrier: out: flush-9:99 - w:0 p:19 b:0
> <...>-4996 [003] 188.988310: wait_barrier: in: dd - w:0 p:19 b:0
> <...>-4996 [003] 188.988311: wait_barrier: out: dd - w:0 p:20 b:0
> <...>-4996 [003] 188.988317: wait_barrier: in: dd - w:0 p:20 b:0
> <...>-4996 [003] 188.988318: wait_barrier: out: dd - w:0 p:21 b:0
> <...>-4996 [003] 188.988321: wait_barrier: in: dd - w:0 p:21 b:0
> <...>-4996 [003] 188.988321: wait_barrier: out: dd - w:0 p:22 b:0
> <...>-4915 [002] 188.988323: allow_barrier: flush-9:99 - w:0 p:21 b:0
> <...>-4996 [003] 188.988327: wait_barrier: in: dd - w:0 p:21 b:0
> <...>-4996 [003] 188.988328: wait_barrier: out: dd - w:0 p:22 b:0
> <...>-4915 [002] 188.988356: wait_barrier: in: flush-9:99 - w:0 p:22 b:0
> <...>-4915 [002] 188.988356: wait_barrier: out: flush-9:99 - w:0 p:23 b:0
> <...>-4915 [002] 188.988361: wait_barrier: in: flush-9:99 - w:0 p:23 b:0
> <...>-4915 [002] 188.988361: wait_barrier: out: flush-9:99 - w:0 p:24 b:0
> <...>-4910 [003] 188.988370: raise_barrier: mid: md99_resync - w:0 p:24 b:1
> <...>-4915 [002] 188.988378: wait_barrier: in: flush-9:99 - w:0 p:24 b:1
> <...>-4996 [003] 188.988638: wait_barrier: in: dd - w:1 p:24 b:1
> <idle>-0 [003] 188.988887: allow_barrier: swapper - w:2 p:23 b:1
> <idle>-0 [003] 188.988895: allow_barrier: swapper - w:2 p:22 b:1
> <idle>-0 [003] 188.988899: allow_barrier: swapper - w:2 p:21 b:1
> <idle>-0 [003] 188.988905: allow_barrier: swapper - w:2 p:20 b:1
> <idle>-0 [003] 188.988909: allow_barrier: swapper - w:2 p:19 b:1
> <idle>-0 [002] 188.996299: allow_barrier: swapper - w:2 p:18 b:1
> <idle>-0 [002] 188.996310: allow_barrier: swapper - w:2 p:17 b:1
> <idle>-0 [002] 188.996317: allow_barrier: swapper - w:2 p:16 b:1
> <idle>-0 [002] 188.996325: allow_barrier: swapper - w:2 p:15 b:1
> <idle>-0 [002] 188.996330: allow_barrier: swapper - w:2 p:14 b:1
> <idle>-0 [002] 188.996336: allow_barrier: swapper - w:2 p:13 b:1
> <idle>-0 [002] 188.996340: allow_barrier: swapper - w:2 p:12 b:1
> <idle>-0 [003] 189.004270: allow_barrier: swapper - w:2 p:11 b:1
> <idle>-0 [003] 189.004278: allow_barrier: swapper - w:2 p:10 b:1
> <idle>-0 [003] 189.004284: allow_barrier: swapper - w:2 p:9 b:1
> <idle>-0 [003] 189.004302: allow_barrier: swapper - w:2 p:8 b:1
> <idle>-0 [003] 189.004313: allow_barrier: swapper - w:2 p:7 b:1
> <idle>-0 [003] 189.004322: allow_barrier: swapper - w:2 p:6 b:1
> <idle>-0 [002] 189.004936: allow_barrier: swapper - w:2 p:5 b:1
> <idle>-0 [002] 189.004959: allow_barrier: swapper - w:2 p:4 b:1
> <idle>-0 [002] 189.004970: allow_barrier: swapper - w:2 p:3 b:1
> <idle>-0 [002] 189.004979: allow_barrier: swapper - w:2 p:2 b:1
> <idle>-0 [002] 189.004996: allow_barrier: swapper - w:2 p:1 b:1
> <idle>-0 [002] 189.005006: allow_barrier: swapper - w:2 p:0 b:1
> <...>-4910 [002] 189.005030: raise_barrier: out: md99_resync - w:2 p:0 b:1
> <...>-4910 [002] 189.005251: raise_barrier: in: md99_resync - w:2 p:0 b:1
> <...>-4909 [002] 189.005676: lower_barrier: md99_raid10 - w:2 p:0 b:0
> <...>-4915 [003] 189.005687: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
> <...>-4996 [003] 189.005713: wait_barrier: out: dd - w:0 p:2 b:0
> <...>-4910 [002] 189.005715: raise_barrier: mid: md99_resync - w:0 p:2 b:1
> <...>-4996 [003] 189.005724: wait_barrier: in: dd - w:0 p:2 b:1
> <idle>-0 [002] 189.012925: allow_barrier: swapper - w:1 p:1 b:1
> <idle>-0 [002] 189.013271: allow_barrier: swapper - w:1 p:0 b:1
> <...>-4910 [002] 189.013295: raise_barrier: out: md99_resync - w:1 p:0 b:1
> <...>-4910 [002] 189.013414: raise_barrier: in: md99_resync - w:1 p:0 b:1
> <...>-4909 [002] 189.021171: lower_barrier: md99_raid10 - w:1 p:0 b:0
> <...>-4996 [003] 189.021179: wait_barrier: out: dd - w:0 p:1 b:0
> <...>-4910 [002] 189.021182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
> <...>-4996 [003] 189.021184: wait_barrier: in: dd - w:0 p:1 b:1
> <...>-4915 [003] 218.929283: wait_barrier: in: flush-9:99 - w:1 p:1 b:1
>
>

So the 'dd' process successfully waited for the barrier to be gone at
189.021179, and thus set pending to '1'. It then submitted the IO request.
We should then see swapper (or possibly some other thread) calling
allow_barrier when the request completes. But we don't.
A request could possibly take many milliseconds to complete, but it shouldn't
take seconds and certainly not minutes.

It might be helpful if you could run this again, and in make_request(), after
the call to "wait_barrier()" print out:
bio->bi_sector, bio->bi_size, bio->bi_rw

I'm guessing that the last request that doesn't seem to complete will be
different from the other in some important way.

Thanks,
NeilBrown




--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Fr, 23 Juli 2010 05:19 ] [ ID #2044996 ]

Re: Raid10 device hangs during resync and heavy I/O.

--TB36FDmn/VVEgNH/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 23/07/10 13:19 +1000, Neil Brown wrote:
> On Thu, 22 Jul 2010 14:49:33 -0400
> Justin Bronder <jsbronder [at] gentoo.org> wrote:
>
> > On 16/07/10 14:46 -0400, Justin Bronder wrote:
> >
> > I've done some more research that may potentially help. All of
> > the following was done with 2.6.34.1.
> >
> > Still produces the hang:
> > - Using cp (may take a bit longer).
> > - Using jfs as the filesystem.
> > - Dropping RESYNC_DEPTH to 32
> > - Using the offset layout.
> >
> > Does not produce the hang:
> > - Using the near layout.
> > - Using dd on the partition directly instead of on a
> > filesystem via something like:
> > dd if=3D/dev/${MD_DEV}p1 of=3D/dev/${MD_DEV}p1 seek=3D4001 bs=3D1M
> >
> >
> > As the barrier code is very similiar, I repeated a number of
> > these tests using raid1 instead of raid10. In every case, I was
> > unable to cause the system to hang. I focused on the barriers
> > due to the tracebacks in the previous email. For the heck of it,
> > I added some tracing (patch below) where the reason for the hang
> > is fairly obvious. Of course, how it happened isn't.
> >
> > The last bit of the trace before the hang.
>
> Thanks for doing this!
>
> See below...

<previous trace cut>

>
>
> So the 'dd' process successfully waited for the barrier to be gone at
> 189.021179, and thus set pending to '1'. It then submitted the IO reques=
t.
> We should then see swapper (or possibly some other thread) calling
> allow_barrier when the request completes. But we don't.
> A request could possibly take many milliseconds to complete, but it shoul=
dn't
> take seconds and certainly not minutes.
>
> It might be helpful if you could run this again, and in make_request(), a=
fter
> the call to "wait_barrier()" print out:
> bio->bi_sector, bio->bi_size, bio->bi_rw
>
> I'm guessing that the last request that doesn't seem to complete will be
> different from the other in some important way.

Nothing stood out to me, but here's the tail end of a couple of different
traces.

<...>-5047 [002] 207.023784: wait_barrier: in: dd - w:0 p:11=
b:0
<...>-5047 [002] 207.023784: wait_barrier: out: dd - w:0 p:12=
b:0
<...>-5047 [002] 207.023785: make_request: dd - sector:747200=
1 sz:40960 rw:0
<...>-4958 [002] 207.023872: raise_barrier: mid: md99_resync =
- w:0 p:12 b:1
<...>-5047 [002] 207.024689: allow_barrier: dd - w:0 p:11=
b:1
<...>-5047 [002] 207.024695: allow_barrier: dd - w:0 p:10=
b:1
<...>-5047 [002] 207.024697: allow_barrier: dd - w:0 p:9 =
b:1
<...>-5047 [002] 207.024710: allow_barrier: dd - w:0 p:8 =
b:1
<...>-5047 [002] 207.024713: allow_barrier: dd - w:0 p:7 =
b:1
<...>-5047 [002] 207.026679: wait_barrier: in: dd - w:0 p:7 =
b:1
<idle>-0 [003] 207.043049: allow_barrier: swapper - w:1=
p:6 b:1
<idle>-0 [003] 207.043058: allow_barrier: swapper - w:1=
p:5 b:1
<idle>-0 [003] 207.043063: allow_barrier: swapper - w:1=
p:4 b:1
<idle>-0 [003] 207.043070: allow_barrier: swapper - w:1=
p:3 b:1
<idle>-0 [003] 207.043074: allow_barrier: swapper - w:1=
p:2 b:1
<idle>-0 [003] 207.043079: allow_barrier: swapper - w:1=
p:1 b:1
<idle>-0 [003] 207.043084: allow_barrier: swapper - w:1=
p:0 b:1
<...>-4958 [003] 207.043108: raise_barrier: out: md99_resync =
- w:1 p:0 b:1
<...>-4958 [003] 207.043150: raise_barrier: in: md99_resync =
- w:1 p:0 b:1
<...>-4957 [003] 207.051206: lower_barrier: md99_raid10 -=
w:1 p:0 b:0
<...>-5047 [002] 207.051215: wait_barrier: out: dd - w:0 p:1 =
b:0
<...>-5047 [002] 207.051216: make_request: dd - sector:747208=
1 sz:20480 rw:0
<...>-4958 [003] 207.051218: raise_barrier: mid: md99_resync =
- w:0 p:1 b:1
<...>-5047 [002] 207.051227: wait_barrier: in: dd - w:0 p:1 =
b:1
<idle>-0 [002] 207.058929: allow_barrier: swapper - w:1=
p:0 b:1
<...>-4958 [003] 207.058938: raise_barrier: out: md99_resync =
- w:1 p:0 b:1
<...>-4958 [003] 207.059044: raise_barrier: in: md99_resync =
- w:1 p:0 b:1
<...>-4957 [003] 207.067171: lower_barrier: md99_raid10 -=
w:1 p:0 b:0
<...>-5047 [002] 207.067179: wait_barrier: out: dd - w:0 p:1 =
b:0
<...>-5047 [002] 207.067180: make_request: dd - sector:747212=
1 sz:3584 rw:0
<...>-4958 [003] 207.067182: raise_barrier: mid: md99_resync =
- w:0 p:1 b:1
<...>-5047 [002] 207.067184: wait_barrier: in: dd - w:0 p:1 =
b:1



<idle>-0 [000] 463.231730: allow_barrier: swapper - w:2=
p:4 b:1
<idle>-0 [000] 463.231739: allow_barrier: swapper - w:2=
p:3 b:1
<idle>-0 [000] 463.231746: allow_barrier: swapper - w:2=
p:2 b:1
<idle>-0 [000] 463.231765: allow_barrier: swapper - w:2=
p:1 b:1
<idle>-0 [000] 463.231774: allow_barrier: swapper - w:2=
p:0 b:1
<...>-5004 [000] 463.231792: raise_barrier: out: md99_resync =
- w:2 p:0 b:1
<...>-5004 [000] 463.232005: raise_barrier: in: md99_resync =
- w:2 p:0 b:1
<...>-5003 [001] 463.232453: lower_barrier: md99_raid10 -=
w:2 p:0 b:0
<...>-5009 [000] 463.232463: wait_barrier: out: flush-9:99 - =
w:1 p:1 b:0
<...>-5009 [000] 463.232464: make_request: flush-9:99 - secto=
r:13931137 sz:61440 rw:1
<...>-5105 [001] 463.232466: wait_barrier: out: dd - w:0 p:2 =
b:0
<...>-5105 [001] 463.232467: make_request: dd - sector:720439=
3 sz:40960 rw:0
<...>-5009 [000] 463.232476: wait_barrier: in: flush-9:99 - =
w:0 p:2 b:0
<...>-5009 [000] 463.232477: wait_barrier: out: flush-9:99 - =
w:0 p:3 b:0
<...>-5009 [000] 463.232477: make_request: flush-9:99 - secto=
r:13931257 sz:3584 rw:1
<...>-5009 [000] 463.232481: wait_barrier: in: flush-9:99 - =
w:0 p:3 b:0
<...>-5009 [000] 463.232482: wait_barrier: out: flush-9:99 - =
w:0 p:4 b:0
<...>-5009 [000] 463.232483: make_request: flush-9:99 - secto=
r:13931264 sz:512 rw:1
<...>-5105 [001] 463.232492: wait_barrier: in: dd - w:0 p:4 =
b:0
<...>-5105 [001] 463.232493: wait_barrier: out: dd - w:0 p:5 =
b:0
<...>-5105 [001] 463.232494: make_request: dd - sector:720447=
3 sz:3584 rw:0
<...>-5004 [000] 463.232495: raise_barrier: mid: md99_resync =
- w:0 p:5 b:1
<...>-5105 [001] 463.232496: wait_barrier: in: dd - w:0 p:5 =
b:1
<...>-5009 [000] 463.232522: wait_barrier: in: flush-9:99 - =
w:1 p:5 b:1
<idle>-0 [000] 463.232726: allow_barrier: swapper - w:2=
p:4 b:1
<idle>-0 [001] 463.240520: allow_barrier: swapper - w:2=
p:3 b:1
<idle>-0 [000] 463.240946: allow_barrier: swapper - w:2=
p:2 b:1
<idle>-0 [000] 463.240955: allow_barrier: swapper - w:2=
p:1 b:1

Thanks,

--
Justin Bronder

--TB36FDmn/VVEgNH/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)

iEYEARECAAYFAkxJuXUACgkQ4MrvBE1wQ8klZwCfW3T543rQ09IAQPX1buTm Crlp
nvcAnRpwFeWOeoFxTjZONOvSWeKA7I3R
=Fc37
-----END PGP SIGNATURE-----

--TB36FDmn/VVEgNH/--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Bronder [ Fr, 23 Juli 2010 17:47 ] [ ID #2045000 ]

Re: Raid10 device hangs during resync and heavy I/O.

On Fri, 23 Jul 2010 11:47:01 -0400
Justin Bronder <jsbronder [at] gentoo.org> wrote:

> On 23/07/10 13:19 +1000, Neil Brown wrote:
> > On Thu, 22 Jul 2010 14:49:33 -0400

> >
> > So the 'dd' process successfully waited for the barrier to be gone at
> > 189.021179, and thus set pending to '1'. It then submitted the IO request.
> > We should then see swapper (or possibly some other thread) calling
> > allow_barrier when the request completes. But we don't.
> > A request could possibly take many milliseconds to complete, but it shouldn't
> > take seconds and certainly not minutes.
> >
> > It might be helpful if you could run this again, and in make_request(), after
> > the call to "wait_barrier()" print out:
> > bio->bi_sector, bio->bi_size, bio->bi_rw
> >
> > I'm guessing that the last request that doesn't seem to complete will be
> > different from the other in some important way.
>
> Nothing stood out to me, but here's the tail end of a couple of different
> traces.

Thanks a lot! Something does stand out for me!....

> <...>-5047 [002] 207.051215: wait_barrier: out: dd - w:0 p:1 b:0
> <...>-5047 [002] 207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
> <...>-4958 [003] 207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
> <...>-5047 [002] 207.051227: wait_barrier: in: dd - w:0 p:1 b:1
> <idle>-0 [002] 207.058929: allow_barrier: swapper - w:1 p:0 b:1
> <...>-4958 [003] 207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
> <...>-4958 [003] 207.059044: raise_barrier: in: md99_resync - w:1 p:0 b:1
> <...>-4957 [003] 207.067171: lower_barrier: md99_raid10 - w:1 p:0 b:0
> <...>-5047 [002] 207.067179: wait_barrier: out: dd - w:0 p:1 b:0
> <...>-5047 [002] 207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
> <...>-4958 [003] 207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
> <...>-5047 [002] 207.067184: wait_barrier: in: dd - w:0 p:1 b:1

The last successful IO is only 3584 bytes - 7 sectors. All the others are
much larger.
It is almost certain that the read needed to cross a chunk boundary, so some
goes to one device, some to the next. It was probably a 64K read or similar.
The first 5 pages all fit in one device, and so goes through as a 20K
read. The next page doesn't so it comes down to md/raid10 as a 1 page read.
raid10 splits it into a 7 sector read and a 1 sector read. We see the 7
sector read being initiated, but it doesn't complete for some reason and the
resync barrier gets in the way so the 1 sector read blocks in wait_barrier.

In the next trace....
>
>
>
> <idle>-0 [000] 463.231730: allow_barrier: swapper - w:2 p:4 b:1
> <idle>-0 [000] 463.231739: allow_barrier: swapper - w:2 p:3 b:1
> <idle>-0 [000] 463.231746: allow_barrier: swapper - w:2 p:2 b:1
> <idle>-0 [000] 463.231765: allow_barrier: swapper - w:2 p:1 b:1
> <idle>-0 [000] 463.231774: allow_barrier: swapper - w:2 p:0 b:1
> <...>-5004 [000] 463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
> <...>-5004 [000] 463.232005: raise_barrier: in: md99_resync - w:2 p:0 b:1
> <...>-5003 [001] 463.232453: lower_barrier: md99_raid10 - w:2 p:0 b:0
> <...>-5009 [000] 463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
> <...>-5009 [000] 463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
> <...>-5105 [001] 463.232466: wait_barrier: out: dd - w:0 p:2 b:0
> <...>-5105 [001] 463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
> <...>-5009 [000] 463.232476: wait_barrier: in: flush-9:99 - w:0 p:2 b:0
> <...>-5009 [000] 463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
> <...>-5009 [000] 463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
> <...>-5009 [000] 463.232481: wait_barrier: in: flush-9:99 - w:0 p:3 b:0
> <...>-5009 [000] 463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
> <...>-5009 [000] 463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
> <...>-5105 [001] 463.232492: wait_barrier: in: dd - w:0 p:4 b:0
> <...>-5105 [001] 463.232493: wait_barrier: out: dd - w:0 p:5 b:0

We see a similar thing with a write being broken into a 15 page writes, then
a 7 sector write, then a 1 sector write - that all works.

> <...>-5105 [001] 463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
> <...>-5004 [000] 463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
> <...>-5105 [001] 463.232496: wait_barrier: in: dd - w:0 p:5 b:1
> <...>-5009 [000] 463.232522: wait_barrier: in: flush-9:99 - w:1 p:5 b:1
> <idle>-0 [000] 463.232726: allow_barrier: swapper - w:2 p:4 b:1
> <idle>-0 [001] 463.240520: allow_barrier: swapper - w:2 p:3 b:1
> <idle>-0 [000] 463.240946: allow_barrier: swapper - w:2 p:2 b:1
> <idle>-0 [000] 463.240955: allow_barrier: swapper - w:2 p:1 b:1
>

But again we see a 7 sector read following a larger read, and the 1 sector
read that should follow gets blocked.

So it is somehow related to the need to split one-page requests across
multiple devices, and it could be specific to read requests.

Ahhhh.... I see the problem. Because a 'generic_make_request' is already
active, the once called by raid10::make_request just queues the request until
the top level one completes. This results in a deadlock.

I'll have to ponder a bit to figure out the best way to fix this.

Thanks again for the report and the help tracking down the problem.

NeilBrown


> Thanks,
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Mo, 02 August 2010 04:29 ] [ ID #2045375 ]

Re: Raid10 device hangs during resync and heavy I/O.

On Mon, 2 Aug 2010 12:29:49 +1000
Neil Brown <neilb [at] suse.de> wrote:


> Ahhhh.... I see the problem. Because a 'generic_make_request' is already
> active, the once called by raid10::make_request just queues the request until
> the top level one completes. This results in a deadlock.
>
> I'll have to ponder a bit to figure out the best way to fix this.
>

So, one good strong cup of tea later I think I have a good solution.

Would you be able to test with this patch and confirm that you cannot
reproduce the hang?
Thanks.

NeilBrown

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 42e64e4..d1d6891 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
[at] [at] -825,11 +825,29 [at] [at] static int make_request(mddev_t *mddev, struct bio * bio)
*/
bp = bio_split(bio,
chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
+
+ /* Each of these 'make_request' calls will call 'wait_barrier'.
+ * If the first succeeds but the second blocks due to the resync
+ * thread raising the barrier, we will deadlock because the
+ * IO to the underlying device will be queued in generic_make_request
+ * and will never complete, so will never reduce nr_pending.
+ * So increment nr_waiting here so no new raise_barriers will
+ * succeed, and so the second wait_barrier cannot block.
+ */
+ spin_lock_irq(&conf->resync_lock);
+ conf->nr_waiting++;
+ spin_unlock_irq(&conf->resync_lock);
+
if (make_request(mddev, &bp->bio1))
generic_make_request(&bp->bio1);
if (make_request(mddev, &bp->bio2))
generic_make_request(&bp->bio2);

+ spin_lock_irq(&conf->resync_lock);
+ conf->nr_waiting--;
+ wake_up(&conf->wait_barrier);
+ spin_unlock_irq(&conf->resync_lock);
+
bio_pair_release(bp);
return 0;
bad_map:
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Mo, 02 August 2010 04:58 ] [ ID #2045376 ]

Re: Raid10 device hangs during resync and heavy I/O.

--ew6BAiZeqk4r7MaW
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 02/08/10 12:58 +1000, Neil Brown wrote:
> On Mon, 2 Aug 2010 12:29:49 +1000
> Neil Brown <neilb [at] suse.de> wrote:
>
>
> > Ahhhh.... I see the problem. Because a 'generic_make_request' is alrea=
dy
> > active, the once called by raid10::make_request just queues the request=
until
> > the top level one completes. This results in a deadlock.
> >
> > I'll have to ponder a bit to figure out the best way to fix this.
> >
>
> So, one good strong cup of tea later I think I have a good solution.
>
> Would you be able to test with this patch and confirm that you cannot
> reproduce the hang?

I've been running with this patch on 2.6.34.1 all day and have yet to cause
the hang. Given it took under 5 minutes earlier, feel free to add:

Tested-by: Justin Bronder <jsbronder [at] gentoo.org>

I really appreciate you taking care of this. Thanks.

> Thanks.
>
> NeilBrown
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 42e64e4..d1d6891 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> [at] [at] -825,11 +825,29 [at] [at] static int make_request(mddev_t *mddev, struct bio =
* bio)
> */
> bp =3D bio_split(bio,
> chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> +
> + /* Each of these 'make_request' calls will call 'wait_barrier'.
> + * If the first succeeds but the second blocks due to the resync
> + * thread raising the barrier, we will deadlock because the
> + * IO to the underlying device will be queued in generic_make_request
> + * and will never complete, so will never reduce nr_pending.
> + * So increment nr_waiting here so no new raise_barriers will
> + * succeed, and so the second wait_barrier cannot block.
> + */
> + spin_lock_irq(&conf->resync_lock);
> + conf->nr_waiting++;
> + spin_unlock_irq(&conf->resync_lock);
> +
> if (make_request(mddev, &bp->bio1))
> generic_make_request(&bp->bio1);
> if (make_request(mddev, &bp->bio2))
> generic_make_request(&bp->bio2);
>
> + spin_lock_irq(&conf->resync_lock);
> + conf->nr_waiting--;
> + wake_up(&conf->wait_barrier);
> + spin_unlock_irq(&conf->resync_lock);
> +
> bio_pair_release(bp);
> return 0;
> bad_map:
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo [at] vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--
Justin Bronder

--ew6BAiZeqk4r7MaW
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.15 (GNU/Linux)

iEYEARECAAYFAkxXLKIACgkQ4MrvBE1wQ8lgtACeL2x41RNdKkdIwKnzS461 8NUM
laQAoIXz91ypb06IN0/T+rw4hXhf17Uo
=SFTA
-----END PGP SIGNATURE-----

--ew6BAiZeqk4r7MaW--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Justin Bronder [ Mo, 02 August 2010 22:37 ] [ ID #2045386 ]

Re: Raid10 device hangs during resync and heavy I/O.

On Mon, 2 Aug 2010 16:37:54 -0400
Justin Bronder <jsbronder [at] gentoo.org> wrote:

> On 02/08/10 12:58 +1000, Neil Brown wrote:
> > On Mon, 2 Aug 2010 12:29:49 +1000
> > Neil Brown <neilb [at] suse.de> wrote:
> >
> >
> > > Ahhhh.... I see the problem. Because a 'generic_make_request' is already
> > > active, the once called by raid10::make_request just queues the request until
> > > the top level one completes. This results in a deadlock.
> > >
> > > I'll have to ponder a bit to figure out the best way to fix this.
> > >
> >
> > So, one good strong cup of tea later I think I have a good solution.
> >
> > Would you be able to test with this patch and confirm that you cannot
> > reproduce the hang?
>
> I've been running with this patch on 2.6.34.1 all day and have yet to cause
> the hang. Given it took under 5 minutes earlier, feel free to add:
>
> Tested-by: Justin Bronder <jsbronder [at] gentoo.org>
>
> I really appreciate you taking care of this. Thanks.

And thank you for testing. I've queued this up now and will send it to Linus
and -stable shortly.

NeilBrown


>
> > Thanks.
> >
> > NeilBrown
> >
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index 42e64e4..d1d6891 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > [at] [at] -825,11 +825,29 [at] [at] static int make_request(mddev_t *mddev, struct bio * bio)
> > */
> > bp = bio_split(bio,
> > chunk_sects - (bio->bi_sector & (chunk_sects - 1)) );
> > +
> > + /* Each of these 'make_request' calls will call 'wait_barrier'.
> > + * If the first succeeds but the second blocks due to the resync
> > + * thread raising the barrier, we will deadlock because the
> > + * IO to the underlying device will be queued in generic_make_request
> > + * and will never complete, so will never reduce nr_pending.
> > + * So increment nr_waiting here so no new raise_barriers will
> > + * succeed, and so the second wait_barrier cannot block.
> > + */
> > + spin_lock_irq(&conf->resync_lock);
> > + conf->nr_waiting++;
> > + spin_unlock_irq(&conf->resync_lock);
> > +
> > if (make_request(mddev, &bp->bio1))
> > generic_make_request(&bp->bio1);
> > if (make_request(mddev, &bp->bio2))
> > generic_make_request(&bp->bio2);
> >
> > + spin_lock_irq(&conf->resync_lock);
> > + conf->nr_waiting--;
> > + wake_up(&conf->wait_barrier);
> > + spin_unlock_irq(&conf->resync_lock);
> > +
> > bio_pair_release(bp);
> > return 0;
> > bad_map:
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo [at] vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Sa, 07 August 2010 13:22 ] [ ID #2045697 ]
Linux » gmane.linux.raid » Raid10 device hangs during resync and heavy I/O.

Vorheriges Thema: [PULL REQUEST] md updates for 2.6.36
Nächstes Thema: raid1: prevent adding a "too recent" device to a mirror?