md deadlock (was Re: 2.6.18-mm2)

On Thu, 2006-09-28 at 13:54 +0200, Michal Piotrowski wrote:
> Hi,
>
> On 28/09/06, Andrew Morton <akpm [at] osdl.org> wrote:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2. 6/2.6.18/2.6.18-mm2/
> >
> >
>
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 2.6.18-mm2 #1
> -------------------------------------------------------
> nash/1264 is trying to acquire lock:
> (&bdev_part_lock_key){--..}, at: [<c0310d4a>] mutex_lock+0x1c/0x1f
>
> but task is already holding lock:
> (&new->reconfig_mutex){--..}, at: [<c03108ff>]
> mutex_lock_interruptible+0x1c/0x1f
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (&new->reconfig_mutex){--..}:
> [<c01390b8>] add_lock_to_list+0x5c/0x7a
> [<c013b1dd>] __lock_acquire+0x9f3/0xaef
> [<c013b643>] lock_acquire+0x71/0x91
> [<c031068f>] __mutex_lock_interruptible_slowpath+0xd2/0x326
> [<c03108ff>] mutex_lock_interruptible+0x1c/0x1f
> [<c02ba4e3>] md_open+0x28/0x5d -> mddev->reconfig_mutex
> [<c0197853>] do_open+0x8b/0x377 -> bdev->bd_mutex (whole)
> [<c0197cd5>] blkdev_open+0x1d/0x46
> [<c0172f36>] __dentry_open+0x133/0x260
> [<c01730d1>] nameidata_to_filp+0x1c/0x2e
> [<c0173111>] do_filp_open+0x2e/0x35
> [<c0173170>] do_sys_open+0x58/0xde
> [<c0173222>] sys_open+0x16/0x18
> [<c0103297>] syscall_call+0x7/0xb
> [<ffffffff>] 0xffffffff
>
> -> #1 (&bdev->bd_mutex){--..}:
> [<c01390b8>] add_lock_to_list+0x5c/0x7a
> [<c013b1dd>] __lock_acquire+0x9f3/0xaef
> [<c013b643>] lock_acquire+0x71/0x91
> [<c0310b0f>] __mutex_lock_slowpath+0xd2/0x2f1
> [<c0310d4a>] mutex_lock+0x1c/0x1f
> [<c0197824>] do_open+0x5c/0x377
> [<c0197bab>] blkdev_get+0x6c/0x77
> [<c01978d0>] do_open+0x108/0x377
> [<c0197bab>] blkdev_get+0x6c/0x77
> [<c0197eb1>] open_by_devnum+0x30/0x3c
> [<c0147419>] swsusp_check+0x14/0xc5
> [<c0145865>] software_resume+0x7e/0x100
> [<c010049e>] init+0x121/0x29f
> [<c0103f23>] kernel_thread_helper+0x7/0x10
> [<c0109523>] save_stack_trace+0x17/0x30
> [<c0138fb0>] save_trace+0x4f/0xfb
> [<c01390b8>] add_lock_to_list+0x5c/0x7a
> [<c013b1dd>] __lock_acquire+0x9f3/0xaef
> [<c013b643>] lock_acquire+0x71/0x91
> [<c0310b0f>] __mutex_lock_slowpath+0xd2/0x2f1
> [<c0310d4a>] mutex_lock+0x1c/0x1f
> [<c0197824>] do_open+0x5c/0x377 -> bdev->bd_mutex (whole)
> [<c0197bab>] blkdev_get+0x6c/0x77
> [<c01978d0>] do_open+0x108/0x377 -> bdev->bd_mutex (partition)
> [<c0197bab>] blkdev_get+0x6c/0x77
> [<c0197eb1>] open_by_devnum+0x30/0x3c
> [<c0147419>] swsusp_check+0x14/0xc5
> [<c0145865>] software_resume+0x7e/0x100
> [<c010049e>] init+0x121/0x29f
> [<c0103f23>] kernel_thread_helper+0x7/0x10
> [<ffffffff>] 0xffffffff
>
> -> #0 (&bdev_part_lock_key){--..}:
> [<c013a7b6>] print_circular_bug_tail+0x30/0x64
> [<c013b114>] __lock_acquire+0x92a/0xaef
> [<c013b643>] lock_acquire+0x71/0x91
> [<c0310b0f>] __mutex_lock_slowpath+0xd2/0x2f1
> [<c0310d4a>] mutex_lock+0x1c/0x1f
> [<c0197323>] bd_claim_by_disk+0x5f/0x18e -> bdev->bd_mutex (partition)
> [<c02b44ec>] bind_rdev_to_array+0x1f0/0x20e
> [<c02b6453>] autostart_arrays+0x24b/0x322
> [<c02b9158>] md_ioctl+0x91/0x13f4
> [<c01ea5bc>] blkdev_driver_ioctl+0x49/0x5b
> [<c01ead23>] blkdev_ioctl+0x755/0x7a2
> [<c0196f9d>] block_ioctl+0x16/0x1b
> [<c01801d2>] do_ioctl+0x22/0x67
> [<c0180460>] vfs_ioctl+0x249/0x25c
> [<c01804ba>] sys_ioctl+0x47/0x75
> [<c0103297>] syscall_call+0x7/0xb
> [<ffffffff>] 0xffffffff
>
> other info that might help us debug this:
>
> 1 lock held by nash/1264:
> #0: (&new->reconfig_mutex){--..}, at: [<c03108ff>]
> mutex_lock_interruptible+0x1c/0x1f
> stack backtrace:
> [<c0104215>] dump_trace+0x64/0x1cd
> [<c0104390>] show_trace_log_lvl+0x12/0x25
> [<c01049e5>] show_trace+0xd/0x10
> [<c0104aad>] dump_stack+0x19/0x1b
> [<c013a7df>] print_circular_bug_tail+0x59/0x64
> [<c013b114>] __lock_acquire+0x92a/0xaef
> [<c013b643>] lock_acquire+0x71/0x91
> [<c0310b0f>] __mutex_lock_slowpath+0xd2/0x2f1
> [<c0310d4a>] mutex_lock+0x1c/0x1f
> [<c0197323>] bd_claim_by_disk+0x5f/0x18e -> bdev->bd_mutex (part)
> [<c02b44ec>] bind_rdev_to_array+0x1f0/0x20e
autorun_devices -> mddev->reconfig_mutex
> [<c02b6453>] autostart_arrays+0x24b/0x322
> [<c02b9158>] md_ioctl+0x91/0x13f4
> [<c01ea5bc>] blkdev_driver_ioctl+0x49/0x5b
> [<c01ead23>] blkdev_ioctl+0x755/0x7a2
> [<c0196f9d>] block_ioctl+0x16/0x1b
> [<c01801d2>] do_ioctl+0x22/0x67
> [<c0180460>] vfs_ioctl+0x249/0x25c
> [<c01804ba>] sys_ioctl+0x47/0x75
> [<c0103297>] syscall_call+0x7/0xb
> DWARF2 unwinder stuck at syscall_call+0x7/0xb
>
> Leftover inexact backtrace:

Looks like a real deadlock here. It seems to me #2 is the easiest to
break.

static int md_open(struct inode *inode, struct file *file)
{
/*
* Succeed if we can lock the mddev, which confirms that
* it isn't being stopped right now.
*/
mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
int err;

if ((err = mddev_lock(mddev)))
goto out;

err = 0;
mddev_get(mddev);
mddev_unlock(mddev);

check_disk_change(inode->i_bdev);
out:
return err;
}

mddev_get() is a simple atomic_inc(), and I fail to see how waiting for
the lock makes any difference.



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra [ Fr, 29 September 2006 14:12 ] [ ID #1482751 ]

Re: md deadlock (was Re: 2.6.18-mm2)

On Friday September 29, a.p.zijlstra [at] chello.nl wrote:
> On Thu, 2006-09-28 at 13:54 +0200, Michal Piotrowski wrote:
>
> Looks like a real deadlock here. It seems to me #2 is the easiest to
> break.

I guess it could deadlock if you tried to add /dev/md0 as a component
of /dev/md0. I should probably check for that somewhere.
In other cases the array->member ordering ensures there is no
deadlock.

>
> static int md_open(struct inode *inode, struct file *file)
> {
> /*
> * Succeed if we can lock the mddev, which confirms that
> * it isn't being stopped right now.
> */
> mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
> int err;
>
> if ((err = mddev_lock(mddev)))
> goto out;
>
> err = 0;
> mddev_get(mddev);
> mddev_unlock(mddev);
>
> check_disk_change(inode->i_bdev);
> out:
> return err;
> }
>
> mddev_get() is a simple atomic_inc(), and I fail to see how waiting for
> the lock makes any difference.

Hmm... I"m pretty sure I do want some sort of locking there - to make
sure that the
if (atomic_read(&mddev->active)>2) {
test in do_md_stop actually means something. However it does seem
that the locking I have doesn't really guarantee anything much.

But I really think that this locking order should be allowed. md
should ensure that there are never any loops in the array->member
ordering, and somehow that needs to be communicated to lockdep.

One of the items on my todo list is to sort out the lifetime rules of
md devices (once accessed, they currently never disappear). Getting
this locking right should be part of that.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Fr, 29 September 2006 14:52 ] [ ID #1482752 ]

Re: md deadlock (was Re: 2.6.18-mm2)

On Fri, 2006-09-29 at 22:52 +1000, Neil Brown wrote:
> On Friday September 29, a.p.zijlstra [at] chello.nl wrote:
> > On Thu, 2006-09-28 at 13:54 +0200, Michal Piotrowski wrote:
> >
> > Looks like a real deadlock here. It seems to me #2 is the easiest to
> > break.
>
> I guess it could deadlock if you tried to add /dev/md0 as a component
> of /dev/md0. I should probably check for that somewhere.
> In other cases the array->member ordering ensures there is no
> deadlock.
>


1 2

open(/dev/md0)

open(/dev/md0)
- do_open() -> bdev->bd_mutex
ioctl(/dev/md0, hotadd)
- md_ioctl() -> mddev->reconfig_mutex
-- hot_add_disk()
--- bind_rdev_to_array()
---- bd_claim_by_disk()
----- bd_claim_by_kobject()
-- md_open()
--- mddev_lock()
---- mutex_lock(mddev->reconfig_mutex)
------ mutex_lock(bdev->bd_mutex)


looks like an AB-BA deadlock to me


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra [ Fr, 29 September 2006 16:03 ] [ ID #1482754 ]

Re: md deadlock (was Re: 2.6.18-mm2)

On Fri, 2006-09-29 at 16:03 +0200, Peter Zijlstra wrote:
> On Fri, 2006-09-29 at 22:52 +1000, Neil Brown wrote:
> > On Friday September 29, a.p.zijlstra [at] chello.nl wrote:
> > > On Thu, 2006-09-28 at 13:54 +0200, Michal Piotrowski wrote:
> > >
> > > Looks like a real deadlock here. It seems to me #2 is the easiest to
> > > break.
> >
> > I guess it could deadlock if you tried to add /dev/md0 as a component
> > of /dev/md0. I should probably check for that somewhere.
> > In other cases the array->member ordering ensures there is no
> > deadlock.
> >
>
>
> 1 2
>
> open(/dev/md0)
>
> open(/dev/md0)
> - do_open() -> bdev->bd_mutex
> ioctl(/dev/md0, hotadd)
> - md_ioctl() -> mddev->reconfig_mutex
> -- hot_add_disk()
> --- bind_rdev_to_array()
> ---- bd_claim_by_disk()
> ----- bd_claim_by_kobject()
> -- md_open()
> --- mddev_lock()
> ---- mutex_lock(mddev->reconfig_mutex)
> ------ mutex_lock(bdev->bd_mutex)
>

D'0h, 1:bdev->bd_mutex is ofcourse rdev->bd_mutex; the slave device's
mutex.

So mddev->bd_mutex wants to be another class all-together.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra [ Mo, 02 Oktober 2006 15:47 ] [ ID #1484839 ]

Re: md deadlock (was Re: 2.6.18-mm2)

Hi,
would this be an appropriate fix do the warning lockdep gives about
possible deadlocks in md.

The warning is currently easily triggered with
mdadm -C /dev/md1 -l1 -n1 /dev/sdc missing

(assuming /dev/sdc is a device that you are happy to be scribbled on).

This will take ->reconfig_mutex on md1 while holding bd_mutex,
then will take bd_mutex on sdc while holding reconfig_mutex on md1

This superficial deadlock isn't a real problem because the bd_mutexes
are on different devices and there is an hierarchical relationship
which avoids the loop necessary for a deadlock.

-----------------------
Avoid lockdep warning in md.

md_open takes ->reconfig_mutex which causes lockdep to complain.
This (normally) doesn't have deadlock potential as the possible
conflict is with a reconfig_mutex in a different device.

I say "normally" because if a loop were created in the array->member
hierarchy a deadlock could happen. However that causes bigger
problems than a deadlock and should be fixed independently.

So we flag the lock in md_open as a nested lock. This requires
defining mutex_lock_interruptible_nested.

Signed-off-by: Neil Brown <neilb [at] suse.de>

### Diffstat output
./drivers/md/md.c | 2 +-
./include/linux/mutex.h | 3 ++-
./kernel/mutex.c | 8 ++++++++
3 files changed, 11 insertions(+), 2 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2006-10-09 14:25:11.000000000 +1000
+++ ./drivers/md/md.c 2006-10-10 12:28:35.000000000 +1000
[at] [at] -4422,7 +4422,7 [at] [at] static int md_open(struct inode *inode,
mddev_t *mddev = inode->i_bdev->bd_disk->private_data;
int err;

- if ((err = mddev_lock(mddev)))
+ if ((err = mutex_lock_interruptible_nested(&mddev->reconfig_mutex, 1)))
goto out;

err = 0;

diff .prev/include/linux/mutex.h ./include/linux/mutex.h
--- .prev/include/linux/mutex.h 2006-10-10 12:37:04.000000000 +1000
+++ ./include/linux/mutex.h 2006-10-10 12:40:20.000000000 +1000
[at] [at] -125,8 +125,9 [at] [at] extern int fastcall mutex_lock_interrupt

#ifdef CONFIG_DEBUG_LOCK_ALLOC
extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
+extern int mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass);
#else
-# define mutex_lock_nested(lock, subclass) mutex_lock(lock)
+# define mutex_lock_interruptible_nested(lock, subclass) mutex_interruptible_lock(lock)
#endif

/*

diff .prev/kernel/mutex.c ./kernel/mutex.c
--- .prev/kernel/mutex.c 2006-10-10 12:35:54.000000000 +1000
+++ ./kernel/mutex.c 2006-10-10 13:20:04.000000000 +1000
[at] [at] -206,6 +206,14 [at] [at] mutex_lock_nested(struct mutex *lock, un
}

EXPORT_SYMBOL_GPL(mutex_lock_nested);
+int __sched
+mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
+{
+ might_sleep();
+ return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass);
+}
+
+EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
#endif

/*
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown [ Di, 10 Oktober 2006 05:53 ] [ ID #1494603 ]

Re: md deadlock (was Re: 2.6.18-mm2)

* Neil Brown <neilb [at] suse.de> wrote:

> --- .prev/include/linux/mutex.h 2006-10-10 12:37:04.000000000 +1000
> +++ ./include/linux/mutex.h 2006-10-10 12:40:20.000000000 +1000
> [at] [at] -125,8 +125,9 [at] [at] extern int fastcall mutex_lock_interrupt
>
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> extern void mutex_lock_nested(struct mutex *lock, unsigned int subclass);
> +extern int mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass);
> #else
> -# define mutex_lock_nested(lock, subclass) mutex_lock(lock)
> +# define mutex_lock_interruptible_nested(lock, subclass) mutex_interruptible_lock(lock)
> #endif

> EXPORT_SYMBOL_GPL(mutex_lock_nested);
> +int __sched
> +mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
> +{
> + might_sleep();
> + return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass);
> +}
> +
> +EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);

looks good to me. (small style nit: maybe insert a newline after the
first EXPORT_SYMBOL_GPL line)

Acked-by: Ingo Molnar <mingo [at] elte.hu>

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo [at] vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar [ Di, 10 Oktober 2006 08:42 ] [ ID #1494604 ]
Linux » gmane.linux.raid » md deadlock (was Re: 2.6.18-mm2)

Vorheriges Thema: [PATCH] md: Fix bug where new drives added to an md array sometimes don't sync properly.
Nächstes Thema: RAID10: near, far, offset -- which one?