removed disk && md-device

am 09.05.2007 14:17:08 von Bernd Schubert

Hi,

we are presently running into a hotplug/linux-raid problem.

Lets assume a hard disk entirely fails or a stupid human being pulls it out of
the system. Several partitions of the very same hardisk are also part of
linux-software raid. Also, /dev is managed by udev.

Problem-1) When the disk fails, udev will remove it from /dev. Unfortunately
this will make it impossible to remove the disk or its partitions
from /dev/mdX device, since mdadm tries to read the device fail and will
abort if this file is not there.

Problem-2) Even though the kernel detected the device to not exist anymore, it
didn't inform its md-layer about this event. The md-layer will first detect
non-existent disk, if a read or write attempt to one of its raid-partitions
fails. Unfortunately, if you are unluckily, it might never detect that, e.g.
for raid1 devices.

I think there should be several solutions to these problems.

1) Before udev removes a device file, it should run a pre-remove script, which
should check if the device is listed in /proc/mdstat and if it is listed
there, it should run mdadm to remove this device from the.
Does udev presently support to run pre-remove scripts?

2.) As soon as the kernel detects a failed device, it should also inform the
md layer.

3.) Does mdadm really need the device?

Thanks,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

Re: removed disk && md-device

am 09.05.2007 15:14:50 von Martin F Krafft

--===============1133767769==
Content-Type: multipart/signed; micalg=pgp-sha1;
protocol="application/pgp-signature"; boundary="wac7ysb48OaltWcw"
Content-Disposition: inline

--wac7ysb48OaltWcw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

also sprach Bernd Schubert [2007.05.09.1417 +0200]:
> Problem-1) When the disk fails, udev will remove it from /dev. Unfortunat=
ely=20
> this will make it impossible to remove the disk or its partitions=20
> from /dev/mdX device, since mdadm tries to read the device fail and will=
=20
> abort if this file is not there.

Please also see http://bugs.debian.org/416512. It would be nice if
you could keep 416512@bugs.debian.org on CC.

mdadm upstream knows of the problem. See the bug log.

--=20
martin; (greetings from the heart of the sun.)
\____ echo mailto: !#^."<*>"|tr "<*> mailto:" net@madduck
=20
spamtraps: madduck.bogus@madduck.net
=20
"i worked myself up from nothing to a state of extreme poverty."
-- groucho marx

--wac7ysb48OaltWcw
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature (GPG/PGP)
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGQclKIgvIgzMMSnURArbfAKCclm6ge5B1dUH92nwdRzh4plBimQCf WX5V
VqBIiEIRCA0G41ram3GBihM=
=0xlf
-----END PGP SIGNATURE-----

--wac7ysb48OaltWcw--

--===============1133767769==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
--===============1133767769==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Linux-hotplug-devel mailing list http://linux-hotplug.sourceforge.net
Linux-hotplug-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-hotplug-d evel

--===============1133767769==--

Re: removed disk && md-device

am 09.05.2007 15:39:53 von Bernd Schubert

On Wednesday 09 May 2007 15:14:50 martin f krafft wrote:
> also sprach Bernd Schubert [2007.05.09.1417 +0200]:
> > Problem-1) When the disk fails, udev will remove it from /dev.
> > Unfortunately this will make it impossible to remove the disk or its
> > partitions from /dev/mdX device, since mdadm tries to read the device
> > fail and will abort if this file is not there.
>
> Please also see http://bugs.debian.org/416512. It would be nice if
> you could keep 416512@bugs.debian.org on CC.
>
> mdadm upstream knows of the problem. See the bug log.

Ah, so Goswin already wrote a bug report :) Actually Goswin first did run into
this problem here while doing some internal tests, but today we have it on a
customer system.

Neil Brown [2007.04.02.0953 +0200]:
>Hmmm... this is somewhat awkward. You could argue that udev should be
>taught to remove the device from the array before removing the device
>from /dev. But I'm not convinced that you always want to 'fail' the
>device. It is possible in this case that the array is quiescent and
>you might like to shut it down without registering a device failure...

Hmm, the the kernel advised hotplug to remove the device from /dev, but you
don't want to remove it from md? Do you have an example for that case?

>It is still possible to fail and remove the device by
>writing "faulty" and then "remove" to
> /sys/block/mdX/md/dev-YYY/state

>Maybe an mdadm command that will do that for a given device, or for
>all components of a given array if the 'dev' link is 'broken', or even
>for all devices for all array.

> mdadm --fail-unplugged --scan
>or
> mdadm --fail-unplugged /dev/md3

Ok, so one could run this as cron script. Neil, may I ask if you already
started to work on this? Since we have the problem on a customer system, we
should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't
start work on it yet, I will do...

Thanks,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 09.05.2007 23:41:35 von Michael Tokarev

Bernd Schubert wrote:
> Hi,
>
> we are presently running into a hotplug/linux-raid problem.
>
> Lets assume a hard disk entirely fails or a stupid human being pulls it out of
> the system. Several partitions of the very same hardisk are also part of
> linux-software raid. Also, /dev is managed by udev.
>
> Problem-1) When the disk fails, udev will remove it from /dev. Unfortunately
> this will make it impossible to remove the disk or its partitions
> from /dev/mdX device, since mdadm tries to read the device fail and will
> abort if this file is not there.

What do you mean by "fails" here?

All the device information is still here, look at /sys/block/mdX/md/rdY/block .
Even if, say, sda (which was a part of md0) disappeared, there will still be
/sys/block/sda directory, because md subsystem keeps it open. Yes the device
node may be removed by udev (oh how i dislike udev!), but all the info is still
here. Also, all the info is in the array information available using ioctl.

mdadm can work it out from here, but it's a bit ugly.

> Problem-2) Even though the kernel detected the device to not exist anymore, it
> didn't inform its md-layer about this event. The md-layer will first detect
> non-existent disk, if a read or write attempt to one of its raid-partitions
> fails. Unfortunately, if you are unluckily, it might never detect that, e.g.
> for raid1 devices.

This is backwards.

"If you're unlucky" should be the opposite -- "You're lucky". Well ok, it really
depends on other things. Because if md-layer does not detect failed disk, it
means that disk hasn't been needed so far (because any attempt to do I/O on it
will fail, and the disk will be kicked off the array). And since there was no
need in that disk, that means no changes has been made to the array (because
in case of any change, all disks will be written to). Which, in turn, means
either of:

a) disk will reappear (there are several failure modes, sometimes just bus
rescan or powercycle will do the trick), and noone will even notice, and
everything will be ok.

b) disk is dead. And I think this is where you say "unlucky" - because for
quite some (unknown amount) of time, the array will be running in degraded
mode, instead of enabling/resyncing hot spare etc.

Again: it depends on the failure scenario. What to do here is questionable,
because a) contradicts with b). So far, I haven't seen disks dying (well,
maybe 2 or 3 times), but I've seen disks "disappearing" randomly for no
apparent reason, and bus reset or powercycle brings them back just fine.
So for me, this is "lucky" behaviour.. ;)

Also, with all the modern hotpluggable drives (usb, sata, hotpluggable scsi,
and esp. networked storage, where network may add its own failure modes),
it's much more easier to make a device disappear - by touching cables for
example - this is the case a).

> I think there should be several solutions to these problems.
>
> 1) Before udev removes a device file, it should run a pre-remove script, which
> should check if the device is listed in /proc/mdstat and if it is listed
> there, it should run mdadm to remove this device from the.
> Does udev presently support to run pre-remove scripts?
>
> 2.) As soon as the kernel detects a failed device, it should also inform the
> md layer.

See above: it depends.

> 3.) Does mdadm really need the device?

No it doesn't. In order to fail or remove a component device from
an array, only major:minor number is needed. Device nodes aren't needed
even to assemble array, but only if doing it the dumb way - during
assembly, mdadm examines the devices and tries to add some intelligency
to the process, and for that, device nodes are really necessary. But
not for hotremovals.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 10.05.2007 09:12:54 von NeilBrown

On Wednesday May 9, bs@q-leap.de wrote:
>
> Neil Brown [2007.04.02.0953 +0200]:
> >Hmmm... this is somewhat awkward. You could argue that udev should be
> >taught to remove the device from the array before removing the device
> >from /dev. But I'm not convinced that you always want to 'fail' the
> >device. It is possible in this case that the array is quiescent and
> >you might like to shut it down without registering a device failure...
>
> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
> don't want to remove it from md? Do you have an example for that case?

Until there is known to be an inconsistency among the devices in an
array, you don't want to record that there is.

Suppose I have two USB drives with a mounted but quiescent filesystem
on a raid1 across them.
I pull them both out, one after the other, to take them to my friends
place.

I plug them both in and find that the array is degraded, because as
soon as I unplugged on, the other was told that it was now the only
one.
Not good. Best to wait for an IO request that actually returns an
errors.

>
> >Maybe an mdadm command that will do that for a given device, or for
> >all components of a given array if the 'dev' link is 'broken', or even
> >for all devices for all array.
>
> > mdadm --fail-unplugged --scan
> >or
> > mdadm --fail-unplugged /dev/md3
>
> Ok, so one could run this as cron script. Neil, may I ask if you already
> started to work on this? Since we have the problem on a customer system, we
> should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't
> start work on it yet, I will do...

No, I haven't, but it is getting near the top of my list.
If you want a script that does this automatically for every array,
something like:

for a in /sys/block/md*/md/dev-*
do
if [ -f $a/block/dev ]
then : still there
else
echo faulty > $a/state
echo remove > $a/state
fi
done

should do what you want. (I haven't tested it though).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 10.05.2007 16:33:47 von David Greaves

Re: removed disk && md-device

am 10.05.2007 20:17:32 von Bernd Schubert

On Thursday 10 May 2007 09:12:54 Neil Brown wrote:
> On Wednesday May 9, bs@q-leap.de wrote:
> > Neil Brown [2007.04.02.0953 +0200]:
> > >Hmmm... this is somewhat awkward. You could argue that udev should be
> > >taught to remove the device from the array before removing the device
> > >from /dev. But I'm not convinced that you always want to 'fail' the
> > >device. It is possible in this case that the array is quiescent and
> > >you might like to shut it down without registering a device failure...
> >
> > Hmm, the the kernel advised hotplug to remove the device from /dev, but
> > you don't want to remove it from md? Do you have an example for that
> > case?
>
> Until there is known to be an inconsistency among the devices in an
> array, you don't want to record that there is.
>
> Suppose I have two USB drives with a mounted but quiescent filesystem
> on a raid1 across them.
> I pull them both out, one after the other, to take them to my friends
> place.
>
> I plug them both in and find that the array is degraded, because as
> soon as I unplugged on, the other was told that it was now the only
> one.
> Not good. Best to wait for an IO request that actually returns an
> errors.

Ok, keeping the raid working in this case would be a good idea, so we would
need it configurable if it should degrade or not.
However, have you tested if pulling and hotplugging the drive works? Actually
thats what our customer did. As long as md keeps the old device information,
the re-plugged-in device will get another device name (and of course also
another major number) and so the md-device will still keeps the old device
information and it will never automagically add the new device.
Probably thats even a good idea, how should the md-layer know if it is really
the very same device and even if it would know that, how should it know that
no data have been modified on it, while it was plugged out?

>
> > >Maybe an mdadm command that will do that for a given device, or for
> > >all components of a given array if the 'dev' link is 'broken', or even
> > >for all devices for all array.
> > >
> > > mdadm --fail-unplugged --scan
> > >or
> > > mdadm --fail-unplugged /dev/md3
> >
> > Ok, so one could run this as cron script. Neil, may I ask if you already
> > started to work on this? Since we have the problem on a customer system,
> > we should fix it ASAP, but at least within the next 2 or 3 weeks. If you
> > didn't start work on it yet, I will do...
>
> No, I haven't, but it is getting near the top of my list.
> If you want a script that does this automatically for every array,
> something like:

I have never looked into the mdadm sources before, but I will try during the
weekend (without any promises).

>
> for a in /sys/block/md*/md/dev-*
> do
> if [ -f $a/block/dev ]
> then : still there
> else
> echo faulty > $a/state
> echo remove > $a/state
> fi
> done
>
> should do what you want. (I haven't tested it though).

Thanks a lot, we will test that here. Do you propose the same logic for mdadm?

Thanks,
Bernd

--
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 03:36:53 von NeilBrown

On Thursday May 10, david@dgreaves.com wrote:
> Neil Brown wrote:
> > On Wednesday May 9, bs@q-leap.de wrote:
> >> Neil Brown [2007.04.02.0953 +0200]:
> >>> Hmmm... this is somewhat awkward. You could argue that udev should be
> >>> taught to remove the device from the array before removing the device
> >> >from /dev. But I'm not convinced that you always want to 'fail' the
> >>> device. It is possible in this case that the array is quiescent and
> >>> you might like to shut it down without registering a device failure...
> >> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
> >> don't want to remove it from md? Do you have an example for that case?
> >
> > Until there is known to be an inconsistency among the devices in an
> > array, you don't want to record that there is.
> >
> > Suppose I have two USB drives with a mounted but quiescent filesystem
> > on a raid1 across them.
> > I pull them both out, one after the other, to take them to my friends
> > place.
> >
> > I plug them both in and find that the array is degraded, because as
> > soon as I unplugged on, the other was told that it was now the only
> > one.
> And, in truth, so it was.

So what was?
It is true that now one drive is "the only one plugged in", but is
that relevant?
Is it true that the one drive is "the only drive in the array"??
That depends on what you mean by "the array". If I am moving "the
array" to another computer, then the one drive still plugged into the
first computer is not "the only drive in the array" from my
perspective.

If there is a write request, and it can only be written to one drive
(because the other is unplugged), then it becomes appropriate to tell
the still-present drive that it is the only drive in the array.

>
> Who updated the event count though?

Sorry, not enough words. I don't know what you are asking.

>
> > Not good. Best to wait for an IO request that actually returns an
> > errors.
> Ah, now would that be a good time to update the event count?

Yes. Of course. It is an event (IO failed). That makes it a good
time to update the event count...... am I missing something here?

>
>
> Maybe you should allow drives to be removed even if they aren't faulty or spare?
> A write to a removed device would mark it faulty in the other devices without
> waiting for a timeout.

Maybe, but I'm not sure what the real gain would be.

>
> But joggling a usb stick (similar to your use case) would probably be OK since
> it would be hot-removed and then hot-added.

This still needs user-space interaction.
If the USB layer detects a removal and a re-insert, sdb may well come
back a something different (sdp?) - though I'm not completely familiar
with how USB storage works.

In any case, it should really be a user-space decision what happens
then. A hot re-add may well be appropriate, but I wouldn't want to
have the kernel make that decision.

NeilBrown

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 08:16:56 von NeilBrown

On Thursday May 10, neilb@suse.de wrote:
>
> No, I haven't, but it is getting near the top of my list.

I have just committed a change to the mdadm .git so that
mdadm /dev/md4 --fail detached

will fail any components of /dev/md4 that appear to be detached (open
returns -ENXIO). and
mdadm /dev/md4 --remove detached
will remove any such devices (that are failed or spare).
so

mdadm /dev/md4 --fail detached --remove detached

will get rid of any detached devices completely, as will

mdadm /dev/md4 --fail detached --remove failed

though that will also remove any failed devices that don't happen to
be detached.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 10:51:40 von Michael Tokarev

Neil Brown wrote:
[]
>> But joggling a usb stick (similar to your use case) would probably be OK since
>> it would be hot-removed and then hot-added.
>
> This still needs user-space interaction.
> If the USB layer detects a removal and a re-insert, sdb may well come
> back a something different (sdp?) - though I'm not completely familiar
> with how USB storage works.

This is in fact an.. interesting issue.

Suppose I pulled the USB cable of sdb -- the WRONG one -- by a mistake.
I noticed this immediately (since the led on the disk stopped lighting),
and plugged the cable back again. There was no write requests to the
array during this time, there was no ANY requests to it at all, it was
completely idle.

But.

The unplug immediately triggers USB device removal. But md subsystem still
holds a reference to (now orphan) sdb. So upon plugging it back, since
sdb is busy, scsi subsystem (which handles USB disks) grabs first available
sdX device, let's say it'll be sdp.

So we've orphan sdb which is "in use" by the array, and fresh new sdp,
which is unused but contains the orphaned array component.

And there's no way to hot-re-add sdp to the array (there's nothing to do
to the array itself!) but.. to powercycle the machine! Because on
hot-remove, event count will be updated on the still-plugged-in device
(sda let it be), and upon hot-add, md will start resyncing. Oh well...
(the only help from md subsystem here is in case if it is using bitmaps,
but that's different issue.)

> In any case, it should really be a user-space decision what happens
> then. A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

The question now isn't about decision anymore. The question is now
about a possibility to implement that decision.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 10:52:01 von David Greaves

Sorry, rushed email - it wasn't clear. I think there is something important here
though.

Oh, it may be worth distinguishing between a drive identifier (/dev/sdb) and a
drive slot (md0, slot2).

Neil Brown wrote:
> On Thursday May 10, david@dgreaves.com wrote:
>> Neil Brown wrote:
>>> On Wednesday May 9, bs@q-leap.de wrote:
>>>> Neil Brown [2007.04.02.0953 +0200]:
>>>>> Hmmm... this is somewhat awkward. You could argue that udev should be
>>>>> taught to remove the device from the array before removing the device
>>>> >from /dev. But I'm not convinced that you always want to 'fail' the
>>>>> device. It is possible in this case that the array is quiescent and
>>>>> you might like to shut it down without registering a device failure...
>>>> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
>>>> don't want to remove it from md? Do you have an example for that case?
>>> Until there is known to be an inconsistency among the devices in an
>>> array, you don't want to record that there is.
>>>
>>> Suppose I have two USB drives with a mounted but quiescent filesystem
>>> on a raid1 across them.
>>> I pull them both out, one after the other, to take them to my friends
>>> place.
>>>
>>> I plug them both in and find that the array is degraded, because as
>>> soon as I unplugged on, the other was told that it was now the only
>>> one.
>> And, in truth, so it was.
>
> So what was?
Sorry; so it was, as you said, "the only one".
Once you unplugged drive B, drive A was the only drive in the array. From the OS
perspective.

> It is true that now one drive is "the only one plugged in", but is
> that relevant?
Not immediately - which is why I don't think it's an error or an I/O and hence
doesn't deserve an I/O event count increment.
(which is what I meant by "Who updated the event count though?")

So md should distinguish between "removed" and "removed and out of sync".
(aside : what does 'failed' mean anyway? What does it give you that you don't
know better from event count?)

> Is it true that the one drive is "the only drive in the array"??
> That depends on what you mean by "the array". If I am moving "the
> array" to another computer, then the one drive still plugged into the
> first computer is not "the only drive in the array" from my
> perspective.
Yes, but I think that's only the same as saying they all have the same UUID -
'human you' doesn't (directly) care/know about the event count match status -
you just want a working array.

> If there is a write request, and it can only be written to one drive
> (because the other is unplugged), then it becomes appropriate to tell
> the still-present drive that it is the only drive in the array.
Ah, now here I think it's relevant to tell the other drive(s) that the unplugged
drive is not only removed - it's failed.
There's a minor error handling optimisation - md could know that a drive was
removed so not even bother writing to it, just mark it failed in the remaining
drives.

>> Who updated the event count though?
> Sorry, not enough words. I don't know what you are asking.
See below

>>> Not good. Best to wait for an IO request that actually returns an
>>> errors.
>> Ah, now would that be a good time to update the event count?
>
> Yes. Of course. It is an event (IO failed). That makes it a good
> time to update the event count...... am I missing something here?
I think so: a remove event shouldn't update the event count to other drives. A
failed write should (of course).

Well, not 'of course'.
If I do I/O to slot1 and slot2 then the event count goes up.
If slot3 is missing, fails etc etc then why do we tell slots 1+2?
Surely md would just do an event count comparison on assembly?

>> Maybe you should allow drives to be removed even if they aren't faulty or spare?
>> A write to a removed device would mark it faulty in the other devices without
>> waiting for a timeout.
>
> Maybe, but I'm not sure what the real gain would be.
See below.

>> But joggling a usb stick (similar to your use case) would probably be OK since
>> it would be hot-removed and then hot-added.
>
> This still needs user-space interaction.
> If the USB layer detects a removal and a re-insert, sdb may well come
> back a something different (sdp?) - though I'm not completely familiar
> with how USB storage works.
Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then. A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.

My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 11:22:17 von Bernd Schubert

On Friday 11 May 2007 10:51:40 Michael Tokarev wrote:
> Neil Brown wrote:
> []
>
> >> But joggling a usb stick (similar to your use case) would probably be OK
> >> since it would be hot-removed and then hot-added.
> >
> > This still needs user-space interaction.
> > If the USB layer detects a removal and a re-insert, sdb may well come
> > back a something different (sdp?) - though I'm not completely familiar
> > with how USB storage works.
>
> This is in fact an.. interesting issue.
>
> Suppose I pulled the USB cable of sdb -- the WRONG one -- by a mistake.
> I noticed this immediately (since the led on the disk stopped lighting),
> and plugged the cable back again. There was no write requests to the
> array during this time, there was no ANY requests to it at all, it was
> completely idle.
>
> But.
>
> The unplug immediately triggers USB device removal. But md subsystem still
> holds a reference to (now orphan) sdb. So upon plugging it back, since
> sdb is busy, scsi subsystem (which handles USB disks) grabs first available
> sdX device, let's say it'll be sdp.
>
> So we've orphan sdb which is "in use" by the array, and fresh new sdp,
> which is unused but contains the orphaned array component.
>
> And there's no way to hot-re-add sdp to the array (there's nothing to do
> to the array itself!) but.. to powercycle the machine! Because on
> hot-remove, event count will be updated on the still-plugged-in device
> (sda let it be), and upon hot-add, md will start resyncing. Oh well...
> (the only help from md subsystem here is in case if it is using bitmaps,
> but that's different issue.)

Yep, thats exactly what I'm talking about and its not only limited to usb, but
happens with sata as well.

Bernd

--
Bernd Schubert
Q-Leap Networks GmbH
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 17:05:31 von David Greaves

[Repost - didn't seem to make it to the lists, sorry cc's]
Sorry, rushed email - it wasn't clear. I think there is something important here
though.

Oh, it may be worth distinguishing between a drive identifier (/dev/sdb) and a
drive slot (md0, slot2).

Neil Brown wrote:
> On Thursday May 10, david@dgreaves.com wrote:
>> Neil Brown wrote:
>>> On Wednesday May 9, bs@q-leap.de wrote:
>>>> Neil Brown [2007.04.02.0953 +0200]:
>>>>> Hmmm... this is somewhat awkward. You could argue that udev should be
>>>>> taught to remove the device from the array before removing the device
>>>> >from /dev. But I'm not convinced that you always want to 'fail' the
>>>>> device. It is possible in this case that the array is quiescent and
>>>>> you might like to shut it down without registering a device failure...
>>>> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
>>>> don't want to remove it from md? Do you have an example for that case?
>>> Until there is known to be an inconsistency among the devices in an
>>> array, you don't want to record that there is.
>>>
>>> Suppose I have two USB drives with a mounted but quiescent filesystem
>>> on a raid1 across them.
>>> I pull them both out, one after the other, to take them to my friends
>>> place.
>>>
>>> I plug them both in and find that the array is degraded, because as
>>> soon as I unplugged on, the other was told that it was now the only
>>> one.
>> And, in truth, so it was.
>
> So what was?
Sorry; so it was, as you said, "the only one".
Once you unplugged drive B, drive A was the only drive in the array. From the OS
perspective.

> It is true that now one drive is "the only one plugged in", but is
> that relevant?
Not immediately - which is why I don't think it's an error or an I/O and hence
doesn't deserve an I/O event count increment.
(which is what I meant by "Who updated the event count though?")

So md should distinguish between "removed" and "removed and out of sync".
(aside : what does 'failed' mean anyway? What does it give you that you don't
know better from event count?)

> Is it true that the one drive is "the only drive in the array"??
> That depends on what you mean by "the array". If I am moving "the
> array" to another computer, then the one drive still plugged into the
> first computer is not "the only drive in the array" from my
> perspective.
Yes, but I think that's only the same as saying they all have the same UUID -
'human you' doesn't (directly) care/know about the event count match status -
you just want a working array.

> If there is a write request, and it can only be written to one drive
> (because the other is unplugged), then it becomes appropriate to tell
> the still-present drive that it is the only drive in the array.
Ah, now here I think it's relevant to tell the other drive(s) that the unplugged
drive is not only removed - it's failed.
There's a minor error handling optimisation - md could know that a drive was
removed so not even bother writing to it, just mark it failed in the remaining
drives.

>> Who updated the event count though?
> Sorry, not enough words. I don't know what you are asking.
See below

>>> Not good. Best to wait for an IO request that actually returns an
>>> errors.
>> Ah, now would that be a good time to update the event count?
>
> Yes. Of course. It is an event (IO failed). That makes it a good
> time to update the event count...... am I missing something here?
I think so: a remove event shouldn't update the event count to other drives. A
failed write should (of course).

Well, not 'of course'.
If I do I/O to slot1 and slot2 then the event count goes up.
If slot3 is missing, fails etc etc then why do we tell slots 1+2?
Surely md would just do an event count comparison on assembly?

>> Maybe you should allow drives to be removed even if they aren't faulty or spare?
>> A write to a removed device would mark it faulty in the other devices without
>> waiting for a timeout.
>
> Maybe, but I'm not sure what the real gain would be.
See below.

>> But joggling a usb stick (similar to your use case) would probably be OK since
>> it would be hot-removed and then hot-added.
>
> This still needs user-space interaction.
> If the USB layer detects a removal and a re-insert, sdb may well come
> back a something different (sdp?) - though I'm not completely familiar
> with how USB storage works.
Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then. A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.

My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 11.05.2007 22:39:35 von Bill Davidsen

Bernd Schubert wrote:
> Yep, thats exactly what I'm talking about and its not only limited to usb, but
> happens with sata as well.
>

And real SCSI hot plug drives if you pull the wrong one.

--
bill davidsen
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

Re: removed disk && md-device

am 11.05.2007 22:47:39 von Bill Davidsen

Neil Brown wrote:
> On Thursday May 10, neilb@suse.de wrote:
>
>> No, I haven't, but it is getting near the top of my list.
>>
>
> I have just committed a change to the mdadm .git so that
> mdadm /dev/md4 --fail detached
>
> will fail any components of /dev/md4 that appear to be detached (open
> returns -ENXIO). and
> mdadm /dev/md4 --remove detached
> will remove any such devices (that are failed or spare).
> so
>
> mdadm /dev/md4 --fail detached --remove detached
>
> will get rid of any detached devices completely, as will
>
> mdadm /dev/md4 --fail detached --remove failed
>
> though that will also remove any failed devices that don't happen to
> be detached.

Good... but now we need a way to say "go look for any reattached parts"
of the array, somewhat as is done at boot in most cases. Look at all
unused devices and partitions for a superblock and UUID (or whatever).

If we had this, picture RAID1 on two drives and a USB stick. And someone
who finger checks typing the fail before pulling the USB for the night.
The doofus who trips over the USB cord or the power cord cord for the
USB drive, they are all clients :-( Anything which would help recover
from their learning experiences with removable media would be helpful.

--
bill davidsen
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

Re: removed disk && md-device

am 11.05.2007 22:50:20 von David Greaves

[Another repost - DNS/ISP/MX problems didn't seem to make it to any of the lists
or debian bugtrack, sorry cc's]
Sorry, rushed email - it wasn't clear. I think there is something important here
though.

Oh, it may be worth distinguishing between a drive identifier (/dev/sdb) and a
drive slot (md0, slot2).

Neil Brown wrote:
> On Thursday May 10, david@dgreaves.com wrote:
>> Neil Brown wrote:
>>> On Wednesday May 9, bs@q-leap.de wrote:
>>>> Neil Brown [2007.04.02.0953 +0200]:
>>>>> Hmmm... this is somewhat awkward. You could argue that udev should be
>>>>> taught to remove the device from the array before removing the device
>>>> >from /dev. But I'm not convinced that you always want to 'fail' the
>>>>> device. It is possible in this case that the array is quiescent and
>>>>> you might like to shut it down without registering a device failure...
>>>> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
>>>> don't want to remove it from md? Do you have an example for that case?
>>> Until there is known to be an inconsistency among the devices in an
>>> array, you don't want to record that there is.
>>>
>>> Suppose I have two USB drives with a mounted but quiescent filesystem
>>> on a raid1 across them.
>>> I pull them both out, one after the other, to take them to my friends
>>> place.
>>>
>>> I plug them both in and find that the array is degraded, because as
>>> soon as I unplugged on, the other was told that it was now the only
>>> one.
>> And, in truth, so it was.
>
> So what was?
Sorry; so it was, as you said, "the only one".
Once you unplugged drive B, drive A was the only drive in the array. From the OS
perspective.

> It is true that now one drive is "the only one plugged in", but is
> that relevant?
Not immediately - which is why I don't think it's an error or an I/O and hence
doesn't deserve an I/O event count increment.
(which is what I meant by "Who updated the event count though?")

So md should distinguish between "removed" and "removed and out of sync".
(aside : what does 'failed' mean anyway? What does it give you that you don't
know better from event count?)

> Is it true that the one drive is "the only drive in the array"??
> That depends on what you mean by "the array". If I am moving "the
> array" to another computer, then the one drive still plugged into the
> first computer is not "the only drive in the array" from my
> perspective.
Yes, but I think that's only the same as saying they all have the same UUID -
'human you' doesn't (directly) care/know about the event count match status -
you just want a working array.

> If there is a write request, and it can only be written to one drive
> (because the other is unplugged), then it becomes appropriate to tell
> the still-present drive that it is the only drive in the array.
Ah, now here I think it's relevant to tell the other drive(s) that the unplugged
drive is not only removed - it's failed.
There's a minor error handling optimisation - md could know that a drive was
removed so not even bother writing to it, just mark it failed in the remaining
drives.

>> Who updated the event count though?
> Sorry, not enough words. I don't know what you are asking.
See below

>>> Not good. Best to wait for an IO request that actually returns an
>>> errors.
>> Ah, now would that be a good time to update the event count?
>
> Yes. Of course. It is an event (IO failed). That makes it a good
> time to update the event count...... am I missing something here?
I think so: a remove event shouldn't update the event count to other drives. A
failed write should (of course).

Well, not 'of course'.
If I do I/O to slot1 and slot2 then the event count goes up.
If slot3 is missing, fails etc etc then why do we tell slots 1+2?
Surely md would just do an event count comparison on assembly?

>> Maybe you should allow drives to be removed even if they aren't faulty or spare?
>> A write to a removed device would mark it faulty in the other devices without
>> waiting for a timeout.
>
> Maybe, but I'm not sure what the real gain would be.
See below.

>> But joggling a usb stick (similar to your use case) would probably be OK since
>> it would be hot-removed and then hot-added.
>
> This still needs user-space interaction.
> If the USB layer detects a removal and a re-insert, sdb may well come
> back a something different (sdp?) - though I'm not completely familiar
> with how USB storage works.
Yes, so, assuming my proposal, in the case where you hot remove sdb (not fail)
then hot add sdp (same drive slot different drive identifier, maybe different
usb controller) the on-disk superblock can reliably ensure that the array just
continues (also assuming quiescence)?

> In any case, it should really be a user-space decision what happens
> then. A hot re-add may well be appropriate, but I wouldn't want to
> have the kernel make that decision.

udev is userspace though - you could have a conservative no-add policy ruleset.

My proposal is simply to allow a hot-remove of a drive without marking it
faulty. This remove event would not update the event counts in other drives.
This allows transient (stupid human in the OP report) drive removal to be
properly communicated via udev to md. You don't end up in the situation of "the
drive formerly known as..."

Just out of interest.
Currently, if I unplug /dev/sdp (which is md0 slot3), wait, plug in a random
non-md usb drive which appears as /dev/sdp, what does md do? Just write to the
new /dev/sdp assuming it's the old one?

David

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: removed disk && md-device

am 15.05.2007 11:52:09 von Goswin von Brederlow

Bill Davidsen writes:

> Bernd Schubert wrote:
>> Yep, thats exactly what I'm talking about and its not only limited
>> to usb, but happens with sata as well.
>>
>
> And real SCSI hot plug drives if you pull the wrong one.

The right thing to do would be to change the raid superblock before
the first write. That would allow pulling disks out and re adding them
as long as no change is written to the raid. Coupling the change in
meta data with the actual change in real data is clearly the best
approach. No false positives or negatives.

MfG
Goswin

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

Re: removed disk && md-device

am 16.05.2007 23:19:44 von Colin McCabe

Neil Brown wrote:
> On Wednesday May 9, bs@q-leap.de wrote:
>> Neil Brown [2007.04.02.0953 +0200]:
>>> Hmmm... this is somewhat awkward. You could argue that udev should be
>>> taught to remove the device from the array before removing the device
>> >from /dev. But I'm not convinced that you always want to 'fail' the
>>> device. It is possible in this case that the array is quiescent and
>>> you might like to shut it down without registering a device failure...
>> Hmm, the the kernel advised hotplug to remove the device from /dev, but you
>> don't want to remove it from md? Do you have an example for that case?
>
> Until there is known to be an inconsistency among the devices in an
> array, you don't want to record that there is.

Keeping admins in the dark about hotplug is a misfeature.

If you look at /proc/mdstat and you see 4 devices, but actually the
janitor unplugged them all yesterday, you are just going to be more
confused when things eventually fail, not less. It's a like a fuel gauge
that says "full," but actually there's only a few drops left in the tank.

> Suppose I have two USB drives with a mounted but quiescent filesystem
> on a raid1 across them.
> I pull them both out, one after the other, to take them to my friends
> place.
>
> I plug them both in and find that the array is degraded, because as
> soon as I unplugged on, the other was told that it was now the only
> one.

Filesystems have mount / umount; RAID has mdadm --assemble / mdadm
--stop. If you start pulling disks without doing the necessary cleanup,
you should EXPECT the array to go into a degraded state.

Colin

> Not good. Best to wait for an IO request that actually returns an
> errors.
>
>>> Maybe an mdadm command that will do that for a given device, or for
>>> all components of a given array if the 'dev' link is 'broken', or even
>>> for all devices for all array.
>>> mdadm --fail-unplugged --scan
>>> or
>>> mdadm --fail-unplugged /dev/md3
>> Ok, so one could run this as cron script. Neil, may I ask if you already
>> started to work on this? Since we have the problem on a customer system, we
>> should fix it ASAP, but at least within the next 2 or 3 weeks. If you didn't
>> start work on it yet, I will do...
>
> No, I haven't, but it is getting near the top of my list.
> If you want a script that does this automatically for every array,
> something like:
>
> for a in /sys/block/md*/md/dev-*
> do
> if [ -f $a/block/dev ]
> then : still there
> else
> echo faulty > $a/state
> echo remove > $a/state
> fi
> done
>
> should do what you want. (I haven't tested it though).
>
> NeilBrown
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

------------------------------------------------------------ -------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/