Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for res

Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for res

am 16.06.2008 10:23:59 von jmolina

Hello!

Well, I came to report and ask for assistance with a failed RAID5 array grow. I'm having a really crappy weekend -- my goldfish died and my RAID array crapped out.

Fortunately, I did a backup before I attempted this, but I am currently working to try and fix the problem rather than restore.

Yes I googled around before asking, and I've not yet found anything similar enough to my situation to be of help.

There does not appear to be anything wrong with the hardware of any disk. Kernel version was 2.6.23.11 -- I am aware of some nasty bug in -rc3, but I don't think this is the same issue. mdadm is v2.6.3.

I had three SATA disks and added two for a total of five, each 500GB in size. My setup involves three partitions on each disk. So, originally I was working with six partitions (three disks, three parts), then added two disks, for a total of five parts per RAID array, three RAID5 arrays, for grand total of fifteen parts.

I then stack LVM and ext3 on top of the three RAID5 arrays, but LVM and the file system is not a problem here or relevant, so let's ignore that.

During the grow process, this system slowly went unresponsive, and I was forced to reboot it after about 30 hours. At first I was not able to run any mdadm commands to see the status of the grow (about 30 minutes after starting), then I was not able to log in with a new shell, then after about 24 hours I was able to use a previously opened shell to see that tons of CRON jobs and other work had backed up, however during all of this time the system was still acting as an IP router doing NAT. Finally, after about 30 hours, the dhcpd daemon stopped giving out leases and then finally traffic stopped and I could not ping the host any longer (not a lease problem).

I should note that this is not a particularly highly loaded system. It's basically a home office do-it-all router, file server, mail server, sort of thing.

After the reboot, one of the three RAID5 arrays (the one being grown) won't assemble. My root is on this array, so I'm pretty much stuck in the initram shell, though I can mount the backup drive with all of my other binaries and files (came in useful since fdisk isn't on the initramfs).

I get the following, which I have typed out since I can't copy from the console;

(initramfs) mdadm --assemble /dev/md2
md: md2 stopped.
mdadm: Failed to restore critical section for reshape, sorry.

And that's it. --force and --run do not help.



Doing an mdadmin --examine on the partitions shows that the Reshape Position was something like 520GB into the newly 680GB array, so it was definitely on it's way before it slowly went to hell.

I was under the impression that a reboot would simply cause the reshape to continue once the system came back up, but apparently not. Something has farked it up badly.

Advice? I'll give just about anything a try, but I'll have to start creating new partitions, arrays, and the whole bit tomorrow and restore the data.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for

am 16.06.2008 11:28:45 von jmolina

I should also add that when doing an --examine on each of the five devices, they show very similar output, the checksum is correct, and the state is clean. Adding a --verbose to the --assemble command didn't show anything surprising, as all five members were found.

This seems to either be a RAID metadata problem (why clean/correct?) or an mdadm assembly failure due to bug or non-implementation of corrective action given my particular failure?

Finally, I did find this somewhat similar position here;

http://www.mail-archive.com/linux-raid@vger.kernel.org/msg09 020.html

Other than the RAID6 issue, it looks very similar. Is this the same issue?



-----Original Message-----
From: linux-raid-owner@vger.kernel.org on behalf of jmolina@tgen.org
Sent: Mon 6/16/2008 1:23 AM
To: linux-raid@vger.kernel.org
Subject: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for reshape, sorry.


Hello!

Well, I came to report and ask for assistance with a failed RAID5 array grow. I'm having a really crappy weekend -- my goldfish died and my RAID array crapped out.

Fortunately, I did a backup before I attempted this, but I am currently working to try and fix the problem rather than restore.

Yes I googled around before asking, and I've not yet found anything similar enough to my situation to be of help.

There does not appear to be anything wrong with the hardware of any disk. Kernel version was 2.6.23.11 -- I am aware of some nasty bug in -rc3, but I don't think this is the same issue. mdadm is v2.6.3.

I had three SATA disks and added two for a total of five, each 500GB in size. My setup involves three partitions on each disk. So, originally I was working with six partitions (three disks, three parts), then added two disks, for a total of five parts per RAID array, three RAID5 arrays, for grand total of fifteen parts.

I then stack LVM and ext3 on top of the three RAID5 arrays, but LVM and the file system is not a problem here or relevant, so let's ignore that.

During the grow process, this system slowly went unresponsive, and I was forced to reboot it after about 30 hours. At first I was not able to run any mdadm commands to see the status of the grow (about 30 minutes after starting), then I was not able to log in with a new shell, then after about 24 hours I was able to use a previously opened shell to see that tons of CRON jobs and other work had backed up, however during all of this time the system was still acting as an IP router doing NAT. Finally, after about 30 hours, the dhcpd daemon stopped giving out leases and then finally traffic stopped and I could not ping the host any longer (not a lease problem).

I should note that this is not a particularly highly loaded system. It's basically a home office do-it-all router, file server, mail server, sort of thing.

After the reboot, one of the three RAID5 arrays (the one being grown) won't assemble. My root is on this array, so I'm pretty much stuck in the initram shell, though I can mount the backup drive with all of my other binaries and files (came in useful since fdisk isn't on the initramfs).

I get the following, which I have typed out since I can't copy from the console;

(initramfs) mdadm --assemble /dev/md2
md: md2 stopped.
mdadm: Failed to restore critical section for reshape, sorry.

And that's it. --force and --run do not help.



Doing an mdadmin --examine on the partitions shows that the Reshape Position was something like 520GB into the newly 680GB array, so it was definitely on it's way before it slowly went to hell.

I was under the impression that a reboot would simply cause the reshape to continue once the system came back up, but apparently not. Something has farked it up badly.

Advice? I'll give just about anything a try, but I'll have to start creating new partitions, arrays, and the whole bit tomorrow and restore the data.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section

am 16.06.2008 12:00:25 von Ken Drummond

> There does not appear to be anything wrong with the hardware of any disk.
> Kernel version was 2.6.23.11 -- I am aware of some nasty bug in -rc3, but
> I don't think this is the same issue. mdadm is v2.6.3.

I had exactly the same problem on the weekend and after some investigation
the problem seems to be mdadm v2.6.3. There was an announcement on this
list for v2.4.4 which included fixes to restarting an interrupted grow. I
was using ubuntu 8.04 which uses mdadm v2.6.3 and doesn't seem to have an
easy method to upgrade. However, my "fix" was to boot from a gentoo
livecd 2008.0b2 and use mdadm v2.6.2 to reconstruct the array (it's still
going, adding 1TiB disk to the previous 3 disk raid 5 array).

> Advice? I'll give just about anything a try, but I'll have to start
> creating new partitions, arrays, and the whole bit tomorrow and restore
> the data.

Try using a different version of mdadm I think it's up to 2.6.7 now but
for me the older v2.6.2 worked too. I did mount the array after it started
rebuilding on 2.6.2 and everything seems to be there.

--
Ken.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Failed RAID5 array grow after reboot interruption; mdadm:Failed to restore critical section for

am 16.06.2008 18:48:26 von Jesse Molina

Thanks. I'll give the updated mdadm binary a try. It certainly looks
plausible that this was a recently fixed mdadm bug.

For the record, I think you typoed this below. You meant to say v2.6.4,
rather than v2.4.4. My current version was v2.6.3. The current mdadm
version appears to be v2.6.4, and Debian currently has a -2 release.

My system is Debian unstable, just as FYI. It's been since January 2008
since v2.6.4-1 was released, so I guess I've not updated this package since
then.

Here is the changelog for mdadm;

http://www.cse.unsw.edu.au/~neilb/source/mdadm/ChangeLog

Specifically;

"Fix restarting of a 'reshape' if it was stopped in the middle."

That sounds like my problem.

I will try this here in an hour or two and see what happens...



On 6/16/08 3:00 AM, "Ken Drummond" wrote:

> There was an announcement on this
> list for v2.4.4 which included fixes to restarting an interrupted grow.

--
# Jesse Molina
# The Translational Genomics Research Institute
# http://www.tgen.org
# Mail = jmolina@tgen.org
# Desk = 1.602.343.8459
# Cell = 1.602.323.7608


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Failed RAID5 array grow after reboot interruption; mdadm: Failedto restore critical section for

am 16.06.2008 21:34:46 von Richard Scobie

Jesse Molina wrote:
> Thanks. I'll give the updated mdadm binary a try. It certainly looks
> plausible that this was a recently fixed mdadm bug.
>
> For the record, I think you typoed this below. You meant to say v2.6.4,
> rather than v2.4.4. My current version was v2.6.3. The current mdadm
> version appears to be v2.6.4, and Debian currently has a -2 release.

The latest version is 2.6.7

http://mirror.linux.org.au/linux/utils/raid/mdadm/

For some reason, Neil's http://www.cse.unsw.edu.au/~neilb/source/mdadm/
repository is not up to date.

Don't use 2.6.5 or 2.6.6 as they may cause segfaults.

Regards,

Richard
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Failed RAID5 array grow after reboot interruption; mdadm:Failed to restore critical section for

am 17.06.2008 03:08:55 von Jesse Molina

Thanks for the help. I confirm success at recovering the array today.

Indeed, replacing the mdadm in the initramfs from the original v2.6.3 to
2.6.4 fixed the problem.

As noted by Richard Scobie, please avoid versions 2.6.5 and 2.6.6. Either
v2.6.4 or v2.6.7 will fix this issue. I fixed it with v2.6.4.



For historical purposes, and to help others, I was able to fix this as
follows;

Since the mdadm binary was in my initramfs, and I was unable to get the
working system up to mount it's root file system, I had to interrupt the
initramfs "init" script, replace mdadm with an updated version, and then
continue the process.

To do this, pass your Linux kernel an option such as "break=mount" or maybe
"break=top", to stop the init script just before it is about to mount the
root file system. Then, get your new mdadm file and replace the existing
one at /sbin/mdadm.

To get the actual mdadm binary, you will need to use a working system to
extract it from a .deb, .rpm, or otherwise download and compile it. In my
case, for debian, you can do an "ar xv " on the package, and then
tar -xzf on the data file. For Debian, I just retrieved the file from
http://packages.debian.org

Then, stick the new file on a CD/DVD disk, USB flash drive, or other media
and somehow get it onto your system while it's still at the (initramfs)
busybox prompt. I was able to mount from a CD, so "mount -t iso9660 -r
/dev/cdrom /temp-cdrom", after a "mkdir /temp-cdrom".

After you have replaced the old mdadm file with the new one, unmount your
temporary media and then type "mdadm --assemble /dev/md0" for whichever
array was flunking out on you. Then "vgchange -a -y" if using LVM.

Finally, do ctrl+D to exit the initramfs shell, which will cause the "init"
script to try and continue with the boot process from where you interrupted
it. Hopefully, the system will then continue as normal.

Note that you will eventually want to update your mdadm file and replace
your initramfs.



Thanks for the help Ken.

As for why my system died while it was doing the original grow, I have no
idea. I'll run it in single user and let it finish the job.



On 6/16/08 9:48 AM, "Jesse Molina" wrote:

>
> Thanks. I'll give the updated mdadm binary a try. It certainly looks
> plausible that this was a recently fixed mdadm bug.
>
> For the record, I think you typoed this below. You meant to say v2.6.4,
> rather than v2.4.4. My current version was v2.6.3. The current mdadm
> version appears to be v2.6.4, and Debian currently has a -2 release.
>
> My system is Debian unstable, just as FYI. It's been since January 2008
> since v2.6.4-1 was released, so I guess I've not updated this package since
> then.
>
> Here is the changelog for mdadm;
>
> http://www.cse.unsw.edu.au/~neilb/source/mdadm/ChangeLog
>
> Specifically;
>
> "Fix restarting of a 'reshape' if it was stopped in the middle."
>
> That sounds like my problem.
>
> I will try this here in an hour or two and see what happens...
>
>
>
> On 6/16/08 3:00 AM, "Ken Drummond" wrote:
>
>> There was an announcement on this
>> list for v2.4.4 which included fixes to restarting an interrupted grow.

--
# Jesse Molina
# The Translational Genomics Research Institute
# http://www.tgen.org
# Mail = jmolina@tgen.org
# Desk = 1.602.343.8459
# Cell = 1.602.323.7608


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for

am 17.06.2008 09:03:22 von Jesse Molina

Hello again

I now have a new problem.

My system is now up, but the array that was causing a problem will not correct itself automatically after several hours. There is no disk activity or any change in the state of the array after many hours.

How do I force the array to resync?



Here is the array in question. It's sitting with a flag of "resync=PENDING". How do I get it out of pending?

--

user@host-->cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]

md2 : active raid5 sdd5[0] sdg5[4] sdh5[3] sdf5[2] sde5[1]
325283840 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
resync=PENDING

--

user@host-->sudo mdadm --detail /dev/md2
/dev/md2:
Version : 00.91.03
Creation Time : Sun Nov 18 02:39:31 2007
Raid Level : raid5
Array Size : 325283840 (310.21 GiB 333.09 GB)
Used Dev Size : 162641920 (155.11 GiB 166.55 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Mon Jun 16 21:46:57 2008
State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Delta Devices : 2, (3->5)

UUID : 05bcf06a:ce126226:d10fa4d9:5a1884ea (local to host sorrows)
Events : 0.92265

Number Major Minor RaidDevice State
0 8 53 0 active sync /dev/sdd5
1 8 69 1 active sync /dev/sde5
2 8 85 2 active sync /dev/sdf5
3 8 117 3 active sync /dev/sdh5
4 8 101 4 active sync /dev/sdg5

--

Some interesting lines from dmesg;

md: md2 stopped.
md: bind
md: bind
md: bind
md: bind
md: bind
md: md2: raid array is not clean -- starting background reconstruction
raid5: reshape will continue
raid5: device sdd5 operational as raid disk 0
raid5: device sdg5 operational as raid disk 4
raid5: device sdh5 operational as raid disk 3
raid5: device sdf5 operational as raid disk 2
raid5: device sde5 operational as raid disk 1
raid5: allocated 5252kB for md2
raid5: raid level 5 set md2 active with 5 out of 5 devices, algorithm 2
RAID5 conf printout:
--- rd:5 wd:5
disk 0, o:1, dev:sdd5
disk 1, o:1, dev:sde5
disk 2, o:1, dev:sdf5
disk 3, o:1, dev:sdh5
disk 4, o:1, dev:sdg5
....ok start reshape thread

--


Note that in this case, the Array Size is actually the old array size rather than what it should be with all five disks.

Whatever the correct course of action is here, it appears neither obvious or well documented to me. I suspect that I'm a test case, since I've archived an unusual state.



-----Original Message-----
From: Jesse Molina
Sent: Mon 6/16/2008 6:08 PM
To: Jesse Molina; Ken Drummond
Cc: linux-raid@vger.kernel.org
Subject: Re: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for reshape, sorry.


Thanks for the help. I confirm success at recovering the array today.

Indeed, replacing the mdadm in the initramfs from the original v2.6.3 to
2.6.4 fixed the problem.

As noted by Richard Scobie, please avoid versions 2.6.5 and 2.6.6. Either
v2.6.4 or v2.6.7 will fix this issue. I fixed it with v2.6.4.



For historical purposes, and to help others, I was able to fix this as
follows;

Since the mdadm binary was in my initramfs, and I was unable to get the
working system up to mount it's root file system, I had to interrupt the
initramfs "init" script, replace mdadm with an updated version, and then
continue the process.

To do this, pass your Linux kernel an option such as "break=mount" or maybe
"break=top", to stop the init script just before it is about to mount the
root file system. Then, get your new mdadm file and replace the existing
one at /sbin/mdadm.

To get the actual mdadm binary, you will need to use a working system to
extract it from a .deb, .rpm, or otherwise download and compile it. In my
case, for debian, you can do an "ar xv " on the package, and then
tar -xzf on the data file. For Debian, I just retrieved the file from
http://packages.debian.org

Then, stick the new file on a CD/DVD disk, USB flash drive, or other media
and somehow get it onto your system while it's still at the (initramfs)
busybox prompt. I was able to mount from a CD, so "mount -t iso9660 -r
/dev/cdrom /temp-cdrom", after a "mkdir /temp-cdrom".

After you have replaced the old mdadm file with the new one, unmount your
temporary media and then type "mdadm --assemble /dev/md0" for whichever
array was flunking out on you. Then "vgchange -a -y" if using LVM.

Finally, do ctrl+D to exit the initramfs shell, which will cause the "init"
script to try and continue with the boot process from where you interrupted
it. Hopefully, the system will then continue as normal.

Note that you will eventually want to update your mdadm file and replace
your initramfs.



Thanks for the help Ken.

As for why my system died while it was doing the original grow, I have no
idea. I'll run it in single user and let it finish the job.



On 6/16/08 9:48 AM, "Jesse Molina" wrote:

>
> Thanks. I'll give the updated mdadm binary a try. It certainly looks
> plausible that this was a recently fixed mdadm bug.
>
> For the record, I think you typoed this below. You meant to say v2.6.4,
> rather than v2.4.4. My current version was v2.6.3. The current mdadm
> version appears to be v2.6.4, and Debian currently has a -2 release.
>
> My system is Debian unstable, just as FYI. It's been since January 2008
> since v2.6.4-1 was released, so I guess I've not updated this package since
> then.
>
> Here is the changelog for mdadm;
>
> http://www.cse.unsw.edu.au/~neilb/source/mdadm/ChangeLog
>
> Specifically;
>
> "Fix restarting of a 'reshape' if it was stopped in the middle."
>
> That sounds like my problem.
>
> I will try this here in an hour or two and see what happens...
>
>
>
> On 6/16/08 3:00 AM, "Ken Drummond" wrote:
>
>> There was an announcement on this
>> list for v2.4.4 which included fixes to restarting an interrupted grow.

--
# Jesse Molina
# The Translational Genomics Research Institute
# http://www.tgen.org
# Mail = jmolina@tgen.org
# Desk = 1.602.343.8459
# Cell = 1.602.323.7608



--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

resync=PENDING, interrupted RAID5 grow will not automaticallyreconstruct

am 18.06.2008 02:30:25 von Jesse Molina

I think I figured this out.

man md

Read the section regarding the sync_action file.

do as root; =B3echo idle > /sys/block/md2/md/sync_action=B2



After issuing the idle command, my array says;

user@host# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
=20
md2 : active raid5 sdd5[0] sdg5[4] sdh5[3] sdf5[2] sde5[1]
325283840 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/5]
[UUUUU]
[=================3D>...] reshap=
e =3D 85.3% (138827968/162641920)
finish=3D279.7min speed=3D1416K/sec

and

user@host# mdadm --detail /dev/md2
/dev/md2:
Version : 00.91.03
Creation Time : Sun Nov 18 02:39:31 2007
Raid Level : raid5
Array Size : 325283840 (310.21 GiB 333.09 GB)
Used Dev Size : 162641920 (155.11 GiB 166.55 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Tue Jun 17 17:25:49 2008
State : active, recovering
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

Reshape Status : 85% complete
Delta Devices : 2, (3->5)

UUID : 05bcf06a:ce126226:d10fa4d9:5a1884ea (local to host
sorrows)
Events : 0.92399

Number Major Minor RaidDevice State
0 8 53 0 active sync /dev/sdd5
1 8 69 1 active sync /dev/sde5
2 8 85 2 active sync /dev/sdf5
3 8 117 3 active sync /dev/sdh5
4 8 101 4 active sync /dev/sdg5
=20





On 6/17/08 12:03 AM, "Jesse Molina" wrote:

>=20
>=20
> Hello again
>=20
> I now have a new problem.
>=20
> My system is now up, but the array that was causing a problem will no=
t correct
> itself automatically after several hours. There is no disk activity =
or any
> change in the state of the array after many hours.
>=20
> How do I force the array to resync?
>=20
>=20
>=20
> Here is the array in question. It's sitting with a flag of "resync=3D=
PENDING".
> How do I get it out of pending?
>=20
> --
>=20
> user@host-->cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
>=20
> md2 : active raid5 sdd5[0] sdg5[4] sdh5[3] sdf5[2] sde5[1]
> 325283840 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/=
5]
> [UUUUU]
> resync=3DPENDING
>=20
> --
>=20
> user@host-->sudo mdadm --detail /dev/md2
> /dev/md2:
> Version : 00.91.03
> Creation Time : Sun Nov 18 02:39:31 2007
> Raid Level : raid5
> Array Size : 325283840 (310.21 GiB 333.09 GB)
> Used Dev Size : 162641920 (155.11 GiB 166.55 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 2
> Persistence : Superblock is persistent
>=20
> Update Time : Mon Jun 16 21:46:57 2008
> State : active
> Active Devices : 5
> Working Devices : 5
> Failed Devices : 0
> Spare Devices : 0
>=20
> Layout : left-symmetric
> Chunk Size : 64K
>=20
> Delta Devices : 2, (3->5)
>=20
> UUID : 05bcf06a:ce126226:d10fa4d9:5a1884ea (local to host =
sorrows)
> Events : 0.92265
>=20
> Number Major Minor RaidDevice State
> 0 8 53 0 active sync /dev/sdd5
> 1 8 69 1 active sync /dev/sde5
> 2 8 85 2 active sync /dev/sdf5
> 3 8 117 3 active sync /dev/sdh5
> 4 8 101 4 active sync /dev/sdg5
>=20
> --
>=20
> Some interesting lines from dmesg;
>=20
> md: md2 stopped.
> md: bind
> md: bind
> md: bind
> md: bind
> md: bind
> md: md2: raid array is not clean -- starting background reconstructio=
n
> raid5: reshape will continue
> raid5: device sdd5 operational as raid disk 0
> raid5: device sdg5 operational as raid disk 4
> raid5: device sdh5 operational as raid disk 3
> raid5: device sdf5 operational as raid disk 2
> raid5: device sde5 operational as raid disk 1
> raid5: allocated 5252kB for md2
> raid5: raid level 5 set md2 active with 5 out of 5 devices, algorithm=
2
> RAID5 conf printout:
> --- rd:5 wd:5
> disk 0, o:1, dev:sdd5
> disk 1, o:1, dev:sde5
> disk 2, o:1, dev:sdf5
> disk 3, o:1, dev:sdh5
> disk 4, o:1, dev:sdg5
> ...ok start reshape thread
>=20
> --
>=20
>=20
> Note that in this case, the Array Size is actually the old array size=
rather
> than what it should be with all five disks.
>=20
> Whatever the correct course of action is here, it appears neither obv=
ious or
> well documented to me. I suspect that I'm a test case, since I've ar=
chived an
> unusual state.
>=20
>=20
>=20
> -----Original Message-----
> From: Jesse Molina
> Sent: Mon 6/16/2008 6:08 PM
> To: Jesse Molina; Ken Drummond
> Cc: linux-raid@vger.kernel.org
> Subject: Re: Failed RAID5 array grow after reboot interruption; mdadm=
: Failed
> to restore critical section for reshape, sorry.
>=20
>=20
> Thanks for the help. I confirm success at recovering the array today=

>=20
> Indeed, replacing the mdadm in the initramfs from the original v2.6.3=
to
> 2.6.4 fixed the problem.
>=20
> As noted by Richard Scobie, please avoid versions 2.6.5 and 2.6.6. E=
ither
> v2.6.4 or v2.6.7 will fix this issue. I fixed it with v2.6.4.
>=20
>=20
>=20
> For historical purposes, and to help others, I was able to fix this a=
s
> follows;
>=20
> Since the mdadm binary was in my initramfs, and I was unable to get t=
he
> working system up to mount it's root file system, I had to interrupt =
the
> initramfs "init" script, replace mdadm with an updated version, and t=
hen
> continue the process.
>=20
> To do this, pass your Linux kernel an option such as "break=3Dmount" =
or maybe
> "break=3Dtop", to stop the init script just before it is about to mou=
nt the
> root file system. Then, get your new mdadm file and replace the exis=
ting
> one at /sbin/mdadm.
>=20
> To get the actual mdadm binary, you will need to use a working system=
to
> extract it from a .deb, .rpm, or otherwise download and compile it. =
In my
> case, for debian, you can do an "ar xv " on the package, an=
d then
> tar -xzf on the data file. For Debian, I just retrieved the file fro=
m
> http://packages.debian.org
>=20
> Then, stick the new file on a CD/DVD disk, USB flash drive, or other =
media
> and somehow get it onto your system while it's still at the (initramf=
s)
> busybox prompt. I was able to mount from a CD, so "mount -t iso9660 =
-r
> /dev/cdrom /temp-cdrom", after a "mkdir /temp-cdrom".
>=20
> After you have replaced the old mdadm file with the new one, unmount =
your
> temporary media and then type "mdadm --assemble /dev/md0" for whichev=
er
> array was flunking out on you. Then "vgchange -a -y" if using LVM.
>=20
> Finally, do ctrl+D to exit the initramfs shell, which will cause the =
"init"
> script to try and continue with the boot process from where you inter=
rupted
> it. Hopefully, the system will then continue as normal.
>=20
> Note that you will eventually want to update your mdadm file and repl=
ace
> your initramfs.
>=20
>=20
>=20
> Thanks for the help Ken.
>=20
> As for why my system died while it was doing the original grow, I hav=
e no
> idea. I'll run it in single user and let it finish the job.
>=20
>=20
>=20
> On 6/16/08 9:48 AM, "Jesse Molina" wrote:
>=20
>>=20
>> Thanks. I'll give the updated mdadm binary a try. It certainly loo=
ks
>> plausible that this was a recently fixed mdadm bug.
>>=20
>> For the record, I think you typoed this below. You meant to say v2.=
6.4,
>> rather than v2.4.4. My current version was v2.6.3. The current mda=
dm
>> version appears to be v2.6.4, and Debian currently has a -2 release.
>>=20
>> My system is Debian unstable, just as FYI. It's been since January =
2008
>> since v2.6.4-1 was released, so I guess I've not updated this packag=
e since
>> then.
>>=20
>> Here is the changelog for mdadm;
>>=20
>> http://www.cse.unsw.edu.au/~neilb/source/mdadm/ChangeLog
>>=20
>> Specifically;
>>=20
>> "Fix restarting of a 'reshape' if it was stopped in the middle."
>>=20
>> That sounds like my problem.
>>=20
>> I will try this here in an hour or two and see what happens...
>>=20
>>=20
>>=20
>> On 6/16/08 3:00 AM, "Ken Drummond" wr=
ote:
>>=20
>>> There was an announcement on this
>>> list for v2.4.4 which included fixes to restarting an interrupted g=
row.
>=20
> --
> # Jesse Molina
> # The Translational Genomics Research Institute
> # http://www.tgen.org
> # Mail =3D jmolina@tgen.org
> # Desk =3D 1.602.343.8459
> # Cell =3D 1.602.323.7608
>=20
>=20
>=20
>=20


--=20
# Jesse Molina
# The Translational Genomics Research Institute
# http://www.tgen.org
# Mail =3D jmolina@tgen.org
# Desk =3D 1.602.343.8459
# Cell =3D 1.602.323.7608


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: resync=PENDING, interrupted RAID5 grow will not automaticallyreconstruct

am 19.06.2008 06:06:16 von NeilBrown

On Tuesday June 17, jmolina@tgen.org wrote:
>=20
> I think I figured this out.
>=20
> man md
>=20
> Read the section regarding the sync_action file.
>=20
> do as root; =B3echo idle > /sys/block/md2/md/sync_action=B2
>=20
>=20
>=20
> After issuing the idle command, my array says;
>=20
> user@host# cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> =20
> md2 : active raid5 sdd5[0] sdg5[4] sdh5[3] sdf5[2] sde5[1]
> 325283840 blocks super 0.91 level 5, 64k chunk, algorithm 2 [5/=
5]
> [UUUUU]
> [=================3D>...] resh=
ape =3D 85.3% (138827968/162641920)
> finish=3D279.7min speed=3D1416K/sec

Well done :-)

Yes, you've found the bug that was fixed by commit
25156198235325805cd7295ed694509fd6e3a29e
which is in 2.6.25.
I'm glad you found the workaround.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Failed RAID5 array grow after reboot interruption; mdadm: Failed to restore critical section for

am 19.06.2008 06:25:34 von NeilBrown

On Monday June 16, jmolina@tgen.org wrote:
>
> During the grow process, this system slowly went unresponsive, and I
> was forced to reboot it after about 30 hours. At first I was not
> able to run any mdadm commands to see the status of the grow (about
> 30 minutes after starting), then I was not able to log in with a new
> shell, then after about 24 hours I was able to use a previously
> opened shell to see that tons of CRON jobs and other work had backed
> up, however during all of this time the system was still acting as
> an IP router doing NAT. Finally, after about 30 hours, the dhcpd
> daemon stopped giving out leases and then finally traffic stopped
> and I could not ping the host any longer (not a lease problem).

This is a bit of a worry. It sounds like the system was running out
of memory. It would seem to suggest that either the reshape process
was leaking memory, or that it was blocking writeout somehow so that
other memory wasn't getting freed.
However I cannot measure it doing either of these things.

If you can reproduce this, I'd love to see the content of
/proc/meminfo
/proc/slabinfo
/proc/slab_allocators

at 5 minutes intervals. But I don't expect you'll want to try that
experiment :-)


NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html