Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assemblyaborted

Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assemblyaborted

am 08.07.2005 04:26:20 von Melinda Taylor

Hello :)

We have a computer based at the South Pole which has a degraded raid 5
array across 4 disks. One of the 4 HDD's mechanically failed but we have
bought the majority of the system back online except for the raid5
array. I am pretty sure that data on the remaining 3 partitions that
made up the raid5 array is intact - just confused. The reason I know
this is that just before we took the system down, the raid5 array
(mounted as /home) was still readable and writable even though
/proc/mdstat said:

*md2 : active raid5 hdd5[3] hdb5[2](F) hdc5[1] hda5[0](F)
844809600 blocks level 5, 128k chunk, algorithm 2 [4/2] [_U_U] *

When I tried to turn on the raid5 set /dev/md2 after replacing the
failed disk I saw the following errors:

Jul 8 12:35:28 planet kernel: [events: 0000003b]
Jul 8 12:35:28 planet kernel: [events: 00000000]
Jul 8 12:35:28 planet kernel: md: invalid raid superblock magic on hdc5
Jul 8 12:35:28 planet kernel: md: hdc5 has invalid sb, not importing!
Jul 8 12:35:28 planet kernel: md: could not import hdc5, trying to run
array nevertheless.
Jul 8 12:35:28 planet kernel: [events: 00000039]
Jul 8 12:35:28 planet kernel: [events: 0000003b]
Jul 8 12:35:28 planet kernel: md: autorun ...
Jul 8 12:35:28 planet kernel: md: considering hdd5 ...
Jul 8 12:35:28 planet kernel: md: adding hdd5 ...
Jul 8 12:35:28 planet kernel: md: adding hdb5 ...
Jul 8 12:35:28 planet kernel: md: adding hda5 ...
Jul 8 12:35:28 planet kernel: md: created md2
Jul 8 12:35:28 planet kernel: md: bind
Jul 8 12:35:28 planet kernel: md: bind
Jul 8 12:35:28 planet kernel: md: bind
Jul 8 12:35:28 planet kernel: md: running:
Jul 8 12:35:28 planet kernel: md: hdd5's event counter: 0000003b
Jul 8 12:35:28 planet kernel: md: hdb5's event counter: 00000039
Jul 8 12:35:28 planet kernel: md: hda5's event counter: 0000003b
Jul 8 12:35:28 planet kernel: md: superblock update time inconsistency
-- using the most recent one
*Jul 8 12:35:28 planet kernel: md: freshest: hdd5*
Jul 8 12:35:28 planet kernel: md: kicking non-fresh hdb5 from array!
Jul 8 12:35:28 planet kernel: md: unbind
Jul 8 12:35:28 planet kernel: md: export_rdev(hdb5)
Jul 8 12:35:28 planet kernel: md: device name has changed from hdc5 to
hda5 since last import!
Jul 8 12:35:28 planet kernel: md2: removing former faulty hda5!
Jul 8 12:35:28 planet kernel: md2: removing former faulty hdb5!
Jul 8 12:35:28 planet kernel: md: md2: raid array is not clean --
starting background reconstruction
Jul 8 12:35:28 planet kernel: md2: max total readahead window set to 1536k
Jul 8 12:35:28 planet kernel: md2: 3 data-disks, max readahead per
data-disk: 512k
Jul 8 12:35:28 planet kernel: raid5: device hdd5 operational as raid disk 3
Jul 8 12:35:28 planet kernel: raid5: device hda5 operational as raid disk 1
*Jul 8 12:35:28 planet kernel: raid5: not enough operational devices
for md2 (2/4 failed)*
Jul 8 12:35:28 planet kernel: RAID5 conf printout:
Jul 8 12:35:28 planet kernel: --- rd:4 wd:2 fd:2
Jul 8 12:35:28 planet kernel: disk 0, s:0, o:0, n:0 rd:0 us:1 dev:[dev
00:00]
Jul 8 12:35:28 planet kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hda5
Jul 8 12:35:28 planet kernel: disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
Jul 8 12:35:28 planet kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:hdd5
Jul 8 12:35:28 planet kernel: raid5: failed to run raid set md2
Jul 8 12:35:28 planet kernel: md: pers->run() failed ...
Jul 8 12:35:28 planet kernel: md :do_md_run() returned -22
Jul 8 12:35:28 planet kernel: md: md2 stopped.
Jul 8 12:35:28 planet kernel: md: unbind
Jul 8 12:35:28 planet kernel: md: export_rdev(hdd5)
Jul 8 12:35:28 planet kernel: md: unbind
Jul 8 12:35:28 planet kernel: md: export_rdev(hda5)
Jul 8 12:35:28 planet kernel: md: ... autorun DONE.

I was googling for solutions and found the mdadm package and installed
it and tried:

/*[root@planet mdadm-1.12.0]# mdadm --assemble --force /dev/md2
/dev/hdd5 /dev/hda5 /dev/hdb5 /dev/hdc5
mdadm: no RAID superblock on /dev/hdc5
mdadm: /dev/hdc5 has no superblock - assembly aborted

*//dev/hdc is the new disk I have just installed to replace the failed
one (/dev/hda). I have parititioned it correctly and in fact one
partition on /dev/hdc1 is now happily part of another raid1 set on the
system so I know all is good with /dev/hdc

My current /proc/mdstat file looks like this (ie missing the raid5 set):
*
Personalities : [raid1] [raid5]
read_ahead 1024 sectors
md1 : active raid1 hdc2[0] hda2[1]
10241344 blocks [2/2] [UU]

md0 : active raid1 hdc1[0] hda1[1]
104320 blocks [2/2] [UU]

unused devices:
*

Can anyone offer any suggestions as to how to get past the /*/dev/hdc5
has no superblock */message?

The data has all been backed up so as a last resort I will rebuild the
raid5 array from scratch but it would be nice to just reassemble it with
data intact as I am sure /dev/hdd5 /dev/hdb5 and /dev/hda5 are actually
all ok.

Many Thanks,

Melinda


/*


*/



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

am 08.07.2005 05:38:48 von Daniel Pittman

On 8 Jul 2005, Melinda Taylor wrote:
> We have a computer based at the South Pole which has a degraded raid 5
> array across 4 disks. One of the 4 HDD's mechanically failed but we have
> bought the majority of the system back online except for the raid5
> array. I am pretty sure that data on the remaining 3 partitions that
> made up the raid5 array is intact - just confused. The reason I know
> this is that just before we took the system down, the raid5 array
> (mounted as /home) was still readable and writable even though
> /proc/mdstat said:

[...]

> /*[root@planet mdadm-1.12.0]# mdadm --assemble --force /dev/md2
> /dev/hdd5 /dev/hda5 /dev/hdb5 /dev/hdc5
> mdadm: no RAID superblock on /dev/hdc5
> mdadm: /dev/hdc5 has no superblock - assembly aborted
>
> *//dev/hdc is the new disk I have just installed to replace the failed
> one (/dev/hda). I have parititioned it correctly and in fact one
> partition on /dev/hdc1 is now happily part of another raid1 set on the
> system so I know all is good with /dev/hdc

What you want to do is start the array as degraded, using *only* the
devices that were part of the disk set. Substitute 'missing' for the
last device if needed but, IIRC, you should be able to say just:

] mdadm --assemble --force /dev/md2 /dev/hd[abd]5

Don't forget to fsck the filesystem thoroughly at this point. :)

Once the array is up and running, you *then* add the new disk to it by
saying:

] mdadm -a /dev/md2 /dev/hdc5

The array will then recover and, hopefully, life is good.

Daniel
--
Anyone who goes to a psychiatrist ought to have his head examined.
-- Samuel Goldwyn

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

am 08.07.2005 10:43:46 von Molle Bestefich

> On 8 Jul 2005, Melinda Taylor wrote:
> > We have a computer based at the South Pole which has a degraded raid 5
> > array across 4 disks. One of the 4 HDD's mechanically failed but we have
> > bought the majority of the system back online except for the raid5
> > array. I am pretty sure that data on the remaining 3 partitions that
> > made up the raid5 array is intact - just confused. The reason I know
> > this is that just before we took the system down, the raid5 array
> > (mounted as /home) was still readable and writable even though
> > /proc/mdstat said:

On 7/8/05, Daniel Pittman wrote:
> What you want to do is start the array as degraded, using *only* the
> devices that were part of the disk set. Substitute 'missing' for the
> last device if needed but, IIRC, you should be able to say just:
>
> ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5
>
> Don't forget to fsck the filesystem thoroughly at this point. :)

At this point, before adding the new disk, I'd suggest making *very*
sure that the event counters match on the three existing disks.
Because if they don't, MD will add the new disk with an event counter
matching the freshest disk in the array. That will cause it to start
synchronizing onto one of the good disks instead of onto the newly
added disk.... Happened to me once, gah.

> Once the array is up and running, you *then* add the new disk to it by
> saying:
>
> ] mdadm -a /dev/md2 /dev/hdc5
>
> The array will then recover and, hopefully, life is good.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

am 08.07.2005 11:26:37 von Daniel Pittman

On 8 Jul 2005, Molle Bestefich wrote:
>> On 8 Jul 2005, Melinda Taylor wrote:
>>> We have a computer based at the South Pole which has a degraded raid 5
>>> array across 4 disks. One of the 4 HDD's mechanically failed but we have
>>> bought the majority of the system back online except for the raid5
>>> array. I am pretty sure that data on the remaining 3 partitions that
>>> made up the raid5 array is intact - just confused. The reason I know
>>> this is that just before we took the system down, the raid5 array
>>> (mounted as /home) was still readable and writable even though
>>> /proc/mdstat said:
>
> On 7/8/05, Daniel Pittman wrote:
>> What you want to do is start the array as degraded, using *only* the
>> devices that were part of the disk set. Substitute 'missing' for the
>> last device if needed but, IIRC, you should be able to say just:
>>
>> ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5
>>
>> Don't forget to fsck the filesystem thoroughly at this point. :)
>
> At this point, before adding the new disk, I'd suggest making *very*
> sure that the event counters match on the three existing disks.
> Because if they don't, MD will add the new disk with an event counter
> matching the freshest disk in the array. That will cause it to start
> synchronizing onto one of the good disks instead of onto the newly
> added disk.... Happened to me once, gah.

Ack! I didn't know that. If the event counters don't match up, what
can you do to correct the problem?

Daniel
--
The problem with defending the purity of the English language is that English
is about as pure as a cribhouse whore. We don't just borrow words; on
occasion, English has pursued other languages down alleyways to beat them
unconscious and rifle their pockets for new vocabulary.
-- James D. Nicoll

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

am 08.07.2005 13:24:00 von Neil Brown

On Friday July 8, daniel@rimspace.net wrote:
> On 8 Jul 2005, Molle Bestefich wrote:
> >> On 8 Jul 2005, Melinda Taylor wrote:
> >>> We have a computer based at the South Pole which has a degraded raid 5
> >>> array across 4 disks. One of the 4 HDD's mechanically failed but we have
> >>> bought the majority of the system back online except for the raid5
> >>> array. I am pretty sure that data on the remaining 3 partitions that
> >>> made up the raid5 array is intact - just confused. The reason I know
> >>> this is that just before we took the system down, the raid5 array
> >>> (mounted as /home) was still readable and writable even though
> >>> /proc/mdstat said:
> >
> > On 7/8/05, Daniel Pittman wrote:
> >> What you want to do is start the array as degraded, using *only* the
> >> devices that were part of the disk set. Substitute 'missing' for the
> >> last device if needed but, IIRC, you should be able to say just:
> >>
> >> ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5
> >>
> >> Don't forget to fsck the filesystem thoroughly at this point. :)
> >
> > At this point, before adding the new disk, I'd suggest making *very*
> > sure that the event counters match on the three existing disks.
> > Because if they don't, MD will add the new disk with an event counter
> > matching the freshest disk in the array. That will cause it to start
> > synchronizing onto one of the good disks instead of onto the newly
> > added disk.... Happened to me once, gah.
>
> Ack! I didn't know that. If the event counters don't match up, what
> can you do to correct the problem?

The "--assemble --force" should result in all the event counters of
the named drives being the same. Then it should be perfectly safe the
add the new drive.

I cannot quite imagine a situation as described by Molle. If it was
at all reproducible I'd love to hear more details.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Degraded raid5 returns mdadm: /dev/hdc5 has no superblock - assembly aborted

am 08.07.2005 19:53:53 von Molle Bestefich

> On Friday July 8, daniel@rimspace.net wrote:
> > On 8 Jul 2005, Molle Bestefich wrote:
> > >> On 8 Jul 2005, Melinda Taylor wrote:
> > >>> We have a computer based at the South Pole which has a degraded raid 5
> > >>> array across 4 disks. One of the 4 HDD's mechanically failed but we have
> > >>> bought the majority of the system back online except for the raid5
> > >>> array. I am pretty sure that data on the remaining 3 partitions that
> > >>> made up the raid5 array is intact - just confused. The reason I know
> > >>> this is that just before we took the system down, the raid5 array
> > >>> (mounted as /home) was still readable and writable even though
> > >>> /proc/mdstat said:
> > >
> > > On 7/8/05, Daniel Pittman wrote:
> > >> What you want to do is start the array as degraded, using *only* the
> > >> devices that were part of the disk set. Substitute 'missing' for the
> > >> last device if needed but, IIRC, you should be able to say just:
> > >>
> > >> ] mdadm --assemble --force /dev/md2 /dev/hd[abd]5
> > >>
> > >> Don't forget to fsck the filesystem thoroughly at this point. :)
> > >
> > > At this point, before adding the new disk, I'd suggest making *very*
> > > sure that the event counters match on the three existing disks.
> > > Because if they don't, MD will add the new disk with an event counter
> > > matching the freshest disk in the array. That will cause it to start
> > > synchronizing onto one of the good disks instead of onto the newly
> > > added disk.... Happened to me once, gah.
> >
> > Ack! I didn't know that. If the event counters don't match up, what
> > can you do to correct the problem?

Daniel Pittman wrote:
> Ack! I didn't know that. If the event counters don't match up, what
> can you do to correct the problem?

In the 2.4 days, I think I used to plug cables in and out of the
disks, rebooting the system again and again until the counters were
aligned.

Neil Brown wrote:
> The "--assemble --force" should result in all the event counters of
> the named drives being the same. Then it should be perfectly safe the
> add the new drive.

Sounds like a better option!

> I cannot quite imagine a situation as described by Molle.

Fair enough, the situation just struck me as something I had seen
before, and it doesn't hurt to be sure..

> If it was at all reproducible I'd love to hear more details.

I'd rather not reproduce it :-).

It's happened a couple of times on a production system..
Once back when it was running 2.4 and an old version of MD, and once
while I was in the process of upgrading the box to 2.6 (so it might
have been while it was booted into 2.4.. not sure). The box used to
have two disks failing from time to time, one due to a semi-bad disk
and one due to a flaky SATA cable.

That's about all I can remember on top of my head.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html