Debugging a strange array corruption

Debugging a strange array corruption

am 14.12.2010 09:10:07 von Brad Campbell

G'day all,

I have a 10 x 1TB drive RAID-6 here. It's been great for ages, but recently I've seen nasty random
corruption across the entire array that I can not pin down.

The machine also has a number of RAID-1 and a RAID-5 which are all behaving perfectly.

The machine has 16GB of RAM, so all my read tests are done with dd bs=1G count=20 to make sure I'm
actually hitting the disk somewhere.

The array is partitioned into three approximately equal partitions.

If I do something like -

for i in `seq 3` ; do dd if=/dev/md0p1 bs=1G count=20 | md5sum ; done

- I get three completely different checksums

The filesystems are unmounted and the array is idle.

I've run the same test individually on all 10 disks in the array and they all appear to give
consistent data. Reading anything from the array gives me mostly correct data with intermittent garbage.

I've tried both a 2.6.36.[12] kernel, and I'm currently running 2.6.37-rc5-git3 with the same odd
results.

All the disks pass long SMART tests. They all checksum correctly from end to end with repeated
sequential runs.

No libata errors in the logs.

The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042 controllers and 2 are
on a SIL3132. This has occurred since I upgraded the mainboard (and kernel at the same time -
nothing like throwing more variables in the mix) and its effects were subtle enough that I missed
them until it had successfully rotated out all of my good backups with broken data. Lesson learned.

I'm stumped and I don't even know where to begin. I've never seen something like this happen without
a bad disk, controller or cable and they are easy to diagnose.

Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 10:22:46 von Roman Mamedov

--Sig_/+OPcHki7SC84nUnvhJnweI9
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 14 Dec 2010 16:10:07 +0800
Brad Campbell wrote:

> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7=
042
> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
> mainboard (and kernel at the same time - nothing like throwing more
> variables in the mix) and its effects were subtle enough that I missed th=
em
> until it had successfully rotated out all of my good backups with broken
> data. Lesson learned.

I'd suggest that you try moving two disks away from SiI3132, change your
setup so that at most ONE port on that controller is used, or none at all.

Some time ago there was a report of data corruption with controllers using
that chip when both ports simultaneously read at full speed:
http://forum.ixbt.com/topic.cgi?id=3D11:35147:1200#1200 (in Russian)
Perhaps problem not in the chip itself, but in some variations of
schematics/components/soldering, because only two of five supposedly identi=
cal
boards the reporter bought were corrupting data in that way, one much
more often than the other.

--=20
With respect,
Roman

--Sig_/+OPcHki7SC84nUnvhJnweI9
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0HN2YACgkQTLKSvz+PZwimtACcDc+vL3lzCo/B2aH/9azP FE6O
GfIAn3ucce9ZHvBtjkH7VKgP++ColOje
=GBqx
-----END PGP SIGNATURE-----

--Sig_/+OPcHki7SC84nUnvhJnweI9--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 10:37:53 von Brad Campbell

On 14/12/10 17:22, Roman Mamedov wrote:
> On Tue, 14 Dec 2010 16:10:07 +0800
> Brad Campbell wrote:
>
>> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042
>> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
>> mainboard (and kernel at the same time - nothing like throwing more
>> variables in the mix) and its effects were subtle enough that I missed them
>> until it had successfully rotated out all of my good backups with broken
>> data. Lesson learned.
>
> I'd suggest that you try moving two disks away from SiI3132, change your
> setup so that at most ONE port on that controller is used, or none at all.
>
> Some time ago there was a report of data corruption with controllers using
> that chip when both ports simultaneously read at full speed:
> http://forum.ixbt.com/topic.cgi?id=11:35147:1200#1200 (in Russian)
> Perhaps problem not in the chip itself, but in some variations of
> schematics/components/soldering, because only two of five supposedly identical
> boards the reporter bought were corrupting data in that way, one much
> more often than the other.
>

And in the prior incarnation I was only using 1 port on that controller so the problem would never
have manifested itself. Thanks, at least I have something to try.

Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 10:42:31 von Roman Mamedov

--Sig_/S4tV7k4fTWJShj0goxlNmDg
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 14 Dec 2010 17:37:53 +0800
Brad Campbell wrote:

> And in the prior incarnation I was only using 1 port on that controller so
> the problem would never have manifested itself. Thanks, at least I have
> something to try.

To add another idea to my previous reply, it should be pretty easy to test =
for
the presence of this particular problem by running your dd+md5sum test on j=
ust
the two physical disks which are plugged into that controller, starting the
tests simultaneously on both disks (in two terminal windows). See if you get
varying sums over several runs that way.


--=20
With respect,
Roman

--Sig_/S4tV7k4fTWJShj0goxlNmDg
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0HPAcACgkQTLKSvz+PZwgw1gCbB8TG9PgTDYkcEr8sjBVN YBFH
ZIMAn3rf6PkgF/qXJgoefkhPbG5JvZRN
=t9uM
-----END PGP SIGNATURE-----

--Sig_/S4tV7k4fTWJShj0goxlNmDg--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 11:29:31 von Brad Campbell

On 14/12/10 17:42, Roman Mamedov wrote:
> On Tue, 14 Dec 2010 17:37:53 +0800
> Brad Campbell wrote:
>
>> And in the prior incarnation I was only using 1 port on that controller so
>> the problem would never have manifested itself. Thanks, at least I have
>> something to try.
>
> To add another idea to my previous reply, it should be pretty easy to test for
> the presence of this particular problem by running your dd+md5sum test on just
> the two physical disks which are plugged into that controller, starting the
> tests simultaneously on both disks (in two terminal windows). See if you get
> varying sums over several runs that way.

I just finished that test. Can I cry now?

Thanks Roman, I'd *never* have pegged it otherwise.

Anyone want a slightly second hand SIL 2 port controller?

Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 12:59:17 von darewi

On Tue, Dec 14, 2010 at 03:22, Roman Mamedov wrote:
> On Tue, 14 Dec 2010 16:10:07 +0800
> Brad Campbell wrote:
>
>> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042
>> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
>> mainboard (and kernel at the same time - nothing like throwing more
>> variables in the mix) and its effects were subtle enough that I missed them
>> until it had successfully rotated out all of my good backups with broken
>> data. Lesson learned.
>
> I'd suggest that you try moving two disks away from SiI3132, change your
> setup so that at most ONE port on that controller is used, or none at all.
>
> Some time ago there was a report of data corruption with controllers using
> that chip when both ports simultaneously read at full speed:
> http://forum.ixbt.com/topic.cgi?id=11:35147:1200#1200 (in Russian)
> Perhaps problem not in the chip itself, but in some variations of
> schematics/components/soldering, because only two of five supposedly identical
> boards the reporter bought were corrupting data in that way, one much
> more often than the other.
>

Interesting... Any idea if the problem affects the SiI 3114 chipset as
well? I've been seeing some similar problems, but haven't had enough
time to dig into it to query the list yet.

--
david williams

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Debugging a strange array corruption

am 14.12.2010 13:07:20 von Roman Mamedov

--Sig_/xNme8SLtybNRnuACXmH4CV+
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

On Tue, 14 Dec 2010 05:59:17 -0600
"David W." wrote:

> Interesting... Any idea if the problem affects the SiI 3114 chipset as
> well? I've been seeing some similar problems, but haven't had enough
> time to dig into it to query the list yet.

There are data corruption issues reported with 3114 as well, not sure how m=
any
of those have been worked around, traced down to something else, or still
remain. See https://encrypted.google.com/search?q=3D3114+data+corruption

--=20
With respect,
Roman

--Sig_/xNme8SLtybNRnuACXmH4CV+
Content-Type: application/pgp-signature; name=signature.asc
Content-Disposition: attachment; filename=signature.asc

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAk0HXfkACgkQTLKSvz+PZwh+SACfY/9V86/glJo5DiNdqqso 4wfz
7PoAoInyWbftaA9XlsXx4ZPT3OqnOCWr
=jWVS
-----END PGP SIGNATURE-----

--Sig_/xNme8SLtybNRnuACXmH4CV+--
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html