SSD - TRIM command

SSD - TRIM command

am 07.02.2011 21:07:33 von Roberto Spadim

hi guys, could md send TRIM command to ssd? using ext4 discart mount option?
if i mix ssd and hd, could this TRIM be rewrite to non TRIM compatible disks?

--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 08.02.2011 18:37:19 von Maurice

On 2/7/2011 1:07 PM, Roberto Spadim wrote:
> hi guys, could md send TRIM command to ssd? using ext4 discart mount option?
> if i mix ssd and hd, could this TRIM be rewrite to non TRIM compatible disks?
>
I have read that using md with SSDs is not a great idea:
Form the Fedora 14 documentation:

"Take note as well that software RAID levels 1, 4, 5, and 6 are not
recommended for use on SSDs.
During the initialization stage of these RAID levels, some RAID
management utilities (such as mdadm)
write to all of the blocks on the storage device to ensure that
checksums operate properly.
This will cause the performance of the SSD to degrade quickly. "

https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_ Administration_Guide/newmds-ssdtuning.html


--
Cheers,
Maurice Hilarius
eMail: /mhilarius@gmail.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 08.02.2011 19:31:24 von Roberto Spadim

it's resync running?
i don't think it's a problem...
any device will die some day...
ssd is faster than hd, why not use it?
i'm using hp smart array p212 with 3.0 firmware, it write on all blocks=
too
maybe a just command line option to start array without sync could help=
..
i don't know if resync is write intensive or just write on diferent
blocks, if it's just diff it's not a problem for ssd...

again...
i know that the 'translate' of trim command to non compatible devices
is a problem for device layer not md layer, but can md send trim
command to all mirrors/disks?

2011/2/8 maurice :
> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>>
>> hi guys, could md send TRIM command to ssd? using ext4 discart mount
>> option?
>> if i mix ssd and hd, could this TRIM be rewrite to non TRIM compatib=
le
>> disks?
>>
> I have read that using md with SSDs is not a great idea:
> Form the Fedora 14 documentation:
>
> "Take note as well that software RAID levels 1, 4, 5, and 6 are not
> recommended for use on SSDs.
> During the initialization stage of these RAID levels, some RAID manag=
ement
> utilities (such as mdadm)
> write to all of the blocks on the storage device to ensure that check=
sums
> operate properly.
> This will cause the performance of the SSD to degrade quickly. "
>
> https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_ Administr=
ation_Guide/newmds-ssdtuning.html
>
>
> --
> Cheers,
> Maurice Hilarius
> eMail: /mhilarius@gmail.com/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 08.02.2011 21:50:17 von Roberto Spadim

=3D] now the right answer :)
question: maybe in future... could we make trim compatible with md?

obs:
i understanded that trim is just for ssd making sectors clean without
writing 000000000000-000000 at the entire sector (a ssd optimization)
if we translate trim to not supported trim disks at device level could
we send TRIM to all disks on md device?
just a option at mdadm --assemble --allow-trim
and send trim received by filesystem

2011/2/8 Scott E. Armitage :
> The problem as I understand it is that md treats the entire device (o=
r
> partition) as "in use" -- even if the filesystem isn't using a partic=
ular
> set of blocks, those blocks must still be consistent across the array=
The
> SSD TRIM command is used to tell the physical drive which blocks are =
no
> longer in use by the filesystem, so that it can optimize write operat=
ions.
> Running under md, all blocks would be "used", so there would be nothi=
ng to
> send with the TRIM command.
> -Scott
>
> On Tue, Feb 8, 2011 at 1:31 PM, Roberto Spadim >
> wrote:
>>
>> it's resync running?
>> i don't think it's a problem...
>> any device will die some day...
>> ssd is faster than hd, why not use it?
>> i'm using hp smart array p212 with 3.0 firmware, it write on all blo=
cks
>> too
>> maybe a just command line option to start array without sync could h=
elp...
>> i don't know if resync is write intensive or just write on diferent
>> blocks, if it's just diff it's not a problem for ssd...
>>
>> again...
>> i know that the 'translate' of trim command to non compatible device=
s
>> is a problem for device layer not md layer, but can md send trim
>> command to all mirrors/disks?
>>
>> 2011/2/8 maurice :
>> > On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>> >>
>> >> hi guys, could md send TRIM command to ssd? using ext4 discart mo=
unt
>> >> option?
>> >> if i mix ssd and hd, could this TRIM be rewrite to non TRIM compa=
tible
>> >> disks?
>> >>
>> > I have read that using md with SSDs is not a great idea:
>> > Form the Fedora 14 documentation:
>> >
>> > "Take note as well that software RAID levels 1, 4, 5, and 6 are no=
t
>> > recommended for use on SSDs.
>> > During the initialization stage of these RAID levels, some RAID
>> > management
>> > utilities (such as mdadm)
>> > write to all of the blocks on the storage device to ensure that
>> > checksums
>> > operate properly.
>> > This will cause the performance of the SSD to degrade quickly. "
>> >
>> >
>> > https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_ Admini=
stration_Guide/newmds-ssdtuning.html
>> >
>> >
>> > --
>> > Cheers,
>> > Maurice Hilarius
>> > eMail: /mhilarius@gmail.com/
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Scott Armitage, B.A.Sc., M.A.Sc. candidate
> Space Flight Laboratory
> University of Toronto Institute for Aerospace Studies
> 4925 Dufferin Street, Toronto, Ontario, Canada, M3H 5T6
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 08.02.2011 22:18:28 von Maurice

On 2/8/2011 1:50 PM, Roberto Spadim wrote:
> =] now the right answer :)
> question: maybe in future... could we make trim compatible with md?
>
I hope that future is "real soon now"
MLC SSD is now starting to appear in the "Enterprise space.
Companies like Pliant have released products for that.
Typical SAN RAID controllers have specific performance limits which can
be saturated with a not very large number of SSDs.
To get higher IOs we need a more powerful RAID engine
A typical 48 core, 128GB RAM box using AMD CPUs and 4 SAS HBAs to disk
JBD cases can be a ridiculously power RAID engine for a
reasonable cost ( at least reasonable compered to NetApp, EMC, Hitachi
SANs, etc) with a large number of devices.

BUT: To use SSDs in the design we need mdadm to be more SSD friendly.


--
Cheers,
Maurice Hilarius
eMail: /mhilarius@gmail.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 08.02.2011 22:33:03 von Roberto Spadim

yeah, we will make it :)
maurice, i was making some raid1 new read balance, could you help me
benchmark it?
it's kernel 2.6.37 based, here is the code:
www.spadim.com.br/raid1/
there's raid1.new.c raid1.new.h, raid1.old.c raid1.old.h
the old and new kernel source code

for user space we can now make this:
/sys/block/mdXXX/md/read_balance_mode
/sys/block/mdXXX/md/read_balance_stripe_shift
/sys/block/mdXXX/md/read_balance_config

at read_balance_mode we have now 4 modes:
near_head (default, working without problems, very good for hd only,
ssd should other mode)

round_robin (normal round robin, with per mirror counter (can make
round_robin) after some reads, very good for ssd only array)

stripe (like raid0, with read_balance_stripe_shift we can shift the
sector number with " >> " command and after select the disk with %
raid_disks, very good for hd or ssd, a good number for shift is >=3D5,
but not so much since this can make math formula use only the first
disk)

time_based (based on head positioning time + read time + i/o queue
time, selecting the best disk to read, work with ssd and hd very well,
current implementation don't have i/o queue time but i will study and
put it to work too)

all configurations for round_robin and time_based as sent to kernel by
read_balance_config
type cat /sys/block/mdxxx/md/read_balance_config
and send per disk the parameters
the first line on cat command is the parameters list, after | is read
only variables, you can't change it, just read
use echo "0 0 0 0 0 0 0 0 0 0"> read_balance_config to change values

thanks =3D]

2011/2/8 maurice :
> On 2/8/2011 1:50 PM, Roberto Spadim wrote:
>>
>> =3D] now the right answer :)
>> question: maybe in future... could we make trim compatible with md?
>>
> I =A0hope that future is "real soon now"
> MLC SSD is now starting to appear in the "Enterprise space.
> Companies like Pliant have released products for that.
> Typical SAN RAID controllers have specific performance limits which c=
an be
> saturated with a not very large number of SSDs.
> To get higher IOs we need a more powerful RAID engine
> A typical 48 core, 128GB RAM box using AMD CPUs and 4 SAS HBAs to dis=
k JBD
> cases can be a ridiculously power RAID engine for a
> reasonable cost ( at least reasonable compered to NetApp, EMC, Hitach=
i SANs,
> etc) with a large number of devices.
>
> BUT: To use SSDs in the design we need mdadm to be more SSD friendly.
>
>
> --
> Cheers,
> Maurice Hilarius
> eMail: /mhilarius@gmail.com/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 08:44:31 von Stan Hoeppner

maurice put forth on 2/8/2011 11:37 AM:
> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>> hi guys, could md send TRIM command to ssd? using ext4 discart mount option?
>> if i mix ssd and hd, could this TRIM be rewrite to non TRIM compatible disks?
>>
> I have read that using md with SSDs is not a great idea:
> Form the Fedora 14 documentation:

Using any RAID level but pure striping with SSDs is a bad idea, for the exact
reason in that documentation: excessive writes.

SSD - Solid State Drive

Note the first two words. Solid state device = integrated circuit. ICs,
including those comprised of flash memory transistors, have totally different
failure modes than spinning rust disks, SRDs, or "plain old mechanical hard drives".

RAID'ing SSDs with any data duplicative RAID level, any mirroring or parity RAID
levels, _decreases_ the life of all SSDs in the array. This is the opposite
effect of what you want: reliability and lifespan.

People have a misconception that SSDs are like hard disks. The only thing they
have in common is that both store data and they can have a similar interface
(SATA). The similarities end there.

RAID is not a proper method of extending the life of SSD storage nor protecting
the data on SSD devices. If you want to pool all the capacity of multiple SSDs
into a single logical device, use RAID 0 or spanning, _not_ a mirror or parity
RAID level. If you want to protect the data, snap it to a single large SATA
drive, or a D2D backup array, and then to tape.

--
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 10:05:22 von edmudama

On Wed, Feb 9 at 1:44, Stan Hoeppner wrote:
>maurice put forth on 2/8/2011 11:37 AM:
>> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>>> hi guys, could md send TRIM command to ssd? using ext4 discart mount option?
>>> if i mix ssd and hd, could this TRIM be rewrite to non TRIM compatible disks?
>>>
>> I have read that using md with SSDs is not a great idea:
>> Form the Fedora 14 documentation:
>
>Using any RAID level but pure striping with SSDs is a bad idea, for the exact
>reason in that documentation: excessive writes.

If I mirror two SSDs, and write 1 unit of data to the mirror, each
element of the mirror should see 1 unit of write. How does this
perform excessive writes, compared to the same workload applied to a
single SSD?

I agree that in aggregate we've now done 2 units worth of writes,
however, in a mirror case, we're protecting against both whole-device
failure and single-sector failure modes, so hardly seems like a bad
idea in all applications.


--
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 14:29:12 von David Brown

On 09/02/2011 08:44, Stan Hoeppner wrote:
> maurice put forth on 2/8/2011 11:37 AM:
>> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>>> hi guys, could md send TRIM command to ssd? using ext4 discart
>>> mount option? if i mix ssd and hd, could this TRIM be rewrite to
>>> non TRIM compatible disks?
>>>
>> I have read that using md with SSDs is not a great idea: Form the
>> Fedora 14 documentation:
>
> Using any RAID level but pure striping with SSDs is a bad idea, for
> the exact reason in that documentation: excessive writes.
>
> SSD - Solid State Drive
>
> Note the first two words. Solid state device = integrated circuit.
> ICs, including those comprised of flash memory transistors, have
> totally different failure modes than spinning rust disks, SRDs, or
> "plain old mechanical hard drives".
>
> RAID'ing SSDs with any data duplicative RAID level, any mirroring or
> parity RAID levels, _decreases_ the life of all SSDs in the array.
> This is the opposite effect of what you want: reliability and
> lifespan.
>
> People have a misconception that SSDs are like hard disks. The only
> thing they have in common is that both store data and they can have a
> similar interface (SATA). The similarities end there.
>
> RAID is not a proper method of extending the life of SSD storage nor
> protecting the data on SSD devices. If you want to pool all the
> capacity of multiple SSDs into a single logical device, use RAID 0 or
> spanning, _not_ a mirror or parity RAID level. If you want to
> protect the data, snap it to a single large SATA drive, or a D2D
> backup array, and then to tape.
>

First off, let me agree with you that backup is important no matter what
you use as your primary storage.

But beyond that, you've got a basic assumption wrong here.

Good quality, modern SSDs do not have write-endurance issues. It's a
thing of the past. Internally, of course, the flash /does/ have
endurance limits. But these are high (especially with SLC devices
rather than MLC devices), and the combination of ECC, wear-levelling and
redundant blocks means that you can write to these devices continuously
at high speed for /years/ before endurance issues become visible by the
host. An additional effect of the extensive ECC is that undetected read
errors are much less likely than with hard disks - when a failure /does/
occur, you know it has occurred.

Many SSD models suffer from a certain amount of performance degradation
when they have been used for a while. Intel's devices were notorious
for this, though apparently they are better now. But that's a speed
issue, not a reliability or lifetime issue.

SSDs (again, I refer to good quality modern devices - earlier models had
more problems) are inherently more reliable than HDs, and have longer
expected lifetimes. This means that it is often fine to put your SSDs
in a RAID0 combination - you still have a greater reliability than you
would with a single HDD.

However, SSDs are not infallible - using redundant RAID with SSDs is a
perfectly valid setup. Obviously you will have a whole disks worth of
extra writes when you set up the RAID, and redundant writes means more
writes, but the SSDs will handle those writes perfectly well.


There is plenty of scope for md / SSD optimisation, however. Good TRIM
support is just one aspect. Other points include matching stripe sizes
to fit the geometry of the SSD, and taking advantage of the seek speeds
of SSD (this is particularly important if you are mirroring an SSD and
an HD).


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 15:39:44 von Roberto Spadim

guys...
if my ssd fail, i buy another...
let's make software ok, the hardware is another problem
raid1 should work with floppy disks, hard disks, ssd, nbd... that's the=
point
make solutions for hardware mix
the question is simple, could we send TRIM command to all mirrors (for
stripe just disks that should receive it)? if device don't have TRIM
we should translate it for a similar command, with the same READ
effect (no problem if it's not atomic)

the point of good read, i sent a email to maurice, and many others
emails in this raid list, there's a new read balance mode for kernel
2.6.37 if you want try to benchmark it please test it:
www.spadim.com.br/raid1
for me it's work very well with hd and ssd mixed array, i need more
test and benchmark to neil accept it as a default feature of md
the sysfs interface is poor yet, in future it should change
the time based mode work, but it should have some features implemented
in futures (queue time estimation)


2011/2/9 David Brown :
> On 09/02/2011 08:44, Stan Hoeppner wrote:
>>
>> maurice put forth on 2/8/2011 11:37 AM:
>>>
>>> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>>>>
>>>> hi guys, could md send TRIM command to ssd? using ext4 discart
>>>> mount option? if i mix ssd and hd, could this TRIM be rewrite to
>>>> non TRIM compatible disks?
>>>>
>>> I have read that using md with SSDs is not a great idea: Form the
>>> Fedora 14 documentation:
>>
>> Using any RAID level but pure striping with SSDs is a bad idea, for
>> the exact reason in that documentation: =A0excessive writes.
>>
>> SSD - Solid State Drive
>>
>> Note the first two words. =A0Solid state device =3D integrated circu=
it.
>> ICs, including those comprised of flash memory transistors, have
>> totally different failure modes than spinning rust disks, SRDs, or
>> "plain old mechanical hard drives".
>>
>> RAID'ing SSDs with any data duplicative RAID level, any mirroring or
>> parity RAID levels, _decreases_ the life of all SSDs in the array.
>> This is the opposite effect of what you want: =A0reliability and
>> lifespan.
>>
>> People have a misconception that SSDs are like hard disks. =A0The on=
ly
>> thing they have in common is that both store data and they can have =
a
>> similar interface (SATA). =A0The similarities end there.
>>
>> RAID is not a proper method of extending the life of SSD storage nor
>> protecting the data on SSD devices. =A0If you want to pool all the
>> capacity of multiple SSDs into a single logical device, use RAID 0 o=
r
>> spanning, _not_ a mirror or parity RAID level. =A0If you want to
>> protect the data, snap it to a single large SATA drive, or a D2D
>> backup array, and then to tape.
>>
>
> First off, let me agree with you that backup is important no matter w=
hat you
> use as your primary storage.
>
> But beyond that, you've got a basic assumption wrong here.
>
> Good quality, modern SSDs do not have write-endurance issues. =A0It's=
a thing
> of the past. =A0Internally, of course, the flash /does/ have enduranc=
e limits.
> =A0But these are high (especially with SLC devices rather than MLC de=
vices),
> and the combination of ECC, wear-levelling and redundant blocks means=
that
> you can write to these devices continuously at high speed for /years/=
before
> endurance issues become visible by the host. =A0An additional effect =
of the
> extensive ECC is that undetected read errors are much less likely tha=
n with
> hard disks - when a failure /does/ occur, you know it has occurred.
>
> Many SSD models suffer from a certain amount of performance degradati=
on when
> they have been used for a while. =A0Intel's devices were notorious fo=
r this,
> though apparently they are better now. =A0But that's a speed issue, n=
ot a
> reliability or lifetime issue.
>
> SSDs (again, I refer to good quality modern devices - earlier models =
had
> more problems) are inherently more reliable than HDs, and have longer
> expected lifetimes. =A0This means that it is often fine to put your S=
SDs in a
> RAID0 combination - you still have a greater reliability than you wou=
ld with
> a single HDD.
>
> However, SSDs are not infallible - using redundant RAID with SSDs is =
a
> perfectly valid setup. =A0Obviously you will have a whole disks worth=
of extra
> writes when you set up the RAID, and redundant writes means more writ=
es, but
> the SSDs will handle those writes perfectly well.
>
>
> There is plenty of scope for md / SSD optimisation, however. =A0Good =
TRIM
> support is just one aspect. =A0Other points include matching stripe s=
izes to
> fit the geometry of the SSD, and taking advantage of the seek speeds =
of SSD
> (this is particularly important if you are mirroring an SSD and an HD=
).
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 16:00:30 von launchpad

I reiterate my previous reply that under the current md architecture,
where the complete device is considered to be in use, sending TRIM
commands makes little sense. AFAICT, reading back a trimmed page is
not defined, since the whole idea is that the host doesn't care about
what is on that page any more.

The next time md comes around to corresponding trimmed pages on two
SSDs, their contents may differ, and all of a sudden our array is no
longer consistent.

On Wed, Feb 9, 2011 at 9:39 AM, Roberto Spadim =
wrote:
> guys...
> if my ssd fail, i buy another...
> let's make software ok, the hardware is another problem
> raid1 should work with floppy disks, hard disks, ssd, nbd... that's t=
he point
> make solutions for hardware mix
> the question is simple, could we send TRIM command to all mirrors (fo=
r
> stripe just disks that should receive it)? if device don't have TRIM
> we should translate it for a similar command, with the same READ
> effect (no problem if it's not atomic)
>
> the point of good read, i sent a email to maurice, and many others
> emails in this raid list, there's a new read balance mode for kernel
> 2.6.37 if you want try to benchmark it please test it:
> www.spadim.com.br/raid1
> for me it's work very well with hd and ssd mixed array, i need more
> test and benchmark to neil accept it as a default feature of md
> the sysfs interface is poor yet, in future it should change
> the time based mode work, but it should have some features implemente=
d
> in futures (queue time estimation)
>
>
> 2011/2/9 David Brown :
>> On 09/02/2011 08:44, Stan Hoeppner wrote:
>>>
>>> maurice put forth on 2/8/2011 11:37 AM:
>>>>
>>>> On 2/7/2011 1:07 PM, Roberto Spadim wrote:
>>>>>
>>>>> hi guys, could md send TRIM command to ssd? using ext4 discart
>>>>> mount option? if i mix ssd and hd, could this TRIM be rewrite to
>>>>> non TRIM compatible disks?
>>>>>
>>>> I have read that using md with SSDs is not a great idea: Form the
>>>> Fedora 14 documentation:
>>>
>>> Using any RAID level but pure striping with SSDs is a bad idea, for
>>> the exact reason in that documentation: =A0excessive writes.
>>>
>>> SSD - Solid State Drive
>>>
>>> Note the first two words. =A0Solid state device =3D integrated circ=
uit.
>>> ICs, including those comprised of flash memory transistors, have
>>> totally different failure modes than spinning rust disks, SRDs, or
>>> "plain old mechanical hard drives".
>>>
>>> RAID'ing SSDs with any data duplicative RAID level, any mirroring o=
r
>>> parity RAID levels, _decreases_ the life of all SSDs in the array.
>>> This is the opposite effect of what you want: =A0reliability and
>>> lifespan.
>>>
>>> People have a misconception that SSDs are like hard disks. =A0The o=
nly
>>> thing they have in common is that both store data and they can have=
a
>>> similar interface (SATA). =A0The similarities end there.
>>>
>>> RAID is not a proper method of extending the life of SSD storage no=
r
>>> protecting the data on SSD devices. =A0If you want to pool all the
>>> capacity of multiple SSDs into a single logical device, use RAID 0 =
or
>>> spanning, _not_ a mirror or parity RAID level. =A0If you want to
>>> protect the data, snap it to a single large SATA drive, or a D2D
>>> backup array, and then to tape.
>>>
>>
>> First off, let me agree with you that backup is important no matter =
what you
>> use as your primary storage.
>>
>> But beyond that, you've got a basic assumption wrong here.
>>
>> Good quality, modern SSDs do not have write-endurance issues. =A0It'=
s a thing
>> of the past. =A0Internally, of course, the flash /does/ have enduran=
ce limits.
>> =A0But these are high (especially with SLC devices rather than MLC d=
evices),
>> and the combination of ECC, wear-levelling and redundant blocks mean=
s that
>> you can write to these devices continuously at high speed for /years=
/ before
>> endurance issues become visible by the host. =A0An additional effect=
of the
>> extensive ECC is that undetected read errors are much less likely th=
an with
>> hard disks - when a failure /does/ occur, you know it has occurred.
>>
>> Many SSD models suffer from a certain amount of performance degradat=
ion when
>> they have been used for a while. =A0Intel's devices were notorious f=
or this,
>> though apparently they are better now. =A0But that's a speed issue, =
not a
>> reliability or lifetime issue.
>>
>> SSDs (again, I refer to good quality modern devices - earlier models=
had
>> more problems) are inherently more reliable than HDs, and have longe=
r
>> expected lifetimes. =A0This means that it is often fine to put your =
SSDs in a
>> RAID0 combination - you still have a greater reliability than you wo=
uld with
>> a single HDD.
>>
>> However, SSDs are not infallible - using redundant RAID with SSDs is=
a
>> perfectly valid setup. =A0Obviously you will have a whole disks wort=
h of extra
>> writes when you set up the RAID, and redundant writes means more wri=
tes, but
>> the SSDs will handle those writes perfectly well.
>>
>>
>> There is plenty of scope for md / SSD optimisation, however. =A0Good=
TRIM
>> support is just one aspect. =A0Other points include matching stripe =
sizes to
>> fit the geometry of the SSD, and taking advantage of the seek speeds=
of SSD
>> (this is particularly important if you are mirroring an SSD and an H=
D).
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Scott Armitage, B.A.Sc., M.A.Sc. candidate
Space Flight Laboratory
University of Toronto Institute for Aerospace Studies
4925 Dufferin Street, Toronto, Ontario, Canada, M3H 5T6
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 16:45:29 von Chris Worley

On Wed, Feb 9, 2011 at 2:05 AM, Eric D. Mudama
wrote:

> I agree that in aggregate we've now done 2 units worth of writes,
> however, in a mirror case, we're protecting against both whole-device
> failure and single-sector failure modes, so hardly seems like a bad
> idea in all applications.

Yes, just pass through the discards, and let us mirror. Sync'ing a
new drive w/o writing (only what needs to be written) is really
trivial (needs no extra saved metadata/LBA bitmaps or ability to query
the device for active sectors), if folks would give it some thought
and quit saying it shouldn't be done.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 16:49:40 von David Brown

On 09/02/2011 15:39, Roberto Spadim wrote:
> guys...
> if my ssd fail, i buy another...
> let's make software ok, the hardware is another problem
> raid1 should work with floppy disks, hard disks, ssd, nbd... that's the point
> make solutions for hardware mix
> the question is simple, could we send TRIM command to all mirrors (for
> stripe just disks that should receive it)? if device don't have TRIM
> we should translate it for a similar command, with the same READ
> effect (no problem if it's not atomic)
>


I've been reading a little more about this. It seems that the days of
TRIM may well be numbered - the latest generation of high-end SSDs have
more powerful garbage collection algorithms, together with more spare
blocks, making TRIM pretty much redundant. This is, of course, the most
convenient solution for everyone (as long as it doesn't cost too much!).

The point of the TRIM command is to tell the SSD that a particular block
is no longer being used, so that the SSD can erase it in the background
- that way when you want to write more data, there are more free blocks
ready and waiting. But if you've got plenty of spare blocks, it's easy
to have them erased in advance and you don't need TRIM.


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 16:52:55 von Chris Worley

On Wed, Feb 9, 2011 at 8:00 AM, Scott E. Armitage
wrote:

>AFAICT, reading back a trimmed page is
> not defined

.... and so should be assumed that reading a trimmed/nonexistant LBA
off of two of the same vendor's SSD's would realize different results?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 17:19:16 von edmudama

On Wed, Feb 9 at 10:00, Scott E. Armitage wrote:
>I reiterate my previous reply that under the current md architecture,
>where the complete device is considered to be in use, sending TRIM
>commands makes little sense. AFAICT, reading back a trimmed page is
>not defined, since the whole idea is that the host doesn't care about
>what is on that page any more.
>
>The next time md comes around to corresponding trimmed pages on two
>SSDs, their contents may differ, and all of a sudden our array is no
>longer consistent.

For SATA devices, ATA8-ACS2 addresses this through Deterministic Read
After Trim in the DATA SET MANAGEMENT command. Devices can be
indeterminate, determinate with a non-zero pattern (often all-ones) or
determinate all-zero for sectors read after being trimmed.

--eric

--
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 17:28:35 von launchpad

Who sends this command? If md can assume that determinate mode is
always set, then RAID 1 at least would remain consistent. For RAID 5,
consistency of the parity information depends on the determinate
pattern used and the number of disks. If you used determinate
all-zero, then parity information would always be consistent, but this
is probably not preferable since every TRIM command would incur an
extra write for each bit in each page of the block.

-S

On Wed, Feb 9, 2011 at 11:19 AM, Eric D. Mudama
wrote:
> On Wed, Feb =A09 at 10:00, Scott E. Armitage wrote:
>>
>> I reiterate my previous reply that under the current md architecture=
,
>> where the complete device is considered to be in use, sending TRIM
>> commands makes little sense. AFAICT, reading back a trimmed page is
>> not defined, since the whole idea is that the host doesn't care abou=
t
>> what is on that page any more.
>>
>> The next time md comes around to corresponding trimmed pages on two
>> SSDs, their contents may differ, and all of a sudden our array is no
>> longer consistent.
>
> For SATA devices, ATA8-ACS2 addresses this through Deterministic Read
> After Trim in the DATA SET MANAGEMENT command. =A0Devices can be
> indeterminate, determinate with a non-zero pattern (often all-ones) o=
r
> determinate all-zero for sectors read after being trimmed.
>
> --eric
>
> --
> Eric D. Mudama
> edmudama@bounceswoosh.org
>
>



--=20
Scott Armitage, B.A.Sc., M.A.Sc. candidate
Space Flight Laboratory
University of Toronto Institute for Aerospace Studies
4925 Dufferin Street, Toronto, Ontario, Canada, M3H 5T6
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 18:17:44 von edmudama

On Wed, Feb 9 at 11:28, Scott E. Armitage wrote:
>Who sends this command? If md can assume that determinate mode is
>always set, then RAID 1 at least would remain consistent. For RAID 5,
>consistency of the parity information depends on the determinate
>pattern used and the number of disks. If you used determinate
>all-zero, then parity information would always be consistent, but this
>is probably not preferable since every TRIM command would incur an
>extra write for each bit in each page of the block.

True, and there are several solutions. Maybe track space used via
some mechanism, such that when you trim you're only trimming the
entire stripe width so no parity is required for the trimmed regions.
Or, trust the drive's wear leveling and endurance rating, combined
with SMART data, to indicate when you need to replace the device
preemptive to eventual failure.

It's not an unsolvable issue. If the RAID5 used distributed parity,
you could expect wear leveling to wear all the devices evenly, since
on average, the # of writes to all devices will be the same. Only a
RAID4 setup would see a lopsided amount of writes to a single device.

--eric

--
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:18:02 von Roberto Spadim

who send?
ext4 send trim commands to device (disk/md raid/nbd)
kernel swap send this commands (when possible) to device too
for internal raid5 parity disk this could be done by md, for data
disks this should be done by ext4

the other question... about resync with only write what is different
this is very good since write and read speed can be different for ssd
(hd don=B4t have this 'problem')
but i=B4m sure that just write what is diff is better than write all
(ssd life will be bigger, hd maybe... i think that will be bigger too)


2011/2/9 Eric D. Mudama :
> On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>>
>> Who sends this command? If md can assume that determinate mode is
>> always set, then RAID 1 at least would remain consistent. For RAID 5=
,
>> consistency of the parity information depends on the determinate
>> pattern used and the number of disks. If you used determinate
>> all-zero, then parity information would always be consistent, but th=
is
>> is probably not preferable since every TRIM command would incur an
>> extra write for each bit in each page of the block.
>
> True, and there are several solutions. =A0Maybe track space used via
> some mechanism, such that when you trim you're only trimming the
> entire stripe width so no parity is required for the trimmed regions.
> Or, trust the drive's wear leveling and endurance rating, combined
> with SMART data, to indicate when you need to replace the device
> preemptive to eventual failure.
>
> It's not an unsolvable issue. =A0If the RAID5 used distributed parity=
,
> you could expect wear leveling to wear all the devices evenly, since
> on average, the # of writes to all devices will be the same. =A0Only =
a
> RAID4 setup would see a lopsided amount of writes to a single device.
>
> --eric
>
> --
> Eric D. Mudama
> edmudama@bounceswoosh.org
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:24:26 von Piergiorgio Sartor

> ext4 send trim commands to device (disk/md raid/nbd)
> kernel swap send this commands (when possible) to device too
> for internal raid5 parity disk this could be done by md, for data
> disks this should be done by ext4

That's an interesting point.

On which basis should a parity "block" get a TRIM?

If you ask me, I think the complete TRIM story is, at
best, a temporary patch.

IMHO the wear levelling should be handled by the filesystem
and, with awarness of this, by the underlining device drivers.
Reason is that the FS knows better what's going on with the
blocks and what will happen.

bye,

pg

>=20
> the other question... about resync with only write what is different
> this is very good since write and read speed can be different for ssd
> (hd don=B4t have this 'problem')
> but i=B4m sure that just write what is diff is better than write all
> (ssd life will be bigger, hd maybe... i think that will be bigger too=
)
>=20
>=20
> 2011/2/9 Eric D. Mudama :
> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
> >>
> >> Who sends this command? If md can assume that determinate mode is
> >> always set, then RAID 1 at least would remain consistent. For RAID=
5,
> >> consistency of the parity information depends on the determinate
> >> pattern used and the number of disks. If you used determinate
> >> all-zero, then parity information would always be consistent, but =
this
> >> is probably not preferable since every TRIM command would incur an
> >> extra write for each bit in each page of the block.
> >
> > True, and there are several solutions. =A0Maybe track space used vi=
a
> > some mechanism, such that when you trim you're only trimming the
> > entire stripe width so no parity is required for the trimmed region=
s.
> > Or, trust the drive's wear leveling and endurance rating, combined
> > with SMART data, to indicate when you need to replace the device
> > preemptive to eventual failure.
> >
> > It's not an unsolvable issue. =A0If the RAID5 used distributed pari=
ty,
> > you could expect wear leveling to wear all the devices evenly, sinc=
e
> > on average, the # of writes to all devices will be the same. =A0Onl=
y a
> > RAID4 setup would see a lopsided amount of writes to a single devic=
e.
> >
> > --eric
> >
> > --
> > Eric D. Mudama
> > edmudama@bounceswoosh.org
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
>=20
>=20
> --=20
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--=20

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:30:15 von Roberto Spadim

nice =3D)
but check that parity block is a raid information, not a filesystem inf=
ormation
for raid we could implement trim when possible (like swap)
and implement a trim that we receive from filesystem, and send to all
disks (if it=B4s a raid1 with mirrors, we should sent to all mirrors)
i don=B4t know what trim do very well, but i think it=B4s a very big wr=
ite
with only some bits for example:
set sector1=3D'0000000000000000000000000000000000000000000000000 0'
could be replace by:
trim sector1
it=B4s faster for sata communication, and it=B4s a good information for
hard disk (it can put a single '0' at the start of the sector and know
that all sector is 0, if it try to read any information it can use
internal memory (don=B4t read hard disk), if a write is done it should
write 0000 to bits, and after after the write operation, but it=B4s
internal function of hard disk/ssd, not a problem of md raid... md
raid should need know how to optimize and use it =3D] )

2011/2/9 Piergiorgio Sartor :
>> ext4 send trim commands to device (disk/md raid/nbd)
>> kernel swap send this commands (when possible) to device too
>> for internal raid5 parity disk this could be done by md, for data
>> disks this should be done by ext4
>
> That's an interesting point.
>
> On which basis should a parity "block" get a TRIM?
>
> If you ask me, I think the complete TRIM story is, at
> best, a temporary patch.
>
> IMHO the wear levelling should be handled by the filesystem
> and, with awarness of this, by the underlining device drivers.
> Reason is that the FS knows better what's going on with the
> blocks and what will happen.
>
> bye,
>
> pg
>
>>
>> the other question... about resync with only write what is different
>> this is very good since write and read speed can be different for ss=
d
>> (hd don=B4t have this 'problem')
>> but i=B4m sure that just write what is diff is better than write all
>> (ssd life will be bigger, hd maybe... i think that will be bigger to=
o)
>>
>>
>> 2011/2/9 Eric D. Mudama :
>> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>> >>
>> >> Who sends this command? If md can assume that determinate mode is
>> >> always set, then RAID 1 at least would remain consistent. For RAI=
D 5,
>> >> consistency of the parity information depends on the determinate
>> >> pattern used and the number of disks. If you used determinate
>> >> all-zero, then parity information would always be consistent, but=
this
>> >> is probably not preferable since every TRIM command would incur a=
n
>> >> extra write for each bit in each page of the block.
>> >
>> > True, and there are several solutions. =A0Maybe track space used v=
ia
>> > some mechanism, such that when you trim you're only trimming the
>> > entire stripe width so no parity is required for the trimmed regio=
ns.
>> > Or, trust the drive's wear leveling and endurance rating, combined
>> > with SMART data, to indicate when you need to replace the device
>> > preemptive to eventual failure.
>> >
>> > It's not an unsolvable issue. =A0If the RAID5 used distributed par=
ity,
>> > you could expect wear leveling to wear all the devices evenly, sin=
ce
>> > on average, the # of writes to all devices will be the same. =A0On=
ly a
>> > RAID4 setup would see a lopsided amount of writes to a single devi=
ce.
>> >
>> > --eric
>> >
>> > --
>> > Eric D. Mudama
>> > edmudama@bounceswoosh.org
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:38:15 von Piergiorgio Sartor

On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
> nice =3D)
> but check that parity block is a raid information, not a filesystem i=
nformation
> for raid we could implement trim when possible (like swap)
> and implement a trim that we receive from filesystem, and send to all
> disks (if it=B4s a raid1 with mirrors, we should sent to all mirrors)

To all disk also in case of RAID-5?

What if the TRIM belongs only to a single SDD block
belonging to a single chunk of a stripe?
That is a *single* SSD of the RAID-5.

Should md re-read the block and re-write (not TRIM)
the parity?

I think anything that has to do with checking &
repairing must be carefully considered...

bye,

pg

> i don=B4t know what trim do very well, but i think it=B4s a very big =
write
> with only some bits for example:
> set sector1=3D'0000000000000000000000000000000000000000000000000 0'
> could be replace by:
> trim sector1
> it=B4s faster for sata communication, and it=B4s a good information f=
or
> hard disk (it can put a single '0' at the start of the sector and kno=
w
> that all sector is 0, if it try to read any information it can use
> internal memory (don=B4t read hard disk), if a write is done it shoul=
d
> write 0000 to bits, and after after the write operation, but it=B4s
> internal function of hard disk/ssd, not a problem of md raid... md
> raid should need know how to optimize and use it =3D] )
>=20
> 2011/2/9 Piergiorgio Sartor :
> >> ext4 send trim commands to device (disk/md raid/nbd)
> >> kernel swap send this commands (when possible) to device too
> >> for internal raid5 parity disk this could be done by md, for data
> >> disks this should be done by ext4
> >
> > That's an interesting point.
> >
> > On which basis should a parity "block" get a TRIM?
> >
> > If you ask me, I think the complete TRIM story is, at
> > best, a temporary patch.
> >
> > IMHO the wear levelling should be handled by the filesystem
> > and, with awarness of this, by the underlining device drivers.
> > Reason is that the FS knows better what's going on with the
> > blocks and what will happen.
> >
> > bye,
> >
> > pg
> >
> >>
> >> the other question... about resync with only write what is differe=
nt
> >> this is very good since write and read speed can be different for =
ssd
> >> (hd don=B4t have this 'problem')
> >> but i=B4m sure that just write what is diff is better than write a=
ll
> >> (ssd life will be bigger, hd maybe... i think that will be bigger =
too)
> >>
> >>
> >> 2011/2/9 Eric D. Mudama :
> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
> >> >>
> >> >> Who sends this command? If md can assume that determinate mode =
is
> >> >> always set, then RAID 1 at least would remain consistent. For R=
AID 5,
> >> >> consistency of the parity information depends on the determinat=
e
> >> >> pattern used and the number of disks. If you used determinate
> >> >> all-zero, then parity information would always be consistent, b=
ut this
> >> >> is probably not preferable since every TRIM command would incur=
an
> >> >> extra write for each bit in each page of the block.
> >> >
> >> > True, and there are several solutions. =A0Maybe track space used=
via
> >> > some mechanism, such that when you trim you're only trimming the
> >> > entire stripe width so no parity is required for the trimmed reg=
ions.
> >> > Or, trust the drive's wear leveling and endurance rating, combin=
ed
> >> > with SMART data, to indicate when you need to replace the device
> >> > preemptive to eventual failure.
> >> >
> >> > It's not an unsolvable issue. =A0If the RAID5 used distributed p=
arity,
> >> > you could expect wear leveling to wear all the devices evenly, s=
ince
> >> > on average, the # of writes to all devices will be the same. =A0=
Only a
> >> > RAID4 setup would see a lopsided amount of writes to a single de=
vice.
> >> >
> >> > --eric
> >> >
> >> > --
> >> > Eric D. Mudama
> >> > edmudama@bounceswoosh.org
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-=
raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.=
html
> >> >
> >>
> >>
> >>
> >> --
> >> Roberto Spadim
> >> Spadim Technology / SPAEmpresarial
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
> >
> > --
> >
> > piergiorgio
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
>=20
>=20
> --=20
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--=20

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:46:24 von Roberto Spadim

it=B4s just a discussion, right? no implementation yet, right?

what i think....
if device accept TRIM, we can use TRIM.
if not, we must translate TRIM to something similar (maybe many WRITES
?), and when we READ from disk we get the same information
the translation coulbe be done by kernel (not md) maybe options on
libata, nbd device....
other option is do it with md, internal (md) TRIM translate function

who send trim?
internal md information: md can generate it (if necessary, maybe it=B4s
not...) for parity disks (not data disks)
filesystem/or another upper layer program (database with direct device
access), we could accept TRIM from filesystem/database, and send it to
disks/mirrors, when necessary translate it (internal or kernel
translate function)


2011/2/9 Piergiorgio Sartor :
> On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
>> nice =3D)
>> but check that parity block is a raid information, not a filesystem =
information
>> for raid we could implement trim when possible (like swap)
>> and implement a trim that we receive from filesystem, and send to al=
l
>> disks (if it=B4s a raid1 with mirrors, we should sent to all mirrors=
)
>
> To all disk also in case of RAID-5?
>
> What if the TRIM belongs only to a single SDD block
> belonging to a single chunk of a stripe?
> That is a *single* SSD of the RAID-5.
>
> Should md re-read the block and re-write (not TRIM)
> the parity?
>
> I think anything that has to do with checking &
> repairing must be carefully considered...
>
> bye,
>
> pg
>
>> i don=B4t know what trim do very well, but i think it=B4s a very big=
write
>> with only some bits for example:
>> set sector1=3D'0000000000000000000000000000000000000000000000000 0'
>> could be replace by:
>> trim sector1
>> it=B4s faster for sata communication, and it=B4s a good information =
for
>> hard disk (it can put a single '0' at the start of the sector and kn=
ow
>> that all sector is 0, if it try to read any information it can use
>> internal memory (don=B4t read hard disk), if a write is done it shou=
ld
>> write 0000 to bits, and after after the write operation, but it=B4s
>> internal function of hard disk/ssd, not a problem of md raid... md
>> raid should need know how to optimize and use it =3D] )
>>
>> 2011/2/9 Piergiorgio Sartor :
>> >> ext4 send trim commands to device (disk/md raid/nbd)
>> >> kernel swap send this commands (when possible) to device too
>> >> for internal raid5 parity disk this could be done by md, for data
>> >> disks this should be done by ext4
>> >
>> > That's an interesting point.
>> >
>> > On which basis should a parity "block" get a TRIM?
>> >
>> > If you ask me, I think the complete TRIM story is, at
>> > best, a temporary patch.
>> >
>> > IMHO the wear levelling should be handled by the filesystem
>> > and, with awarness of this, by the underlining device drivers.
>> > Reason is that the FS knows better what's going on with the
>> > blocks and what will happen.
>> >
>> > bye,
>> >
>> > pg
>> >
>> >>
>> >> the other question... about resync with only write what is differ=
ent
>> >> this is very good since write and read speed can be different for=
ssd
>> >> (hd don=B4t have this 'problem')
>> >> but i=B4m sure that just write what is diff is better than write =
all
>> >> (ssd life will be bigger, hd maybe... i think that will be bigger=
too)
>> >>
>> >>
>> >> 2011/2/9 Eric D. Mudama :
>> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>> >> >>
>> >> >> Who sends this command? If md can assume that determinate mode=
is
>> >> >> always set, then RAID 1 at least would remain consistent. For =
RAID 5,
>> >> >> consistency of the parity information depends on the determina=
te
>> >> >> pattern used and the number of disks. If you used determinate
>> >> >> all-zero, then parity information would always be consistent, =
but this
>> >> >> is probably not preferable since every TRIM command would incu=
r an
>> >> >> extra write for each bit in each page of the block.
>> >> >
>> >> > True, and there are several solutions. =A0Maybe track space use=
d via
>> >> > some mechanism, such that when you trim you're only trimming th=
e
>> >> > entire stripe width so no parity is required for the trimmed re=
gions.
>> >> > Or, trust the drive's wear leveling and endurance rating, combi=
ned
>> >> > with SMART data, to indicate when you need to replace the devic=
e
>> >> > preemptive to eventual failure.
>> >> >
>> >> > It's not an unsolvable issue. =A0If the RAID5 used distributed =
parity,
>> >> > you could expect wear leveling to wear all the devices evenly, =
since
>> >> > on average, the # of writes to all devices will be the same. =A0=
Only a
>> >> > RAID4 setup would see a lopsided amount of writes to a single d=
evice.
>> >> >
>> >> > --eric
>> >> >
>> >> > --
>> >> > Eric D. Mudama
>> >> > edmudama@bounceswoosh.org
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux=
-raid" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Roberto Spadim
>> >> Spadim Technology / SPAEmpresarial
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-r=
aid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
>> >
>> > --
>> >
>> > piergiorgio
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 19:52:01 von Roberto Spadim

the other question...
checked and repair
i don=B4t know the today resync implementation (i need read source code=
)
but, a read check diferences and after write if any diference is
found, is better than write without check diferences
why better?
to SSD: it will have a bigger life
to HDD: i think it will have a bigger life too (I THINK)
the problem: more operations
without check:
READ from source, WRITE to mirror
with check:
READ from source, READ from mirror, check diff, WRITE to mirror if diff

maybe a option to mdadm could set the md device to RESYNC WITH CHECK,
and RESYNC WITHOUT CHECK
it=B4s a user option, not a md option, right? if user want a fast resyn=
c
it can use without check or with check, but we can give user
options... that=B4s very nice (to user), the default option? i think
WITHOUT CHECK should be the default option, without check is a feature
like default chuck size...


2011/2/9 Roberto Spadim :
> it=B4s just a discussion, right? no implementation yet, right?
>
> what i think....
> if device accept TRIM, we can use TRIM.
> if not, we must translate TRIM to something similar (maybe many WRITE=
S
> ?), and when we READ from disk we get the same information
> the translation coulbe be done by kernel (not md) maybe options on
> libata, nbd device....
> other option is do it with md, internal (md) TRIM translate function
>
> who send trim?
> internal md information: md can generate it (if necessary, maybe it=B4=
s
> not...) for parity disks (not data disks)
> filesystem/or another upper layer program (database with direct devic=
e
> access), we could accept TRIM from filesystem/database, and send it t=
o
> disks/mirrors, when necessary translate it (internal or kernel
> translate function)
>
>
> 2011/2/9 Piergiorgio Sartor :
>> On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
>>> nice =3D)
>>> but check that parity block is a raid information, not a filesystem=
information
>>> for raid we could implement trim when possible (like swap)
>>> and implement a trim that we receive from filesystem, and send to a=
ll
>>> disks (if it=B4s a raid1 with mirrors, we should sent to all mirror=
s)
>>
>> To all disk also in case of RAID-5?
>>
>> What if the TRIM belongs only to a single SDD block
>> belonging to a single chunk of a stripe?
>> That is a *single* SSD of the RAID-5.
>>
>> Should md re-read the block and re-write (not TRIM)
>> the parity?
>>
>> I think anything that has to do with checking &
>> repairing must be carefully considered...
>>
>> bye,
>>
>> pg
>>
>>> i don=B4t know what trim do very well, but i think it=B4s a very bi=
g write
>>> with only some bits for example:
>>> set sector1=3D'0000000000000000000000000000000000000000000000000 0'
>>> could be replace by:
>>> trim sector1
>>> it=B4s faster for sata communication, and it=B4s a good information=
for
>>> hard disk (it can put a single '0' at the start of the sector and k=
now
>>> that all sector is 0, if it try to read any information it can use
>>> internal memory (don=B4t read hard disk), if a write is done it sho=
uld
>>> write 0000 to bits, and after after the write operation, but it=B4s
>>> internal function of hard disk/ssd, not a problem of md raid... md
>>> raid should need know how to optimize and use it =3D] )
>>>
>>> 2011/2/9 Piergiorgio Sartor :
>>> >> ext4 send trim commands to device (disk/md raid/nbd)
>>> >> kernel swap send this commands (when possible) to device too
>>> >> for internal raid5 parity disk this could be done by md, for dat=
a
>>> >> disks this should be done by ext4
>>> >
>>> > That's an interesting point.
>>> >
>>> > On which basis should a parity "block" get a TRIM?
>>> >
>>> > If you ask me, I think the complete TRIM story is, at
>>> > best, a temporary patch.
>>> >
>>> > IMHO the wear levelling should be handled by the filesystem
>>> > and, with awarness of this, by the underlining device drivers.
>>> > Reason is that the FS knows better what's going on with the
>>> > blocks and what will happen.
>>> >
>>> > bye,
>>> >
>>> > pg
>>> >
>>> >>
>>> >> the other question... about resync with only write what is diffe=
rent
>>> >> this is very good since write and read speed can be different fo=
r ssd
>>> >> (hd don=B4t have this 'problem')
>>> >> but i=B4m sure that just write what is diff is better than write=
all
>>> >> (ssd life will be bigger, hd maybe... i think that will be bigge=
r too)
>>> >>
>>> >>
>>> >> 2011/2/9 Eric D. Mudama :
>>> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>>> >> >>
>>> >> >> Who sends this command? If md can assume that determinate mod=
e is
>>> >> >> always set, then RAID 1 at least would remain consistent. For=
RAID 5,
>>> >> >> consistency of the parity information depends on the determin=
ate
>>> >> >> pattern used and the number of disks. If you used determinate
>>> >> >> all-zero, then parity information would always be consistent,=
but this
>>> >> >> is probably not preferable since every TRIM command would inc=
ur an
>>> >> >> extra write for each bit in each page of the block.
>>> >> >
>>> >> > True, and there are several solutions. =A0Maybe track space us=
ed via
>>> >> > some mechanism, such that when you trim you're only trimming t=
he
>>> >> > entire stripe width so no parity is required for the trimmed r=
egions.
>>> >> > Or, trust the drive's wear leveling and endurance rating, comb=
ined
>>> >> > with SMART data, to indicate when you need to replace the devi=
ce
>>> >> > preemptive to eventual failure.
>>> >> >
>>> >> > It's not an unsolvable issue. =A0If the RAID5 used distributed=
parity,
>>> >> > you could expect wear leveling to wear all the devices evenly,=
since
>>> >> > on average, the # of writes to all devices will be the same. =A0=
Only a
>>> >> > RAID4 setup would see a lopsided amount of writes to a single =
device.
>>> >> >
>>> >> > --eric
>>> >> >
>>> >> > --
>>> >> > Eric D. Mudama
>>> >> > edmudama@bounceswoosh.org
>>> >> >
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe linu=
x-raid" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-inf=
o.html
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Roberto Spadim
>>> >> Spadim Technology / SPAEmpresarial
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe linux-=
raid" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.=
html
>>> >
>>> > --
>>> >
>>> > piergiorgio
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-r=
aid" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
>>> >
>>>
>>>
>>>
>>> --
>>> Roberto Spadim
>>> Spadim Technology / SPAEmpresarial
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
>>
>> --
>>
>> piergiorgio
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:13:59 von Piergiorgio Sartor

> it=B4s just a discussion, right? no implementation yet, right?

Of course...
=20
> what i think....
> if device accept TRIM, we can use TRIM.
> if not, we must translate TRIM to something similar (maybe many WRITE=
S
> ?), and when we READ from disk we get the same information

TRIM is not about writing at all. TRIM tells the
device that the addressed block is not anymore used,
so it (the SSD) can do whatever it wants with it.

The only software layer having the same "knowledge"
is the filesystem, the other layers, do not have
any decisional power about the block allocation.
Except for metadata, of course.

So, IMHO, a software TRIM can only be in the FS.

bye,

pg

> the translation coulbe be done by kernel (not md) maybe options on
> libata, nbd device....
> other option is do it with md, internal (md) TRIM translate function
>=20
> who send trim?
> internal md information: md can generate it (if necessary, maybe it=B4=
s
> not...) for parity disks (not data disks)
> filesystem/or another upper layer program (database with direct devic=
e
> access), we could accept TRIM from filesystem/database, and send it t=
o
> disks/mirrors, when necessary translate it (internal or kernel
> translate function)
>=20
>=20
> 2011/2/9 Piergiorgio Sartor :
> > On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
> >> nice =3D)
> >> but check that parity block is a raid information, not a filesyste=
m information
> >> for raid we could implement trim when possible (like swap)
> >> and implement a trim that we receive from filesystem, and send to =
all
> >> disks (if it=B4s a raid1 with mirrors, we should sent to all mirro=
rs)
> >
> > To all disk also in case of RAID-5?
> >
> > What if the TRIM belongs only to a single SDD block
> > belonging to a single chunk of a stripe?
> > That is a *single* SSD of the RAID-5.
> >
> > Should md re-read the block and re-write (not TRIM)
> > the parity?
> >
> > I think anything that has to do with checking &
> > repairing must be carefully considered...
> >
> > bye,
> >
> > pg
> >
> >> i don=B4t know what trim do very well, but i think it=B4s a very b=
ig write
> >> with only some bits for example:
> >> set sector1=3D'0000000000000000000000000000000000000000000000000 0'
> >> could be replace by:
> >> trim sector1
> >> it=B4s faster for sata communication, and it=B4s a good informatio=
n for
> >> hard disk (it can put a single '0' at the start of the sector and =
know
> >> that all sector is 0, if it try to read any information it can use
> >> internal memory (don=B4t read hard disk), if a write is done it sh=
ould
> >> write 0000 to bits, and after after the write operation, but it=B4=
s
> >> internal function of hard disk/ssd, not a problem of md raid... md
> >> raid should need know how to optimize and use it =3D] )
> >>
> >> 2011/2/9 Piergiorgio Sartor :
> >> >> ext4 send trim commands to device (disk/md raid/nbd)
> >> >> kernel swap send this commands (when possible) to device too
> >> >> for internal raid5 parity disk this could be done by md, for da=
ta
> >> >> disks this should be done by ext4
> >> >
> >> > That's an interesting point.
> >> >
> >> > On which basis should a parity "block" get a TRIM?
> >> >
> >> > If you ask me, I think the complete TRIM story is, at
> >> > best, a temporary patch.
> >> >
> >> > IMHO the wear levelling should be handled by the filesystem
> >> > and, with awarness of this, by the underlining device drivers.
> >> > Reason is that the FS knows better what's going on with the
> >> > blocks and what will happen.
> >> >
> >> > bye,
> >> >
> >> > pg
> >> >
> >> >>
> >> >> the other question... about resync with only write what is diff=
erent
> >> >> this is very good since write and read speed can be different f=
or ssd
> >> >> (hd don=B4t have this 'problem')
> >> >> but i=B4m sure that just write what is diff is better than writ=
e all
> >> >> (ssd life will be bigger, hd maybe... i think that will be bigg=
er too)
> >> >>
> >> >>
> >> >> 2011/2/9 Eric D. Mudama :
> >> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
> >> >> >>
> >> >> >> Who sends this command? If md can assume that determinate mo=
de is
> >> >> >> always set, then RAID 1 at least would remain consistent. Fo=
r RAID 5,
> >> >> >> consistency of the parity information depends on the determi=
nate
> >> >> >> pattern used and the number of disks. If you used determinat=
e
> >> >> >> all-zero, then parity information would always be consistent=
, but this
> >> >> >> is probably not preferable since every TRIM command would in=
cur an
> >> >> >> extra write for each bit in each page of the block.
> >> >> >
> >> >> > True, and there are several solutions. =A0Maybe track space u=
sed via
> >> >> > some mechanism, such that when you trim you're only trimming =
the
> >> >> > entire stripe width so no parity is required for the trimmed =
regions.
> >> >> > Or, trust the drive's wear leveling and endurance rating, com=
bined
> >> >> > with SMART data, to indicate when you need to replace the dev=
ice
> >> >> > preemptive to eventual failure.
> >> >> >
> >> >> > It's not an unsolvable issue. =A0If the RAID5 used distribute=
d parity,
> >> >> > you could expect wear leveling to wear all the devices evenly=
, since
> >> >> > on average, the # of writes to all devices will be the same. =
=A0Only a
> >> >> > RAID4 setup would see a lopsided amount of writes to a single=
device.
> >> >> >
> >> >> > --eric
> >> >> >
> >> >> > --
> >> >> > Eric D. Mudama
> >> >> > edmudama@bounceswoosh.org
> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe lin=
ux-raid" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-in=
fo.html
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Roberto Spadim
> >> >> Spadim Technology / SPAEmpresarial
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux=
-raid" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
> >> >
> >> > --
> >> >
> >> > piergiorgio
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-=
raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.=
html
> >> >
> >>
> >>
> >>
> >> --
> >> Roberto Spadim
> >> Spadim Technology / SPAEmpresarial
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
> >
> > --
> >
> > piergiorgio
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
>=20
>=20
> --=20
> Roberto Spadim
> Spadim Technology / SPAEmpresarial

--=20

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:15:48 von Doug Dumitru

I work with SSDs arrays all the time, so I have a couple of thoughts
about trim and md.

'trim' is still necessary. SandForce controllers are "better" at
this, but still need free space to do their work. I had a set of SF
drives drop to 22 MB/sec writes because they were full and scrambled.
It takes a lot of effort to get them that messed up, but it can still
happen. Trim brings them back.

The bottom line is that SSDs do block re-organization on the fly and
free space makes the re-org more efficient. More efficient means
faster, and as importantly less wear amplification.

Most SSDs (and I think the latest trim spec) are deterministic on
trim'd sectors. If you trim a sector, they read that sector as zeros.
This makes raid much "safer".

raid/0,1,10 should be fine to echo discard commands down to the
downstream drives in the bio request. It is then up to the physical
device driver to turn the discard bio request into an ATA (or SCSI)
trim. Most block devices don't seem to understand discard requests
yet, but this will get better over time.

raid/4,5,6 is a lot more complicated. With raid/4,5 with an even
number of drives, you can trim whole stripes safely. Pieces of
stripes get interesting because you have to treat a trim as a write of
zeros and re-calc parity. raid/6 will always have parity issues
regardless of how many drives there are. Even worse is that
raid/4,5,6 parity read/modify/write operations tend to chatter the FTL
(Flash Translation Layer) logic and make matters worse (often much
worse). If you are not streaming long linear writes, raid/4,5,6 in a
heavy write environment is a probably a very bad idea for most SSDs.

Another issue with trim is how "async" it behaves. You can trim a lot
of data to a drive, but it is hard to tell when the drive actually is
ready afterwards. Some drives also choke on trim requests that come
at them too fast or requests that are too long. The behavior can be
quite random. So then comes the issue of how many "user knobs" to
supply to tune what trims where. Again, raid/0,1,10 are pretty easy.
Raid/4,5,6 really requires that you know the precise geometry and
control the IO. Way beyond what ext4 understands at this point.

Trim can also be "faked" with some drives. Again, looking at the
SandForce based drives, these drive internally de-dupe so you can fake
write data and help the drives get free space. Do this by filling the
drive with zeros (ie, dd if=/dev/zero of=big.file bs=1M), do a sync,
and then delete the big.file. This works through md, across SANs,
from XEN virtuals, or wherever. With SandForce drives, this is not as
effective as a trim, but better than nothing. Unfortunately, only
SandForce drives and Flash Supercharger understand zero's this way. A
filesystem option that "zeros discarded sectors" would actually make
as much sense in some deployment settings as the discard option (not
sure, but ext# might already have this). NTFS has actually supported
this since XP as a security enhancement.

Doug Dumitru
EasyCo LLC

ps: My background with this has been the development of Flash
SuperCharger. I am not trying to run an advert here, but the care and
feeding of SSDs can be interesting. Flash SuperCharger breaks most of
these rules, but it does know the exact geometry of what it is driving
and plays excessive games to drives SSDs at their exact "sweet spot".
One of our licensees just sent me some benchmarks at > 500,000 4K
random writes/sec for a moderate sized array running raid/5.

pps: Failures of SSDs are different than HDDs. SSDs can and do fail
and need raid for many applications. If you need high write IOPS, it
pretty much has to be raid/1,10 (unless you run our Flash SuperCharger
layer).

ppps: I have seen SSDs silently return corrupted data. Disks do this
as well. A paper from 2 years ago quoted disk silent error rates as
high as 1 bad block every 73TB read. Very scary stuff, but probably
beyond the scope of what md can address.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:16:41 von Roberto Spadim

yeah =3D)
a question...
if i send a TRIM to a sector
if i read from it
what i have?
0x00000000000000000000000000000000000 ?
if yes, we could translate TRIM to WRITE on devices without TRIM (hard =
disks)
just to have the same READ information

2011/2/9 Piergiorgio Sartor :
>> it=B4s just a discussion, right? no implementation yet, right?
>
> Of course...
>
>> what i think....
>> if device accept TRIM, we can use TRIM.
>> if not, we must translate TRIM to something similar (maybe many WRIT=
ES
>> ?), and when we READ from disk we get the same information
>
> TRIM is not about writing at all. TRIM tells the
> device that the addressed block is not anymore used,
> so it (the SSD) can do whatever it wants with it.
>
> The only software layer having the same "knowledge"
> is the filesystem, the other layers, do not have
> any decisional power about the block allocation.
> Except for metadata, of course.
>
> So, IMHO, a software TRIM can only be in the FS.
>
> bye,
>
> pg
>
>> the translation coulbe be done by kernel (not md) maybe options on
>> libata, nbd device....
>> other option is do it with md, internal (md) TRIM translate function
>>
>> who send trim?
>> internal md information: md can generate it (if necessary, maybe it=B4=
s
>> not...) for parity disks (not data disks)
>> filesystem/or another upper layer program (database with direct devi=
ce
>> access), we could accept TRIM from filesystem/database, and send it =
to
>> disks/mirrors, when necessary translate it (internal or kernel
>> translate function)
>>
>>
>> 2011/2/9 Piergiorgio Sartor :
>> > On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
>> >> nice =3D)
>> >> but check that parity block is a raid information, not a filesyst=
em information
>> >> for raid we could implement trim when possible (like swap)
>> >> and implement a trim that we receive from filesystem, and send to=
all
>> >> disks (if it=B4s a raid1 with mirrors, we should sent to all mirr=
ors)
>> >
>> > To all disk also in case of RAID-5?
>> >
>> > What if the TRIM belongs only to a single SDD block
>> > belonging to a single chunk of a stripe?
>> > That is a *single* SSD of the RAID-5.
>> >
>> > Should md re-read the block and re-write (not TRIM)
>> > the parity?
>> >
>> > I think anything that has to do with checking &
>> > repairing must be carefully considered...
>> >
>> > bye,
>> >
>> > pg
>> >
>> >> i don=B4t know what trim do very well, but i think it=B4s a very =
big write
>> >> with only some bits for example:
>> >> set sector1=3D'0000000000000000000000000000000000000000000000000 0=
'
>> >> could be replace by:
>> >> trim sector1
>> >> it=B4s faster for sata communication, and it=B4s a good informati=
on for
>> >> hard disk (it can put a single '0' at the start of the sector and=
know
>> >> that all sector is 0, if it try to read any information it can us=
e
>> >> internal memory (don=B4t read hard disk), if a write is done it s=
hould
>> >> write 0000 to bits, and after after the write operation, but it=B4=
s
>> >> internal function of hard disk/ssd, not a problem of md raid... m=
d
>> >> raid should need know how to optimize and use it =3D] )
>> >>
>> >> 2011/2/9 Piergiorgio Sartor :
>> >> >> ext4 send trim commands to device (disk/md raid/nbd)
>> >> >> kernel swap send this commands (when possible) to device too
>> >> >> for internal raid5 parity disk this could be done by md, for d=
ata
>> >> >> disks this should be done by ext4
>> >> >
>> >> > That's an interesting point.
>> >> >
>> >> > On which basis should a parity "block" get a TRIM?
>> >> >
>> >> > If you ask me, I think the complete TRIM story is, at
>> >> > best, a temporary patch.
>> >> >
>> >> > IMHO the wear levelling should be handled by the filesystem
>> >> > and, with awarness of this, by the underlining device drivers.
>> >> > Reason is that the FS knows better what's going on with the
>> >> > blocks and what will happen.
>> >> >
>> >> > bye,
>> >> >
>> >> > pg
>> >> >
>> >> >>
>> >> >> the other question... about resync with only write what is dif=
ferent
>> >> >> this is very good since write and read speed can be different =
for ssd
>> >> >> (hd don=B4t have this 'problem')
>> >> >> but i=B4m sure that just write what is diff is better than wri=
te all
>> >> >> (ssd life will be bigger, hd maybe... i think that will be big=
ger too)
>> >> >>
>> >> >>
>> >> >> 2011/2/9 Eric D. Mudama :
>> >> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>> >> >> >>
>> >> >> >> Who sends this command? If md can assume that determinate m=
ode is
>> >> >> >> always set, then RAID 1 at least would remain consistent. F=
or RAID 5,
>> >> >> >> consistency of the parity information depends on the determ=
inate
>> >> >> >> pattern used and the number of disks. If you used determina=
te
>> >> >> >> all-zero, then parity information would always be consisten=
t, but this
>> >> >> >> is probably not preferable since every TRIM command would i=
ncur an
>> >> >> >> extra write for each bit in each page of the block.
>> >> >> >
>> >> >> > True, and there are several solutions. =A0Maybe track space =
used via
>> >> >> > some mechanism, such that when you trim you're only trimming=
the
>> >> >> > entire stripe width so no parity is required for the trimmed=
regions.
>> >> >> > Or, trust the drive's wear leveling and endurance rating, co=
mbined
>> >> >> > with SMART data, to indicate when you need to replace the de=
vice
>> >> >> > preemptive to eventual failure.
>> >> >> >
>> >> >> > It's not an unsolvable issue. =A0If the RAID5 used distribut=
ed parity,
>> >> >> > you could expect wear leveling to wear all the devices evenl=
y, since
>> >> >> > on average, the # of writes to all devices will be the same.=
=A0Only a
>> >> >> > RAID4 setup would see a lopsided amount of writes to a singl=
e device.
>> >> >> >
>> >> >> > --eric
>> >> >> >
>> >> >> > --
>> >> >> > Eric D. Mudama
>> >> >> > edmudama@bounceswoosh.org
>> >> >> >
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe li=
nux-raid" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-i=
nfo.html
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Roberto Spadim
>> >> >> Spadim Technology / SPAEmpresarial
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linu=
x-raid" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-inf=
o.html
>> >> >
>> >> > --
>> >> >
>> >> > piergiorgio
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux=
-raid" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Roberto Spadim
>> >> Spadim Technology / SPAEmpresarial
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-r=
aid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h=
tml
>> >
>> > --
>> >
>> > piergiorgio
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:21:02 von Piergiorgio Sartor

> yeah =3D)
> a question...
> if i send a TRIM to a sector
> if i read from it
> what i have?
> 0x00000000000000000000000000000000000 ?
> if yes, we could translate TRIM to WRITE on devices without TRIM (har=
d disks)
> just to have the same READ information

It seems the 0x0 is not a standard. Return values
seem to be quite undefined, even if 0x0 *might*
be common.

Second, why do you want to emulate the 0x0 thing?

I do not see the point of writing zero on a device
which do not support TRIM. Just do nothing seems a
better choice, even in mixed environment.

bye,

pg
=20
> 2011/2/9 Piergiorgio Sartor :
> >> it=B4s just a discussion, right? no implementation yet, right?
> >
> > Of course...
> >
> >> what i think....
> >> if device accept TRIM, we can use TRIM.
> >> if not, we must translate TRIM to something similar (maybe many WR=
ITES
> >> ?), and when we READ from disk we get the same information
> >
> > TRIM is not about writing at all. TRIM tells the
> > device that the addressed block is not anymore used,
> > so it (the SSD) can do whatever it wants with it.
> >
> > The only software layer having the same "knowledge"
> > is the filesystem, the other layers, do not have
> > any decisional power about the block allocation.
> > Except for metadata, of course.
> >
> > So, IMHO, a software TRIM can only be in the FS.
> >
> > bye,
> >
> > pg
> >
> >> the translation coulbe be done by kernel (not md) maybe options on
> >> libata, nbd device....
> >> other option is do it with md, internal (md) TRIM translate functi=
on
> >>
> >> who send trim?
> >> internal md information: md can generate it (if necessary, maybe i=
t=B4s
> >> not...) for parity disks (not data disks)
> >> filesystem/or another upper layer program (database with direct de=
vice
> >> access), we could accept TRIM from filesystem/database, and send i=
t to
> >> disks/mirrors, when necessary translate it (internal or kernel
> >> translate function)
> >>
> >>
> >> 2011/2/9 Piergiorgio Sartor :
> >> > On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
> >> >> nice =3D)
> >> >> but check that parity block is a raid information, not a filesy=
stem information
> >> >> for raid we could implement trim when possible (like swap)
> >> >> and implement a trim that we receive from filesystem, and send =
to all
> >> >> disks (if it=B4s a raid1 with mirrors, we should sent to all mi=
rrors)
> >> >
> >> > To all disk also in case of RAID-5?
> >> >
> >> > What if the TRIM belongs only to a single SDD block
> >> > belonging to a single chunk of a stripe?
> >> > That is a *single* SSD of the RAID-5.
> >> >
> >> > Should md re-read the block and re-write (not TRIM)
> >> > the parity?
> >> >
> >> > I think anything that has to do with checking &
> >> > repairing must be carefully considered...
> >> >
> >> > bye,
> >> >
> >> > pg
> >> >
> >> >> i don=B4t know what trim do very well, but i think it=B4s a ver=
y big write
> >> >> with only some bits for example:
> >> >> set sector1=3D'000000000000000000000000000000000000000000000000=
00'
> >> >> could be replace by:
> >> >> trim sector1
> >> >> it=B4s faster for sata communication, and it=B4s a good informa=
tion for
> >> >> hard disk (it can put a single '0' at the start of the sector a=
nd know
> >> >> that all sector is 0, if it try to read any information it can =
use
> >> >> internal memory (don=B4t read hard disk), if a write is done it=
should
> >> >> write 0000 to bits, and after after the write operation, but it=
=B4s
> >> >> internal function of hard disk/ssd, not a problem of md raid...=
md
> >> >> raid should need know how to optimize and use it =3D] )
> >> >>
> >> >> 2011/2/9 Piergiorgio Sartor :
> >> >> >> ext4 send trim commands to device (disk/md raid/nbd)
> >> >> >> kernel swap send this commands (when possible) to device too
> >> >> >> for internal raid5 parity disk this could be done by md, for=
data
> >> >> >> disks this should be done by ext4
> >> >> >
> >> >> > That's an interesting point.
> >> >> >
> >> >> > On which basis should a parity "block" get a TRIM?
> >> >> >
> >> >> > If you ask me, I think the complete TRIM story is, at
> >> >> > best, a temporary patch.
> >> >> >
> >> >> > IMHO the wear levelling should be handled by the filesystem
> >> >> > and, with awarness of this, by the underlining device drivers=

> >> >> > Reason is that the FS knows better what's going on with the
> >> >> > blocks and what will happen.
> >> >> >
> >> >> > bye,
> >> >> >
> >> >> > pg
> >> >> >
> >> >> >>
> >> >> >> the other question... about resync with only write what is d=
ifferent
> >> >> >> this is very good since write and read speed can be differen=
t for ssd
> >> >> >> (hd don=B4t have this 'problem')
> >> >> >> but i=B4m sure that just write what is diff is better than w=
rite all
> >> >> >> (ssd life will be bigger, hd maybe... i think that will be b=
igger too)
> >> >> >>
> >> >> >>
> >> >> >> 2011/2/9 Eric D. Mudama :
> >> >> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
> >> >> >> >>
> >> >> >> >> Who sends this command? If md can assume that determinate=
mode is
> >> >> >> >> always set, then RAID 1 at least would remain consistent.=
For RAID 5,
> >> >> >> >> consistency of the parity information depends on the dete=
rminate
> >> >> >> >> pattern used and the number of disks. If you used determi=
nate
> >> >> >> >> all-zero, then parity information would always be consist=
ent, but this
> >> >> >> >> is probably not preferable since every TRIM command would=
incur an
> >> >> >> >> extra write for each bit in each page of the block.
> >> >> >> >
> >> >> >> > True, and there are several solutions. =A0Maybe track spac=
e used via
> >> >> >> > some mechanism, such that when you trim you're only trimmi=
ng the
> >> >> >> > entire stripe width so no parity is required for the trimm=
ed regions.
> >> >> >> > Or, trust the drive's wear leveling and endurance rating, =
combined
> >> >> >> > with SMART data, to indicate when you need to replace the =
device
> >> >> >> > preemptive to eventual failure.
> >> >> >> >
> >> >> >> > It's not an unsolvable issue. =A0If the RAID5 used distrib=
uted parity,
> >> >> >> > you could expect wear leveling to wear all the devices eve=
nly, since
> >> >> >> > on average, the # of writes to all devices will be the sam=
e. =A0Only a
> >> >> >> > RAID4 setup would see a lopsided amount of writes to a sin=
gle device.
> >> >> >> >
> >> >> >> > --eric
> >> >> >> >
> >> >> >> > --
> >> >> >> > Eric D. Mudama
> >> >> >> > edmudama@bounceswoosh.org
> >> >> >> >
> >> >> >> > --
> >> >> >> > To unsubscribe from this list: send the line "unsubscribe =
linux-raid" in
> >> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo=
-info.html
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Roberto Spadim
> >> >> >> Spadim Technology / SPAEmpresarial
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe li=
nux-raid" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-i=
nfo.html
> >> >> >
> >> >> > --
> >> >> >
> >> >> > piergiorgio
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe lin=
ux-raid" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-in=
fo.html
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Roberto Spadim
> >> >> Spadim Technology / SPAEmpresarial
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux=
-raid" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
> >> >
> >> > --
> >> >
> >> > piergiorgio
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-=
raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.=
html
> >> >
> >>
> >>
> >>
> >> --
> >> Roberto Spadim
> >> Spadim Technology / SPAEmpresarial
> >
> > --
> >
> > piergiorgio
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.htm=
l
> >
>=20
>=20
>=20
> --=20
> Roberto Spadim
> Spadim Technology / SPAEmpresarial

--=20

piergiorgio
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:22:19 von Roberto Spadim

i agree with ppps
that=B4s why ecc, checksum and parity is usefull (raid5,6) (raid1 if yo=
u
read from all mirror to check difference and select the 'right disk')

2011/2/9 Doug Dumitru :
> I work with SSDs arrays all the time, so I have a couple of thoughts
> about trim and md.
>
> 'trim' is still necessary. =A0SandForce controllers are "better" at
> this, but still need free space to do their work. =A0I had a set of S=
=46
> drives drop to 22 MB/sec writes because they were full and scrambled.
> It takes a lot of effort to get them that messed up, but it can still
> happen. =A0Trim brings them back.
>
> The bottom line is that SSDs do block re-organization on the fly and
> free space makes the re-org more efficient. =A0More efficient means
> faster, and as importantly less wear amplification.
>
> Most SSDs (and I think the latest trim spec) are deterministic on
> trim'd sectors. =A0If you trim a sector, they read that sector as zer=
os.
> =A0This makes raid much "safer".
>
> raid/0,1,10 should be fine to echo discard commands down to the
> downstream drives in the bio request. =A0It is then up to the physica=
l
> device driver to turn the discard bio request into an ATA (or SCSI)
> trim. =A0Most block devices don't seem to understand discard requests
> yet, but this will get better over time.
>
> raid/4,5,6 is a lot more complicated. =A0With raid/4,5 with an even
> number of drives, you can trim whole stripes safely. =A0Pieces of
> stripes get interesting because you have to treat a trim as a write o=
f
> zeros and re-calc parity. =A0raid/6 will always have parity issues
> regardless of how many drives there are. =A0Even worse is that
> raid/4,5,6 parity read/modify/write operations tend to chatter the FT=
L
> (Flash Translation Layer) logic and make matters worse (often much
> worse). =A0If you are not streaming long linear writes, raid/4,5,6 in=
a
> heavy write environment is a probably a very bad idea for most SSDs.
>
> Another issue with trim is how "async" it behaves. =A0You can trim a =
lot
> of data to a drive, but it is hard to tell when the drive actually is
> ready afterwards. =A0Some drives also choke on trim requests that com=
e
> at them too fast or requests that are too long. =A0The behavior can b=
e
> quite random. =A0So then comes the issue of how many "user knobs" to
> supply to tune what trims where. =A0Again, raid/0,1,10 are pretty eas=
y.
> Raid/4,5,6 really requires that you know the precise geometry and
> control the IO. =A0Way beyond what ext4 understands at this point.
>
> Trim can also be "faked" with some drives. =A0Again, looking at the
> SandForce based drives, these drive internally de-dupe so you can fak=
e
> write data and help the drives get free space. =A0Do this by filling =
the
> drive with zeros (ie, dd if=3D/dev/zero of=3Dbig.file bs=3D1M), do a =
sync,
> and then delete the big.file. =A0This works through md, across SANs,
> from XEN virtuals, or wherever. =A0With SandForce drives, this is not=
as
> effective as a trim, but better than nothing. =A0Unfortunately, only
> SandForce drives and Flash Supercharger understand zero's this way. =A0=
A
> filesystem option that "zeros discarded sectors" would actually make
> as much sense in some deployment settings as the discard option (not
> sure, but ext# might already have this). =A0NTFS has actually support=
ed
> this since XP as a security enhancement.
>
> Doug Dumitru
> EasyCo LLC
>
> ps: =A0My background with this has been the development of Flash
> SuperCharger. =A0I am not trying to run an advert here, but the care =
and
> feeding of SSDs can be interesting. =A0Flash SuperCharger breaks most=
of
> these rules, but it does know the exact geometry of what it is drivin=
g
> and plays excessive games to drives SSDs at their exact "sweet spot".
> One of our licensees just sent me some benchmarks at > 500,000 4K
> random writes/sec for a moderate sized array running raid/5.
>
> pps: =A0Failures of SSDs are different than HDDs. =A0SSDs can and do =
fail
> and need raid for many applications. =A0If you need high write IOPS, =
it
> pretty much has to be raid/1,10 (unless you run our Flash SuperCharge=
r
> layer).
>
> ppps: =A0I have seen SSDs silently return corrupted data. =A0Disks do=
this
> as well. =A0A paper from 2 years ago quoted disk silent error rates a=
s
> high as 1 bad block every 73TB read. =A0Very scary stuff, but probabl=
y
> beyond the scope of what md can address.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 09.02.2011 20:27:41 von Roberto Spadim

just to make READ ok with any drive mix
if device have TRIM, use it
if not use WRITE 0x000000...
after if we READ from /dev/md0
we have the same information (0x000000) doesn=B4t matter if it=B4s a ss=
d
hd with or without trim function
ext4 send trim command (but it=B4s a user option, should be used only
with TRIM supported disks)
swap send (it=B4s not a user option, kernel check if device can execute
TRIM, if not don=B4t send (i don=B4t know what it do, but we could use =
the
same code to 'emulate' TRIM command, like swap do))

why emulate? because we can use a mixed array (ssd/hd) and get more
performace from TRIM enabled disks and ext4 (or other filesystem that
will use md as a device)
the point is: put support of TRIM command to MD devices
today i don=B4t know if it have (i think not)
if exists this support, how it works? could we mix TRIM enabled and
non TRIM devices in a raid array?

the first option is don=B4t use trim
the second use trim when possible, emulate trim when impossible
the third only accept trim if all devices are trim enabled (this
should be a run time option, since we can remove a mirror with trim
support and put a mirror without trim support)

2011/2/9 Piergiorgio Sartor :
>> yeah =3D)
>> a question...
>> if i send a TRIM to a sector
>> if i read from it
>> what i have?
>> 0x00000000000000000000000000000000000 ?
>> if yes, we could translate TRIM to WRITE on devices without TRIM (ha=
rd disks)
>> just to have the same READ information
>
> It seems the 0x0 is not a standard. Return values
> seem to be quite undefined, even if 0x0 *might*
> be common.
>
> Second, why do you want to emulate the 0x0 thing?
>
> I do not see the point of writing zero on a device
> which do not support TRIM. Just do nothing seems a
> better choice, even in mixed environment.
>
> bye,
>
> pg
>
>> 2011/2/9 Piergiorgio Sartor :
>> >> it=B4s just a discussion, right? no implementation yet, right?
>> >
>> > Of course...
>> >
>> >> what i think....
>> >> if device accept TRIM, we can use TRIM.
>> >> if not, we must translate TRIM to something similar (maybe many W=
RITES
>> >> ?), and when we READ from disk we get the same information
>> >
>> > TRIM is not about writing at all. TRIM tells the
>> > device that the addressed block is not anymore used,
>> > so it (the SSD) can do whatever it wants with it.
>> >
>> > The only software layer having the same "knowledge"
>> > is the filesystem, the other layers, do not have
>> > any decisional power about the block allocation.
>> > Except for metadata, of course.
>> >
>> > So, IMHO, a software TRIM can only be in the FS.
>> >
>> > bye,
>> >
>> > pg
>> >
>> >> the translation coulbe be done by kernel (not md) maybe options o=
n
>> >> libata, nbd device....
>> >> other option is do it with md, internal (md) TRIM translate funct=
ion
>> >>
>> >> who send trim?
>> >> internal md information: md can generate it (if necessary, maybe =
it=B4s
>> >> not...) for parity disks (not data disks)
>> >> filesystem/or another upper layer program (database with direct d=
evice
>> >> access), we could accept TRIM from filesystem/database, and send =
it to
>> >> disks/mirrors, when necessary translate it (internal or kernel
>> >> translate function)
>> >>
>> >>
>> >> 2011/2/9 Piergiorgio Sartor :
>> >> > On Wed, Feb 09, 2011 at 04:30:15PM -0200, Roberto Spadim wrote:
>> >> >> nice =3D)
>> >> >> but check that parity block is a raid information, not a files=
ystem information
>> >> >> for raid we could implement trim when possible (like swap)
>> >> >> and implement a trim that we receive from filesystem, and send=
to all
>> >> >> disks (if it=B4s a raid1 with mirrors, we should sent to all m=
irrors)
>> >> >
>> >> > To all disk also in case of RAID-5?
>> >> >
>> >> > What if the TRIM belongs only to a single SDD block
>> >> > belonging to a single chunk of a stripe?
>> >> > That is a *single* SSD of the RAID-5.
>> >> >
>> >> > Should md re-read the block and re-write (not TRIM)
>> >> > the parity?
>> >> >
>> >> > I think anything that has to do with checking &
>> >> > repairing must be carefully considered...
>> >> >
>> >> > bye,
>> >> >
>> >> > pg
>> >> >
>> >> >> i don=B4t know what trim do very well, but i think it=B4s a ve=
ry big write
>> >> >> with only some bits for example:
>> >> >> set sector1=3D'00000000000000000000000000000000000000000000000=
000'
>> >> >> could be replace by:
>> >> >> trim sector1
>> >> >> it=B4s faster for sata communication, and it=B4s a good inform=
ation for
>> >> >> hard disk (it can put a single '0' at the start of the sector =
and know
>> >> >> that all sector is 0, if it try to read any information it can=
use
>> >> >> internal memory (don=B4t read hard disk), if a write is done i=
t should
>> >> >> write 0000 to bits, and after after the write operation, but i=
t=B4s
>> >> >> internal function of hard disk/ssd, not a problem of md raid..=
md
>> >> >> raid should need know how to optimize and use it =3D] )
>> >> >>
>> >> >> 2011/2/9 Piergiorgio Sartor :
>> >> >> >> ext4 send trim commands to device (disk/md raid/nbd)
>> >> >> >> kernel swap send this commands (when possible) to device to=
o
>> >> >> >> for internal raid5 parity disk this could be done by md, fo=
r data
>> >> >> >> disks this should be done by ext4
>> >> >> >
>> >> >> > That's an interesting point.
>> >> >> >
>> >> >> > On which basis should a parity "block" get a TRIM?
>> >> >> >
>> >> >> > If you ask me, I think the complete TRIM story is, at
>> >> >> > best, a temporary patch.
>> >> >> >
>> >> >> > IMHO the wear levelling should be handled by the filesystem
>> >> >> > and, with awarness of this, by the underlining device driver=
s.
>> >> >> > Reason is that the FS knows better what's going on with the
>> >> >> > blocks and what will happen.
>> >> >> >
>> >> >> > bye,
>> >> >> >
>> >> >> > pg
>> >> >> >
>> >> >> >>
>> >> >> >> the other question... about resync with only write what is =
different
>> >> >> >> this is very good since write and read speed can be differe=
nt for ssd
>> >> >> >> (hd don=B4t have this 'problem')
>> >> >> >> but i=B4m sure that just write what is diff is better than =
write all
>> >> >> >> (ssd life will be bigger, hd maybe... i think that will be =
bigger too)
>> >> >> >>
>> >> >> >>
>> >> >> >> 2011/2/9 Eric D. Mudama :
>> >> >> >> > On Wed, Feb =A09 at 11:28, Scott E. Armitage wrote:
>> >> >> >> >>
>> >> >> >> >> Who sends this command? If md can assume that determinat=
e mode is
>> >> >> >> >> always set, then RAID 1 at least would remain consistent=
For RAID 5,
>> >> >> >> >> consistency of the parity information depends on the det=
erminate
>> >> >> >> >> pattern used and the number of disks. If you used determ=
inate
>> >> >> >> >> all-zero, then parity information would always be consis=
tent, but this
>> >> >> >> >> is probably not preferable since every TRIM command woul=
d incur an
>> >> >> >> >> extra write for each bit in each page of the block.
>> >> >> >> >
>> >> >> >> > True, and there are several solutions. =A0Maybe track spa=
ce used via
>> >> >> >> > some mechanism, such that when you trim you're only trimm=
ing the
>> >> >> >> > entire stripe width so no parity is required for the trim=
med regions.
>> >> >> >> > Or, trust the drive's wear leveling and endurance rating,=
combined
>> >> >> >> > with SMART data, to indicate when you need to replace the=
device
>> >> >> >> > preemptive to eventual failure.
>> >> >> >> >
>> >> >> >> > It's not an unsolvable issue. =A0If the RAID5 used distri=
buted parity,
>> >> >> >> > you could expect wear leveling to wear all the devices ev=
enly, since
>> >> >> >> > on average, the # of writes to all devices will be the sa=
me. =A0Only a
>> >> >> >> > RAID4 setup would see a lopsided amount of writes to a si=
ngle device.
>> >> >> >> >
>> >> >> >> > --eric
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Eric D. Mudama
>> >> >> >> > edmudama@bounceswoosh.org
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > To unsubscribe from this list: send the line "unsubscribe=
linux-raid" in
>> >> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> > More majordomo info at =A0http://vger.kernel.org/majordom=
o-info.html
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> Roberto Spadim
>> >> >> >> Spadim Technology / SPAEmpresarial
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe l=
inux-raid" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-=
info.html
>> >> >> >
>> >> >> > --
>> >> >> >
>> >> >> > piergiorgio
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe li=
nux-raid" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-i=
nfo.html
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Roberto Spadim
>> >> >> Spadim Technology / SPAEmpresarial
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linu=
x-raid" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at =A0http://vger.kernel.org/majordomo-inf=
o.html
>> >> >
>> >> > --
>> >> >
>> >> > piergiorgio
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux=
-raid" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at =A0http://vger.kernel.org/majordomo-info=
html
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Roberto Spadim
>> >> Spadim Technology / SPAEmpresarial
>> >
>> > --
>> >
>> > piergiorgio
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-ra=
id" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at =A0http://vger.kernel.org/majordomo-info.ht=
ml
>> >
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:20:46 von Phillip Susi

On 2/9/2011 10:49 AM, David Brown wrote:
> I've been reading a little more about this. It seems that the days of
> TRIM may well be numbered - the latest generation of high-end SSDs have
> more powerful garbage collection algorithms, together with more spare
> blocks, making TRIM pretty much redundant. This is, of course, the most
> convenient solution for everyone (as long as it doesn't cost too much!).
>
> The point of the TRIM command is to tell the SSD that a particular block
> is no longer being used, so that the SSD can erase it in the background
> - that way when you want to write more data, there are more free blocks
> ready and waiting. But if you've got plenty of spare blocks, it's easy
> to have them erased in advance and you don't need TRIM.

It is not just about having free blocks ready and waiting. When doing
wear leveling, you might find an erase block that has not been written
to in a long time, so you want to move that data to a more worn block,
and use the less worn block for more frequently written to sectors. If
you know that sectors are unused because they have been TRIMed, then you
don't have to waste time and wear copying the junk there to the new
flash block.

TRIM is also quite useful for thin provisioned storage, which seems to
be getting popular.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:24:20 von Phillip Susi

On 2/9/2011 11:19 AM, Eric D. Mudama wrote:
> For SATA devices, ATA8-ACS2 addresses this through Deterministic Read
> After Trim in the DATA SET MANAGEMENT command. Devices can be
> indeterminate, determinate with a non-zero pattern (often all-ones) or
> determinate all-zero for sectors read after being trimmed.

IIRC, it was a word in the IDENTIFY response, not the DATA SET
MANAGEMENT command.

On 2/9/2011 11:28 AM, Scott E. Armitage wrote:
> Who sends this command? If md can assume that determinate mode is
> always set, then RAID 1 at least would remain consistent. For RAID 5,
> consistency of the parity information depends on the determinate
> pattern used and the number of disks. If you used determinate
> all-zero, then parity information would always be consistent, but this
> is probably not preferable since every TRIM command would incur an
> extra write for each bit in each page of the block.

The drive tells YOU how its trim behaves; you don't command it.

If the drive is deterministic and always returns zeros after TRIM, then
mdadm could pass the TRIM down and process it like a write of all zeros,
and recompute the parity. If it isn't deterministic, then I don't think
there's anything you can do to handle TRIM requests.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:25:03 von Roberto Spadim

TRIM is a new feature for many hard disk/ssd
it=B4s more to get a bigger life o disk, allow a dynamic badblock
reallocation (filesystem must tell where is empty)


2011/2/21 Phillip Susi :
> On 2/9/2011 10:49 AM, David Brown wrote:
>> I've been reading a little more about this. =A0It seems that the day=
s of
>> TRIM may well be numbered - the latest generation of high-end SSDs h=
ave
>> more powerful garbage collection algorithms, together with more spar=
e
>> blocks, making TRIM pretty much redundant. =A0This is, of course, th=
e most
>> convenient solution for everyone (as long as it doesn't cost too muc=
h!).
>>
>> The point of the TRIM command is to tell the SSD that a particular b=
lock
>> is no longer being used, so that the SSD can erase it in the backgro=
und
>> - that way when you want to write more data, there are more free blo=
cks
>> ready and waiting. =A0But if you've got plenty of spare blocks, it's=
easy
>> to have them erased in advance and you don't need TRIM.
>
> It is not just about having free blocks ready and waiting. =A0When do=
ing
> wear leveling, you might find an erase block that has not been writte=
n
> to in a long time, so you want to move that data to a more worn block=
,
> and use the less worn block for more frequently written to sectors. =A0=
If
> you know that sectors are unused because they have been TRIMed, then =
you
> don't have to waste time and wear copying the junk there to the new
> flash block.
>
> TRIM is also quite useful for thin provisioned storage, which seems t=
o
> be getting popular.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:30:44 von Roberto Spadim

just some ideas...

hummm thinking about TRIM in a mixed supported/non supported raid1 arra=
y...

when a filesystem will read a block that is trimmed?
since filesystem first write and after read, maybe never
trimmed blocks are unused blocks (filesystem know where they are)
maybe with a (read/compare/write if diff) resync function, we will
have problems with non trimmed (with support to TRIM) disks being
added on raid1
maybe....

i think that sending trim to devices isn=B4t a problem, it=B4s
optimization of disk that must be done by filesystem, raid1 should
only send this command to disks. the problem is, if a disk don=B4t hav=
e
trim, we must implement a trim compatible command (or not...
filesystem know about free blocks)



2011/2/21 Phillip Susi :
> On 2/9/2011 11:19 AM, Eric D. Mudama wrote:
>> For SATA devices, ATA8-ACS2 addresses this through Deterministic Rea=
d
>> After Trim in the DATA SET MANAGEMENT command. =A0Devices can be
>> indeterminate, determinate with a non-zero pattern (often all-ones) =
or
>> determinate all-zero for sectors read after being trimmed.
>
> IIRC, it was a word in the IDENTIFY response, not the DATA SET
> MANAGEMENT command.
>
> On 2/9/2011 11:28 AM, Scott E. Armitage wrote:
>> Who sends this command? If md can assume that determinate mode is
>> always set, then RAID 1 at least would remain consistent. For RAID 5=
,
>> consistency of the parity information depends on the determinate
>> pattern used and the number of disks. If you used determinate
>> all-zero, then parity information would always be consistent, but th=
is
>> is probably not preferable since every TRIM command would incur an
>> extra write for each bit in each page of the block.
>
> The drive tells YOU how its trim behaves; you don't command it.
>
> If the drive is deterministic and always returns zeros after TRIM, th=
en
> mdadm could pass the TRIM down and process it like a write of all zer=
os,
> and recompute the parity. =A0If it isn't deterministic, then I don't =
think
> there's anything you can do to handle TRIM requests.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:34:50 von Phillip Susi

On 2/21/2011 1:25 PM, Roberto Spadim wrote:
> TRIM is a new feature for many hard disk/ssd
> it=B4s more to get a bigger life o disk, allow a dynamic badblock
> reallocation (filesystem must tell where is empty)

Ummm... thanks????

I know quite well what TRIM is, which is why I was discussing how mdadm
could support it.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:48:46 von Roberto Spadim

yeah, for raid1 just send trim to device (if no layout is in use)
for stripe must have a rewrite o command and check if we could use trim
for internal raid informations we shoudn=B4t use

2011/2/21 Phillip Susi :
> On 2/21/2011 1:25 PM, Roberto Spadim wrote:
>> TRIM is a new feature for many hard disk/ssd
>> it=B4s more to get a bigger life o disk, allow a dynamic badblock
>> reallocation (filesystem must tell where is empty)
>
> Ummm... thanks????
>
> I know quite well what TRIM is, which is why I was discussing how mda=
dm
> could support it.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 19:51:50 von mathias.buren

(please don't top post)

On 21 February 2011 18:25, Roberto Spadim wrote=
:
> TRIM is a new feature for many hard disk/ssd
> it´s more to get a bigger life o disk, allow a dynamic badblock
> reallocation (filesystem must tell where is empty)
>
>
> 2011/2/21 Phillip Susi :
>> On 2/9/2011 10:49 AM, David Brown wrote:
>>> I've been reading a little more about this.  It seems that the=
days of
>>> TRIM may well be numbered - the latest generation of high-end SSDs =
have
>>> more powerful garbage collection algorithms, together with more spa=
re
>>> blocks, making TRIM pretty much redundant.  This is, of course=
, the most
>>> convenient solution for everyone (as long as it doesn't cost too mu=
ch!).
>>>
>>> The point of the TRIM command is to tell the SSD that a particular =
block
>>> is no longer being used, so that the SSD can erase it in the backgr=
ound
>>> - that way when you want to write more data, there are more free bl=
ocks
>>> ready and waiting.  But if you've got plenty of spare blocks, =
it's easy
>>> to have them erased in advance and you don't need TRIM.
>>
>> It is not just about having free blocks ready and waiting.  Whe=
n doing
>> wear leveling, you might find an erase block that has not been writt=
en
>> to in a long time, so you want to move that data to a more worn bloc=
k,
>> and use the less worn block for more frequently written to sectors. =
 If
>> you know that sectors are unused because they have been TRIMed, then=
you
>> don't have to waste time and wear copying the junk there to the new
>> flash block.
>>
>> TRIM is also quite useful for thin provisioned storage, which seems =
to
>> be getting popular.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.h=
tml
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.ht=
ml
>

TRIM is not a new feature for HDDs as they don't have the problem that
SSDs have. Where did you hear this?

// Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:32:50 von Roberto Spadim

TRIM isn=B4t a problem, it=B4s a solution to optimize dynamic allocatio=
n,
and life time of devices (SSD or harddisk)
i don=B4t see any problem to implement trim command on hard disks (not
in linux, but at harddisk firmware level)

hard disk have the same problem of ssd, allocation of badblocks, any
harddisk could implement trim and use it to realloc badblocks...

--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:38:13 von mathias.buren

On 21 February 2011 19:32, Roberto Spadim wrote=
:
> TRIM isn´t a problem, it´s a solution to optimize dynamic a=
llocation,
> and life time of devices (SSD or harddisk)
> i don´t see any problem to implement trim command on hard disks =
(not
> in linux, but at harddisk firmware level)
>
> hard disk have the same problem of ssd, allocation of badblocks, any
> harddisk could implement trim and use it to realloc badblocks...
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>

I don't think you understand TRIM. It wouldn't work, and there is no
need for it, on a HDD. AFAIK a HDD does not have the same penalty as a
SSD does when it needs to write to a (previously) used area. An SSD
cannot do this without erasing the whole (block? page?), usually 512KB
in size (varies between different manufacturers), but the data that's
on there still needs to be moved elsewhere first, block erased, data
moved back the same time the new data is written together with it.
AFAIK it works something like this anyway. The only benefit TRIM will
give you would be potentially faster writes, right.

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:39:44 von mathias.buren

On 21 February 2011 19:38, Mathias Burén =
wrote:
> On 21 February 2011 19:32, Roberto Spadim wro=
te:
>> TRIM isn´t a problem, it´s a solution to optimize dynamic =
allocation,
>> and life time of devices (SSD or harddisk)
>> i don´t see any problem to implement trim command on hard disks=
(not
>> in linux, but at harddisk firmware level)
>>
>> hard disk have the same problem of ssd, allocation of badblocks, any
>> harddisk could implement trim and use it to realloc badblocks...
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>>
>
> I don't think you understand TRIM. It wouldn't work, and there is no
> need for it, on a HDD. AFAIK a HDD does not have the same penalty as =
a
> SSD does when it needs to write to a (previously) used area. An SSD
> cannot do this without erasing the whole (block? page?), usually 512K=
B
> in size (varies between different manufacturers), but the data that's
> on there still needs to be moved elsewhere first, block erased, data
> moved back the same time the new data is written together with it.
> AFAIK it works something like this anyway. The only benefit TRIM will
> give you would be potentially faster writes, right.
>
> // M
>

Plus support is needed from the kernel (done) filesystem (ext4 has
it). The filesystem seese the MD device, not the actual SSDs behind
it, so it would probably be quite complicated to implement passthrough
of the trim command in this case.

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:39:52 von Roberto Spadim

sorry, but i sent email without a information:
TRIM is a 'ATA Specification' command

http://en.wikipedia.org/wiki/TRIM_command

any disk with ATA command could suport TRIM, hard disk or ssd or
anyother type of phisical allocation


2011/2/21 Roberto Spadim :
> TRIM isn=B4t a problem, it=B4s a solution to optimize dynamic allocat=
ion,
> and life time of devices (SSD or harddisk)
> i don=B4t see any problem to implement trim command on hard disks (no=
t
> in linux, but at harddisk firmware level)
>
> hard disk have the same problem of ssd, allocation of badblocks, any
> harddisk could implement trim and use it to realloc badblocks...
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:43:08 von Roberto Spadim

yeah, the idea of implement of TRIM at MD is to send TRIM to devices
that was received by MD on filesystem level
raid1 + no layout + all mirrors with TRIM support=3D> i think it=B4s ea=
sy
to implement... just send the command to mirros (ssd or hd, since they
support it)
for striped devices?! maybe could support, it=B4s more dificult
for linear raid0 it could be easy too

2011/2/21 Mathias Bur=E9n :
> On 21 February 2011 19:38, Mathias Bur=E9n =
wrote:
>> On 21 February 2011 19:32, Roberto Spadim wr=
ote:
>>> TRIM isn=B4t a problem, it=B4s a solution to optimize dynamic alloc=
ation,
>>> and life time of devices (SSD or harddisk)
>>> i don=B4t see any problem to implement trim command on hard disks (=
not
>>> in linux, but at harddisk firmware level)
>>>
>>> hard disk have the same problem of ssd, allocation of badblocks, an=
y
>>> harddisk could implement trim and use it to realloc badblocks...
>>>
>>> --
>>> Roberto Spadim
>>> Spadim Technology / SPAEmpresarial
>>>
>>
>> I don't think you understand TRIM. It wouldn't work, and there is no
>> need for it, on a HDD. AFAIK a HDD does not have the same penalty as=
a
>> SSD does when it needs to write to a (previously) used area. An SSD
>> cannot do this without erasing the whole (block? page?), usually 512=
KB
>> in size (varies between different manufacturers), but the data that'=
s
>> on there still needs to be moved elsewhere first, block erased, data
>> moved back the same time the new data is written together with it.
>> AFAIK it works something like this anyway. The only benefit TRIM wil=
l
>> give you would be potentially faster writes, right.
>>
>> // M
>>
>
> Plus support is needed from the kernel (done) filesystem (ext4 has
> it). The filesystem seese the MD device, not the actual SSDs behind
> it, so it would probably be quite complicated to implement passthroug=
h
> of the trim command in this case.
>
> // M
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:51:24 von Doug Dumitru

To be technically accurate, trim is a hint to a storage device that
has a "block translation layer" that can take advantage of knowing
that a block contains no meaningful data.

Flash needs trim only if flash has an FTL (Flash Translation Layer)
that is re-mapping blocks in such a manner as free blocks are helpful
in making this process more efficient. Older SSDs did not support
trim and had no real need for it. If you look at the FTL used with
simple Flash (think CF cards, SD cards, and USB sticks) trim does not
help them. Trim and wear leveling are un-related and don't really
impact each other.

On the linux side trim is "discard". This is actually a much better
abstraction as it does not imply SSDs.

Any type of block device that does dynamic block remapping will likely
be helped (at least somewhat) by discard. The only examples of this I
can think of off-hand are 1) my Flash SuperCharger code, and 2)
block-level de-dupe engines. I am sure other examples will be created
over time.

Hopefully, discard can be driven down the stack. I would personally
prefer the linux community declare that discard and zero writes are
identical. If an SSD supports trim and linux wants to translate a
discard into a trim at the device driver layer, and the SSD is
non-deterministic, then that SSD is broken. Then again, my attitude
about this is very arrogant and I think the trim spec was broken from
the beginning.

--
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 20:57:05 von Roberto Spadim

>Then again, my attitude
> about this is very arrogant and I think the trim spec was broken from
> the beginning.


maybe... we could put all harddisk firmware at linux code... why we
need reallocation of harddisks? we need when filesystem don=B4t do it
we question is: can we implement TRIM at MD device?




2011/2/21 Doug Dumitru :
> To be technically accurate, trim is a hint to a storage device that
> has a "block translation layer" that can take advantage of knowing
> that a block contains no meaningful data.
>
> Flash needs trim only if flash has an FTL (Flash Translation Layer)
> that is re-mapping blocks in such a manner as free blocks are helpful
> in making this process more efficient. =A0Older SSDs did not support
> trim and had no real need for it. =A0If you look at the FTL used with
> simple Flash (think CF cards, SD cards, and USB sticks) trim does not
> help them. =A0Trim and wear leveling are un-related and don't really
> impact each other.
>
> On the linux side trim is "discard". =A0This is actually a much bette=
r
> abstraction as it does not imply SSDs.
>
> Any type of block device that does dynamic block remapping will likel=
y
> be helped (at least somewhat) by discard. =A0The only examples of thi=
s I
> can think of off-hand are 1) my Flash SuperCharger code, and 2)
> block-level de-dupe engines. =A0I am sure other examples will be crea=
ted
> over time.
>
> Hopefully, discard can be driven down the stack. =A0I would personall=
y
> prefer the linux community declare that discard and zero writes are
> identical. =A0If an SSD supports trim and linux wants to translate a
> discard into a trim at the device driver layer, and the SSD is
> non-deterministic, then that SSD is broken. =A0Then again, my attitud=
e
> about this is very arrogant and I think the trim spec was broken from
> the beginning.
>
> --
> Doug Dumitru
> EasyCo LLC
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 21:45:24 von Phillip Susi

On 2/21/2011 2:39 PM, Mathias Burén wrote:
> Plus support is needed from the kernel (done) filesystem (ext4 has
> it). The filesystem seese the MD device, not the actual SSDs behind
> it, so it would probably be quite complicated to implement passthroug=
h
> of the trim command in this case.

It has been mentioned at least twice now how to implement it. The
device-mapper driver already has implemented TRIM passthrough for its
linear, stripe, and mirror targets. The trick is handling it with raid=
[56].
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 21:47:27 von Phillip Susi

On 2/21/2011 2:39 PM, Roberto Spadim wrote:
> sorry, but i sent email without a information:
> TRIM is a 'ATA Specification' command
>
> http://en.wikipedia.org/wiki/TRIM_command
>
> any disk with ATA command could suport TRIM, hard disk or ssd or
> anyother type of phisical allocation

Sure, but hard disks have no reason to, which is why they don't and
won't support it.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 22:02:30 von mathias.buren

On 21 February 2011 20:47, Phillip Susi wrote:
> On 2/21/2011 2:39 PM, Roberto Spadim wrote:
>> sorry, but i sent email without a information:
>> TRIM is a 'ATA Specification' command
>>
>> http://en.wikipedia.org/wiki/TRIM_command
>>
>> any disk with ATA command could suport TRIM, hard disk or ssd or
>> anyother type of phisical allocation
>
> Sure, but hard disks have no reason to, which is why they don't and
> won't support it.
>

My point exactly.

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 21.02.2011 23:52:46 von Roberto Spadim

i don=B4t think so, since it=B4s ATA command, any ATA compatible can us=
e
it, it could be used for HD with badblocks and dynamic reallocation
without problems, the harddisk don=B4t need a dedicated space for
badblock. for md software we must know if devices support or not TRIM.

the next question, md is ATA compatible? no!?, it=B4s a linux device,
not a ATA device. what commands linux devices allow? could md allow
TRIM?

2011/2/21 Mathias Bur=E9n :
> On 21 February 2011 20:47, Phillip Susi wrote:
>> On 2/21/2011 2:39 PM, Roberto Spadim wrote:
>>> sorry, but i sent email without a information:
>>> TRIM is a 'ATA Specification' command
>>>
>>> http://en.wikipedia.org/wiki/TRIM_command
>>>
>>> any disk with ATA command could suport TRIM, hard disk or ssd or
>>> anyother type of phisical allocation
>>
>> Sure, but hard disks have no reason to, which is why they don't and
>> won't support it.
>>
>
> My point exactly.
>
> // M
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 00:41:00 von mathias.buren

On 21 February 2011 22:52, Roberto Spadim wrote=
:
> i don´t think so, since it´s ATA command, any ATA compatibl=
e can use
> it, it could be used for HD with badblocks and dynamic reallocation
> without problems, the harddisk don´t need a dedicated space for
> badblock. for md software we must know if devices support or not TRIM=

>
> the next question, md is ATA compatible? no!?, it´s a linux devi=
ce,
> not a ATA device. what commands linux devices allow? could md allow
> TRIM?
>
> 2011/2/21 Mathias Burén :
>> On 21 February 2011 20:47, Phillip Susi wrote:
>>> On 2/21/2011 2:39 PM, Roberto Spadim wrote:
>>>> sorry, but i sent email without a information:
>>>> TRIM is a 'ATA Specification' command
>>>>
>>>> http://en.wikipedia.org/wiki/TRIM_command
>>>>
>>>> any disk with ATA command could suport TRIM, hard disk or ssd or
>>>> anyother type of phisical allocation
>>>
>>> Sure, but hard disks have no reason to, which is why they don't and
>>> won't support it.
>>>
>>
>> My point exactly.
>>
>> // M
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid=
" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.h=
tml
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>

Please don't top post.
http://www.splitbrain.org/blog/2011-02/15-top_posting_like_d ont_i_why

Harddrives already have an allocated area with spare sectors, which
they use whenever they need to. You can find out how many sectors have
been reallocated by the HDD by looking at the SMART data, like so:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
[...]
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 00:42:20 von mathias.buren

On 21 February 2011 23:41, Mathias Burén =
wrote:
> On 21 February 2011 22:52, Roberto Spadim wro=
te:
>> i don´t think so, since it´s ATA command, any ATA compatib=
le can use
>> it, it could be used for HD with badblocks and dynamic reallocation
>> without problems, the harddisk don´t need a dedicated space for
>> badblock. for md software we must know if devices support or not TRI=
M.
>>
>> the next question, md is ATA compatible? no!?, it´s a linux dev=
ice,
>> not a ATA device. what commands linux devices allow? could md allow
>> TRIM?
>>
>> 2011/2/21 Mathias Burén :
>>> On 21 February 2011 20:47, Phillip Susi wrote:
>>>> On 2/21/2011 2:39 PM, Roberto Spadim wrote:
>>>>> sorry, but i sent email without a information:
>>>>> TRIM is a 'ATA Specification' command
>>>>>
>>>>> http://en.wikipedia.org/wiki/TRIM_command
>>>>>
>>>>> any disk with ATA command could suport TRIM, hard disk or ssd or
>>>>> anyother type of phisical allocation
>>>>
>>>> Sure, but hard disks have no reason to, which is why they don't an=
d
>>>> won't support it.
>>>>
>>>
>>> My point exactly.
>>>
>>> // M
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rai=
d" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.=
html
>>>
>>
>>
>>
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>>
>
> Please don't top post.
> http://www.splitbrain.org/blog/2011-02/15-top_posting_like_d ont_i_why
>
> Harddrives already have an allocated area with spare sectors, which
> they use whenever they need to. You can find out how many sectors hav=
e
> been reallocated by the HDD by looking at the SMART data, like so:
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG    =
VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
> [...]
>  5 Reallocated_Sector_Ct   0x0033   200   200 =C2=
=A0 140    Pre-fail
> Always       -       0
>
> // M
>

I forgot to write that the trim command has nothing to do with bad
blocks or sectors, it's just a way of "resetting" blocks so that can
be written to without having to erase them first. (IIRC)

There is no such issue with HDDs, therefore have no benefit at all
using the trim command with them.

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 00:52:47 von Roberto Spadim

trim tell harddisk that those block are not in use

not in use block can be used by harddisk reallocation algorithm, like
spare sectors

hard disks can use TRIM command to 'create' 'good' blocks like spare sectors
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 01:25:35 von mathias.buren

On 21 February 2011 23:52, Roberto Spadim wrote:
> trim tell harddisk that those block are not in use
>
> not in use block can be used by harddisk reallocation algorithm, like
> spare sectors
>
> hard disks can use TRIM command to 'create' 'good' blocks like spare sectors
>

Do you mean online defragmentation...? If so, that's for the
filesystem to do. Or do you mean that it could be used to tell the HDD
that it has extra sectors it can use to reallocate bad sectors?...

// M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 01:30:05 von Brendan Conoboy

On 02/21/2011 03:52 PM, Roberto Spadim wrote:
> trim tell harddisk that those block are not in use
>
> not in use block can be used by harddisk reallocation algorithm, like
> spare sectors
>
> hard disks can use TRIM command to 'create' 'good' blocks like spare sectors

I'm trying really hard to follow what this means but just can't grasp
what you're getting at. What scenario is there in which trim actually
does anything for you on an HD? I can't think of any situation where
this makes any sense for HDs with current firmware functionality. If a
sector is unused, but bad, you won't know until you write to it. If
it's bad and you write to it, the write gets reallocated to a good spare
sector. Are you proposing to notify the drive what sectors are unused
so it can check for and reallocate bad blocks before they're used again?
Something else?

--
Brendan Conoboy / Red Hat, Inc. / blc@redhat.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 01:32:17 von edmudama

On Mon, Feb 21 at 19:52, Roberto Spadim wrote:
>i don=B4t think so, since it=B4s ATA command, any ATA compatible can u=
se
>it, it could be used for HD with badblocks and dynamic reallocation
>without problems, the harddisk don=B4t need a dedicated space for
>badblock. for md software we must know if devices support or not TRIM.

It's been 15 or more years since hard drives exposed their bad blocks
to the host, I don't think it'd be a good idea to revisit that
decision.

--=20
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 01:36:10 von edmudama

On Mon, Feb 21 at 20:52, Roberto Spadim wrote:
>trim tell harddisk that those block are not in use

yes

>not in use block can be used by harddisk reallocation algorithm, like
>spare sectors

no, because the host may immediately write to a trim'd sector

The spares in an HDD can never be accessed outside of special tools,
they're swap-in replacements for regions of the media that have
developed defects.

>hard disks can use TRIM command to 'create' 'good' blocks like spare sectors

this doesn't make sense to me


--
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 02:46:34 von Roberto Spadim

if it make sense on ssd, harddisk make sense too, it's a block device
like ssd, the diference of ssd/harddisk? access time,
bytes(bits)/block, life time
bad block exist in ssd and harddisk, ssd can realloc online, some hardd=
isks too


> no, because the host may immediately write to a trim'd sector
yes, filesystem know where exists a unused sector
if device (harddisk/ssd) know and have a reallocation algorithm, it
can realloc without telling filesystem to do it (that's why TRIM is
interesting)
since today ssd use NAND (not NOR) the block size isn't 1 bit like a
harddisk head. trim for harddisk only make sense for badblock
reallocation

--------------------------
getting back to the first question, can MD support trim? yes/no/not
now/some levels and layouts only?


2011/2/21 Eric D. Mudama :
> On Mon, Feb 21 at 20:52, Roberto Spadim wrote:
>>
>> trim tell harddisk that those block are not in use
>
> yes
>
>> not in use block can be used by harddisk reallocation algorithm, lik=
e
>> spare sectors
>
> no, because the host may immediately write to a trim'd sector
>
> The spares in an HDD can never be accessed outside of special tools,
> they're swap-in replacements for regions of the media that have
> developed defects.
>
>> hard disks can use TRIM command to 'create' 'good' blocks like spare
>> sectors
>
> this doesn't make sense to me
>
>
> --
> Eric D. Mudama
> edmudama@bounceswoosh.org
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 02:52:22 von mathias.buren

On 22 February 2011 01:46, Roberto Spadim wrote:
> if it make sense on ssd, harddisk make sense too, it's a block device
> like ssd, the diference of ssd/harddisk? access time,
> bytes(bits)/block, life time
> bad block exist in ssd and harddisk, ssd can realloc online, some harddisks too
>
>> no, because the host may immediately write to a trim'd sector
> yes, filesystem know where exists a unused sector
> if device (harddisk/ssd) know and have a reallocation algorithm, it
> can realloc without telling filesystem to do it (that's why TRIM is
> interesting)
> since today ssd use NAND (not NOR) the block size isn't 1 bit like a
> harddisk head. trim for harddisk only make sense for badblock
> reallocation
> --------------------------
> getting back to the first question, can MD support trim? yes/no/not
> now/some levels and layouts only?
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial

This explains a bit why trim is good for SSDs and has nothing to do
with harddrives at all, since they use spinning platters and not
chips. http://www.anandtech.com/show/2738/10

// Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 02:55:24 von Roberto Spadim

it can be used for badblock reallocation if harddisk have it
a harddisk is near to NOR ssd with variable accesstime, if head is
near sector to be read/write accesstime is small, if sector is far
from head, access time increase (normaly <=3D1 disk revolution if head
control system is good, for 7200rpm 1revolution is near to 8.33ms)

2011/2/21 Mathias Bur=E9n :
> On 22 February 2011 01:46, Roberto Spadim wro=
te:
>> if it make sense on ssd, harddisk make sense too, it's a block devic=
e
>> like ssd, the diference of ssd/harddisk? access time,
>> bytes(bits)/block, life time
>> bad block exist in ssd and harddisk, ssd can realloc online, some ha=
rddisks too
>>
>>> no, because the host may immediately write to a trim'd sector
>> yes, filesystem know where exists a unused sector
>> if device (harddisk/ssd) know and have a reallocation algorithm, it
>> can realloc without telling filesystem to do it (that's why TRIM is
>> interesting)
>> since today ssd use NAND (not NOR) the block size isn't 1 bit like a
>> harddisk head. trim for harddisk only make sense for badblock
>> reallocation
>> --------------------------
>> getting back to the first question, can MD support trim? yes/no/not
>> now/some levels and layouts only?
>> --
>> Roberto Spadim
>> Spadim Technology / SPAEmpresarial
>
> This explains a bit why trim is good for SSDs and has nothing to do
> with harddrives at all, since they use spinning platters and not
> chips. http://www.anandtech.com/show/2738/10
>
> // Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 03:01:18 von edmudama

On Mon, Feb 21 at 22:55, Roberto Spadim wrote:
>it can be used for badblock reallocation if harddisk have it
>a harddisk is near to NOR ssd with variable accesstime, if head is
>near sector to be read/write accesstime is small, if sector is far
>from head, access time increase (normaly <=1 disk revolution if head
>control system is good, for 7200rpm 1revolution is near to 8.33ms)

Hard disks do not expose their defect information/remappings. They
present a defect-free logical region to the host.

Optimizing for a few hundred thousand remapped sectors across the LBA
range of ~6 billion LBAs on a 3TB drive isn't worth the effort or code
complexity in most cases.

I still don't see how TRIM helps a rotating drive.

--
Eric D. Mudama
edmudama@bounceswoosh.org

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 03:02:11 von Mikael Abrahamsson

On Mon, 21 Feb 2011, Roberto Spadim wrote:

> it can be used for badblock reallocation if harddisk have it a harddisk
> is near to NOR ssd with variable accesstime, if head is near sector to
> be read/write accesstime is small, if sector is far from head, access
> time increase (normaly <=1 disk revolution if head control system is
> good, for 7200rpm 1revolution is near to 8.33ms)

Could we please stop this discussion. If you think HDDs should have this
kind of bad sector reallocation scheme, please go to the HDD manufacturers
and lobby to them. It is not on-topic for linux-raid ml.

--
Mikael Abrahamsson email: swmike@swm.pp.se
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

RE: SSD - TRIM command

am 22.02.2011 03:22:35 von Guy Watkins

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Mikael Abrahamsson
} Sent: Monday, February 21, 2011 9:02 PM
} To: linux-raid@vger.kernel.org
} Subject: Re: SSD - TRIM command
}
} On Mon, 21 Feb 2011, Roberto Spadim wrote:
}
} > it can be used for badblock reallocation if harddisk have it a harddisk
} > is near to NOR ssd with variable accesstime, if head is near sector to
} > be read/write accesstime is small, if sector is far from head, access
} > time increase (normaly <=1 disk revolution if head control system is
} > good, for 7200rpm 1revolution is near to 8.33ms)
}
} Could we please stop this discussion. If you think HDDs should have this
} kind of bad sector reallocation scheme, please go to the HDD manufacturers
} and lobby to them. It is not on-topic for linux-raid ml.
}
} --
} Mikael Abrahamsson email: swmike@swm.pp.se

What about tape drives? :)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 03:27:26 von Roberto Spadim

tape drive =3D harddisk with only one head, the head can't move, only
the tape (disk/plate or any other name you want)

could we get back and answer the main question?
--------------------------
getting back to the first question, can MD support trim? yes/no/not
now/some levels and layouts only?


2011/2/21 Guy Watkins :
> } -----Original Message-----
> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> } owner@vger.kernel.org] On Behalf Of Mikael Abrahamsson
> } Sent: Monday, February 21, 2011 9:02 PM
> } To: linux-raid@vger.kernel.org
> } Subject: Re: SSD - TRIM command
> }
> } On Mon, 21 Feb 2011, Roberto Spadim wrote:
> }
> } > it can be used for badblock reallocation if harddisk have it a ha=
rddisk
> } > is near to NOR ssd with variable accesstime, if head is near sect=
or to
> } > be read/write accesstime is small, if sector is far from head, ac=
cess
> } > time increase (normaly <=3D1 disk revolution if head control syst=
em is
> } > good, for 7200rpm 1revolution is near to 8.33ms)
> }
> } Could we please stop this discussion. If you think HDDs should have=
this
> } kind of bad sector reallocation scheme, please go to the HDD manufa=
cturers
> } and lobby to them. It is not on-topic for linux-raid ml.
> }
> } --
> } Mikael Abrahamsson =A0 =A0email: swmike@swm.pp.se
>
> What about tape drives? =A0:)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 03:38:07 von Phillip Susi

On 02/21/2011 08:55 PM, Roberto Spadim wrote:
> it can be used for badblock reallocation if harddisk have it
> a harddisk is near to NOR ssd with variable accesstime, if head is
> near sector to be read/write accesstime is small, if sector is far
> from head, access time increase (normaly<=1 disk revolution if head
> control system is good, for 7200rpm 1revolution is near to 8.33ms)

Bad blocks are only reallocated when you write to them. Since they are
bad, you can't read the previous contents anyway, so it does not matter
whether the OS cared about it before or not.

You seem to not understand the fundamental purpose of TRIM. Hard disks
only reallocate blocks when they go bad. SSDs move blocks around all
the time. That process can be optimized if the drive knows that the OS
does not care about certain blocks. Hard drives don't do this, so they
have no reason to support TRIM.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 04:29:09 von Roberto Spadim

getting off topic...
----
they have - reallocation
> Bad blocks are only reallocated when you write to them. Since they a=
re bad,
> you can't read the previous contents anyway, so it does not matter wh=
ether
> the OS cared about it before or not.

when you write, if bad, mark block as bad. how? internal disk memory,
spare blocks. it's a device level problem, if device can't correct
move the problem to filesystem level.

what device level could do? use a 'good block' (if exists) =3D> dynamic
reallocation
'good block' =3D block not in use by filesystem, not marked as bad, can
be used by realloc

with trim, you can inform device firmware what blocks are not in use
by filesystem, if harddisk have reallocation it can use 'good blocks'
to store blocks that was realloc on badblock errors.

why implement it? if you have 11111filesystems mounted with bad blocks
at same time you will have >=3D11111 iops to repair this error at
filesystem level. if device can correct you don't need to waste cpu
and memory at filesystem

------
any layer between ATA and [plate,NAND flash,NOR flash] can be
implemented by harddisk/ssd firmware
some layers that can be implemented: online reallocation, queue,
online encrypt/decrypt, online compress/decompress and others, some
ssd have optimizations to get better life time and write/read
performace
how to 'tune' these algorithms? ATA commands, SCSI or anyother
protocol that support tune

why trim? inform harddisk/ssd what block isn't in use

what harddisk/ssd could do with trim information?
dynamic reallocation (badblocks), any other operation that need not
used blocks (some algorithms use it to get better read/write
performace)
on devices with byte read/write level (NAND flash) we could write to
one timmed block without reading the block and write again, NOR flash
and harddisk don't need this they work with bits not bytes/blocks
why send a error to filesystem if it can be corrected at device level.
just send error when can't correct it.


2011/2/21 Phillip Susi :
> On 02/21/2011 08:55 PM, Roberto Spadim wrote:
>>
>> it can be used for badblock reallocation if harddisk have it
>> a harddisk is near to NOR ssd with variable accesstime, if head is
>> near sector to be read/write accesstime is small, if sector is far
>> from head, access time increase (normaly<=3D1 disk revolution if hea=
d
>> control system is good, for 7200rpm 1revolution is near to 8.33ms)
>
> Bad blocks are only reallocated when you write to them. =A0Since they=
are bad,
> you can't read the previous contents anyway, so it does not matter wh=
ether
> the OS cared about it before or not.
>
> You seem to not understand the fundamental purpose of TRIM. =A0Hard d=
isks only
> reallocate blocks when they go bad. =A0SSDs move blocks around all th=
e time.
> =A0That process can be optimized if the drive knows that the OS does =
not care
> about certain blocks. =A0Hard drives don't do this, so they have no r=
eason to
> support TRIM.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 04:42:03 von Roberto Spadim

off topic again...

continue with the idea of optimizations, the last otimization we could
have is implement a filesystem at harddisk
it could implement all filesystem functions, no device function, it
could have many more information about data, not only block 'in
use'/'not in use'. it could understand: file starting at block x
ending at block y, with information w, accestime z, etc etc. it could
be more intelligent than a raw device. in others words, it's a
fileserver...

why implement algorithms at device level? today harddisk processors
(fpga, arm processors, others) have a lot of cpu power not in use, why
not use it? that's why we send trim to device, if it's a harddisk or
ssd or anyother pseudo/real device no problem, we sent the trim
command to otimize it

----------------
getting out of off topic,

please stop sending 'i think it's not a performace feature, it don't
need be implemented in device level', let's implement all functions
that device level could allow (ATA/SCSI specifications or any other)
and optimize when possible
checking neil md roadmap, badblock work will be very good for md
devices, it's a good optimization for raid1 since mirror will only
fail when many blocks fail



can we implement TRIM at MD level? it's a good feature to implement?
we will have a lot of work to implement it?
my opnion
we can, on some raid levels
it's a good feature
we will have a lot of work to implement and test


any answer from raid developers?


--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 04:45:40 von NeilBrown

On Mon, 21 Feb 2011 23:27:26 -0300 Roberto Spadim
wrote:

> tape drive = harddisk with only one head, the head can't move, only
> the tape (disk/plate or any other name you want)
>
> could we get back and answer the main question?
> --------------------------
> getting back to the first question, can MD support trim? yes/no/not
> now/some levels and layouts only?
>

MD currently doesn't accept 'discard' requests.

RAID0 and LINEAR could be made to accept 'discard' if any
member device accepted 'discard'. Patches welcome.

Other levels need md to know not to try to resync/recover regions that
have been discarded. See "non-sync bitmap" section of the recent
md roadmap.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 05:04:38 von Phillip Susi

On 02/21/2011 10:29 PM, Roberto Spadim wrote:
> what device level could do? use a 'good block' (if exists) => dynamic
> reallocation
> 'good block' = block not in use by filesystem, not marked as bad, can
> be used by realloc

No. It can only use blocks reserved for spares at manufacture time. It
can not use any old block that the fs is not using at the time, because
the fs may choose to use it in the future.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 05:30:29 von Roberto Spadim

it can't because today filesystem (exclude ext4 and swap) don't use
trim command to tell device what block isn't in use

2011/2/22 Phillip Susi :
> On 02/21/2011 10:29 PM, Roberto Spadim wrote:
>>
>> what device level could do? use a 'good block' (if exists) =3D> =A0d=
ynamic
>> reallocation
>> 'good block' =3D block not in use by filesystem, not marked as bad, =
can
>> be used by realloc
>
> No. =A0It can only use blocks reserved for spares at manufacture time=
=A0It can
> not use any old block that the fs is not using at the time, because t=
he fs
> may choose to use it in the future.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 05:37:38 von Roberto Spadim

thanks neil, i will try to read and make some patch, my focus is ssd
optimization, at hardware level (hw raid) i didn't see any good
improvement
good =3D a good read balance (based on queue and disk read rate), trim =
support

----
good read balance =3D round robin or another time based algorithm (can
be cpu intensive), i didn't found yet how to get queue of linux bio
(mirrors)
trim support - nothing to report, it's a 'feature request' for long
term (after badblock and others features)

-----

ps...
neil what are you thinking about badblock and layout?
for example... reading from a bad block will be internaly (md source
code) remapped to a good block? or just try read/write to another
device?

in other words we will have 'dynamic' layout?


2011/2/22 NeilBrown :
> On Mon, 21 Feb 2011 23:27:26 -0300 Roberto Spadim br>
> wrote:
>
>> tape drive =3D harddisk with only one head, the head can't move, onl=
y
>> the tape (disk/plate or any other name you want)
>>
>> could we get back and answer the main question?
>> --------------------------
>> getting back to the first question, can MD support trim? yes/no/not
>> now/some levels and layouts only?
>>
>
> MD currently doesn't accept 'discard' requests.
>
> RAID0 and LINEAR could be made to accept 'discard' if any
> member device accepted 'discard'. =A0Patches welcome.
>
> Other levels need md to know not to try to resync/recover regions tha=
t
> have been discarded. =A0See "non-sync bitmap" section of the recent
> md roadmap.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 15:45:00 von Phillip Susi

On 2/21/2011 11:30 PM, Roberto Spadim wrote:
> it can't because today filesystem (exclude ext4 and swap) don't use
> trim command to tell device what block isn't in use

You aren't getting it. The fs can tell the drive all it wants: the
drive does not care. It has nothing useful it can do with that information.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD - TRIM command

am 22.02.2011 18:15:52 von Roberto Spadim

i think it have. ssd have, why not hd? hd can implement a inteligent
layer to speedup writes/reads without telling to fs

2011/2/22 Phillip Susi :
> On 2/21/2011 11:30 PM, Roberto Spadim wrote:
>> it can't because today filesystem (exclude ext4 and swap) don't use
>> trim command to tell device what block isn't in use
>
> You aren't getting it. =A0The fs can tell the drive all it wants: the
> drive does not care. =A0It has nothing useful it can do with that inf=
ormation.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"=
in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>



--=20
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html