Personal tools
You are here: Home Issue tracker A way to override device sector size (better support 4k drives in 512b emulation)

#82 — A way to override device sector size (better support 4k drives in 512b emulation)

State Rejected
Version:
Area Functionality
Issue type Feature
Severity Medium
Submitted by Bryan Pendleton
Submitted on Aug 12, 2010
Responsible Seth Heeren
Target release:
Return to tracker
Last modified on Sep 19, 2010 by Seth Heeren
With new Advanced Format (4k sector) drives becoming common, it will be important for ZFS to write the right sector size to maintain performance. However, many of these drives lie about their native sector size, still claiming 512b.

Reading around on the Solaris support lists, it seems some people have had success using the Solaris command "gnop" to create a new device which can enforce a higher effective sector size. It doesn't appear such a tool exists for Linux. Is there a reasonable way to add a sector/blocksize override to the zvol system, so that it is possible to (re)declare 4k drives as such, even though the hardware reports them as 512b sector drives?
Steps to reproduce:
Add a 4k drive to a pool, see iffy read performance, lousy write performance.
Added by Seth Heeren on Aug 13, 2010 03:11 AM
Responsible manager: (UNASSIGNED)sgheeren
1. Have you demonstrated said iffy response times?

2. Have a look at

    zfs get |& egrep 'block|record'

and the Solaris hints on optimizing ZFS for database servers (the Evil Tuning Guide[1] should be a good starting point).

[1] http://www.solarisinternals.com/[…]/ZFS_Evil_Tuning_Guide
Added by Bryan Pendleton on Aug 13, 2010 10:10 AM
Unfortunately, I have not been able to make an A/B comparison, because there is no "gnop" equivalent I can find in Linux. I do know that my write performance dropped by a factor of 5x-10x after adding a 1.5Gb USB drive that I've seen insinuated to contain 4k sectors. A similar 500Gb (unlikely to have been a 4k drive) USB drive is part of the original pool configuration, and shows significantly better pool performance. The 1.5Tb drive had been performing much better as an LVM/ext[34] drive prior. I unfortunately also can't test under Solaris, because a 1.5Tb USB drive with "512b" sectors can't be used with a 32-bit kernel.

There is *much* discussion of dealing with 4k sector drives and ZFS. A search for "zfs gnop 4k" will find a number of threads where OpenSolaris and FreeBSD users have found performance to be much more satisfactory when using the "gnop" tool to create a device that requires all I/O to take place in units of 4k, rather than just 512b.

Short summary of the problem: As the storage industry moves to 4k sectors, they've decided not, at least initially, to actually expose the 4k sectors to the OS. However, 512b-emulation of a 4k sector requires that an isolated <4k write to disk first be preceded by a read. So, scattered 512b writes (as might be expected from ZFS if its assuming a native 512b sector size) will cause extra reads (so that the unchanged parts of the 4k sector can be written back with the changed 512b). And each read requires an extra rotation of the platter before the subsequent write. Generating aligned 4k writes lets the underlying firmware skip the read - the full 4k native sector is just written as normal.
Added by Seth Heeren on Aug 13, 2010 10:37 AM
> performance dropped by a factor of 5x-10x after adding a 1.5Gb USB drive that I've seen insinuated to contain 4k sectors. A similar 500Gb (unlikely to have been a 4k drive) USB drive is part of the original pool configuration, and shows significantly better pool performance.

You could compare with the same pool adding another drive, that you "haven't seen insinuated to contain 4k sectors". Most likely, the same performace drop will be visible. If not, then we have analysis material :)

> The 1.5Tb drive had been performing much better as an LVM/ext[34] drive prior
Measured in? Kilos? Sighs per minute? Bitten nails? I'm assuming you'll have some throughput measure (like, in Mb/s) but that is hardly relevant, because ZFS is slower than ext4 anyways, especially on zfs-fuse.

All in all, I think you need to think your situation through a bit more (I don't have much time, nor do I have an insinuated drive). Have you considered how the performance will suffer if you gnop-ed the pool to 4k (pretend we can...) and only one of the disks actually uses 4k sectors?

Have you, actually, considered creating a separate pool containing _just_ the big drive that you have seen insinuated ... etc?
Have you tried hooking up that drive to the motherboard, to rule out USB bandwith or controller contention?

I get your summary of the 4k technology (Thanks). I'd like to point at a number of ZFS characteristics that probably go a long way to reduce probable performance impact here (as I'm sure countless others will have responded in these other sources I haven't read):
(a) ZFS has a very very strong transaction grouping mechanism in it's ZIO/ZIL layers. Writes are going to bundled
(b) ZFS is effectively log-structured, meaning that any writes (including deletes and most rewrites) are going to be sequential appends most of the time (unless your pool is heavily fragmented, which can only occur with a load of many small files on an almost-filled pool; in such a case ZFS's block allocator is known to be non-performant anyway and disk seek times will dominate the disk wait times)
(c) ZFS has a number of tuning parameters (see Evil Tuning Guide mentioned before) that will allow maximum adaptation
(d) you can always include an SSD as cache/log device so only the big, longlasting writes will go to your 4k-sector disk anyway

Hope that helps
Added by Seth Heeren on Aug 14, 2010 03:16 PM
Look what the cat dragged in: just today, someone posted the following useful pointer on the user group. It was buried in a thread so I'll post it here for your convenience:

http://groups.google.com/[…]/e837dfcacbfa7623


On 08/14/2010 09:10 PM, devsk wrote:
> > For folks who are troubled by large IOs (to high latency devices like
> > HDDs, USB devices etc.) bringing their Linux system down to its knees,
> > use kernel 2.6.35 with two very small patches found at
> >
> > http://phoronix.com/[…]/showpost.php?p=142025&postcount=38
> >
> > It has fixed the issue to a large extent i.e. the system doesn't grind
> > to a halt anymore.
> >
> > You may also wanna follow the bug: https://bugzilla.kernel.org/show_bug.cgi?id=12309
> >
> > -devsk
> >

Well.... that came out of nowhere ?! Splendid info, thanks, but it might
have been appropriate to post a new thread for this. Well, look at me
nagging :) I'm adding this info to issue #82 where I'm kind-of giving
Bryan a hard time convincing me of performance problems... Your
description looks surprisingly similar, so let's see what he can do with
the info :)
Added by Seth Heeren on Sep 19, 2010 06:43 PM
Issue state: unconfirmedrejected
closing this due to
(a) WONTFIX (fix should be in upstream design)
(b) for the record, I recently (weeks ago) added two WD15EARS disks (4k sectored) to my OpenSolaris NAS and have not been able to find any difference in perf.

Note that I run 2 mirrored pools and have consciously chosen to mix the EARS drives with the existing EADS drives, so I now have

    Mirror(1.5TB EADS + 1.5TB EARS)
    Mirror(1.5TB EADS + 1.5TB EARS)
    
instead of the prior situation

    Mirror(1.5TB EADS + 1.5TB EADS)

I see no performance impact