Personal tools
You are here: Home Issue tracker Scrub stalls with 0.6.9-7~1.gbp52ec03

#68 — Scrub stalls with 0.6.9-7~1.gbp52ec03

State Resolved
Version:
Area Functionality
Issue type Bug
Severity Important
Submitted by Seth Heeren
Submitted on Jul 01, 2010
Responsible Seth Heeren
Target release: 0.7.0
Return to tracker
Last modified on Sep 19, 2010 by Seth Heeren
Packages from https://launchpad.net/~bugs-sehe/+archive/zfs-fuse

vanilla settings on 0.6.9-6 produce Okay scrub performance.
once upgraded to 0.6.9-7~1.gbp52ec03 (vanilla settings) systems stalls (typing in terminal is bursty)

purging 0.6.9-7 and installing 0.6.9-6 back seems to restore performance during scrub

see also http://groups.google.com/[…]/cc4c277b2c2903fa
Added by Seth Heeren on Jul 01, 2010 05:38 AM
Well we have two things to go at then, in order of rising plausibility:

(a) reinstate the "-a 1 -e 1" options via /etc/default/zfsrc in the
0.6.9.-7 version (I would be flabberghasted if that made the difference)
(b) retest with the keep_cache hint reverted (issue #65)

Obviously, this will be a test-only version, as we cannot afford to
generally enable the keep_cache thing.

To be honest, I don't expect any relief from both analysis steps at the
moment.

Download the test package from here
   http://gitweb.zfs-fuse.net/[…]/issue68
or here
   http://downloads.sehe.nl/zfs-fuse/issue68/
Added by Gavin Chappell on Jul 01, 2010 05:58 AM
This testing version removes the "burstiness" on the terminal, and seems to behave the same as -6 did from that point of view.

PS - I added myself as a watcher on this, which I assume is what you meant by the CC list and should get me emails with any further responses after this one...
Added by Seth Heeren on Jul 01, 2010 06:06 AM
Severity: MediumImportant
Ok, thanks for the test results.

you should be receiving this update then :)

Bad news all-in-all because it means that somehow performance is impacted while doing a scrub (which I never expected to hit that code from f138e5b66 in the first place).

I'll accept the bug and figure out if somehow knows how to better fix the cache coherence issue that was patched with this change (read the commit message for f138e5b66 for more info).

Later
Added by Seth Heeren on Jul 01, 2010 06:08 AM
Issue state: unconfirmedopen
Added by Seth Heeren on Jul 01, 2010 07:02 PM
I cannot reproduce this at all.

Yes, scrub slow down the system considerably.
But I see no improvement when going back to 0.6.9-6 over 0.6.9-7~1*
Also, 0.6.9-7~2 has exactly the same performance characteristic

I timed with a pool of 30Gb, 32000 files, 81%CAP; scrub reaches about 30Mb/s

root@lucid:~# time find /SEED/ | wc -l

This always clocks in at about 0'40 to 0'70s (with scrub running) versus 0'10s (without scrub running), or 3s for subsequent runs (regardless of scrub)

Interestingly
   root@lucid:~# echo 3 > /proc/sys/vm/drop_caches
has no effect on subsequent runs.

I have tweaked it a bit and some performance win can be had by using the following options in zfsrc:
fuse-mount-options=noatime,default_permissions
fuse-attr-timeout = 3600
fuse-entry-timeout = 3600
zfs-prefetch-disable
max-arc-size = 2048

I monitor scrub throughput by using 'sudo vmstat -S m 1 | tee vmstat' in a separate terminal. Setting 'zfs-prefetch-disable' did not adversely affect the scrub throughput on my box
Added by (anonymous) on Jul 02, 2010 01:16 AM
Ah, but it's not the scrub performance that I'm having problems with. I seem to recall seeing figures of around 25mb/s (3 * 500Gb USB2 disks in a RAIDZ1 configuration) which are close to yours with standard disks.

The problem here is how the rest of the system responds even to just simple things like text input via SSH when scrubbing with -7~1. This particular release does something which sends latency through the roof, which the releases on either side don't do, but all three perform similarly while scrubbing.

When I get chance later today I'll try and run the same tests that you did there to verify that scrub performance is the same with all three releases, and also if I can find some screen capture software I can try and demonstrate the differences in the revisions with regards to the latency. Alternatively if you have a public SSH key floating around somewhere, I can set up an account on the machine itself and you can see the difference for yourself.
Added by Seth Heeren on Jul 02, 2010 03:27 AM
I've replied off=list (check you'r junk mail if necessary). The mail is PGP signed which should be ignorable if your email client doesnot support it.
I have attached a pubkey to the email

In response to the issue tracker, off-list:

> > Ah, but it's not the scrub performance that I'm having problems
> > with. I seem to recall seeing figures of around 25mb/s (3 * 500Gb
> > USB2 disks in a RAIDZ1 configuration) which are close to yours with
> > standard disks.
I wasn't saying you had scrub performance issues. It's just that
(a) I don't have them
(b) I listed some zfsrc tweaks
(c) I wanted to demonstrate that scrub performance itself is still
normal; It'd be unfair competition if I said I have no latency issues,
but scrub goes with 5Mb/s LOL
> >
> > The problem here is how the rest of the system responds even to just
> > simple things like text input via SSH when scrubbing with -7~1.
Do you think SSH has anything to do with it (?surprise?) Hmmm that
would be weird unless you are heftily CPU-bound? I tested locally in
local (X) terminals. Perhaps you can show the output of

    vmstat -S m 1


or, for very detailed core usage stats:

    mpstat -P ALL 1


> > This particular release does something which sends latency through
> > the roof, which the releases on either side don't do, but all three
> > perform similarly while scrubbing.
> >
> > When I get chance later today I'll try and run the same tests that
> > you did there to verify that scrub performance is the same with all
> > three releases,
I suggest my tweaks to zfsrc mention in the ticket response. Make sure
you tune the cache sizes to your system memory :)

> > and also if I can find some screen capture software I can try and
> > demonstrate the differences in the revisions with regards to the
> > latency.
I suggest script(1) and scriptreplay(1), Like so:

    SCRIPT="$(date +'issue68_%d%m-%H:%M')"
    script -t $SCRIPT.script 2> $SCRIPT.timing

    PS1="$(date +%H:%M:%S) $PS1"

Mix in your screen(1) routine for max fun. Then to replay

    scriptreplay "$SCRIPT.timing" "$SCRIPT.script"

On X I use wink or gtkrecordmydesktop but these will have to run with
such slow framerates that it hardly demonstrates the _real_ latency :)
> > Alternatively if you have a public SSH key floating around
> > somewhere, I can set up an account on the machine itself and you can
> > see the difference for yourself.
Now we're talking! Attaching a pub key for this purpose
Added by Seth Heeren on Sep 19, 2010 05:20 PM
closing due to
(a) inactivity
(b) a better fix for #65 has been found and applied to testing - due for 0.7.0 release

Please start your test engines :)

commit 7cb2c61cfe7505b7abe53dc935be250e095e00e6
Author: Seth Heeren <zfs-fuse@sehe.nl>
Date: Fri Aug 13 15:09:01 2010 +0200

    Reenabling the keep_cache flag on zfsfuse_opencreate
    
    Because it should not be necessary anymore since Emmanuels fix in
    68a7787261e632
    
    This effectively reverts
        16e046c031505795df72a24906775dbc0f2e03b4
        8e5e01349d376f11fa2a318e174ee56e689a4e34
        5363a6021c1fbaf97ab93ed95a05ae876644a2c9
        288ab55443945461f8f8fe02221b37aafa9557cf
    This will be in the next release, maint will continue to contain the
    'simpler' fix that disabled the fuse keep_cache unconditionally