#68 — Scrub stalls with 0.6.9-7~1.gbp52ec03
| State | Resolved |
|---|---|
| Version: | |
| Area | Functionality |
| Issue type | Bug |
| Severity | Important |
| Submitted by | Seth Heeren |
| Submitted on | Jul 01, 2010 |
| Responsible | Seth Heeren |
| Target release: | 0.7.0 |
Last modified on
Sep 19, 2010
by
Seth Heeren
Packages from https://launchpad.net/~bugs-sehe/+archive/zfs-fuse
vanilla settings on 0.6.9-6 produce Okay scrub performance.
once upgraded to 0.6.9-7~1.gbp52ec03 (vanilla settings) systems stalls (typing in terminal is bursty)
purging 0.6.9-7 and installing 0.6.9-6 back seems to restore performance during scrub
see also http://groups.google.com/[…]/cc4c277b2c2903fa
vanilla settings on 0.6.9-6 produce Okay scrub performance.
once upgraded to 0.6.9-7~1.gbp52ec03 (vanilla settings) systems stalls (typing in terminal is bursty)
purging 0.6.9-7 and installing 0.6.9-6 back seems to restore performance during scrub
see also http://groups.google.com/[…]/cc4c277b2c2903fa
Added by
Seth Heeren
on
Jul 01, 2010 05:38 AM
Well we have two things to go at then, in order of rising plausibility:
(a) reinstate the "-a 1 -e 1" options via /etc/default/zfsrc in the
0.6.9.-7 version (I would be flabberghasted if that made the difference)
(b) retest with the keep_cache hint reverted (issue #65)
Obviously, this will be a test-only version, as we cannot afford to
generally enable the keep_cache thing.
To be honest, I don't expect any relief from both analysis steps at the
moment.
Download the test package from here
http://gitweb.zfs-fuse.net/[…]/issue68
or here
http://downloads.sehe.nl/zfs-fuse/issue68/
(a) reinstate the "-a 1 -e 1" options via /etc/default/zfsrc in the
0.6.9.-7 version (I would be flabberghasted if that made the difference)
(b) retest with the keep_cache hint reverted (issue #65)
Obviously, this will be a test-only version, as we cannot afford to
generally enable the keep_cache thing.
To be honest, I don't expect any relief from both analysis steps at the
moment.
Download the test package from here
http://gitweb.zfs-fuse.net/[…]/issue68
or here
http://downloads.sehe.nl/zfs-fuse/issue68/
Added by
Gavin Chappell
on
Jul 01, 2010 05:58 AM
This testing version removes the "burstiness" on the terminal, and seems to behave the same as -6 did from that point of view.
PS - I added myself as a watcher on this, which I assume is what you meant by the CC list and should get me emails with any further responses after this one...
PS - I added myself as a watcher on this, which I assume is what you meant by the CC list and should get me emails with any further responses after this one...
Added by
Seth Heeren
on
Jul 01, 2010 06:06 AM
Severity:
Medium → Important
Ok, thanks for the test results.
you should be receiving this update then :)
Bad news all-in-all because it means that somehow performance is impacted while doing a scrub (which I never expected to hit that code from f138e5b66 in the first place).
I'll accept the bug and figure out if somehow knows how to better fix the cache coherence issue that was patched with this change (read the commit message for f138e5b66 for more info).
Later
you should be receiving this update then :)
Bad news all-in-all because it means that somehow performance is impacted while doing a scrub (which I never expected to hit that code from f138e5b66 in the first place).
I'll accept the bug and figure out if somehow knows how to better fix the cache coherence issue that was patched with this change (read the commit message for f138e5b66 for more info).
Later
Added by
Seth Heeren
on
Jul 01, 2010 06:08 AM
Issue state:
unconfirmed → open
Added by
Seth Heeren
on
Jul 01, 2010 07:02 PM
I cannot reproduce this at all.
Yes, scrub slow down the system considerably.
But I see no improvement when going back to 0.6.9-6 over 0.6.9-7~1*
Also, 0.6.9-7~2 has exactly the same performance characteristic
I timed with a pool of 30Gb, 32000 files, 81%CAP; scrub reaches about 30Mb/s
root@lucid:~# time find /SEED/ | wc -l
This always clocks in at about 0'40 to 0'70s (with scrub running) versus 0'10s (without scrub running), or 3s for subsequent runs (regardless of scrub)
Interestingly
root@lucid:~# echo 3 > /proc/sys/vm/drop_caches
has no effect on subsequent runs.
I have tweaked it a bit and some performance win can be had by using the following options in zfsrc:
fuse-mount-options=noatime,default_permissions
fuse-attr-timeout = 3600
fuse-entry-timeout = 3600
zfs-prefetch-disable
max-arc-size = 2048
I monitor scrub throughput by using 'sudo vmstat -S m 1 | tee vmstat' in a separate terminal. Setting 'zfs-prefetch-disable' did not adversely affect the scrub throughput on my box
Yes, scrub slow down the system considerably.
But I see no improvement when going back to 0.6.9-6 over 0.6.9-7~1*
Also, 0.6.9-7~2 has exactly the same performance characteristic
I timed with a pool of 30Gb, 32000 files, 81%CAP; scrub reaches about 30Mb/s
root@lucid:~# time find /SEED/ | wc -l
This always clocks in at about 0'40 to 0'70s (with scrub running) versus 0'10s (without scrub running), or 3s for subsequent runs (regardless of scrub)
Interestingly
root@lucid:~# echo 3 > /proc/sys/vm/drop_caches
has no effect on subsequent runs.
I have tweaked it a bit and some performance win can be had by using the following options in zfsrc:
fuse-mount-options=noatime,default_permissions
fuse-attr-timeout = 3600
fuse-entry-timeout = 3600
zfs-prefetch-disable
max-arc-size = 2048
I monitor scrub throughput by using 'sudo vmstat -S m 1 | tee vmstat' in a separate terminal. Setting 'zfs-prefetch-disable' did not adversely affect the scrub throughput on my box
Added by
(anonymous)
on
Jul 02, 2010 01:16 AM
Ah, but it's not the scrub performance that I'm having problems with. I seem to recall seeing figures of around 25mb/s (3 * 500Gb USB2 disks in a RAIDZ1 configuration) which are close to yours with standard disks.
The problem here is how the rest of the system responds even to just simple things like text input via SSH when scrubbing with -7~1. This particular release does something which sends latency through the roof, which the releases on either side don't do, but all three perform similarly while scrubbing.
When I get chance later today I'll try and run the same tests that you did there to verify that scrub performance is the same with all three releases, and also if I can find some screen capture software I can try and demonstrate the differences in the revisions with regards to the latency. Alternatively if you have a public SSH key floating around somewhere, I can set up an account on the machine itself and you can see the difference for yourself.
The problem here is how the rest of the system responds even to just simple things like text input via SSH when scrubbing with -7~1. This particular release does something which sends latency through the roof, which the releases on either side don't do, but all three perform similarly while scrubbing.
When I get chance later today I'll try and run the same tests that you did there to verify that scrub performance is the same with all three releases, and also if I can find some screen capture software I can try and demonstrate the differences in the revisions with regards to the latency. Alternatively if you have a public SSH key floating around somewhere, I can set up an account on the machine itself and you can see the difference for yourself.
Added by
Seth Heeren
on
Jul 02, 2010 03:27 AM
I've replied off=list (check you'r junk mail if necessary). The mail is PGP signed which should be ignorable if your email client doesnot support it.
I have attached a pubkey to the email
In response to the issue tracker, off-list:
> > Ah, but it's not the scrub performance that I'm having problems
> > with. I seem to recall seeing figures of around 25mb/s (3 * 500Gb
> > USB2 disks in a RAIDZ1 configuration) which are close to yours with
> > standard disks.
I wasn't saying you had scrub performance issues. It's just that
(a) I don't have them
(b) I listed some zfsrc tweaks
(c) I wanted to demonstrate that scrub performance itself is still
normal; It'd be unfair competition if I said I have no latency issues,
but scrub goes with 5Mb/s LOL
> >
> > The problem here is how the rest of the system responds even to just
> > simple things like text input via SSH when scrubbing with -7~1.
Do you think SSH has anything to do with it (?surprise?) Hmmm that
would be weird unless you are heftily CPU-bound? I tested locally in
local (X) terminals. Perhaps you can show the output of
vmstat -S m 1
or, for very detailed core usage stats:
mpstat -P ALL 1
> > This particular release does something which sends latency through
> > the roof, which the releases on either side don't do, but all three
> > perform similarly while scrubbing.
> >
> > When I get chance later today I'll try and run the same tests that
> > you did there to verify that scrub performance is the same with all
> > three releases,
I suggest my tweaks to zfsrc mention in the ticket response. Make sure
you tune the cache sizes to your system memory :)
> > and also if I can find some screen capture software I can try and
> > demonstrate the differences in the revisions with regards to the
> > latency.
I suggest script(1) and scriptreplay(1), Like so:
SCRIPT="$(date +'issue68_%d%m-%H:%M')"
script -t $SCRIPT.script 2> $SCRIPT.timing
PS1="$(date +%H:%M:%S) $PS1"
Mix in your screen(1) routine for max fun. Then to replay
scriptreplay "$SCRIPT.timing" "$SCRIPT.script"
On X I use wink or gtkrecordmydesktop but these will have to run with
such slow framerates that it hardly demonstrates the _real_ latency :)
> > Alternatively if you have a public SSH key floating around
> > somewhere, I can set up an account on the machine itself and you can
> > see the difference for yourself.
Now we're talking! Attaching a pub key for this purpose
I have attached a pubkey to the email
In response to the issue tracker, off-list:
> > Ah, but it's not the scrub performance that I'm having problems
> > with. I seem to recall seeing figures of around 25mb/s (3 * 500Gb
> > USB2 disks in a RAIDZ1 configuration) which are close to yours with
> > standard disks.
I wasn't saying you had scrub performance issues. It's just that
(a) I don't have them
(b) I listed some zfsrc tweaks
(c) I wanted to demonstrate that scrub performance itself is still
normal; It'd be unfair competition if I said I have no latency issues,
but scrub goes with 5Mb/s LOL
> >
> > The problem here is how the rest of the system responds even to just
> > simple things like text input via SSH when scrubbing with -7~1.
Do you think SSH has anything to do with it (?surprise?) Hmmm that
would be weird unless you are heftily CPU-bound? I tested locally in
local (X) terminals. Perhaps you can show the output of
vmstat -S m 1
or, for very detailed core usage stats:
mpstat -P ALL 1
> > This particular release does something which sends latency through
> > the roof, which the releases on either side don't do, but all three
> > perform similarly while scrubbing.
> >
> > When I get chance later today I'll try and run the same tests that
> > you did there to verify that scrub performance is the same with all
> > three releases,
I suggest my tweaks to zfsrc mention in the ticket response. Make sure
you tune the cache sizes to your system memory :)
> > and also if I can find some screen capture software I can try and
> > demonstrate the differences in the revisions with regards to the
> > latency.
I suggest script(1) and scriptreplay(1), Like so:
SCRIPT="$(date +'issue68_%d%m-%H:%M')"
script -t $SCRIPT.script 2> $SCRIPT.timing
PS1="$(date +%H:%M:%S) $PS1"
Mix in your screen(1) routine for max fun. Then to replay
scriptreplay "$SCRIPT.timing" "$SCRIPT.script"
On X I use wink or gtkrecordmydesktop but these will have to run with
such slow framerates that it hardly demonstrates the _real_ latency :)
> > Alternatively if you have a public SSH key floating around
> > somewhere, I can set up an account on the machine itself and you can
> > see the difference for yourself.
Now we're talking! Attaching a pub key for this purpose
Added by
Seth Heeren
on
Sep 19, 2010 05:20 PM
closing due to
(a) inactivity
(b) a better fix for #65 has been found and applied to testing - due for 0.7.0 release
Please start your test engines :)
commit 7cb2c61cfe7505b7abe53dc935be250e095e00e6
Author: Seth Heeren <zfs-fuse@sehe.nl>
Date: Fri Aug 13 15:09:01 2010 +0200
Reenabling the keep_cache flag on zfsfuse_opencreate
Because it should not be necessary anymore since Emmanuels fix in
68a7787261e632
This effectively reverts
16e046c031505795df72a24906775dbc0f2e03b4
8e5e01349d376f11fa2a318e174ee56e689a4e34
5363a6021c1fbaf97ab93ed95a05ae876644a2c9
288ab55443945461f8f8fe02221b37aafa9557cf
This will be in the next release, maint will continue to contain the
'simpler' fix that disabled the fuse keep_cache unconditionally
(a) inactivity
(b) a better fix for #65 has been found and applied to testing - due for 0.7.0 release
Please start your test engines :)
commit 7cb2c61cfe7505b7abe53dc935be250e095e00e6
Author: Seth Heeren <zfs-fuse@sehe.nl>
Date: Fri Aug 13 15:09:01 2010 +0200
Reenabling the keep_cache flag on zfsfuse_opencreate
Because it should not be necessary anymore since Emmanuels fix in
68a7787261e632
This effectively reverts
16e046c031505795df72a24906775dbc0f2e03b4
8e5e01349d376f11fa2a318e174ee56e689a4e34
5363a6021c1fbaf97ab93ed95a05ae876644a2c9
288ab55443945461f8f8fe02221b37aafa9557cf
This will be in the next release, maint will continue to contain the
'simpler' fix that disabled the fuse keep_cache unconditionally

