#70 — Can't unmount due to a backgrounded zfs destroy
| State | Unconfirmed |
|---|---|
| Version: | 0.7.0 |
| Area | Functionality |
| Issue type | Bug |
| Severity | Medium |
| Submitted by | Jan Ploski |
| Submitted on | Jul 04, 2010 |
| Responsible | Seth Heeren |
| Target release: | 0.7.0 |
Last modified on
Sep 27, 2010
by
Jan Ploski
Once again reporting against snapshot sehe-288ab55443945461f8f8fe02221b37aafa9557cf.
I issued a zfs destroy command, which I then interrupted with CTRL-C after more than half an hour waiting (the dataset is 24 GB, the performance of destroy seems terrible; perhaps it's related to dedup=on on that dataset). Apparently zfs destroy is still running in the background, as zfs list displays the dataset slowly decreasing in size. Now I wanted to restart zfs-fuse in order to give it more memory (max-arc-size = 400 now). However, I can't do it in an orderly manner because the init script hangs at "Unmounting ZFS filesystems...".
I'm going to kill the zfs-fuse process (and maybe reboot if it doesn't help), but that shouldn't be necessary.
I issued a zfs destroy command, which I then interrupted with CTRL-C after more than half an hour waiting (the dataset is 24 GB, the performance of destroy seems terrible; perhaps it's related to dedup=on on that dataset). Apparently zfs destroy is still running in the background, as zfs list displays the dataset slowly decreasing in size. Now I wanted to restart zfs-fuse in order to give it more memory (max-arc-size = 400 now). However, I can't do it in an orderly manner because the init script hangs at "Unmounting ZFS filesystems...".
I'm going to kill the zfs-fuse process (and maybe reboot if it doesn't help), but that shouldn't be necessary.
Added by
Jan Ploski
on
Jul 04, 2010 10:46 AM
I had to kill -9. After that the mounts were not removed nor could I remove them with umount -f. Starting the init script again (without reboot in between) hangs at "Mounting ZFS filesystems..." (the disk seems to be doing something). "zfs list" in another shell is blocked.
Added by
Jan Ploski
on
Jul 04, 2010 10:55 AM
After I rebooted, the init script still hangs at "Mounting ZFS filesystems...". So I hit CTRL-C to let booting continue. zfs-fuse is now running and being very busy with the disk, I suppose completing my zfs destroy command from before the reboot. zfs/zpool commands just hang. I will wait some hours to see whether it comes back. Were it happening in production, it could turn into a real ZFS horror story (access to entiry pool denied because of a runaway zfs destroy on a single small dataset)...
Added by
Jan Ploski
on
Jul 04, 2010 11:52 AM
After the disk activity has ceased (30 minutes?), the pool became available again. The destroyed dataset is gone and everything seems fine (except that the the errors reported with zpool status -v cannot be cleared, as reported in ticket #69).
Added by
(anonymous)
on
Jul 04, 2010 05:09 PM
Jan your problem is exactly the same as reported in #20 (recently confirmed still a problem with 0.6.9 maint). However, we had never been able to reproduce the issue. It seems you are now the second person to have this problem.
I suggest we
(a) try to establish the common denominator to find a root cause
(b) could I in some way gain access to the machine that exhibits this in order to investigate?
I will cross post this issue #20 for Maximilian Mehnert
(see also http://groups.google.com/[…]/)
I suggest we
(a) try to establish the common denominator to find a root cause
(b) could I in some way gain access to the machine that exhibits this in order to investigate?
I will cross post this issue #20 for Maximilian Mehnert
(see also http://groups.google.com/[…]/)
Added by
(anonymous)
on
Jul 04, 2010 05:13 PM
PS. a few notes:
if this was on a 'recent' testing snapshot (288ab55, e.g.), look at db9f77e3c420d20f58c4340a0ddc22e0b4d8a9cd
There were possible races and deadlocks especially in the fuse_unmount_all code
Also, if you ever need to, use the following command 3 times to definitely get rid of 'hung/stale' zfs-fuse mounts:
poolname=tank; egrep "^(/dev/fuse|$poolname|kstat)" /etc/mtab | cut -d\ -f1 | sort -r | xargs -tr umount -fl
Cheers
if this was on a 'recent' testing snapshot (288ab55, e.g.), look at db9f77e3c420d20f58c4340a0ddc22e0b4d8a9cd
There were possible races and deadlocks especially in the fuse_unmount_all code
Also, if you ever need to, use the following command 3 times to definitely get rid of 'hung/stale' zfs-fuse mounts:
poolname=tank; egrep "^(/dev/fuse|$poolname|kstat)" /etc/mtab | cut -d\ -f1 | sort -r | xargs -tr umount -fl
Cheers
Added by
Jan Ploski
on
Jul 04, 2010 05:57 PM
I'm not sure whether this is really the same problem as #20... The interactions that lead to reaching the unresponsive state certainly look different.
After destroying the dataset in question, I recreated it without dedup and rsynced the files back from the same Windows Vista machine. Interestingly, after this operation I have a few more files with allegedly "permanent" errors:
# zpool status -v
pool: green
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
green ONLINE 0 0 0
disk/by-id/dm-name-green ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<0x21f>:<0x4e>
<0x21f>:<0x53>
<0x21f>:<0x5a>
<0x21f>:<0x62>
<0x21f>:<0x68>
green/backup/vista:<0x39003>
/green/backup/vista/DELL/drivers/R179345/Lang/HDMI/kor
/green/backup/vista/Drivers/Security/R153770/CB_2K
/green/backup/vista/Drivers/video/R154827/LANG/HDMI/ara
/green/backup/vista/Drivers/video/R154827/LANG/HDMI/chs
The last 4 entries are new and all refer to directories (whose contents I can list all right in the ZFS file system).
I will try the destroy command again later today to see if it goes just as slow as before and whether I can get unmount/mount to hang again. I can't give you access to the machine, but I could recompile and run instrumented code if need be.
After destroying the dataset in question, I recreated it without dedup and rsynced the files back from the same Windows Vista machine. Interestingly, after this operation I have a few more files with allegedly "permanent" errors:
# zpool status -v
pool: green
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
green ONLINE 0 0 0
disk/by-id/dm-name-green ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<0x21f>:<0x4e>
<0x21f>:<0x53>
<0x21f>:<0x5a>
<0x21f>:<0x62>
<0x21f>:<0x68>
green/backup/vista:<0x39003>
/green/backup/vista/DELL/drivers/R179345/Lang/HDMI/kor
/green/backup/vista/Drivers/Security/R153770/CB_2K
/green/backup/vista/Drivers/video/R154827/LANG/HDMI/ara
/green/backup/vista/Drivers/video/R154827/LANG/HDMI/chs
The last 4 entries are new and all refer to directories (whose contents I can list all right in the ZFS file system).
I will try the destroy command again later today to see if it goes just as slow as before and whether I can get unmount/mount to hang again. I can't give you access to the machine, but I could recompile and run instrumented code if need be.
Added by
Seth Heeren
on
Jul 05, 2010 03:02 AM
Target release:
None → 0.7.0
Responsible manager:
(UNASSIGNED) → sgheeren
To me it looks more than similar:
I issued a _zfs destroy_[1] command, which I then _interrupted with CTRL-C_ [2] after more than half an hour waiting (...). Apparently _zfs destroy is still running in the background_ [3: bug #20], as zfs list displays the dataset slowly decreasing in size[4].
Now the epilogue is somewhat less similar, but it is also rootcaused by the previous symptoms:
Now I wanted to restart zfs-fuse in order to give it more memory (max-arc-size = 400 now). However, I can't do it in an orderly manner because the init script hangs at "Unmounting ZFS filesystems...".
Hangs at "Unmounting ZFS filesystems" is not unlikely to be fixed in testing db9f77e3c420d20f58c43
Any chance I can get my hands on an existing environment with these problems?
I issued a _zfs destroy_[1] command, which I then _interrupted with CTRL-C_ [2] after more than half an hour waiting (...). Apparently _zfs destroy is still running in the background_ [3: bug #20], as zfs list displays the dataset slowly decreasing in size[4].
Now the epilogue is somewhat less similar, but it is also rootcaused by the previous symptoms:
Now I wanted to restart zfs-fuse in order to give it more memory (max-arc-size = 400 now). However, I can't do it in an orderly manner because the init script hangs at "Unmounting ZFS filesystems...".
Hangs at "Unmounting ZFS filesystems" is not unlikely to be fixed in testing db9f77e3c420d20f58c43
Any chance I can get my hands on an existing environment with these problems?
Added by
(anonymous)
on
Jul 14, 2010 09:51 PM
Hi, I have also similar problem from time to time (on some old version near the 0.6.0).
I have single zpool with single ~120GB device. 15 file systems,
lots of snapshots.
Mounting takes lots of time.
I mean not nacassarly on zfs destroy, but somtimes when i booting
zfs is mounting for about 30 minutes (disk is constantly working).
Fortunetly no data was lost.
I do not know in what cirumstances this occurs exactly.
For sure once it was power loss, and hard reboot.
But still why it would do this (mount fs) so long?
It once happened also when I was removing about 100 files
each 200-800GB each (about 40GB in total) - zfs-fuse in such
situation quickly crashed my computer
(do too the lack of memory and lots swaping). After that I rebooted,
and was waiting 30 minutes when it was mounting.
It always scares me to death.
Apperently it does something very leazly in reclaiming free space,
but it is not lazy on this when mounting/unmounting.
I have single zpool with single ~120GB device. 15 file systems,
lots of snapshots.
Mounting takes lots of time.
I mean not nacassarly on zfs destroy, but somtimes when i booting
zfs is mounting for about 30 minutes (disk is constantly working).
Fortunetly no data was lost.
I do not know in what cirumstances this occurs exactly.
For sure once it was power loss, and hard reboot.
But still why it would do this (mount fs) so long?
It once happened also when I was removing about 100 files
each 200-800GB each (about 40GB in total) - zfs-fuse in such
situation quickly crashed my computer
(do too the lack of memory and lots swaping). After that I rebooted,
and was waiting 30 minutes when it was mounting.
It always scares me to death.
Apperently it does something very leazly in reclaiming free space,
but it is not lazy on this when mounting/unmounting.
Added by
(anonymous)
on
Jul 14, 2010 09:54 PM
Ah. Update of previous response.
It also happened once when I was performing mass destroy of snapshots
(about 100 snapshots of the same filesystem, which had about 50 GB of data).
It also happened once when I was performing mass destroy of snapshots
(about 100 snapshots of the same filesystem, which had about 50 GB of data).
Added by
Seth Heeren
on
Jul 15, 2010 06:14 AM
First let me point out that the original problem mentioned was that operations seem to not be canceling properly. This is _not_ the same as saying that some operations take a long time.
You describe exactly the known 'slow' operations. The symptoms are are aggravated when importing a pool that had canceled/partially completed operations. What they have in common is the need to place 'temp user holds'. This makes certain zfs housekeeping operations (and zfs-fuse more so) scale badly with many datasets (filesystems and or snapshots).
Operations that do certainly this:
zfs send
zfs rollback
scrub/resilver
Other longrunning jobs where I'm not sure if it the temp user (snapshot) holds, but it is certainly some synching TX overhead are
zfs destroy
deleting (big volumes of) (higly) dedup-ed data
It is easy to see that locking of the DDT becomes the performance bottleneck on the latter operation, and I think there is an upstream bug to request that the DDT get's it's own vdev to reduce contention.
In my experience when such a long-running operation is canceled (e.g. by hitting CTRL-C in the userland or if the daemon is killed), the performance hit when re-importing the pool after relaunch is disproportional. I have personally waited for 40 minutes last week (for a simple zpool import to complete).
The problem is that the temp user holds (and possibly other transient locks) are going to have to be _atomically_ removed while the pool is _not yet actually imported_. This involves opening the pool (spa), releasing a single reference, syncing the metadata and closing the pool (spa). This is excruciatingly slow.
So the advice is
(a) use dedup ONLY when you know why you absolutely need it (having once enabled dedup will from then on incur some DDT overhead even once disabled; 'zpool history | grep dedup' to find out whether that is the case)
(b) be (very) patient when sitting out the large operations; canceling these makes matters (lots) worse, especially since no zfs operations can be done while importing a pool, not even on other pools
(c) )(obviously?) avoid running out of memory.
- Check your initscripts for proper ulimit management
- check kernel vmaps setting
- reduce --max-arc-size (on my system I need to keep it <2048Mb or zfs-fuse can sometimes fail to allocate memory; I haven't investigated why that is, but keep a sane threshold)
(d) inspect 'zpool history -il' to verify (after a lenghty import, e.g.) where the time was spent. E.g. you can clearly see
You might alleviate some of the pain by explicitely exporting the pool before restarting zfs-fuse after a canceled operation. This might do the required administration more efficiently, though I've never tried this.
As a bruteforce method you could try creating pools of version <18 because that version introduced the snapshot holds in the first place. Of course you will get the old issues back (you'll get races when e.g. removing snapshots that are being accessed in another operation, and miss the new features.
You describe exactly the known 'slow' operations. The symptoms are are aggravated when importing a pool that had canceled/partially completed operations. What they have in common is the need to place 'temp user holds'. This makes certain zfs housekeeping operations (and zfs-fuse more so) scale badly with many datasets (filesystems and or snapshots).
Operations that do certainly this:
zfs send
zfs rollback
scrub/resilver
Other longrunning jobs where I'm not sure if it the temp user (snapshot) holds, but it is certainly some synching TX overhead are
zfs destroy
deleting (big volumes of) (higly) dedup-ed data
It is easy to see that locking of the DDT becomes the performance bottleneck on the latter operation, and I think there is an upstream bug to request that the DDT get's it's own vdev to reduce contention.
In my experience when such a long-running operation is canceled (e.g. by hitting CTRL-C in the userland or if the daemon is killed), the performance hit when re-importing the pool after relaunch is disproportional. I have personally waited for 40 minutes last week (for a simple zpool import to complete).
The problem is that the temp user holds (and possibly other transient locks) are going to have to be _atomically_ removed while the pool is _not yet actually imported_. This involves opening the pool (spa), releasing a single reference, syncing the metadata and closing the pool (spa). This is excruciatingly slow.
So the advice is
(a) use dedup ONLY when you know why you absolutely need it (having once enabled dedup will from then on incur some DDT overhead even once disabled; 'zpool history | grep dedup' to find out whether that is the case)
(b) be (very) patient when sitting out the large operations; canceling these makes matters (lots) worse, especially since no zfs operations can be done while importing a pool, not even on other pools
(c) )(obviously?) avoid running out of memory.
- Check your initscripts for proper ulimit management
- check kernel vmaps setting
- reduce --max-arc-size (on my system I need to keep it <2048Mb or zfs-fuse can sometimes fail to allocate memory; I haven't investigated why that is, but keep a sane threshold)
(d) inspect 'zpool history -il' to verify (after a lenghty import, e.g.) where the time was spent. E.g. you can clearly see
You might alleviate some of the pain by explicitely exporting the pool before restarting zfs-fuse after a canceled operation. This might do the required administration more efficiently, though I've never tried this.
As a bruteforce method you could try creating pools of version <18 because that version introduced the snapshot holds in the first place. Of course you will get the old issues back (you'll get races when e.g. removing snapshots that are being accessed in another operation, and miss the new features.
Added by
Seth Heeren
on
Sep 19, 2010 05:28 PM
Did any of the above information help? Is there anything else I can do?
Note that I'm publishing a new 'unstable' branch which contains a lot of upstream fixes, hopefully including the ones I was referring to in the above comments
Seth
Note that I'm publishing a new 'unstable' branch which contains a lot of upstream fixes, hopefully including the ones I was referring to in the above comments
Seth
Added by
Jan Ploski
on
Sep 25, 2010 05:51 AM
Here is another story with experiences similar to the original bug report.
I'm now 15 minutes into a resilver, having attached another 1 TB disk to a previously unmirrored (but partially deduped) pool. zpool status reports 169 hours to go :(
Adventurously, I wanted to check what happens if I stop zfs-fuse and restart it - whether resilver will continue from where it left and how fast the pool will come up.
# /etc/init.d/zfs-fuse stop
Unmounting ZFS filesystems...done. <--- this took minutes to complete
Stopping zfs-fuse: zfs-fuse failed! <--- hmm?
# zpool status
(hangs while the disks are doing something, but then)
internal error: Connection reset by peer
Aborted
The zfs-fuse daemon is gone now, though /var/run/zfs-fuse.pid is still around. However, mounts have been successfully removed (sometimes it leaves mounts when it dies, then you have to remove manually or it won't restart).
Anyway, I next did /etc/init.d/zfs-fuse start, which went quickly, if not smoothly. Now zfs status gives me:
# zpool status
pool: green
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 0h1m, 0.02% done, 150h19m to go
config:
NAME STATE READ WRITE CKSUM
green ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/dm-name-green1 ONLINE 0 0 0
disk/by-id/dm-name-green2 ONLINE 0 0 1 64.6M resilvered
errors: No known data errors
Observations
(1) the "resilvered" progress counter went back to zero, though maybe it means "resilvered in this session", no idea
(2) a checksum error appeared
(3) it tells me I have an "unrecoverable" error now (in a mirror? why can't it repair any error using the source disk)
(4) the 11 spurious "data errors", which I reported in another issue, got miraculously cleared
(5) I'm not sure what to do now: wait 160 hours until resilver completes? Do not shut down the computer while it is running? Am I then going to have to start over again afterwards because of the "unrecoverable" error? Should I cancel now? Am I going to lose data? (last question is rhetorical, as we know ZFS never loses data ;)
I'm now 15 minutes into a resilver, having attached another 1 TB disk to a previously unmirrored (but partially deduped) pool. zpool status reports 169 hours to go :(
Adventurously, I wanted to check what happens if I stop zfs-fuse and restart it - whether resilver will continue from where it left and how fast the pool will come up.
# /etc/init.d/zfs-fuse stop
Unmounting ZFS filesystems...done. <--- this took minutes to complete
Stopping zfs-fuse: zfs-fuse failed! <--- hmm?
# zpool status
(hangs while the disks are doing something, but then)
internal error: Connection reset by peer
Aborted
The zfs-fuse daemon is gone now, though /var/run/zfs-fuse.pid is still around. However, mounts have been successfully removed (sometimes it leaves mounts when it dies, then you have to remove manually or it won't restart).
Anyway, I next did /etc/init.d/zfs-fuse start, which went quickly, if not smoothly. Now zfs status gives me:
# zpool status
pool: green
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 0h1m, 0.02% done, 150h19m to go
config:
NAME STATE READ WRITE CKSUM
green ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
disk/by-id/dm-name-green1 ONLINE 0 0 0
disk/by-id/dm-name-green2 ONLINE 0 0 1 64.6M resilvered
errors: No known data errors
Observations
(1) the "resilvered" progress counter went back to zero, though maybe it means "resilvered in this session", no idea
(2) a checksum error appeared
(3) it tells me I have an "unrecoverable" error now (in a mirror? why can't it repair any error using the source disk)
(4) the 11 spurious "data errors", which I reported in another issue, got miraculously cleared
(5) I'm not sure what to do now: wait 160 hours until resilver completes? Do not shut down the computer while it is running? Am I then going to have to start over again afterwards because of the "unrecoverable" error? Should I cancel now? Am I going to lose data? (last question is rhetorical, as we know ZFS never loses data ;)
Added by
Seth Heeren
on
Sep 25, 2010 07:03 AM
Hi there
Interesting case. I tried this when I added 2 disks to my NAS (OpenSolaris). It too restarted at the 0% with resilver after reboot. So that matches with Osol b134 behaviour.
It might not be resilvering from the start, though, and I just realized you can find out by using zpool history -i (it will list the starting txg numbers for a resilver/scrub)
The checksum error is supposedly due the fact zfs-fuse shutdown wasn't graceful. There is no (known) way to nicely pause the resilver, and after a while zfs-fuse itself starts simply killing things, as well your init-script (you describe /var/run/zfs.pid is still around with the process gone, which almost guarantees that your init-script issued a kill -9 after waiting for some time).
You should be very careful giving interpretation to zpool error messages/codes: they are very very subtle. I have learned by now, that an 'unrecoverable error' means: there was an error accessing a physical disk, that the controller/disk could not recover from. Of course, data was not compromised. Note how the message says that the (IO) error was unrecoverable. It doesnot talk about data, let alone unrecoverable data, which is a whole other thing.
Ad the observations:
(1) the "resilvered" progress counter went back to zero, though maybe it means "resilvered in this session", no idea
zpool history -i
(2) a checksum error appeared
not wholly unexpected due to ungraceful shutdown. search upstream for clarification with ZFS-8000-9P
(3) it tells me I have an "unrecoverable" error now (in a mirror? why can't it repair any error using the source disk)
this is exactly the checksum error (see 2.) not how there are no data errors listed
(4) the 11 spurious "data errors", which I reported in another issue, got miraculously cleared
it may be as silly as this: there is only room for one error status with a pool, and when a new error condition arrives, it must 'zpool clear' first? I know this is confusing since you have tried to rid these errors before... I'm thinking now: perhaps you tried deleting the corrupted files, then a zpool clear before, but the files ('deleted' in the current datasets) still existed in snapshots/clones (big clue). If these snapshots have disappeared in the mean time, a new (implicit) 'zpool clear' will likely remove the listed file corruption.
How does this sound to you?
(5) I'm not sure what to do now: wait 160 hours until resilver completes? Do not shut down the computer while it is running? Am I then going to have to start over again afterwards because of the "unrecoverable" error? Should I cancel now? Am I going to lose data? (last question is rhetorical, as we know ZFS never loses data ;)
Just wait: my pool reported enormous amounts of time at first. In the end it was done under 32 hours (don't remember precisely) but this was in Solaris. YOu might want to check with iotop or zpool iostat that disk IO rate is healthy (at least say 20 to 30 Mb/s on both disks (R vs W) - if it is a lot less, you might think about getting around the performance neck first.
$0.02
Interesting case. I tried this when I added 2 disks to my NAS (OpenSolaris). It too restarted at the 0% with resilver after reboot. So that matches with Osol b134 behaviour.
It might not be resilvering from the start, though, and I just realized you can find out by using zpool history -i (it will list the starting txg numbers for a resilver/scrub)
The checksum error is supposedly due the fact zfs-fuse shutdown wasn't graceful. There is no (known) way to nicely pause the resilver, and after a while zfs-fuse itself starts simply killing things, as well your init-script (you describe /var/run/zfs.pid is still around with the process gone, which almost guarantees that your init-script issued a kill -9 after waiting for some time).
You should be very careful giving interpretation to zpool error messages/codes: they are very very subtle. I have learned by now, that an 'unrecoverable error' means: there was an error accessing a physical disk, that the controller/disk could not recover from. Of course, data was not compromised. Note how the message says that the (IO) error was unrecoverable. It doesnot talk about data, let alone unrecoverable data, which is a whole other thing.
Ad the observations:
(1) the "resilvered" progress counter went back to zero, though maybe it means "resilvered in this session", no idea
zpool history -i
(2) a checksum error appeared
not wholly unexpected due to ungraceful shutdown. search upstream for clarification with ZFS-8000-9P
(3) it tells me I have an "unrecoverable" error now (in a mirror? why can't it repair any error using the source disk)
this is exactly the checksum error (see 2.) not how there are no data errors listed
(4) the 11 spurious "data errors", which I reported in another issue, got miraculously cleared
it may be as silly as this: there is only room for one error status with a pool, and when a new error condition arrives, it must 'zpool clear' first? I know this is confusing since you have tried to rid these errors before... I'm thinking now: perhaps you tried deleting the corrupted files, then a zpool clear before, but the files ('deleted' in the current datasets) still existed in snapshots/clones (big clue). If these snapshots have disappeared in the mean time, a new (implicit) 'zpool clear' will likely remove the listed file corruption.
How does this sound to you?
(5) I'm not sure what to do now: wait 160 hours until resilver completes? Do not shut down the computer while it is running? Am I then going to have to start over again afterwards because of the "unrecoverable" error? Should I cancel now? Am I going to lose data? (last question is rhetorical, as we know ZFS never loses data ;)
Just wait: my pool reported enormous amounts of time at first. In the end it was done under 32 hours (don't remember precisely) but this was in Solaris. YOu might want to check with iotop or zpool iostat that disk IO rate is healthy (at least say 20 to 30 Mb/s on both disks (R vs W) - if it is a lot less, you might think about getting around the performance neck first.
$0.02
Added by
Seth Heeren
on
Sep 25, 2010 07:09 AM
PS. In my upcoming unstable branch (merged with Emmanuels synchs from upstream) there are a number of commits that mention scrubbing state and resilvering state fro vdevs and try to be more smart about operations on such vdevs.
I don't suggest you switch right now, but perhaps there were fixes upstream to (part) of the symptoms we are seeing in issue #70 and #20
Have a look at 4385d97cf for example: http://gitweb.zfs-fuse.net/?p=sehe;a=commitdiff;h=4385d97cf
My gitweb has the text search facility enabled too
I don't suggest you switch right now, but perhaps there were fixes upstream to (part) of the symptoms we are seeing in issue #70 and #20
Have a look at 4385d97cf for example: http://gitweb.zfs-fuse.net/?p=sehe;a=commitdiff;h=4385d97cf
My gitweb has the text search facility enabled too
Added by
Jan Ploski
on
Sep 25, 2010 07:32 AM
Thanks! I now understand the checksum error better. I also read Sun's documentation (but too late, after posting). The "unrecoverable" thing is scary/misleading. I guess someone who worded it that way should pay compensation for the grief caused ;)
As for the clearing of errors... No, it's not related to deleting snapshots/clones/files. I haven't changed these in any way, just issued a 'zfs attach'. These errors were oblivious to removing files - the error messages just changed from ones which included paths to ones which included hex codes only. I believe that afterwards, these metadata error messages "reattached" themselves to some new files (to be more exact, directories). I suppose because of reusing some internal data structures. Anyway, they are gone now, and my data is still intact, so I'm content.
I posted a mournful message to the newsgroup with a few more details regarding resilver performance. There seems to be something wrong, after 1.5 run time it still has 157 hours to go. If no better clues become available, I will just keep an eye on how long it takes to complete actually. Thankfully, a resilver isn't done often. Maybe that's also why it's not optimized yet? According to the Sun developer, it is "slower" to obtain better data integrity, but what kind of safety is that if you have to wait a week before such a small system reaches back its desired level of redundancy.
As for the clearing of errors... No, it's not related to deleting snapshots/clones/files. I haven't changed these in any way, just issued a 'zfs attach'. These errors were oblivious to removing files - the error messages just changed from ones which included paths to ones which included hex codes only. I believe that afterwards, these metadata error messages "reattached" themselves to some new files (to be more exact, directories). I suppose because of reusing some internal data structures. Anyway, they are gone now, and my data is still intact, so I'm content.
I posted a mournful message to the newsgroup with a few more details regarding resilver performance. There seems to be something wrong, after 1.5 run time it still has 157 hours to go. If no better clues become available, I will just keep an eye on how long it takes to complete actually. Thankfully, a resilver isn't done often. Maybe that's also why it's not optimized yet? According to the Sun developer, it is "slower" to obtain better data integrity, but what kind of safety is that if you have to wait a week before such a small system reaches back its desired level of redundancy.
Added by
Seth Heeren
on
Sep 25, 2010 09:02 AM
There is clearly something wrong when resilver runs at 1mb/s This is not something I saw on Solaris.
For now, the most important question is whether you were able to confirm that the resilver is _not_ being completely restarted each time (zpool history -i).
It might be anywhere between fuse, zfs-fuse and your kernel (did you try the 2.6.35 (rc4+)? It appears to contain quite important fixes for certain sys configs (scan some threads if you will, though I'm pretty sure I've posted the suggestion with various bugs, possibly yours).
Also, try to run swapoff if you cannnot upgrade kernels.
I don't know about "slower for better consistency". I can venture a guess that many many (_many_) snapshost x filesystems will possibly result in pathological cases?
If you ever get fed up, it would be most interesting to see the throughput of
zfs send -R yourpool@now | pv > /dev/null
According to the theory, send uses exactly the same traverse/read operations as a scrub/resilver (resilver being an internal send/receive with added 'live' semantics). Generally, the read spead reached with send should match the read speed while resilvering.
[* pv = pipeview/progress visualizer? that is available on debians and solaris CSW)
For now, the most important question is whether you were able to confirm that the resilver is _not_ being completely restarted each time (zpool history -i).
It might be anywhere between fuse, zfs-fuse and your kernel (did you try the 2.6.35 (rc4+)? It appears to contain quite important fixes for certain sys configs (scan some threads if you will, though I'm pretty sure I've posted the suggestion with various bugs, possibly yours).
Also, try to run swapoff if you cannnot upgrade kernels.
I don't know about "slower for better consistency". I can venture a guess that many many (_many_) snapshost x filesystems will possibly result in pathological cases?
If you ever get fed up, it would be most interesting to see the throughput of
zfs send -R yourpool@now | pv > /dev/null
According to the theory, send uses exactly the same traverse/read operations as a scrub/resilver (resilver being an internal send/receive with added 'live' semantics). Generally, the read spead reached with send should match the read speed while resilvering.
[* pv = pipeview/progress visualizer? that is available on debians and solaris CSW)
Added by
Jan Ploski
on
Sep 25, 2010 09:19 AM
It seems that it's starting from scratch each time. I'm posting zpool history -i, which contains record of my two restarts:
2010-09-25.12:22:43 zpool attach green /dev/disk/by-id/dm-name-green1 /dev/disk/by-id/dm-name-green2
2010-09-25.12:41:24 [internal pool scrub done txg:257798] complete=0
2010-09-25.12:41:24 [internal pool scrub txg:257798] func=1 mintxg=3 maxtxg=257702
2010-09-25.16:05:53 [internal pool scrub done txg:258213] complete=0
2010-09-25.16:05:53 [internal pool scrub txg:258213] func=1 mintxg=3 maxtxg=257702
I increased max-arc-size from 400 to 800 (just a lucky try). It didn't change anything
# zpool iostat 5 8
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
green 359G 663G 30 5 611K 18.0K
green 359G 663G 12 27 305K 60.9K
green 359G 663G 42 0 1009K 0
green 359G 663G 34 0 718K 0
green 359G 663G 35 0 723K 0
green 359G 663G 32 0 732K 0
green 359G 663G 32 0 802K 0
green 359G 663G 31 0 478K 0
# iostat /dev/sdb1 /dev/sdc1 5 8
Linux 2.6.34 (remotejava) 09/25/10
avg-cpu: %user %nice %system %iowait %steal %idle
2.00 0.38 2.47 21.17 0.00 73.99
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 27.98 12.24 1168.29 188872 18020251
sdb1 29.44 3379.29 49.18 52123747 758523
avg-cpu: %user %nice %system %iowait %steal %idle
1.51 0.29 2.19 23.60 0.00 72.41
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 40.32 0.00 853.69 0 4277
sdb1 37.72 2229.74 120.36 11171 603
avg-cpu: %user %nice %system %iowait %steal %idle
1.84 0.48 3.01 23.27 0.00 71.40
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 34.40 0.00 1315.60 0 6578
sdb1 40.40 4949.40 1.60 24747 8
avg-cpu: %user %nice %system %iowait %steal %idle
1.67 0.34 2.89 23.70 0.00 71.40
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 30.62 0.00 1628.03 0 8189
sdb1 32.01 4183.70 0.00 21044 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.76 0.29 2.64 24.29 0.00 71.02
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 31.66 0.00 1483.37 0 7402
sdb1 33.07 4393.19 0.00 21922 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.80 0.34 3.11 24.24 0.00 70.52
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 30.74 0.00 1603.79 0 8035
sdb1 32.93 4420.96 0.00 22149 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.70 0.44 2.97 24.55 0.00 70.34
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 31.27 0.00 1213.55 0 6092
sdb1 33.07 4112.95 0.00 20647 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.84 0.19 2.76 24.92 0.00 70.29
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 29.54 0.00 1229.14 0 6158
sdb1 30.94 3996.21 0.00 20021 0
# uname -a
Linux remotejava 2.6.34 #10 SMP PREEMPT Wed Jul 21 18:23:11 CEST 2010 i686 GNU/Linux
Given that it's not reading much and it's not writing much and it's keeping one CPU core in 100% iowait... Does it mean it's seeking like crazy? Is tps 30 high?
I guess I will try with the newer kernel next as you suggest. I faintly recall a newer kernel was causing problems elsewhere, not with zfs-fuse, but that was some time ago.
By the way, the CKSUM error I had reported earlier disappeared after the second restart of zfs-fuse (which went in the same disruptive fashion as before). I didn't issue a 'zfs clear', so it's also puzzling.
2010-09-25.12:22:43 zpool attach green /dev/disk/by-id/dm-name-green1 /dev/disk/by-id/dm-name-green2
2010-09-25.12:41:24 [internal pool scrub done txg:257798] complete=0
2010-09-25.12:41:24 [internal pool scrub txg:257798] func=1 mintxg=3 maxtxg=257702
2010-09-25.16:05:53 [internal pool scrub done txg:258213] complete=0
2010-09-25.16:05:53 [internal pool scrub txg:258213] func=1 mintxg=3 maxtxg=257702
I increased max-arc-size from 400 to 800 (just a lucky try). It didn't change anything
# zpool iostat 5 8
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
green 359G 663G 30 5 611K 18.0K
green 359G 663G 12 27 305K 60.9K
green 359G 663G 42 0 1009K 0
green 359G 663G 34 0 718K 0
green 359G 663G 35 0 723K 0
green 359G 663G 32 0 732K 0
green 359G 663G 32 0 802K 0
green 359G 663G 31 0 478K 0
# iostat /dev/sdb1 /dev/sdc1 5 8
Linux 2.6.34 (remotejava) 09/25/10
avg-cpu: %user %nice %system %iowait %steal %idle
2.00 0.38 2.47 21.17 0.00 73.99
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 27.98 12.24 1168.29 188872 18020251
sdb1 29.44 3379.29 49.18 52123747 758523
avg-cpu: %user %nice %system %iowait %steal %idle
1.51 0.29 2.19 23.60 0.00 72.41
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 40.32 0.00 853.69 0 4277
sdb1 37.72 2229.74 120.36 11171 603
avg-cpu: %user %nice %system %iowait %steal %idle
1.84 0.48 3.01 23.27 0.00 71.40
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 34.40 0.00 1315.60 0 6578
sdb1 40.40 4949.40 1.60 24747 8
avg-cpu: %user %nice %system %iowait %steal %idle
1.67 0.34 2.89 23.70 0.00 71.40
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 30.62 0.00 1628.03 0 8189
sdb1 32.01 4183.70 0.00 21044 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.76 0.29 2.64 24.29 0.00 71.02
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 31.66 0.00 1483.37 0 7402
sdb1 33.07 4393.19 0.00 21922 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.80 0.34 3.11 24.24 0.00 70.52
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 30.74 0.00 1603.79 0 8035
sdb1 32.93 4420.96 0.00 22149 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.70 0.44 2.97 24.55 0.00 70.34
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 31.27 0.00 1213.55 0 6092
sdb1 33.07 4112.95 0.00 20647 0
avg-cpu: %user %nice %system %iowait %steal %idle
1.84 0.19 2.76 24.92 0.00 70.29
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sdc1 29.54 0.00 1229.14 0 6158
sdb1 30.94 3996.21 0.00 20021 0
# uname -a
Linux remotejava 2.6.34 #10 SMP PREEMPT Wed Jul 21 18:23:11 CEST 2010 i686 GNU/Linux
Given that it's not reading much and it's not writing much and it's keeping one CPU core in 100% iowait... Does it mean it's seeking like crazy? Is tps 30 high?
I guess I will try with the newer kernel next as you suggest. I faintly recall a newer kernel was causing problems elsewhere, not with zfs-fuse, but that was some time ago.
By the way, the CKSUM error I had reported earlier disappeared after the second restart of zfs-fuse (which went in the same disruptive fashion as before). I didn't issue a 'zfs clear', so it's also puzzling.
Added by
Jan Ploski
on
Sep 25, 2010 03:54 PM
I tried the suggested experiment with zfs send -R green/backup/pc2@20100925_013001 | pv > /dev/null
No problems at all there, it reports up to 70 MB/s transfer. This particular dataset had dedup=off.
Fragment of zfs iostat output:
green 359G 663G 559 0 69.5M 0
green 359G 663G 528 0 65.1M 0
green 359G 663G 577 0 71.8M 0
green 359G 663G 580 0 71.7M 0
green 359G 663G 560 0 69.2M 0
green 359G 663G 579 0 72.1M 0
green 359G 663G 570 0 71.0M 0
green 359G 663G 599 0 71.2M 0
green 359G 663G 549 0 68.3M 0
green 359G 663G 558 0 69.4M 0
green 359G 663G 579 0 72.1M 0
green 359G 663G 351 0 36.0M 0
green 359G 663G 198 0 19.8M 0
green 359G 663G 222 0 23.3M 0
green 359G 663G 187 0 20.2M 0
green 359G 663G 591 0 73.6M 0
green 359G 663G 443 0 54.0M 0
green 359G 663G 388 0 45.2M 0
green 359G 663G 479 0 59.7M 0
I also tried the same on a dataset which had dedup=on, with just as good performance.
It's only scrub/resilver that is so painfully slow.
No problems at all there, it reports up to 70 MB/s transfer. This particular dataset had dedup=off.
Fragment of zfs iostat output:
green 359G 663G 559 0 69.5M 0
green 359G 663G 528 0 65.1M 0
green 359G 663G 577 0 71.8M 0
green 359G 663G 580 0 71.7M 0
green 359G 663G 560 0 69.2M 0
green 359G 663G 579 0 72.1M 0
green 359G 663G 570 0 71.0M 0
green 359G 663G 599 0 71.2M 0
green 359G 663G 549 0 68.3M 0
green 359G 663G 558 0 69.4M 0
green 359G 663G 579 0 72.1M 0
green 359G 663G 351 0 36.0M 0
green 359G 663G 198 0 19.8M 0
green 359G 663G 222 0 23.3M 0
green 359G 663G 187 0 20.2M 0
green 359G 663G 591 0 73.6M 0
green 359G 663G 443 0 54.0M 0
green 359G 663G 388 0 45.2M 0
green 359G 663G 479 0 59.7M 0
I also tried the same on a dataset which had dedup=on, with just as good performance.
It's only scrub/resilver that is so painfully slow.
Added by
Seth Heeren
on
Sep 25, 2010 05:51 PM
Ahhhrggg performance is clearly absent.
I'm posting a partial post here because it was left in drafts for most of the day and is now partially obsolete:
The dominance of IOWAIT quite frankly suggests hardware issues to me.
Approaches:
<strike>
1. though I'm sure there are ways to measure IRQ storms and queue depths and whatnot. I simply suggest elimininating factors (remove unused hardware, swap out controllers if possible). </strike>
2. look at the kernel patch mentioned on the list, and or swapoff
<strike>Still interested in the send >/dev/null speed</strike>
Also, which revision/debug flags does your build use?
I'm posting a partial post here because it was left in drafts for most of the day and is now partially obsolete:
The dominance of IOWAIT quite frankly suggests hardware issues to me.
Approaches:
<strike>
1. though I'm sure there are ways to measure IRQ storms and queue depths and whatnot. I simply suggest elimininating factors (remove unused hardware, swap out controllers if possible). </strike>
2. look at the kernel patch mentioned on the list, and or swapoff
<strike>Still interested in the send >/dev/null speed</strike>
Also, which revision/debug flags does your build use?
Added by
Jan Ploski
on
Sep 25, 2010 06:02 PM
Regarding the zfs send, it's "blazing fast" (in comparison), see above. Or do you insist on transferring from pool@now (that doesn't work because I have multiple file systems in that pool).
Regarding swapoff patch, I have no swap in this system.
Regarding hardware, the configuration is unchanged, except for one disk (same model as the prior one). I will try unplugging this new disk to see whether it helps (though I doubt it).
I have an update. After 1h 26m into the scrub process, it's at 8.54% and now "only" 15h 26m to go. I'm also seeing different "zpool iostat" outputs. Whereas in the beginning it seemed to keep steady within the 7-9 MB/s range, it is now varying more. Samples taken every 5 seconds, as in previous measurements:
green 359G 663G 124 20 14.8M 288K
green 359G 663G 179 0 20.8M 0
green 359G 663G 171 0 15.7M 0
green 359G 663G 198 0 12.1M 0
green 359G 663G 173 0 9.13M 0
green 359G 663G 199 0 10.6M 0
green 359G 663G 157 8 7.70M 60.0K
green 359G 663G 161 0 7.40M 819
green 359G 663G 190 0 8.94M 0
green 359G 663G 171 0 6.42M 0
green 359G 663G 215 0 11.0M 0
green 359G 663G 143 0 3.44M 0
green 359G 663G 172 0 3.70M 0
green 359G 663G 134 11 2.14M 61.4K
green 359G 663G 141 0 2.66M 0
green 359G 663G 149 0 1.78M 0
green 359G 663G 127 0 1.75M 0
green 359G 663G 134 0 1.51M 0
green 359G 663G 133 0 2.00M 0
green 359G 663G 114 13 1.91M 61.4K
green 359G 663G 132 0 1.94M 0
green 359G 663G 135 0 2.45M 0
green 359G 663G 145 0 2.23M 0
green 359G 663G 132 0 2.61M 0
green 359G 663G 155 0 2.96M 0
green 359G 663G 184 12 4.14M 60.3K
green 359G 663G 136 0 2.44M 819
green 359G 663G 242 0 2.99M 0
green 359G 663G 166 0 2.68M 0
I will leave both scrub and zpool iostat running overnight so I can analyze the data in the morning.
Regarding swapoff patch, I have no swap in this system.
Regarding hardware, the configuration is unchanged, except for one disk (same model as the prior one). I will try unplugging this new disk to see whether it helps (though I doubt it).
I have an update. After 1h 26m into the scrub process, it's at 8.54% and now "only" 15h 26m to go. I'm also seeing different "zpool iostat" outputs. Whereas in the beginning it seemed to keep steady within the 7-9 MB/s range, it is now varying more. Samples taken every 5 seconds, as in previous measurements:
green 359G 663G 124 20 14.8M 288K
green 359G 663G 179 0 20.8M 0
green 359G 663G 171 0 15.7M 0
green 359G 663G 198 0 12.1M 0
green 359G 663G 173 0 9.13M 0
green 359G 663G 199 0 10.6M 0
green 359G 663G 157 8 7.70M 60.0K
green 359G 663G 161 0 7.40M 819
green 359G 663G 190 0 8.94M 0
green 359G 663G 171 0 6.42M 0
green 359G 663G 215 0 11.0M 0
green 359G 663G 143 0 3.44M 0
green 359G 663G 172 0 3.70M 0
green 359G 663G 134 11 2.14M 61.4K
green 359G 663G 141 0 2.66M 0
green 359G 663G 149 0 1.78M 0
green 359G 663G 127 0 1.75M 0
green 359G 663G 134 0 1.51M 0
green 359G 663G 133 0 2.00M 0
green 359G 663G 114 13 1.91M 61.4K
green 359G 663G 132 0 1.94M 0
green 359G 663G 135 0 2.45M 0
green 359G 663G 145 0 2.23M 0
green 359G 663G 132 0 2.61M 0
green 359G 663G 155 0 2.96M 0
green 359G 663G 184 12 4.14M 60.3K
green 359G 663G 136 0 2.44M 819
green 359G 663G 242 0 2.99M 0
green 359G 663G 166 0 2.68M 0
I will leave both scrub and zpool iostat running overnight so I can analyze the data in the morning.
Added by
Seth Heeren
on
Sep 26, 2010 05:56 AM
> Regarding the zfs send, it's "blazing fast" (in comparison), see above. Or do you insist on transferring from pool@now (that doesn't work because I have multiple file systems in that pool).
Well, insist... no :) But it would be relevant to run against the pool that matters. I mentioned 'zfs send -R' (the -R being operative here). It will include all filesystems, volumes and snapshots in the pool.
To be clear: I don't expect you to receive the stream, just see how fast it will read.
15 hours seems back in the reaility zone :) Still awful sluggish of course
Well, insist... no :) But it would be relevant to run against the pool that matters. I mentioned 'zfs send -R' (the -R being operative here). It will include all filesystems, volumes and snapshots in the pool.
To be clear: I don't expect you to receive the stream, just see how fast it will read.
15 hours seems back in the reaility zone :) Still awful sluggish of course
Added by
Jan Ploski
on
Sep 26, 2010 06:55 AM
The announced scrub is now complete. It took only 8h 33m! You can have a look at an OpenOffice spreadsheet with zpool iostat output collected during the last 7 hours: http://www.plosquare.com/download/zpool-iostat-20100926.ods (it was not allowed to attach this file, perhaps too big).
As you can see, after a long period of low bandwidth, there is a sudden burst. I now believe that my statement that I somehow managed to reduce the scrub run time from 40h to 6h in the past by altering configuration is false. More likely it was just myself observing this performance burst which coincided with my tweaks. As I learned from this example, looking at a short sample of zpool iostat outputs is unfortunately not a good way to assess or predict scrub performance. I also t hink that this data exonerates my hardware. More likely it is a performance bug and/or the pool's content.
Perhaps the planned resilver would be much quicker than estimated too (the burning question is how many hours to wait before conclusion). I think I'm going to test the evil dedup hypothesis next, that is, transfer all datasets to the other disk, but without dedup=on, then scrub that other disk and observe bandwidth usage.
As you can see, after a long period of low bandwidth, there is a sudden burst. I now believe that my statement that I somehow managed to reduce the scrub run time from 40h to 6h in the past by altering configuration is false. More likely it was just myself observing this performance burst which coincided with my tweaks. As I learned from this example, looking at a short sample of zpool iostat outputs is unfortunately not a good way to assess or predict scrub performance. I also t hink that this data exonerates my hardware. More likely it is a performance bug and/or the pool's content.
Perhaps the planned resilver would be much quicker than estimated too (the burning question is how many hours to wait before conclusion). I think I'm going to test the evil dedup hypothesis next, that is, transfer all datasets to the other disk, but without dedup=on, then scrub that other disk and observe bandwidth usage.
Added by
Seth Heeren
on
Sep 26, 2010 07:09 AM
note however that an initial mirror resilver is more like a scrub of a mirrored pool than a scrub of a single vdev... So if you want to really compare, best to receive the data to a mirrored pool with dedup=off
Any time spent is appreciated, but don't forget that you are under no obligation to do so (and you stated yourself that you sometimes wonder whether you should have to go to all these lengths as a user - just saying...)
Any time spent is appreciated, but don't forget that you are under no obligation to do so (and you stated yourself that you sometimes wonder whether you should have to go to all these lengths as a user - just saying...)
Added by
Jan Ploski
on
Sep 26, 2010 12:19 PM
The "zfs send -R green@now | pv > /dev/null" experiment is now finished. Almost 2x faster than the scrub, reasonable and mostly steady bandwidth usage:
http://www.plosquare.com/[…]/zpool-iostat-send-20100926.ods
http://www.plosquare.com/[…]/zpool-iostat-send-20100926.ods
Added by
Jan Ploski
on
Sep 27, 2010 03:55 AM
I repeated the scrub with unmounted and unshared file systems to test your theory about threading issues. It took even more time than before, over 10h. I can't interpret the measurements at all. This time it doesn't look like there is an initial preparation/seek phase after which the bandwidth goes up. Perhaps the profile looks so different from the first because it was the second scrub within a short time.
http://www.plosquare.com/[…]/zpool-iostat-scrub-unmounted-20100926.ods
http://www.plosquare.com/[…]/zpool-iostat-scrub-unmounted-20100926.ods

