#27 — crash (and unable to restart) with dedup and reservations
| State | Resolved |
|---|---|
| Version: | |
| Area | Functionality |
| Issue type | Bug |
| Severity | Important |
| Submitted by | Spam Domain |
| Submitted on | Feb 10, 2010 |
| Responsible | Seth Heeren |
| Target release: | 0.6.9 |
Last modified on
May 22, 2010
by
Seth Heeren
If you have dedup on and a reservation set you can crash zfs-fuse and will be unable to start it up.
- Steps to reproduce:
- [root@b64-5 ~]# zpool create -O dedup=on mypool vg/zp1 vg/zp2
[root@b64-5 ~]# zfs create mypool/a
[root@b64-5 ~]# zfs create mypool/b
[root@b64-5 ~]# zfs set reservation=3G mypool/a
[root@b64-5 ~]# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
mypool 3.97G 227K 3.97G 0% 1.00x ONLINE -
[root@b64-5 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 3.00G 928M 24K /mypool
mypool/a 21K 3.91G 21K /mypool/a
mypool/b 21K 928M 21K /mypool/b
[root@b64-5 ~]# cp /mirrors/centos/5/isos/x86_64/CentOS-5.4-x86_64-bin-[12]of7.iso /mypool/a/
[root@b64-5 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 3.00G 926M 24K /mypool
mypool/a 1.18G 2.72G 1.18G /mypool/a
mypool/b 21K 926M 21K /mypool/b
[root@b64-5 ~]# cp /mypool/a/* /mypool/b/
[root@b64-5 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
mypool 4.15G 905M 24K /mypool
mypool/a 1.18G 2.70G 1.18G /mypool/a
mypool/b 1.15G 905M 1.15G /mypool/b
[root@b64-5 ~]# rm -f /mypool/a/*
At this point zfs-fuse will crash and will refuse to start up. Hope this makes it clear how to reproduce this.
Added by
Seth Heeren
on
Feb 10, 2010 04:52 PM
Responsible manager:
(UNASSIGNED) → sgheeren
have tried your steps, could not reproduce.
What version are you running?
What version are you running?
Added by
Seth Heeren
on
Feb 10, 2010 04:59 PM
attached a shell transcript from my 'success'
Added by
Spam Domain
on
Feb 10, 2010 04:59 PM
I did a clone of http://rainemu.swishparty.co.uk/cgi-bin/gitweb.cgi?p=zfs today (2/10 @ 13:00 GMT)
The important thing here is to make sure that the deduped data will cause the /mypool/b to exceed available amount when you remove /mypool/a
Before any data is copied my pool only has 928M available to it because of the reservation:
mypool/b 21K 928M 21K /mypool/b
After copying the data from /mypool/a to /mypool/b the space used by /mypool/b is > 928M
mypool/b 1.15G 905M 1.15G /mypool/b
This doesn't cause an issue initially because it is all duplicate data that was copied from /mypool/a. When the data in /mypool/a is removed it causes /mypool/b to go over what it should have and causes the crash (assumption).
The important thing here is to make sure that the deduped data will cause the /mypool/b to exceed available amount when you remove /mypool/a
Before any data is copied my pool only has 928M available to it because of the reservation:
mypool/b 21K 928M 21K /mypool/b
After copying the data from /mypool/a to /mypool/b the space used by /mypool/b is > 928M
mypool/b 1.15G 905M 1.15G /mypool/b
This doesn't cause an issue initially because it is all duplicate data that was copied from /mypool/a. When the data in /mypool/a is removed it causes /mypool/b to go over what it should have and causes the crash (assumption).
Added by
Seth Heeren
on
Feb 10, 2010 05:01 PM
Ok, now i get it. What is upstream behaviour? My OSol nv132 box is currently in the middle of a large backup transfer so I'd rather not crash it at this time:)
Added by
Spam Domain
on
Feb 10, 2010 05:09 PM
Not sure what upstream behavior is and don't have a way to test it. I was just running through some reservation examples and tried to understand how reservations worked with dedup and ran into this case. Though it might be useful for someone to look into and fix if possible.
Added by
Seth Heeren
on
Feb 10, 2010 05:29 PM
Issue state:
unconfirmed → open
Ok with the pool capacity vs. reservation in mind, I have managed to break zfs-fuse. In a different fashion (I kept getting clean E_NOSPC messages). Only when I tried to destroy the 'b' fs (with no reservation), the daemon crashed.
Attached the script to replay (http://zfs-fuse.net/wiki/reproduced.zip)
## to prepare test pool (my now famous dorky.sh preconfigured for a simple 1G pool)
# ./dorky.sh
## to view actions:
# scriptreplay scriptreplay timing typescript
Warning: the dorky.sh script assumes an idle machine with *no critical zpools*!
Attached the script to replay (http://zfs-fuse.net/wiki/reproduced.zip)
## to prepare test pool (my now famous dorky.sh preconfigured for a simple 1G pool)
# ./dorky.sh
## to view actions:
# scriptreplay scriptreplay timing typescript
Warning: the dorky.sh script assumes an idle machine with *no critical zpools*!
Added by
Seth Heeren
on
Feb 10, 2010 05:37 PM
Severity:
Medium → Low
The culprit seems to be
lib/libzpool/dsl_dir.c:654: dsl_dir_space_available: Assertion `used <= spa_get_dspace(dd->dd_pool->dp_spa) (0x5a571400 <= 0x3f800000)` failed.
I did not see much of a problem when building with an explicit debug=0
Could you try
# cd src
# time (scons -j4 debug=0; sudo scons -j4 debug=0 install)
and retest? My assumption is that zfs-fuse is panicking _on demand_ here. When running ZFS in debug builds on ONNV (OpenSolaris) you'd expect it to panic the kernel on the same line _by design_
lib/libzpool/dsl_dir.c:654: dsl_dir_space_available: Assertion `used <= spa_get_dspace(dd->dd_pool->dp_spa) (0x5a571400 <= 0x3f800000)` failed.
I did not see much of a problem when building with an explicit debug=0
Could you try
# cd src
# time (scons -j4 debug=0; sudo scons -j4 debug=0 install)
and retest? My assumption is that zfs-fuse is panicking _on demand_ here. When running ZFS in debug builds on ONNV (OpenSolaris) you'd expect it to panic the kernel on the same line _by design_
Added by
Spam Domain
on
Feb 10, 2010 05:50 PM
It may just be a timing issue. I know we have delayed delete and it doesn't crash until it attempts to remove the data. From your output the zpool list after the removal of /a/ it shows that the delete hasn't taken place yet as dedup is still 1.99x.
I tried running the same sequence of commands and get the same output as you. I'm assuming that the directory structure is cached and that is why the find/etc is working sorta. If you try another zfs command after the delete apparently took place you will find that is has already crashed and the daemon is dead.
I tried running the same sequence of commands and get the same output as you. I'm assuming that the directory structure is cached and that is why the find/etc is working sorta. If you try another zfs command after the delete apparently took place you will find that is has already crashed and the daemon is dead.
Added by
Seth Heeren
on
Feb 10, 2010 06:02 PM
The delete will have to walk the DDT (dedup tables) and decrease the block references. This is a time consuming operation (and cause for many _serious_ upstream bug reports - even after a reboot, pools of considerable size can be 'stuck' for hours or even days while committing the last pending transaction that involved deletions/destroyals on dedup enabled pools).
For our current issue, this means that it makes a lot of sense that the deletion takes time. The issue is not that it takes time for the daemon to finally/actually crash. The issue is that the daemon crashes at all. So, no I don't think we should call this a timing issue.
I think the /issue/ we are looking at is mainly that you are using a debug build, which is trip-wired to baulk at circumstances like this. Note that assert(...) calls abort(...). Lookup the man page for ABORT(3) to see what that implies.
But we cannot be sure that the pool behaves properly again when using a non-debug build until you retest. I retested my self (keeping zpool.cache so reloading the same pool as is). I was then able to cleanup and get the pool in a working state and with the reservation honoured (I think it said "rm b/* --> E_NOSPC", but "zfs destroy dorky1/b --> OK")
Let me know when you have info?
For our current issue, this means that it makes a lot of sense that the deletion takes time. The issue is not that it takes time for the daemon to finally/actually crash. The issue is that the daemon crashes at all. So, no I don't think we should call this a timing issue.
I think the /issue/ we are looking at is mainly that you are using a debug build, which is trip-wired to baulk at circumstances like this. Note that assert(...) calls abort(...). Lookup the man page for ABORT(3) to see what that implies.
But we cannot be sure that the pool behaves properly again when using a non-debug build until you retest. I retested my self (keeping zpool.cache so reloading the same pool as is). I was then able to cleanup and get the pool in a working state and with the reservation honoured (I think it said "rm b/* --> E_NOSPC", but "zfs destroy dorky1/b --> OK")
Let me know when you have info?
Added by
Spam Domain
on
Feb 10, 2010 06:10 PM
The timing comment was in regards to your getting a different crash than I did.
I'm running whatever the rpm builds. I'm not aware of it doing a debug build but it may be. I'll try doing a build like you suggest and see if it behaves correctly.
I'm running whatever the rpm builds. I'm not aware of it doing a debug build but it may be. I'll try doing a build like you suggest and see if it behaves correctly.
Added by
Seth Heeren
on
Feb 10, 2010 06:23 PM
Severity:
Low → Important
See http://groups.google.com/[…]/12c11fd87ac8e7c8
I think that calls for a discussion. My bias will be clear :)
I think that calls for a discussion. My bias will be clear :)
Added by
Spam Domain
on
Feb 10, 2010 08:54 PM
I've rebuilt the rpm with debug=0 and it performs more like I would expect it to. There is no crash and it shows 0 bytes available. Once weird side effect is that you are unable to delete any files. It errors out with a "No space left on device" error. However if I change the reservation to give that filesystem a little room then I'm able to delete files and get it back within limits.
So in the end this isn't a bug with zfs but with the building of the rpm package. The default needs to change to debug=0 or the rpm needs to specify debug=0 during building. Here is the change I made to the rpm to get it to build correctly and not thow any errors. I'd also recommend updating the version/release of the spec as I assume that the tree with dedup and raidz3 isn't 0.6.0-0.0.###snapshot anymore.
--- zfs-fuse.spec.old 2010-02-10 08:15:39.000000000 -0700
+++ zfs-fuse.spec 2010-02-10 18:43:04.000000000 -0700
@@ -10,6 +10,7 @@
Source0: %{name}-%{version}.tar.bz2
BuildRoot: %{_tmppath}/%{name}-%{version}-root
BuildRequires: fuse-devel libaio-devel zlib-devel scons
+BuildRequires: openssl-devel libattr-devel
%description
ZFS (formerly the Zettabyte File System), is a filesystem invented by
@@ -61,7 +62,7 @@
%build
cd src
-scons
+scons debug=0
%install
[ "$RPM_BUILD_ROOT" != "/" ] && [ -d $RPM_BUILD_ROOT ] && rm -rf $RPM_BUILD_ROOT;
@@ -73,7 +74,7 @@
mkdir -p $RPM_BUILD_ROOT%{_mandir}/man8
install -m 644 doc/*.8.gz $RPM_BUILD_ROOT%{_mandir}/man8
cd src
-scons install install_dir=$RPM_BUILD_ROOT%_sbindir
+scons debug=0 install install_dir=$RPM_BUILD_ROOT%_sbindir man_dir=$RPM_BUILD_ROOT%_mandir/man8/
%clean
[ "$RPM_BUILD_ROOT" != "/" ] && [ -d $RPM_BUILD_ROOT ] && rm -rf $RPM_BUILD_ROOT;
So in the end this isn't a bug with zfs but with the building of the rpm package. The default needs to change to debug=0 or the rpm needs to specify debug=0 during building. Here is the change I made to the rpm to get it to build correctly and not thow any errors. I'd also recommend updating the version/release of the spec as I assume that the tree with dedup and raidz3 isn't 0.6.0-0.0.###snapshot anymore.
--- zfs-fuse.spec.old 2010-02-10 08:15:39.000000000 -0700
+++ zfs-fuse.spec 2010-02-10 18:43:04.000000000 -0700
@@ -10,6 +10,7 @@
Source0: %{name}-%{version}.tar.bz2
BuildRoot: %{_tmppath}/%{name}-%{version}-root
BuildRequires: fuse-devel libaio-devel zlib-devel scons
+BuildRequires: openssl-devel libattr-devel
%description
ZFS (formerly the Zettabyte File System), is a filesystem invented by
@@ -61,7 +62,7 @@
%build
cd src
-scons
+scons debug=0
%install
[ "$RPM_BUILD_ROOT" != "/" ] && [ -d $RPM_BUILD_ROOT ] && rm -rf $RPM_BUILD_ROOT;
@@ -73,7 +74,7 @@
mkdir -p $RPM_BUILD_ROOT%{_mandir}/man8
install -m 644 doc/*.8.gz $RPM_BUILD_ROOT%{_mandir}/man8
cd src
-scons install install_dir=$RPM_BUILD_ROOT%_sbindir
+scons debug=0 install install_dir=$RPM_BUILD_ROOT%_sbindir man_dir=$RPM_BUILD_ROOT%_mandir/man8/
%clean
[ "$RPM_BUILD_ROOT" != "/" ] && [ -d $RPM_BUILD_ROOT ] && rm -rf $RPM_BUILD_ROOT;
Added by
Seth Heeren
on
Feb 10, 2010 09:02 PM
Target release:
None → 0.6.1
good news. this confirms the fix/workaround for now. I'm still inclined to see what happens on opensolaris (regarding the ENOSPC) but regardless.
note that the stable version of zfs-fuse supports up to and including pool version 16. That excludes the features you mention. The only release that is actually actively anticipated is a 0.6.1 with stability fixes towards 0.6.0 (see homepage).
I do agree that dedup and raidz3 are nice, but zfs-fuse will always have to lag behind a bit (by the way, not much more (if at all) than Sun Solaris 10 lags behind Open Solaris)
$0.02 -
Keep an eye on the discussion at the google list and thanks for the patch. I'll keep this one open in order to apply that after some discussion.
note that the stable version of zfs-fuse supports up to and including pool version 16. That excludes the features you mention. The only release that is actually actively anticipated is a 0.6.1 with stability fixes towards 0.6.0 (see homepage).
I do agree that dedup and raidz3 are nice, but zfs-fuse will always have to lag behind a bit (by the way, not much more (if at all) than Sun Solaris 10 lags behind Open Solaris)
$0.02 -
Keep an eye on the discussion at the google list and thanks for the patch. I'll keep this one open in order to apply that after some discussion.
Added by
Seth Heeren
on
May 22, 2010 11:09 AM
Issue state:
open → resolved
Target release:
0.6.1 → 0.6.9
patch landed as b79138c9
behaviour now consistent with 'normal' capacity exhaustion (of the form where snapshots cannot be rolled back to, but they can be destroyed. Individual files cannot be deleted (ENOSPC)
This mathces OSol behaviour.)
behaviour now consistent with 'normal' capacity exhaustion (of the form where snapshots cannot be rolled back to, but they can be destroyed. Individual files cannot be deleted (ENOSPC)
This mathces OSol behaviour.)

repro.log
