#29 — ZFS-FUSE crashes (SIGABRT on file deletion, thread synchronization?)
| State | Rejected |
|---|---|
| Version: | 0.6.0 |
| Area | Functionality |
| Issue type | Bug |
| Severity | Medium |
| Submitted by | (anonymous) |
| Submitted on | Feb 24, 2010 |
| Responsible | Seth Heeren |
| Target release: |
—
|
Last modified on
Nov 10, 2010
by
Seth Heeren
ZFS-FUSE crashes from time to time for no apparent reason. It just disappears. Since I don't get a core file produced or a segfault message in dmesg, so it isn't dying with SIGSEGV. I can't record a backtrace, and I find nothing related in the logs. More than half of the times when this happens I'm downloading photos from my camera with Digikam. I assume that it is exactly the same issue I reported earlier on the mailing list, which happened under the same conditions: http://groups.google.com/gr[…]1b2%3F#doc_1589609c1f825b8c
I get this with both origin/critical and Emmanuel Anne's repository.
Here's copy of the error and backtrace I got originally, assuming it's the same problem:
run-zfs-fuse: lib/libzpool/zfs_znode.c:576: zfs_znode_dmu_fini:
Assertion `zp->z_dbuf != ((void *)0)' failed.
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fc12dea6950 (LWP 12241)]
0x00007fc15392f205 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007fc15392f205 in raise () from /lib/libc.so.6
#1 0x00007fc153930723 in abort () from /lib/libc.so.6
#2 0x00007fc153928229 in __assert_fail () from /lib/libc.so.6
#3 0x00000000004777bc in zfs_znode_dmu_fini (zp=0x7fc0eb157540)
at lib/libzpool/zfs_znode.c:576
#4 0x000000000040a130 in zfs_rmnode (zp=0x7fc0eb157540)
at zfs-fuse/zfs_dir.c:623
#5 0x000000000041c13f in zfs_inactive (vp=0x7fc0dddd6e10,
cr=<value optimized out>, ct=<value optimized out>)
at zfs-fuse/zfs_vnops.c:3925
#6 0x0000000000420ac3 in zfsfuse_getattr_helper (req=0x89ac50,
ino=<value optimized out>, fi=<value optimized out>)
at zfs-fuse/zfs_operations.c:176
#7 0x000000000041cb4b in zfsfuse_listener_loop (arg=<value optimized
out>)
at zfs-fuse/fuse_listener.c:267
#8 0x00007fc154496017 in start_thread () from /lib/libpthread.so.0
#9 0x00007fc1539cd48d in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
I get this with both origin/critical and Emmanuel Anne's repository.
Here's copy of the error and backtrace I got originally, assuming it's the same problem:
run-zfs-fuse: lib/libzpool/zfs_znode.c:576: zfs_znode_dmu_fini:
Assertion `zp->z_dbuf != ((void *)0)' failed.
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fc12dea6950 (LWP 12241)]
0x00007fc15392f205 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007fc15392f205 in raise () from /lib/libc.so.6
#1 0x00007fc153930723 in abort () from /lib/libc.so.6
#2 0x00007fc153928229 in __assert_fail () from /lib/libc.so.6
#3 0x00000000004777bc in zfs_znode_dmu_fini (zp=0x7fc0eb157540)
at lib/libzpool/zfs_znode.c:576
#4 0x000000000040a130 in zfs_rmnode (zp=0x7fc0eb157540)
at zfs-fuse/zfs_dir.c:623
#5 0x000000000041c13f in zfs_inactive (vp=0x7fc0dddd6e10,
cr=<value optimized out>, ct=<value optimized out>)
at zfs-fuse/zfs_vnops.c:3925
#6 0x0000000000420ac3 in zfsfuse_getattr_helper (req=0x89ac50,
ino=<value optimized out>, fi=<value optimized out>)
at zfs-fuse/zfs_operations.c:176
#7 0x000000000041cb4b in zfsfuse_listener_loop (arg=<value optimized
out>)
at zfs-fuse/fuse_listener.c:267
#8 0x00007fc154496017 in start_thread () from /lib/libpthread.so.0
#9 0x00007fc1539cd48d in clone () from /lib/libc.so.6
#10 0x0000000000000000 in ?? ()
Added by
Seth Heeren
on
Feb 24, 2010 01:39 PM
since the stack trace matches your (? anon) other issue report; we don't usually see asserts like this.
Emmanuels latest repo should not throw asserts at all unless you explicitely build for debug mode. Caveats both ways.
Could you tell us a bit about your config
1. os
2. fuse versions
3. pool version
4. pool properties
5. filesystem version and properties
6. feature usage (zfs list, zfs get -r all)
hardware perhaps? partitioning involved. Other fs-es or software running?
What kind of (concurrent) filesystem load do you have. What is your /etc/zfs/zfsrc?
Emmanuels latest repo should not throw asserts at all unless you explicitely build for debug mode. Caveats both ways.
Could you tell us a bit about your config
1. os
2. fuse versions
3. pool version
4. pool properties
5. filesystem version and properties
6. feature usage (zfs list, zfs get -r all)
hardware perhaps? partitioning involved. Other fs-es or software running?
What kind of (concurrent) filesystem load do you have. What is your /etc/zfs/zfsrc?
Added by
(anonymous)
on
Feb 24, 2010 01:58 PM
As I said, I repasted the backtrace from the old issue since I can't create a new backtrace right now (I stopped running zfs-fuse in gdb two-three months ago, since it had weird side effects such as hanging my whole system when zfs-fuse receives a signal). I'm not sure that it's the same issue, and if it doesn't throw asserts anymore, it's possible that it's another issue. Especially if the older one is fixed.
Give me some advice on how to create a backtrace. I've enabled core file creation, but as it isn't a segfault, I don't get one. I prefer to not run zfs-fuse in gdb, and when I did I had it running for months without issues. I have found no way to reproduce it directly. It just happens sometimes.
1. OS: Debian SID
2. Fuse 2.8.1
3. Pool version and properties:
pool size 2,61T -
pool capacity 51% -
pool altroot - default
pool health ONLINE -
pool version 16 local
pool bootfs - default
pool delegation on default
pool autoreplace off default
pool cachefile - default
pool failmode wait default
pool listsnapshots off default
pool autoexpand off default
pool dedupditto 0 default
pool dedupratio 1.00x -
pool free 1,25T -
pool allocated 1,36T -
4. Filesystem version and properties
pool/warehouse type filesystem -
pool/warehouse creation пн май 5 17:30 2008 -
pool/warehouse used 754G -
pool/warehouse available 827G -
pool/warehouse referenced 730G -
pool/warehouse compressratio 1.00x -
pool/warehouse mounted yes -
pool/warehouse quota none default
pool/warehouse reservation none default
pool/warehouse recordsize 128K default
pool/warehouse mountpoint /dpt local
pool/warehouse sharenfs off default
pool/warehouse checksum on default
pool/warehouse compression off default
pool/warehouse atime on default
pool/warehouse devices on default
pool/warehouse exec on default
pool/warehouse setuid on default
pool/warehouse readonly off default
pool/warehouse zoned off default
pool/warehouse snapdir hidden default
pool/warehouse aclmode groupmask default
pool/warehouse aclinherit restricted default
pool/warehouse canmount on default
pool/warehouse shareiscsi off default
pool/warehouse xattr on default
pool/warehouse copies 1 default
pool/warehouse version 3 -
pool/warehouse utf8only off -
pool/warehouse normalization none -
pool/warehouse casesensitivity sensitive -
pool/warehouse vscan off default
pool/warehouse nbmand off default
pool/warehouse sharesmb off default
pool/warehouse refquota none default
pool/warehouse refreservation none default
pool/warehouse primarycache all default
pool/warehouse secondarycache all default
pool/warehouse usedbysnapshots 0 -
pool/warehouse usedbydataset 0 -
pool/warehouse usedbychildren 0 -
pool/warehouse usedbyrefreservation 0 -
pool/warehouse logbias latency default
pool/warehouse dedup off default
pool/warehouse mlslabel off -
5. Load: Normal desktop load: Firefox updating history database, Amarok writing some stats in a database, Kopete writing some stats in a database, some downloads running, Digikam downloading photos (and updating a database). In more than ahlf of the cases, Digikam was running and downloading photos, and after reboot either the database journal was corrupted, or its owner was changed to root, or I experienced bug #28 with digikam's database journal in the backtrace.
Give me some advice on how to create a backtrace. I've enabled core file creation, but as it isn't a segfault, I don't get one. I prefer to not run zfs-fuse in gdb, and when I did I had it running for months without issues. I have found no way to reproduce it directly. It just happens sometimes.
1. OS: Debian SID
2. Fuse 2.8.1
3. Pool version and properties:
pool size 2,61T -
pool capacity 51% -
pool altroot - default
pool health ONLINE -
pool version 16 local
pool bootfs - default
pool delegation on default
pool autoreplace off default
pool cachefile - default
pool failmode wait default
pool listsnapshots off default
pool autoexpand off default
pool dedupditto 0 default
pool dedupratio 1.00x -
pool free 1,25T -
pool allocated 1,36T -
4. Filesystem version and properties
pool/warehouse type filesystem -
pool/warehouse creation пн май 5 17:30 2008 -
pool/warehouse used 754G -
pool/warehouse available 827G -
pool/warehouse referenced 730G -
pool/warehouse compressratio 1.00x -
pool/warehouse mounted yes -
pool/warehouse quota none default
pool/warehouse reservation none default
pool/warehouse recordsize 128K default
pool/warehouse mountpoint /dpt local
pool/warehouse sharenfs off default
pool/warehouse checksum on default
pool/warehouse compression off default
pool/warehouse atime on default
pool/warehouse devices on default
pool/warehouse exec on default
pool/warehouse setuid on default
pool/warehouse readonly off default
pool/warehouse zoned off default
pool/warehouse snapdir hidden default
pool/warehouse aclmode groupmask default
pool/warehouse aclinherit restricted default
pool/warehouse canmount on default
pool/warehouse shareiscsi off default
pool/warehouse xattr on default
pool/warehouse copies 1 default
pool/warehouse version 3 -
pool/warehouse utf8only off -
pool/warehouse normalization none -
pool/warehouse casesensitivity sensitive -
pool/warehouse vscan off default
pool/warehouse nbmand off default
pool/warehouse sharesmb off default
pool/warehouse refquota none default
pool/warehouse refreservation none default
pool/warehouse primarycache all default
pool/warehouse secondarycache all default
pool/warehouse usedbysnapshots 0 -
pool/warehouse usedbydataset 0 -
pool/warehouse usedbychildren 0 -
pool/warehouse usedbyrefreservation 0 -
pool/warehouse logbias latency default
pool/warehouse dedup off default
pool/warehouse mlslabel off -
5. Load: Normal desktop load: Firefox updating history database, Amarok writing some stats in a database, Kopete writing some stats in a database, some downloads running, Digikam downloading photos (and updating a database). In more than ahlf of the cases, Digikam was running and downloading photos, and after reboot either the database journal was corrupted, or its owner was changed to root, or I experienced bug #28 with digikam's database journal in the backtrace.
Added by
Milko Krachounov
on
Mar 03, 2010 06:13 AM
I managed to reproduce the bug several times already. However, I'm completely unable to debug it with gdb. gdb just hangs for me.
How the problem is reproduced on my FS: I run a scan with clamav using klamav on my ZFS filesystems. About hour or two in the scan, and the process just exits with status 255. By the way, is this the "new way" it throws assertions? And why the previous more verbose one was removed then?
Here's some related log:
http://paste.pocoo.org/show/184607/
How the problem is reproduced on my FS: I run a scan with clamav using klamav on my ZFS filesystems. About hour or two in the scan, and the process just exits with status 255. By the way, is this the "new way" it throws assertions? And why the previous more verbose one was removed then?
Here's some related log:
http://paste.pocoo.org/show/184607/
Added by
Seth Heeren
on
Mar 03, 2010 08:15 AM
Responsible manager:
(UNASSIGNED) → sgheeren
Thanks for the update
First things first
> By the way, is this the "new way" it throws assertions?
Well, i'm guessing this is not actually a question. But: "no", that is not the new way assertions are thrown (why _do_ you ask, really?)
With that out of the way
> I run a scan with clamav using klamav on my ZFS filesystems. About hour or two in the scan, and the process just exits with status 255.
Thanks for a clear recipe to start reproducing this.
On the debugging issues:
(a) consider attaching gdb just-in-time, in case some kind of resource leak/build-up is involved
gdb --pid $(pgrep zfs-fuse)
(b) with a view to possible sources of resource exhaustion, consider using the /canonical/ (invented term there) -a 1 -e 1 instead of the large timeouts. If all you do is sequential FS scanning, you won't get any benefit from longer-term caching anyway. The /used/ cache is localized only and the rest should in fact be discarded as quickly as possible. I'm not sure this should reduce any resource usage, but it seems to be sensible for the situation anyway.
(c) Bear in mind that options from /etc/zfs/zfsrc will still take effect even if you specify other options on the command line
First things first
> By the way, is this the "new way" it throws assertions?
Well, i'm guessing this is not actually a question. But: "no", that is not the new way assertions are thrown (why _do_ you ask, really?)
With that out of the way
> I run a scan with clamav using klamav on my ZFS filesystems. About hour or two in the scan, and the process just exits with status 255.
Thanks for a clear recipe to start reproducing this.
On the debugging issues:
(a) consider attaching gdb just-in-time, in case some kind of resource leak/build-up is involved
gdb --pid $(pgrep zfs-fuse)
(b) with a view to possible sources of resource exhaustion, consider using the /canonical/ (invented term there) -a 1 -e 1 instead of the large timeouts. If all you do is sequential FS scanning, you won't get any benefit from longer-term caching anyway. The /used/ cache is localized only and the rest should in fact be discarded as quickly as possible. I'm not sure this should reduce any resource usage, but it seems to be sensible for the situation anyway.
(c) Bear in mind that options from /etc/zfs/zfsrc will still take effect even if you specify other options on the command line
Added by
Seth Heeren
on
May 22, 2010 09:19 AM
Issue state:
unconfirmed → open
decided to give it one more shot to reproduce this sharing info from this ticket and the group posts. I used the critical branch initially, build debug=2 and run --no-daemon in gdb
creating a 85M pool on 3x64M raidz (min size),
data is faked by creating random named files of 1m from urandom
1. filling it up till full; then rollback to @now (80% full); rinse repeat in tight loop
2. running 5 parallell clamscan jobs on all the files (quadcore system)
The [1.] job runs for a long time without issue (except for heating up a CPU core)
The [2.] job runs for a long time without issue (except for heating up 4 cores).
ROUGH OUTPUT
=============
root@karmic:~# sudo umount -f /tmp/pool; ZFS_GDB=1 ./exabyt
keeping zfs daemon (being debugged)
======== building pool l named pool at /tmp/pool_blk =================================
zpool create -O mountpoint=/tmp/pool pool raidz /tmp/pool_blk/za1 /tmp/pool_blk/za2 /tmp/pool_blk/za3
pool: pool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/tmp/pool_blk/za1 ONLINE 0 0 0
/tmp/pool_blk/za2 ONLINE 0 0 0
/tmp/pool_blk/za3 ONLINE 0 0 0
errors: No known data errors
NAME USED AVAIL REFER MOUNTPOINT
pool 102K 86.5M 28.0K /tmp/pool
root@karmic:~# for a in $(seq 1 70); do dd if=/dev/urandom of=/tmp/pool/small$RANDOM.bin bs=1M count=1; done 2>/dev/null
root@karmic:~# zfs snapshot pool@now
root@karmic:~# (while true; do for a in $(seq 1 20); do dd if=/dev/urandom of=/tmp/pool/small$RANDOM.bin bs=1M count=1 2>&1; done; zfs rollback -r pool@now 2>&1; done) > /dev/null&
[1] 22376
root@karmic:~# while true; do for a in 1 2 3 4 5; do clamscan /tmp/pool & done; wait; done;
I've let this draconian stress test run for 15 minutes+ without seeing an issue. Then it hit the assert (error == 0) in zfsfuse_release. Since this was recently addressed for issue #43 i decided to retest with that version.
I'll post how long it runs without trouble from now
creating a 85M pool on 3x64M raidz (min size),
data is faked by creating random named files of 1m from urandom
1. filling it up till full; then rollback to @now (80% full); rinse repeat in tight loop
2. running 5 parallell clamscan jobs on all the files (quadcore system)
The [1.] job runs for a long time without issue (except for heating up a CPU core)
The [2.] job runs for a long time without issue (except for heating up 4 cores).
ROUGH OUTPUT
=============
root@karmic:~# sudo umount -f /tmp/pool; ZFS_GDB=1 ./exabyt
keeping zfs daemon (being debugged)
======== building pool l named pool at /tmp/pool_blk =================================
zpool create -O mountpoint=/tmp/pool pool raidz /tmp/pool_blk/za1 /tmp/pool_blk/za2 /tmp/pool_blk/za3
pool: pool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/tmp/pool_blk/za1 ONLINE 0 0 0
/tmp/pool_blk/za2 ONLINE 0 0 0
/tmp/pool_blk/za3 ONLINE 0 0 0
errors: No known data errors
NAME USED AVAIL REFER MOUNTPOINT
pool 102K 86.5M 28.0K /tmp/pool
root@karmic:~# for a in $(seq 1 70); do dd if=/dev/urandom of=/tmp/pool/small$RANDOM.bin bs=1M count=1; done 2>/dev/null
root@karmic:~# zfs snapshot pool@now
root@karmic:~# (while true; do for a in $(seq 1 20); do dd if=/dev/urandom of=/tmp/pool/small$RANDOM.bin bs=1M count=1 2>&1; done; zfs rollback -r pool@now 2>&1; done) > /dev/null&
[1] 22376
root@karmic:~# while true; do for a in 1 2 3 4 5; do clamscan /tmp/pool & done; wait; done;
I've let this draconian stress test run for 15 minutes+ without seeing an issue. Then it hit the assert (error == 0) in zfsfuse_release. Since this was recently addressed for issue #43 i decided to retest with that version.
I'll post how long it runs without trouble from now
Added by
Seth Heeren
on
May 22, 2010 10:50 AM
Issue state:
open → rejected
After letting that run for >2hours I decided that we close this particular bug report for three reasons
(a) lack of info (a relevant stacktrace wasn't there ("assuming it is the same problem")
(b) not reproducible
(c) the only problem (re?)produced was removed in the fix for bug #43
(a) lack of info (a relevant stacktrace wasn't there ("assuming it is the same problem")
(b) not reproducible
(c) the only problem (re?)produced was removed in the fix for bug #43

