Personal tools
You are here: Home Issue tracker ZFS-FUSE crashes (while re-silvering) with spontaneous SIGABRT

#71 — ZFS-FUSE crashes (while re-silvering) with spontaneous SIGABRT

State Resolved
Version: 0.6.9
Area Functionality
Issue type Bug
Severity Medium
Submitted by Eric Astor
Submitted on Jul 09, 2010
Responsible Seth Heeren
Target release:
Return to tracker
Last modified on Sep 19, 2010 by Seth Heeren
The zfs-fuse daemon crashes spontaneously, while re-silvering a large two-disk mirror (one 1 TB disk, one 1.5 TB, with ~300 GB used). The source filesystem is originally created by OpenSolaris, upgraded to pool version 23 by ZFS-FUSE, and moved (via send/receive) to a brand-new created pool of one device, which I then attached the 1.5 TB drive to as a mirror. As near as I can tell, the problem happened not much more than half-way through the re-silvering, with no other process accessing the filesystems on the pool. One of the filesystems was being shared (but not accessed) via SMB and NFS. The problem may well be unrelated to the re-silvering, but it's the only factor I know of - I have no idea how to reproduce this bug, as even in this context, the problem takes several hours before it occurs.

I duplicated this bug a second time, after rebooting the system to put it in a clean state. I was later able to try this again, and get the same problem, with ZFS-FUSE running in gdb. The file with the full output is here, but I've included the snippet with the problem below. There's nothing of interest above this - just threads being created and dying. I did run a backtrace, which is included here:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0xa9d50b70 (LWP 2130)]
0x0012d422 in __kernel_vsyscall ()
(gdb) bt
#0 0x0012d422 in __kernel_vsyscall ()
#1 0x00322651 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0x00325a82 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x080d2c25 in ?? ()
#4 0x080d2e6a in ?? ()
#5 0x080d2ea7 in ?? ()
#6 0x080d2efc in ?? ()
#7 0x0804f6fb in ?? ()
#8 0x0804fb7a in ?? ()
#9 0x08066c15 in ?? ()
#10 0x08069b24 in ?? ()
#11 0x0806b430 in ?? ()
#12 0x00172ec1 in ?? () from /lib/libfuse.so.2
#13 0x001727ba in ?? () from /lib/libfuse.so.2
#14 0x00175976 in fuse_session_process () from /lib/libfuse.so.2
#15 0x08068888 in ?? ()
#16 0x0014c96e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#17 0x003c5a4e in clone () from /lib/tls/i686/cmov/libc.so.6
(gdb)

I'm currently re-running with strace on to see if I can spot anything else, and will update this bug if I see the problem a fourth time in this context.
Attached:
zfs-fuse_gdb-report.txt — Plain Text, 10Kb
Added by Eric Astor on Jul 09, 2010 04:38 AM
One addendum: the pool in question has no special options enabled. In particular, dedup was not enabled.
Added by (anonymous) on Jul 09, 2010 05:20 AM
Hi Eric thanks for reporting this issue.

It's going to be hard to reproduce without having your actual pool. I'm sorry that this is taking you a lot of time because the pool is rather large :) So let's try to minimize the pain by looking for specifics.

(a) what do we know?
From the stacktrace we can safely say that (although you state "no other process accessing the filesystems on the pool") there was a simultaneous access via the fuse layer (meaning normal VFS operation like stat, sync, open etc). This ought not to be a problem, of course, but as long as we are diagnosing, it is good to keep in mind that the failure happens during such an access. (This might be from a df, your logwatch, an automount service, or whatever)

What do we want to know?
(b) what version is this (git log -1? package url?)
(c) what is /etc/zfs/zfsrc?
(d) what is in syslog (grep -i zfs /var/log/syslog)?
(e) how do you trigger the resilver? resilver != scrub and a scrub might be canceled, which helps debugging.
(f) zfs get -s local,received all?
(g) what is 'zpool history -i'?
(h) can you scrub this on OSol? I hope your solaris box is recent enough to import this pool version :)
(i) what happens when you disable sharing explicitely? zfs unshare -a should do the trick without having to alter the actual properties
Things to try:

(j) use --fuse-mount-options debug
(k) build and run in foreground:
   cd src
   scons debug=2
   ./zfs-fuse -o debug -n |& tee zfs-fuse.log
(l) try to (first) avoid the resilver (by zfs [detach|offline] pool <incompletely_resilvered_vdev, or by importing degraded without the disk online); if it is a scrub, cancel by 'zpool scrub -s pool', and then:
(m) try to locate any issues in specific datasets by using send. Send uses essentially the same traversal code as a resilver

zfs send -Rv pool@now > /tmp/debug_stream

This will let you know (on console) where in the process it bugs. The info can further be used with zdb or zstreamdump in order to find more details

Added by Eric Astor on Jul 09, 2010 12:21 PM
a) I realized right before going to sleep last night that there was one other factor - in every case, I'd had a "watch zpool status" running, so I could easily monitor the status of the resilver. That's *probably* the simultaneous access. It's also somewhat possible that I'd accidentally triggered a stat, through a share.

b) The version is the package of 0.6.9 for 32-bit Ubuntu linked from the front page:
https://launchpad.net/[…]/zfs-fuse_0.6.9-6_i386.deb

c) zfsrc is the default for the package; I made no changes.

d) syslog contains some entries from zfs. The only messages are a mix of the following groups:

Jul 9 04:21:56 www zfs-fuse: initial max_map_count 65530
Jul 9 04:21:56 www zfs-fuse: ARC caching: maximum ARC size: 100 MiB
Jul 9 04:21:56 www zfs-fuse: ARC setup: min ARC size set to 16777216 bytes
Jul 9 04:21:56 www zfs-fuse: ARC setup: max ARC size set to 104857600 bytes

and

Jul 9 04:23:22 www zfs-fuse: kstat: fuse_mount error - trying to umount

The latter, of course, seem to appear once I was trying to recover from the problem.

e) The first time, I triggered the resilver by letting the system realize on its own time that the second drive was blank. All others, I've sped the process up a bit by initiating a scrub, which terminated and started the resilver.

f) No properties in zfs get -s local all, but some in zfs get -s received all; I've attached the result of the query.

g) The relevant piece of zpool history -i is here:

2010-07-08.14:16:48 zpool import newStorage storage
2010-07-08.14:18:19 [internal pool scrub txg:17252] func=1 mintxg=3 maxtxg=17252
2010-07-08.14:18:19 [internal vdev attach txg:17252] attach vdev=/dev/disk/by-id/usb-WD_Ext_HDD_1021_574D41565533303037313934-0:0-part1 to vdev=/dev/disk/by-id/ata-WDC_WD10EADS-00M2B0_WD-WCAV51390627-part1
2010-07-08.14:18:33 zpool attach storage /dev/disk/by-id/ata-WDC_WD10EADS-00M2B0_WD-WCAV51390627-part1 /dev/disk/by-id/usb-WD_Ext_HDD_1021_574D41565533303037313934-0:0-part1
2010-07-08.17:37:26 [internal pool scrub done txg:17529] complete=0
2010-07-08.17:37:26 [internal pool scrub txg:17529] func=1 mintxg=3 maxtxg=17252
2010-07-08.18:18:56 [internal pool scrub done txg:17590] complete=0
2010-07-08.18:18:56 [internal pool scrub txg:17590] func=1 mintxg=3 maxtxg=17252
2010-07-08.18:27:36 [internal pool scrub done txg:17598] complete=0
2010-07-08.18:27:36 [internal pool scrub txg:17598] func=1 mintxg=3 maxtxg=17252
2010-07-08.18:27:50 [internal pool scrub done txg:17600] complete=0
2010-07-08.18:27:50 [internal pool scrub txg:17600] func=1 mintxg=3 maxtxg=17252
2010-07-08.18:27:50 zpool scrub storage
2010-07-08.23:11:02 [internal pool scrub done txg:18105] complete=0
2010-07-08.23:11:02 [internal pool scrub txg:18105] func=1 mintxg=3 maxtxg=17252
2010-07-08.23:30:06 [internal pool scrub done txg:18147] complete=0
2010-07-08.23:30:06 [internal pool scrub txg:18147] func=1 mintxg=3 maxtxg=17252
2010-07-08.23:30:20 [internal pool scrub done txg:18149] complete=0
2010-07-08.23:30:20 [internal pool scrub txg:18149] func=1 mintxg=3 maxtxg=17252
2010-07-08.23:30:20 zpool scrub storage
2010-07-09.04:23:32 [internal pool scrub done txg:18617] complete=0
2010-07-09.04:23:32 [internal pool scrub txg:18617] func=1 mintxg=3 maxtxg=17252
2010-07-09.04:23:47 [internal pool scrub done txg:18619] complete=0
2010-07-09.04:23:47 [internal pool scrub txg:18619] func=1 mintxg=3 maxtxg=17252
2010-07-09.04:23:47 zpool scrub storage

h) Unfortunately, I was having issues with my OSol install, so blew it away and re-installed with Linux and ZFS-FUSE. I don't currently have an OSol box, and gave my system all the space available outside the ZFS pool, so it might take some time to try to get it to install. Not impossible, since I have a drive I could shove the boot partition on and space on the ZFS pool, but unpleasant.

i) Sharing's already disabled explicitly; I decided I'd fight with it as integrated later, and stick to just configuring the shares from the system for now. SMB shares were configured via Nautilus, NFS shares via /etc/exports.

As for your suggested things to try, I will definitely try a debug run next - probably go ahead and recompile from source, so I can flip the debug flag. However, the dataset's already been sent between drives at least 2 or 3 ways under ZFS-FUSE, so I doubt that'll turn anything new up.
Attached:
zfs_get_received.txt — Plain Text, 2Kb
Added by (anonymous) on Jul 12, 2010 01:27 PM
The re-silver finished with no problems, while the share was de-activated... it appears that it wasn't the problem, since shortly after, I activated the share - and shortly had another crash! No terribly useful data from that run, except for one interesting piece of output from strace; every other process received a SIGABRT shortly after one process issued these lines:

18:40:29.057874 read(12, "\252O\375\330\34\277\367\316", 8) = 8
18:40:29.058071 clock_gettime(CLOCK_MONOTONIC, {70249, 986146137}) = 0
18:40:29.058238 clock_gettime(CLOCK_MONOTONIC, {70249, 986311320}) = 0
18:40:29.058400 io_submit(13180928, 1, {...}) = 1
18:40:29.058755 futex(0x7f7b92cc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
18:40:29.071959 futex(0x7f7b92b0, FUTEX_WAKE_PRIVATE, 1) = 0
18:40:29.072092 sched_yield() = 0
18:40:29.072306 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
18:40:29.072504 tgkill(26928, 26946, SIGABRT) = 0
18:40:29.072658 --- SIGABRT (Aborted) @ 0 (0) ---

Not sure if that's useful. I'll back up the filesystems to another drive, then recompile with the debug flag on and see if I can get the bug to reoccur with more tracing active.
Added by Eric Astor on Jul 12, 2010 02:47 PM
Got it pinned down now! I managed to replicate the issue quickly - the resilvering was NOT involved. My steps to duplicate, under ZFS-FUSE 0.6.9:

The filesystem in question includes a large music library, which is mounted via NFS on another system, which references the mounted filesystem for Rhythmbox's music library.

1) Start zfs-fuse, compiled with debug=2, under gdb, with options '-a 1 -e 1 -n'
2) Mount the filesystem in question (storage/shared, in my case)
3) Use exportfs to export the filesystem as an NFS share
4) Mount the NFS share on another computer
5) Start Rhythmbox, with its music library stored on the share
6) Wait until Rhythmbox has finished importing the full library
7) Start playing, then click Next a few times to skip between tracks

For me, this causes ZFS to crash, with the following output in GDB:

Program received signal SIGABRT,
Aborted.
[Switching to Thread 0xa854db70 (LWP 16436)]
0x0012d422 in __kernel_vsyscall ()
(gdb) bt
#0 0x0012d422 in __kernel_vsyscall ()
#1 0x00322651 in raise () from /lib/tls/i686/cmov/libc.so.6
#2 0x00325a82 in abort () from /lib/tls/i686/cmov/libc.so.6
#3 0x081217df in zfs_fuid_table_load (os=0x85a5b540, fuid_obj=8204,
    idx_tree=0x8550a074, domain_tree=0x8550a088) at lib/libzpool/zfs_fuid.c:155
#4 0x081219b0 in zfs_fuid_init (zfsvfs=0x8550a040)
    at lib/libzpool/zfs_fuid.c:214
#5 0x081220a9 in zfs_fuid_find_by_idx (zfsvfs=0x8550a040, idx=1)
    at lib/libzpool/zfs_fuid.c:374
#6 0x081221f2 in zfs_fuid_map_id (zfsvfs=0x8550a040, fuid=4294967314,
    cr=0xa854d1d8, type=ZFS_GROUP) at lib/libzpool/zfs_fuid.c:408
#7 0x08122a5e in zfs_groupmember (zfsvfs=0x8550a040, id=4294967314,
    cr=0xa854d1d8) at lib/libzpool/zfs_fuid.c:738
#8 0x08052e2f in zfs_zaccess_aces_check (zp=0x85b5eb48,
    working_mode=0xa854d00c, anyaccess=B_FALSE, cr=0xa854d1d8)
    at zfs-fuse/zfs_acl.c:2348
#9 0x080531d6 in zfs_zaccess_common (zp=0x85b5eb48, v4_mode=1,
    working_mode=0xa854d00c, check_privs=0xa854cffc, skipaclchk=B_FALSE,
    cr=0xa854d1d8) at zfs-fuse/zfs_acl.c:2490
#10 0x08053644 in zfs_zaccess (zp=0x85b5eb48, mode=1, flags=0,
    skipaclchk=B_FALSE, cr=0xa854d1d8) at zfs-fuse/zfs_acl.c:2620
#11 0x080538ea in zfs_zaccess_rwx (zp=0x85b5eb48, mode=256, flags=0,
    cr=0xa854d1d8) at zfs-fuse/zfs_acl.c:2705
#12 0x080695f0 in zfs_access (vp=0xb7fe78c0, mode=256, flag=0, cr=0xa854d1d8,
    ct=0x0) at zfs-fuse/zfs_vnops.c:1035
#13 0x0816659d in fop_access (vp=0xb7fe78c0, mode=256, flags=0, cr=0xa854d1d8,
    ct=0x0) at lib/libsolkerncompat/vnode.c:1195
#14 0x080754cb in zfsfuse_opencreate (req=0x81fdb28, ino=12782, fi=0xa854d270,
    fflags=32768, createmode=0, name=0x0) at zfs-fuse/zfs_operations.c:859
#15 0x0807579b in zfsfuse_open_helper (req=0x81fdb28, ino=12782, fi=0xa854d270)
    at zfs-fuse/zfs_operations.c:933
#16 0x0017358f in ?? () from /lib/libfuse.so.2
#17 0x001727ba in ?? () from /lib/libfuse.so.2
#18 0x00175976 in fuse_session_process () from /lib/libfuse.so.2
#19 0x08071eb2 in zfsfuse_listener_loop (arg=0x0)
    at zfs-fuse/fuse_listener.c:278
#20 0x0014c96e in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#21 0x003c5a4e in clone () from /lib/tls/i686/cmov/libc.so.6

The abort call is issued on Line 155 of lib/libzpool/zfs_fuid.c, in the function 'zfs_fuid_table_load', by the macro 'ksid_lookupdomain'.

ksid_lookupdomain is defined in lib/libsolkerncompat/include/sys/sid.h, as:
#define ksid_lookupdomain(d) (abort(), NULL)

Is this intentional? It looks to me like this would always cause a silent crash...
Added by Seth Heeren on Sep 19, 2010 06:35 PM
Issue state: unconfirmedresolved
Responsible manager: (UNASSIGNED)sgheeren
patch will be released in 0.7.0

thanks