Personal tools
You are here: Home Issue tracker filesystem disconnects

#63 — filesystem disconnects

State In progress
Version: 0.6.9
Area Functionality
Issue type Bug
Severity Medium
Submitted by (anonymous)
Submitted on Jun 18, 2010
Responsible Seth Heeren
Target release:
Return to tracker
Last modified on Sep 19, 2010 by Seth Heeren
I have some problems making ZFS work.
In short, I get filesystem disconnects when there are several hundred filesystems and snapshots made.
The worst thing is that I don't get any error message in any logs, so can't really provide information to start with.
I try to summarize though what I have already done and what "results" I have.
 
Test system's main purpose would be to store a webserver's virutal directories copy - rsynced to zfs, each virtual dir is separate fs -, and make daily snapshots.
I have tried debian first, with the 0.6.0 branch, then tried ubuntu 32 and 64 too, even tested on different hardware, making no difference.
Yesterday I tried with the 0.6.9, pretty much with the same results.
One of the servers have 2GB ram, other 8GB, - it looks like, the 8GB one goes further, but since thes tests are time consuming*, I am not sure on this.
Yesterday I started the 8GB server with the ZFS disks in it, and the system took like 8500 seconds - 2 hours ~20 minutes - to boot with those.
 
[ 7.340723] EXT3 FS on md2, internal journal
[ 7.340800] EXT3-fs: mounted filesystem with ordered data mode.
[ 8492.412193] e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
[ 8492.468062] e1000e 0000:00:19.0: irq 29 for MSI/MSI-X
Full dmesg here:
http://www.maques.hu/zfs/dmesg
 
The ZFS is on 2*1TB mirrored disks (full disks sdc/sdd of dmesg) with like ~400GB data on it [max], pretty much the rsync of md2
http://www.maques.hu/zfs/df-output.txt
 
It looks like that after boot, ZFS tries to mounts the filesystems, then dies at some point, as you can see in the df-output, it only reached till letter "s", of course there are like at least th 50% more of vhostdirs till "z".
[When I created this current pool/fs-s earlier, the copy progress [rsync] was dieing at certain points]
 
After booting, zfs doesn't seem to be running, and a zfs start does some work, then no mounts, only then I can see the following in syslog:
 
Jun 17 17:19:55 rsync zfs-fuse: initial max_map_count 65530
Jun 17 17:19:55 rsync zfs-fuse: ARC caching: maximum ARC size: 100 MiB
Jun 17 17:19:55 rsync zfs-fuse: ARC setup: min ARC size set to 16777216 bytes
Jun 17 17:19:55 rsync zfs-fuse: ARC setup: max ARC size set to 104857600 bytes
Jun 17 17:19:56 rsync zfs-fuse: kstat: fuse_mount error - trying to umount
 
Any chance - like some verbose/debug option - for me to test so I can get more details on what causing the hangs/disconnects of FS-s?
(I'm also not sure if a mount like this should take ~2,5 hrs too)
Without logs I'm not sure what to report as "bug", if that would be required...
 
Added by Seth Heeren on Jun 18, 2010 06:01 PM
Issue state: unconfirmedopen
Responsible manager: (UNASSIGNED)sgheeren
Wow. First of all, thanks for trying so much to get this working. Where to start.

A. Ok, how did you come up with the pool? Do you have any reference system in which it behaved less erratically (surely, if it was always this bad, you wouldn't have let it grow this far?)? How did you first find out there were problems?

B. Do I know your system setup (I cannot see who you are, but I suspect I may have system info from prior contact.)

C. Can you compile from source, of course using 0.6.9, and specify

    scons debug=2
    zfs-fuse -n # this will run in foreground, spilling more info on stderr

D. ulimit is clearly in view here, so please make sure ulimit is unrestrictive onzfs-fuse launch

E. Currently, the IOCTLQUEUE_MAX_PENDING is 256 (a mount request is an IOCTL), but processing will simply block once the queue is full, it will not fail (tested)

F. MAX_FILESYSTEMS (fuse_listener.c) is 1000. If this is tripped, file systems would be 'quietly' not-mounted with only a "Warning: filesystem limit (%i) reached, unmounting..\n" printed on stderr. It should not normally crash.

G. Lastly, it might be feasible to run under gdb (when not using breakpoints things will be slower, but not unusable).
    
     bash$ sudo gdb --args ./zfs-fuse -n

This way should catch any asserts and fatal signals, showing e.g. when the max number of pthreads is exceeded (not unlikely). (A 'thread apply all bt' is the 'big gun' information source)

-----------------

Y. What is memory behaviour/footprint?

Z. Check the OOM killer (check e.g. http://zfs-fuse.sehe.nl/?p=[…]07c6f0024914cd3fed35#patch1) for some tricks
Added by (anonymous) on Jun 18, 2010 07:49 PM
>> A. Ok, how did you come up with the pool? Do you have any reference system in which it behaved less erratically
>> (surely, if it was always this bad, you wouldn't have let it grow this far?)? How did you first find out there were problems?
Well, tried so many configurations that I somewhat lost track, and I guess the best would be to do the debug=2 and other stuff you said later on anyway.
OTOH, it was started off as a new pool with some basic:
 zpool create vol0 mirror /dev/sdc /dev/sdd
 zfs set atime=off vol0 #this might be unusual, but I guess it should make less trouble not more...
 zfs create vol0/ns1
and the rest of sub zfs-s were created by rsyncing script, first time maybe reaching to directories beginning with "g".
Only difference shown when disks were moved to ununtu64, self compiled 0.6.0, then copy continued, not sure till when, but at least "s"...

>>B. Do I know your system setup (I cannot see who you are, but I suspect I may have system info from prior contact.)
attached dmesg previously, but nothing special, and also tried different hardware [intel mb/proc only, different ones though]

>>C. Can you compile from source, of course using 0.6.9, and specify
>> scons debug=2
>> zfs-fuse -n # this will run in foreground, spilling more info on stderr
will try next week, thanks for the tips

>>D. ulimit is clearly in view here, so please make sure ulimit is unrestrictive onzfs-fuse launch
I'm using sligthly modified [paths, etc] script from
http://rudd-o.com/[…]/starting-zfs-fuse-up-properly
which have some ulimits, so I guess that should be covered

>>E. Currently, the IOCTLQUEUE_MAX_PENDING is 256 (a mount request is an IOCTL), but processing will simply block once the queue is full, it will not fail (tested)

>>F. MAX_FILESYSTEMS (fuse_listener.c) is 1000. If this is tripped, file systems would be 'quietly' not-mounted with only a "Warning: filesystem limit (%i) reached, unmounting..\n" printed on stderr. It should not normally crash.
Ok, didn't know that, but that would likely be reached [above "s" though].
Any reason for the limit? I mean 1000 is not a "round" number, 1024 is :-)...
Also, do snapshots count in that?

>>G. Lastly, it might be feasible to run under gdb (when not using breakpoints things will be slower, but not unusable).
>> bash$ sudo gdb --args ./zfs-fuse -n
will do, thanks.

>>This way should catch any asserts and fatal signals, showing e.g. when the max number of pthreads is exceeded
>>(not unlikely). (A 'thread apply all bt' is the 'big gun' information source)
I guess I can gather some "useful" information with those switches etc., so I likely can get back with more details.

>>Y. What is memory behaviour/footprint?
Didn't seem to use up memory even on the 2GB setup.

>>Z. Check the OOM killer (check e.g. http://zfs-fuse.sehe.nl/?p=[…]07c6f0024914cd3fed35#patch1) for some tricks
Will do that too, thanks for all the tips, have a nice weekend :-)
Added by Seth Heeren on Jun 19, 2010 05:43 AM
On A. (note the reference numbers to make it easy to respond to isolated items)

You're response is a bit shorthand for me. Specific questions:
> OTOH, it was started off as a new pool with some basic: zpool create vol0 mirror /dev/sdc /dev/sdd
QA1. On what kind of system was this?

> ... the rest of sub zfs-s were created by rsyncing script, first time maybe reaching to directories beginning with "g".

QA2. What does this mean? This was the first time you got an error? Did you stop it on purpose?

> Only difference shown when disks were moved to ununtu64, self compiled 0.6.0, then copy continued, not sure till when, but at least "s"...

QA3. "Only difference shown" is unclear to me.

-----------------------------
On B.

> attached dmesg previously
QB1. I'm guessing this is Lenny? I don't know how to tell. If I may suggest something: I don't think it should matter and since you seem versatile enough I'll simply recommend:
  * "recent debian" (e.g. Debian Squeeze/Ubuntu Lucid) 64 bit
  * run --no-kstat-mount unless you need it
  * run with an explicit stack size like --stack-size=8
You can use the package from my ppa (https://launchpad.net/~bugs-sehe/+archive/zfs-fuse)

-------------------------------
On F.

> Any reason for the limit

Note that this limits the number of _fuse_ filesystem mounts. Ofcourse, ZFS is for _Z_ettabyte Filesystem. No such limit for ZFS. As such, snapshots _donot_ count. Only _mounted_ filesystems count. This is an area where we could do some 'smart' testing (incrementally mounting part of the fs-tree instead of all at once, e.g.).

On Y.

Ok.



Added by (anonymous) on Jun 22, 2010 06:27 AM
>>QA1. On what kind of system was this?
Well... creation was on a debian testing with 0.6.0 first, but I believe it was destroyed/recreated, likely on the ubuntu64 later on, also new disks were brought in fot the test.
I understand that details would be "helpful", but my main issue here is the lack of any error/warning message from zfs-fuse.
So, in other words, let's go forward, I can destroy and recreate the pool on any recommended system.

>> QA2. What does this mean? This was the first time you got an error? Did you stop it on purpose?
First time I got the error (zfs hang without any message) was a plain copy [rsync] of [website] directories [from a to z], each to one sub fs.

>> QA3. "Only difference shown" is unclear to me.
progress seemed to go further. [instead of hanging at directories starting with "g", process hanged at "s"]
[note that I likely used raised ulimit here, which might have helped - but as said at QA1: I can start from 0 anytime]

>> QB1. I'm guessing this is Lenny? I don't know how to tell. If I may suggest something: I don't think it should matter and since you seem versatile enough I'll simply recommend:
>> * "recent debian" (e.g. Debian Squeeze/Ubuntu Lucid) 64 bit
Ok, I'm using debian testing [whatever it's current name is now :-)], usually 32 bits, but 64 bit would be no problem too.
I had too much problems with "stables" in the last 10 years or so :-]

>> * run --no-kstat-mount unless you need it
>> * run with an explicit stack size like --stack-size=8
Erm, sorry for being noobish, but where should I put these switches to? zfs-fuse start?

>>F. Note that this limits the number of _fuse_ filesystem mounts.
Thanks. As I see, the 1000 is a default limit and can be overwritten to be -1 [unlimited],
but I just checked, it does not seem to be the problem here [yet].
Otoh, I can set it to some extremely low value [like 20 or so] just to see what happens then [error messages wise].

Now, let's go forward.
- C1: I tried the -n option and did as per "C" [0.6.9, scons debug=2, debian testing 32 bit so far*]
*-started on that, and due the enermous time the mounts took, had no time to test on other systems, but can/will do
running with -n resulted "eventually"
lib/libsolkerncompat/thread.c:48: zk_thread_create: Assertion 'pthread_create(&tid, &attr, (void *(*)(void *)) func, arg) == 0' failed.
[core dumped]

Other observations:
- C2: The mount time until zfs disconnect/die took ~2,5 hrs, the first directories [sub-fs-es] was mounted fast, like ~1 sec or less, but as the progress went, the mounts were slower and slower, being ~30 seconds at near end and ~15 seconds around in the middle - not relation to the size of the fs-es. Should it be normal??? [The total mounts when process died was arount ~870 dirs/fs-s - not near to 1000]
- C3: Memory usage seemed low, ~230MB out ot the 8GB

Haven't tried 64 bit, but I pretty much believe the resuls would be similar.
Have no problem trying it in case it should help anything, also can destroy/recreate pool under ubuntu 64/0.6.9 and recreate the structure+recopy data.
Haven't done gdb stuff, can do, but should need some more advice on that

Thanks
Added by Seth Heeren on Jun 22, 2010 06:42 AM
Issue state: openin-progress
Target release: None0.7.0
Ok we found it:

> lib/libsolkerncompat/thread.c:48: zk_thread_create: Assertion 'pthread_create(&tid, &attr, (void *(*)(void *)) func, arg) == 0' failed.

This is your problem: out of thread capacity for mounts. This is a scalability issue. I'll see what I can do to lessen the harm (I have run into this limitation during stress testing before).

64 bits might help (I don't know how pthreads are implemented with regards to kernel 'size') (from the docs "The maximum number of threads that may be created by a process is implementation dependent.")

If you use the packages from my ppa you will have a nice init script (/etc/init.d/zfs-fuse) and a /etc/zfs/zfsrc containing the option I mentioned.

(MAX_FILESYSTEMS is _not_ user-serviceable. You are welcome to start wearing a developer hat, of course)
Added by Seth Heeren on Jun 22, 2010 09:41 AM
The extra thread per mounted (!) fs is for the zil. So you might alleviate the issue by tuning:

(a) avoid mounting inactive filesystems
(b) enter optimizing settings into /etc/zfs/zfsrc, like e.g.

stack-size = 2
no-kstat-mount

(c) from the big evil tuning guide, set zil_disable to 1 (zfs-fuse does not support the tunables, except by editing the code, see src/lib/libzpool/zil.c:68. READ THE GUIDE first, and be sure you know what this does!

I tested it, memory consumption is down and with (c) thread usage doesn't increase anymore. There is still a performance issue with mounting/unmounting filesystems (seems to take exponential time). No news there yet.

You might want to grab my issue63 branch, because it contains the fix for a bad-pointer ref that _may_ cause do_mount to take a long time scanning mount options (?)
Added by Gabor Funk on Jun 22, 2010 10:25 AM
[managed to register :-]

Out of curiousity, I put the same pool under an Ubuntu 64 Karmic, and the step-by-step mount took pretty much only several seconds [no exponential delays].
I didn't experience the "thread" problem here around ~870,
however, in this case since it went away till 1000 without problems, I got the
"Warning: filesystem limit (1000) reached, unmounting..." message.
Then the whole zfs pool pretty much disconnected/hanged [issuing a df on another console hangs too].
So, therefore it seems I experienced hanging on both 32-bit [debian testing] and 64-bit [ubuntu karmic] but for different reasons...

>> (MAX_FILESYSTEMS is _not_ user-serviceable. You are welcome to start wearing a developer hat, of course)
Erm, can I simply change "#Define MAX_FILESYSTEMS 1000" in fuse_listener.c without causing any problems?

-
other suggestion: "INSTALL" names some -dev packages to be installed, but libssl-dev and libattr*-dev not mentioned and imho should be listed too, as they are required to build,
libzpool/sha256.c includes openssl/sha.h
and
zfs-fuse/zfs_operations.c includes attr/xattr.h
[for those who use google :-)]
Added by Seth Heeren on Sep 19, 2010 05:16 PM
Target release: 0.7.0None
Gabor,

sorry i managed to completely miss your response.

Yes you can simply change "#Define MAX_FILESYSTEMS 1000" in fuse_listener.c without causing any problems? (provided enough resources)
But i'm sure you tried that

Thanks for noting the stale doc files: see issue #7. I'll be kicking that one again in a new Bug Hug for 0.7.0
The site already starts to contain more valuable info