#43 — zfs-fuse daemon dies on attempt to write a file to a rolled back directory
| State | Resolved |
|---|---|
| Version: | 0.6.0 |
| Area | Functionality |
| Issue type | Bug |
| Severity | Important |
| Submitted by | (anonymous) |
| Submitted on | May 21, 2010 |
| Responsible | Seth Heeren |
| Target release: | 0.6.9 |
Last modified on
May 27, 2010
by
Seth Heeren
If you create a snapshot, then open a text file in an editor, write the file, then rollback to snapshot without leaving the editor, then attempt to write the file again, then the zfs-fuse daemon dies. To recover, the PID file has to be deleted manually and file systems remounted with zfs umount/zfs mount. I can reliably reproduce using Debian's 0.6.0+critical20100301-1 (with fuse built into vanilla kernel 2.6.34, if that matters).
- Steps to reproduce:
- 1. zfs snapshot tank@now
2. cd /tank
3. vim test.txt (it might be important that vim is configured to write it's "swap file" in the current directory, which is the default!)
4. Enter some text, save the file with :w
5. In another shell: zfs rollback tank@now
6. Enter another line of text in the editor, vim now signals problems with swap
7. Attempt to write file (can't be done), zfs-fuse daemon now gone
8. ls: .: Transport endpoint is not connected
Added by
Seth Heeren
on
May 21, 2010 03:16 PM
Issue state:
unconfirmed → open
Severity:
Medium → Important
Responsible manager:
(UNASSIGNED) → sgheeren
Thanks for reporting this. I've reproduced the issue as mentioned.
Note that when doing the ls _in the original shell_ (by backgrounding vim) works ok.
Only when I
(a) quit vim to do ls, or
(b) enter ':!ls' from vim,
(c) re-open the file (:e!)
(d) re-open without swapfile (:se noswapfile|e!)
it will crash the daemon
The inode number for the tank mount point remains at 1 the whole time (this is not a stale curwdir problem).
The initial swapfile usage _IS_ a deciding force: using 'vim -n test.txt' from the get-go doesnot fail in (a),(b),(c==d)
It caught my attention that the mountpoint dir seems to lose the g+w mode on the rollback step (even without any file/dir access let alone vim). I'm thinking permission cache?
I'll fiddle a bit with strace
Next up is gdb
Note that when doing the ls _in the original shell_ (by backgrounding vim) works ok.
Only when I
(a) quit vim to do ls, or
(b) enter ':!ls' from vim,
(c) re-open the file (:e!)
(d) re-open without swapfile (:se noswapfile|e!)
it will crash the daemon
The inode number for the tank mount point remains at 1 the whole time (this is not a stale curwdir problem).
The initial swapfile usage _IS_ a deciding force: using 'vim -n test.txt' from the get-go doesnot fail in (a),(b),(c==d)
It caught my attention that the mountpoint dir seems to lose the g+w mode on the rollback step (even without any file/dir access let alone vim). I'm thinking permission cache?
I'll fiddle a bit with strace
Next up is gdb
Added by
Seth Heeren
on
May 21, 2010 03:59 PM
Running 0.6.9_beta2
From the strace I gather that the daemon crashes on vim's unlink call on the swap file.
When unlinking from the :q! sequence, the error code is
unlink("/tmp/demo1/.test.txt.swp") = -1 EIO (Input/output error)
When unlinking by :se noswapfile the errorcode is
unlink("/tmp/demo1/.test.txt.swp") = -1 ECONNABORTED (Software caused connection abort)
Running the daemon with 'zfs-fuse --no-kstat-mount -a 0 -e 0' makes no difference.
====================
Running 0.6.0
Duplicates the issue.
====================
Running 0.5.0
Does not_duplicate the issue (works as advertised but on a pool version 13). Using pool version=13 on 0.6.9_beta2 still fails.
====================
Opensolaris b134
vim reports "E72: Close error on swap file". No panic
From the strace I gather that the daemon crashes on vim's unlink call on the swap file.
When unlinking from the :q! sequence, the error code is
unlink("/tmp/demo1/.test.txt.swp") = -1 EIO (Input/output error)
When unlinking by :se noswapfile the errorcode is
unlink("/tmp/demo1/.test.txt.swp") = -1 ECONNABORTED (Software caused connection abort)
Running the daemon with 'zfs-fuse --no-kstat-mount -a 0 -e 0' makes no difference.
====================
Running 0.6.0
Duplicates the issue.
====================
Running 0.5.0
Does not_duplicate the issue (works as advertised but on a pool version 13). Using pool version=13 on 0.6.9_beta2 still fails.
====================
Opensolaris b134
vim reports "E72: Close error on swap file". No panic
Added by
Seth Heeren
on
May 21, 2010 04:33 PM
sudo gdb --args zfs-fuse/zfs-fuse --no-kstat-mount -a 0 -e 0 -n
(gdb) break zfsfuse_unlink_helper
(gdb) r
The failing unlink trips assert:
zfs-fuse/zfs_operations.c:625: zfsfuse_release: Assertion `error == 0` failed
I suppose we could lift the VERIFY. Replacing with ASSERT will result in no panic unless build is debug!=0
For the moment I try to get to the bottom why the assert doesn't fire on 0.5.0
(gdb) break zfsfuse_unlink_helper
(gdb) r
The failing unlink trips assert:
zfs-fuse/zfs_operations.c:625: zfsfuse_release: Assertion `error == 0` failed
I suppose we could lift the VERIFY. Replacing with ASSERT will result in no panic unless build is debug!=0
For the moment I try to get to the bottom why the assert doesn't fire on 0.5.0
Added by
Seth Heeren
on
May 21, 2010 06:03 PM
My current understanding is that zfs-fuse may run into problems because it doesn't support inode-level locking (which Solaris kernel supports via e.g. cleanlocks())
Therefore I would propose to remove the VERIFY for zfs-fuse.
I'm still not entirely aware of why 0.5.0 doesn't hit this problem. The problem was introduced by revision 5218413 which constitutes a snapshot
http://sites.google.com/[…]/zfs-fuse-2009.06.03.tar.bz2
Unfortunately this is a huge patch and rather underdocumented (http://article.gmane.org/gmane.linux.drivers.fuse.zfs/1375)
I can't get much of a handle on where to start looking for the 'triggering' change.
Therefore I would propose to remove the VERIFY for zfs-fuse.
I'm still not entirely aware of why 0.5.0 doesn't hit this problem. The problem was introduced by revision 5218413 which constitutes a snapshot
http://sites.google.com/[…]/zfs-fuse-2009.06.03.tar.bz2
Unfortunately this is a huge patch and rather underdocumented (http://article.gmane.org/gmane.linux.drivers.fuse.zfs/1375)
I can't get much of a handle on where to start looking for the 'triggering' change.
Added by
Seth Heeren
on
May 21, 2010 06:28 PM
Well, after all it is simply that the zfs_close vnode operation started doing a ZFS_VERIFY_ZP(zp) to verify the pointer to the znode (which corresponds to the vnode, which corresponds to the inode).
I can relate that to mercurial revision 9909 (9909:aa280f585a3e from onvv-gate).
It points to bug 6790232 which bears a remarkable resemblance to this very issue:
http://bugs.opensolaris.org[…]3c05736d5cc4?bug_id=6790232
I can relate that to mercurial revision 9909 (9909:aa280f585a3e from onvv-gate).
It points to bug 6790232 which bears a remarkable resemblance to this very issue:
http://bugs.opensolaris.org[…]3c05736d5cc4?bug_id=6790232
Added by
Seth Heeren
on
May 21, 2010 07:43 PM
Issue state:
open → resolved
Target release:
None → 0.6.9
After more digging I concluded my analysis.
Emmanuels zfs_zget change in fab8bc75 gave me the idea to look at the unlinked property before handling the znode.
Alas that turned out not to work (the z_unlinked field in the znode struct is still == 0). I assume it may (again) be because of lack of inode-level ref accounting in zfs-fuse (as opposed to Solaris kernel?)
I have locally commited a patch to: If an error is received we no longer abort, but rather log a sensible warning to syslog
May 22 02:31:40 karmic zfs-fuse: zfsfuse_release: stale inode (Input/output error)?
or
May 22 02:06:51 karmic zfs-fuse: zfsfuse_release: stale inode (No such file or directory)
It is committed in my own repo[1]. I just decided to push this to the official testing branch as well after considering:
- nothing could be more broken as the VERIFY() previously guaranteed a daemon abort
- no syslog can be flooded since this condition is not expected to happen frequently (or we would have had reports)
[1] http://zfs-fuse.sehe.nl/?p=[…]4397c4acb3f6351a0c5cff4ac2f
Emmanuels zfs_zget change in fab8bc75 gave me the idea to look at the unlinked property before handling the znode.
Alas that turned out not to work (the z_unlinked field in the znode struct is still == 0). I assume it may (again) be because of lack of inode-level ref accounting in zfs-fuse (as opposed to Solaris kernel?)
I have locally commited a patch to: If an error is received we no longer abort, but rather log a sensible warning to syslog
May 22 02:31:40 karmic zfs-fuse: zfsfuse_release: stale inode (Input/output error)?
or
May 22 02:06:51 karmic zfs-fuse: zfsfuse_release: stale inode (No such file or directory)
It is committed in my own repo[1]. I just decided to push this to the official testing branch as well after considering:
- nothing could be more broken as the VERIFY() previously guaranteed a daemon abort
- no syslog can be flooded since this condition is not expected to happen frequently (or we would have had reports)
[1] http://zfs-fuse.sehe.nl/?p=[…]4397c4acb3f6351a0c5cff4ac2f
Added by
(anonymous)
on
May 22, 2010 07:14 AM
That was quick! I can confirm that the fix works for me.
Added by
Seth Heeren
on
May 27, 2010 05:49 PM
Issue state:
resolved → open
After receiving and analysing the history on newly reported #45 I decided against removing the assert.
I have tried my best to recreate the fix from before the mentioned fab8bc75/5218413 (which largely either produce this error or fail to compile... ).
I finally restored some order by basically reapplying
git diff be06509^^..be06509
It is available in testing now for testing
I have tried my best to recreate the fix from before the mentioned fab8bc75/5218413 (which largely either produce this error or fail to compile... ).
I finally restored some order by basically reapplying
git diff be06509^^..be06509
It is available in testing now for testing
Added by
Seth Heeren
on
May 27, 2010 06:35 PM
Issue state:
open → resolved
Arrrrg - it was past bedtime for me...
I tested that while having misplaced my git revert for the prior fix. So, it turns out, the VERIFY is still borking in this scenario.
So, the problems are unrelated after all (though we _might_ be able to recheck all invocations of zfs_zget and find some more that ought to be ok with unlinked vnodes). For now, I'm simply reclosing this one, since the fix for #45 doesn't change the status of this bug after all.
I tested that while having misplaced my git revert for the prior fix. So, it turns out, the VERIFY is still borking in this scenario.
So, the problems are unrelated after all (though we _might_ be able to recheck all invocations of zfs_zget and find some more that ought to be ok with unlinked vnodes). For now, I'm simply reclosing this one, since the fix for #45 doesn't change the status of this bug after all.

