#13 — tools hang on zfs ioctl due to bug in put_nvlist
| State | Resolved |
|---|---|
| Version: | 0.6.0 |
| Area | User interface |
| Issue type | Bug |
| Severity | Critical |
| Submitted by | Seth Heeren |
| Submitted on | Dec 23, 2009 |
| Responsible | Seth Heeren |
| Target release: |
Last modified on
Aug 13, 2010
by
Seth Heeren
The following (git blame zfs-fuse/zfs_ioctl.c|less +836)
b9e6e6f6 (Ricardo Correia 2006-09-15 20:00:59 +0100 836)
00000000 (Not Committed Yet 2009-12-23 13:24:27 +0100 837) zc->zc_nvlist_dst_size = size;
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 838) /* This commented allocation was probably some kind of optimization
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 839) since this zc is sent to the socket. Except that put_nvlist is sometimes
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 840) called recursively and in this case we get very fast an out of memory error
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 841) in this function. Simply commenting out the allocation fixes the problem */
b9e6e6f6 (Ricardo Correia 2006-09-15 20:00:59 +0100 842) return (error);
The log mentions (git log a0da) "More memory allocation debugging messages and the removal of an optimization that causes an out-of-memory error. Hopefully with this, we have fixed the bug in zfs regarding vanishing datasets"
This feature is _clearly not_ an optimization. It is precisely as in many POSIX extensions, WIN32 api: the API will return the required buffer size if the client passes a buffer that is too small. In this particular case I was adding a second pool and 'zpool status' was stuck in an infinite loop - continuously reallocating a buffer at the original size (4096) which was too small to contain the server info (git blame lib/libzfs/libzfs_config.c | less +129).
The function zcmd_expand_dst_nvlist is evidently useless if the member zc_nvlist_dst_size is not updated by zfs-fuse. This function is being used in the following locations:
Functions calling this function: zcmd_expand_dst_nvlist
File Function Line
0 libzfs_config.c namespace_reload 141 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
1 libzfs_config.c zpool_refresh_stats 256 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
2 libzfs_dataset.c get_stats_ioctl 334 if (zcmd_expand_dst_nvlist(hdl, zc) != 0) {
3 libzfs_dataset.c zfs_do_list_ioctl 2277 if (zcmd_expand_dst_nvlist(zhp->zfs_hdl, zc) != 0) {
4 libzfs_import.c refresh_config 384 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
5 libzfs_pool.c zpool_get_all_props 77 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
So this may be causing more trouble. I couldn't find any _evidently_ wrong calculations of the size that gets written here in a quick scan. If having the auto-buffer-expansion feature leads to out of memory quickly (like the code comment is suggesting) I'd say we should be looking at the cause (memory being leaked/miscalculation of required size?)
b9e6e6f6 (Ricardo Correia 2006-09-15 20:00:59 +0100 836)
00000000 (Not Committed Yet 2009-12-23 13:24:27 +0100 837) zc->zc_nvlist_dst_size = size;
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 838) /* This commented allocation was probably some kind of optimization
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 839) since this zc is sent to the socket. Except that put_nvlist is sometimes
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 840) called recursively and in this case we get very fast an out of memory error
a0da39b5 (Manuel Amador (Rudd-O) 2009-07-16 21:38:18 -0500 841) in this function. Simply commenting out the allocation fixes the problem */
b9e6e6f6 (Ricardo Correia 2006-09-15 20:00:59 +0100 842) return (error);
The log mentions (git log a0da) "More memory allocation debugging messages and the removal of an optimization that causes an out-of-memory error. Hopefully with this, we have fixed the bug in zfs regarding vanishing datasets"
This feature is _clearly not_ an optimization. It is precisely as in many POSIX extensions, WIN32 api: the API will return the required buffer size if the client passes a buffer that is too small. In this particular case I was adding a second pool and 'zpool status' was stuck in an infinite loop - continuously reallocating a buffer at the original size (4096) which was too small to contain the server info (git blame lib/libzfs/libzfs_config.c | less +129).
The function zcmd_expand_dst_nvlist is evidently useless if the member zc_nvlist_dst_size is not updated by zfs-fuse. This function is being used in the following locations:
Functions calling this function: zcmd_expand_dst_nvlist
File Function Line
0 libzfs_config.c namespace_reload 141 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
1 libzfs_config.c zpool_refresh_stats 256 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
2 libzfs_dataset.c get_stats_ioctl 334 if (zcmd_expand_dst_nvlist(hdl, zc) != 0) {
3 libzfs_dataset.c zfs_do_list_ioctl 2277 if (zcmd_expand_dst_nvlist(zhp->zfs_hdl, zc) != 0) {
4 libzfs_import.c refresh_config 384 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
5 libzfs_pool.c zpool_get_all_props 77 if (zcmd_expand_dst_nvlist(hdl, &zc) != 0) {
So this may be causing more trouble. I couldn't find any _evidently_ wrong calculations of the size that gets written here in a quick scan. If having the auto-buffer-expansion feature leads to out of memory quickly (like the code comment is suggesting) I'd say we should be looking at the cause (memory being leaked/miscalculation of required size?)
- Steps to reproduce:
- Run this with /tmp mounted on something that supports sparse files (!!!), e.g. tmpfs:
#!/bin/bash
for pool in 1 2 3 4 5 6 7
do
devdir=/tmp/ztest_$pool;
mkdir -pv $devdir;
cd $devdir;
for a in a-particular-long-named-device-backing-file-named-atrociously-raw{1,2,3,4,5,6,7,8,9,10}
do
dd of=$a bs=1024M seek=1024 count=0 2>/dev/null
done
zpool create pool-$pool raidz $devdir/a-*
done
zpool iostat
zpool status -v | wc -l # this command should not hang anymore with the fix applied
Added by
Seth Heeren
on
Dec 23, 2009 04:07 PM
Ok, Emmanuel would recognize this: his 2d4e24ca1ba92fa04ae6efd45a847713cfd1e3ee has a better workaround for this function. It is however dated after the 0.6.0 release. We might want to back port that, since the symptoms seem a bit grave?
sehe@karmic:~/custom/ZFS$ git tag --contains 2d4e24ca1ba92fa04ae6
sehe@karmic:~/custom/ZFS$ git tag --contains a0da39b591de50312d
0.6.0
sehe@karmic:~/custom/ZFS$ git tag --contains 2d4e24ca1ba92fa04ae6
sehe@karmic:~/custom/ZFS$ git tag --contains a0da39b591de50312d
0.6.0
Added by
Seth Heeren
on
Dec 23, 2009 05:29 PM
Now interestingly, I couldn't cherrypick that rev (2d4e24) because of a merge conflict. Now what? It turns out this is a very hacked line-of-code indeed. I'm sorry I didn't catch much discussion on this on the list (I know it has been discussed with the disappearing datasets issue). I must admit, I agree with the comment in rev 37af7fac0 at least. This should be fixed on the call site: it should always retry the IOCTL on ENOMEM and use the suggested buffer size. If not, it should simply make sure the buffer is large enough.
put_nvlist shouldn't be touched with a ten-foot pole because breakage will ensue (as was demonstrated often enough now).
============> can we backport / cherrypick 37af7fac0524a741b44734057e8f635e40155cb4 ?
commit 2d4e24ca1ba92fa04ae6efd45a847713cfd1e3ee
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Sun Aug 30 00:35:10 2009 +0200
Address the raidz1 problem in put_nvlist
when a drive is unavailable in a raidz1 pool and you try to import the pool,
you get an infinite loop on put_nvlist. Apparently this function was really
intended to fail in this case. So this quick fix tries to address the problem
without taking back the old bug of the disappearing pools. It's a lucky hack
for now, it seems to work everywhere but it should be improved one day...
commit 37af7fac0524a741b44734057e8f635e40155cb4
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Fri Nov 13 19:05:09 2009 +0100
Final fix for the disappearing datasets
the problem was only in the list_next ioctl command :
put_nvlist was putting the size to use for future requests on zfs_cmd_t but
this same zfs_cmd_t was used for the next child, so if the next fs had more
properties, you got a nice out of memory error.
This final fixes doesn't touch at all at put_nvlist, it just changes the zfs
list command to ignore the size change between children
put_nvlist shouldn't be touched with a ten-foot pole because breakage will ensue (as was demonstrated often enough now).
============> can we backport / cherrypick 37af7fac0524a741b44734057e8f635e40155cb4 ?
commit 2d4e24ca1ba92fa04ae6efd45a847713cfd1e3ee
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Sun Aug 30 00:35:10 2009 +0200
Address the raidz1 problem in put_nvlist
when a drive is unavailable in a raidz1 pool and you try to import the pool,
you get an infinite loop on put_nvlist. Apparently this function was really
intended to fail in this case. So this quick fix tries to address the problem
without taking back the old bug of the disappearing pools. It's a lucky hack
for now, it seems to work everywhere but it should be improved one day...
commit 37af7fac0524a741b44734057e8f635e40155cb4
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Fri Nov 13 19:05:09 2009 +0100
Final fix for the disappearing datasets
the problem was only in the list_next ioctl command :
put_nvlist was putting the size to use for future requests on zfs_cmd_t but
this same zfs_cmd_t was used for the next child, so if the next fs had more
properties, you got a nice out of memory error.
This final fixes doesn't touch at all at put_nvlist, it just changes the zfs
list command to ignore the size change between children
Added by
Seth Heeren
on
Jan 15, 2010 06:19 AM
Issue state:
unconfirmed → open
Severity:
Important → Critical
Ok, people, this thing is a major stopper for anything but hte simplest pool layouts (higher number of pools, vdevs or longer vdev names will result in put_nvlist running out of buffer room quickly --> hanging userland tools (not to mention server threads)).
At least 4 bugs with the symptom have been reported on the user group. 2 of them confirmed the problem went away when upgrading to Emmanuels repo/master.
Anyone wishing to check/confirm the solution: switch to a later version (notably including 37af7fac0524a741b44734057e8f635e40155cb4)
So eg. do
git remote add rainemu http://rainemu.swishparty.co.uk/git/zfs
git checkout -b test-0.7.0-pre-alpha 37af7fac0524a741b44734057e8f635e40155cb4)
rebuild, install, test. Beware: you'll be running an unstable version. Don't do anything/more than necesarry on valuable pools?
At least 4 bugs with the symptom have been reported on the user group. 2 of them confirmed the problem went away when upgrading to Emmanuels repo/master.
Anyone wishing to check/confirm the solution: switch to a later version (notably including 37af7fac0524a741b44734057e8f635e40155cb4)
So eg. do
git remote add rainemu http://rainemu.swishparty.co.uk/git/zfs
git checkout -b test-0.7.0-pre-alpha 37af7fac0524a741b44734057e8f635e40155cb4)
rebuild, install, test. Beware: you'll be running an unstable version. Don't do anything/more than necesarry on valuable pools?
Added by
Rudd-O
on
Jan 15, 2010 02:25 PM
I have merged Emmanuel's rainemu repository and pushed into master official, so you can continue using the official repo as long as you use the master branch rather than one of the stable tags.
Guys, advise -- can we merge the quoted commits that fix the issues into stable, to have a 0.6.1 release?
Guys, advise -- can we merge the quoted commits that fix the issues into stable, to have a 0.6.1 release?
Added by
Rudd-O
on
Jan 15, 2010 02:45 PM
Again,
The official repo does contain in the master branch the fix by Emmanuel (Final fix for the disappearing datasets). The commit ID is different, though: it begins with efdd33ca.
The official repo does contain in the master branch the fix by Emmanuel (Final fix for the disappearing datasets). The commit ID is different, though: it begins with efdd33ca.
Added by
Seth Heeren
on
Jan 19, 2010 04:05 AM
As jafo correctly pointed out on the list, the 37af7fac commits introduces a lot more change than might be warranted for stable deployments. It certainly does _not_ lend itself to be cherrypicked, like I suggested earlier. Sorry for my mistake[1]. I have reduced things to a minimum fix, only related to this particular problem (attached).
Can we test this/get it merged instead of rolling the master forward to include all the other upstream changes?
[1] this must have happened because in actual life I'm ahead of the 0.6.0 release myself, so I didn't give that commit the due attention because I have it anyway
Can we test this/get it merged instead of rolling the master forward to include all the other upstream changes?
[1] this must have happened because in actual life I'm ahead of the 0.6.0 release myself, so I didn't give that commit the due attention because I have it anyway
Added by
Seth Heeren
on
Jan 19, 2010 04:17 AM
Replaced steps to reproduce with something less intrusive than the original report. Anyone can test this on their setup?
Added by
Seth Heeren
on
Jan 25, 2010 06:21 PM
Target release:
0.6.0 → 0.6.1
fix delivered in the new branch
# git clone http://git.zfs-fuse.net/official
# cd official/
# git checkout origin/critical
No need to retest at the moment (this has been verified by at least 4 persons in the field)
This branch is slated for release as 0.6.1 in the near future
# git clone http://git.zfs-fuse.net/official
# cd official/
# git checkout origin/critical
No need to retest at the moment (this has been verified by at least 4 persons in the field)
This branch is slated for release as 0.6.1 in the near future
Added by
Seth Heeren
on
May 22, 2010 01:36 PM
Issue state:
open → resolved
this had been resolved in 0.6.1 (critical)
closing
closing
Added by
Daniel Smedegaard Buus
on
Aug 13, 2010 03:01 AM
Hi :)
I think I may just have become a victim of this bug (on Mint9/Lucid no less (using 0.6.0-1; why it isn't 0.6.1 in the repos I'm not sure)), as described here:
http://opensolaris.org/jive/thread.jspa?threadID=133027
I'm confused about the severity of this. My old fs (~8TB) disappeared after creating a new one in its zpool and doing a reboot, it seems.
Would I be able to get back my disappeared fs by deleting the newly created one? Am I fudged? What to do?
Thanks :)
I think I may just have become a victim of this bug (on Mint9/Lucid no less (using 0.6.0-1; why it isn't 0.6.1 in the repos I'm not sure)), as described here:
http://opensolaris.org/jive/thread.jspa?threadID=133027
I'm confused about the severity of this. My old fs (~8TB) disappeared after creating a new one in its zpool and doing a reboot, it seems.
Would I be able to get back my disappeared fs by deleting the newly created one? Am I fudged? What to do?
Thanks :)
Added by
Seth Heeren
on
Aug 13, 2010 03:20 AM
Responsible manager:
RuddO → sgheeren
First: I haven't read your linked thread yet.
This issue has since been fixed
This issue was a sever usability limiter but not an immediate threat to your data. Notably, if an fs was missing (not listed, actually), you could most of the time simply mount it explicitely if you knew the name:
zfs mount pool/i/know/this_exists
Once upgraded to higher version of zfs-fuse, you should see the fs(es) again. Another workaround would be to use alternative device nodes (or symlinks) that have substantially smaller names (much like using /dev/sda instead of e.g. /dev/disk/by-id/dm-uuid-LVM-0PFWnn02l7cW2t5rNQMKjW2e0174q1IY9MM10D01KuT1C83dUU72tf6YcNhfjv0L).
I don't know what exactly you did/has happened when you mention 'newly created'? In general it would _not_ be possible to create the fs by the same name, since zfs-fuse would still _know_ it's there, despite 'zfs list' not being able to list it. If you have committed the mistake of overwriting your entire pool by creating a new pool over it, you will be out of luck, I'm afraid.
Let me know if there are further questions/answers
This issue has since been fixed
This issue was a sever usability limiter but not an immediate threat to your data. Notably, if an fs was missing (not listed, actually), you could most of the time simply mount it explicitely if you knew the name:
zfs mount pool/i/know/this_exists
Once upgraded to higher version of zfs-fuse, you should see the fs(es) again. Another workaround would be to use alternative device nodes (or symlinks) that have substantially smaller names (much like using /dev/sda instead of e.g. /dev/disk/by-id/dm-uuid-LVM-0PFWnn02l7cW2t5rNQMKjW2e0174q1IY9MM10D01KuT1C83dUU72tf6YcNhfjv0L).
I don't know what exactly you did/has happened when you mention 'newly created'? In general it would _not_ be possible to create the fs by the same name, since zfs-fuse would still _know_ it's there, despite 'zfs list' not being able to list it. If you have committed the mistake of overwriting your entire pool by creating a new pool over it, you will be out of luck, I'm afraid.
Let me know if there are further questions/answers
Added by
Daniel Smedegaard Buus
on
Aug 13, 2010 03:33 AM
Seth, thank you so much for your swift response :)
Indeed the explicit mount command just mounted my "invisible" fs! Funny though, because I could have sworn I already tried that - except I might have tried mounting it under its mountpoint name (/home/daniel/Archive) rather than its pool name (titanic/archive).
The other fs I was talking about was a second fs in the same pool, called titanic/backups, which I wanted to use, well, for backups :) Am I right in the assumption that it's this new fs which is triggering the out of memory issue on boot? And that destroying this second fs again (at least so long as I'm using 0.6.0 (still don't understand why the Ubuntu package maintainers haven't updated it to 0.6.1 yet)) would allow me to have my original titanic/archive fs automounted on boot (with hopefully no out of memory log messages)?
I'm not ready to dare upgrading to 0.6.9 yet (though deduplication was one of the main attractions to me in the first place). I still need to back up a bunch of stuff to dare this feat ;)
Again, thanks a lot for your answer! My blood pressure just decreased significantly ;)
Daniel
Indeed the explicit mount command just mounted my "invisible" fs! Funny though, because I could have sworn I already tried that - except I might have tried mounting it under its mountpoint name (/home/daniel/Archive) rather than its pool name (titanic/archive).
The other fs I was talking about was a second fs in the same pool, called titanic/backups, which I wanted to use, well, for backups :) Am I right in the assumption that it's this new fs which is triggering the out of memory issue on boot? And that destroying this second fs again (at least so long as I'm using 0.6.0 (still don't understand why the Ubuntu package maintainers haven't updated it to 0.6.1 yet)) would allow me to have my original titanic/archive fs automounted on boot (with hopefully no out of memory log messages)?
I'm not ready to dare upgrading to 0.6.9 yet (though deduplication was one of the main attractions to me in the first place). I still need to back up a bunch of stuff to dare this feat ;)
Again, thanks a lot for your answer! My blood pressure just decreased significantly ;)
Daniel
Added by
Seth Heeren
on
Aug 13, 2010 03:46 AM
Good to hear your health risk was reduced :_)
Yes the extra fs could have triggered this: the out-of-mem is actually on the memory buffer that will hold the list of filesystems, so if it grows, it will be truncated (or as in the case of mount -a it will likely fail at all?)
I would say it is likely safer to upgrade to 0.6.9 than to stay on 0.6[<9] because of fixed bugs/issues from upstream.
I agree that I'd _first_ have full backups. In my world, full backups should exist from the get-go, so if you're not covered, that's always going to be prio #1 IMHO, regardless of upgrades.
Read up here for additional hints on upgrading:
http://zfs-fuse.net/documentation/upgrading
Also note that you are probably using some ppa/unofficial repository if you're still on 0.6.1 despite running Lucid, because a.f.a.i.k debian has updated to 0.6.9 maint and if you're interested, the homepage links to a ppa with 0.6.9 packages for ubuntu
Yes the extra fs could have triggered this: the out-of-mem is actually on the memory buffer that will hold the list of filesystems, so if it grows, it will be truncated (or as in the case of mount -a it will likely fail at all?)
I would say it is likely safer to upgrade to 0.6.9 than to stay on 0.6[<9] because of fixed bugs/issues from upstream.
I agree that I'd _first_ have full backups. In my world, full backups should exist from the get-go, so if you're not covered, that's always going to be prio #1 IMHO, regardless of upgrades.
Read up here for additional hints on upgrading:
http://zfs-fuse.net/documentation/upgrading
Also note that you are probably using some ppa/unofficial repository if you're still on 0.6.1 despite running Lucid, because a.f.a.i.k debian has updated to 0.6.9 maint and if you're interested, the homepage links to a ppa with 0.6.9 packages for ubuntu
Added by
Daniel Smedegaard Buus
on
Aug 13, 2010 05:19 AM
Yeah, I think I actually will upgrade to 0.6.9 this weekend. For the record, though, AFAICT, it's not that I'm on an unofficial repo, check it out:
http://packages.ubuntu.com/source/lucid/zfs-fuse
And I actually do have backups, sort of, because I recreated this array using backups of my former xfs-on-raid5 array by dumping everything onto external drives. I'm just not sure about which stuff is where now, so I'm treading carefully so as not to disturb the status quo. 8 TB takes a while to restore ;)
Again, thanks for the help! And have a nice one :)
http://packages.ubuntu.com/source/lucid/zfs-fuse
And I actually do have backups, sort of, because I recreated this array using backups of my former xfs-on-raid5 array by dumping everything onto external drives. I'm just not sure about which stuff is where now, so I'm treading carefully so as not to disturb the status quo. 8 TB takes a while to restore ;)
Again, thanks for the help! And have a nice one :)
Added by
Daniel Smedegaard Buus
on
Aug 13, 2010 07:33 AM
Okay, I realize this is not a discussion forum, so I apologize for all this chattery business, but I'm wondering if I may ask just one more question.
Being convinced that I should do the upgrade this weekend, I'm reading up on issues, e.g. the dedup faq that is linked to under the release notes for 0.6.9. I'm however confused as to how I can know which fixes are included with the zfs-fuse. The version semantics used for the Solaris stuff are the developer builds, like snv132 (i.e. that one must've made it into the latest osol dev iso which is 134 AFAIR), while zfs-fuse version of course are completely different. Only the "pool version" is reported, e.g. 23 for 0.6.9, but I have no clue if that would mean that the issues listed under the dedup faq apply to this version of zfs-fuse or not.
Hope I'm not being too clingy here :)
Being convinced that I should do the upgrade this weekend, I'm reading up on issues, e.g. the dedup faq that is linked to under the release notes for 0.6.9. I'm however confused as to how I can know which fixes are included with the zfs-fuse. The version semantics used for the Solaris stuff are the developer builds, like snv132 (i.e. that one must've made it into the latest osol dev iso which is 134 AFAIR), while zfs-fuse version of course are completely different. Only the "pool version" is reported, e.g. 23 for 0.6.9, but I have no clue if that would mean that the issues listed under the dedup faq apply to this version of zfs-fuse or not.
Hope I'm not being too clingy here :)
Added by
Seth Heeren
on
Aug 13, 2010 07:58 AM

patch.put_nvlist_hotfix
repro.sh
