#65 — Cache leads to data corruption after rollback
| State | Tested and confirmed closed |
|---|---|
| Version: | 0.6.9 |
| Area | Functionality |
| Issue type | Bug |
| Severity | Important |
| Submitted by | Jan Ploski |
| Submitted on | Jun 20, 2010 |
| Responsible | Seth Heeren |
| Target release: | 0.7.0 |
Last modified on
Sep 24, 2010
by
Seth Heeren
This is a ticket for the issue initially described here: http://groups.google.com/[…]/6fd2fbe12495464d
I'm experiencing this problem on the zfs-fuse testing branch, commit 3c64b738517f9a7c68e77fa7b2714c6278d4a9d2, compiled from source.
After a zfs rollback, the rolled back file system apparently may be left in an inconsistent state, from the viewpoint of an application (a MySQL database, in that case, complaining about corrupted tables). It is easy to reliably reproduce (at least for myself) using the procedure described in the original thread. The symptoms do NOT occur if caches are flushed using echo 3 > /proc/sys/vm/drop_caches after rollback.
I consider this a pretty serious issue, as it may lead to data loss because the confused application detects corruption and tries to "repair" the rolled back file system, possibly making matters worse.
I'm experiencing this problem on the zfs-fuse testing branch, commit 3c64b738517f9a7c68e77fa7b2714c6278d4a9d2, compiled from source.
After a zfs rollback, the rolled back file system apparently may be left in an inconsistent state, from the viewpoint of an application (a MySQL database, in that case, complaining about corrupted tables). It is easy to reliably reproduce (at least for myself) using the procedure described in the original thread. The symptoms do NOT occur if caches are flushed using echo 3 > /proc/sys/vm/drop_caches after rollback.
I consider this a pretty serious issue, as it may lead to data loss because the confused application detects corruption and tries to "repair" the rolled back file system, possibly making matters worse.
- Steps to reproduce:
- Original symptoms reported here http://groups.google.com/[…]/6fd2fbe12495464d
Reduced steps to reproduce for original versions (0.5.x-0.6.9?):
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export first="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
zfs rollback -r ploski1@$first
echo "$first" " /tmp/ploski1/data" | md5sum -c
Modified steps to reproduce on latest testing branch (add flock to rollback step):
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export first="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
{ flock -x 200 && zfs rollback -r ploski1@$first; } 200< /tmp/ploski1/data
echo "$first" " /tmp/ploski1/data" | md5sum -c
Added by
Seth Heeren
on
Jun 20, 2010 04:06 PM
Issue state:
unconfirmed → open
Severity:
Medium → Important
Responsible manager:
(UNASSIGNED) → sgheeren
Confirming this as a serious consistency bug.
Only mitigated by the possibility that not all users might not be in the habit of rolling back to snapshots.
If they do, they should be able to trust the linux fs caches to work ok, of course. I have a feeling it should be pretty easy to signal to fuse that a filesystems caches should be (completely) dropped. This makes especially sense since fuse has a large area of application in virtual filesystems, where the actual FS content can never be cached, because it is 'virtual' and volatile. I'll have to do some catch up reading (fuse documentation hasn't been my favourite untill now).
In the meantime, this immediately makes a headline at the known issues.
I'll try to figure out any dependencies on daemon configuration (prefetch,L2ARC,disable-block-cache,disable-page-cache,fuse-atr-timeout,fuse-entry-timeout).
Only mitigated by the possibility that not all users might not be in the habit of rolling back to snapshots.
If they do, they should be able to trust the linux fs caches to work ok, of course. I have a feeling it should be pretty easy to signal to fuse that a filesystems caches should be (completely) dropped. This makes especially sense since fuse has a large area of application in virtual filesystems, where the actual FS content can never be cached, because it is 'virtual' and volatile. I'll have to do some catch up reading (fuse documentation hasn't been my favourite untill now).
In the meantime, this immediately makes a headline at the known issues.
I'll try to figure out any dependencies on daemon configuration (prefetch,L2ARC,disable-block-cache,disable-page-cache,fuse-atr-timeout,fuse-entry-timeout).
Added by
Seth Heeren
on
Jun 20, 2010 04:54 PM
Issue state:
open → in-progress
I've been fully able to reproduce this issue, script at the bottom
Note that the inclusion of either --disable-page-cache or --disable-block-cache removes the symptom (-a and -e have no further influence).
Updating the known issues with this workaround
------------------------------------------------
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export target="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
zfs rollback -r ploski1@$target
zfs list -tall
echo "$target" " /tmp/ploski1/data" | md5sum -c # will fail unless --disable-*-cache given
echo 1 > /proc/sys/vm/drop_caches
echo "$target" " /tmp/ploski1/data" | md5sum -c
Note that the inclusion of either --disable-page-cache or --disable-block-cache removes the symptom (-a and -e have no further influence).
Updating the known issues with this workaround
------------------------------------------------
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export target="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data
zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
zfs rollback -r ploski1@$target
zfs list -tall
echo "$target" " /tmp/ploski1/data" | md5sum -c # will fail unless --disable-*-cache given
echo 1 > /proc/sys/vm/drop_caches
echo "$target" " /tmp/ploski1/data" | md5sum -c
Added by
Seth Heeren
on
Jun 20, 2010 06:53 PM
My branch issue65 (http://gitweb.zfs-fuse.net/[…]/issue65) now contains a fix for this. (snapshots downloadable or clone from my repo)
I'd be interested in performance tests with this branch, because it has the net effect of --disable-page-cache. From the code affected, this option seems like a misnomer to begin with, and I doubt whether it actually makes so much of a difference. For, here is my commit message containing the ins and outs of the analysis:
commit 288ab55443945461f8f8fe02221b37aafa9557cf
Author: Seth Heeren <sgheeren@hotmail.com>
Date: Mon Jun 21 01:44:31 2010 +0200
Disabling the use of keep_cache in zfsfuse_opencreate
This is in response to the serious issue #65
(http://zfs-fuse.net/issues/65)
This makes disable_page_cache a nilpotent option for now
The original work (99bcacea) seems to be discussed in a July 2008 thread
here
http://groups.google.com/[…]/3393f58b221db75d
It looks at this point that
(a) dsiabling page cache has no functional detriments (unlike e.g.
disable-block-cache which disables mmap() on fuse fs)
(b) there is no (simple) way to invalidate the cache explicitely on
'destructive inplace snapshot operations' (like rollback, promote and
possibly receive); functions present are:
fuse_invalidate (obsolete, doesn't do anything)
fuse_lowlevel_notify_inval_inode(...)
fuse_lowlevel_notify_inval_entry(...)
* These are pretty badly documented.
* Also, it will be hard to
call these with the right session/channel for a given fs when
serving the ZFS_IOC_ROLLBACK command, e.g. since there is no
knowledge about the corresponding fuse session in the handler.
We might introduce a thread-local global (like cur_fd) for the
purpose, but later
For now, let's simply performance test whether there is any measurable
difference. After all, block-cache seems to be the main performance
boon.
Viewed sceptically, 'keep_cache' in response to !disable_page_cache
might even be a mistake/logic error. I'm not sure whether keep_cache on
file zfsfuse_opencreate makes sense outside a read-only fs?
I'd be interested in performance tests with this branch, because it has the net effect of --disable-page-cache. From the code affected, this option seems like a misnomer to begin with, and I doubt whether it actually makes so much of a difference. For, here is my commit message containing the ins and outs of the analysis:
commit 288ab55443945461f8f8fe02221b37aafa9557cf
Author: Seth Heeren <sgheeren@hotmail.com>
Date: Mon Jun 21 01:44:31 2010 +0200
Disabling the use of keep_cache in zfsfuse_opencreate
This is in response to the serious issue #65
(http://zfs-fuse.net/issues/65)
This makes disable_page_cache a nilpotent option for now
The original work (99bcacea) seems to be discussed in a July 2008 thread
here
http://groups.google.com/[…]/3393f58b221db75d
It looks at this point that
(a) dsiabling page cache has no functional detriments (unlike e.g.
disable-block-cache which disables mmap() on fuse fs)
(b) there is no (simple) way to invalidate the cache explicitely on
'destructive inplace snapshot operations' (like rollback, promote and
possibly receive); functions present are:
fuse_invalidate (obsolete, doesn't do anything)
fuse_lowlevel_notify_inval_inode(...)
fuse_lowlevel_notify_inval_entry(...)
* These are pretty badly documented.
* Also, it will be hard to
call these with the right session/channel for a given fs when
serving the ZFS_IOC_ROLLBACK command, e.g. since there is no
knowledge about the corresponding fuse session in the handler.
We might introduce a thread-local global (like cur_fd) for the
purpose, but later
For now, let's simply performance test whether there is any measurable
difference. After all, block-cache seems to be the main performance
boon.
Viewed sceptically, 'keep_cache' in response to !disable_page_cache
might even be a mistake/logic error. I'm not sure whether keep_cache on
file zfsfuse_opencreate makes sense outside a read-only fs?
Added by
Jan Ploski
on
Jun 21, 2010 02:43 PM
Your fix works well for my test case. I will upload some bonnie figures collected before and after applying the fix (I did not notice any performance differences, but then, I didn't examine it very thoroughly).
Added by
Jan Ploski
on
Jun 21, 2010 02:44 PM
Bonnie figures before the fix (using git testing, commit 3c64b738517f9a7c68e77fa7b2714c6278d4a9d2)
Added by
Jan Ploski
on
Jun 21, 2010 02:46 PM
Bonnie figures after the fix (using snapshot http://gitweb.zfs-fuse.net/[…]/issue65)
Added by
Seth Heeren
on
Jun 21, 2010 04:20 PM
Thanks for taking this time. My own benchmarks (both scenarios tested 2x across 16G)
http://downloads.sehe.nl/zfs-fuse/issue65/benchmarks.html
confirm that there is little or no difference.
Not really surprisingly, as I don't think anything about this change affects the actual caching of data pages.
The keep_cache seems to be an optimizing hint to the fuse layer telling it that it is ok to cache the inode meta data (attrs) even on new file creation. It might be customary in linux VFS drivers to invalidate this cache (perhaps because the correct inodes need not be known at creation time... this is mostly conjecture, but it rings true enough). Now fuse, being all-user-space, of course can break this tradition, because the actual fs driver is itself in the calling code.
In any case, I can see a (completely marginal) performance improvement with -a >0 and -e >0 and a lot of new file creations.
Perhaps unsurprisingly, the irony of the benchmarks is that one of the few discernable differences is that file creations seem to be a tiny bit faster using the patch
I'm going to vote for inclusion in the test branch, dropping --disable-page-cache setting because it is unsafe with this type of snapshot operation.
http://downloads.sehe.nl/zfs-fuse/issue65/benchmarks.html
confirm that there is little or no difference.
Not really surprisingly, as I don't think anything about this change affects the actual caching of data pages.
The keep_cache seems to be an optimizing hint to the fuse layer telling it that it is ok to cache the inode meta data (attrs) even on new file creation. It might be customary in linux VFS drivers to invalidate this cache (perhaps because the correct inodes need not be known at creation time... this is mostly conjecture, but it rings true enough). Now fuse, being all-user-space, of course can break this tradition, because the actual fs driver is itself in the calling code.
In any case, I can see a (completely marginal) performance improvement with -a >0 and -e >0 and a lot of new file creations.
Perhaps unsurprisingly, the irony of the benchmarks is that one of the few discernable differences is that file creations seem to be a tiny bit faster using the patch
I'm going to vote for inclusion in the test branch, dropping --disable-page-cache setting because it is unsafe with this type of snapshot operation.
Added by
Seth Heeren
on
Jun 21, 2010 06:06 PM
Issue state:
in-progress → resolved
Target release:
None → 0.7.0
rolled into testing
testing converted to new upcoming release zfs-fuse.net
testing converted to new upcoming release zfs-fuse.net
Added by
Jan Ploski
on
Sep 20, 2010 06:32 AM
I pulled the current 'testing' today and this issue occurred again! Once again I got corrupted MySQL tables in a scenario which worked flawlessly with my yesterday's (older) build of zfs-fuse (for which I unfortunately miss the .git directory, so can't report commit id). Can you please investigate?
Added by
Jan Ploski
on
Sep 20, 2010 07:10 AM
Ok, I see now what the single crucial difference is:
The my "good" older version contains the following line in zfs_operations.c:
fi->keep_cache = 0;
But the "bad" version (that is, the current 'testing') contains:
fi->keep_cache = page_cache;
Can you change it? Or is there anything I must configure differently to get the same behavior from the current version?
The my "good" older version contains the following line in zfs_operations.c:
fi->keep_cache = 0;
But the "bad" version (that is, the current 'testing') contains:
fi->keep_cache = page_cache;
Can you change it? Or is there anything I must configure differently to get the same behavior from the current version?
Added by
Seth Heeren
on
Sep 20, 2010 07:11 AM
Issue state:
resolved → open
huh...
I'm sorry that this happens to you again (it must be a pain to fix). I'm however also glad that this is still the _testing_ branch - that makes me feel less guilty.
Now on the subject:
-------------------
Can you describe the scenario. Does it still involve rollback? If not, how does it operate?
Background:
-----------
testing is indeed different from maint. Maint still contains the workaround that you confirmed from the issue65 branch. Effectively, it ran with --disable-page-cache always.
However, since there were performance impacts (that I hadn't found with the benchmarks we did in this issue) that were not nice. Emmanuel came up with another fix: 68a778726 (76391c8f1cin Emmanuels repo).
commit 68a7787261e6324ef722b86560075f46dfea2629
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Mon Aug 9 13:14:15 2010 +0200
Remount the fs after a rollback
This clears the page cache as it should.
After I tested that this works for me with the steps to reproduce from this issue, it was decided to reenable the page-cache facility and use the more explicit fix.
You can always go back to maint, because the described alternative fix has _not_ been backported to 0.6.9-maint because the maint branch is supposed to maximize stability (ahem - q.e.d.)
Can you supply the information requested, I'll try to regression test this issue once again
I'm sorry that this happens to you again (it must be a pain to fix). I'm however also glad that this is still the _testing_ branch - that makes me feel less guilty.
Now on the subject:
-------------------
Can you describe the scenario. Does it still involve rollback? If not, how does it operate?
Background:
-----------
testing is indeed different from maint. Maint still contains the workaround that you confirmed from the issue65 branch. Effectively, it ran with --disable-page-cache always.
However, since there were performance impacts (that I hadn't found with the benchmarks we did in this issue) that were not nice. Emmanuel came up with another fix: 68a778726 (76391c8f1cin Emmanuels repo).
commit 68a7787261e6324ef722b86560075f46dfea2629
Author: Emmanuel Anne <emmanuel.anne@gmail.com>
Date: Mon Aug 9 13:14:15 2010 +0200
Remount the fs after a rollback
This clears the page cache as it should.
After I tested that this works for me with the steps to reproduce from this issue, it was decided to reenable the page-cache facility and use the more explicit fix.
You can always go back to maint, because the described alternative fix has _not_ been backported to 0.6.9-maint because the maint branch is supposed to maximize stability (ahem - q.e.d.)
Can you supply the information requested, I'll try to regression test this issue once again
Added by
Seth Heeren
on
Sep 20, 2010 07:15 AM
As I explain in my crossing post this is not a mistake keep_cache serves a big performance gain
> Can you change it? Or is there anything I must configure differently to get the same behavior from the current version?
Of course. Simply add disable-page-cache in /etc/zfs/zfsrc will do the trick. However, this issue still stands. It is not _ok_ to corrupt data on certain configs.
> Can you change it? Or is there anything I must configure differently to get the same behavior from the current version?
Of course. Simply add disable-page-cache in /etc/zfs/zfsrc will do the trick. However, this issue still stands. It is not _ok_ to corrupt data on certain configs.
Added by
Seth Heeren
on
Sep 20, 2010 07:55 AM
Issue state:
open → in-progress
mmm. I just confirmed that it is a locking issue. The remount step fails if files are open. I have let myself be convinced there...
Also, this probably means that not all relevant subsystems are being shutdown in your case. Think of the working directory for your snapshot scripts, e.g.
lsof +D /pool/mountpoint/
should give you more info. Modified steps to reproduce see steps to reproduce.
Also, this probably means that not all relevant subsystems are being shutdown in your case. Think of the working directory for your snapshot scripts, e.g.
lsof +D /pool/mountpoint/
should give you more info. Modified steps to reproduce see steps to reproduce.
Added by
Jan Ploski
on
Sep 20, 2010 07:57 AM
Ok, here's a small scenario how to reproduce it using MySQL:
1. Configure zfs-fuse WITHOUT the disable-page-cache option, otherwise you are NOT going to see the problem.
2. Create a new database 'zfstest'. It should live in a zfs dataset. It only contains one table:
CREATE TABLE `foo` (
`bar` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
3. Insert a single row, then stop MySQL and make a snapshot, e.g. zfs snapshot green/mysql/zfstest@snap
4. Start MySQL again.
5. Insert another row.
6. Now rollback to the previously created snapshot: zfs rollback green/mysql/zfstest@snap; mysqladmin refresh
7. Insert another row.
8. Now run "myisamchk *.MYI" in the database directory. You should see output like:
Checking MyISAM file: foo.MYI
Data records: 3 Deleted blocks: 0
myisamchk: warning: 1 client is using or hasn't closed the table properly
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check record links
myisamchk: error: Wrong bytesec: 0-0-0 at linkstart: 20
MyISAM-table 'foo.MYI' is corrupted
Fix it using switch "-r" or "-o"
The message "client is using or hasn't closed" is harmless and can safely be ignored. However, the "Wrong bytesec" and "is corrupted" message is not.
The rollback followed by "mysqladmin refresh" is admittedly walking on thin ice. This trick is supposed to make MySQL forget its internal state concerning the rolled back database without a need to restart the whole server (and thus cause outage to other databases). In fact, when I tried without this trick, that is, stopped the MySQL server properly, rolled back, then restarted it again, I could no longer reproduce the table corruption in the above scenario. HOWEVER, the trick appears to work reliably (as in: "hundreds of repetitions without any corruptions") with disable-page-cache, so I still suspect that it might be a zfs-fuse rather than MySQL problem.
1. Configure zfs-fuse WITHOUT the disable-page-cache option, otherwise you are NOT going to see the problem.
2. Create a new database 'zfstest'. It should live in a zfs dataset. It only contains one table:
CREATE TABLE `foo` (
`bar` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
3. Insert a single row, then stop MySQL and make a snapshot, e.g. zfs snapshot green/mysql/zfstest@snap
4. Start MySQL again.
5. Insert another row.
6. Now rollback to the previously created snapshot: zfs rollback green/mysql/zfstest@snap; mysqladmin refresh
7. Insert another row.
8. Now run "myisamchk *.MYI" in the database directory. You should see output like:
Checking MyISAM file: foo.MYI
Data records: 3 Deleted blocks: 0
myisamchk: warning: 1 client is using or hasn't closed the table properly
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check record links
myisamchk: error: Wrong bytesec: 0-0-0 at linkstart: 20
MyISAM-table 'foo.MYI' is corrupted
Fix it using switch "-r" or "-o"
The message "client is using or hasn't closed" is harmless and can safely be ignored. However, the "Wrong bytesec" and "is corrupted" message is not.
The rollback followed by "mysqladmin refresh" is admittedly walking on thin ice. This trick is supposed to make MySQL forget its internal state concerning the rolled back database without a need to restart the whole server (and thus cause outage to other databases). In fact, when I tried without this trick, that is, stopped the MySQL server properly, rolled back, then restarted it again, I could no longer reproduce the table corruption in the above scenario. HOWEVER, the trick appears to work reliably (as in: "hundreds of repetitions without any corruptions") with disable-page-cache, so I still suspect that it might be a zfs-fuse rather than MySQL problem.
Added by
Seth Heeren
on
Sep 20, 2010 08:13 AM
yeah, your
{ rollback& mysqladmin refresh; }
step mirrors my
{ flock -x 200 && zfs rollback -r ploski1@$first; } 200 < /tmp/ploski1/data
in my modified steps to reproduce. I'd offer that this is more than walking on thin ice.
Did you ever test this on plain ext4? Of course, you should mimic the rollback by dd from a (lvm2) snapshot on the block level, to be really fair. I wager that there is a good chance that it won't work with exactly the same symptoms.
When you use a filecopy to rollback the data, you are explicitely telling the VFS layer to drop the caches, so it isn't a problem. Doing so with ZFS would equally not be a problem:
1. clone the target snapshot
2. mount the clone readonly
3. rsync from the mounted clone over your 'live' mount
4. cleanup all the later snapshots and the clone
Of course this is clumsy, will not perform etc. but it makes it very clear how zfs rollback is different and why it is not a zfs-fuse bug that it works differently.
Now, I agree with you that this particular trap is dangerous and we need to protect users at all cost. I'm going to see whether I can.
{ rollback& mysqladmin refresh; }
step mirrors my
{ flock -x 200 && zfs rollback -r ploski1@$first; } 200 < /tmp/ploski1/data
in my modified steps to reproduce. I'd offer that this is more than walking on thin ice.
Did you ever test this on plain ext4? Of course, you should mimic the rollback by dd from a (lvm2) snapshot on the block level, to be really fair. I wager that there is a good chance that it won't work with exactly the same symptoms.
When you use a filecopy to rollback the data, you are explicitely telling the VFS layer to drop the caches, so it isn't a problem. Doing so with ZFS would equally not be a problem:
1. clone the target snapshot
2. mount the clone readonly
3. rsync from the mounted clone over your 'live' mount
4. cleanup all the later snapshots and the clone
Of course this is clumsy, will not perform etc. but it makes it very clear how zfs rollback is different and why it is not a zfs-fuse bug that it works differently.
Now, I agree with you that this particular trap is dangerous and we need to protect users at all cost. I'm going to see whether I can.
Added by
Jan Ploski
on
Sep 20, 2010 08:26 AM
I haven't tried using file-based methods because it has worked like a charm with the previous zfs-fuse... so far. It still seems to be ok with the disable-page-cache option.
Basically what I'm trying is to build a rig for automated testing of web applications, in which I have one MySQL server hosting multiple databases (clones of a "good known initial state" database), multiple client threads running against this MySQL server, each on its own database, and each client thread being able to quickly roll back to a previous database state independently of all other threads. Because the database's size may be significant, trivial approaches such as mysqldump/mysql import won't cut it. Having to restart the MySQL server is also a huge nuisance and may also make the entire setup infeasible due to the additional synchronization required and resulting lack of parallelism in the client threads.
Your rsync-based solution seems to be something in between these alternatives. I will examine it, especially if my current approach fails.
Basically what I'm trying is to build a rig for automated testing of web applications, in which I have one MySQL server hosting multiple databases (clones of a "good known initial state" database), multiple client threads running against this MySQL server, each on its own database, and each client thread being able to quickly roll back to a previous database state independently of all other threads. Because the database's size may be significant, trivial approaches such as mysqldump/mysql import won't cut it. Having to restart the MySQL server is also a huge nuisance and may also make the entire setup infeasible due to the additional synchronization required and resulting lack of parallelism in the client threads.
Your rsync-based solution seems to be something in between these alternatives. I will examine it, especially if my current approach fails.
Added by
Seth Heeren
on
Sep 20, 2010 08:46 AM
I think you got it.
My rsync narrative was not intended as a 'solution'. I also expect rsync to refuse to overwrite files because they are in use (which is the same reason the remount doesn't happen).
In essence you describe you have a specific situation where the cost of having to restart services outweighs the cost of running --disable-page-cache. Then for your scenario, by all means go with disable page cache.
I will keep your type of scenario in mind. I am going to try to make rollback fail (EBUSY) if the filesystem is mounted && cannot be remounted. To allow for your situation, I will make sure that doesn't happen when disable-page-cache is in effect.
Cheers,
Seth
My rsync narrative was not intended as a 'solution'. I also expect rsync to refuse to overwrite files because they are in use (which is the same reason the remount doesn't happen).
In essence you describe you have a specific situation where the cost of having to restart services outweighs the cost of running --disable-page-cache. Then for your scenario, by all means go with disable page cache.
I will keep your type of scenario in mind. I am going to try to make rollback fail (EBUSY) if the filesystem is mounted && cannot be remounted. To allow for your situation, I will make sure that doesn't happen when disable-page-cache is in effect.
Cheers,
Seth
Added by
Jan Ploski
on
Sep 20, 2010 08:51 AM
That sounds good. I will also check to see if EBUSY is reported in my environment (without disable-page-cache) after you have committed the change.
Added by
Seth Heeren
on
Sep 20, 2010 09:13 AM
If the rsync 'story' sounds appealing, a far more direct way will be more robust:
(1) clone a snapshot
(2) mount it (mount man page has a MS_MOVE flag which atomically swaps it in)
This may or may not work depending on the type of locked files used and whether the files are reopened dutifully after 'mysqladmin refresh'
$0.02
(1) clone a snapshot
(2) mount it (mount man page has a MS_MOVE flag which atomically swaps it in)
This may or may not work depending on the type of locked files used and whether the files are reopened dutifully after 'mysqladmin refresh'
$0.02
Added by
Jan Ploski
on
Sep 20, 2010 10:43 AM
This is not directly related, but speaking of performance, when I run these series of tests (basically consisting of a few db updates, rollback, a few other db updates, rollback, and so on in a loop, with reads in between), I can see vmstat/iostat reporting constant IO activity. I can actually hear the disk working all the time. Given the nature of my application, I am not at all concerned about data loss, as all these modifications are temporary anyway. Is there some way to convince zfs-fuse to keep away from disk and to cache everything in memory (I could possibly use a ramdisk as the block device for the pool, if the db size permits, but are there any simpler solutions on zfs-fuse level)?
Added by
Seth Heeren
on
Sep 20, 2010 10:57 AM
Yes, look at zil_disable, cache devices.
Disabling zil (in zfs-fuse by modding the code) and putting cache on tmpfs will avoid the bulk of the IO. Of course data can be lost.
If you want really to avoid any access, there is no other way than putting the whole pool on tmpfs (backing with sparse files). I do this all the time. I hereby attach my 'playground' script that I use to generate multiple pools with various pool layouts based on sparse files in tmpfs on the fly. You should recognize the pool names from my analysis here.
The reason is that pool metadata is obviously replicated across all discs and _synced_ because it is so crucial to pool recoverability. So no matter how smart the caching, as soon as you do dataset manipulation (snapshot, rollback etc) you will hit the disk.
Disabling zil (in zfs-fuse by modding the code) and putting cache on tmpfs will avoid the bulk of the IO. Of course data can be lost.
If you want really to avoid any access, there is no other way than putting the whole pool on tmpfs (backing with sparse files). I do this all the time. I hereby attach my 'playground' script that I use to generate multiple pools with various pool layouts based on sparse files in tmpfs on the fly. You should recognize the pool names from my analysis here.
The reason is that pool metadata is obviously replicated across all discs and _synced_ because it is so crucial to pool recoverability. So no matter how smart the caching, as soon as you do dataset manipulation (snapshot, rollback etc) you will hit the disk.
Added by
Jan Ploski
on
Sep 20, 2010 11:48 AM
This is very cool. I got a 100% increase in performance (cutting down time for a series of tests from 331 to 158s) by simply moving the dataset to tmpfs (hacking the zfs code seemed a bit too global for my taste and current needs). I didn't realize you can create a zfs pool over any file straight, not just a block device. Thanks!
Added by
Seth Heeren
on
Sep 20, 2010 12:16 PM
if speed is all-important, go for loop on tmpfs (losetup(1))
if you want speed + snapshots, go for ext2 on zvol on zfsonlinux (zfsonlinux.org); mandates a 64-bit OS for proper function (on 32bit, unstable mem subsystem + 2Gb limit on ZVol size)
also, look at qemu + cow (google)
if you want speed + snapshots, go for ext2 on zvol on zfsonlinux (zfsonlinux.org); mandates a 64-bit OS for proper function (on 32bit, unstable mem subsystem + 2Gb limit on ZVol size)
also, look at qemu + cow (google)
Added by
Seth Heeren
on
Sep 20, 2010 05:22 PM
Ok, we are in between a rock and a hard place herehere. In current testing, we have the remount call. [Note: code rewrite in ad2dd4d
see http://gitweb.zfs-fuse.net/[…]cial;a=commitdiff;h=ad2dd4d for comments]
SITUATION [A] THAT SUCCEEDS
===========================
in the steps to reproduce, rollback as follows:
bash$ { strace -e trace=mount zfs rollback -r ploski1@$first && echo ok || echo oops; }
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
ok
md5 check is OK
SITUATION [B] THAT FAILS
========================
in the steps to reproduce, rollback as follows:
bash$ { flock -x 200 && strace -e trace=mount zfs rollback -r ploski1@$first && echo ok || echo oops; } 200< /tmp/ploski1/data
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
ok
md5 check is FAILED (checksum did NOT match)
ANALYSIS
========
From the strace we can see that the remount call is done both times with equal success (returns 0). Also from prior analysis (rechecked but omitted for brevity) we know that it actually has the desired effect in situation [A], because without the remount there would invalid cache data returned upon the checksum verification step.
HOWEVER, we can see from situation [B] that some file locks (such as the exclusive lock on a file in the dataset, emulated by flock) can completely thwart the effect of remount, while being undetectable. I had been praying (and expecting) that remount would return an error in this case. Alas.
CONCLUSION
==========
What we are stuck with is two options:
(1) globally use keep_cache = 0; we have learned from the initial fix to this issue (as still present on the maint branch) that this has undesirable performance effects (see also issue #68)
(2) programmatically apply the brute-force hack using 'echo 3 > /proc/sys/vm/drop_caches'; this step has the following drawbacks, IMO:
- might not be portable
- might conflict with SELinux, AppArmor or distros that like to run zfs-fuse as non-root
- purges _all_ kernel page caches; this might not be acceptable on all systems
- this will always be open to race conditions, as there is no telling what gets returned to a process that is actively reading a (larger) stream _across_ the rollback call. Perhaps this is not a big issue as it is very much expectable ('intuitive') behaviour.
I really don't think we have much of an option besides (2) at this point. Unless someone can come up with another way to detect the success of clearing the cache partially (such as the remount call).
The best we can do is magically skip the step in case the user is actively running --disable-page-cache anyway.
PS - To the paranoid
====================
I did test with drop_caches in combination with flock -x which worked marvelously. I did not try to force the race condition :) Here is the output for your enjoyment:
root@maverick:~# md5sum /tmp/ploski1/data; zfs list -tall
2a52845027d9b586a21b926aeb3e1c5d /tmp/ploski1/data
NAME USED AVAIL REFER MOUNTPOINT
ploski1 40.2M 62.5G 20.0M /tmp/ploski1
ploski1@ced6eaa41e48a68f61e6ab04527494d5 20.0M - 20.0M -
ploski1@2a52845027d9b586a21b926aeb3e1c5d 0 - 20.0M -
root@maverick:~# { flock -x 200 && strace -e trace=mount zfs rollback -r ploski1@$first && echo success || echo failure; echo 3 > /proc/sys/vm/drop_caches; } 200< /tmp/ploski1/data
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
success
root@maverick:~# md5sum /tmp/ploski1/data; zfs list -tall
ced6eaa41e48a68f61e6ab04527494d5 /tmp/ploski1/data
NAME USED AVAIL REFER MOUNTPOINT
ploski1 20.1M 62.5G 20.0M /tmp/ploski1
ploski1@ced6eaa41e48a68f61e6ab04527494d5 0 - 20.0M -
see http://gitweb.zfs-fuse.net/[…]cial;a=commitdiff;h=ad2dd4d for comments]
SITUATION [A] THAT SUCCEEDS
===========================
in the steps to reproduce, rollback as follows:
bash$ { strace -e trace=mount zfs rollback -r ploski1@$first && echo ok || echo oops; }
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
ok
md5 check is OK
SITUATION [B] THAT FAILS
========================
in the steps to reproduce, rollback as follows:
bash$ { flock -x 200 && strace -e trace=mount zfs rollback -r ploski1@$first && echo ok || echo oops; } 200< /tmp/ploski1/data
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
ok
md5 check is FAILED (checksum did NOT match)
ANALYSIS
========
From the strace we can see that the remount call is done both times with equal success (returns 0). Also from prior analysis (rechecked but omitted for brevity) we know that it actually has the desired effect in situation [A], because without the remount there would invalid cache data returned upon the checksum verification step.
HOWEVER, we can see from situation [B] that some file locks (such as the exclusive lock on a file in the dataset, emulated by flock) can completely thwart the effect of remount, while being undetectable. I had been praying (and expecting) that remount would return an error in this case. Alas.
CONCLUSION
==========
What we are stuck with is two options:
(1) globally use keep_cache = 0; we have learned from the initial fix to this issue (as still present on the maint branch) that this has undesirable performance effects (see also issue #68)
(2) programmatically apply the brute-force hack using 'echo 3 > /proc/sys/vm/drop_caches'; this step has the following drawbacks, IMO:
- might not be portable
- might conflict with SELinux, AppArmor or distros that like to run zfs-fuse as non-root
- purges _all_ kernel page caches; this might not be acceptable on all systems
- this will always be open to race conditions, as there is no telling what gets returned to a process that is actively reading a (larger) stream _across_ the rollback call. Perhaps this is not a big issue as it is very much expectable ('intuitive') behaviour.
I really don't think we have much of an option besides (2) at this point. Unless someone can come up with another way to detect the success of clearing the cache partially (such as the remount call).
The best we can do is magically skip the step in case the user is actively running --disable-page-cache anyway.
PS - To the paranoid
====================
I did test with drop_caches in combination with flock -x which worked marvelously. I did not try to force the race condition :) Here is the output for your enjoyment:
root@maverick:~# md5sum /tmp/ploski1/data; zfs list -tall
2a52845027d9b586a21b926aeb3e1c5d /tmp/ploski1/data
NAME USED AVAIL REFER MOUNTPOINT
ploski1 40.2M 62.5G 20.0M /tmp/ploski1
ploski1@ced6eaa41e48a68f61e6ab04527494d5 20.0M - 20.0M -
ploski1@2a52845027d9b586a21b926aeb3e1c5d 0 - 20.0M -
root@maverick:~# { flock -x 200 && strace -e trace=mount zfs rollback -r ploski1@$first && echo success || echo failure; echo 3 > /proc/sys/vm/drop_caches; } 200< /tmp/ploski1/data
mount("ploski1", "/tmp/ploski1", 0x43ba43, MS_REMOUNT, NULL) = 0
success
root@maverick:~# md5sum /tmp/ploski1/data; zfs list -tall
ced6eaa41e48a68f61e6ab04527494d5 /tmp/ploski1/data
NAME USED AVAIL REFER MOUNTPOINT
ploski1 20.1M 62.5G 20.0M /tmp/ploski1
ploski1@ced6eaa41e48a68f61e6ab04527494d5 0 - 20.0M -
Added by
Seth Heeren
on
Sep 20, 2010 06:17 PM
Issue state:
in-progress → resolved
Code has been integrated into testing (b3fa925).
All is well (in terms of consistent data).
All error paths tested: an informational message is printed to stderr, e.g.:
bash$ zfs rollback -r ploski1@$first
drop_caches failed: permission denied
I opted _NOT_ to get smart about the disable-page-cache setting. I thought better of it because not just the page cache is involved but also inode and dentry cache. Disabling page-cache will probably still allow inconsistent (incoherent) cache data to be returned after rollback.
On that same thought, should we not issue a sync call before doing the rollback? This would ensure any writes don't go to cached locations (because any pending IO structures are not dropped by drop_caches). Hmmm. Perhaps the simplest we could do is force rollback to umount the dataset first, and remount it if it was mounted, really.
Jan could you retest? (testing branch)
My steps were
=============
./dorky.sh
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data; zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export first="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data; zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
{ flock -x 200 && zfs rollback -r ploski1@$first; } 200< /tmp/ploski1/data
zfs list -tall
echo "$first" " /tmp/ploski1/data" | md5sum -c
Final output
============
bash$ echo "$first" " /tmp/ploski1/data" | md5sum -c
/tmp/ploski1/data: OK
All is well (in terms of consistent data).
All error paths tested: an informational message is printed to stderr, e.g.:
bash$ zfs rollback -r ploski1@$first
drop_caches failed: permission denied
I opted _NOT_ to get smart about the disable-page-cache setting. I thought better of it because not just the page cache is involved but also inode and dentry cache. Disabling page-cache will probably still allow inconsistent (incoherent) cache data to be returned after rollback.
On that same thought, should we not issue a sync call before doing the rollback? This would ensure any writes don't go to cached locations (because any pending IO structures are not dropped by drop_caches). Hmmm. Perhaps the simplest we could do is force rollback to umount the dataset first, and remount it if it was mounted, really.
Jan could you retest? (testing branch)
My steps were
=============
./dorky.sh
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data; zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
export first="$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
dd if=/dev/urandom bs=1M count=20 of=/tmp/ploski1/data; zfs snapshot "ploski1@$(md5sum /tmp/ploski1/data|cut -d' ' -f1)"
zfs list -tall
md5sum /tmp/ploski1/data
{ flock -x 200 && zfs rollback -r ploski1@$first; } 200< /tmp/ploski1/data
zfs list -tall
echo "$first" " /tmp/ploski1/data" | md5sum -c
Final output
============
bash$ echo "$first" " /tmp/ploski1/data" | md5sum -c
/tmp/ploski1/data: OK
Added by
Jan Ploski
on
Sep 21, 2010 12:52 PM
I can confirm that the drop_caches solution works as expected (no inconsistency without disable-page-cache option, both in your test case and in my own ones). This is a rather heavy-handed approach, of course. I'm now evaluating the performance impact on my application, which at present includes browser, database, php web app, selenium rc (= java) and perl test driver scripts all running on the same machine. I expect the excessive cache cleaning to hurt, but we'll see...
Added by
Jan Ploski
on
Sep 21, 2010 01:51 PM
I have now measured the execution times of my real app:
- with old version using disable-page-cache: 2045s
- with new version using drop_caches (no disable-page-cache): 2478s
I ran this test against a pool backed by a real disk. I'm going to compare performance on tmpfs later and I suppose the difference will be less noticeable, but clearly if you can invent a better way to prevent the inconsistency other than the global drop_caches, it would be welcome. I haven't yet understood what exactly is cached where in this stack and why, so I'm afraid I can't contribute any good ideas.
- with old version using disable-page-cache: 2045s
- with new version using drop_caches (no disable-page-cache): 2478s
I ran this test against a pool backed by a real disk. I'm going to compare performance on tmpfs later and I suppose the difference will be less noticeable, but clearly if you can invent a better way to prevent the inconsistency other than the global drop_caches, it would be welcome. I haven't yet understood what exactly is cached where in this stack and why, so I'm afraid I can't contribute any good ideas.
Added by
Seth Heeren
on
Sep 21, 2010 02:22 PM
Thanks for reporting back.
In terms of minimizing the effect of rolling back snapshots I suppose that the best thing to do is move the 'other' files to tmpfs, _not_ the zfs filesystem (by rolling back you are making it actively impossible to cache that anyway). By keeping the 'other' bits in your test suite on tmpfs you are in effect 'fixing' them in 'cache' (RAM). No drop_caches can hurt them as long as they are in tmpfs.
That being said, I think that moving your zpool onto tmpfs will in the end normally produce bigger performance gains (simply because zfs-fuse can be slow). So in practice the net result would still be you'd best move the zpool there.
Another thought is that, allthough of course I hate the heavy-handedness of this approach, it is probably unimportant because
(a) not everyone is rolling back often (I'm guessing most people use snapshots for backup only, and backup restores are rare, especially integral snapshot put-backs; normally a single file restore is done by mounting a clone and destroying that)
(b) there is no known alternative that won't put data at risk
Cheers,
Seth
In terms of minimizing the effect of rolling back snapshots I suppose that the best thing to do is move the 'other' files to tmpfs, _not_ the zfs filesystem (by rolling back you are making it actively impossible to cache that anyway). By keeping the 'other' bits in your test suite on tmpfs you are in effect 'fixing' them in 'cache' (RAM). No drop_caches can hurt them as long as they are in tmpfs.
That being said, I think that moving your zpool onto tmpfs will in the end normally produce bigger performance gains (simply because zfs-fuse can be slow). So in practice the net result would still be you'd best move the zpool there.
Another thought is that, allthough of course I hate the heavy-handedness of this approach, it is probably unimportant because
(a) not everyone is rolling back often (I'm guessing most people use snapshots for backup only, and backup restores are rare, especially integral snapshot put-backs; normally a single file restore is done by mounting a clone and destroying that)
(b) there is no known alternative that won't put data at risk
Cheers,
Seth
Added by
Jan Ploski
on
Sep 21, 2010 03:25 PM
You are correct that data integrity is much more important to humanity at large :) than fast rollbacks. In that respect my scenario is rather "exotic". Although I dare say, having lightning fast snapshot/rollbacks could be considered an "enabling technology" and a "unique selling proposition" of sorts. Once it's there, more people are aware of it and know they can rely upon it (as in: it will work in the next release), they will come up with interesting new uses (or misuses ;)).
I might also look at the performance of LVM in the same scenario in the future (and possibly ZVOL, as you suggested; the 64bit requirement is a turn-off; VM-level snapshots as in qemu seem too coarse for my needs). I'm not really using any unique ZFS features in this application. But I do elsewhere, so it would be good to just stay with ZFS rather than switch back and forth between two (or more) competing implementations that basically accomplish the same thing, but differ in user interface and performance.
In any case I'm satisfied with the current solution to this issue. If I really wish, I could still use the older version or maintain a performance patch, but I guess I won't. I had a quiet hope that your last few statements (about sync/remount) might hint at an alternative to drop_caches. Maybe it should be moved to another issue with low severity (and priority). Or maybe just close and wait until someone else starts asking performance questions about it.
I might also look at the performance of LVM in the same scenario in the future (and possibly ZVOL, as you suggested; the 64bit requirement is a turn-off; VM-level snapshots as in qemu seem too coarse for my needs). I'm not really using any unique ZFS features in this application. But I do elsewhere, so it would be good to just stay with ZFS rather than switch back and forth between two (or more) competing implementations that basically accomplish the same thing, but differ in user interface and performance.
In any case I'm satisfied with the current solution to this issue. If I really wish, I could still use the older version or maintain a performance patch, but I guess I won't. I had a quiet hope that your last few statements (about sync/remount) might hint at an alternative to drop_caches. Maybe it should be moved to another issue with low severity (and priority). Or maybe just close and wait until someone else starts asking performance questions about it.
Added by
Seth Heeren
on
Sep 21, 2010 06:15 PM
Issue state:
resolved → closed
> I might also look at the performance of LVM in the same scenario in the future
Lvm snapshot seem overrated to me. For one thing, last time I checked there was still no usable rollback function (and if it existed I reckon it will be slow).
Manually copying from a snapshot is slow (!) and fraught with danger of running out of space (unless you match the origin's capacity at which point all boils down to space-inefficient COW).
Can you tell I'm not a fan of lvm snapshots? (Mind you, I use nothing else than lvm2 on all my machines since 5 years or so, but just not snapshots, anymore).
As much as I'd love to keep this open and tinker with explicit remounts that work forcefully, I think I'll be realistic and just keep it at this for the moment.
Thanks again,
closing
Lvm snapshot seem overrated to me. For one thing, last time I checked there was still no usable rollback function (and if it existed I reckon it will be slow).
Manually copying from a snapshot is slow (!) and fraught with danger of running out of space (unless you match the origin's capacity at which point all boils down to space-inefficient COW).
Can you tell I'm not a fan of lvm snapshots? (Mind you, I use nothing else than lvm2 on all my machines since 5 years or so, but just not snapshots, anymore).
As much as I'd love to keep this open and tinker with explicit remounts that work forcefully, I think I'll be realistic and just keep it at this for the moment.
Thanks again,
closing

bonnie_old.txt
dorky.sh
drop_caches.c
