Personal tools
You are here: Home Issue tracker zpool status CKSUM update broken?

#21 — zpool status CKSUM update broken?

State Resolved
Version: 0.6.0
Area Functionality
Issue type Bug
Severity Low
Submitted by (anonymous)
Submitted on Jan 22, 2010
Responsible Seth Heeren
Target release:
Return to tracker
Last modified on May 22, 2010 by Seth Heeren
 Hi list

I tried to test the scrubbing mechanism of a two member raid today and here's
sth odd I don't understand:

I created a zfs-fuse test raid, filled it with 70GB of video material, scrubbed
the array and everything was fine as expected.

With one drive unplugged I booted into a live system. I used dd_rescue to
spread 20 errors all over the zfs partition. In fact I overwrote some 512 Byte
blocks with /dev/zero. So this raid member became the defective one.

I rebooted the zfs file system with both drives attached and run a binary
compare of all video files in the zfs pool. The idea behind this was to make
sure as many data as possible gets read including data from the defective
drive.

Meanwhile I checked zpool status to see, if the reading of both drives (one
damaged, one intact) would cause the CKSUM value to rise. Nothing happened -
the binary compare finished with no CKSUM errors.

So I forced the scrubbing of both drives with zpool scrub <pool> - now 2 CKSUM
errors were noticed right at the begining of scrub (less than 10% was scrubbed
at that time) with 384kb to re-silver. For the rest of the 90% no more errors
got detected.

Two things I don't understand here, I hope someone can explain if it's a bug
what I should expect to happen with zfs in this case:

a) Why doesn't zpool status show checksum errors immediately when it reads
from the faulty drive? I expected zpool status to update every checksum error,
reading error and writing error immediately.

b) Status after scrubbing: why are there only 2 CKSUM errors mentioned right
at the begining of scrub, when there are about 20 error blocks of 512byte size
spread in the rest of the unchecked 90%?

thanks for some comment on that
Steps to reproduce:
See Details
Added by (anonymous) on Jan 22, 2010 07:52 AM
I'd report this upstream first.

For me I'd have to first establish if the behaviour of ZFS on Solaris is any difrerent.

EDIT: sgheeren (forgot to login first)
Added by Seth Heeren on Jan 22, 2010 03:17 PM
Responsible manager: (UNASSIGNED)sgheeren
Questions:

(a) what is your method/tool to 'binary compare' and what do you compare it to? Many compare tools might not actually read the files after a certain amount of differences.

(b) have you got any reason to believe that all of the blocks you clobber are actually _in use_? if they're not, it makes sense that neither reading, scrubbing nor zfs send would find any problem

I've devised a few scripts to test this today, and they've not given me any surprises yet. I'll post with instructions later.
Added by Seth Heeren on Jan 23, 2010 08:01 PM
Issue state: unconfirmedpostponed
Severity: MediumLow
So here are the scripts promised.

The procedure is quite simple:

0. {have backups}
1. untar
2. edit corrupt.conf
3. ./do_prepare.sh
4. ./mk_corrupt.sh
5. sh ./do_corrupt.sh
6. simultaneous:
       watch -n5 -d './do_readback.sh && ./do_verify.sh | sort -n'
   and
       zpool scrub && watch -d -n 5 zpool status -v [poolname]

I have yet to see strange things happen. Although, I have seen unrelated threading issues when monitoring via 'zfs list' (for free space). See my post at the user group.

I have had an occasion or two of not all blocks getting reverted. Then I played a bit with zdb to see whether I could somehow work out for a fact that the block in question was actually in use and should hence have been corrected. Unfortunately, I have to admit that I didn't find a way to prove that (before the collapse of the sun, anyway).

I will post some characteristics of the scripts in the user list.
Added by Seth Heeren on Jan 23, 2010 08:11 PM
Cannot add attachments. Will post at user list
Added by Seth Heeren on Jan 25, 2010 04:48 PM
Issue state: postponedopen
un-postpone (postpone apparently 'closes' it as far as UI is concerned?!)
Added by Seth Heeren on May 22, 2010 02:30 PM
Issue state: openresolved
closing due to lack of interest/tangible problem

please take to upstream (zfs-discuss) if you really want to get this question answered