Personal tools
You are here: Home Issue tracker resilvering continues after offlining device to be resilvered.

#20 — resilvering continues after offlining device to be resilvered.

State Confirmed
Version:
Area Process
Issue type Bug
Severity Medium
Submitted by (anonymous)
Submitted on Dec 31, 2009
Responsible Seth Heeren
Target release: 0.6.9
Return to tracker
Last modified on Jul 04, 2010
I've a raidz mirror with two devices. I had one of them offline for a while and started resilvering with "zpool online <pool> <device>".
However, shortly after resilvering started, I decided otherwise and did a "zpool offline" for that device.
"zpool status" showed the corresponding pool degraded with one device offline. However, the resilvering process continued with heavy disk activity on the remaining disk and resilvering status was shown by "zpool status".
"zpool online" again for the offlined device resulted in what looks like a non-incremental resilvering for the whole raiz mirror.

So far it looks as if no data was lost. However, the resilvering process with a phantom disk felt a bit chilling.
 
Version:
commit 6dc41374b3479afae2e536c20623d02a512f7b8a
Merge: 91398ce af73060
Author: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Date: Mon Dec 7 15:55:24 2009 -0800

    Merge remote branch 'emmanuel/master'

Steps to reproduce:
provided above
Added by Seth Heeren on Jan 25, 2010 06:16 PM
Target release: None0.6.1
please retest with

# git clone http://git.zfs-fuse.net/official
# cd official/
# git checkout origin/critical

if at all possible. You can also merge the fixes since 0.6.0 from that branch
Added by Seth Heeren on Jun 19, 2010 02:03 PM
Issue state: unconfirmedopen
Target release: 0.6.10.6.9
Responsible manager: (UNASSIGNED)sgheeren
mail dd Sat, 19 Jun 2010 10:21:51 -0700

I just noticed some odd behaviour that I did not expect.

I tried a zfs send|zfs receive on my machine.
Then noticed that I did want to only send incremental data and canceled
the command with CTRL+C. So far so good.
I continued with the incremental data which worked out fine and tried to
export the target pool afterwards.

That did not work. Zpool claimed that the pool was still busy.
I checked to see that it was already unmounted. Then I noticed that
there was still I/O activity on the disk. According to iostat, it was
actually the full bandwidth that was possible for that disk.

The same happens with CTRL+C while doing a "zfs rollback". Activity
continues until the rollback is finished.

I noticed similar behaviour earlier while resilvering. In case the disk
that is being resilvered is suddlenly removed, heavy reading activity
continues on the source disk.

For the first case I would expect zfs to just cancel activity on both
the source and target pool. The same for cancelling a rollback.

If the target fails while resilvering, that should probably be detected...

I don't have solaris on any machine to test those cases there.
Does someone know what should really happen? Did someone else experience
these problems?

All the best,
Maximilian.


-- To post to this group, send email to zfs-fuse@googlegroups.com To visit our Web site, click on http://zfs-fuse.net/
Added by Seth Heeren on Jun 19, 2010 02:04 PM
 On 06/19/2010 07:21 PM, Maximilian Mehnert wrote:
> there was still I/O activity on the disk. According to iostat, it was
> actually the full bandwidth
Read or write?

> For the first case I would expect zfs to just cancel activity on both
> the source and target pool.
Agreed

> The same for cancelling a rollback.
>
Not agreed. I would expect ZFS to complete the currently running atomic transaction, and then abort the rollback. This could take some time, especially under e.g. dedup.

> I don't have solaris on any machine to test those cases there.
>
I do.

> Does someone know what should really happen?
Sun does :)

> Did someone else experience
> these problems?
>
No.
Added by Seth Heeren on Jun 19, 2010 02:06 PM
Mail dd. Sat, 19 Jun 2010 11:51:40 -0700 (PDT)

>> >> actually the full bandwidth
> > Read or write?
I'm not sure anymore. But it was definitely the disk with the receiving
pool.
If needed I'll try to reproduce it.

> > Can you state which versions these new report applies to.
commit 281dcc3aea76fa371a55af83869049bef159a9af
Author: Seth Heeren <sgheeren@hotmail.com>
Date: Thu Jun 3 21:06:11 2010 +0200

>> >> The same for cancelling a rollback.
> > Not agreed. I would expect ZFS to complete the currently running atomic
> > transaction, and then abort the rollback. This could take some time,
> > especially under e.g. dedup.
Agreed ;-)

> > Do you mean you
> > _know_ it detects it, or you just hope it will; do you have
> > anyparticular reason to mention it? If so, please share more details :)
Ok. I'll try. I've two devices in a raidz pool. Both via device mapper.
Let's say /dev/mapper/part1, /dev/mapper/part2.
part1 is ok and part2 ist resilvering. When part2 is unplugged, zpool
status shows an ongoing resilvering process. Read activity on part1
continues (I guess till the end of the supposed resilvering process.
Never waited that long).
Added by Seth Heeren on Jun 21, 2010 04:44 AM
Yet another very loose thought:

/etc/profile
/etc/profile.d/*
~/.profile

/etc/bash.bashrc
~/.bashrc
/etc/environment

check to see if anywhere it contains "trap 2 3" or something similar?
Added by (anonymous) on Jul 04, 2010 05:15 PM
Cross posting from issue #70: another user experiences abort not working

This will probably help us get leverage on the context in which this problem appears

------------ from issue #70:
Jan your problem is exactly the same as reported in #20 (recently confirmed still a problem with 0.6.9 maint). However, we had never been able to reproduce the issue. It seems you are now the second person to have this problem.

I suggest we
(a) try to establish the common denominator to find a root cause
(b) could I in some way gain access to the machine that exhibits this in order to investigate?

I will cross post this issue #20 for Maximilian Mehnert
(see also http://groups.google.com/[…]/)