Personal tools
You are here: Home Issue tracker Debug build (scons debug=2) dies right after startup

#25 — Debug build (scons debug=2) dies right after startup

State Resolved
Version:
Area Functionality
Issue type Bug
Severity Low
Submitted by Bjoern Kahl
Submitted on Feb 03, 2010
Responsible Seth Heeren
Target release:
Return to tracker
Last modified on Feb 13, 2010 by Seth Heeren
The zfs-fuse daemon build with debug=2 dies right after startup without useful error message.

-------------------------------
athlon:~# zfs-fuse -n -a 2 -e 2
hostname = athlon.local
hw_serial = 1463809007
ncpus = 1
physmem = 112988 pages (0.43 GB)
pagesize = 4096, pageshift: 12
pwd_buflen = 1024, grp_buflen = 1024

Aborted (core dumped)
-------------------------------

inspecting the core file with gdb shows an assertion failure in umem:

(gdb) back
#0 0xb7f21518 in raise () from /lib/tls/libpthread.so.0
#1 0x08150d8c in umem_do_abort () at lib/libumem/umem_fail.c:81
#2 0x08150e82 in umem_panic (format=Could not find the frame base for "umem_panic".
) at lib/libumem/umem_fail.c:148
#3 0x08150f10 in __umem_assert_failed (assertion=Could not find the frame base for "__umem_assert_failed".
) at lib/libumem/umem_fail.c:172
#4 0x0814df45 in vmem_populate (vmp=0xb7f3a494, vmflag=0) at lib/libumem/vmem.c:606
#5 0x0814f642 in vmem_add (vmp=0xb7f3a494, vaddr=0x1, size=2147483647, vmflag=0) at lib/libumem/vmem.c:1177
#6 0x081503e4 in vmem_create (name=0x8189d69 "taskq_id_arena", base=0x1, size=2147483647, quantum=1, afunc=0, ffunc=0,
    source=0x0, qcache_max=0, vmflag=0) at lib/libumem/vmem.c:1536
#7 0x08158d0f in taskq_init () at lib/libsolkerncompat/taskq.c:788
#8 0x0815308e in libsolkerncompat_init () at lib/libsolkerncompat/main.c:80
#9 0x0804c8e3 in do_init () at zfs-fuse/util.c:92
#10 0x0804beda in main (argc=6, argv=0xbff7baa4) at zfs-fuse/main.c:387

The problem is triggered in lib/libumem/vmem.c:1177 which reads:

(gdb) list
601 return (1);
602 }
603
604 (void) mutex_unlock(&vmp->vm_lock);
605
606 ASSERT(vmflag & VM_NOSLEEP); /* we do not allow sleep allocations */
607 lp = &vmem_nosleep_lock;
608
609 /*
610 * Cannot be just a mutex_lock(), since that has no effect if

and is called from lib/libsolkerncompat/taskq.c:788

(gdb) list
783 taskq_ent_cache = kmem_cache_create("taskq_ent_cache",
784 sizeof (taskq_ent_t), 0, taskq_ent_constructor,
785 taskq_ent_destructor, NULL, NULL, NULL, 0);
786 taskq_cache = kmem_cache_create("taskq_cache", sizeof (taskq_t),
787 0, taskq_constructor, taskq_destructor, NULL, NULL, NULL, 0);
788 taskq_id_arena = vmem_create("taskq_id_arena",
789 (void *)1, INT32_MAX, 1, NULL, NULL, NULL, 0,
790 VM_SLEEP | VMC_IDENTIFIER);
791
792 list_create(&taskq_cpupct_list, sizeof (taskq_cpupct_ent_t),


However, what really scares me are these lines in lib/libumem/misc.h

/*
 * We define our own assertion handling since libc's assert() calls malloc()
 */
#if !DEBUG
#define ASSERT(assertion) (void)0
#else
#define ASSERT(assertion) (void)((assertion) || \
     __umem_assert_failed(#assertion, __FILE__, __LINE__))
#endif

So if DEBUG is not defined, all these checks in libumem are NOT done. In other words, the memory system will happily serve requests, it is not designed to handle?!? Or do I miss an important point here?

I still have the core file, so if someone needs it, let me know.


System configuration:

Debian Sarge, with custom-compiled zfs_fuse.
uname -a : Linux athlon.local 2.6.18-4-486 #1 Wed May 9 22:23:40 UTC 2007 i686 GNU/Linux
-----------------------
cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm)
stepping : 1
cpu MHz : 1164.553
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow ts
bogomips : 2330.93
-----------------------
fuse: debian package 2.7.4, compiled & installed from source (fuse_2.7.4-1.1.dsc) + kernel module from fuse-source_2.7.1-2~bpo40+1_all.deb

zfs_fuse is compiled from source, as per instructions on this site. I used "git clone http://git.zfs-fuse.net/official" from 2010-01-31

Steps to reproduce:
compile latest checkout (here "" from 2010-01-31) with "scons debug=2"

start the zfs-daemon

it should immediately crash.
Added by Bjoern Kahl on Feb 03, 2010 01:04 PM
:( Typo:

After the backtrace, it should read:

The problem is triggered in lib/libumem/vmem.c:606 which reads:

and not lib/libumem/vmem.c:1177

Sorry for the noise.
Added by Seth Heeren on Feb 03, 2010 03:37 PM
Issue state: unconfirmedopen
Hi there

This exact symptom has been reported once before. I'd have to search the list to see who it was. We haven't been able to track the cause then. It might help if you could providedetails on the dev setup (gcc version, glibc version, all dependencies and fuse in particular).

If you can't manage, I'd have to reproduce it myself and that would take a bit longer.

On the assert thing: it is completely normal for asserts to be disabled in release mode (i.e. non-debug) because they can impact performance quite a lot. The thing about 'overriding' libc's ASSERT only means that they have their own implementation (for debug mode, just like libc's) - not that it would not work.

Seth
Added by Bjoern Kahl on Feb 11, 2010 02:08 PM
Hi, sorry for late reply.

> his exact symptom has been reported once before. I'd have to search the list to see who it was. We haven't been
> able to track the cause then. It might help if you could providedetails on the dev setup (gcc version, glibc
> version, all dependencies and fuse in particular).

I don't think that this depends on the build environment. Anyway, here is my config:
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)
GNU C Library stable release version 2.3.6, by Roland McGrath et al. ; Compiled by GNU CC version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21).

zfs-fuse has been compiled as stated in my original report:
fuse: debian package 2.7.4, compiled & installed from source (fuse_2.7.4-1.1.dsc) + kernel module from fuse-source_2.7.1-2~bpo40+1_all.deb

> it is completely normal for asserts to be disabled in release mode (i.e. non-debug)

Yes, I know, I do this regularly.

My point is:

The code *must* crash, if the ASSERTS are in, because the caller in /libsolkerncompat/taskq.c:788 passes an *unsupported* memory requirement flag, namely VM_SLEEP, to the vmem_create(). The comment in /libumem/vmem.c:606 *explicitly* says: "ASSERT(vmflag & VM_NOSLEEP); /* we do not allow sleep allocations */". Now, in release builds the test is skipped, and vmem_create() may or may not do what taskq.c expects.

Added by Seth Heeren on Feb 11, 2010 03:06 PM
Responsible manager: (UNASSIGNED)sgheeren
> The code *must* crash, if the ASSERTS are in, because the caller in /libsolkerncompat/taskq.c:788
> passes an *unsupported* memory requirement flag, namely VM_SLEEP, to the vmem_create(). The comment
> in /libumem/vmem.c:606 *explicitly* says: "ASSERT(vmflag & VM_NOSLEEP); /* we do not allow sleep
> allocations */". Now, in release builds the test is skipped, and vmem_create() may or may
> not do what taskq.c expects.

Actually, the idea is that the code should _not_ crash in debug, nor in release mode. The way this is accomplished is to be anal about checking everything twice in debug mode, and trust your code quality in production/release mode. Run debug mode for diagnostics, but don't check for conditions that are known not to occur in release mode. It is a waste of time.

Now, we have a brand new thread on the list http://groups.google.com/[…]/12c11fd87ac8e7c8 talking about a middle ground here.
Added by Seth Heeren on Feb 11, 2010 03:29 PM
> I don't think that this depends on the build environment

Mmmm.... tempted to say it is a known fact. But you may be right. Are you running some kind of app-armor or SELinux type of security modes?
Added by Seth Heeren on Feb 11, 2010 04:50 PM
AH I finally reproduced it:

domU-12-31-38-00-38-D1:~/official/src# uname -a
Linux domU-12-31-38-00-38-D1 2.6.21.7-2.fc8xen-ec2-v1.0 #2 SMP Tue Sep 1 10:04:29 EDT 2009 i686 GNU/Linux
domU-12-31-38-00-38-D1:~/official/src# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 5.0.3 (lenny)
Release: 5.0.3
Codename: lenny
domU-12-31-38-00-38-D1:~/official/src# gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.3.2-1.1' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-cld --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu
Thread model: posix
gcc version 4.3.2 (Debian 4.3.2-1.1)

You can forget about selinux, since I tested with

# apt-get install selinux-utils
# setenforce Permissive

and there was no difference.

The thing _is_ however very timing-sensitive. As I happened to set some breakpoints (e.g. break vmem_add) and I manually continued ('c') on the gdb prompt each time it hit, it would simply succeed (?!? sic). Granted, if I simply held the enter key to keep repeating the continue command, it would still crash.

Now this is funny, IMHO because at the precise moment that SIGABRT is trapped, 'info threads' shows only 1 thread. Ps agrees:

UID PID PPID LWP C NLWP STIME TTY STAT TIME CMD
root 4952 4951 4952 0 1 21:13 pts/2 T 0:00 /usr/local/sbin/zfs-fuse --no-daemon

There must be something fishy with either duplicate instances of static data in libumem/umem.o (?) or the opposite, illegally shared statics? The fact that vm_flag==0 upon entry of vmem_add is in itself not a problem, because some 5-10 other requests by umem itself are handled ok with those flags.

So, I'd shift focus from the vm_flag to memory corruption/initialization race for this issue.
Added by Seth Heeren on Feb 11, 2010 04:59 PM
Severity: MediumImportant
After fiddling with the timings for 30-40 minutes I found the following two configurations to be the only ones that work on my test box

 780 void
 781 taskq_init(void)
 782 {
 786 sleep(1);
 783 taskq_ent_cache = kmem_cache_create("taskq_ent_cache",
 784 sizeof (taskq_ent_t), 0, taskq_ent_constructor,
 785 taskq_ent_destructor, NULL, NULL, NULL, 0);
 787 taskq_cache = kmem_cache_create("taskq_cache", sizeof (taskq_t),
 788 0, taskq_constructor, taskq_destructor, NULL, NULL, NULL, 0);
 789 sleep(1);
 790 taskq_id_arena = vmem_create("taskq_id_arena",
 791 (void *)1, INT32_MAX, 1, NULL, NULL, NULL, 0,
 792 VM_SLEEP | VMC_IDENTIFIER);
 793
 794 list_create(&taskq_cpupct_list, sizeof (taskq_cpupct_ent_t),
 795 offsetof(taskq_cpupct_ent_t, tp_link));
 796 }

or

 786 sleep(1);
 783 taskq_ent_cache = kmem_cache_create("taskq_ent_cache",
 784 sizeof (taskq_ent_t), 0, taskq_ent_constructor,
 785 taskq_ent_destructor, NULL, NULL, NULL, 0);
 787 taskq_cache = kmem_cache_create("taskq_cache", sizeof (taskq_t),
 788 0, taskq_constructor, taskq_destructor, NULL, NULL, NULL, 0);
 789 sleep(1);
 790 taskq_id_arena = vmem_create("taskq_id_arena",
 791 (void *)1, INT32_MAX, 1, NULL, NULL, NULL, 0,
 792 VM_SLEEP | VMC_IDENTIFIER);

I hope this helps others generate idea's as to the source of the race condition. In the mean time, one of these options could be suggested as a workaround if the problem is considered a stopping issue... I donot recommend this workaround; however one cannot rule out that the 'stable' version everyone else is using just 'happens' to survive the race by chance on their systems?

It certainly confuses me that I needed to try to get close to the tested build environment before being able to reproduce this issue in the first place.
Added by Seth Heeren on Feb 11, 2010 05:15 PM
Oh, in case someone else cares to reproduce this:

I reproduced/diagnosed this on an amazon EC2 small instance ($0.085 per hour)
AMI ami-dcf615b5 (alestic/debian-5.0-lenny-base-20091011.manifest.xml)

apt-get update
apt-get install -y git-core screen htop vim libaio-dev libattr1-dev libacl1-dev libz-dev libz-dev libfuse-dev libfuse2 scons libssl-dev gdb
git clone http://git.zfs-fuse.net/official "master@{2010-01-01}"
cd official/src
scons -j3 debug=2 install

# test/verify crash
zfs-fuse; sleep 4; pgrep zfs-fuse && zpool list

I could leave the machine on / launch another in case someone wants to help out. Reach me at http://groups.google.com/group/zfs-fuse
Added by Seth Heeren on Feb 13, 2010 04:52 PM
Issue state: openin-progress
Severity: ImportantLow
Emmanuel has a convincing argument that changing VM_SLEEP -> VM_NOSLEEP shouldn't matter at this point in the program's initialization.

He's pushed a commit bde20b8

You should probably retest with this patch to verify that it WorksForYouToo. I haven't verified this myself.
Added by Seth Heeren on Feb 13, 2010 05:17 PM
Issue state: in-progressresolved
Target release: None0.6.1
So I went and retested my own setup ($0.085 down the drain for less than 10 minutes :))

It works fine now.

I pushed (cherry-pick) the relevant commit to zfs-fuse.net/official master branches. So pulling the official master should get you the fix.

Closing