Fix: Use a filled signal mask to disable all signals
Changelog from David Pelton's original patch:
While using lttng-ust with an application that was calling fork()
with pending signals, I found that all signals were getting unmasked
shortly before the underlying call to fork(). After some
investigation, I found that the rcu_bp_before_fork() function was
unmasking all signals. Based on the comments for this function, it
should be masking all signals. Inspection of the rest of the code
in urcu-bp.c revealed the same pattern in two other functions.
This patch changes the code to use a filled signal mask to disable
all signals. The change to rcu_bp_before_fork() addressed the
problem I was seeing while using lttng-ust. The changes to the
other two functions appear to fix other instances of the same
problem.
Updates by Mathieu Desnoyers:
- Use SIG_BLOCK instead of SIG_SETMASK when setting a filled mask. This
has the same behavior in this case (since we're blocking all signals),
but is semantically neater: if we ever some signals from that mask,
we'd like to to a union with the signal mask already blocked by the
application.
- Also fix incorrect signal masking in compat_arch_x86.c.
Reported-by: David Pelton <dpelton@ciena.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
These currently translate into simple compiler barriers on all
architectures, but the and/or/add/sub/inc/dec uatomics do not provide
memory ordering guarantees (only uatomic_add_return, uatomic_sub_return,
uatomic_xchg, and uatomic_cmpxchg provides full memory barrier
guarantees before and after the atomic operations).
Passing an unsigned int to uatomic_sub does not honor sign extend to
long, as we should be allowed by assume.
Fix this by introducing caa_cast_long_keep_sign(), which casts either to
long or unsigned long depending on the signedness of the argument
received. It is used in uatomic_sub before applying the "-" operator,
since this operator needs to operate on the "long" type size (since sign
extension might not be performed if the argument received is unsigned).
Discourage use of pthread_atfork() for call_rcu handlers
Discourage use of glibc pthread_atfork() for call_rcu handlers due to
its inappropriate assumptions about single-threadedness while pthread
atfork handlers are executing. This results in hangs within the glibc
memory allocator.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix call_rcu fork handling by putting all call_rcu threads in a
quiescent state before fork (paused state), and unpausing them when the
parent returns from fork.
On the child, everything will run fine as long as we don't issue fork()
from a call_rcu callback.
Side-note: pthread_atfork is not appropriate when using with multithread
and malloc/free. The glibc malloc implementation sadly expects that all
malloc/free are executed from the context of a single thread while
pthread atfork handlers are running, which leads to interesting hang in
glibc.
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Lai Jiangshan (laijs@cn.fujitsu.com) wrote:
> > test code:
> > ./tests/test_urcu_lfs 100 10 10
> >
> > bug produce rate > 60%
> >
> > {{{
> > I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or
> +"./tests/test_urcu_lfs 100 100 10"
> > But I just test it about 5 times
> > }}}
> >
> > 4cores*1threads: Intel(R) Core(TM) i5 CPU 760
> > RCU_MB (no time to test for other rcu type)
> > test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
> >
> > I didn't see any bug when "./tests/test_urcu_mb 10 100 10"
> >
> > Sorry, I tried, but I failed to find out the root cause currently.
>
> I think I managed to narrow down the issue:
>
> 1) the master branch does not reproduce it, but commit
> 768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
> time.
>
> 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
> current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
> moving to wfcqueue.
>
> 3) the bug always arise, for me, at the end of the 10 seconds.
> However, it might be simply due to the fact that most of the memory
> get freed at the end of program execution.
>
> 4) I've been able to get a backtrace, and it looks like we have some
> call_rcu callback-invokation threads still working while
> call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
> is nicely waiting for the next thread to stop, and during that time,
> two callback-invokation threads are invoking callbacks (and one of
> them triggers the segfault).
>
> So I expect that commit
>
> commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date: Tue Sep 25 10:50:49 2012 -0500
>
> call_rcu: use wfcqueue, eliminate false-sharing
>
> Eliminate false-sharing between call_rcu (enqueuer) and worker threads
> on the queue head and tail.
>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>
> Could have managed to fix the issue, or change the timing enough that it
> does not reproduces. I'll continue investigating.
The bug was in call rcu. It is not required for master, because we fixed
it while moving to wfcqueue. We were erroneously writing to the head
field of the default call_rcu_data rather than tail.
The conditions to reproduce this bug:
1) setup per-cpu callback-invokation threads,
2) use call_rcu
3) call call_rcu_data_free() while there are still some pending
callbacks that have not yet been executed by the callback-invokation
threads,
4) we then get corruption due to the "default" callback invokation
that walks through a corrupted queue.
urcu call_rcu: Use RCU read-side protection for per-cpu call_rcu data
A concurrent get_cpu_call_rcu_data(), called by get_call_rcu_data(),
could dereference this pointer without holding any mutex. So this
situation would happen if we have a concurrent call_rcu() executing
while we do the create_all_cpu_call_rcu_data().
I think we would need to put a rcu_dereference() around
per_cpu_call_rcu_data read within get_cpu_call_rcu_data() too.
per_cpu_call_rcu_data should be done with rcu_set_pointer.
Also, a rcu read-side critical section would be required around any
usage of per_cpu_call_rcu_data, and the action of tearing down the
per-cpu data would require to wait for a quiescent state. So we would
basically require that the call_rcu users need to be registered as
RCU reader threads.
Lai Jiangshan [Thu, 29 Sep 2011 17:55:11 +0000 (13:55 -0400)]
urcu,call_rcu: avoid create call_rcu_data for child when unneed
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
urcu_defer: Use cancellation flag instead of pthread_cancel()
- Provides better control over cancellation point location.
- Set the futex to 0 before exiting the defer thread.
This patch combines and enhances patches from Lai Jiangshan:
urcu,defer_rcu: fix missing respond to a cancellation request
urcu,defer_rcu: Avoid thread exit unexpected
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paolo Bonzini [Thu, 22 Sep 2011 09:12:44 +0000 (05:12 -0400)]
cmm: provide lightweight smp_rmb/smp_wmb on PPC
lwsync orders loads in cacheable memory with respect to other loads,
and stores in cacheable memory with respect to other stores. Use it
to implement smp_rmb/smp_wmb.
The heavy-weight sync is still used for the "full" rmb/wmb operations,
as well as for smp_mb.
[ Edit by Mathieu Desnoyers: rephrased the comments around the memory
barriers. ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
We provide sequential consistency semantic over all architectures for
cmpxchg and add_return family of primitives, but the powerpc
implementation does not match that.
Change the isync after the atomic primitives to sync, and explain the
scheme.
Lai Jiangshan [Thu, 15 Sep 2011 15:24:03 +0000 (11:24 -0400)]
avoid leaking crdp for failed path
[ Comment: now that set_cpu_call_rcu_data() is not racy and detects
overwrites, we can effectively trust its return value and free the
crdp if already set. ]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
urcu-pointer: implement URCU_FORCE_CAST for C++ compatibility of urcu-pointer.h
We need to be careful with those, so we do not break aliasing. Our
use-case is to cast back and forth between the same type and a void *
(or void **) type when we pass pointers to C functions. As we cast back
to the same type when the pointer is returned from the function,
aliasing should still work.
Paolo Bonzini [Tue, 13 Sep 2011 17:49:28 +0000 (13:49 -0400)]
urcu-qsbr: use rcu_thread_offline/rcu_thread_online instead of inlining them
* Mathieu Desnoyers wrote:
> Just to let you know that I pushed two updates into urcu: one fixes a
> grace period hang caused by a missing wakeup in the synchronize_rcu
> QSBR code. This appears to hit us due to the more fine-grained wakeup
> code brought by Paolo. The wakeup was really missing from the
> synchronize_rcu code (so Paolo's code just triggered an existing
> problem). I thought it would be good to let you know the effect: grace
> periods are delayed forever. This problem never appeared in a release
> (I caught it before).
Good catch. Why not use rcu_thread_offline/online in synchronize_rcu,
instead of touching rcu_reader.ctr directly? I had this in my QEMU
branch but hadn't posted yet because it was meant as a cleanup only.
synchronize_rcu go into offline mode during grace period. It duplicates
the rcu_thread_online/offline code, and therefore adding the required
wake_up_gp() is required there too.
Paolo Bonzini [Sat, 10 Sep 2011 19:29:56 +0000 (12:29 -0700)]
cmm: do not generate code for smp_rmb/smp_wmb on x86_64
We can assume, on x86, that no accesses to write-combining memory occur,
and also that there are no non-temporal load/stores (people would
presumably write those with assembly or intrinsics and put appropriate
lfence/sfence manually). In this case rmb and wmb are no-ops on x86.
according to the updated x86 memory models:
Paul E. McKenney. Memory Ordering in Modern Microprocessors
www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf
x86 does not reorder loads vs loads, and stores vs stores when using
normal memory accesses, with the notable exceptions of Pentium Pro
(reorders reads) and WinChip (reorders writes). Therefore, it is safe
not to emit fence instructions for x86_64 cmm_smp_rmb()/cmm_smp_wmb(),
but we leave the memory fences in place for x86_32 for those two
special-cases.
Define cmm_smp_rmb and cmm_smp_wmb to be the "common" operations that
do not require fence instruction, while leaving cmm_rmb and cmm_wmb in
place for more sophisticated uses.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paolo Bonzini [Sat, 10 Sep 2011 19:26:19 +0000 (12:26 -0700)]
cmm: let per-arch files provide cmm_smp_* barriers
x86 instructions lfence and sfence are rarely needed, thus we want
the cmm_smp_rmb/cmm_smp_wmb macros to be simple compiler barriers.
So, let the per-arch files override the default definitions in
arch/generic.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Minor change (not visible to end-users): this commit introduces a
liburcu-common.so to hold wfqueue, wfstack and futex fall-back code.
This .so is used by all URCU flavors, as well as liburcu-cds.so.
I thought about it some more, and had discussions with various people,
and there are a few reasons to go for a scheme where rcu read lock
should be taken by the caller, and to pass call_rcu as a parameter to
the data structure init function:
A) The advantage, as Paul E. McKenney pointed out, is that one single .so
is enough to support all RCU flavors. Very convenient for external data
structure containers.
B) It clearly documents where rcu read-side locks are needed, so the user
keep control and in-depth understanding of their read-side locks.
C) When multiple API functions that require RCU read-side lock to be
held (sometimes even the same lock) throughout a sequence of API
calls, we have no choice but to let the caller hold the read-side
lock.
D) Due to support of multiple nesting of rcu read-side lock, any
"improvement" we could get by releasing the read-side lock in
retry loops would vanish in the cases where we are called within
nested C.S..
E) If a library uses synchronize_rcu, this should be clearly documented,
and even frowned upon, because this involves important limitations on
the design of the caller, and important performance hit. There are
usually ways to reach the same result through use of call_rcu, which
should really be used thoroughout these libraries.
F) It clearly documents when a data structure needs to use call_rcu
internally.
G) Some very early benchmark results show that there is indeed not
much performance gain to achieve by inlining call_rcu, even if it is
a version with a cache for the "call_rcu structure" lookup
(per-cpu/per-thread/global). So passing it as a parameter to
the data structure init function should be fine, even in cases
where it is called very often.
H) For use-cases where applications would like to use more than one
RCU flavor concurrently (which is now supported), leaving management
of RCU read-side C.S. to the reader allows the application to take
more than one RCU read-side lock across API calls. It also lets the
application specify its own call_rcu function that could handle more
than one RCU flavor.
So for all these reasons, I reverting back to the API we have in our
last release (0.6.4).
Paolo Bonzini [Wed, 17 Aug 2011 09:42:51 +0000 (05:42 -0400)]
urcu-qsbr: avoid useless futex wakeups and burning CPU for long grace periods
I noticed that urcu makes exactly _one_ attempt at using futexes
to avoid busy looping on synchronize_rcu. The attached patch instead
switches from busy waiting to futexes after RCU_QS_ACTIVE_ATTEMPTS.
To limit the amount of system calls, reading threads remember whether
they already had a quiescent state in this grace period; if so they were
already removed from the list, and can avoid signaling the futex.
Performance measured with rcutorture (nreaders: 10, nupdaters: 1,
duration: 10, median of nine runs):
RCU_QS_ACTIVE_ATTEMPTS == 100, no patch n_updates = 292
RCU_QS_ACTIVE_ATTEMPTS == 1, no patch n_updates = 290
RCU_QS_ACTIVE_ATTEMPTS == 100, with patch n_updates = 408
RCU_QS_ACTIVE_ATTEMPTS == 1, with patch n_updates = 404
(the first two cases are obviously the same; the only change is
when the futex is used, but over many calls there is no difference).
This patch matches the update to the Promela model.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paolo Bonzini [Wed, 17 Aug 2011 09:31:21 +0000 (05:31 -0400)]
test api cleanup: remove unused primitives
[ Mathieu: the rationale for this is that we can always add back that
code if every needed. Removing leftover GPLv2 test code is an
incentive to create the appropriate library-wide LGPL/MIT-style
abstractions.]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paolo Bonzini [Tue, 9 Aug 2011 12:37:14 +0000 (08:37 -0400)]
wfqueue: fix type-incorrect assignment
The "old_tail = q->tail, q->tail = node" assignment in wfqueue
is not type safe; q->tail is a pointer to pointer to node and the
correct value to assign is &node->next. While the arithmetic is
the same, it is better to be tidy.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Duncan Sands [Tue, 16 Aug 2011 11:10:01 +0000 (07:10 -0400)]
Fix choice of default flavour in urcu/map/urcu.h
Hi, I noticed in the 0.64 release (and git too) that if a flavour is not
specified explicitly then RCU_MB is chosen in urcu/map/urcu.h, while the
docs say and the Makefile expects RCU_MEMBARRIER. Note that the header
file urcu/static/urcu.h has similar logic but uses RCU_MEMBARRIER for
the default.