From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: Thu, 11 Oct 2012 15:56:42 +0000 (-0400)
Subject: Fix: call_rcu list corruption on teardown
X-Git-Tag: v0.7.5~1
X-Git-Url: https://git.lttng.org./?a=commitdiff_plain;h=19cf3ae166a744ecb105bdd2cd1a1faf12bfd590;p=urcu.git

Fix: call_rcu list corruption on teardown

* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Lai Jiangshan (laijs@cn.fujitsu.com) wrote:
> > test code:
> > ./tests/test_urcu_lfs 100 10 10
> >
> > bug produce rate > 60%
> >
> > {{{
> > I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or
> +"./tests/test_urcu_lfs 100 100 10"
> > But I just test it about 5 times
> > }}}
> >
> > 4cores*1threads: Intel(R) Core(TM) i5 CPU         760
> > RCU_MB (no time to test for other rcu type)
> > test commit: 768fba83676f49eb73fd1d8ad452016a84c5ec2a
> >
> > I didn't see any bug when "./tests/test_urcu_mb 10 100 10"
> >
> > Sorry, I tried, but I failed to find out the root cause currently.
>
> I think I managed to narrow down the issue:
>
> 1) the master branch does not reproduce it, but commit
>    768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
>    time.
>
> 2) the main change between 768fba83676f49eb73fd1d8ad452016a84c5ec2a and
>    current master (f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
>    moving to wfcqueue.
>
> 3) the bug always arise, for me, at the end of the 10 seconds.
>    However, it might be simply due to the fact that most of the memory
>    get freed at the end of program execution.
>
> 4) I've been able to get a backtrace, and it looks like we have some
>    call_rcu callback-invokation threads still working while
>    call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
>    is nicely waiting for the next thread to stop, and during that time,
>    two callback-invokation threads are invoking callbacks (and one of
>    them triggers the segfault).
>
> So I expect that commit
>
> commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date:   Tue Sep 25 10:50:49 2012 -0500
>
>     call_rcu: use wfcqueue, eliminate false-sharing
>
>     Eliminate false-sharing between call_rcu (enqueuer) and worker threads
>     on the queue head and tail.
>
>     Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>
> Could have managed to fix the issue, or change the timing enough that it
> does not reproduces. I'll continue investigating.

The bug was in call rcu. It is not required for master, because we fixed
it while moving to wfcqueue.  We were erroneously writing to the head
field of the default call_rcu_data rather than tail.

The conditions to reproduce this bug:

1) setup per-cpu callback-invokation threads,
2) use call_rcu
3) call call_rcu_data_free() while there are still some pending
   callbacks that have not yet been executed by the callback-invokation
   threads,
4) we then get corruption due to the "default" callback invokation
   that walks through a corrupted queue.

This bug is fixed by commit 5161f31e09ce33dd79afad8d08a2372fbf1c4fbe in
the 0.7.x series. Commit 0b8ab7df078a6d8e1439b1db5849638892e1cc83 in the
0.7.x series explains the real motivation for moving call_rcu from
wfqueue to wfcqueue.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---

diff --git a/urcu-call-rcu-impl.h b/urcu-call-rcu-impl.h
index 13b24ff..b205229 100644
--- a/urcu-call-rcu-impl.h
+++ b/urcu-call-rcu-impl.h
@@ -647,8 +647,9 @@ void call_rcu_data_free(struct call_rcu_data *crdp)
 		/* Create default call rcu data if need be */
 		(void) get_default_call_rcu_data();
 		cbs_endprev = (struct cds_wfq_node **)
-			uatomic_xchg(&default_call_rcu_data, cbs_tail);
-		*cbs_endprev = cbs;
+			uatomic_xchg(&default_call_rcu_data->cbs.tail,
+					cbs_tail);
+		_CMM_STORE_SHARED(*cbs_endprev, cbs);
 		uatomic_add(&default_call_rcu_data->qlen,
 			    uatomic_read(&crdp->qlen));
 		wake_call_rcu_thread(default_call_rcu_data);