Mathieu Desnoyers [Wed, 5 Dec 2012 13:53:08 +0000 (08:53 -0500)]
wfstack: document first/next return values
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 5 Dec 2012 11:13:08 +0000 (06:13 -0500)]
test wfstack: enforce external mutex if needed by default
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 5 Dec 2012 11:12:42 +0000 (06:12 -0500)]
test wfcqueue: enforce external mutex if needed by default
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 26 Nov 2012 03:02:18 +0000 (22:02 -0500)]
urcu-mb/signal/membarrier: batch concurrent synchronize_rcu()
Here are benchmarks on batching of synchronize_rcu(), and it leads to
very interesting scalability improvement and speedups, e.g., on a
24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
threads, each updater using synchronize_rcu()):
* Serialized grace periods:
./test_urcu 4 20 20
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 20 wdelay 0
nr_reads
714598368 nr_writes
5032889 nr_ops
719631257
* Batched grace periods:
./test_urcu 4 20 20
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 20 wdelay 0
nr_reads
611848168 nr_writes
9877965 nr_ops
621726133
For a
9877965/
5032889 = 1.96 speedup for 20 updaters.
Of course, we can see that readers have slowed down, probably due to
increased update traffic, given there is no change to the read-side code
whatsoever.
Now let's see the penality of managing the stack for single-updater.
With 4 readers, single updater:
* Serialized grace periods :
./test_urcu 4 1 20
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
241959144 nr_writes
11146189 nr_ops
253105333
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
257131080 nr_writes
12310537 nr_ops
269441617
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
259973359 nr_writes
12203025 nr_ops
272176384
* Batched grace periods :
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
298926555 nr_writes
14018748 nr_ops
312945303
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
272411290 nr_writes
12832166 nr_ops
285243456
SUMMARY ./test_urcu testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
267511858 nr_writes
12822026 nr_ops
280333884
Serialized vs batched seems to similar, batched possibly even slightly
faster, but this is probably caused by NUMA affinity.
More benchmark results:
* Serialized synchronize_rcu() -- test_urcu (mb)
./test_urcu 4 1 20
SUMMARY ./test_urcu testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
222512859 nr_writes
10723654 nr_ops
233236513
./test_urcu 4 20 20
SUMMARY ./test_urcu testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
722096653 nr_writes
5012429 nr_ops
727109082
./test_urcu 12 12 20
SUMMARY ./test_urcu testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
1822868768 nr_writes
2300787 nr_ops
1825169555
./test_urcu 16 8 20
SUMMARY ./test_urcu testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
2355908375 nr_writes
1604850 nr_ops
2357513225
./test_urcu 20 4 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
3003457459 nr_writes
1074828 nr_ops
3004532287
./test_urcu 20 3 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
2956972543 nr_writes
1036556 nr_ops
2958009099
./test_urcu 20 2 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
2890178860 nr_writes
1030095 nr_ops
2891208955
./test_urcu 20 1 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
3017482290 nr_writes 783420 nr_ops
3018265710
* Batched synchronize_rcu() -- test_urcu (mb)
./test_urcu 4 1 20
SUMMARY ./test_urcu testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
271476751 nr_writes
12858885 nr_ops
284335636
./test_urcu 4 20 20
SUMMARY ./test_urcu testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
608488583 nr_writes
10080610 nr_ops
618569193
./test_urcu 12 12 20
SUMMARY ./test_urcu testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
1260044362 nr_writes
7957711 nr_ops
1268002073
./test_urcu 16 8 20
SUMMARY ./test_urcu testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
2048890674 nr_writes
5440985 nr_ops
2054331659
./test_urcu 20 4 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
2819267217 nr_writes
3093008 nr_ops
2822360225
./test_urcu 20 3 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
3067795320 nr_writes
2817760 nr_ops
3070613080
./test_urcu 20 2 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
3116770603 nr_writes
2404242 nr_ops
3119174845
./test_urcu 20 1 20
SUMMARY ./test_urcu testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
2238534130 nr_writes
3737588 nr_ops
2242271718
* Serialized synchronize_rcu() -- test_urcu_signal
./test_urcu_signal 4 1 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
16063309841 nr_writes 9217 nr_ops
16063319058
./test_urcu_signal 4 20 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
16065183739 nr_writes 9182 nr_ops
16065192921
./test_urcu_signal 12 12 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
48028512672 nr_writes 8890 nr_ops
48028521562
./test_urcu_signal 16 8 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
64001589198 nr_writes 8756 nr_ops
64001597954
./test_urcu_signal 20 4 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
79907434070 nr_writes 9068 nr_ops
79907443138
./test_urcu_signal 20 3 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
79987250839 nr_writes 8589 nr_ops
79987259428
./test_urcu_signal 20 2 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
79749947176 nr_writes 8596 nr_ops
79749955772
./test_urcu_signal 20 1 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
79751023090 nr_writes 8624 nr_ops
79751031714
* Batched synchronize_rcu() -- test_urcu_signal
./test_urcu_signal 4 1 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
15739087241 nr_writes 9218 nr_ops
15739096459
./test_urcu_signal 4 20 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
15662135806 nr_writes 94833 nr_ops
15662230639
./test_urcu_signal 12 12 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
46634363289 nr_writes 56903 nr_ops
46634420192
./test_urcu_signal 16 8 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
62263951759 nr_writes 39058 nr_ops
62263990817
./test_urcu_signal 20 4 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
77799768623 nr_writes 21065 nr_ops
77799789688
./test_urcu_signal 20 3 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
76408008440 nr_writes 17026 nr_ops
76408025466
./test_urcu_signal 20 2 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
77868927424 nr_writes 12630 nr_ops
77868940054
./test_urcu_signal 20 1 20
SUMMARY ./test_urcu_signal testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
77293186844 nr_writes 8680 nr_ops
77293195524
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 19 Nov 2012 23:16:53 +0000 (18:16 -0500)]
urcu-wait: move queue management code into urcu-wait.h
Note: urcu-wait.h is not yet exposed outside of userspace RCU.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 18 Nov 2012 20:16:43 +0000 (15:16 -0500)]
urcu-wait: move wait code into separate file
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 17:40:12 +0000 (12:40 -0500)]
urcu-qsbr: batch concurrent synchronize_rcu()
Here are benchmarks on batching of synchronize_rcu(), and it leads to
very interesting scalability improvement and speedups, e.g., on a
24-core AMD, with a write-heavy scenario (4 readers threads, 20 updater
threads, each updater using synchronize_rcu()):
* Serialized grace periods :
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 20 wdelay 0
nr_reads
20251412728 nr_writes
1826331 nr_ops
20253239059
* Batched grace periods :
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 20 wdelay 0
nr_reads
15141994746 nr_writes
9382515 nr_ops
15151377261
For a
9382515/
1826331 = 5.13 speedup for 20 updaters.
Of course, we can see that readers have slowed down, probably due to
increased update traffic, given there is no change to the read-side code
whatsoever.
Now let's see the penality of managing the stack for single-updater.
With 4 readers, single updater:
* Serialized grace periods :
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
19240784755 nr_writes
2130839 nr_ops
19242915594
* Batched grace periods :
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4
rdur 0 wdur 0 nr_writers 1 wdelay 0
nr_reads
19160162768 nr_writes
2253068 nr_ops
1916241583
2253068 vs
2137036 -> a couple of runs show that this difference lost in
the noise for single updater.
More benchmark results:
* Serialized synchronize_rcu() -- test_urcu_qsbr
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
18841016559 nr_writes
1857130 nr_ops
18842873689
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
20272811733 nr_writes
1837027 nr_ops
20274648760
./test_urcu_qsbr 12 12 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
60343516643 nr_writes
2353685 nr_ops
60345870328
./test_urcu_qsbr 16 8 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
78202711840 nr_writes
2326331 nr_ops
78205038171
./test_urcu_qsbr 20 4 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
94553396003 nr_writes
2238396 nr_ops
94555634399
./test_urcu_qsbr 20 3 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
95004708661 nr_writes
2165966 nr_ops
95006874627
./test_urcu_qsbr 20 2 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
95386506198 nr_writes
2194352 nr_ops
95388700550
./test_urcu_qsbr 20 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
84705972017 nr_writes
2609595 nr_ops
84708581612
* Batched synchronize_rcu() -- test_urcu_qsbr
./test_urcu_qsbr 4 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
19154850714 nr_writes
2238834 nr_ops
19157089548
./test_urcu_qsbr 4 20 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 4 rdur 0 wdur 0 nr_writers 20 wdelay 0 nr_reads
15114131760 nr_writes
9370255 nr_ops
15123502015
./test_urcu_qsbr 12 12 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 12 rdur 0 wdur 0 nr_writers 12 wdelay 0 nr_reads
45541854970 nr_writes
5786496 nr_ops
45547641466
./test_urcu_qsbr 16 8 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 16 rdur 0 wdur 0 nr_writers 8 wdelay 0 nr_reads
66217337547 nr_writes
4257427 nr_ops
66221594974
./test_urcu_qsbr 20 4 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 4 wdelay 0 nr_reads
95048642908 nr_writes
2416266 nr_ops
95051059174
./test_urcu_qsbr 20 3 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 3 wdelay 0 nr_reads
96679609928 nr_writes
2211168 nr_ops
96681821096
./test_urcu_qsbr 20 2 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 2 wdelay 0 nr_reads
92166219811 nr_writes
1968725 nr_ops
92168188536
./test_urcu_qsbr 20 1 20
SUMMARY ./test_urcu_qsbr testdur 20 nr_readers 20 rdur 0 wdur 0 nr_writers 1 wdelay 0 nr_reads
87986181951 nr_writes
3278737 nr_ops
87989460688
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 14:07:34 +0000 (09:07 -0500)]
tests: use standard malloc/free for synchronize_rcu()
Allows removing mutex from tests, which allow testing scalability of
concurrent synchronize_rcu() executions.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 03:33:34 +0000 (22:33 -0500)]
urcu-bp: move quiescent threads to separate list
Accelerate 2-phase grace period by not having to iterate twice on
threads not within RCU read-side critical section.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 03:32:28 +0000 (22:32 -0500)]
urcu-mb/signal/membarrier: move quiescent threads to separate list
Accelerate 2-phase grace period by not having to iterate twice on
threads not nested within a RCU read-side lock.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 03:31:28 +0000 (22:31 -0500)]
urcu-qsbr: move offline threads to separate list
Accelerate 2-phase grace period by not having to iterate on offline
threads twice.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 02:44:59 +0000 (21:44 -0500)]
urcu-bp: improve 2-phase wait scheme
In the single-bit, 2-phase grace period scheme, all we need to do is to
observe each reader going through a quiescent state while we are in the
grace period.
We therefore only need to perform one global counter update, surrounded
by 2 iterations on readers to observe change in their snapshot.
We can therefore remove the first counter update (prior to the first
iteration on readers): it was useless and was only slowing down the
grace period.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 02:44:20 +0000 (21:44 -0500)]
urcu-mb/signal/membarrier: improve 2-phase wait scheme
In the single-bit, 2-phase grace period scheme, all we need to do is to
observe each reader going through a quiescent state while we are in the
grace period.
We therefore only need to perform one global counter update, surrounded
by 2 iterations on readers to observe change in their snapshot.
We can therefore remove the first counter update (prior to the first
iteration on readers): it was useless and was only slowing down the
grace period.
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 12 Nov 2012 02:40:23 +0000 (21:40 -0500)]
urcu-qsbr: improve 2-phase wait scheme
In the single-bit, 2-phase grace period scheme, all we need to do is to
observe each reader going through a quiescent state while we are in the
grace period.
We therefore only need to perform one global counter update, surrounded
by 2 iterations on readers to observe change in their snapshot.
We can therefore remove the first counter update (prior to the first
iteration on readers): it was useless and was only slowing down the
grace period.
Suggested-by: Alan Stern <stern@rowland.harvard.edu>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 20 Nov 2012 02:45:04 +0000 (21:45 -0500)]
wfcqueue: implement mutex-free splice
A carefully crafted splice operation does not need to use an external
mutex to synchronize against other splice operations.
The trick is atomically exchange the head next pointer with
NULL. If the pointer we replaced was NULL, it means the queue was
possibly empty. If head next was not NULL, by setting head to NULL, we
ensure that concurrent splice operations are going to see an empty
queue, even if concurrent enqueue operations move tail further. This
means that as long as we are within splice, after setting head to NULL,
but before moving tail back to head, concurrent splice operations will
always see an empty queue, therefore acting as mutual exclusion.
If exchange returns a NULL head, we confirm that it was indeed empty by
checking if the tail pointer points to the head node, busy-waiting if
necessary.
Then the last step is to move the tail pointer to head. At that point,
enqueuers are going to start enqueuing at head again, and other splice
operations will be able to proceed.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 20 Nov 2012 03:36:05 +0000 (22:36 -0500)]
wfcqueue: document empty criterion
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 20 Nov 2012 10:28:42 +0000 (05:28 -0500)]
urcu-call-rcu: use wait-free splice return value
We can now use the splice return value to know if the source queue was
empty rather than testing for destination queue emptiness after the
splice operation.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 18 Nov 2012 15:35:35 +0000 (10:35 -0500)]
test wfcqueue: add tests for queue state return value
with e.g. ./test_urcu_wfcq 2 2 10 -w
we can confirm that we see as many "enqueue to empty queue" as we see
"splice from non-empty queue", which confirms that the queue state
returned by enqueue is indeed sampled atomically with enqueue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Mathieu Desnoyers [Sun, 18 Nov 2012 15:31:35 +0000 (10:31 -0500)]
wfcqueue: enqueue and splice return queue state
enqueue can return whether the queue was empty or not prior to enqueue.
splice can return this information about destination queue too, but
there are more cases to handle, because we don't touch the destination
queue if the source queue was empty, and in the nonblocking case, we
return that we would need to block on the source queue.
The destination queue state is sampled atomically with enqueue/splice to
destination operations.
Knowing this state is useful when "ownership" on a batch of queue items
can be assigned to those enqueuing the first items, e.g. to implement
wait/wakeup schemes.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Mathieu Desnoyers [Tue, 20 Nov 2012 04:22:50 +0000 (23:22 -0500)]
Fix: wfcqueue nonblocking dequeue
Failures were not handled in the nonblocking dequeue implementation.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paul E. McKenney [Fri, 16 Nov 2012 03:07:03 +0000 (22:07 -0500)]
wfcqueue: Fix lock and unlock functions
The current implementation of cds_wfcq_dequeue_lock() and
cds_wfcq_dequeue_unlock() entails mutually assured recursion.
Redirect to _cds_wfcq_dequeue_lock() and _cds_wfcq_dequeue_unlock(),
respectively.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Simon Marchi [Thu, 15 Nov 2012 14:29:42 +0000 (09:29 -0500)]
runtests: Make path of time binary configurable
I work on a platform that does not come with a time program. This patch
makes it possible to specify the path of the time binary or not use it
if none is available.
If the URCU_TEST_TIME_BIN environment variable exists and is executable,
it is used. Otherwise it tries with /usr/bin/time, the most common
location. If it is not there, the tests are ran without timing info.
[ Edit by Mathieu Desnoyers: use "." instead of "source" (no bash-ism),
edit commit about check for emptiness vs definition to match the code. ]
Signed-off-by: Simon Marchi <simon.marchi@polymtl.ca>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 11 Nov 2012 16:20:07 +0000 (11:20 -0500)]
urcu-qsbr: skip Q.S. reporting if already reported
We can skip both memory barriers and store reporting quiescent state if
we notice we already reported Q.S. for the current value of
"rcu_gp_ctr".
It covers the two implementations of QSBR:
* 64-bit architecture: we assume the counter never overflows, and
therefore only perform one increment followed by waiting for readers.
In this scenario, we don't care if the rcu_gp_ctr load is moved into
the prior read-side critical section, as long as the
URCU_TLS(rcu_reader).ctr store is ordered.
* 32-bit architecture: given the 32-bit counter could overflow,
we rely on a 2-phase approach, using a single bit: we flip
the rcu_gp_ctr bit, then wait to observe that all readers have
taken a copy of the new rcu_gp_ctr. We flip it again, and wait until
we observe that all readers have copied its new value. We are then
certain that each reader necessarily passed through a quiescent state
during the grace period (and that Q.S. was not located prior to our
grace period). This scheme works even if the rcu_gp_ctr load is moved
into the prior read-side critical section, as long as store to
URCU_TLS(rcu_reader).ctr is ordered with respect to other memory
accesses within that thread.
Suggested-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Mathieu Desnoyers [Fri, 9 Nov 2012 02:45:04 +0000 (21:45 -0500)]
Fix TLS detection: test with linker, add --disable-compiler-tls
NetBSD 5.1 and older, as well as Darwin, succeed to compile code
containing TLS, but cannot link it. Test with linker in addition to
compiler for TLS support.
Also add a --disable-compiler-tls configure option to allow users to
force using the pthread getspecific fall back.
Fixes #288
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 8 Nov 2012 20:28:59 +0000 (15:28 -0500)]
Cleanup: cast pthread_self() return value to unsigned long
pthread_t can map to other things that unsigned long (e.g. pointer).
Cast it to unsigned long for debug printing and for debug delay random
value purposes.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Christian Babeux [Thu, 8 Nov 2012 19:30:08 +0000 (14:30 -0500)]
Fallback mechanism not working on platform where TLS is unsupported
The CONFIG_RCU_TLS entry in config.h.in is defined by default to "TLS".
This has the unfortunate consequence of defining CONFIG_RCU_TLS on
platform where TLS is unsupported and effectively disabling the pthread
based fallback mechanism. This macro should be #undef by default and the
AX_TLS m4 macro will properly detect if TLS is supported.
Signed-off-by: Christian Babeux <christian.babeux@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 7 Nov 2012 20:22:57 +0000 (15:22 -0500)]
Revert "Fix: cross-build: configure.ac should use --target, not --host"
This reverts commit
1eade46a854eb8211be9fd32e0cf6835576deb63.
No. --target is for building cross-compilers. --host was appropriate.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 7 Nov 2012 20:09:28 +0000 (15:09 -0500)]
Fix: cross-build: configure.ac should use --target, not --host
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 4 Nov 2012 18:04:40 +0000 (13:04 -0500)]
test_urcu_wfcq: add splice and nosync tests
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 4 Nov 2012 18:03:59 +0000 (13:03 -0500)]
test_urcu_wfs: cleanup
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 4 Nov 2012 18:03:32 +0000 (13:03 -0500)]
test_urcu_lfs: cleanup
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 22:34:40 +0000 (18:34 -0400)]
Fix static linking: add missing static for _defer_rcu
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 21:56:04 +0000 (17:56 -0400)]
tests: report error value for make check
exit 1 as soon as a test fails.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 21:50:24 +0000 (17:50 -0400)]
Add multiflavor test program
Add a multiflavor test program to catch symbol name clashes earlier next
time.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 21:49:39 +0000 (17:49 -0400)]
Fix static linking: fix symbol name namespaces
gp_futex, yield_active, rand_yield, has_sys_membarrier, rcu_defer_exit,
call_rcu_data_free, call_rcu_before_fork, call_rcu_after_fork_parent,
call_rcu_after_fork_child are exported by each urcu flavor.
In order to fix use-cases where multiple flavors are statically linked
into the same application, we need to move these symbols to local
namespaces.
Ensure that all symbols are prefixed by "rcu_".
Also add each of those symbols into urcu/map/*.h headers, so they get
mapped to their flavor-specific symbol name by the preprocessor.
This requires bumping our .so version from 1.0.0 to 2.0.0, because it
changes some symbol names.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 20:37:04 +0000 (16:37 -0400)]
Fix static linking: add missing static to thr_defer
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 1 Nov 2012 20:33:01 +0000 (16:33 -0400)]
Fix static linking: add missing static
update_counter_and_wait and call_rcu_data_list are only used locally.
Add the static keyword to ensure their symbol are not exported. This
helps fixing static linking of many URCU flavors into the same program.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 15:40:37 +0000 (11:40 -0400)]
deprecation: fix build with gcc < 4.5
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 15:22:56 +0000 (11:22 -0400)]
wfstack.c: update copyright notice
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 15:02:27 +0000 (11:02 -0400)]
Update wfstack copyright notice
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 15:00:30 +0000 (11:00 -0400)]
Comment fix: update associated LGPL header name
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 12:53:56 +0000 (08:53 -0400)]
Update cds-api.txt following API deprecations
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 12:43:33 +0000 (08:43 -0400)]
Deprecate wfqueue
Replaced by "wfcqueue", which has a semantic that allows placing head
and tail on different cache lines, and does not allocate memory
internally. wfqueue users can easily migrate to wfcqueue.
We choose to deprecate wfqueue rather than reimplementing it on top of
wfcqueue to ensure we keep strong ABI compatibility for existing wfqueue
users.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 23 Oct 2012 12:36:42 +0000 (08:36 -0400)]
Deprecate rculfstack
Replaced by "lfstack", which has a less restrictive semantic, and covers
rculfstack completely.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 22 Oct 2012 12:55:22 +0000 (08:55 -0400)]
wfcqueue: introduce nonblocking API
Introduce nonblocking API in wfcqueue, allowing RT threads to try to
dequeue, splice, or iterate on spliced queues without blocking: the
caller needs to handle CDS_WFCQ_WOULDBLOCK return value (or nonzero
return value for splice).
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Paul McKenney <paulmck@linux.vnet.ibm.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Mathieu Desnoyers [Fri, 12 Oct 2012 13:51:41 +0000 (09:51 -0400)]
lfstack: test pop_all and pop
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 12 Oct 2012 13:30:15 +0000 (09:30 -0400)]
lfstack: implement empty, pop_all and iterators, document API
We are changing the ABI by adding a mutex into struct cds_lfs_stack.
This ABI has never been exposed in a release so far, so we can change
it.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 20:44:40 +0000 (16:44 -0400)]
lfstack: implement test
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 19:08:57 +0000 (15:08 -0400)]
lfstack: implement lock-free stack
This stack does not require to hold RCU read-side lock across push, and
allows multiple strategies to be used for pop.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sat, 13 Oct 2012 02:11:49 +0000 (22:11 -0400)]
wfstack: implement pop_all and iteration tests
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sat, 13 Oct 2012 01:47:05 +0000 (21:47 -0400)]
wfstack: implement cds_wfs_pop_all and iterators, document API
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 22 Oct 2012 22:17:24 +0000 (18:17 -0400)]
rculfhash test: fix trivial memleak and return node leak and errors
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 22 Oct 2012 21:37:38 +0000 (17:37 -0400)]
rculfhash: add missing extern
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 22 Oct 2012 21:34:31 +0000 (17:34 -0400)]
Cleanup: fix cppcheck errors
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 14 Oct 2012 15:59:31 +0000 (11:59 -0400)]
wfcqueue: remove ancient comment
This comment is a leftover from wfqueue and is now inappropriate in the
context of wfcqueue: the dequeue operation busy-waits if it sees a NULL
next pointer from a node that is not the tail node.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Sat, 13 Oct 2012 16:48:54 +0000 (12:48 -0400)]
test_urcu_lfq: remove rcu_defer_register_thread() from test_urcu_lfq
test_urcu_lfq has already switch to call_rcu(),
rcu_defer_register_thread() is unneeded.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Sat, 13 Oct 2012 16:46:45 +0000 (12:46 -0400)]
test_urcu_lfq: test for the proper pointer
We should use "if (qnode)" instead of "if (node)" in case of
the struct cds_lfq_node_rcu is not the first field of struct node.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Sat, 13 Oct 2012 16:45:33 +0000 (12:45 -0400)]
test_urcu_lfs: remove rcu_defer_register_thread() from test_urcu_lfs
test_urcu_lfs has already switch to call_rcu(),
rcu_defer_register_thread() is unneeded.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Sat, 13 Oct 2012 16:41:17 +0000 (12:41 -0400)]
test_urcu_lfs: test for the proper pointer
We should use "if (snode)" instead of "if (node)" in case of
the struct cds_lfs_node_rcu is not the first field of struct node.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 12 Oct 2012 14:33:20 +0000 (10:33 -0400)]
wfcqueue: clarify locking usage
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 12 Oct 2012 11:47:11 +0000 (07:47 -0400)]
Document APIs in README
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 12 Oct 2012 11:29:34 +0000 (07:29 -0400)]
Test cleanup: replace "l" parameter by "loops"
Reported-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 20:41:16 +0000 (16:41 -0400)]
Add wfcqueue header to cds.h
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 16:44:10 +0000 (12:44 -0400)]
Fix: urcu-bp, urcu, urcu-qsbr should include wfcqueue
Those are still including wfqueue.h, but need to move to wfcqueue.h,
since this is now needed by call_rcu. It was still working, because call
rcu headers include wfcqueue.h, but they were doing so _after_ #undef
_LGPL_SOURCE was issued, which made wfcqueue.h depend on
liburcu-common.so to find the wfcqueue symbols. This was in turn adding
a transitive dependency that was not present before, and thus causing
build failure in cross-build environments, especially those on Debian
systems, due to special handling of transitive dependencies on Debian
autotools.
Reported-by: Simon Marchi <simon.marchi@polymtl.ca>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 16:28:23 +0000 (12:28 -0400)]
Fix: call_rcu list corruption on teardown (documentation)
This commit is a place-holder to document that commit
5161f31e09ce33dd79afad8d08a2372fbf1c4fbe fixed a list corruption bug in
call_rcu.
Introducing __cds_wfcq_splice_blocking() fixed a list corruption bug in
the 0.7.x series. The equivalent fix appeared in 0.6.8 for the
stable-0.6 branch.
Description of the bug:
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Lai Jiangshan (laijs@cn.fujitsu.com) wrote:
> > test code:
> > ./tests/test_urcu_lfs 100 10 10
> >
> > bug produce rate > 60%
> >
> > {{{
> > I didn't see any bug when "./tests/test_urcu_lfs 10 10 10" Or
> +"./tests/test_urcu_lfs 100 100 10"
> > But I just test it about 5 times
> > }}}
> >
> > 4cores*1threads: Intel(R) Core(TM) i5 CPU 760
> > RCU_MB (no time to test for other rcu type)
> > test commit:
768fba83676f49eb73fd1d8ad452016a84c5ec2a
> >
> > I didn't see any bug when "./tests/test_urcu_mb 10 100 10"
> >
> > Sorry, I tried, but I failed to find out the root cause currently.
>
> I think I managed to narrow down the issue:
>
> 1) the master branch does not reproduce it, but commit
>
768fba83676f49eb73fd1d8ad452016a84c5ec2a repdroduces it about 50% of the
> time.
>
> 2) the main change between
768fba83676f49eb73fd1d8ad452016a84c5ec2a and
> current master (
f94061a3df4c9eab9ac869a19e4228de54771fcb) is call_rcu
> moving to wfcqueue.
>
> 3) the bug always arise, for me, at the end of the 10 seconds.
> However, it might be simply due to the fact that most of the memory
> get freed at the end of program execution.
>
> 4) I've been able to get a backtrace, and it looks like we have some
> call_rcu callback-invocation threads still working while
> call_rcu_data_free() is invoked. In the backtrace, call_rcu_data_free()
> is nicely waiting for the next thread to stop, and during that time,
> two callback-invocation threads are invoking callbacks (and one of
> them triggers the segfault).
>
> So I expect that commit
>
> commit
5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
> Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Date: Tue Sep 25 10:50:49 2012 -0500
>
> call_rcu: use wfcqueue, eliminate false-sharing
>
> Eliminate false-sharing between call_rcu (enqueuer) and worker threads
> on the queue head and tail.
>
> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>
> Could have managed to fix the issue, or change the timing enough that it
> does not reproduces. I'll continue investigating.
The bug was in call rcu. It is not required for master, because we fixed
it while moving to wfcqueue. We were erroneously writing to the head
field of the default call_rcu_data rather than tail.
The conditions to reproduce this bug:
1) setup per-cpu callback-invocation threads,
2) use call_rcu
3) call call_rcu_data_free() while there are still some pending
callbacks that have not yet been executed by the callback-invocation
threads,
4) we then get corruption due to the "default" callback invocation
that walks through a corrupted queue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 15:41:48 +0000 (11:41 -0400)]
call_rcu: remove head field alignement, explain wfcqueue motivation
The following commit:
commit
5161f31e09ce33dd79afad8d08a2372fbf1c4fbe
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: Tue Sep 25 10:50:49 2012 -0500
call_rcu: use wfcqueue, eliminate false-sharing
Eliminate false-sharing between call_rcu (enqueuer) and worker threads
on the queue head and tail.
introduced a change in call_rcu: it moved from "wfqueue" to "wfcqueue".
Its changelog states that the goal is to eliminate false-sharing, but
the changelog rationale is wrong.
The actual primary goal is to use the "splice" operation (which is
similar to the "dequeue_all" operation proposed by Lai Jiangshan),
instead of open-coding this operation directly within the call_rcu
implementation. The objective stated by Lai was to make testing of this
code-path easier, and he was right: we ended up noticing a bug in the
original call_rcu implementation (in this open-coded splice operation)
that was really hard to trigger, which was fixed by the move to
wfcqueue.
About false-sharing: In the case of call_rcu callback invokation threads
vs call_rcu callers, we do not care about false-sharing because call_rcu
callback-invocation threads use batching ("splice") to get an entire
list of callbacks, which effectively empties the queue, and requires to
touch the tail anyway. Ensuring that head and tail are placed on
different cache lines would matter only if we would be using "dequeue"
in the callback-invocation thread, which is not the case: we grab the
whole queue, and then iterate from our local head to our local tail.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 11 Oct 2012 15:27:37 +0000 (11:27 -0400)]
wfcqueue: update credits in patch documentation
Give credits to those responsible for the design and implementation of
commit
8ad4ce587f001ae026d5560ac509c2e48986130b, "wfcqueue: implement
concurrency-efficient queue", which happened through rounds of email and
patch exchanges.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 8 Oct 2012 16:11:30 +0000 (12:11 -0400)]
wfcqueue documentation: hint at for_each iterators
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 8 Oct 2012 14:44:38 +0000 (10:44 -0400)]
Fix urcu-call-rcu-impl.h: false-sharing
> > struct call_rcu_data {
> > - struct cds_wfq_queue cbs;
> > + /*
> > + * Align the tail on cache line size to eliminate false-sharing
> > + * with head.
> > + */
> > + struct cds_wfcq_tail __attribute__((aligned(CAA_CACHE_LINE_SIZE))) cbs_tail;
> > + /* Alignment on cache line size will add padding here */
> > +
> > + struct cds_wfcq_head cbs_head;
>
>
> wrong here. In this code, cbs_tail and cbs_head are in the same cache line.
Reported-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 25 Sep 2012 15:50:49 +0000 (10:50 -0500)]
call_rcu: use wfcqueue, eliminate false-sharing
Eliminate false-sharing between call_rcu (enqueuer) and worker threads
on the queue head and tail.
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 23 Sep 2012 23:16:08 +0000 (19:16 -0400)]
wfcqueue test
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sun, 23 Sep 2012 23:14:59 +0000 (19:14 -0400)]
wfcqueue: implement concurrency-efficient queue
This new API simplify the wfqueue implementation, and brings a 2.3x to
2.6x performance boost due to the ability to eliminate false-sharing
between enqueue and dequeue.
This work is derived from the patch from Lai Jiangshan submitted as
"urcu: new wfqueue implementation"
(http://lists.lttng.org/pipermail/lttng-dev/2012-August/018379.html)
Its changelog:
> Some guys would be surprised by this fact:
> There are already TWO implementations of wfqueue in urcu.
>
> The first one is in urcu/static/wfqueue.h:
> 1) enqueue: exchange the tail and then update previous->next
> 2) dequeue: wait for first node's next pointer and them shift, a dummy node
> is introduced to avoid the queue->tail become NULL when shift.
>
> The second one shares some code with the first one, and the left code
> are spreading in urcu-call-rcu-impl.h:
> 1) enqueue: share with the first one
> 2) no dequeue operation: and no shift, so it don't need dummy node,
> Although the dummy node is queued when initialization, but it is removed
> after the first dequeue_all operation in call_rcu_thread().
> call_rcu_data_free() forgets to handle the dummy node if it is not removed.
> 3)dequeue_all: record the old head and tail, and queue->head become the special
> tail node.(atomic record the tail and change the tail).
>
> The second implementation's code are spreading, bad for review, and it is not
> tested by tests/test_urcu_wfq.
>
> So we need a better implementation avoid the dummy node dancing and can service
> both generic wfqueue APIs and dequeue_all API for call rcu.
>
> The new implementation:
> 1) enqueue: share with the first one/original implementation.
> 2) dequeue: shift when node count >= 2, cmpxchg when node count = 1.
> no dummy node, save memory.
> 3) dequeue_all: simply set queue->head.next to NULL, xchg the tail
> and return the old head.next.
>
> More implementation details are in the code.
> tests/test_urcu_wfq will be update in future for testing new APIs.
The patch proposed by Lai brings a very interesting simplification to
the single-node handling (which is kept here), and moves all queue
handling code away from call_rcu implementation, back into the wfqueue
code. This has the benefit to allow testing enhancements.
I modified it so the API does not expose implementation details to the
user (e.g. ___cds_wfq_node_sync_next). I added a "splice" operation and
a for loop iterator which should allow wfqueue users to use the list
very efficiently both from LGPL/GPL code and from non-LGPL-compatible
code.
I also changed the API so the queue head and tail are now two separate
structures: it allows the queue user to place these as they like, either
on different cache lines (to eliminate false-sharing), or close one to
another (on same cache-line) in case a queue is spliced onto the stack
and not concurrently accessed.
Benchmarks performed on Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
(dual-core, with hyperthreading)
Benchmark invoked:
for a in $(seq 1 10); do ./test_urcu_wfq 1 1 10 -a 0 -a 2; done
(using cpu number 0 and 2, which should correspond to two cores of my
Intel 2-core/hyperthread processor)
Before patch:
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
97274297 nr_dequeues
80745742 successful enqueues
97274297 successful dequeues
80745321 end_dequeues
16528976 nr_ops
178020039
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
92300568 nr_dequeues
75019529 successful enqueues
92300568 successful dequeues
74973237 end_dequeues
17327331 nr_ops
167320097
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
93516443 nr_dequeues
75846726 successful enqueues
93516443 successful dequeues
75826578 end_dequeues
17689865 nr_ops
169363169
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94160362 nr_dequeues
77967638 successful enqueues
94160362 successful dequeues
77967638 end_dequeues
16192724 nr_ops
172128000
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
97491956 nr_dequeues
81001191 successful enqueues
97491956 successful dequeues
81000247 end_dequeues
16491709 nr_ops
178493147
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94101298 nr_dequeues
75650510 successful enqueues
94101298 successful dequeues
75649318 end_dequeues
18451980 nr_ops
169751808
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94742803 nr_dequeues
75402105 successful enqueues
94742803 successful dequeues
75341859 end_dequeues
19400944 nr_ops
170144908
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
92198835 nr_dequeues
75037877 successful enqueues
92198835 successful dequeues
75027605 end_dequeues
17171230 nr_ops
167236712
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
94159560 nr_dequeues
77895972 successful enqueues
94159560 successful dequeues
77858442 end_dequeues
16301118 nr_ops
172055532
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
96059399 nr_dequeues
80115442 successful enqueues
96059399 successful dequeues
80066843 end_dequeues
15992556 nr_ops
176174841
After patch:
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221229322 nr_dequeues
210645491 successful enqueues
221229322 successful dequeues
210645088 end_dequeues
10584234 nr_ops
431874813
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
219803943 nr_dequeues
210377337 successful enqueues
219803943 successful dequeues
210368680 end_dequeues
9435263 nr_ops
430181280
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237006358 nr_dequeues
237035340 successful enqueues
237006358 successful dequeues
236997050 end_dequeues 9308 nr_ops
474041698
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
235822443 nr_dequeues
235815942 successful enqueues
235822443 successful dequeues
235814020 end_dequeues 8423 nr_ops
471638385
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
235825567 nr_dequeues
235811803 successful enqueues
235825567 successful dequeues
235810526 end_dequeues 15041 nr_ops
471637370
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221974953 nr_dequeues
210938190 successful enqueues
221974953 successful dequeues
210938190 end_dequeues
11036763 nr_ops
432913143
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237994492 nr_dequeues
237938119 successful enqueues
237994492 successful dequeues
237930648 end_dequeues 63844 nr_ops
475932611
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
220634365 nr_dequeues
210491382 successful enqueues
220634365 successful dequeues
210490995 end_dequeues
10143370 nr_ops
431125747
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
237388065 nr_dequeues
237401251 successful enqueues
237388065 successful dequeues
237380295 end_dequeues 7770 nr_ops
474789316
testdur 10 nr_enqueuers 1 wdelay 0 nr_dequeuers 1 rdur 0 nr_enqueues
221201436 nr_dequeues
210831162 successful enqueues
221201436 successful dequeues
210831162 end_dequeues
10370274 nr_ops
432032598
Summary: Both enqueue and dequeue speed increase: around 2.3x speedup
for enqueue, and around 2.6x for dequeue.
We can verify that:
successful enqueues - successful dequeues = end_dequeues
For all runs (ensures correctness: no lost node).
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Paul E. McKenney [Fri, 7 Sep 2012 01:15:53 +0000 (21:15 -0400)]
Ensure that read-side functions meet 10-line LGPL criterion
This commit ensures that all read-side functions meet the 10-line LGPL
criterion that permits them to be expanded directly into non-LGPL code,
without function-call instructions. It also documents this as the
intent.
[ paulmck: Spelling fixes called out by Josh Triplett and name
change called out by Mathieu Desnoyers (_rcu_read_lock_help() ->
_rcu_read_lock_update(). ]
[ Mathieu Desnoyers: _rcu_read_unlock_help renamed to
_rcu_read_unlock_update_and_wakeup, and spelling fix for
preced -> precede. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 6 Sep 2012 23:09:28 +0000 (19:09 -0400)]
tls-compat.h: document sigaltstack(2) limitation
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Thu, 6 Sep 2012 23:07:19 +0000 (19:07 -0400)]
urcu: add notice to URCU_TLS() for it is not strictly async-signal-safe
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 6 Sep 2012 13:58:36 +0000 (09:58 -0400)]
Document sigaltstack(2) limitation
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 3 Sep 2012 17:51:43 +0000 (13:51 -0400)]
Documentation: update LICENSE file
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 27 Aug 2012 12:09:26 +0000 (08:09 -0400)]
Update version to 0.7.4
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 21 Aug 2012 22:45:37 +0000 (18:45 -0400)]
rculfhash API documentation: document destroy RCU read-lock constraint
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 21 Aug 2012 15:01:50 +0000 (11:01 -0400)]
Fix: rculfhash should be offline while waiting for resize to complete
Causes hang on destroy with urcu QSBR if destroy is called within a rcu
registered thread.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 15 Aug 2012 15:36:05 +0000 (11:36 -0400)]
Add missing entry to gitignore
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Thu, 9 Aug 2012 14:24:38 +0000 (10:24 -0400)]
urcu: move busy-wait code and name it ___cds_wfq_node_sync_next()
This code which waits for a node's next pointer until it appears, will
be used many times, move it to a help function and name it
___cds_wfq_node_sync_next().
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Thu, 9 Aug 2012 14:19:14 +0000 (10:19 -0400)]
urcu: fix compat_futex_noasync()
This patch fix two critical problems in the compatibility fallback of
compact_futex_noasync():
1) compat_futex_cond is not bound to any @uaddr, it services all @uaddr,
if you wakeup only one thread(pthread_cond_signal), the @uaddr of
this waking thread and the @uaddr of the woken-up thread may be different.
The woken-up thread will very probably go to sleep again
because his own condition is not true.
*And* this waking thread(FUTEX_WAKE) wake up NOTHING.
2) If the caller want to wake up all waiting threads, he will use INT_MAX
for @val, and:
for (i = 0; i < INT_MAX; i++)
pthread_cond_signal(&compat_futex_cond);
becomes almost infinity loop.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lai Jiangshan [Thu, 9 Aug 2012 14:10:08 +0000 (10:10 -0400)]
urcu: add hint to DEFINE_URCU_TLS() for compound types
Just a hint.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 30 Jul 2012 03:45:40 +0000 (23:45 -0400)]
Fix: CAA_BUILD_BUG_ON should refer to CAA_BUILD_BUG_ON_ZERO
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Ralf Baechle [Tue, 10 Jul 2012 15:03:08 +0000 (11:03 -0400)]
Add MIPS support
[ Edit by Mathieu Desnoyers: add explanations about supported
MIPS architectures, extracted from conversation with Ralf Baechle:
* Supported architectures
Ralf Baechle (edited by Mathieu Desnoyers):
This code works on all MIPS architecture variants. The memory barrier
instruction, SYNC, was introduced for MIPS II. The original MIPS I
instruction set doesn't have SYNC nor SMP support in the processor
architecture itself so SMP required horrible kludges in the system
hardware. I think it's safe to say that Linux/MIPS will never support
any of these MIPS I SMP systems. In the unlikely case this happens
anyway, we have a (Linux) kernel emulation of the SYNC instruction.
Voila - full binary compatibility across all MIPS processors and the
oldest hardware pays the performance penalty.
* Choice of barrier for cmm_mb()/cmm_rmb()/cmm_wmb()
Ralf Baechle:
"RMI (aka Netlogic and now Broadcom) XLR processor cores can be
configured to permit LD-LD, LD-ST, ST-LD and ST-ST reordering; default
is only ST-ST reordering. To allow Linux to eventually enable full
reordering cmm_mb(), cmm_rmb() and cmm_wmb() all should perform SYNC
and a compiler barrier."
* No-op choice for cmm_read_barrier_depends():
Ralf Baechle:
"Technically there is nothing in the MIPS architecture spec that would
keep a MIPS implementation from reordering as freely as an Alpha or
even more liberally. In practice most do strong ordering. However
there is no MIPS implementation that makes full use of all the rope
provided. So in theory a paranoid implementation of
cmm_read_barrier_depends() for MIPS should perform a SYNC. In reality
it's not necessary and no sane MIPS core designer would implement
something that would design a core that need a non-empty
cmm_read_barrier_depends(). The reason why my patch had an empty one
is that I was using the Alpha code as a template."
Mathieu Desnoyers:
Moreover, the Linux kernel chooses a no-op for MIPS
read_barrier_depends() implementation, so any MIPS architecture that
would be as weak as Alpha would break the Linux kernel before breaking
the userspace RCU library.
* No need to put ".set noreorder" in cmm_mb() inline assembly:
Ralf Baechle:
"Certain instructions such as SYNC won't get reordered." ]
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
CC: Paul McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 9 Jul 2012 13:44:52 +0000 (09:44 -0400)]
Compatibility: remove bash-ismsm from test scripts
+= is not supported by all shells.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 22 Jun 2012 16:48:14 +0000 (12:48 -0400)]
Fix inappropriate lib behavior: don't call exit()
Use abort() (implemented through the new urcu_die()) instead of exit(-1)
for unrecoverable errors.
Fixes #152
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Thu, 14 Jun 2012 04:56:40 +0000 (00:56 -0400)]
Fix: re-enable compatibility with autoconf < 2.64
> I tried to build the latest urcu (git master e51500) on a Centos 6.2 box, and got:
>
> jscott@dxi0-62:~/src/userspace-rcu$ make -j4
> CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /users/jscott/src/userspace-rcu/config/missing --run aclocal-1.11 -I
> +config
> CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /users/jscott/src/userspace-rcu/config/missing --run autoconf
> cd . && /bin/sh /users/jscott/src/userspace-rcu/config/missing --run automake-1.11 --foreign
> configure:4010: error: possibly undefined macro: m4_ifnblank
> If this token and others are legitimate, please use m4_pattern_allow.
> See the Autoconf documentation.
> make: *** [configure] Error 1
> make: *** Waiting for unfinished jobs....
>
> Some digging showed that the macro m4_ifnblank requires autoconf 2.64. Centos 6.2 has autoconf 2.63. :(
>
> I just worked around it by reverting commit a767fd locally, then I can build fine.
Reported-by: John Steele Scott <toojays@toojays.net>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 12 Jun 2012 15:24:31 +0000 (11:24 -0400)]
Fix c99 compatibility: use __asm__ and __volatile__ in public headers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Mon, 11 Jun 2012 14:16:35 +0000 (10:16 -0400)]
Fix c99 compatibility: use __typeof__ instead of typeof in public headers
Reported-by: John Steele Scott <toojays@toojays.net>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 1 Jun 2012 21:12:43 +0000 (17:12 -0400)]
warning fix: tests urcutorture for NetBSD 5
> CC rcutorture_urcu-urcutorture.o
> In file included from urcutorture.c:9:
> api.h: In function '__smp_thread_id':
> api.h:160: warning: cast from pointer to integer of different size
> api.h:160: warning: cast from pointer to integer of different size
> api.h: In function 'wait_thread':
> api.h:210: warning: cast from pointer to integer of different size
> api.h:210: warning: cast from pointer to integer of different size
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 1 Jun 2012 17:45:44 +0000 (13:45 -0400)]
Update version to 0.7.3
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 1 Jun 2012 17:58:31 +0000 (13:58 -0400)]
Fix tests: make dist lib dependency
Some test programs were depending in SOURCES on the CDS library. Change
this for a LDADD, which makes "make dist" work after a make clean.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Fri, 1 Jun 2012 17:43:23 +0000 (13:43 -0400)]
Update README for OS supported, tests dependency
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 30 May 2012 13:55:39 +0000 (09:55 -0400)]
Add CodingStyle to tarball
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Wed, 30 May 2012 13:03:45 +0000 (09:03 -0400)]
Add coding style document
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Tue, 29 May 2012 02:10:54 +0000 (22:10 -0400)]
Test fix: test_perthreadlock uninitialized mutex
- Initialize the per thread mutexes. (fix)
- Remove unused count_reader/count_writer variables. (cleanup)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Hirohisa Yamaguchi [Sun, 27 May 2012 18:16:59 +0000 (14:16 -0400)]
tests: support FreeBSD short "time" args
time(1) in FreeBSD does not have long argument name: change --append to
-a and --output to -o
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Mathieu Desnoyers [Sat, 26 May 2012 15:00:16 +0000 (11:00 -0400)]
freebsd 8.2 fix: define MAP_ANONYMOUS for compatibility
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This page took 0.052185 seconds and 4 git commands to generate.