2 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 Initial ideas based on the released Judy Shop Manual
6 (http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
7 implemented by Hewlett-Packard.
9 Thresholds and RCU-specific analysis is introduced in this document.
11 Advantages of using Judy Array (compressed nodes) for RCU tree:
15 - favor cache-line alignment of structures
18 - updates that need to reallocate nodes are slower than, e.g. non-rcu
21 Choice: Using 256 entries intermediate nodes (index can be represented
22 on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
25 * Node types (from less dense node to most dense)
30 Parent pointer is NULL.
33 - Type A: sequential search in value and pointer arrays
35 + Add/removal just needs to update value and pointer array, single-entry
36 (non-RCU...). For RCU, we might need to update the entire node anyway.
37 - Requires sequential search through all value array for lookup fail
40 Filled at 3 entries max 64-bit
41 8 bits indicating number of children
42 Array of 8-bit values followed by array of associated pointers.
43 64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
45 -> up to this point on 64-bit, sequential lookup and pointer read fit in
47 - lookup fail&success: 1 cache-line.
49 Filled at 6 entries max 32-bit, 7 entries max 64-bit
50 8 bits indicating number of children
51 Array of 8-bit values followed by array of associated pointers.
52 32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
53 64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
55 -> up to this point on 32-bit, sequential lookup and pointer read fit in
57 - lookup fail&success: 1 cache-line.
59 Filled at 12 entries max 32-bit, 14 entries max 64-bit
60 8 bits indicating number of children
61 Array of 8-bit values followed by array of associated pointers.
62 32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
63 64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
65 Filled at 25 entries max 32-bit, 28 entries max 64-bit
66 8 bits indicating number of children
67 Array of 8-bit values followed by array of associated pointers.
68 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
69 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
71 ---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72 in values array fits in a 32-byte cache line.
73 - lookup failure: 1 cache line.
74 - lookup success: 2 cache lines.
76 The two below are listed for completeness sake, but because they require
77 2 32-byte cache lines for lookup, these are deemed inappropriate.
79 Filled at 51 entries max 32-bit, 56 entries max 64-bit
80 8 bits indicating number of children
81 Array of 8-bit values followed by array of associated pointers.
82 32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
83 64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
85 Filled at 102 entries max 32-bit, 113 entries max 64-bit
86 8 bits indicating number of children
87 Array of 8-bit values followed by array of associated pointers.
88 32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
89 64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
92 - Type B: pools of values and pointers arrays
94 Pools of values and pointers arrays. Each pool values array is 32-bytes
95 in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
96 integer, which is the number of children in this pool, followed by an
97 array of 8-bit values, padding, and an array of pointers. Values and
98 pointer arrays are associated as in Type A.
100 The entries of a node are associated to their respective pool based
101 on their index position.
103 + Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
104 lookup success: 2 cache lines.
106 + Allows in-place updates without reallocation, except when a pool is
107 full. (this was not possible with bitmap-based nodes)
108 - If one pool exhausts its space, we need to increase the node size.
109 Therefore, for very dense populations, we will end up using the
110 pigeon-hole node type sooner, thus consuming more space.
114 Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
115 32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
116 64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
118 Total up to 50 entries (32-bit), 56 entries (64-bit)
119 2 pools: 32-bit = 256 bytes
120 2 pools: 64-bit = 512 bytes
122 Total up to 100 entries (32-bit), 112 entries (64-bit)
123 4 pools: 32-bit = 512 bytes
124 4 pools: 32-bit = 1024 bytes
127 * Choice of pool configuration distribution:
129 We have pools of either 2 or 4 linear arrays. Their total size is
130 between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays).
132 Alignment on 256 bytes means that we can spare the 8 least significant
133 bits of the pointers. Given that the type selection already uses 3 bits,
136 Alignment on 512 bytes -> 8 bits left.
138 We can therefore encode which bit, or which two bits, are used as
139 distribution selection. We can use this technique to reequilibrate pools
140 if they become unbalanced (e.g. all children are within one of the two
143 Assuming that finding the exact sub-pool usage maximum for any given
144 distribution is NP complete (not proven).
146 Taking into account unbalance ratios (tested programmatically by
147 randomly taking N entries from 256, calculating the distribution for
148 each bit (number of nodes for which bit is one/zero), and calculating
149 the difference in number of nodes for each bit, choosing the minimum
150 difference -- for millions of runs).
152 tot entries unbalance largest linear array (stat. approx.)
153 ---------------------------------------------------------------------
154 41 entries: 9 20.5+4.5=25 (target ~50/2=25)
155 47 entries: 9 23.5+4.5=28 (target ~56/2=28)
157 Note: there exists rare worse cases where the unbalance is larger, but
158 it happens _very_ rarely. But need to provide a fallback if the subclass
159 does not fit, but it does not need to be efficient.
162 For pool of size 4, we need to approximate what is the maximum unbalance
163 we can get for choice of distributions grouped by pairs of bits.
165 tot entries unbalance largest linear array (stat. approx.)
166 ---------------------------------------------------------------------
167 80 entries: 20 20+5=25 (target: ~100/4=25)
168 90 entries: 22 22.5+5.5=28 (target: ~112/4=28)
171 Note: there exists rare worse cases where the unbalance is larger, but
172 it happens _very_ rarely. But need to provide a fallback if the subclass
173 does not fit, but it does not need to be efficient.
176 * Population "does not fit" and distribution fallback
178 When adding a child to a distribution node, if the child does not fit,
179 we recalculate the best distribution. If it does not fit in that
180 distribution neither, we need to expand the node type.
182 When removing a child, if the node child count is brought to the number
183 of entries expected to statistically fit in the lower order node, we try
184 to shrink. However, if we notice that the distribution does not actually
185 fit in that shrinked node, we abort the shrink operation. If shrink
186 fails, we keep a counter of insertion/removal operations on the node
187 before we allow the shrink to be attempted again.
190 - Type C: pigeon-hole array
192 Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
193 Array of children node pointers. Pointers NULL if no child at index.
194 32-bit: 4*256 = 1024 bytes
195 64-bit: 8*256 = 2048 bytes
198 * Analysis of the thresholds:
200 Analysis of number of cache-lines touched for each node, per-node-type,
201 depending on the number of children per node, as we increment the number
202 of children from 0 to 256. Through this, we choose number of children
203 thresholds at which it is worthwhile to use a different node type.
207 - ALWAYS 1 cache line hit for lookup failure (all cases)
215 - Type A: sequential search in value and pointer arrays
216 - 1 cache line hit for lookup success
221 - 2 cache line hit for lookup success
239 - Type C: pigeon-hole array
240 - 1 cache line hit for lookup success
252 - Type A: sequential search in value and pointer arrays
253 - 1 cache line hit for lookup success
258 - 2 cache line hit for lookup success
279 - Type C: pigeon-hole array
280 - 1 cache line hit for lookup success
286 * Analysis of node type encoding and node pointers:
288 Lookups are _always_ from the top of the tree going down. This
289 facilitates RCU replacement as we only keep track of pointers going
292 Type of node encoded in the parent's pointer. Need to reserve 2
293 least-significant bits.
298 RCU_JA_LINEAR = 0, /* Type A */
299 /* 32-bit: 1 to 25 children, 8 to 128 bytes */
300 /* 64-bit: 1 to 28 children, 16 to 256 bytes */
301 RCU_JA_POOL = 1, /* Type B */
302 /* 32-bit: 26 to 100 children, 256 to 512 bytes */
303 /* 64-bit: 29 to 112 children, 512 to 1024 bytes */
304 RCU_JA_PIGEON = 2, /* Type C */
305 /* 32-bit: 101 to 256 children, 1024 bytes */
306 /* 64-bit: 113 to 256 children, 2048 bytes */
307 /* Leaf nodes are implicit from their height in the tree */
310 If entire pointer is NULL, children is empty.
313 * Lookup and Update Algorithms
315 Let's propose a quite simple scheme that uses a mutex on nodes to manage
316 update concurrency. It's certainly not optimal in terms of concurrency
317 management within a node, but it has the advantage of being simple to
318 implement and understand.
320 We need to keep a count of the number of children nodes (for each node),
321 to keep track of when the node type thresholds are reached. It would be
322 important to put an hysteresis loop so we don't change between node
323 types too often for a loop on add/removal of the same node.
325 We acquire locks from child to parent, nested. We take all locks
326 required to perform a given update in the tree (but no more) to keep it
327 consistent with respect to number of children per node.
329 If check for node being gc'd (always under node lock) fails, we simply
330 need to release the lock and lookup the node again.
337 RCU-lookup each level of the tree. If level is not populated, fail.
338 Until we reach the leaf node.
348 RCU-lookup insert position. Find location in tree where nodes are
349 missing for this insertion. If leaf is already present, insert fails,
350 releasing the rcu read lock. The insert location consists of a parent
351 node to which we want to attach a new node.
355 RCU-lookup parent node. Take the parent lock. If the parent needs to be
356 reallocated to make room for this insertion, RCU-lookup parent-parent
357 node and take the parent-parent lock. For each lock taken, check if
358 node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
363 Construct the whole branch from the new topmost intermediate node down
364 to the new leaf node we are inserting.
367 - If parent node reallocation is needed:
368 Reallocate the parent node, adding the new branch to it, and
369 increment its node count.
370 set gc flag in old nodes.
371 call_rcu free for all old nodes.
372 Populate new parent node with rcu_assign_pointer.
374 Increment parent node count.
375 Use rcu_assign_pointer to populate this new branch into the parent
380 Release parent and (if taken) parent-parent locks.
389 RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
394 RCU-lookup parent. Take the parent lock. If the parent needs to be
395 reallocated because it would be too large for the decremented number of
396 children, RCU-lookup parent-parent and take the parent-parent lock. Do
397 so recursively until no node reallocation is needed, or until root is
400 For each lock taken, check if node is being gc'd. If gc'd, release lock,
401 re-RCU-lookup this node, and retry.
405 The branch (or portion of branch) consisting of taken locks necessarily
406 has a simple node removal or update as operation to do on its top node.
408 If the operation is a node removal, then, necessarily, the entire branch
409 under the node removal operation will simply disappear. No node
410 allocation is needed.
412 Else, if the operation is a child node reallocation, the child node will
413 necessarily do a node removal. So _its_ entire child branch will
414 disappear. So reallocate this child node without the removed branch
415 (remember to decrement its nr children count).
419 No reallocation case: simply set the appropriate child pointer in the
420 topmost locked node to NULL. Decrement its nr children count.
422 Reallocation case: set the child pointer in the topmost locked node to
423 the newly allocated node.
424 set old nodes gc flag.
425 call_rcu free for all old nodes.
433 For the various types of nodes:
435 - sequential search (type A)
436 - RCU replacement: mutex
437 - Entry update: mutex
439 - bitmap followed by pointer array (type B)
440 - RCU replacement: mutex
441 - Entry update: mutex
443 - pigeon hole array (type C)
444 - RCU replacement: mutex
445 - Entry update: mutex