[userspace-rcu.git] / rcuja / design.txt

RCU Judy Array Design
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
March 8, 2012

Initial ideas based on the released Judy Shop Manual
(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
implemented by Hewlett-Packard.

Thresholds and RCU-specific analysis is introduced in this document.

Advantages of using Judy Array (compressed nodes) for RCU tree:
- no rebalancing
- no transplant
- RCU-friendly!
- favor cache-line alignment of structures

Disadvantage:
- updates that need to reallocate nodes are slower than, e.g. non-rcu
  red-black trees.

Choice: Using 256 entries intermediate nodes (index can be represented
on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit


* Node types (from less dense node to most dense)


- empty node:

Parent pointer is NULL.


- Type A: sequential search in value and pointer arrays

+ Add/removal just needs to update value and pointer array, single-entry
  (non-RCU...). For RCU, we might need to update the entire node anyway.
- Requires sequential search through all value array for lookup fail
  test.

Filled at 3 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes

-> up to this point on 64-bit, sequential lookup and pointer read fit in
a 32-byte cache line.
  - lookup fail&success: 1 cache-line.

Filled at 6 entries max 32-bit, 7 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes

-> up to this point on 32-bit, sequential lookup and pointer read fit in
a 32-byte cache line.
  - lookup fail&success: 1 cache-line.

Filled at 12 entries max 32-bit, 14 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes

Filled at 25 entries max 32-bit, 28 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes

---> up to this point, on both 32-bit and 64-bit, the sequential lookup
in values array fits in a 32-byte cache line.
  - lookup failure: 1 cache line.
  - lookup success: 2 cache lines.

The two below are listed for completeness sake, but because they require
2 32-byte cache lines for lookup, these are deemed inappropriate.

Filled at 51 entries max 32-bit, 56 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes

Filled at 102 entries max 32-bit, 113 entries max 64-bit
8 bits indicating number of children
Array of 8-bit values followed by array of associated pointers.
32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes


- Type B: pools of values and pointers arrays

Pools of values and pointers arrays. Each pool values array is 32-bytes
in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
integer, which is the number of children in this pool, followed by an
array of 8-bit values, padding, and an array of pointers. Values and
pointer arrays are associated as in Type A.

The entries of a node are associated to their respective pool based
on their index position.

+ Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
  lookup success: 2 cache lines.

+ Allows in-place updates without reallocation, except when a pool is
  full. (this was not possible with bitmap-based nodes)
- If one pool exhausts its space, we need to increase the node size.
  Therefore, for very dense populations, we will end up using the
  pigeon-hole node type sooner, thus consuming more space.

Pool configuration:

Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes

Total up to 50 entries (32-bit), 56 entries (64-bit)
2 pools: 32-bit = 256 bytes
2 pools: 64-bit = 512 bytes

Total up to 100 entries (32-bit), 112 entries (64-bit)
4 pools: 32-bit = 512 bytes
4 pools: 32-bit = 1024 bytes


* Choice of pool configuration distribution:

We have pools of either 2 or 4 linear arrays. Their total size is
between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays).

Alignment on 256 bytes means that we can spare the 8 least significant
bits of the pointers. Given that the type selection already uses 3 bits,
we have 7 bits left.

Alignment on 512 bytes -> 8 bits left.

We can therefore encode which bit, or which two bits, are used as
distribution selection. We can use this technique to reequilibrate pools
if they become unbalanced (e.g. all children are within one of the two
linear arrays).

Assuming that finding the exact sub-pool usage maximum for any given
distribution is NP complete (not proven).

Taking into account unbalance ratios (tested programmatically by
randomly taking N entries from 256, calculating the distribution for
each bit (number of nodes for which bit is one/zero), and calculating
the difference in number of nodes for each bit, choosing the minimum
difference -- for millions of runs).

tot entries   unbalance       largest linear array (stat. approx.)
---------------------------------------------------------------------
48 entries:      2 (98%)              24+1=25 (target ~50/2=25)
54 entries:      2 (97%)              27+1=28 (target ~56/2=28)

Note: there exists rare worse cases where the unbalance is larger, but
it happens _very_ rarely. But need to provide a fallback if the subclass
does not fit, but it does not need to be efficient.


For pool of size 4, we need to approximate what is the maximum unbalance
we can get for choice of distributions grouped by pairs of bits.

tot entries     unbalance     largest linear array (stat. approx.)
---------------------------------------------------------------------
92 entries:      8 (99%)             23+2=25  (target: ~100/4=25)
104 entries:     8 (99%)             26+2=28  (target: ~112/4=28)


Note: there exists rare worse cases where the unbalance is larger, but
it happens _very_ rarely. But need to provide a fallback if the subclass
does not fit, but it does not need to be efficient.


* Population "does not fit" and distribution fallback

When adding a child to a distribution node, if the child does not fit,
we recalculate the best distribution. If it does not fit in that
distribution neither, we need to expand the node type.

When removing a child, if the node child count is brought to the number
of entries expected to statistically fit in the lower order node, we try
to shrink. However, if we notice that the distribution does not actually
fit in that shrinked node, we abort the shrink operation. If shrink
fails, we keep a counter of insertion/removal operations on the node
before we allow the shrink to be attempted again.


- Type C: pigeon-hole array

Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
Array of children node pointers. Pointers NULL if no child at index.
32-bit: 4*256 = 1024 bytes
64-bit: 8*256 = 2048 bytes


* Analysis of the thresholds:

Analysis of number of cache-lines touched for each node, per-node-type,
depending on the number of children per node, as we increment the number
of children from 0 to 256. Through this, we choose number of children
thresholds at which it is worthwhile to use a different node type.

Per node:

- ALWAYS 1 cache line hit for lookup failure (all cases)

32-bit

- Unexisting

0 children

- Type A: sequential search in value and pointer arrays
- 1 cache line hit for lookup success
- 32 bytes storage

up to 6 children

- 2 cache line hit for lookup success
- 64 bytes storage

up to 12 children

- 128 bytes storage

up to 25 children

- Type B: pool

- 256 bytes storage

up to 50 children

- 512 bytes storage
up to 100 children

- Type C: pigeon-hole array
- 1 cache line hit for lookup success
- 1024 bytes storage

up to 256 children


64-bit

- Unexisting

0 children

- Type A: sequential search in value and pointer arrays
- 1 cache line hit for lookup success
- 32 bytes storage

up to 3 children

- 2 cache line hit for lookup success
- 64 bytes storage

up to 7 children

- 128 bytes storage

up to 14 children

- 256 bytes storage

up to 28 children

- Type B: pool

- 512 bytes storage
up to 56 children

- 1024 bytes storage
up to 112 children

- Type C: pigeon-hole array
- 1 cache line hit for lookup success
- 2048 bytes storage

up to 256 children


* Analysis of node type encoding and node pointers:

Lookups are _always_ from the top of the tree going down. This
facilitates RCU replacement as we only keep track of pointers going
downward.

Type of node encoded in the parent's pointer. Need to reserve 2
least-significant bits.

Types of children:

enum child_type {
	RCU_JA_LINEAR = 0,	/* Type A */
			/* 32-bit: 1 to 25 children, 8 to 128 bytes */
			/* 64-bit: 1 to 28 children, 16 to 256 bytes */
	RCU_JA_POOL = 1,	/* Type B */
			/* 32-bit: 26 to 100 children, 256 to 512 bytes */
			/* 64-bit: 29 to 112 children, 512 to 1024 bytes */
	RCU_JA_PIGEON = 2,	/* Type C */
			/* 32-bit: 101 to 256 children, 1024 bytes */
			/* 64-bit: 113 to 256 children, 2048 bytes */
	/* Leaf nodes are implicit from their height in the tree */
};

If entire pointer is NULL, children is empty.


* Lookup and Update Algorithms

Let's propose a quite simple scheme that uses a mutex on nodes to manage
update concurrency. It's certainly not optimal in terms of concurrency
management within a node, but it has the advantage of being simple to
implement and understand.

We need to keep a count of the number of children nodes (for each node),
to keep track of when the node type thresholds are reached. It would be
important to put an hysteresis loop so we don't change between node
types too often for a loop on add/removal of the same node.

We acquire locks from child to parent, nested. We take all locks
required to perform a given update in the tree (but no more) to keep it
consistent with respect to number of children per node.

If check for node being gc'd (always under node lock) fails, we simply
need to release the lock and lookup the node again.


- Leaf lookup

rcu_read_lock()

RCU-lookup each level of the tree. If level is not populated, fail.
Until we reach the leaf node.

rcu_read_unlock()


- Leaf insertion

A) Lookup

rcu_read_lock()
RCU-lookup insert position. Find location in tree where nodes are
missing for this insertion. If leaf is already present, insert fails,
releasing the rcu read lock.  The insert location consists of a parent
node to which we want to attach a new node.

B) Lock

RCU-lookup parent node. Take the parent lock. If the parent needs to be
reallocated to make room for this insertion, RCU-lookup parent-parent
node and take the parent-parent lock.  For each lock taken, check if
node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
retry.

C) Create

Construct the whole branch from the new topmost intermediate node down
to the new leaf node we are inserting. 

D) Populate:
  - If parent node reallocation is needed:
     Reallocate the parent node, adding the new branch to it, and
     increment its node count.
     set gc flag in old nodes.
     call_rcu free for all old nodes.
     Populate new parent node with rcu_assign_pointer.
  - Else:
    Increment parent node count.
    Use rcu_assign_pointer to populate this new branch into the parent
    node.

E) Locks

Release parent and (if taken) parent-parent locks.
rcu_read_unlock()


- Leaf removal

A) Lookup

rcu_read_lock()
RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
read lock.

B) Lock

RCU-lookup parent. Take the parent lock. If the parent needs to be
reallocated because it would be too large for the decremented number of
children, RCU-lookup parent-parent and take the parent-parent lock. Do
so recursively until no node reallocation is needed, or until root is
reached.

For each lock taken, check if node is being gc'd. If gc'd, release lock,
re-RCU-lookup this node, and retry.

C) Create

The branch (or portion of branch) consisting of taken locks necessarily
has a simple node removal or update as operation to do on its top node.

If the operation is a node removal, then, necessarily, the entire branch
under the node removal operation will simply disappear. No node
allocation is needed.

Else, if the operation is a child node reallocation, the child node will
necessarily do a node removal. So _its_ entire child branch will
disappear. So reallocate this child node without the removed branch
(remember to decrement its nr children count).

D) Populate

No reallocation case: simply set the appropriate child pointer in the
topmost locked node to NULL. Decrement its nr children count.

Reallocation case: set the child pointer in the topmost locked node to
the newly allocated node.
set old nodes gc flag.
call_rcu free for all old nodes.

E) Locks

Release all locks.
rcu_read_unlock()


For the various types of nodes:

- sequential search (type A)
  - RCU replacement: mutex
  - Entry update: mutex

- bitmap followed by pointer array (type B)
  - RCU replacement: mutex
  - Entry update: mutex

- pigeon hole array (type C)
  - RCU replacement: mutex
  - Entry update: mutex
Commit	Line	Data
61009379 MD	1	RCU Judy Array Design
	2	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
	3	March 8, 2012
	4
	5	Initial ideas based on the released Judy Shop Manual
	6	(http://judy.sourceforge.net/). Judy was invented by Doug Baskins and
	7	implemented by Hewlett-Packard.
	8
	9	Thresholds and RCU-specific analysis is introduced in this document.
	10
	11	Advantages of using Judy Array (compressed nodes) for RCU tree:
	12	- no rebalancing
	13	- no transplant
	14	- RCU-friendly!
	15	- favor cache-line alignment of structures
	16
	17	Disadvantage:
	18	- updates that need to reallocate nodes are slower than, e.g. non-rcu
	19	red-black trees.
	20
	21	Choice: Using 256 entries intermediate nodes (index can be represented
	22	on 8 bits): 4 levels on 32-bit, 8 levels on 64-bit
	23
	24
	25	* Node types (from less dense node to most dense)
	26
	27
	28	- empty node:
	29
	30	Parent pointer is NULL.
	31
	32
	33	- Type A: sequential search in value and pointer arrays
	34
	35	+ Add/removal just needs to update value and pointer array, single-entry
	36	(non-RCU...). For RCU, we might need to update the entire node anyway.
	37	- Requires sequential search through all value array for lookup fail
	38	test.
	39
	40	Filled at 3 entries max 64-bit
	41	8 bits indicating number of children
	42	Array of 8-bit values followed by array of associated pointers.
	43	64-bit: 1 byte + 3 bytes + 4 bytes pad + 3*8 = 32 bytes
	44
	45	-> up to this point on 64-bit, sequential lookup and pointer read fit in
	46	a 32-byte cache line.
	47	- lookup fail&success: 1 cache-line.
	48
	49	Filled at 6 entries max 32-bit, 7 entries max 64-bit
	50	8 bits indicating number of children
	51	Array of 8-bit values followed by array of associated pointers.
	52	32-bit: 1 byte + 6 bytes + 1 byte pad + 6*4bytes = 32 bytes
	53	64-bit: 1 byte + 7 bytes + 7*8 = 64 bytes
	54
	55	-> up to this point on 32-bit, sequential lookup and pointer read fit in
	56	a 32-byte cache line.
	57	- lookup fail&success: 1 cache-line.
	58
	59	Filled at 12 entries max 32-bit, 14 entries max 64-bit
	60	8 bits indicating number of children
	61	Array of 8-bit values followed by array of associated pointers.
	62	32-bit: 1 byte + 12 bytes + 3 bytes pad + 12*4bytes = 64 bytes
	63	64-bit: 1 byte + 14 bytes + 1 byte pad + 14*8 = 128 bytes
	64
65	Filled at 25 entries max 32-bit, 28 entries max 64-bit
66	8 bits indicating number of children
67	Array of 8-bit values followed by array of associated pointers.
68	32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
69	64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
70
71	---> up to this point, on both 32-bit and 64-bit, the sequential lookup
72	in values array fits in a 32-byte cache line.
73	- lookup failure: 1 cache line.
74	- lookup success: 2 cache lines.
75
76	The two below are listed for completeness sake, but because they require
77	2 32-byte cache lines for lookup, these are deemed inappropriate.
78
79	Filled at 51 entries max 32-bit, 56 entries max 64-bit
80	8 bits indicating number of children
81	Array of 8-bit values followed by array of associated pointers.
82	32-bit: 1 byte + 51 bytes + 51*4bytes = 256 bytes
83	64-bit: 1 byte + 56 bytes + 7 bytes pad + 56*8 = 512 bytes
84
85	Filled at 102 entries max 32-bit, 113 entries max 64-bit
86	8 bits indicating number of children
87	Array of 8-bit values followed by array of associated pointers.
88	32-bit: 1 byte + 102 bytes + 1 byte pad + 102*4bytes = 512 bytes
89	64-bit: 1 byte + 113 bytes + 6 bytes pad + 113*8 = 1024 bytes
90
91
fd800776	92	- Type B: pools of values and pointers arrays
61009379	93
fd800776 MD	94	Pools of values and pointers arrays. Each pool values array is 32-bytes
	95	in size (so it fits in a L1 cacheline). Each pool begins with an 8-bit
	96	integer, which is the number of children in this pool, followed by an
	97	array of 8-bit values, padding, and an array of pointers. Values and
	98	pointer arrays are associated as in Type A.
61009379	99
fd800776 MD	100	The entries of a node are associated to their respective pool based
fd800776 MD	101	on their index position.
61009379	102
fd800776 MD	103	+ Allows lookup failure to use 1 32-byte cache-line only. (1 cacheline)
fd800776 MD	104	lookup success: 2 cache lines.
61009379	105
fd800776 MD	106	+ Allows in-place updates without reallocation, except when a pool is
	107	full. (this was not possible with bitmap-based nodes)
	108	- If one pool exhausts its space, we need to increase the node size.
	109	Therefore, for very dense populations, we will end up using the
	110	pigeon-hole node type sooner, thus consuming more space.
61009379	111
fd800776	112	Pool configuration:
61009379	113
fd800776 MD	114	Per pool, filled at 25 entries (32-bit), 28 entries (64-bit)
	115	32-bit: 1 byte + 25 bytes + 2 bytes pad + 25*4bytes = 128 bytes
	116	64-bit: 1 byte + 28 bytes + 3 bytes pad + 28*8 = 256 bytes
	117
	118	Total up to 50 entries (32-bit), 56 entries (64-bit)
	119	2 pools: 32-bit = 256 bytes
	120	2 pools: 64-bit = 512 bytes
	121
	122	Total up to 100 entries (32-bit), 112 entries (64-bit)
	123	4 pools: 32-bit = 512 bytes
	124	4 pools: 32-bit = 1024 bytes
61009379 MD	125
61009379 MD	126
fa89978a MD	127	* Choice of pool configuration distribution:
	128
	129	We have pools of either 2 or 4 linear arrays. Their total size is
	130	between 256 bytes (32-bit 2 arrays) and 1024 bytes (64-bit 4 arrays).
	131
	132	Alignment on 256 bytes means that we can spare the 8 least significant
	133	bits of the pointers. Given that the type selection already uses 3 bits,
	134	we have 7 bits left.
	135
	136	Alignment on 512 bytes -> 8 bits left.
	137
	138	We can therefore encode which bit, or which two bits, are used as
	139	distribution selection. We can use this technique to reequilibrate pools
	140	if they become unbalanced (e.g. all children are within one of the two
	141	linear arrays).
	142
	143	Assuming that finding the exact sub-pool usage maximum for any given
	144	distribution is NP complete (not proven).
	145
	146	Taking into account unbalance ratios (tested programmatically by
	147	randomly taking N entries from 256, calculating the distribution for
	148	each bit (number of nodes for which bit is one/zero), and calculating
	149	the difference in number of nodes for each bit, choosing the minimum
	150	difference -- for millions of runs).
	151
	152	tot entries unbalance largest linear array (stat. approx.)
	153	---------------------------------------------------------------------
3d45251f MD	154	48 entries: 2 (98%) 24+1=25 (target ~50/2=25)
3d45251f MD	155	54 entries: 2 (97%) 27+1=28 (target ~56/2=28)
fa89978a MD	156
	157	Note: there exists rare worse cases where the unbalance is larger, but
	158	it happens _very_ rarely. But need to provide a fallback if the subclass
	159	does not fit, but it does not need to be efficient.
	160
	161
	162	For pool of size 4, we need to approximate what is the maximum unbalance
	163	we can get for choice of distributions grouped by pairs of bits.
	164
	165	tot entries unbalance largest linear array (stat. approx.)
	166	---------------------------------------------------------------------
3d45251f MD	167	92 entries: 8 (99%) 23+2=25 (target: ~100/4=25)
3d45251f MD	168	104 entries: 8 (99%) 26+2=28 (target: ~112/4=28)
fa89978a MD	169
	170
	171	Note: there exists rare worse cases where the unbalance is larger, but
	172	it happens _very_ rarely. But need to provide a fallback if the subclass
	173	does not fit, but it does not need to be efficient.
	174
	175
	176	* Population "does not fit" and distribution fallback
	177
	178	When adding a child to a distribution node, if the child does not fit,
	179	we recalculate the best distribution. If it does not fit in that
	180	distribution neither, we need to expand the node type.
	181
	182	When removing a child, if the node child count is brought to the number
	183	of entries expected to statistically fit in the lower order node, we try
	184	to shrink. However, if we notice that the distribution does not actually
	185	fit in that shrinked node, we abort the shrink operation. If shrink
	186	fails, we keep a counter of insertion/removal operations on the node
	187	before we allow the shrink to be attempted again.
	188
	189
61009379 MD	190	- Type C: pigeon-hole array
	191
	192	Filled at 47.2%/48.8% or more (32-bit: 121 entries+, 64-bit: 125 entries+)
	193	Array of children node pointers. Pointers NULL if no child at index.
	194	32-bit: 4*256 = 1024 bytes
	195	64-bit: 8*256 = 2048 bytes
	196
	197
	198	* Analysis of the thresholds:
	199
	200	Analysis of number of cache-lines touched for each node, per-node-type,
	201	depending on the number of children per node, as we increment the number
	202	of children from 0 to 256. Through this, we choose number of children
	203	thresholds at which it is worthwhile to use a different node type.
	204
	205	Per node:
	206
	207	- ALWAYS 1 cache line hit for lookup failure (all cases)
	208
	209	32-bit
	210
	211	- Unexisting
	212
	213	0 children
	214
	215	- Type A: sequential search in value and pointer arrays
	216	- 1 cache line hit for lookup success
	217	- 32 bytes storage
	218
	219	up to 6 children
	220
	221	- 2 cache line hit for lookup success
	222	- 64 bytes storage
	223
	224	up to 12 children
	225
61009379 MD	226	- 128 bytes storage
61009379 MD	227
fd800776 MD	228	up to 25 children
	229
	230	- Type B: pool
61009379 MD	231
61009379 MD	232	- 256 bytes storage
fd800776 MD	233
fd800776 MD	234	up to 50 children
61009379 MD	235
61009379 MD	236	- 512 bytes storage
fd800776	237	up to 100 children
61009379 MD	238
	239	- Type C: pigeon-hole array
	240	- 1 cache line hit for lookup success
	241	- 1024 bytes storage
	242
	243	up to 256 children
	244
	245
	246	64-bit
	247
	248	- Unexisting
	249
	250	0 children
	251
	252	- Type A: sequential search in value and pointer arrays
	253	- 1 cache line hit for lookup success
	254	- 32 bytes storage
	255
	256	up to 3 children
	257
	258	- 2 cache line hit for lookup success
	259	- 64 bytes storage
	260
	261	up to 7 children
	262
	263	- 128 bytes storage
	264
	265	up to 14 children
	266
61009379 MD	267	- 256 bytes storage
	268
	269	up to 28 children
	270
fd800776 MD	271	- Type B: pool
fd800776 MD	272
61009379	273	- 512 bytes storage
fd800776	274	up to 56 children
61009379 MD	275
61009379 MD	276	- 1024 bytes storage
fd800776	277	up to 112 children
61009379 MD	278
	279	- Type C: pigeon-hole array
	280	- 1 cache line hit for lookup success
	281	- 2048 bytes storage
	282
	283	up to 256 children
	284
	285
	286	* Analysis of node type encoding and node pointers:
	287
	288	Lookups are _always_ from the top of the tree going down. This
	289	facilitates RCU replacement as we only keep track of pointers going
	290	downward.
	291
	292	Type of node encoded in the parent's pointer. Need to reserve 2
	293	least-significant bits.
	294
	295	Types of children:
	296
	297	enum child_type {
e5227865	298	RCU_JA_LINEAR = 0, /* Type A */
fd800776 MD	299	/* 32-bit: 1 to 25 children, 8 to 128 bytes */
	300	/* 64-bit: 1 to 28 children, 16 to 256 bytes */
	301	RCU_JA_POOL = 1, /* Type B */
	302	/* 32-bit: 26 to 100 children, 256 to 512 bytes */
	303	/* 64-bit: 29 to 112 children, 512 to 1024 bytes */
e5227865	304	RCU_JA_PIGEON = 2, /* Type C */
fd800776 MD	305	/* 32-bit: 101 to 256 children, 1024 bytes */
fd800776 MD	306	/* 64-bit: 113 to 256 children, 2048 bytes */
e5227865	307	/* Leaf nodes are implicit from their height in the tree */
61009379 MD	308	};
	309
	310	If entire pointer is NULL, children is empty.
	311
	312
	313	* Lookup and Update Algorithms
	314
	315	Let's propose a quite simple scheme that uses a mutex on nodes to manage
	316	update concurrency. It's certainly not optimal in terms of concurrency
	317	management within a node, but it has the advantage of being simple to
	318	implement and understand.
	319
	320	We need to keep a count of the number of children nodes (for each node),
	321	to keep track of when the node type thresholds are reached. It would be
	322	important to put an hysteresis loop so we don't change between node
	323	types too often for a loop on add/removal of the same node.
	324
	325	We acquire locks from child to parent, nested. We take all locks
	326	required to perform a given update in the tree (but no more) to keep it
	327	consistent with respect to number of children per node.
	328
	329	If check for node being gc'd (always under node lock) fails, we simply
	330	need to release the lock and lookup the node again.
	331
	332
	333	- Leaf lookup
	334
	335	rcu_read_lock()
	336
	337	RCU-lookup each level of the tree. If level is not populated, fail.
	338	Until we reach the leaf node.
	339
	340	rcu_read_unlock()
	341
	342
	343	- Leaf insertion
	344
	345	A) Lookup
	346
	347	rcu_read_lock()
	348	RCU-lookup insert position. Find location in tree where nodes are
	349	missing for this insertion. If leaf is already present, insert fails,
	350	releasing the rcu read lock. The insert location consists of a parent
	351	node to which we want to attach a new node.
	352
	353	B) Lock
	354
	355	RCU-lookup parent node. Take the parent lock. If the parent needs to be
	356	reallocated to make room for this insertion, RCU-lookup parent-parent
	357	node and take the parent-parent lock. For each lock taken, check if
	358	node is being gc'd. If gc'd, release lock, re-RCU-lookup this node, and
	359	retry.
	360
	361	C) Create
	362
	363	Construct the whole branch from the new topmost intermediate node down
	364	to the new leaf node we are inserting.
	365
	366	D) Populate:
	367	- If parent node reallocation is needed:
	368	Reallocate the parent node, adding the new branch to it, and
	369	increment its node count.
	370	set gc flag in old nodes.
	371	call_rcu free for all old nodes.
372	Populate new parent node with rcu_assign_pointer.
373	- Else:
374	Increment parent node count.
375	Use rcu_assign_pointer to populate this new branch into the parent
376	node.
377
378	E) Locks
379
380	Release parent and (if taken) parent-parent locks.
381	rcu_read_unlock()
382
383
384	- Leaf removal
385
386	A) Lookup
387
388	rcu_read_lock()
389	RCU-lookup leaf to remove. If leaf is missing, fail and release rcu
390	read lock.
391
392	B) Lock
393
394	RCU-lookup parent. Take the parent lock. If the parent needs to be
395	reallocated because it would be too large for the decremented number of
396	children, RCU-lookup parent-parent and take the parent-parent lock. Do
397	so recursively until no node reallocation is needed, or until root is
398	reached.
399
400	For each lock taken, check if node is being gc'd. If gc'd, release lock,
401	re-RCU-lookup this node, and retry.
402
403	C) Create
404
405	The branch (or portion of branch) consisting of taken locks necessarily
406	has a simple node removal or update as operation to do on its top node.
407
408	If the operation is a node removal, then, necessarily, the entire branch
409	under the node removal operation will simply disappear. No node
410	allocation is needed.
411
412	Else, if the operation is a child node reallocation, the child node will
413	necessarily do a node removal. So _its_ entire child branch will
414	disappear. So reallocate this child node without the removed branch
415	(remember to decrement its nr children count).
416
417	D) Populate
418
419	No reallocation case: simply set the appropriate child pointer in the
420	topmost locked node to NULL. Decrement its nr children count.
421
422	Reallocation case: set the child pointer in the topmost locked node to
423	the newly allocated node.
424	set old nodes gc flag.
425	call_rcu free for all old nodes.
426
427	E) Locks
428
429	Release all locks.
430	rcu_read_unlock()
431
432
433	For the various types of nodes:
434
435	- sequential search (type A)
436	- RCU replacement: mutex
437	- Entry update: mutex
438
439	- bitmap followed by pointer array (type B)
440	- RCU replacement: mutex
441	- Entry update: mutex
442
443	- pigeon hole array (type C)
444	- RCU replacement: mutex
445	- Entry update: mutex