[lttv.git] / ltt / branches / poly / doc / developer / lttng-userspace-tracing.txt


Some thoughts about userspace tracing

Mathieu Desnoyers January 2006


* Goals

Fast and secure user space tracing.

Fast : 

- 5000ns for a system call is too long. Writing an event directly to memory
	takes 220ns.
- Still, we can afford a system call for buffer switch, which occurs less often.
- No locking, no signal disabling. Disabling signals require 2 system calls.
	Mutexes are implemented with a short spin lock, followed by a yield. Yet
	another system call. In addition, we have no way to know on which CPU we are
	running when in user mode. We can be preempted anywhere.
- No contention.
- No interrupt disabling : it doesn't exist in user mode.

Secure :

- A process shouldn't be able to corrupt the system's trace or another
	process'trace. It should be limited to its own memory space.


* Solution

- Signal handler concurrency

Using atomic space reservation in the buffer(s) will remove the requirement for
locking. This is the fast and safe way to deal with concurrency coming from
signal handlers.

- Start/stop tracing

Two possible solutions :

Either we export a read-only memory page from kernel to user space. That would
be somehow seen as a hack, as I have never even seen such interface anywhere
else. It may lead to problems related to exported types. The proper, but slow,
way to do it would be to have a system call that would return the tracing
status.

My suggestion is to go for a system call, but only call it :

- when the thread starts
- when receiving a SIGRTMIN+3 (multithread ?)

Note : save the thread ID (process ID) in the logging function and the update
handler. Use it as a comparison to check if we are a forked child thread.
Start a brand new buffer list in that case.


Two possibilities :

- one system call per information to get/one system call to get all information.
- one signal per information to get/one signal for "update" tracing info.

I would tend to adopt :

- One signal for "general tracing update"
	One signal handler would clearly be enough, more would be unnecessary
	overhead/pollution.
- One system call for all updates.
	We will need to have multiple parameters though. We have up to 6 parameters.

syscall get_tracing_info

parameter 1 : trace buffer map address. (id)

parameter 2 : active ? (int)


Concurrency

We must have per thread buffers. Then, no memory can be written by two threads
at once. It removes the need for locks (ok, atomic reservation was already doing
that) and removes false sharing.


Multiple traces

By having the number of active traces, we can allocate as much buffers as we
need. Allocation is done in the kernel with relay_open. User space mapping is
done when receiving the signal/starting the process and getting the number of
traces actives.

It means that we must make sure to only update the data structures used by
tracing functions once the buffers are created.

We could have a syscall "get_next_buffer" that would basically mmap the next
unmmapped buffer, or return NULL is all buffers are mapped.

If we remove a trace, the kernel should stop the tracing, and then get the last
buffer for this trace. What is important is to make sure no writers are still
trying to write in a memory region that get desallocated.

For that, we will keep an atomic variable "tracing_level", which tells how many
times we are nested in tracing code (program code/signal handlers) for a
specific trace.

We could do that trace removal in two operations :

- Send an update tracing signal to the process
	- the sig handler get the new tracing status, which tells that tracing is 
		disabled for the specific trace. It writes this status in the tracing
		control structure of the process.
	- If tracing_level is 0, well, it's fine : there are no potential writers in
		the removed trace. It's up to us to buffer switch the removed trace, and,
		after the control returns to us, set_tracing_info this page to NULL and
		delete this memory area.
	- Else (tracing_level > 0), flag the removed trace for later switch/delete.
	
	It then returns control to the process.

- If the tracing_level was > 0, there was one or more writers potentially
	accessing this memory area. When the control comes back to the writer, at the
	end of the write in a trace, if the trace is marked for switch/delete and the
	tracing_level is 0 (after the decrement of the writer itself), then the
	writer must buffer switch, and then delete the memory area.


Filter

The update tracing info signal will make the thread get the new filter
information. Getting this information will also happen upon process creation.

parameter 3 for the get tracing info : a integer containing the 32 bits mask.


Buffer switch

There could be a tracing_buffer_switch system call, that would give the page
start address as parameter. The job of the kernel is to steal this page,
possibly replacing it with a zeroed page (we don't care about the content of the
page after the syscall).

Process dying

The kernel should be aware of the current pages used for tracing in each thread.
If a thread dies unexpectedly, we want the kernel to get the last bits of
information before the thread crashes.

Memory protection

If a process corrupt its own mmaped buffers, the rest of the trace will be
readable, and each process have its own memory space.

Two possibilities :

Either we create one channel per process, or we have per cpu tracefiles for all
the processes, with the specification that data is written in a monotically
increasing time order and that no process share a 4k page with another process.

The problem with having only one tracefile per cpu is that we cannot safely
steal a process'buffer upon a schedule change because it may be currently
writing to it.

It leaves the one tracefile per thread as the only solution.

Another argument in favor of this solution is the possibility to have mixed
32-64 bits processes on the same machine. Dealing with types will be easier.


Corrupted trace

A corrupted tracefile will only affect one thread. The rest of the trace will
still be readable.


Facilities

Upon process creation or when receiving the signal of trace info update, when a
new trace appears, the thread should write the facility information into it. It
must then have a list of registered facilities, all done at the thread level.

We must decide if we allow a facility channel for each thread. The advantage is
that we have a readable channel in flight recorder mode, while the disadvantage
is to duplicate the number of channels, which may become quite high. To follow
the general design of a high throughput channel and a low throughput channel for
vital information, I suggest to have a separate channel for facilities, per
trace, per process.


API :

syscall 1 :

in :
buffer : NULL means get new traces
				 non NULL means to get the information for the specified buffer
out :
buffer : returns the address of the trace buffer
active : is the trace active ?
filter : 32 bits filter mask

return : 0 on success, 1 on error.

int ltt_update(void **buffer, int *active, int *filter);

syscall 2 :

in :
buffer : Switch the specified buffer.
return : 0 on success, 1 on error.

int ltt_switch(void *buffer);


Signal :

SIGRTMIN+3
(like hardware fault and expiring timer : to the thread, see p. 413 of Advances
prog. in the UNIX env.)

Signal is sent on tracing create/destroy, start/stop and filter change.

Will update for itself only : it will remove unnecessary concurrency.


Notes :

It doesn't matter "when" the process receives the update signal after a trace
start : it will receive it in priority, before executing anything else when it
will be scheduled in.


Major enhancement :

* Buffer pool *

The problem with the design, up to now, is if an heavily threaded application
launches many threads that has a short lifetime : it will allocate memory for
each traced thread, consuming time and it will create an incredibly high
number of files in the trace (or per thread).

(thanks to Matthew Khouzam)
The solution to this sits in the use of a buffer poll : We typically create a
buffer pool of a specified size (say, 10 buffers by default, alterable by the
user), each 8k in size (4k for normal trace, 4k for facility channel), for a
total of 80kB of memory. It has to be tweaked to the maximum number of
expected threads running at once, or it will have to grow dynamically (thus
impacting on the trace).

A typical approach to dynamic growth is to double the number of allocated
buffers each time a threashold near the limit is reached.

Each channel would be found as :

trace_name/user/facilities_0
trace_name/user/cpu_0
trace_name/user/facilities_1
trace_name/user/cpu_1
...

When a thread asks for being traced, it gets a buffer from free buffers pool. If
the number of available buffers falls under a threshold, the pool is marked for
expansion and the thread gets its buffer quickly. The expansion will be executed
a little bit later by a worker thread. If however, the number of available
buffer is 0, then an "emergency" reservation will be done, allocating only one
buffer. The goal of this is to modify the thread fork time as less as possible.

When a thread releases a buffer (the thread terminates), a buffer switch is
performed, so the data can be flushed to disk and no other thread will mess
with it or render the buffer unreadable.

Upon trace creation, the pre-allocated pool is allocated. Upon trace
destruction, the threads are first informed of the trace destruction, any
pending worker thread (for pool allocation) is cancelled and then the pool is
released. Buffers used by threads at this moment but not mapped for reading
will be simply destroyed (as their refcount will fall to 0). It means that
between the "trace stop" and "trace destroy", there should be enough time to let
the lttd daemon open the newly created channels or they will be lost.

Upon buffer switch, the reader can read directly from the buffer. Note that when
the reader finish reading a buffer, if the associated thread writer has
exited, it must fill the buffer with zeroes and put it back into the free pool.
In the case where the trace is destroyed, it must just derement its refcount (as
it would do otherwise) and the buffer will be destroyed.

This pool will reduce the number of trace files created to the order of the
number of threads present in the system at a given time.

A worse cast scenario is 32768 processes traced at the same time, for a total
amount of 256MB of buffers. If a machine has so many threads, it probably have
enough memory to handle this.

In flight recorder mode, it would be interesting to use a LRU algorithm to
choose which buffer from the pool we must take for a newly forked thread. A
simple queue would do it.

SMP : per cpu pools ? -> no, L1 and L2 caches are typically too small to be
impacted by the fact that a reused buffer is on a different or the same CPU.
Commit	Line	Data
7a747250	1
	2	Some thoughts about userspace tracing
	3
	4	Mathieu Desnoyers January 2006
	5
	6
	7
	8	* Goals
	9
	10	Fast and secure user space tracing.
	11
	12	Fast :
	13
	14	- 5000ns for a system call is too long. Writing an event directly to memory
	15	takes 220ns.
	16	- Still, we can afford a system call for buffer switch, which occurs less often.
	17	- No locking, no signal disabling. Disabling signals require 2 system calls.
	18	Mutexes are implemented with a short spin lock, followed by a yield. Yet
	19	another system call. In addition, we have no way to know on which CPU we are
	20	running when in user mode. We can be preempted anywhere.
	21	- No contention.
	22	- No interrupt disabling : it doesn't exist in user mode.
	23
	24	Secure :
	25
	26	- A process shouldn't be able to corrupt the system's trace or another
	27	process'trace. It should be limited to its own memory space.
	28
	29
	30
	31	* Solution
	32
	33	- Signal handler concurrency
	34
	35	Using atomic space reservation in the buffer(s) will remove the requirement for
	36	locking. This is the fast and safe way to deal with concurrency coming from
	37	signal handlers.
	38
	39	- Start/stop tracing
	40
	41	Two possible solutions :
	42
	43	Either we export a read-only memory page from kernel to user space. That would
	44	be somehow seen as a hack, as I have never even seen such interface anywhere
	45	else. It may lead to problems related to exported types. The proper, but slow,
	46	way to do it would be to have a system call that would return the tracing
	47	status.
	48
	49	My suggestion is to go for a system call, but only call it :
	50
cb310b57	51	- when the thread starts
3f43b8fb	52	- when receiving a SIGRTMIN+3 (multithread ?)
cb310b57	53
	54	Note : save the thread ID (process ID) in the logging function and the update
	55	handler. Use it as a comparison to check if we are a forked child thread.
	56	Start a brand new buffer list in that case.
	57
7a747250	58
	59	Two possibilities :
	60
	61	- one system call per information to get/one system call to get all information.
	62	- one signal per information to get/one signal for "update" tracing info.
	63
	64	I would tend to adopt :
	65
	66	- One signal for "general tracing update"
	67	One signal handler would clearly be enough, more would be unnecessary
	68	overhead/pollution.
	69	- One system call for all updates.
	70	We will need to have multiple parameters though. We have up to 6 parameters.
	71
	72	syscall get_tracing_info
	73
cb310b57	74	parameter 1 : trace buffer map address. (id)
	75
	76	parameter 2 : active ? (int)
7a747250	77
	78
	79	Concurrency
	80
	81	We must have per thread buffers. Then, no memory can be written by two threads
	82	at once. It removes the need for locks (ok, atomic reservation was already doing
	83	that) and removes false sharing.
	84
	85
	86	Multiple traces
	87
	88	By having the number of active traces, we can allocate as much buffers as we
cb310b57	89	need. Allocation is done in the kernel with relay_open. User space mapping is
	90	done when receiving the signal/starting the process and getting the number of
	91	traces actives.
7a747250	92
	93	It means that we must make sure to only update the data structures used by
	94	tracing functions once the buffers are created.
	95
cb310b57	96	We could have a syscall "get_next_buffer" that would basically mmap the next
cb310b57	97	unmmapped buffer, or return NULL is all buffers are mapped.
7a747250	98
	99	If we remove a trace, the kernel should stop the tracing, and then get the last
	100	buffer for this trace. What is important is to make sure no writers are still
	101	trying to write in a memory region that get desallocated.
	102
	103	For that, we will keep an atomic variable "tracing_level", which tells how many
	104	times we are nested in tracing code (program code/signal handlers) for a
	105	specific trace.
	106
	107	We could do that trace removal in two operations :
	108
	109	- Send an update tracing signal to the process
	110	- the sig handler get the new tracing status, which tells that tracing is
	111	disabled for the specific trace. It writes this status in the tracing
	112	control structure of the process.
	113	- If tracing_level is 0, well, it's fine : there are no potential writers in
	114	the removed trace. It's up to us to buffer switch the removed trace, and,
	115	after the control returns to us, set_tracing_info this page to NULL and
	116	delete this memory area.
	117	- Else (tracing_level > 0), flag the removed trace for later switch/delete.
	118
	119	It then returns control to the process.
	120
	121	- If the tracing_level was > 0, there was one or more writers potentially
	122	accessing this memory area. When the control comes back to the writer, at the
	123	end of the write in a trace, if the trace is marked for switch/delete and the
	124	tracing_level is 0 (after the decrement of the writer itself), then the
cb310b57	125	writer must buffer switch, and then delete the memory area.
7a747250	126
	127
	128	Filter
	129
	130	The update tracing info signal will make the thread get the new filter
	131	information. Getting this information will also happen upon process creation.
	132
cb310b57	133	parameter 3 for the get tracing info : a integer containing the 32 bits mask.
7a747250	134
	135
	136	Buffer switch
	137
	138	There could be a tracing_buffer_switch system call, that would give the page
	139	start address as parameter. The job of the kernel is to steal this page,
	140	possibly replacing it with a zeroed page (we don't care about the content of the
	141	page after the syscall).
	142
	143	Process dying
	144
	145	The kernel should be aware of the current pages used for tracing in each thread.
	146	If a thread dies unexpectedly, we want the kernel to get the last bits of
	147	information before the thread crashes.
	148
7a747250	149	Memory protection
7a747250	150
cb310b57	151	If a process corrupt its own mmaped buffers, the rest of the trace will be
cb310b57	152	readable, and each process have its own memory space.
7a747250	153
	154	Two possibilities :
	155
	156	Either we create one channel per process, or we have per cpu tracefiles for all
	157	the processes, with the specification that data is written in a monotically
	158	increasing time order and that no process share a 4k page with another process.
	159
	160	The problem with having only one tracefile per cpu is that we cannot safely
	161	steal a process'buffer upon a schedule change because it may be currently
	162	writing to it.
	163
	164	It leaves the one tracefile per thread as the only solution.
	165
	166	Another argument in favor of this solution is the possibility to have mixed
	167	32-64 bits processes on the same machine. Dealing with types will be easier.
	168
	169
	170	Corrupted trace
	171
	172	A corrupted tracefile will only affect one thread. The rest of the trace will
	173	still be readable.
	174
	175
	176	Facilities
	177
	178	Upon process creation or when receiving the signal of trace info update, when a
	179	new trace appears, the thread should write the facility information into it. It
	180	must then have a list of registered facilities, all done at the thread level.
	181
	182	We must decide if we allow a facility channel for each thread. The advantage is
	183	that we have a readable channel in flight recorder mode, while the disadvantage
	184	is to duplicate the number of channels, which may become quite high. To follow
	185	the general design of a high throughput channel and a low throughput channel for
	186	vital information, I suggest to have a separate channel for facilities, per
	187	trace, per process.
	188
	189
	190
cb310b57	191	API :
	192
	193	syscall 1 :
	194
3f43b8fb	195	in :
	196	buffer : NULL means get new traces
	197	non NULL means to get the information for the specified buffer
	198	out :
	199	buffer : returns the address of the trace buffer
	200	active : is the trace active ?
	201	filter : 32 bits filter mask
cb310b57	202
3f43b8fb	203	return : 0 on success, 1 on error.
	204
	205	int ltt_update(void *buffer, int active, int *filter);
cb310b57	206
	207	syscall 2 :
	208
3f43b8fb	209	in :
	210	buffer : Switch the specified buffer.
	211	return : 0 on success, 1 on error.
	212
	213	int ltt_switch(void *buffer);
cb310b57	214
	215
	216	Signal :
	217
3f43b8fb	218	SIGRTMIN+3
cb310b57	219	(like hardware fault and expiring timer : to the thread, see p. 413 of Advances
	220	prog. in the UNIX env.)
	221
0dee0e75	222	Signal is sent on tracing create/destroy, start/stop and filter change.
0dee0e75	223
cb310b57	224	Will update for itself only : it will remove unnecessary concurrency.
	225
	226
	227
0dee0e75	228	Notes :
	229
	230	It doesn't matter "when" the process receives the update signal after a trace
	231	start : it will receive it in priority, before executing anything else when it
	232	will be scheduled in.
cb310b57	233
	234
	235
cfed1a52	236	Major enhancement :
	237
	238	* Buffer pool *
	239
	240	The problem with the design, up to now, is if an heavily threaded application
	241	launches many threads that has a short lifetime : it will allocate memory for
	242	each traced thread, consuming time and it will create an incredibly high
	243	number of files in the trace (or per thread).
	244
	245	(thanks to Matthew Khouzam)
	246	The solution to this sits in the use of a buffer poll : We typically create a
	247	buffer pool of a specified size (say, 10 buffers by default, alterable by the
	248	user), each 8k in size (4k for normal trace, 4k for facility channel), for a
	249	total of 80kB of memory. It has to be tweaked to the maximum number of
	250	expected threads running at once, or it will have to grow dynamically (thus
	251	impacting on the trace).
	252
	253	A typical approach to dynamic growth is to double the number of allocated
	254	buffers each time a threashold near the limit is reached.
	255
	256	Each channel would be found as :
	257
	258	trace_name/user/facilities_0
	259	trace_name/user/cpu_0
	260	trace_name/user/facilities_1
	261	trace_name/user/cpu_1
	262	...
	263
	264	When a thread asks for being traced, it gets a buffer from free buffers pool. If
	265	the number of available buffers falls under a threshold, the pool is marked for
	266	expansion and the thread gets its buffer quickly. The expansion will be executed
	267	a little bit later by a worker thread. If however, the number of available
	268	buffer is 0, then an "emergency" reservation will be done, allocating only one
	269	buffer. The goal of this is to modify the thread fork time as less as possible.
	270
	271	When a thread releases a buffer (the thread terminates), a buffer switch is
	272	performed, so the data can be flushed to disk and no other thread will mess
	273	with it or render the buffer unreadable.
	274
	275	Upon trace creation, the pre-allocated pool is allocated. Upon trace
	276	destruction, the threads are first informed of the trace destruction, any
	277	pending worker thread (for pool allocation) is cancelled and then the pool is
	278	released. Buffers used by threads at this moment but not mapped for reading
	279	will be simply destroyed (as their refcount will fall to 0). It means that
	280	between the "trace stop" and "trace destroy", there should be enough time to let
	281	the lttd daemon open the newly created channels or they will be lost.
	282
	283	Upon buffer switch, the reader can read directly from the buffer. Note that when
	284	the reader finish reading a buffer, if the associated thread writer has
	285	exited, it must fill the buffer with zeroes and put it back into the free pool.
	286	In the case where the trace is destroyed, it must just derement its refcount (as
	287	it would do otherwise) and the buffer will be destroyed.
	288
	289	This pool will reduce the number of trace files created to the order of the
	290	number of threads present in the system at a given time.
	291
	292	A worse cast scenario is 32768 processes traced at the same time, for a total
	293	amount of 256MB of buffers. If a machine has so many threads, it probably have
	294	enough memory to handle this.
	295
	296	In flight recorder mode, it would be interesting to use a LRU algorithm to
	297	choose which buffer from the pool we must take for a newly forked thread. A
	298	simple queue would do it.
	299
300	SMP : per cpu pools ? -> no, L1 and L2 caches are typically too small to be
301	impacted by the fact that a reused buffer is on a different or the same CPU.
302
303
304
305
306
307
308
309
cb310b57	310
cb310b57	311
7a747250	312
	313
	314