[lttv.git] / ltt / branches / poly / doc / developer / lttng-userspace-tracing.txt


Some thoughts about userspace tracing

Mathieu Desnoyers January 2006


* Goals

Fast and secure user space tracing.

Fast : 

- 5000ns for a system call is too long. Writing an event directly to memory
	takes 220ns.
- Still, we can afford a system call for buffer switch, which occurs less often.
- No locking, no signal disabling. Disabling signals require 2 system calls.
	Mutexes are implemented with a short spin lock, followed by a yield. Yet
	another system call. In addition, we have no way to know on which CPU we are
	running when in user mode. We can be preempted anywhere.
- No contention.
- No interrupt disabling : it doesn't exist in user mode.

Secure :

- A process shouldn't be able to corrupt the system's trace or another
	process'trace. It should be limited to its own memory space.


* Solution

- Signal handler concurrency

Using atomic space reservation in the buffer(s) will remove the requirement for
locking. This is the fast and safe way to deal with concurrency coming from
signal handlers.

- Start/stop tracing

Two possible solutions :

Either we export a read-only memory page from kernel to user space. That would
be somehow seen as a hack, as I have never even seen such interface anywhere
else. It may lead to problems related to exported types. The proper, but slow,
way to do it would be to have a system call that would return the tracing
status.

My suggestion is to go for a system call, but only call it :

- when the process starts
- when receiving a SIG_UPDTRACING

Two possibilities :

- one system call per information to get/one system call to get all information.
- one signal per information to get/one signal for "update" tracing info.

I would tend to adopt :

- One signal for "general tracing update"
	One signal handler would clearly be enough, more would be unnecessary
	overhead/pollution.
- One system call for all updates.
	We will need to have multiple parameters though. We have up to 6 parameters.

syscall get_tracing_info

first parameter : active traces mask (32 bits : 32 traces).


Concurrency

We must have per thread buffers. Then, no memory can be written by two threads
at once. It removes the need for locks (ok, atomic reservation was already doing
that) and removes false sharing.


Multiple traces

By having the number of active traces, we can allocate as much buffers as we
need. The only thing is that the buffers will only be allocated when receiving
the signal/starting the process and getting the number of traces actives.

It means that we must make sure to only update the data structures used by
tracing functions once the buffers are created.

When adding a new buffer, we should call the set_tracing_info syscall and give
the new buffers array to the kernel. It's an array of 32 pointers to user pages.
They will be used by the kernel to get the last pages when the thread dies.

If we remove a trace, the kernel should stop the tracing, and then get the last
buffer for this trace. What is important is to make sure no writers are still
trying to write in a memory region that get desallocated.

For that, we will keep an atomic variable "tracing_level", which tells how many
times we are nested in tracing code (program code/signal handlers) for a
specific trace.

We could do that trace removal in two operations :

- Send an update tracing signal to the process
	- the sig handler get the new tracing status, which tells that tracing is 
		disabled for the specific trace. It writes this status in the tracing
		control structure of the process.
	- If tracing_level is 0, well, it's fine : there are no potential writers in
		the removed trace. It's up to us to buffer switch the removed trace, and,
		after the control returns to us, set_tracing_info this page to NULL and
		delete this memory area.
	- Else (tracing_level > 0), flag the removed trace for later switch/delete.
	
	It then returns control to the process.

- If the tracing_level was > 0, there was one or more writers potentially
	accessing this memory area. When the control comes back to the writer, at the
	end of the write in a trace, if the trace is marked for switch/delete and the
	tracing_level is 0 (after the decrement of the writer itself), then the
	writer must buffer switch, set_tracing_info to NULL and then delete the
	memory area.


Filter

The update tracing info signal will make the thread get the new filter
information. Getting this information will also happen upon process creation.

parameter 2 for the get tracing info : array of 32 ints (32 bits).
Each integer is the filter mask for a trace. As there are up to 32 active
traces, we have 32 integers for filter.


Buffer switch

There could be a tracing_buffer_switch system call, that would give the page
start address as parameter. The job of the kernel is to steal this page,
possibly replacing it with a zeroed page (we don't care about the content of the
page after the syscall).

Process dying

The kernel should be aware of the current pages used for tracing in each thread.
If a thread dies unexpectedly, we want the kernel to get the last bits of
information before the thread crashes.

syscall set_tracing_info

parameter 1 : array of 32 user space pointers to current pages or NULL.


Memory protection

We want each process to be usable to make a trace unreadable, and each process
to have its own memory space.

Two possibilities :

Either we create one channel per process, or we have per cpu tracefiles for all
the processes, with the specification that data is written in a monotically
increasing time order and that no process share a 4k page with another process.

The problem with having only one tracefile per cpu is that we cannot safely
steal a process'buffer upon a schedule change because it may be currently
writing to it.

It leaves the one tracefile per thread as the only solution.

Another argument in favor of this solution is the possibility to have mixed
32-64 bits processes on the same machine. Dealing with types will be easier.


Corrupted trace

A corrupted tracefile will only affect one thread. The rest of the trace will
still be readable.


Facilities

Upon process creation or when receiving the signal of trace info update, when a
new trace appears, the thread should write the facility information into it. It
must then have a list of registered facilities, all done at the thread level.

We must decide if we allow a facility channel for each thread. The advantage is
that we have a readable channel in flight recorder mode, while the disadvantage
is to duplicate the number of channels, which may become quite high. To follow
the general design of a high throughput channel and a low throughput channel for
vital information, I suggest to have a separate channel for facilities, per
trace, per process.
Commit	Line	Data
7a747250	1
	2	Some thoughts about userspace tracing
	3
	4	Mathieu Desnoyers January 2006
	5
	6
	7
	8	* Goals
	9
	10	Fast and secure user space tracing.
	11
	12	Fast :
	13
	14	- 5000ns for a system call is too long. Writing an event directly to memory
	15	takes 220ns.
	16	- Still, we can afford a system call for buffer switch, which occurs less often.
	17	- No locking, no signal disabling. Disabling signals require 2 system calls.
	18	Mutexes are implemented with a short spin lock, followed by a yield. Yet
	19	another system call. In addition, we have no way to know on which CPU we are
	20	running when in user mode. We can be preempted anywhere.
	21	- No contention.
	22	- No interrupt disabling : it doesn't exist in user mode.
	23
	24	Secure :
	25
	26	- A process shouldn't be able to corrupt the system's trace or another
	27	process'trace. It should be limited to its own memory space.
	28
	29
	30
	31	* Solution
	32
	33	- Signal handler concurrency
	34
	35	Using atomic space reservation in the buffer(s) will remove the requirement for
	36	locking. This is the fast and safe way to deal with concurrency coming from
	37	signal handlers.
	38
	39	- Start/stop tracing
	40
	41	Two possible solutions :
	42
	43	Either we export a read-only memory page from kernel to user space. That would
	44	be somehow seen as a hack, as I have never even seen such interface anywhere
	45	else. It may lead to problems related to exported types. The proper, but slow,
	46	way to do it would be to have a system call that would return the tracing
	47	status.
	48
	49	My suggestion is to go for a system call, but only call it :
	50
	51	- when the process starts
	52	- when receiving a SIG_UPDTRACING
	53
	54	Two possibilities :
	55
	56	- one system call per information to get/one system call to get all information.
	57	- one signal per information to get/one signal for "update" tracing info.
	58
	59	I would tend to adopt :
	60
	61	- One signal for "general tracing update"
	62	One signal handler would clearly be enough, more would be unnecessary
	63	overhead/pollution.
	64	- One system call for all updates.
65	We will need to have multiple parameters though. We have up to 6 parameters.
66
67	syscall get_tracing_info
68
69	first parameter : active traces mask (32 bits : 32 traces).
70
71
72	Concurrency
73
74	We must have per thread buffers. Then, no memory can be written by two threads
75	at once. It removes the need for locks (ok, atomic reservation was already doing
76	that) and removes false sharing.
77
78
79	Multiple traces
80
81	By having the number of active traces, we can allocate as much buffers as we
82	need. The only thing is that the buffers will only be allocated when receiving
83	the signal/starting the process and getting the number of traces actives.
84
85	It means that we must make sure to only update the data structures used by
86	tracing functions once the buffers are created.
87
88	When adding a new buffer, we should call the set_tracing_info syscall and give
89	the new buffers array to the kernel. It's an array of 32 pointers to user pages.
90	They will be used by the kernel to get the last pages when the thread dies.
91
92	If we remove a trace, the kernel should stop the tracing, and then get the last
93	buffer for this trace. What is important is to make sure no writers are still
94	trying to write in a memory region that get desallocated.
95
96	For that, we will keep an atomic variable "tracing_level", which tells how many
97	times we are nested in tracing code (program code/signal handlers) for a
98	specific trace.
99
100	We could do that trace removal in two operations :
101
102	- Send an update tracing signal to the process
103	- the sig handler get the new tracing status, which tells that tracing is
104	disabled for the specific trace. It writes this status in the tracing
105	control structure of the process.
106	- If tracing_level is 0, well, it's fine : there are no potential writers in
107	the removed trace. It's up to us to buffer switch the removed trace, and,
108	after the control returns to us, set_tracing_info this page to NULL and
109	delete this memory area.
110	- Else (tracing_level > 0), flag the removed trace for later switch/delete.
111
112	It then returns control to the process.
113
114	- If the tracing_level was > 0, there was one or more writers potentially
115	accessing this memory area. When the control comes back to the writer, at the
116	end of the write in a trace, if the trace is marked for switch/delete and the
117	tracing_level is 0 (after the decrement of the writer itself), then the
118	writer must buffer switch, set_tracing_info to NULL and then delete the
119	memory area.
120
121
122	Filter
123
124	The update tracing info signal will make the thread get the new filter
125	information. Getting this information will also happen upon process creation.
126
127	parameter 2 for the get tracing info : array of 32 ints (32 bits).
128	Each integer is the filter mask for a trace. As there are up to 32 active
129	traces, we have 32 integers for filter.
130
131
132	Buffer switch
133
134	There could be a tracing_buffer_switch system call, that would give the page
135	start address as parameter. The job of the kernel is to steal this page,
136	possibly replacing it with a zeroed page (we don't care about the content of the
137	page after the syscall).
138
139	Process dying
140
141	The kernel should be aware of the current pages used for tracing in each thread.
142	If a thread dies unexpectedly, we want the kernel to get the last bits of
143	information before the thread crashes.
144
145	syscall set_tracing_info
146
147	parameter 1 : array of 32 user space pointers to current pages or NULL.
148
149
150	Memory protection
151
152	We want each process to be usable to make a trace unreadable, and each process
153	to have its own memory space.
154
155	Two possibilities :
156
157	Either we create one channel per process, or we have per cpu tracefiles for all
158	the processes, with the specification that data is written in a monotically
159	increasing time order and that no process share a 4k page with another process.
160
161	The problem with having only one tracefile per cpu is that we cannot safely
162	steal a process'buffer upon a schedule change because it may be currently
163	writing to it.
164
165	It leaves the one tracefile per thread as the only solution.
166
167	Another argument in favor of this solution is the possibility to have mixed
168	32-64 bits processes on the same machine. Dealing with types will be easier.
169
170
171	Corrupted trace
172
173	A corrupted tracefile will only affect one thread. The rest of the trace will
174	still be readable.
175
176
177	Facilities
178
179	Upon process creation or when receiving the signal of trace info update, when a
180	new trace appears, the thread should write the facility information into it. It
181	must then have a list of registered facilities, all done at the thread level.
182
183	We must decide if we allow a facility channel for each thread. The advantage is
184	that we have a readable channel in flight recorder mode, while the disadvantage
185	is to duplicate the number of channels, which may become quite high. To follow
186	the general design of a high throughput channel and a low throughput channel for
187	vital information, I suggest to have a separate channel for facilities, per
188	trace, per process.
189
190
191
192
193
194