update roadmap
[lttv.git] / ltt / branches / poly / doc / developer / lttng-userspace-tracing.txt
CommitLineData
7a747250 1
2Some thoughts about userspace tracing
3
4Mathieu Desnoyers January 2006
5
6
7
8* Goals
9
10Fast and secure user space tracing.
11
12Fast :
13
14- 5000ns for a system call is too long. Writing an event directly to memory
15 takes 220ns.
16- Still, we can afford a system call for buffer switch, which occurs less often.
17- No locking, no signal disabling. Disabling signals require 2 system calls.
18 Mutexes are implemented with a short spin lock, followed by a yield. Yet
19 another system call. In addition, we have no way to know on which CPU we are
20 running when in user mode. We can be preempted anywhere.
21- No contention.
22- No interrupt disabling : it doesn't exist in user mode.
23
24Secure :
25
26- A process shouldn't be able to corrupt the system's trace or another
27 process'trace. It should be limited to its own memory space.
28
29
30
31* Solution
32
33- Signal handler concurrency
34
35Using atomic space reservation in the buffer(s) will remove the requirement for
36locking. This is the fast and safe way to deal with concurrency coming from
37signal handlers.
38
39- Start/stop tracing
40
41Two possible solutions :
42
43Either we export a read-only memory page from kernel to user space. That would
44be somehow seen as a hack, as I have never even seen such interface anywhere
45else. It may lead to problems related to exported types. The proper, but slow,
46way to do it would be to have a system call that would return the tracing
47status.
48
49My suggestion is to go for a system call, but only call it :
50
cb310b57 51- when the thread starts
3f43b8fb 52- when receiving a SIGRTMIN+3 (multithread ?)
cb310b57 53
54Note : save the thread ID (process ID) in the logging function and the update
55handler. Use it as a comparison to check if we are a forked child thread.
56Start a brand new buffer list in that case.
57
7a747250 58
59Two possibilities :
60
61- one system call per information to get/one system call to get all information.
62- one signal per information to get/one signal for "update" tracing info.
63
64I would tend to adopt :
65
66- One signal for "general tracing update"
67 One signal handler would clearly be enough, more would be unnecessary
68 overhead/pollution.
69- One system call for all updates.
70 We will need to have multiple parameters though. We have up to 6 parameters.
71
72syscall get_tracing_info
73
cb310b57 74parameter 1 : trace buffer map address. (id)
75
76parameter 2 : active ? (int)
7a747250 77
78
79Concurrency
80
81We must have per thread buffers. Then, no memory can be written by two threads
82at once. It removes the need for locks (ok, atomic reservation was already doing
83that) and removes false sharing.
84
85
86Multiple traces
87
88By having the number of active traces, we can allocate as much buffers as we
cb310b57 89need. Allocation is done in the kernel with relay_open. User space mapping is
90done when receiving the signal/starting the process and getting the number of
91traces actives.
7a747250 92
93It means that we must make sure to only update the data structures used by
94tracing functions once the buffers are created.
95
cb310b57 96We could have a syscall "get_next_buffer" that would basically mmap the next
97unmmapped buffer, or return NULL is all buffers are mapped.
7a747250 98
99If we remove a trace, the kernel should stop the tracing, and then get the last
100buffer for this trace. What is important is to make sure no writers are still
101trying to write in a memory region that get desallocated.
102
103For that, we will keep an atomic variable "tracing_level", which tells how many
104times we are nested in tracing code (program code/signal handlers) for a
105specific trace.
106
107We could do that trace removal in two operations :
108
109- Send an update tracing signal to the process
110 - the sig handler get the new tracing status, which tells that tracing is
111 disabled for the specific trace. It writes this status in the tracing
112 control structure of the process.
113 - If tracing_level is 0, well, it's fine : there are no potential writers in
114 the removed trace. It's up to us to buffer switch the removed trace, and,
115 after the control returns to us, set_tracing_info this page to NULL and
116 delete this memory area.
117 - Else (tracing_level > 0), flag the removed trace for later switch/delete.
118
119 It then returns control to the process.
120
121- If the tracing_level was > 0, there was one or more writers potentially
122 accessing this memory area. When the control comes back to the writer, at the
123 end of the write in a trace, if the trace is marked for switch/delete and the
124 tracing_level is 0 (after the decrement of the writer itself), then the
cb310b57 125 writer must buffer switch, and then delete the memory area.
7a747250 126
127
128Filter
129
130The update tracing info signal will make the thread get the new filter
131information. Getting this information will also happen upon process creation.
132
cb310b57 133parameter 3 for the get tracing info : a integer containing the 32 bits mask.
7a747250 134
135
136Buffer switch
137
138There could be a tracing_buffer_switch system call, that would give the page
139start address as parameter. The job of the kernel is to steal this page,
140possibly replacing it with a zeroed page (we don't care about the content of the
141page after the syscall).
142
143Process dying
144
145The kernel should be aware of the current pages used for tracing in each thread.
146If a thread dies unexpectedly, we want the kernel to get the last bits of
147information before the thread crashes.
148
7a747250 149Memory protection
150
cb310b57 151If a process corrupt its own mmaped buffers, the rest of the trace will be
152readable, and each process have its own memory space.
7a747250 153
154Two possibilities :
155
156Either we create one channel per process, or we have per cpu tracefiles for all
157the processes, with the specification that data is written in a monotically
158increasing time order and that no process share a 4k page with another process.
159
160The problem with having only one tracefile per cpu is that we cannot safely
161steal a process'buffer upon a schedule change because it may be currently
162writing to it.
163
164It leaves the one tracefile per thread as the only solution.
165
166Another argument in favor of this solution is the possibility to have mixed
16732-64 bits processes on the same machine. Dealing with types will be easier.
168
169
170Corrupted trace
171
172A corrupted tracefile will only affect one thread. The rest of the trace will
173still be readable.
174
175
176Facilities
177
178Upon process creation or when receiving the signal of trace info update, when a
179new trace appears, the thread should write the facility information into it. It
180must then have a list of registered facilities, all done at the thread level.
181
182We must decide if we allow a facility channel for each thread. The advantage is
183that we have a readable channel in flight recorder mode, while the disadvantage
184is to duplicate the number of channels, which may become quite high. To follow
185the general design of a high throughput channel and a low throughput channel for
186vital information, I suggest to have a separate channel for facilities, per
187trace, per process.
188
189
190
cb310b57 191API :
192
193syscall 1 :
194
3f43b8fb 195in :
196buffer : NULL means get new traces
197 non NULL means to get the information for the specified buffer
198out :
199buffer : returns the address of the trace buffer
200active : is the trace active ?
201filter : 32 bits filter mask
cb310b57 202
3f43b8fb 203return : 0 on success, 1 on error.
204
205int ltt_update(void **buffer, int *active, int *filter);
cb310b57 206
207syscall 2 :
208
3f43b8fb 209in :
210buffer : Switch the specified buffer.
211return : 0 on success, 1 on error.
212
213int ltt_switch(void *buffer);
cb310b57 214
215
216Signal :
217
3f43b8fb 218SIGRTMIN+3
cb310b57 219(like hardware fault and expiring timer : to the thread, see p. 413 of Advances
220prog. in the UNIX env.)
221
0dee0e75 222Signal is sent on tracing create/destroy, start/stop and filter change.
223
cb310b57 224Will update for itself only : it will remove unnecessary concurrency.
225
226
227
0dee0e75 228Notes :
229
230It doesn't matter "when" the process receives the update signal after a trace
231start : it will receive it in priority, before executing anything else when it
232will be scheduled in.
cb310b57 233
234
235
cfed1a52 236Major enhancement :
237
238* Buffer pool *
239
240The problem with the design, up to now, is if an heavily threaded application
241launches many threads that has a short lifetime : it will allocate memory for
242each traced thread, consuming time and it will create an incredibly high
243number of files in the trace (or per thread).
244
245(thanks to Matthew Khouzam)
246The solution to this sits in the use of a buffer poll : We typically create a
247buffer pool of a specified size (say, 10 buffers by default, alterable by the
248user), each 8k in size (4k for normal trace, 4k for facility channel), for a
249total of 80kB of memory. It has to be tweaked to the maximum number of
250expected threads running at once, or it will have to grow dynamically (thus
251impacting on the trace).
252
253A typical approach to dynamic growth is to double the number of allocated
254buffers each time a threashold near the limit is reached.
255
256Each channel would be found as :
257
258trace_name/user/facilities_0
259trace_name/user/cpu_0
260trace_name/user/facilities_1
261trace_name/user/cpu_1
262...
263
264When a thread asks for being traced, it gets a buffer from free buffers pool. If
265the number of available buffers falls under a threshold, the pool is marked for
266expansion and the thread gets its buffer quickly. The expansion will be executed
267a little bit later by a worker thread. If however, the number of available
268buffer is 0, then an "emergency" reservation will be done, allocating only one
269buffer. The goal of this is to modify the thread fork time as less as possible.
270
271When a thread releases a buffer (the thread terminates), a buffer switch is
272performed, so the data can be flushed to disk and no other thread will mess
273with it or render the buffer unreadable.
274
275Upon trace creation, the pre-allocated pool is allocated. Upon trace
276destruction, the threads are first informed of the trace destruction, any
277pending worker thread (for pool allocation) is cancelled and then the pool is
278released. Buffers used by threads at this moment but not mapped for reading
279will be simply destroyed (as their refcount will fall to 0). It means that
280between the "trace stop" and "trace destroy", there should be enough time to let
281the lttd daemon open the newly created channels or they will be lost.
282
283Upon buffer switch, the reader can read directly from the buffer. Note that when
284the reader finish reading a buffer, if the associated thread writer has
285exited, it must fill the buffer with zeroes and put it back into the free pool.
286In the case where the trace is destroyed, it must just derement its refcount (as
287it would do otherwise) and the buffer will be destroyed.
288
289This pool will reduce the number of trace files created to the order of the
290number of threads present in the system at a given time.
291
292A worse cast scenario is 32768 processes traced at the same time, for a total
293amount of 256MB of buffers. If a machine has so many threads, it probably have
294enough memory to handle this.
295
296In flight recorder mode, it would be interesting to use a LRU algorithm to
297choose which buffer from the pool we must take for a newly forked thread. A
298simple queue would do it.
299
300SMP : per cpu pools ? -> no, L1 and L2 caches are typically too small to be
301impacted by the fact that a reused buffer is on a different or the same CPU.
302
303
304
305
306
307
308
309
cb310b57 310
311
7a747250 312
313
314
This page took 0.040618 seconds and 4 git commands to generate.