7a747250 |
1 | |
2 | Some thoughts about userspace tracing |
3 | |
4 | Mathieu Desnoyers January 2006 |
5 | |
6 | |
7 | |
8 | * Goals |
9 | |
10 | Fast and secure user space tracing. |
11 | |
12 | Fast : |
13 | |
14 | - 5000ns for a system call is too long. Writing an event directly to memory |
15 | takes 220ns. |
16 | - Still, we can afford a system call for buffer switch, which occurs less often. |
17 | - No locking, no signal disabling. Disabling signals require 2 system calls. |
18 | Mutexes are implemented with a short spin lock, followed by a yield. Yet |
19 | another system call. In addition, we have no way to know on which CPU we are |
20 | running when in user mode. We can be preempted anywhere. |
21 | - No contention. |
22 | - No interrupt disabling : it doesn't exist in user mode. |
23 | |
24 | Secure : |
25 | |
26 | - A process shouldn't be able to corrupt the system's trace or another |
27 | process'trace. It should be limited to its own memory space. |
28 | |
29 | |
30 | |
31 | * Solution |
32 | |
33 | - Signal handler concurrency |
34 | |
35 | Using atomic space reservation in the buffer(s) will remove the requirement for |
36 | locking. This is the fast and safe way to deal with concurrency coming from |
37 | signal handlers. |
38 | |
39 | - Start/stop tracing |
40 | |
41 | Two possible solutions : |
42 | |
43 | Either we export a read-only memory page from kernel to user space. That would |
44 | be somehow seen as a hack, as I have never even seen such interface anywhere |
45 | else. It may lead to problems related to exported types. The proper, but slow, |
46 | way to do it would be to have a system call that would return the tracing |
47 | status. |
48 | |
49 | My suggestion is to go for a system call, but only call it : |
50 | |
cb310b57 |
51 | - when the thread starts |
3f43b8fb |
52 | - when receiving a SIGRTMIN+3 (multithread ?) |
cb310b57 |
53 | |
54 | Note : save the thread ID (process ID) in the logging function and the update |
55 | handler. Use it as a comparison to check if we are a forked child thread. |
56 | Start a brand new buffer list in that case. |
57 | |
7a747250 |
58 | |
59 | Two possibilities : |
60 | |
61 | - one system call per information to get/one system call to get all information. |
62 | - one signal per information to get/one signal for "update" tracing info. |
63 | |
64 | I would tend to adopt : |
65 | |
66 | - One signal for "general tracing update" |
67 | One signal handler would clearly be enough, more would be unnecessary |
68 | overhead/pollution. |
69 | - One system call for all updates. |
70 | We will need to have multiple parameters though. We have up to 6 parameters. |
71 | |
72 | syscall get_tracing_info |
73 | |
cb310b57 |
74 | parameter 1 : trace buffer map address. (id) |
75 | |
76 | parameter 2 : active ? (int) |
7a747250 |
77 | |
78 | |
79 | Concurrency |
80 | |
81 | We must have per thread buffers. Then, no memory can be written by two threads |
82 | at once. It removes the need for locks (ok, atomic reservation was already doing |
83 | that) and removes false sharing. |
84 | |
85 | |
86 | Multiple traces |
87 | |
88 | By having the number of active traces, we can allocate as much buffers as we |
cb310b57 |
89 | need. Allocation is done in the kernel with relay_open. User space mapping is |
90 | done when receiving the signal/starting the process and getting the number of |
91 | traces actives. |
7a747250 |
92 | |
93 | It means that we must make sure to only update the data structures used by |
94 | tracing functions once the buffers are created. |
95 | |
cb310b57 |
96 | We could have a syscall "get_next_buffer" that would basically mmap the next |
97 | unmmapped buffer, or return NULL is all buffers are mapped. |
7a747250 |
98 | |
99 | If we remove a trace, the kernel should stop the tracing, and then get the last |
100 | buffer for this trace. What is important is to make sure no writers are still |
101 | trying to write in a memory region that get desallocated. |
102 | |
103 | For that, we will keep an atomic variable "tracing_level", which tells how many |
104 | times we are nested in tracing code (program code/signal handlers) for a |
105 | specific trace. |
106 | |
107 | We could do that trace removal in two operations : |
108 | |
109 | - Send an update tracing signal to the process |
110 | - the sig handler get the new tracing status, which tells that tracing is |
111 | disabled for the specific trace. It writes this status in the tracing |
112 | control structure of the process. |
113 | - If tracing_level is 0, well, it's fine : there are no potential writers in |
114 | the removed trace. It's up to us to buffer switch the removed trace, and, |
115 | after the control returns to us, set_tracing_info this page to NULL and |
116 | delete this memory area. |
117 | - Else (tracing_level > 0), flag the removed trace for later switch/delete. |
118 | |
119 | It then returns control to the process. |
120 | |
121 | - If the tracing_level was > 0, there was one or more writers potentially |
122 | accessing this memory area. When the control comes back to the writer, at the |
123 | end of the write in a trace, if the trace is marked for switch/delete and the |
124 | tracing_level is 0 (after the decrement of the writer itself), then the |
cb310b57 |
125 | writer must buffer switch, and then delete the memory area. |
7a747250 |
126 | |
127 | |
128 | Filter |
129 | |
130 | The update tracing info signal will make the thread get the new filter |
131 | information. Getting this information will also happen upon process creation. |
132 | |
cb310b57 |
133 | parameter 3 for the get tracing info : a integer containing the 32 bits mask. |
7a747250 |
134 | |
135 | |
136 | Buffer switch |
137 | |
138 | There could be a tracing_buffer_switch system call, that would give the page |
139 | start address as parameter. The job of the kernel is to steal this page, |
140 | possibly replacing it with a zeroed page (we don't care about the content of the |
141 | page after the syscall). |
142 | |
143 | Process dying |
144 | |
145 | The kernel should be aware of the current pages used for tracing in each thread. |
146 | If a thread dies unexpectedly, we want the kernel to get the last bits of |
147 | information before the thread crashes. |
148 | |
7a747250 |
149 | Memory protection |
150 | |
cb310b57 |
151 | If a process corrupt its own mmaped buffers, the rest of the trace will be |
152 | readable, and each process have its own memory space. |
7a747250 |
153 | |
154 | Two possibilities : |
155 | |
156 | Either we create one channel per process, or we have per cpu tracefiles for all |
157 | the processes, with the specification that data is written in a monotically |
158 | increasing time order and that no process share a 4k page with another process. |
159 | |
160 | The problem with having only one tracefile per cpu is that we cannot safely |
161 | steal a process'buffer upon a schedule change because it may be currently |
162 | writing to it. |
163 | |
164 | It leaves the one tracefile per thread as the only solution. |
165 | |
166 | Another argument in favor of this solution is the possibility to have mixed |
167 | 32-64 bits processes on the same machine. Dealing with types will be easier. |
168 | |
169 | |
170 | Corrupted trace |
171 | |
172 | A corrupted tracefile will only affect one thread. The rest of the trace will |
173 | still be readable. |
174 | |
175 | |
176 | Facilities |
177 | |
178 | Upon process creation or when receiving the signal of trace info update, when a |
179 | new trace appears, the thread should write the facility information into it. It |
180 | must then have a list of registered facilities, all done at the thread level. |
181 | |
182 | We must decide if we allow a facility channel for each thread. The advantage is |
183 | that we have a readable channel in flight recorder mode, while the disadvantage |
184 | is to duplicate the number of channels, which may become quite high. To follow |
185 | the general design of a high throughput channel and a low throughput channel for |
186 | vital information, I suggest to have a separate channel for facilities, per |
187 | trace, per process. |
188 | |
189 | |
190 | |
cb310b57 |
191 | API : |
192 | |
193 | syscall 1 : |
194 | |
3f43b8fb |
195 | in : |
196 | buffer : NULL means get new traces |
197 | non NULL means to get the information for the specified buffer |
198 | out : |
199 | buffer : returns the address of the trace buffer |
200 | active : is the trace active ? |
201 | filter : 32 bits filter mask |
cb310b57 |
202 | |
3f43b8fb |
203 | return : 0 on success, 1 on error. |
204 | |
205 | int ltt_update(void **buffer, int *active, int *filter); |
cb310b57 |
206 | |
207 | syscall 2 : |
208 | |
3f43b8fb |
209 | in : |
210 | buffer : Switch the specified buffer. |
211 | return : 0 on success, 1 on error. |
212 | |
213 | int ltt_switch(void *buffer); |
cb310b57 |
214 | |
215 | |
216 | Signal : |
217 | |
3f43b8fb |
218 | SIGRTMIN+3 |
cb310b57 |
219 | (like hardware fault and expiring timer : to the thread, see p. 413 of Advances |
220 | prog. in the UNIX env.) |
221 | |
0dee0e75 |
222 | Signal is sent on tracing create/destroy, start/stop and filter change. |
223 | |
cb310b57 |
224 | Will update for itself only : it will remove unnecessary concurrency. |
225 | |
226 | |
227 | |
0dee0e75 |
228 | Notes : |
229 | |
230 | It doesn't matter "when" the process receives the update signal after a trace |
231 | start : it will receive it in priority, before executing anything else when it |
232 | will be scheduled in. |
cb310b57 |
233 | |
234 | |
235 | |
cfed1a52 |
236 | Major enhancement : |
237 | |
238 | * Buffer pool * |
239 | |
240 | The problem with the design, up to now, is if an heavily threaded application |
241 | launches many threads that has a short lifetime : it will allocate memory for |
242 | each traced thread, consuming time and it will create an incredibly high |
243 | number of files in the trace (or per thread). |
244 | |
245 | (thanks to Matthew Khouzam) |
246 | The solution to this sits in the use of a buffer poll : We typically create a |
247 | buffer pool of a specified size (say, 10 buffers by default, alterable by the |
248 | user), each 8k in size (4k for normal trace, 4k for facility channel), for a |
249 | total of 80kB of memory. It has to be tweaked to the maximum number of |
250 | expected threads running at once, or it will have to grow dynamically (thus |
251 | impacting on the trace). |
252 | |
253 | A typical approach to dynamic growth is to double the number of allocated |
254 | buffers each time a threashold near the limit is reached. |
255 | |
256 | Each channel would be found as : |
257 | |
258 | trace_name/user/facilities_0 |
259 | trace_name/user/cpu_0 |
260 | trace_name/user/facilities_1 |
261 | trace_name/user/cpu_1 |
262 | ... |
263 | |
264 | When a thread asks for being traced, it gets a buffer from free buffers pool. If |
265 | the number of available buffers falls under a threshold, the pool is marked for |
266 | expansion and the thread gets its buffer quickly. The expansion will be executed |
267 | a little bit later by a worker thread. If however, the number of available |
268 | buffer is 0, then an "emergency" reservation will be done, allocating only one |
269 | buffer. The goal of this is to modify the thread fork time as less as possible. |
270 | |
271 | When a thread releases a buffer (the thread terminates), a buffer switch is |
272 | performed, so the data can be flushed to disk and no other thread will mess |
273 | with it or render the buffer unreadable. |
274 | |
275 | Upon trace creation, the pre-allocated pool is allocated. Upon trace |
276 | destruction, the threads are first informed of the trace destruction, any |
277 | pending worker thread (for pool allocation) is cancelled and then the pool is |
278 | released. Buffers used by threads at this moment but not mapped for reading |
279 | will be simply destroyed (as their refcount will fall to 0). It means that |
280 | between the "trace stop" and "trace destroy", there should be enough time to let |
281 | the lttd daemon open the newly created channels or they will be lost. |
282 | |
283 | Upon buffer switch, the reader can read directly from the buffer. Note that when |
284 | the reader finish reading a buffer, if the associated thread writer has |
285 | exited, it must fill the buffer with zeroes and put it back into the free pool. |
286 | In the case where the trace is destroyed, it must just derement its refcount (as |
287 | it would do otherwise) and the buffer will be destroyed. |
288 | |
289 | This pool will reduce the number of trace files created to the order of the |
290 | number of threads present in the system at a given time. |
291 | |
292 | A worse cast scenario is 32768 processes traced at the same time, for a total |
293 | amount of 256MB of buffers. If a machine has so many threads, it probably have |
294 | enough memory to handle this. |
295 | |
296 | In flight recorder mode, it would be interesting to use a LRU algorithm to |
297 | choose which buffer from the pool we must take for a newly forked thread. A |
298 | simple queue would do it. |
299 | |
300 | SMP : per cpu pools ? -> no, L1 and L2 caches are typically too small to be |
301 | impacted by the fact that a reused buffer is on a different or the same CPU. |
302 | |
303 | |
304 | |
305 | |
306 | |
307 | |
308 | |
309 | |
cb310b57 |
310 | |
311 | |
7a747250 |
312 | |
313 | |
314 | |