7a747250 |
1 | |
2 | Some thoughts about userspace tracing |
3 | |
4 | Mathieu Desnoyers January 2006 |
5 | |
6 | |
7 | |
8 | * Goals |
9 | |
10 | Fast and secure user space tracing. |
11 | |
12 | Fast : |
13 | |
14 | - 5000ns for a system call is too long. Writing an event directly to memory |
15 | takes 220ns. |
16 | - Still, we can afford a system call for buffer switch, which occurs less often. |
17 | - No locking, no signal disabling. Disabling signals require 2 system calls. |
18 | Mutexes are implemented with a short spin lock, followed by a yield. Yet |
19 | another system call. In addition, we have no way to know on which CPU we are |
20 | running when in user mode. We can be preempted anywhere. |
21 | - No contention. |
22 | - No interrupt disabling : it doesn't exist in user mode. |
23 | |
24 | Secure : |
25 | |
26 | - A process shouldn't be able to corrupt the system's trace or another |
27 | process'trace. It should be limited to its own memory space. |
28 | |
29 | |
30 | |
31 | * Solution |
32 | |
33 | - Signal handler concurrency |
34 | |
35 | Using atomic space reservation in the buffer(s) will remove the requirement for |
36 | locking. This is the fast and safe way to deal with concurrency coming from |
37 | signal handlers. |
38 | |
39 | - Start/stop tracing |
40 | |
41 | Two possible solutions : |
42 | |
43 | Either we export a read-only memory page from kernel to user space. That would |
44 | be somehow seen as a hack, as I have never even seen such interface anywhere |
45 | else. It may lead to problems related to exported types. The proper, but slow, |
46 | way to do it would be to have a system call that would return the tracing |
47 | status. |
48 | |
49 | My suggestion is to go for a system call, but only call it : |
50 | |
51 | - when the process starts |
52 | - when receiving a SIG_UPDTRACING |
53 | |
54 | Two possibilities : |
55 | |
56 | - one system call per information to get/one system call to get all information. |
57 | - one signal per information to get/one signal for "update" tracing info. |
58 | |
59 | I would tend to adopt : |
60 | |
61 | - One signal for "general tracing update" |
62 | One signal handler would clearly be enough, more would be unnecessary |
63 | overhead/pollution. |
64 | - One system call for all updates. |
65 | We will need to have multiple parameters though. We have up to 6 parameters. |
66 | |
67 | syscall get_tracing_info |
68 | |
69 | first parameter : active traces mask (32 bits : 32 traces). |
70 | |
71 | |
72 | Concurrency |
73 | |
74 | We must have per thread buffers. Then, no memory can be written by two threads |
75 | at once. It removes the need for locks (ok, atomic reservation was already doing |
76 | that) and removes false sharing. |
77 | |
78 | |
79 | Multiple traces |
80 | |
81 | By having the number of active traces, we can allocate as much buffers as we |
82 | need. The only thing is that the buffers will only be allocated when receiving |
83 | the signal/starting the process and getting the number of traces actives. |
84 | |
85 | It means that we must make sure to only update the data structures used by |
86 | tracing functions once the buffers are created. |
87 | |
88 | When adding a new buffer, we should call the set_tracing_info syscall and give |
89 | the new buffers array to the kernel. It's an array of 32 pointers to user pages. |
90 | They will be used by the kernel to get the last pages when the thread dies. |
91 | |
92 | If we remove a trace, the kernel should stop the tracing, and then get the last |
93 | buffer for this trace. What is important is to make sure no writers are still |
94 | trying to write in a memory region that get desallocated. |
95 | |
96 | For that, we will keep an atomic variable "tracing_level", which tells how many |
97 | times we are nested in tracing code (program code/signal handlers) for a |
98 | specific trace. |
99 | |
100 | We could do that trace removal in two operations : |
101 | |
102 | - Send an update tracing signal to the process |
103 | - the sig handler get the new tracing status, which tells that tracing is |
104 | disabled for the specific trace. It writes this status in the tracing |
105 | control structure of the process. |
106 | - If tracing_level is 0, well, it's fine : there are no potential writers in |
107 | the removed trace. It's up to us to buffer switch the removed trace, and, |
108 | after the control returns to us, set_tracing_info this page to NULL and |
109 | delete this memory area. |
110 | - Else (tracing_level > 0), flag the removed trace for later switch/delete. |
111 | |
112 | It then returns control to the process. |
113 | |
114 | - If the tracing_level was > 0, there was one or more writers potentially |
115 | accessing this memory area. When the control comes back to the writer, at the |
116 | end of the write in a trace, if the trace is marked for switch/delete and the |
117 | tracing_level is 0 (after the decrement of the writer itself), then the |
118 | writer must buffer switch, set_tracing_info to NULL and then delete the |
119 | memory area. |
120 | |
121 | |
122 | Filter |
123 | |
124 | The update tracing info signal will make the thread get the new filter |
125 | information. Getting this information will also happen upon process creation. |
126 | |
127 | parameter 2 for the get tracing info : array of 32 ints (32 bits). |
128 | Each integer is the filter mask for a trace. As there are up to 32 active |
129 | traces, we have 32 integers for filter. |
130 | |
131 | |
132 | Buffer switch |
133 | |
134 | There could be a tracing_buffer_switch system call, that would give the page |
135 | start address as parameter. The job of the kernel is to steal this page, |
136 | possibly replacing it with a zeroed page (we don't care about the content of the |
137 | page after the syscall). |
138 | |
139 | Process dying |
140 | |
141 | The kernel should be aware of the current pages used for tracing in each thread. |
142 | If a thread dies unexpectedly, we want the kernel to get the last bits of |
143 | information before the thread crashes. |
144 | |
145 | syscall set_tracing_info |
146 | |
147 | parameter 1 : array of 32 user space pointers to current pages or NULL. |
148 | |
149 | |
150 | Memory protection |
151 | |
152 | We want each process to be usable to make a trace unreadable, and each process |
153 | to have its own memory space. |
154 | |
155 | Two possibilities : |
156 | |
157 | Either we create one channel per process, or we have per cpu tracefiles for all |
158 | the processes, with the specification that data is written in a monotically |
159 | increasing time order and that no process share a 4k page with another process. |
160 | |
161 | The problem with having only one tracefile per cpu is that we cannot safely |
162 | steal a process'buffer upon a schedule change because it may be currently |
163 | writing to it. |
164 | |
165 | It leaves the one tracefile per thread as the only solution. |
166 | |
167 | Another argument in favor of this solution is the possibility to have mixed |
168 | 32-64 bits processes on the same machine. Dealing with types will be easier. |
169 | |
170 | |
171 | Corrupted trace |
172 | |
173 | A corrupted tracefile will only affect one thread. The rest of the trace will |
174 | still be readable. |
175 | |
176 | |
177 | Facilities |
178 | |
179 | Upon process creation or when receiving the signal of trace info update, when a |
180 | new trace appears, the thread should write the facility information into it. It |
181 | must then have a list of registered facilities, all done at the thread level. |
182 | |
183 | We must decide if we allow a facility channel for each thread. The advantage is |
184 | that we have a readable channel in flight recorder mode, while the disadvantage |
185 | is to duplicate the number of channels, which may become quite high. To follow |
186 | the general design of a high throughput channel and a low throughput channel for |
187 | vital information, I suggest to have a separate channel for facilities, per |
188 | trace, per process. |
189 | |
190 | |
191 | |
192 | |
193 | |
194 | |