| 1 | |
| 2 | Some thoughts about userspace tracing |
| 3 | |
| 4 | Mathieu Desnoyers January 2006 |
| 5 | |
| 6 | |
| 7 | |
| 8 | * Goals |
| 9 | |
| 10 | Fast and secure user space tracing. |
| 11 | |
| 12 | Fast : |
| 13 | |
| 14 | - 5000ns for a system call is too long. Writing an event directly to memory |
| 15 | takes 220ns. |
| 16 | - Still, we can afford a system call for buffer switch, which occurs less often. |
| 17 | - No locking, no signal disabling. Disabling signals require 2 system calls. |
| 18 | Mutexes are implemented with a short spin lock, followed by a yield. Yet |
| 19 | another system call. In addition, we have no way to know on which CPU we are |
| 20 | running when in user mode. We can be preempted anywhere. |
| 21 | - No contention. |
| 22 | - No interrupt disabling : it doesn't exist in user mode. |
| 23 | |
| 24 | Secure : |
| 25 | |
| 26 | - A process shouldn't be able to corrupt the system's trace or another |
| 27 | process'trace. It should be limited to its own memory space. |
| 28 | |
| 29 | |
| 30 | |
| 31 | * Solution |
| 32 | |
| 33 | - Signal handler concurrency |
| 34 | |
| 35 | Using atomic space reservation in the buffer(s) will remove the requirement for |
| 36 | locking. This is the fast and safe way to deal with concurrency coming from |
| 37 | signal handlers. |
| 38 | |
| 39 | - Start/stop tracing |
| 40 | |
| 41 | Two possible solutions : |
| 42 | |
| 43 | Either we export a read-only memory page from kernel to user space. That would |
| 44 | be somehow seen as a hack, as I have never even seen such interface anywhere |
| 45 | else. It may lead to problems related to exported types. The proper, but slow, |
| 46 | way to do it would be to have a system call that would return the tracing |
| 47 | status. |
| 48 | |
| 49 | My suggestion is to go for a system call, but only call it : |
| 50 | |
| 51 | - when the thread starts |
| 52 | - when receiving a SIGRTMIN+3 (multithread ?) |
| 53 | |
| 54 | Note : save the thread ID (process ID) in the logging function and the update |
| 55 | handler. Use it as a comparison to check if we are a forked child thread. |
| 56 | Start a brand new buffer list in that case. |
| 57 | |
| 58 | |
| 59 | Two possibilities : |
| 60 | |
| 61 | - one system call per information to get/one system call to get all information. |
| 62 | - one signal per information to get/one signal for "update" tracing info. |
| 63 | |
| 64 | I would tend to adopt : |
| 65 | |
| 66 | - One signal for "general tracing update" |
| 67 | One signal handler would clearly be enough, more would be unnecessary |
| 68 | overhead/pollution. |
| 69 | - One system call for all updates. |
| 70 | We will need to have multiple parameters though. We have up to 6 parameters. |
| 71 | |
| 72 | syscall get_tracing_info |
| 73 | |
| 74 | parameter 1 : trace buffer map address. (id) |
| 75 | |
| 76 | parameter 2 : active ? (int) |
| 77 | |
| 78 | |
| 79 | Concurrency |
| 80 | |
| 81 | We must have per thread buffers. Then, no memory can be written by two threads |
| 82 | at once. It removes the need for locks (ok, atomic reservation was already doing |
| 83 | that) and removes false sharing. |
| 84 | |
| 85 | |
| 86 | Multiple traces |
| 87 | |
| 88 | By having the number of active traces, we can allocate as much buffers as we |
| 89 | need. Allocation is done in the kernel with relay_open. User space mapping is |
| 90 | done when receiving the signal/starting the process and getting the number of |
| 91 | traces actives. |
| 92 | |
| 93 | It means that we must make sure to only update the data structures used by |
| 94 | tracing functions once the buffers are created. |
| 95 | |
| 96 | We could have a syscall "get_next_buffer" that would basically mmap the next |
| 97 | unmmapped buffer, or return NULL is all buffers are mapped. |
| 98 | |
| 99 | If we remove a trace, the kernel should stop the tracing, and then get the last |
| 100 | buffer for this trace. What is important is to make sure no writers are still |
| 101 | trying to write in a memory region that get desallocated. |
| 102 | |
| 103 | For that, we will keep an atomic variable "tracing_level", which tells how many |
| 104 | times we are nested in tracing code (program code/signal handlers) for a |
| 105 | specific trace. |
| 106 | |
| 107 | We could do that trace removal in two operations : |
| 108 | |
| 109 | - Send an update tracing signal to the process |
| 110 | - the sig handler get the new tracing status, which tells that tracing is |
| 111 | disabled for the specific trace. It writes this status in the tracing |
| 112 | control structure of the process. |
| 113 | - If tracing_level is 0, well, it's fine : there are no potential writers in |
| 114 | the removed trace. It's up to us to buffer switch the removed trace, and, |
| 115 | after the control returns to us, set_tracing_info this page to NULL and |
| 116 | delete this memory area. |
| 117 | - Else (tracing_level > 0), flag the removed trace for later switch/delete. |
| 118 | |
| 119 | It then returns control to the process. |
| 120 | |
| 121 | - If the tracing_level was > 0, there was one or more writers potentially |
| 122 | accessing this memory area. When the control comes back to the writer, at the |
| 123 | end of the write in a trace, if the trace is marked for switch/delete and the |
| 124 | tracing_level is 0 (after the decrement of the writer itself), then the |
| 125 | writer must buffer switch, and then delete the memory area. |
| 126 | |
| 127 | |
| 128 | Filter |
| 129 | |
| 130 | The update tracing info signal will make the thread get the new filter |
| 131 | information. Getting this information will also happen upon process creation. |
| 132 | |
| 133 | parameter 3 for the get tracing info : a integer containing the 32 bits mask. |
| 134 | |
| 135 | |
| 136 | Buffer switch |
| 137 | |
| 138 | There could be a tracing_buffer_switch system call, that would give the page |
| 139 | start address as parameter. The job of the kernel is to steal this page, |
| 140 | possibly replacing it with a zeroed page (we don't care about the content of the |
| 141 | page after the syscall). |
| 142 | |
| 143 | Process dying |
| 144 | |
| 145 | The kernel should be aware of the current pages used for tracing in each thread. |
| 146 | If a thread dies unexpectedly, we want the kernel to get the last bits of |
| 147 | information before the thread crashes. |
| 148 | |
| 149 | Memory protection |
| 150 | |
| 151 | If a process corrupt its own mmaped buffers, the rest of the trace will be |
| 152 | readable, and each process have its own memory space. |
| 153 | |
| 154 | Two possibilities : |
| 155 | |
| 156 | Either we create one channel per process, or we have per cpu tracefiles for all |
| 157 | the processes, with the specification that data is written in a monotically |
| 158 | increasing time order and that no process share a 4k page with another process. |
| 159 | |
| 160 | The problem with having only one tracefile per cpu is that we cannot safely |
| 161 | steal a process'buffer upon a schedule change because it may be currently |
| 162 | writing to it. |
| 163 | |
| 164 | It leaves the one tracefile per thread as the only solution. |
| 165 | |
| 166 | Another argument in favor of this solution is the possibility to have mixed |
| 167 | 32-64 bits processes on the same machine. Dealing with types will be easier. |
| 168 | |
| 169 | |
| 170 | Corrupted trace |
| 171 | |
| 172 | A corrupted tracefile will only affect one thread. The rest of the trace will |
| 173 | still be readable. |
| 174 | |
| 175 | |
| 176 | Facilities |
| 177 | |
| 178 | Upon process creation or when receiving the signal of trace info update, when a |
| 179 | new trace appears, the thread should write the facility information into it. It |
| 180 | must then have a list of registered facilities, all done at the thread level. |
| 181 | |
| 182 | We must decide if we allow a facility channel for each thread. The advantage is |
| 183 | that we have a readable channel in flight recorder mode, while the disadvantage |
| 184 | is to duplicate the number of channels, which may become quite high. To follow |
| 185 | the general design of a high throughput channel and a low throughput channel for |
| 186 | vital information, I suggest to have a separate channel for facilities, per |
| 187 | trace, per process. |
| 188 | |
| 189 | |
| 190 | |
| 191 | API : |
| 192 | |
| 193 | syscall 1 : |
| 194 | |
| 195 | in : |
| 196 | buffer : NULL means get new traces |
| 197 | non NULL means to get the information for the specified buffer |
| 198 | out : |
| 199 | buffer : returns the address of the trace buffer |
| 200 | active : is the trace active ? |
| 201 | filter : 32 bits filter mask |
| 202 | |
| 203 | return : 0 on success, 1 on error. |
| 204 | |
| 205 | int ltt_update(void **buffer, int *active, int *filter); |
| 206 | |
| 207 | syscall 2 : |
| 208 | |
| 209 | in : |
| 210 | buffer : Switch the specified buffer. |
| 211 | return : 0 on success, 1 on error. |
| 212 | |
| 213 | int ltt_switch(void *buffer); |
| 214 | |
| 215 | |
| 216 | Signal : |
| 217 | |
| 218 | SIGRTMIN+3 |
| 219 | (like hardware fault and expiring timer : to the thread, see p. 413 of Advances |
| 220 | prog. in the UNIX env.) |
| 221 | |
| 222 | Signal is sent on tracing create/destroy, start/stop and filter change. |
| 223 | |
| 224 | Will update for itself only : it will remove unnecessary concurrency. |
| 225 | |
| 226 | |
| 227 | |
| 228 | Notes : |
| 229 | |
| 230 | It doesn't matter "when" the process receives the update signal after a trace |
| 231 | start : it will receive it in priority, before executing anything else when it |
| 232 | will be scheduled in. |
| 233 | |
| 234 | |
| 235 | |
| 236 | |
| 237 | |
| 238 | |
| 239 | |
| 240 | |