--- /dev/null
+
+Some thoughts about userspace tracing
+
+Mathieu Desnoyers January 2006
+
+
+
+* Goals
+
+Fast and secure user space tracing.
+
+Fast :
+
+- 5000ns for a system call is too long. Writing an event directly to memory
+ takes 220ns.
+- Still, we can afford a system call for buffer switch, which occurs less often.
+- No locking, no signal disabling. Disabling signals require 2 system calls.
+ Mutexes are implemented with a short spin lock, followed by a yield. Yet
+ another system call. In addition, we have no way to know on which CPU we are
+ running when in user mode. We can be preempted anywhere.
+- No contention.
+- No interrupt disabling : it doesn't exist in user mode.
+
+Secure :
+
+- A process shouldn't be able to corrupt the system's trace or another
+ process'trace. It should be limited to its own memory space.
+
+
+
+* Solution
+
+- Signal handler concurrency
+
+Using atomic space reservation in the buffer(s) will remove the requirement for
+locking. This is the fast and safe way to deal with concurrency coming from
+signal handlers.
+
+- Start/stop tracing
+
+Two possible solutions :
+
+Either we export a read-only memory page from kernel to user space. That would
+be somehow seen as a hack, as I have never even seen such interface anywhere
+else. It may lead to problems related to exported types. The proper, but slow,
+way to do it would be to have a system call that would return the tracing
+status.
+
+My suggestion is to go for a system call, but only call it :
+
+- when the process starts
+- when receiving a SIG_UPDTRACING
+
+Two possibilities :
+
+- one system call per information to get/one system call to get all information.
+- one signal per information to get/one signal for "update" tracing info.
+
+I would tend to adopt :
+
+- One signal for "general tracing update"
+ One signal handler would clearly be enough, more would be unnecessary
+ overhead/pollution.
+- One system call for all updates.
+ We will need to have multiple parameters though. We have up to 6 parameters.
+
+syscall get_tracing_info
+
+first parameter : active traces mask (32 bits : 32 traces).
+
+
+Concurrency
+
+We must have per thread buffers. Then, no memory can be written by two threads
+at once. It removes the need for locks (ok, atomic reservation was already doing
+that) and removes false sharing.
+
+
+Multiple traces
+
+By having the number of active traces, we can allocate as much buffers as we
+need. The only thing is that the buffers will only be allocated when receiving
+the signal/starting the process and getting the number of traces actives.
+
+It means that we must make sure to only update the data structures used by
+tracing functions once the buffers are created.
+
+When adding a new buffer, we should call the set_tracing_info syscall and give
+the new buffers array to the kernel. It's an array of 32 pointers to user pages.
+They will be used by the kernel to get the last pages when the thread dies.
+
+If we remove a trace, the kernel should stop the tracing, and then get the last
+buffer for this trace. What is important is to make sure no writers are still
+trying to write in a memory region that get desallocated.
+
+For that, we will keep an atomic variable "tracing_level", which tells how many
+times we are nested in tracing code (program code/signal handlers) for a
+specific trace.
+
+We could do that trace removal in two operations :
+
+- Send an update tracing signal to the process
+ - the sig handler get the new tracing status, which tells that tracing is
+ disabled for the specific trace. It writes this status in the tracing
+ control structure of the process.
+ - If tracing_level is 0, well, it's fine : there are no potential writers in
+ the removed trace. It's up to us to buffer switch the removed trace, and,
+ after the control returns to us, set_tracing_info this page to NULL and
+ delete this memory area.
+ - Else (tracing_level > 0), flag the removed trace for later switch/delete.
+
+ It then returns control to the process.
+
+- If the tracing_level was > 0, there was one or more writers potentially
+ accessing this memory area. When the control comes back to the writer, at the
+ end of the write in a trace, if the trace is marked for switch/delete and the
+ tracing_level is 0 (after the decrement of the writer itself), then the
+ writer must buffer switch, set_tracing_info to NULL and then delete the
+ memory area.
+
+
+Filter
+
+The update tracing info signal will make the thread get the new filter
+information. Getting this information will also happen upon process creation.
+
+parameter 2 for the get tracing info : array of 32 ints (32 bits).
+Each integer is the filter mask for a trace. As there are up to 32 active
+traces, we have 32 integers for filter.
+
+
+Buffer switch
+
+There could be a tracing_buffer_switch system call, that would give the page
+start address as parameter. The job of the kernel is to steal this page,
+possibly replacing it with a zeroed page (we don't care about the content of the
+page after the syscall).
+
+Process dying
+
+The kernel should be aware of the current pages used for tracing in each thread.
+If a thread dies unexpectedly, we want the kernel to get the last bits of
+information before the thread crashes.
+
+syscall set_tracing_info
+
+parameter 1 : array of 32 user space pointers to current pages or NULL.
+
+
+Memory protection
+
+We want each process to be usable to make a trace unreadable, and each process
+to have its own memory space.
+
+Two possibilities :
+
+Either we create one channel per process, or we have per cpu tracefiles for all
+the processes, with the specification that data is written in a monotically
+increasing time order and that no process share a 4k page with another process.
+
+The problem with having only one tracefile per cpu is that we cannot safely
+steal a process'buffer upon a schedule change because it may be currently
+writing to it.
+
+It leaves the one tracefile per thread as the only solution.
+
+Another argument in favor of this solution is the possibility to have mixed
+32-64 bits processes on the same machine. Dealing with types will be easier.
+
+
+Corrupted trace
+
+A corrupted tracefile will only affect one thread. The rest of the trace will
+still be readable.
+
+
+Facilities
+
+Upon process creation or when receiving the signal of trace info update, when a
+new trace appears, the thread should write the facility information into it. It
+must then have a list of registered facilities, all done at the thread level.
+
+We must decide if we allow a facility channel for each thread. The advantage is
+that we have a readable channel in flight recorder mode, while the disadvantage
+is to duplicate the number of channels, which may become quite high. To follow
+the general design of a high throughput channel and a low throughput channel for
+vital information, I suggest to have a separate channel for facilities, per
+trace, per process.
+
+
+
+
+
+