5 Following the huge discussion thread about tracing/static vs dynamic
6 instrumentation/markers, a consensus seems to emerge about the need for a
7 marker system in the Linux kernel. The main issues this mechanism addresses are:
9 - Identify code important to runtime data collection/analysis tools in tree so
10 that it follows the code changes naturally.
11 - Be visually appealing to kernel developers.
12 - Have a very low impact on the system performance.
13 - Integrate in the standard kernel infrastructure : use C and loadable modules.
15 The time has come for some performance measurements of the Linux Kernel Markers,
21 Use timestamp counter to calculate the time spent, with interrupts disabled.
22 Machine : Pentium 4 3GHz, 1GB ram
23 Fully preemptible kernel
24 Linux Kernel Markers 0.19
27 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
29 This marker, with two elements (integer and pointer) have been chosen because it
30 is representative of high volume events. For instance, a trap entry event logs a
31 trap_id (long) and an address (pointer). The same applies to system calls, where
32 a system call entry event logs both the ID of the system call and the address of
36 * Execute an empty loop
40 time delta (cycles): 15026497
41 cycles per loop : 1.50
43 - i386 "optimized" : immediate value, test and predicted branch
44 (non connected marker)
46 time delta (cycles): 40031640
47 cycles per loop : 4.00
48 cycles per loop for marker : 4.00-1.50=2.50
50 - i386 "generic" : load, test and predicted branch
51 (non connected marker)
53 time delta (cycles): 26697878
54 cycles per loop : 2.67
55 cycles per loop for marker : 2.67-1.50=1.17
58 * Execute a loop of memcpy 4096 bytes
60 This test has been done to show the impact of markers on a system where the
61 memory is already used, which is more representative of a running kernel.
65 time delta (cycles): 12981555
66 cycles per loop : 1298.16
68 - i386 "optimized" : immediate value, test and predicted branch
69 (non connected marker)
71 time delta (cycles): 12982290
72 cycles per loop : 1298.23
73 cycles per loop for marker : 1298.23-1298.16=0.074
75 - i386 "generic" : load, test and predicted branch
76 (non connected marker)
78 time delta (cycles): 13002788
79 cycles per loop : 1300.28
80 cycles per loop for marker : 1300.28-1298.16=2.123
83 The following tests are done with the "optimized" markers only
85 - Execute a loop with marker enabled, with i386 "fastcall" register argument
86 setup, probe empty. With preemption disabling.
88 time delta (cycles): 4407608
89 cycles per loop : 44.08
90 cycles per loop to disable preemption and setup arguments in registers :
93 - Execute a loop with a marker enabled, with an empty probe. Var args argument
94 setup, probe empty. With preemption disabling.
96 time delta (cycles): 5210587
97 cycles per loop : 52.11
98 additional cycles per loop to setup var args : 52.11-44.08=8.03
100 - Execute a loop with a marker enabled, with an empty probe. Var args argument
101 setup, probe empty. No preemption disabling.
103 time delta (cycles): 3363450
104 cycles per loop : 33.63
105 cycles per loop to disable preemption : 44.08-33.63=10.45
107 - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
108 Data is copied by the probe. With preemption disabling.
110 time delta (cycles): 5299837
111 cycles per loop : 53.00
112 additional cycles per loop to get arguments in probe (from stack) on x86 :
115 - Execute a loop with marker enabled, with var args probe expecting arguments.
116 Data is copied by the probe. With preemption disabling.
118 time delta (cycles): 5574300
119 cycles per loop : 55.74
120 additional cycles per loop to get expected variable arguments on x86 :
123 - Execute a loop with marker enabled, with var args probe, format string
124 Data is copied by the probe. This is a 6 bytes string to decode.
126 time delta (cycles): 9622117
127 cycles per loop : 96.22
128 additional cycles per loop to dynamically parse arguments with a 6 bytes format
129 string : 96.22-55.74=40.48
131 - Execute a loop with marker enabled, with var args probe expecting arguments.
132 Data is copied by the probe. With preemption disabling. An empty "kprobe" is
133 connected to the probe.
135 time delta (cycles): 423397455
136 cycles per loop : 4233.97
137 additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
142 The disassembly of the following marked function will be shown :
144 static int my_open(struct inode *inode, struct file *file)
146 MARK(subsys_mark1, "%d %p", 1, NULL);
154 static int my_open(struct inode *inode, struct file *file)
157 1: 89 e5 mov %esp,%ebp
158 3: 83 ec 0c sub $0xc,%esp
159 MARK(subsys_mark1, "%d %p", 1, NULL);
160 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
161 8: 84 c0 test %al,%al
162 a: 75 07 jne 13 <my_open+0x13>
166 c: b8 ff ff ff ff mov $0xffffffff,%eax
169 13: b8 01 00 00 00 mov $0x1,%eax
170 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
171 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
173 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
175 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
176 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
177 3a: b8 01 00 00 00 mov $0x1,%eax
178 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
179 44: eb c6 jmp c <my_open+0xc>
184 static int my_open(struct inode *inode, struct file *file)
187 1: 89 e5 mov %esp,%ebp
188 3: 83 ec 0c sub $0xc,%esp
189 MARK(subsys_mark1, "%d %p", 1, NULL);
190 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
191 d: 84 c0 test %al,%al
192 f: 75 07 jne 18 <my_open+0x18>
196 11: b8 ff ff ff ff mov $0xffffffff,%eax
199 18: b8 01 00 00 00 mov $0x1,%eax
200 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
201 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
203 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
205 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
206 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
207 3f: b8 01 00 00 00 mov $0x1,%eax
208 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
209 49: eb c6 jmp 11 <my_open+0x11>
213 Here is the typical var arg probe that has been used in those tests. It saves
214 the values expectes as parameters in global variables. The DO_MARK1_FORMAT
215 define is used for probe registration to make sure that it will be connected
216 with a marker that has a matching format string. Note that this checking is
217 optional : the probe can register with a NULL format and afterward check itself
218 the format string received in parameter dynamically.
223 #define DO_MARK1_FORMAT "%d %p"
224 void do_mark1(const char *format, ...)
228 va_start(ap, format);
229 value = va_arg(ap, int);
230 ptr = va_arg(ap, void*);
236 Here is the disassembly of the probe :
238 #define DO_MARK1_FORMAT "%d %p"
239 void do_mark1(const char *format, ...)
242 1: 89 e5 mov %esp,%ebp
243 3: 83 ec 04 sub $0x4,%esp
246 va_start(ap, format);
247 value = va_arg(ap, int);
248 6: 8b 45 0c mov 0xc(%ebp),%eax
249 9: a3 00 00 00 00 mov %eax,0x0
250 ptr = va_arg(ap, void*);
251 e: 8b 45 10 mov 0x10(%ebp),%eax
252 11: a3 00 00 00 00 mov %eax,0x0
262 This is the size added by each marker to the memory image :
266 .text section : instructions
267 Adds 6 bytes in the "likely" path.
268 Adds 32 bytes in the "unlikely" path.
269 .data section : r/w data
271 .rodata.str1 : strings
272 Length of the marker name
273 .debug_str : strings (if loaded..)
274 Length of the marker name + 7 bytes (__mark_)
278 12 bytes (3 pointers)
282 .text section : instructions
283 Adds 11 bytes in the "likely" path.
284 Adds 32 bytes in the "unlikely" path.
285 .data section : r/w data
286 1 byte (the activation flag)
287 .rodata.str1 : strings
288 Length of the marker name
289 .debug_str : strings (if loaded..)
290 Length of the marker name + 7 bytes (__mark_)
294 12 bytes (3 pointers)
299 Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
300 Running a 2.6.17 vanilla kernel :
305 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
310 --> 0.98 % speedup with markers
312 Ping flood on loopback interface :
313 Running a 2.6.17 vanilla kernel :
314 136596 packets transmitted, 136596 packets received, 0% packet loss
315 round-trip min/avg/max = 0.0/0.0/0.1 ms
321 12601 packets transmitted/s
323 Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
324 108504 packets transmitted, 108504 packets received, 0% packet loss
325 round-trip min/avg/max = 0.0/0.0/0.1 ms
331 12596 packets transmitted/s
333 --> 0.03 % slowdown with markers
338 In an empty loop, the generic marker is faster than the optimized marker. This
339 may be due to better performances of the movzbl instruction over the movb on the
340 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
341 of the movzbl becomes greater because it uses the memory bandwidth.
343 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
344 as much as dynamically parsing the format string to get the variable arguments
347 There is almost no difference, on x86, between passing the arguments directly on
348 the stack and using a variable argument list when its layout is known
349 statically (0.89 cycles vs 2.74 cycles).
351 The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
352 (typcal use, high memory usage) used by the optimized marker by adding the
353 ability to insert a breakpoint at any location without any impact on the code
354 when inactive. This breakpoint based approach is very useful to instrument core
355 kernel code that has not been previously marked without need to recompile and
356 reboot. We can therefore compare the case "without markers" to the null impact
357 of an inactive int3 breakpoint.
359 However, the performance impact for using a kprobe is non negligible when
360 activated. Assuming that kprobes would have a mechanism to get the variables
361 from the caller's stack, it would perform the same task in at least 4178.23
362 cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
363 useful for the reason explained earlier, the high event rate paths in the kernel
364 would clearly benefit from a marker mechanism when the are probed.
366 Code size and memory footprints are smaller with the optimized version : 6
367 bytes of code in the likely path compared to 11 bytes. The memory footprint of
368 the optimized approach saves 4 bytes of data memory that would otherwise have to
371 On the macro-benchmark side, no significant difference in performance has been
372 found between the vanilla kernel and a kernel "marked" with the standard LTTng