bc82195a |
1 | |
2 | |
c7b68d19 |
3 | Hi, |
4 | |
5 | Following the huge discussion thread about tracing/static vs dynamic |
6 | instrumentation/markers, a consensus seems to emerge about the need for a |
7 | marker system in the Linux kernel. The main issues this mechanism addresses are: |
8 | |
9 | - Identify code important to runtime data collection/analysis tools in tree so |
10 | that it follows the code changes naturally. |
11 | - Be visually appealing to kernel developers. |
12 | - Have a very low impact on the system performance. |
13 | - Integrate in the standard kernel infrastructure : use C and loadable modules. |
14 | |
15 | The time has come for some performance measurements of the Linux Kernel Markers, |
16 | which follows. |
17 | |
18 | |
19 | * Micro-benchmarks |
bc82195a |
20 | |
21 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
b9c0c958 |
22 | Machine : Pentium 4 3GHz, 1GB ram |
bc82195a |
23 | Fully preemptible kernel |
bc82195a |
24 | Linux Kernel Markers 0.19 |
b9c0c958 |
25 | Kernel : Linux 2.6.17 |
26 | |
27 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
28 | |
29 | This marker, with two elements (integer and pointer) have been chosen because it |
30 | is representative of high volume events. For instance, a trap entry event logs a |
31 | trap_id (long) and an address (pointer). The same applies to system calls, where |
32 | a system call entry event logs both the ID of the system call and the address of |
33 | the caller. |
34 | |
bc82195a |
35 | |
36 | * Execute an empty loop |
b9c0c958 |
37 | |
38 | - Without marker |
bc82195a |
39 | NR_LOOPS : 10000000 |
40 | time delta (cycles): 15026497 |
41 | cycles per loop : 1.50 |
b9c0c958 |
42 | |
bc82195a |
43 | - i386 "optimized" : immediate value, test and predicted branch |
44 | (non connected marker) |
45 | NR_LOOPS : 10000000 |
46 | time delta (cycles): 40031640 |
47 | cycles per loop : 4.00 |
b9c0c958 |
48 | cycles per loop for marker : 4.00-1.50=2.50 |
49 | |
bc82195a |
50 | - i386 "generic" : load, test and predicted branch |
51 | (non connected marker) |
52 | NR_LOOPS : 10000000 |
53 | time delta (cycles): 26697878 |
54 | cycles per loop : 2.67 |
b9c0c958 |
55 | cycles per loop for marker : 2.67-1.50=1.17 |
56 | |
bc82195a |
57 | |
58 | * Execute a loop of memcpy 4096 bytes |
b9c0c958 |
59 | |
60 | This test has been done to show the impact of markers on a system where the |
61 | memory is already used, which is more representative of a running kernel. |
62 | |
bc82195a |
63 | - Without marker |
64 | NR_LOOPS : 10000 |
65 | time delta (cycles): 12981555 |
66 | cycles per loop : 1298.16 |
b9c0c958 |
67 | |
bc82195a |
68 | - i386 "optimized" : immediate value, test and predicted branch |
69 | (non connected marker) |
70 | NR_LOOPS : 10000 |
71 | time delta (cycles): 12982290 |
72 | cycles per loop : 1298.23 |
b9c0c958 |
73 | cycles per loop for marker : 1298.23-1298.16=0.074 |
74 | |
bc82195a |
75 | - i386 "generic" : load, test and predicted branch |
76 | (non connected marker) |
77 | NR_LOOPS : 10000 |
78 | time delta (cycles): 13002788 |
79 | cycles per loop : 1300.28 |
b9c0c958 |
80 | cycles per loop for marker : 1300.28-1298.16=2.123 |
bc82195a |
81 | |
82 | |
83 | The following tests are done with the "optimized" markers only |
84 | |
b9c0c958 |
85 | - Execute a loop with marker enabled, with i386 "fastcall" register argument |
86 | setup, probe empty. With preemption disabling. |
bc82195a |
87 | NR_LOOPS : 100000 |
88 | time delta (cycles): 4407608 |
89 | cycles per loop : 44.08 |
b9c0c958 |
90 | cycles per loop to disable preemption and setup arguments in registers : |
91 | 44.08-4.00=40.08 |
bc82195a |
92 | |
b9c0c958 |
93 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
94 | setup, probe empty. With preemption disabling. |
bc82195a |
95 | NR_LOOPS : 100000 |
96 | time delta (cycles): 5210587 |
97 | cycles per loop : 52.11 |
98 | additional cycles per loop to setup var args : 52.11-44.08=8.03 |
99 | |
b9c0c958 |
100 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
101 | setup, probe empty. No preemption disabling. |
102 | NR_LOOPS : 100000 |
103 | time delta (cycles): 3363450 |
104 | cycles per loop : 33.63 |
105 | cycles per loop to disable preemption : 44.08-33.63=10.45 |
106 | |
107 | - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected. |
108 | Data is copied by the probe. With preemption disabling. |
bc82195a |
109 | NR_LOOPS : 100000 |
110 | time delta (cycles): 5299837 |
111 | cycles per loop : 53.00 |
112 | additional cycles per loop to get arguments in probe (from stack) on x86 : |
b9c0c958 |
113 | 53.00-52.11=0.89 |
bc82195a |
114 | |
b9c0c958 |
115 | - Execute a loop with marker enabled, with var args probe expecting arguments. |
116 | Data is copied by the probe. With preemption disabling. |
bc82195a |
117 | NR_LOOPS : 100000 |
118 | time delta (cycles): 5574300 |
119 | cycles per loop : 55.74 |
120 | additional cycles per loop to get expected variable arguments on x86 : |
121 | 55.74-53.00=2.74 |
122 | |
b9c0c958 |
123 | - Execute a loop with marker enabled, with var args probe, format string |
124 | Data is copied by the probe. This is a 6 bytes string to decode. |
bc82195a |
125 | NR_LOOPS : 100000 |
126 | time delta (cycles): 9622117 |
127 | cycles per loop : 96.22 |
b9c0c958 |
128 | additional cycles per loop to dynamically parse arguments with a 6 bytes format |
c7b68d19 |
129 | string : 96.22-55.74=40.48 |
130 | |
131 | - Execute a loop with marker enabled, with var args probe expecting arguments. |
132 | Data is copied by the probe. With preemption disabling. An empty "kprobe" is |
133 | connected to the probe. |
134 | NR_LOOPS : 100000 |
135 | time delta (cycles): 423397455 |
136 | cycles per loop : 4233.97 |
137 | additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23 |
bc82195a |
138 | |
139 | |
140 | * Assembly code |
141 | |
b9c0c958 |
142 | The disassembly of the following marked function will be shown : |
143 | |
144 | static int my_open(struct inode *inode, struct file *file) |
145 | { |
146 | MARK(subsys_mark1, "%d %p", 1, NULL); |
147 | |
148 | return -EPERM; |
149 | } |
150 | |
bc82195a |
151 | |
152 | - Optimized |
153 | |
154 | static int my_open(struct inode *inode, struct file *file) |
155 | { |
156 | 0: 55 push %ebp |
157 | 1: 89 e5 mov %esp,%ebp |
158 | 3: 83 ec 0c sub $0xc,%esp |
159 | MARK(subsys_mark1, "%d %p", 1, NULL); |
b9c0c958 |
160 | 6: b0 00 mov $0x0,%al <-- immediate load 0 in al |
bc82195a |
161 | 8: 84 c0 test %al,%al |
162 | a: 75 07 jne 13 <my_open+0x13> |
163 | |
164 | return -EPERM; |
165 | } |
166 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
167 | 11: c9 leave |
168 | 12: c3 ret |
169 | 13: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
170 | 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable |
bc82195a |
171 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
172 | 24: 00 |
173 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
174 | 2c: 00 |
175 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
b9c0c958 |
176 | 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
bc82195a |
177 | 3a: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
178 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable |
bc82195a |
179 | 44: eb c6 jmp c <my_open+0xc> |
180 | |
181 | |
182 | - Generic |
183 | |
184 | static int my_open(struct inode *inode, struct file *file) |
185 | { |
186 | 0: 55 push %ebp |
187 | 1: 89 e5 mov %esp,%ebp |
188 | 3: 83 ec 0c sub $0xc,%esp |
189 | MARK(subsys_mark1, "%d %p", 1, NULL); |
b9c0c958 |
190 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte |
bc82195a |
191 | d: 84 c0 test %al,%al |
192 | f: 75 07 jne 18 <my_open+0x18> |
193 | |
194 | return -EPERM; |
195 | } |
196 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
197 | 16: c9 leave |
198 | 17: c3 ret |
199 | 18: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
200 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable |
bc82195a |
201 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
202 | 29: 00 |
203 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
204 | 31: 00 |
205 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
b9c0c958 |
206 | 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
bc82195a |
207 | 3f: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
208 | 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable |
bc82195a |
209 | 49: eb c6 jmp 11 <my_open+0x11> |
210 | |
b9c0c958 |
211 | |
212 | |
213 | Here is the typical var arg probe that has been used in those tests. It saves |
214 | the values expectes as parameters in global variables. The DO_MARK1_FORMAT |
215 | define is used for probe registration to make sure that it will be connected |
216 | with a marker that has a matching format string. Note that this checking is |
217 | optional : the probe can register with a NULL format and afterward check itself |
218 | the format string received in parameter dynamically. |
219 | |
220 | int value; |
221 | void *ptr; |
222 | |
223 | #define DO_MARK1_FORMAT "%d %p" |
224 | void do_mark1(const char *format, ...) |
225 | { |
226 | va_list ap; |
227 | |
228 | va_start(ap, format); |
229 | value = va_arg(ap, int); |
230 | ptr = va_arg(ap, void*); |
231 | |
232 | va_end(ap); |
233 | } |
234 | |
235 | |
236 | Here is the disassembly of the probe : |
237 | |
238 | #define DO_MARK1_FORMAT "%d %p" |
239 | void do_mark1(const char *format, ...) |
240 | { |
241 | 0: 55 push %ebp |
242 | 1: 89 e5 mov %esp,%ebp |
243 | 3: 83 ec 04 sub $0x4,%esp |
244 | va_list ap; |
245 | |
246 | va_start(ap, format); |
247 | value = va_arg(ap, int); |
248 | 6: 8b 45 0c mov 0xc(%ebp),%eax |
249 | 9: a3 00 00 00 00 mov %eax,0x0 |
250 | ptr = va_arg(ap, void*); |
251 | e: 8b 45 10 mov 0x10(%ebp),%eax |
252 | 11: a3 00 00 00 00 mov %eax,0x0 |
253 | |
254 | va_end(ap); |
255 | } |
256 | 16: c9 leave |
257 | 17: c3 ret |
258 | |
259 | |
bc82195a |
260 | * Size (x86) |
261 | |
b9c0c958 |
262 | This is the size added by each marker to the memory image : |
263 | |
bc82195a |
264 | - Optimized |
265 | |
b9c0c958 |
266 | .text section : instructions |
bc82195a |
267 | Adds 6 bytes in the "likely" path. |
268 | Adds 32 bytes in the "unlikely" path. |
b9c0c958 |
269 | .data section : r/w data |
270 | 0 byte |
271 | .rodata.str1 : strings |
272 | Length of the marker name |
273 | .debug_str : strings (if loaded..) |
274 | Length of the marker name + 7 bytes (__mark_) |
275 | .markers |
276 | 8 bytes (2 pointers) |
277 | .markers.c |
278 | 12 bytes (3 pointers) |
279 | |
280 | - Generic |
281 | |
282 | .text section : instructions |
bc82195a |
283 | Adds 11 bytes in the "likely" path. |
284 | Adds 32 bytes in the "unlikely" path. |
b9c0c958 |
285 | .data section : r/w data |
286 | 1 byte (the activation flag) |
287 | .rodata.str1 : strings |
288 | Length of the marker name |
289 | .debug_str : strings (if loaded..) |
290 | Length of the marker name + 7 bytes (__mark_) |
291 | .markers |
292 | 8 bytes (2 pointers) |
293 | .markers.c |
294 | 12 bytes (3 pointers) |
bc82195a |
295 | |
296 | |
c7b68d19 |
297 | * Macro-benchmarks |
bc82195a |
298 | |
299 | Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache. |
300 | Running a 2.6.17 vanilla kernel : |
301 | real 8m2.443s |
302 | user 7m35.124s |
303 | sys 0m34.950s |
304 | |
305 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
306 | real 8m1.635s |
307 | user 7m34.552s |
308 | sys 0m36.298s |
309 | |
c7b68d19 |
310 | --> 0.98 % speedup with markers |
311 | |
bc82195a |
312 | Ping flood on loopback interface : |
313 | Running a 2.6.17 vanilla kernel : |
314 | 136596 packets transmitted, 136596 packets received, 0% packet loss |
315 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
316 | |
317 | real 0m10.840s |
318 | user 0m0.360s |
319 | sys 0m10.485s |
320 | |
321 | 12601 packets transmitted/s |
322 | |
323 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
324 | 108504 packets transmitted, 108504 packets received, 0% packet loss |
325 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
326 | |
327 | real 0m8.614s |
328 | user 0m0.264s |
329 | sys 0m8.353s |
330 | |
331 | 12596 packets transmitted/s |
332 | |
c7b68d19 |
333 | --> 0.03 % slowdown with markers |
bc82195a |
334 | |
335 | |
336 | Conclusion |
337 | |
338 | In an empty loop, the generic marker is faster than the optimized marker. This |
339 | may be due to better performances of the movzbl instruction over the movb on the |
340 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
341 | of the movzbl becomes greater because it uses the memory bandwidth. |
342 | |
343 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
344 | as much as dynamically parsing the format string to get the variable arguments |
345 | (40.48 cycles). |
346 | |
347 | There is almost no difference, on x86, between passing the arguments directly on |
348 | the stack and using a variable argument list when its layout is known |
349 | statically (0.89 cycles vs 2.74 cycles). |
350 | |
b9c0c958 |
351 | The int3 approach for adding instrumentation dynamically saves the 0.074 cycle |
352 | (typcal use, high memory usage) used by the optimized marker by adding the |
353 | ability to insert a breakpoint at any location without any impact on the code |
354 | when inactive. This breakpoint based approach is very useful to instrument core |
355 | kernel code that has not been previously marked without need to recompile and |
356 | reboot. We can therefore compare the case "without markers" to the null impact |
c7b68d19 |
357 | of an inactive int3 breakpoint. |
358 | |
359 | However, the performance impact for using a kprobe is non negligible when |
360 | activated. Assuming that kprobes would have a mechanism to get the variables |
361 | from the caller's stack, it would perform the same task in at least 4178.23 |
362 | cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very |
363 | useful for the reason explained earlier, the high event rate paths in the kernel |
364 | would clearly benefit from a marker mechanism when the are probed. |
365 | |
366 | Code size and memory footprints are smaller with the optimized version : 6 |
367 | bytes of code in the likely path compared to 11 bytes. The memory footprint of |
368 | the optimized approach saves 4 bytes of data memory that would otherwise have to |
369 | stay in cache. |
370 | |
371 | On the macro-benchmark side, no significant difference in performance has been |
372 | found between the vanilla kernel and a kernel "marked" with the standard LTTng |
373 | instrumentation. |
bc82195a |
374 | |
375 | |