| 1 | |
| 2 | |
| 3 | Hi, |
| 4 | |
| 5 | Following the huge discussion thread about tracing/static vs dynamic |
| 6 | instrumentation/markers, a consensus seems to emerge about the need for a |
| 7 | marker system in the Linux kernel. The main issues this mechanism addresses are: |
| 8 | |
| 9 | - Identify code important to runtime data collection/analysis tools in tree so |
| 10 | that it follows the code changes naturally. |
| 11 | - Be visually appealing to kernel developers. |
| 12 | - Have a very low impact on the system performance. |
| 13 | - Integrate in the standard kernel infrastructure : use C and loadable modules. |
| 14 | |
| 15 | The time has come for some performance measurements of the Linux Kernel Markers, |
| 16 | which follows. |
| 17 | |
| 18 | |
| 19 | * Micro-benchmarks |
| 20 | |
| 21 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
| 22 | Machine : Pentium 4 3GHz, 1GB ram |
| 23 | Fully preemptible kernel |
| 24 | Linux Kernel Markers 0.19 |
| 25 | Kernel : Linux 2.6.17 |
| 26 | |
| 27 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
| 28 | |
| 29 | This marker, with two elements (integer and pointer) have been chosen because it |
| 30 | is representative of high volume events. For instance, a trap entry event logs a |
| 31 | trap_id (long) and an address (pointer). The same applies to system calls, where |
| 32 | a system call entry event logs both the ID of the system call and the address of |
| 33 | the caller. |
| 34 | |
| 35 | |
| 36 | * Execute an empty loop |
| 37 | |
| 38 | - Without marker |
| 39 | NR_LOOPS : 10000000 |
| 40 | time delta (cycles): 15026497 |
| 41 | cycles per loop : 1.50 |
| 42 | |
| 43 | - i386 "optimized" : immediate value, test and predicted branch |
| 44 | (non connected marker) |
| 45 | NR_LOOPS : 10000000 |
| 46 | time delta (cycles): 40031640 |
| 47 | cycles per loop : 4.00 |
| 48 | cycles per loop for marker : 4.00-1.50=2.50 |
| 49 | |
| 50 | - i386 "generic" : load, test and predicted branch |
| 51 | (non connected marker) |
| 52 | NR_LOOPS : 10000000 |
| 53 | time delta (cycles): 26697878 |
| 54 | cycles per loop : 2.67 |
| 55 | cycles per loop for marker : 2.67-1.50=1.17 |
| 56 | |
| 57 | |
| 58 | * Execute a loop of memcpy 4096 bytes |
| 59 | |
| 60 | This test has been done to show the impact of markers on a system where the |
| 61 | memory is already used, which is more representative of a running kernel. |
| 62 | |
| 63 | - Without marker |
| 64 | NR_LOOPS : 10000 |
| 65 | time delta (cycles): 12981555 |
| 66 | cycles per loop : 1298.16 |
| 67 | |
| 68 | - i386 "optimized" : immediate value, test and predicted branch |
| 69 | (non connected marker) |
| 70 | NR_LOOPS : 10000 |
| 71 | time delta (cycles): 12982290 |
| 72 | cycles per loop : 1298.23 |
| 73 | cycles per loop for marker : 1298.23-1298.16=0.074 |
| 74 | |
| 75 | - i386 "generic" : load, test and predicted branch |
| 76 | (non connected marker) |
| 77 | NR_LOOPS : 10000 |
| 78 | time delta (cycles): 13002788 |
| 79 | cycles per loop : 1300.28 |
| 80 | cycles per loop for marker : 1300.28-1298.16=2.123 |
| 81 | |
| 82 | |
| 83 | The following tests are done with the "optimized" markers only |
| 84 | |
| 85 | - Execute a loop with marker enabled, with i386 "fastcall" register argument |
| 86 | setup, probe empty. With preemption disabling. |
| 87 | NR_LOOPS : 100000 |
| 88 | time delta (cycles): 4407608 |
| 89 | cycles per loop : 44.08 |
| 90 | cycles per loop to disable preemption and setup arguments in registers : |
| 91 | 44.08-4.00=40.08 |
| 92 | |
| 93 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
| 94 | setup, probe empty. With preemption disabling. |
| 95 | NR_LOOPS : 100000 |
| 96 | time delta (cycles): 5210587 |
| 97 | cycles per loop : 52.11 |
| 98 | additional cycles per loop to setup var args : 52.11-44.08=8.03 |
| 99 | |
| 100 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
| 101 | setup, probe empty. No preemption disabling. |
| 102 | NR_LOOPS : 100000 |
| 103 | time delta (cycles): 3363450 |
| 104 | cycles per loop : 33.63 |
| 105 | cycles per loop to disable preemption : 44.08-33.63=10.45 |
| 106 | |
| 107 | - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected. |
| 108 | Data is copied by the probe. With preemption disabling. |
| 109 | NR_LOOPS : 100000 |
| 110 | time delta (cycles): 5299837 |
| 111 | cycles per loop : 53.00 |
| 112 | additional cycles per loop to get arguments in probe (from stack) on x86 : |
| 113 | 53.00-52.11=0.89 |
| 114 | |
| 115 | - Execute a loop with marker enabled, with var args probe expecting arguments. |
| 116 | Data is copied by the probe. With preemption disabling. |
| 117 | NR_LOOPS : 100000 |
| 118 | time delta (cycles): 5574300 |
| 119 | cycles per loop : 55.74 |
| 120 | additional cycles per loop to get expected variable arguments on x86 : |
| 121 | 55.74-53.00=2.74 |
| 122 | |
| 123 | - Execute a loop with marker enabled, with var args probe, format string |
| 124 | Data is copied by the probe. This is a 6 bytes string to decode. |
| 125 | NR_LOOPS : 100000 |
| 126 | time delta (cycles): 9622117 |
| 127 | cycles per loop : 96.22 |
| 128 | additional cycles per loop to dynamically parse arguments with a 6 bytes format |
| 129 | string : 96.22-55.74=40.48 |
| 130 | |
| 131 | - Execute a loop with marker enabled, with var args probe expecting arguments. |
| 132 | Data is copied by the probe. With preemption disabling. An empty "kprobe" is |
| 133 | connected to the probe. |
| 134 | NR_LOOPS : 100000 |
| 135 | time delta (cycles): 423397455 |
| 136 | cycles per loop : 4233.97 |
| 137 | additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23 |
| 138 | |
| 139 | |
| 140 | * Assembly code |
| 141 | |
| 142 | The disassembly of the following marked function will be shown : |
| 143 | |
| 144 | static int my_open(struct inode *inode, struct file *file) |
| 145 | { |
| 146 | MARK(subsys_mark1, "%d %p", 1, NULL); |
| 147 | |
| 148 | return -EPERM; |
| 149 | } |
| 150 | |
| 151 | |
| 152 | - Optimized |
| 153 | |
| 154 | static int my_open(struct inode *inode, struct file *file) |
| 155 | { |
| 156 | 0: 55 push %ebp |
| 157 | 1: 89 e5 mov %esp,%ebp |
| 158 | 3: 83 ec 0c sub $0xc,%esp |
| 159 | MARK(subsys_mark1, "%d %p", 1, NULL); |
| 160 | 6: b0 00 mov $0x0,%al <-- immediate load 0 in al |
| 161 | 8: 84 c0 test %al,%al |
| 162 | a: 75 07 jne 13 <my_open+0x13> |
| 163 | |
| 164 | return -EPERM; |
| 165 | } |
| 166 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
| 167 | 11: c9 leave |
| 168 | 12: c3 ret |
| 169 | 13: b8 01 00 00 00 mov $0x1,%eax |
| 170 | 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable |
| 171 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
| 172 | 24: 00 |
| 173 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
| 174 | 2c: 00 |
| 175 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
| 176 | 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
| 177 | 3a: b8 01 00 00 00 mov $0x1,%eax |
| 178 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable |
| 179 | 44: eb c6 jmp c <my_open+0xc> |
| 180 | |
| 181 | |
| 182 | - Generic |
| 183 | |
| 184 | static int my_open(struct inode *inode, struct file *file) |
| 185 | { |
| 186 | 0: 55 push %ebp |
| 187 | 1: 89 e5 mov %esp,%ebp |
| 188 | 3: 83 ec 0c sub $0xc,%esp |
| 189 | MARK(subsys_mark1, "%d %p", 1, NULL); |
| 190 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte |
| 191 | d: 84 c0 test %al,%al |
| 192 | f: 75 07 jne 18 <my_open+0x18> |
| 193 | |
| 194 | return -EPERM; |
| 195 | } |
| 196 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
| 197 | 16: c9 leave |
| 198 | 17: c3 ret |
| 199 | 18: b8 01 00 00 00 mov $0x1,%eax |
| 200 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable |
| 201 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
| 202 | 29: 00 |
| 203 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
| 204 | 31: 00 |
| 205 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
| 206 | 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
| 207 | 3f: b8 01 00 00 00 mov $0x1,%eax |
| 208 | 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable |
| 209 | 49: eb c6 jmp 11 <my_open+0x11> |
| 210 | |
| 211 | |
| 212 | |
| 213 | Here is the typical var arg probe that has been used in those tests. It saves |
| 214 | the values expectes as parameters in global variables. The DO_MARK1_FORMAT |
| 215 | define is used for probe registration to make sure that it will be connected |
| 216 | with a marker that has a matching format string. Note that this checking is |
| 217 | optional : the probe can register with a NULL format and afterward check itself |
| 218 | the format string received in parameter dynamically. |
| 219 | |
| 220 | int value; |
| 221 | void *ptr; |
| 222 | |
| 223 | #define DO_MARK1_FORMAT "%d %p" |
| 224 | void do_mark1(const char *format, ...) |
| 225 | { |
| 226 | va_list ap; |
| 227 | |
| 228 | va_start(ap, format); |
| 229 | value = va_arg(ap, int); |
| 230 | ptr = va_arg(ap, void*); |
| 231 | |
| 232 | va_end(ap); |
| 233 | } |
| 234 | |
| 235 | |
| 236 | Here is the disassembly of the probe : |
| 237 | |
| 238 | #define DO_MARK1_FORMAT "%d %p" |
| 239 | void do_mark1(const char *format, ...) |
| 240 | { |
| 241 | 0: 55 push %ebp |
| 242 | 1: 89 e5 mov %esp,%ebp |
| 243 | 3: 83 ec 04 sub $0x4,%esp |
| 244 | va_list ap; |
| 245 | |
| 246 | va_start(ap, format); |
| 247 | value = va_arg(ap, int); |
| 248 | 6: 8b 45 0c mov 0xc(%ebp),%eax |
| 249 | 9: a3 00 00 00 00 mov %eax,0x0 |
| 250 | ptr = va_arg(ap, void*); |
| 251 | e: 8b 45 10 mov 0x10(%ebp),%eax |
| 252 | 11: a3 00 00 00 00 mov %eax,0x0 |
| 253 | |
| 254 | va_end(ap); |
| 255 | } |
| 256 | 16: c9 leave |
| 257 | 17: c3 ret |
| 258 | |
| 259 | |
| 260 | * Size (x86) |
| 261 | |
| 262 | This is the size added by each marker to the memory image : |
| 263 | |
| 264 | - Optimized |
| 265 | |
| 266 | .text section : instructions |
| 267 | Adds 6 bytes in the "likely" path. |
| 268 | Adds 32 bytes in the "unlikely" path. |
| 269 | .data section : r/w data |
| 270 | 0 byte |
| 271 | .rodata.str1 : strings |
| 272 | Length of the marker name |
| 273 | .debug_str : strings (if loaded..) |
| 274 | Length of the marker name + 7 bytes (__mark_) |
| 275 | .markers |
| 276 | 8 bytes (2 pointers) |
| 277 | .markers.c |
| 278 | 12 bytes (3 pointers) |
| 279 | |
| 280 | - Generic |
| 281 | |
| 282 | .text section : instructions |
| 283 | Adds 11 bytes in the "likely" path. |
| 284 | Adds 32 bytes in the "unlikely" path. |
| 285 | .data section : r/w data |
| 286 | 1 byte (the activation flag) |
| 287 | .rodata.str1 : strings |
| 288 | Length of the marker name |
| 289 | .debug_str : strings (if loaded..) |
| 290 | Length of the marker name + 7 bytes (__mark_) |
| 291 | .markers |
| 292 | 8 bytes (2 pointers) |
| 293 | .markers.c |
| 294 | 12 bytes (3 pointers) |
| 295 | |
| 296 | |
| 297 | * Macro-benchmarks |
| 298 | |
| 299 | Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache. |
| 300 | Running a 2.6.17 vanilla kernel : |
| 301 | real 8m2.443s |
| 302 | user 7m35.124s |
| 303 | sys 0m34.950s |
| 304 | |
| 305 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
| 306 | real 8m1.635s |
| 307 | user 7m34.552s |
| 308 | sys 0m36.298s |
| 309 | |
| 310 | --> 0.98 % speedup with markers |
| 311 | |
| 312 | Ping flood on loopback interface : |
| 313 | Running a 2.6.17 vanilla kernel : |
| 314 | 136596 packets transmitted, 136596 packets received, 0% packet loss |
| 315 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
| 316 | |
| 317 | real 0m10.840s |
| 318 | user 0m0.360s |
| 319 | sys 0m10.485s |
| 320 | |
| 321 | 12601 packets transmitted/s |
| 322 | |
| 323 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
| 324 | 108504 packets transmitted, 108504 packets received, 0% packet loss |
| 325 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
| 326 | |
| 327 | real 0m8.614s |
| 328 | user 0m0.264s |
| 329 | sys 0m8.353s |
| 330 | |
| 331 | 12596 packets transmitted/s |
| 332 | |
| 333 | --> 0.03 % slowdown with markers |
| 334 | |
| 335 | |
| 336 | Conclusion |
| 337 | |
| 338 | In an empty loop, the generic marker is faster than the optimized marker. This |
| 339 | may be due to better performances of the movzbl instruction over the movb on the |
| 340 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
| 341 | of the movzbl becomes greater because it uses the memory bandwidth. |
| 342 | |
| 343 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
| 344 | as much as dynamically parsing the format string to get the variable arguments |
| 345 | (40.48 cycles). |
| 346 | |
| 347 | There is almost no difference, on x86, between passing the arguments directly on |
| 348 | the stack and using a variable argument list when its layout is known |
| 349 | statically (0.89 cycles vs 2.74 cycles). |
| 350 | |
| 351 | The int3 approach for adding instrumentation dynamically saves the 0.074 cycle |
| 352 | (typcal use, high memory usage) used by the optimized marker by adding the |
| 353 | ability to insert a breakpoint at any location without any impact on the code |
| 354 | when inactive. This breakpoint based approach is very useful to instrument core |
| 355 | kernel code that has not been previously marked without need to recompile and |
| 356 | reboot. We can therefore compare the case "without markers" to the null impact |
| 357 | of an inactive int3 breakpoint. |
| 358 | |
| 359 | However, the performance impact for using a kprobe is non negligible when |
| 360 | activated. Assuming that kprobes would have a mechanism to get the variables |
| 361 | from the caller's stack, it would perform the same task in at least 4178.23 |
| 362 | cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very |
| 363 | useful for the reason explained earlier, the high event rate paths in the kernel |
| 364 | would clearly benefit from a marker mechanism when the are probed. |
| 365 | |
| 366 | Code size and memory footprints are smaller with the optimized version : 6 |
| 367 | bytes of code in the likely path compared to 11 bytes. The memory footprint of |
| 368 | the optimized approach saves 4 bytes of data memory that would otherwise have to |
| 369 | stay in cache. |
| 370 | |
| 371 | On the macro-benchmark side, no significant difference in performance has been |
| 372 | found between the vanilla kernel and a kernel "marked" with the standard LTTng |
| 373 | instrumentation. |
| 374 | |
| 375 | |