bc82195a |
1 | |
2 | |
3 | * Microbenchmarks |
4 | |
5 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
b9c0c958 |
6 | Machine : Pentium 4 3GHz, 1GB ram |
bc82195a |
7 | Fully preemptible kernel |
bc82195a |
8 | Linux Kernel Markers 0.19 |
b9c0c958 |
9 | Kernel : Linux 2.6.17 |
10 | |
11 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
12 | |
13 | This marker, with two elements (integer and pointer) have been chosen because it |
14 | is representative of high volume events. For instance, a trap entry event logs a |
15 | trap_id (long) and an address (pointer). The same applies to system calls, where |
16 | a system call entry event logs both the ID of the system call and the address of |
17 | the caller. |
18 | |
bc82195a |
19 | |
20 | * Execute an empty loop |
b9c0c958 |
21 | |
22 | - Without marker |
bc82195a |
23 | NR_LOOPS : 10000000 |
24 | time delta (cycles): 15026497 |
25 | cycles per loop : 1.50 |
b9c0c958 |
26 | |
bc82195a |
27 | - i386 "optimized" : immediate value, test and predicted branch |
28 | (non connected marker) |
29 | NR_LOOPS : 10000000 |
30 | time delta (cycles): 40031640 |
31 | cycles per loop : 4.00 |
b9c0c958 |
32 | cycles per loop for marker : 4.00-1.50=2.50 |
33 | |
bc82195a |
34 | - i386 "generic" : load, test and predicted branch |
35 | (non connected marker) |
36 | NR_LOOPS : 10000000 |
37 | time delta (cycles): 26697878 |
38 | cycles per loop : 2.67 |
b9c0c958 |
39 | cycles per loop for marker : 2.67-1.50=1.17 |
40 | |
bc82195a |
41 | |
42 | * Execute a loop of memcpy 4096 bytes |
b9c0c958 |
43 | |
44 | This test has been done to show the impact of markers on a system where the |
45 | memory is already used, which is more representative of a running kernel. |
46 | |
bc82195a |
47 | - Without marker |
48 | NR_LOOPS : 10000 |
49 | time delta (cycles): 12981555 |
50 | cycles per loop : 1298.16 |
b9c0c958 |
51 | |
bc82195a |
52 | - i386 "optimized" : immediate value, test and predicted branch |
53 | (non connected marker) |
54 | NR_LOOPS : 10000 |
55 | time delta (cycles): 12982290 |
56 | cycles per loop : 1298.23 |
b9c0c958 |
57 | cycles per loop for marker : 1298.23-1298.16=0.074 |
58 | |
bc82195a |
59 | - i386 "generic" : load, test and predicted branch |
60 | (non connected marker) |
61 | NR_LOOPS : 10000 |
62 | time delta (cycles): 13002788 |
63 | cycles per loop : 1300.28 |
b9c0c958 |
64 | cycles per loop for marker : 1300.28-1298.16=2.123 |
bc82195a |
65 | |
66 | |
67 | The following tests are done with the "optimized" markers only |
68 | |
b9c0c958 |
69 | - Execute a loop with marker enabled, with i386 "fastcall" register argument |
70 | setup, probe empty. With preemption disabling. |
bc82195a |
71 | NR_LOOPS : 100000 |
72 | time delta (cycles): 4407608 |
73 | cycles per loop : 44.08 |
b9c0c958 |
74 | cycles per loop to disable preemption and setup arguments in registers : |
75 | 44.08-4.00=40.08 |
bc82195a |
76 | |
b9c0c958 |
77 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
78 | setup, probe empty. With preemption disabling. |
bc82195a |
79 | NR_LOOPS : 100000 |
80 | time delta (cycles): 5210587 |
81 | cycles per loop : 52.11 |
82 | additional cycles per loop to setup var args : 52.11-44.08=8.03 |
83 | |
b9c0c958 |
84 | - Execute a loop with a marker enabled, with an empty probe. Var args argument |
85 | setup, probe empty. No preemption disabling. |
86 | NR_LOOPS : 100000 |
87 | time delta (cycles): 3363450 |
88 | cycles per loop : 33.63 |
89 | cycles per loop to disable preemption : 44.08-33.63=10.45 |
90 | |
91 | - Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected. |
92 | Data is copied by the probe. With preemption disabling. |
bc82195a |
93 | NR_LOOPS : 100000 |
94 | time delta (cycles): 5299837 |
95 | cycles per loop : 53.00 |
96 | additional cycles per loop to get arguments in probe (from stack) on x86 : |
b9c0c958 |
97 | 53.00-52.11=0.89 |
bc82195a |
98 | |
b9c0c958 |
99 | - Execute a loop with marker enabled, with var args probe expecting arguments. |
100 | Data is copied by the probe. With preemption disabling. |
bc82195a |
101 | NR_LOOPS : 100000 |
102 | time delta (cycles): 5574300 |
103 | cycles per loop : 55.74 |
104 | additional cycles per loop to get expected variable arguments on x86 : |
105 | 55.74-53.00=2.74 |
106 | |
b9c0c958 |
107 | - Execute a loop with marker enabled, with var args probe, format string |
108 | Data is copied by the probe. This is a 6 bytes string to decode. |
bc82195a |
109 | processing. |
110 | NR_LOOPS : 100000 |
111 | time delta (cycles): 9622117 |
112 | cycles per loop : 96.22 |
b9c0c958 |
113 | additional cycles per loop to dynamically parse arguments with a 6 bytes format |
114 | string : |
115 | 96.22-55.74=40.48 |
bc82195a |
116 | |
117 | |
118 | * Assembly code |
119 | |
b9c0c958 |
120 | The disassembly of the following marked function will be shown : |
121 | |
122 | static int my_open(struct inode *inode, struct file *file) |
123 | { |
124 | MARK(subsys_mark1, "%d %p", 1, NULL); |
125 | |
126 | return -EPERM; |
127 | } |
128 | |
bc82195a |
129 | |
130 | - Optimized |
131 | |
132 | static int my_open(struct inode *inode, struct file *file) |
133 | { |
134 | 0: 55 push %ebp |
135 | 1: 89 e5 mov %esp,%ebp |
136 | 3: 83 ec 0c sub $0xc,%esp |
137 | MARK(subsys_mark1, "%d %p", 1, NULL); |
b9c0c958 |
138 | 6: b0 00 mov $0x0,%al <-- immediate load 0 in al |
bc82195a |
139 | 8: 84 c0 test %al,%al |
140 | a: 75 07 jne 13 <my_open+0x13> |
141 | |
142 | return -EPERM; |
143 | } |
144 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
145 | 11: c9 leave |
146 | 12: c3 ret |
147 | 13: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
148 | 18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable |
bc82195a |
149 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
150 | 24: 00 |
151 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
152 | 2c: 00 |
153 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
b9c0c958 |
154 | 34: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
bc82195a |
155 | 3a: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
156 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable |
bc82195a |
157 | 44: eb c6 jmp c <my_open+0xc> |
158 | |
159 | |
160 | - Generic |
161 | |
162 | static int my_open(struct inode *inode, struct file *file) |
163 | { |
164 | 0: 55 push %ebp |
165 | 1: 89 e5 mov %esp,%ebp |
166 | 3: 83 ec 0c sub $0xc,%esp |
167 | MARK(subsys_mark1, "%d %p", 1, NULL); |
b9c0c958 |
168 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte |
bc82195a |
169 | d: 84 c0 test %al,%al |
170 | f: 75 07 jne 18 <my_open+0x18> |
171 | |
172 | return -EPERM; |
173 | } |
174 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
175 | 16: c9 leave |
176 | 17: c3 ret |
177 | 18: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
178 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable |
bc82195a |
179 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
180 | 29: 00 |
181 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
182 | 31: 00 |
183 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
b9c0c958 |
184 | 39: ff 15 74 10 00 00 call *0x1074 <-- function pointer |
bc82195a |
185 | 3f: b8 01 00 00 00 mov $0x1,%eax |
b9c0c958 |
186 | 44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable |
bc82195a |
187 | 49: eb c6 jmp 11 <my_open+0x11> |
188 | |
b9c0c958 |
189 | |
190 | |
191 | Here is the typical var arg probe that has been used in those tests. It saves |
192 | the values expectes as parameters in global variables. The DO_MARK1_FORMAT |
193 | define is used for probe registration to make sure that it will be connected |
194 | with a marker that has a matching format string. Note that this checking is |
195 | optional : the probe can register with a NULL format and afterward check itself |
196 | the format string received in parameter dynamically. |
197 | |
198 | int value; |
199 | void *ptr; |
200 | |
201 | #define DO_MARK1_FORMAT "%d %p" |
202 | void do_mark1(const char *format, ...) |
203 | { |
204 | va_list ap; |
205 | |
206 | va_start(ap, format); |
207 | value = va_arg(ap, int); |
208 | ptr = va_arg(ap, void*); |
209 | |
210 | va_end(ap); |
211 | } |
212 | |
213 | |
214 | Here is the disassembly of the probe : |
215 | |
216 | #define DO_MARK1_FORMAT "%d %p" |
217 | void do_mark1(const char *format, ...) |
218 | { |
219 | 0: 55 push %ebp |
220 | 1: 89 e5 mov %esp,%ebp |
221 | 3: 83 ec 04 sub $0x4,%esp |
222 | va_list ap; |
223 | |
224 | va_start(ap, format); |
225 | value = va_arg(ap, int); |
226 | 6: 8b 45 0c mov 0xc(%ebp),%eax |
227 | 9: a3 00 00 00 00 mov %eax,0x0 |
228 | ptr = va_arg(ap, void*); |
229 | e: 8b 45 10 mov 0x10(%ebp),%eax |
230 | 11: a3 00 00 00 00 mov %eax,0x0 |
231 | |
232 | va_end(ap); |
233 | } |
234 | 16: c9 leave |
235 | 17: c3 ret |
236 | |
237 | |
bc82195a |
238 | * Size (x86) |
239 | |
b9c0c958 |
240 | This is the size added by each marker to the memory image : |
241 | |
bc82195a |
242 | - Optimized |
243 | |
b9c0c958 |
244 | .text section : instructions |
bc82195a |
245 | Adds 6 bytes in the "likely" path. |
246 | Adds 32 bytes in the "unlikely" path. |
b9c0c958 |
247 | .data section : r/w data |
248 | 0 byte |
249 | .rodata.str1 : strings |
250 | Length of the marker name |
251 | .debug_str : strings (if loaded..) |
252 | Length of the marker name + 7 bytes (__mark_) |
253 | .markers |
254 | 8 bytes (2 pointers) |
255 | .markers.c |
256 | 12 bytes (3 pointers) |
257 | |
258 | - Generic |
259 | |
260 | .text section : instructions |
bc82195a |
261 | Adds 11 bytes in the "likely" path. |
262 | Adds 32 bytes in the "unlikely" path. |
b9c0c958 |
263 | .data section : r/w data |
264 | 1 byte (the activation flag) |
265 | .rodata.str1 : strings |
266 | Length of the marker name |
267 | .debug_str : strings (if loaded..) |
268 | Length of the marker name + 7 bytes (__mark_) |
269 | .markers |
270 | 8 bytes (2 pointers) |
271 | .markers.c |
272 | 12 bytes (3 pointers) |
bc82195a |
273 | |
274 | |
275 | * Macrobenchmarks |
276 | |
277 | Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache. |
278 | Running a 2.6.17 vanilla kernel : |
279 | real 8m2.443s |
280 | user 7m35.124s |
281 | sys 0m34.950s |
282 | |
283 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
284 | real 8m1.635s |
285 | user 7m34.552s |
286 | sys 0m36.298s |
287 | |
288 | Ping flood on loopback interface : |
289 | Running a 2.6.17 vanilla kernel : |
290 | 136596 packets transmitted, 136596 packets received, 0% packet loss |
291 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
292 | |
293 | real 0m10.840s |
294 | user 0m0.360s |
295 | sys 0m10.485s |
296 | |
297 | 12601 packets transmitted/s |
298 | |
299 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
300 | 108504 packets transmitted, 108504 packets received, 0% packet loss |
301 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
302 | |
303 | real 0m8.614s |
304 | user 0m0.264s |
305 | sys 0m8.353s |
306 | |
307 | 12596 packets transmitted/s |
308 | |
309 | |
310 | |
311 | Conclusion |
312 | |
313 | In an empty loop, the generic marker is faster than the optimized marker. This |
314 | may be due to better performances of the movzbl instruction over the movb on the |
315 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
316 | of the movzbl becomes greater because it uses the memory bandwidth. |
317 | |
318 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
319 | as much as dynamically parsing the format string to get the variable arguments |
320 | (40.48 cycles). |
321 | |
322 | There is almost no difference, on x86, between passing the arguments directly on |
323 | the stack and using a variable argument list when its layout is known |
324 | statically (0.89 cycles vs 2.74 cycles). |
325 | |
b9c0c958 |
326 | The int3 approach for adding instrumentation dynamically saves the 0.074 cycle |
327 | (typcal use, high memory usage) used by the optimized marker by adding the |
328 | ability to insert a breakpoint at any location without any impact on the code |
329 | when inactive. This breakpoint based approach is very useful to instrument core |
330 | kernel code that has not been previously marked without need to recompile and |
331 | reboot. We can therefore compare the case "without markers" to the null impact |
332 | of the int3 breakpoint based approach when inactive. |
333 | |
334 | |
335 | |
336 | |
337 | |
338 | |
339 | |
340 | |
341 | |
bc82195a |
342 | |
343 | |