bc82195a |
1 | |
2 | |
3 | * Microbenchmarks |
4 | |
5 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
6 | Machine : Pentium 4 3GHz |
7 | Fully preemptible kernel |
8 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
9 | Linux Kernel Markers 0.19 |
10 | |
11 | * Execute an empty loop |
12 | NR_LOOPS : 10000000 |
13 | time delta (cycles): 15026497 |
14 | cycles per loop : 1.50 |
15 | - i386 "optimized" : immediate value, test and predicted branch |
16 | (non connected marker) |
17 | NR_LOOPS : 10000000 |
18 | time delta (cycles): 40031640 |
19 | cycles per loop : 4.00 |
20 | cycles per loop for marker : 2.50 |
21 | - i386 "generic" : load, test and predicted branch |
22 | (non connected marker) |
23 | NR_LOOPS : 10000000 |
24 | time delta (cycles): 26697878 |
25 | cycles per loop : 2.67 |
26 | cycles per loop for marker : 1.17 |
27 | |
28 | * Execute a loop of memcpy 4096 bytes |
29 | - Without marker |
30 | NR_LOOPS : 10000 |
31 | time delta (cycles): 12981555 |
32 | cycles per loop : 1298.16 |
33 | - i386 "optimized" : immediate value, test and predicted branch |
34 | (non connected marker) |
35 | NR_LOOPS : 10000 |
36 | time delta (cycles): 12982290 |
37 | cycles per loop : 1298.23 |
38 | cycles per loop for marker : 0.074 |
39 | - i386 "generic" : load, test and predicted branch |
40 | (non connected marker) |
41 | NR_LOOPS : 10000 |
42 | time delta (cycles): 13002788 |
43 | cycles per loop : 1300.28 |
44 | cycles per loop for marker : 2.123 |
45 | |
46 | |
47 | The following tests are done with the "optimized" markers only |
48 | |
49 | Execute a loop with marker enabled, with i386 "fastcall" register argument |
50 | setup, probe empty. |
51 | NR_LOOPS : 100000 |
52 | time delta (cycles): 4407608 |
53 | cycles per loop : 44.08 |
54 | cycles per loop to setup arguments in registers : 44.08-4.00=40.08 |
55 | |
56 | Execute a loop with a marker enabled, with an empty probe. Var args argument |
57 | setup, probe empty. |
58 | NR_LOOPS : 100000 |
59 | time delta (cycles): 5210587 |
60 | cycles per loop : 52.11 |
61 | additional cycles per loop to setup var args : 52.11-44.08=8.03 |
62 | |
63 | Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected. |
64 | Data is copied by the probe. |
65 | NR_LOOPS : 100000 |
66 | time delta (cycles): 5299837 |
67 | cycles per loop : 53.00 |
68 | additional cycles per loop to get arguments in probe (from stack) on x86 : |
69 | 53.00-52.11=0.89 |
70 | |
71 | Execute a loop with marker enabled, with var args probe expecting arguments. |
72 | Data is copied by the probe. |
73 | NR_LOOPS : 100000 |
74 | time delta (cycles): 5574300 |
75 | cycles per loop : 55.74 |
76 | additional cycles per loop to get expected variable arguments on x86 : |
77 | 55.74-53.00=2.74 |
78 | |
79 | Execute a loop with marker enabled, with var args probe, format string |
80 | Data is copied by the probe. |
81 | processing. |
82 | NR_LOOPS : 100000 |
83 | time delta (cycles): 9622117 |
84 | cycles per loop : 96.22 |
85 | additional cycles per loop to dynamically parse arguments with format string : |
86 | 96.22-55.74=40.48 |
87 | |
88 | |
89 | * Assembly code |
90 | |
91 | |
92 | - Optimized |
93 | |
94 | static int my_open(struct inode *inode, struct file *file) |
95 | { |
96 | 0: 55 push %ebp |
97 | 1: 89 e5 mov %esp,%ebp |
98 | 3: 83 ec 0c sub $0xc,%esp |
99 | MARK(subsys_mark1, "%d %p", 1, NULL); |
100 | 6: b0 00 mov $0x0,%al |
101 | 8: 84 c0 test %al,%al |
102 | a: 75 07 jne 13 <my_open+0x13> |
103 | |
104 | return -EPERM; |
105 | } |
106 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
107 | 11: c9 leave |
108 | 12: c3 ret |
109 | 13: b8 01 00 00 00 mov $0x1,%eax |
110 | 18: e8 fc ff ff ff call 19 <my_open+0x19> |
111 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
112 | 24: 00 |
113 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
114 | 2c: 00 |
115 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
116 | 34: ff 15 74 10 00 00 call *0x1074 |
117 | 3a: b8 01 00 00 00 mov $0x1,%eax |
118 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> |
119 | 44: eb c6 jmp c <my_open+0xc> |
120 | |
121 | |
122 | - Generic |
123 | |
124 | static int my_open(struct inode *inode, struct file *file) |
125 | { |
126 | 0: 55 push %ebp |
127 | 1: 89 e5 mov %esp,%ebp |
128 | 3: 83 ec 0c sub $0xc,%esp |
129 | MARK(subsys_mark1, "%d %p", 1, NULL); |
130 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax |
131 | d: 84 c0 test %al,%al |
132 | f: 75 07 jne 18 <my_open+0x18> |
133 | |
134 | return -EPERM; |
135 | } |
136 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
137 | 16: c9 leave |
138 | 17: c3 ret |
139 | 18: b8 01 00 00 00 mov $0x1,%eax |
140 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> |
141 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
142 | 29: 00 |
143 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
144 | 31: 00 |
145 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
146 | 39: ff 15 74 10 00 00 call *0x1074 |
147 | 3f: b8 01 00 00 00 mov $0x1,%eax |
148 | 44: e8 fc ff ff ff call 45 <my_open+0x45> |
149 | 49: eb c6 jmp 11 <my_open+0x11> |
150 | |
151 | * Size (x86) |
152 | |
153 | - Optimized |
154 | |
155 | Adds 6 bytes in the "likely" path. |
156 | Adds 32 bytes in the "unlikely" path. |
157 | |
158 | - Generic |
159 | |
160 | Adds 11 bytes in the "likely" path. |
161 | Adds 32 bytes in the "unlikely" path. |
162 | |
163 | |
164 | * Macrobenchmarks |
165 | |
166 | Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache. |
167 | Running a 2.6.17 vanilla kernel : |
168 | real 8m2.443s |
169 | user 7m35.124s |
170 | sys 0m34.950s |
171 | |
172 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
173 | real 8m1.635s |
174 | user 7m34.552s |
175 | sys 0m36.298s |
176 | |
177 | Ping flood on loopback interface : |
178 | Running a 2.6.17 vanilla kernel : |
179 | 136596 packets transmitted, 136596 packets received, 0% packet loss |
180 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
181 | |
182 | real 0m10.840s |
183 | user 0m0.360s |
184 | sys 0m10.485s |
185 | |
186 | 12601 packets transmitted/s |
187 | |
188 | Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) : |
189 | 108504 packets transmitted, 108504 packets received, 0% packet loss |
190 | round-trip min/avg/max = 0.0/0.0/0.1 ms |
191 | |
192 | real 0m8.614s |
193 | user 0m0.264s |
194 | sys 0m8.353s |
195 | |
196 | 12596 packets transmitted/s |
197 | |
198 | |
199 | |
200 | Conclusion |
201 | |
202 | In an empty loop, the generic marker is faster than the optimized marker. This |
203 | may be due to better performances of the movzbl instruction over the movb on the |
204 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
205 | of the movzbl becomes greater because it uses the memory bandwidth. |
206 | |
207 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
208 | as much as dynamically parsing the format string to get the variable arguments |
209 | (40.48 cycles). |
210 | |
211 | There is almost no difference, on x86, between passing the arguments directly on |
212 | the stack and using a variable argument list when its layout is known |
213 | statically (0.89 cycles vs 2.74 cycles). |
214 | |
215 | |
216 | |