5 Use timestamp counter to calculate the time spent, with interrupts disabled.
6 Machine : Pentium 4 3GHz
7 Fully preemptible kernel
8 marker : MARK(subsys_mark1, "%d %p", 1, NULL);
9 Linux Kernel Markers 0.19
11 * Execute an empty loop
13 time delta (cycles): 15026497
14 cycles per loop : 1.50
15 - i386 "optimized" : immediate value, test and predicted branch
16 (non connected marker)
18 time delta (cycles): 40031640
19 cycles per loop : 4.00
20 cycles per loop for marker : 2.50
21 - i386 "generic" : load, test and predicted branch
22 (non connected marker)
24 time delta (cycles): 26697878
25 cycles per loop : 2.67
26 cycles per loop for marker : 1.17
28 * Execute a loop of memcpy 4096 bytes
31 time delta (cycles): 12981555
32 cycles per loop : 1298.16
33 - i386 "optimized" : immediate value, test and predicted branch
34 (non connected marker)
36 time delta (cycles): 12982290
37 cycles per loop : 1298.23
38 cycles per loop for marker : 0.074
39 - i386 "generic" : load, test and predicted branch
40 (non connected marker)
42 time delta (cycles): 13002788
43 cycles per loop : 1300.28
44 cycles per loop for marker : 2.123
47 The following tests are done with the "optimized" markers only
49 Execute a loop with a marker enabled, with an empty probe.
51 time delta (cycles): 5210587
52 cycles per loop : 52.11
53 cycles per loop for empty probe : 52.11-4.00=48.11
55 Execute a loop with marker enabled, with i386 direct argument passing.
57 time delta (cycles): 5299837
58 cycles per loop : 53.00
59 cycles per loop to get arguments in probe (from stack) on x86 : 53.00-52.11=0.89
61 Execute a loop with marker enabled, with var args probe.
63 time delta (cycles): 5574300
64 cycles per loop : 55.74
65 cycles per loop to get expected variable arguments on x86 : 55.74-53.00=2.74
67 Execute a loop with marker enabled, with var args probe, format string
70 time delta (cycles): 9622117
71 cycles per loop : 96.22
72 cycles per loop to dynamically parse arguments
73 with format string : 96.22-55.74=40.48
81 static int my_open(struct inode *inode, struct file *file)
84 1: 89 e5 mov %esp,%ebp
85 3: 83 ec 0c sub $0xc,%esp
86 MARK(subsys_mark1, "%d %p", 1, NULL);
89 a: 75 07 jne 13 <my_open+0x13>
93 c: b8 ff ff ff ff mov $0xffffffff,%eax
96 13: b8 01 00 00 00 mov $0x1,%eax
97 18: e8 fc ff ff ff call 19 <my_open+0x19>
98 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
100 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
102 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
103 34: ff 15 74 10 00 00 call *0x1074
104 3a: b8 01 00 00 00 mov $0x1,%eax
105 3f: e8 fc ff ff ff call 40 <my_open+0x40>
106 44: eb c6 jmp c <my_open+0xc>
111 static int my_open(struct inode *inode, struct file *file)
114 1: 89 e5 mov %esp,%ebp
115 3: 83 ec 0c sub $0xc,%esp
116 MARK(subsys_mark1, "%d %p", 1, NULL);
117 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax
118 d: 84 c0 test %al,%al
119 f: 75 07 jne 18 <my_open+0x18>
123 11: b8 ff ff ff ff mov $0xffffffff,%eax
126 18: b8 01 00 00 00 mov $0x1,%eax
127 1d: e8 fc ff ff ff call 1e <my_open+0x1e>
128 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
130 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
132 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
133 39: ff 15 74 10 00 00 call *0x1074
134 3f: b8 01 00 00 00 mov $0x1,%eax
135 44: e8 fc ff ff ff call 45 <my_open+0x45>
136 49: eb c6 jmp 11 <my_open+0x11>
142 Adds 6 bytes in the "likely" path.
143 Adds 32 bytes in the "unlikely" path.
147 Adds 11 bytes in the "likely" path.
148 Adds 32 bytes in the "unlikely" path.
154 In an empty loop, the generic marker is faster than the optimized marker. This
155 may be due to better performances of the movzbl instruction over the movb on the
156 Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
157 of the movzbl becomes greater because it uses the memory bandwidth.
159 The preemption disabling and call to a probe itself costs 48.11 cycles, almost
160 as much as dynamically parsing the format string to get the variable arguments
163 There is almost no difference, on x86, between passing the arguments directly on
164 the stack and using a variable argument list when its layout is known
165 statically (0.89 cycles vs 2.74 cycles).