| 1 | |
| 2 | |
| 3 | * Microbenchmarks |
| 4 | |
| 5 | Use timestamp counter to calculate the time spent, with interrupts disabled. |
| 6 | Machine : Pentium 4 3GHz |
| 7 | Fully preemptible kernel |
| 8 | marker : MARK(subsys_mark1, "%d %p", 1, NULL); |
| 9 | Linux Kernel Markers 0.19 |
| 10 | |
| 11 | * Execute an empty loop |
| 12 | NR_LOOPS : 10000000 |
| 13 | time delta (cycles): 15026497 |
| 14 | cycles per loop : 1.50 |
| 15 | - i386 "optimized" : immediate value, test and predicted branch |
| 16 | (non connected marker) |
| 17 | NR_LOOPS : 10000000 |
| 18 | time delta (cycles): 40031640 |
| 19 | cycles per loop : 4.00 |
| 20 | cycles per loop for marker : 2.50 |
| 21 | - i386 "generic" : load, test and predicted branch |
| 22 | (non connected marker) |
| 23 | NR_LOOPS : 10000000 |
| 24 | time delta (cycles): 26697878 |
| 25 | cycles per loop : 2.67 |
| 26 | cycles per loop for marker : 1.17 |
| 27 | |
| 28 | * Execute a loop of memcpy 4096 bytes |
| 29 | - Without marker |
| 30 | NR_LOOPS : 10000 |
| 31 | time delta (cycles): 12981555 |
| 32 | cycles per loop : 1298.16 |
| 33 | - i386 "optimized" : immediate value, test and predicted branch |
| 34 | (non connected marker) |
| 35 | NR_LOOPS : 10000 |
| 36 | time delta (cycles): 12982290 |
| 37 | cycles per loop : 1298.23 |
| 38 | cycles per loop for marker : 0.074 |
| 39 | - i386 "generic" : load, test and predicted branch |
| 40 | (non connected marker) |
| 41 | NR_LOOPS : 10000 |
| 42 | time delta (cycles): 13002788 |
| 43 | cycles per loop : 1300.28 |
| 44 | cycles per loop for marker : 2.123 |
| 45 | |
| 46 | |
| 47 | The following tests are done with the "optimized" markers only |
| 48 | |
| 49 | Execute a loop with a marker enabled, with an empty probe. |
| 50 | NR_LOOPS : 100000 |
| 51 | time delta (cycles): 5210587 |
| 52 | cycles per loop : 52.11 |
| 53 | cycles per loop for empty probe : 52.11-4.00=48.11 |
| 54 | |
| 55 | Execute a loop with marker enabled, with i386 direct argument passing. |
| 56 | NR_LOOPS : 100000 |
| 57 | time delta (cycles): 5299837 |
| 58 | cycles per loop : 53.00 |
| 59 | cycles per loop to get arguments in probe (from stack) on x86 : 53.00-52.11=0.89 |
| 60 | |
| 61 | Execute a loop with marker enabled, with var args probe. |
| 62 | NR_LOOPS : 100000 |
| 63 | time delta (cycles): 5574300 |
| 64 | cycles per loop : 55.74 |
| 65 | cycles per loop to get expected variable arguments on x86 : 55.74-53.00=2.74 |
| 66 | |
| 67 | Execute a loop with marker enabled, with var args probe, format string |
| 68 | processing. |
| 69 | NR_LOOPS : 100000 |
| 70 | time delta (cycles): 9622117 |
| 71 | cycles per loop : 96.22 |
| 72 | cycles per loop to dynamically parse arguments |
| 73 | with format string : 96.22-55.74=40.48 |
| 74 | |
| 75 | |
| 76 | * Assembly code |
| 77 | |
| 78 | |
| 79 | - Optimized |
| 80 | |
| 81 | static int my_open(struct inode *inode, struct file *file) |
| 82 | { |
| 83 | 0: 55 push %ebp |
| 84 | 1: 89 e5 mov %esp,%ebp |
| 85 | 3: 83 ec 0c sub $0xc,%esp |
| 86 | MARK(subsys_mark1, "%d %p", 1, NULL); |
| 87 | 6: b0 00 mov $0x0,%al |
| 88 | 8: 84 c0 test %al,%al |
| 89 | a: 75 07 jne 13 <my_open+0x13> |
| 90 | |
| 91 | return -EPERM; |
| 92 | } |
| 93 | c: b8 ff ff ff ff mov $0xffffffff,%eax |
| 94 | 11: c9 leave |
| 95 | 12: c3 ret |
| 96 | 13: b8 01 00 00 00 mov $0x1,%eax |
| 97 | 18: e8 fc ff ff ff call 19 <my_open+0x19> |
| 98 | 1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
| 99 | 24: 00 |
| 100 | 25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
| 101 | 2c: 00 |
| 102 | 2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
| 103 | 34: ff 15 74 10 00 00 call *0x1074 |
| 104 | 3a: b8 01 00 00 00 mov $0x1,%eax |
| 105 | 3f: e8 fc ff ff ff call 40 <my_open+0x40> |
| 106 | 44: eb c6 jmp c <my_open+0xc> |
| 107 | |
| 108 | |
| 109 | - Generic |
| 110 | |
| 111 | static int my_open(struct inode *inode, struct file *file) |
| 112 | { |
| 113 | 0: 55 push %ebp |
| 114 | 1: 89 e5 mov %esp,%ebp |
| 115 | 3: 83 ec 0c sub $0xc,%esp |
| 116 | MARK(subsys_mark1, "%d %p", 1, NULL); |
| 117 | 6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax |
| 118 | d: 84 c0 test %al,%al |
| 119 | f: 75 07 jne 18 <my_open+0x18> |
| 120 | |
| 121 | return -EPERM; |
| 122 | } |
| 123 | 11: b8 ff ff ff ff mov $0xffffffff,%eax |
| 124 | 16: c9 leave |
| 125 | 17: c3 ret |
| 126 | 18: b8 01 00 00 00 mov $0x1,%eax |
| 127 | 1d: e8 fc ff ff ff call 1e <my_open+0x1e> |
| 128 | 22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp) |
| 129 | 29: 00 |
| 130 | 2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp) |
| 131 | 31: 00 |
| 132 | 32: c7 04 24 0d 00 00 00 movl $0xd,(%esp) |
| 133 | 39: ff 15 74 10 00 00 call *0x1074 |
| 134 | 3f: b8 01 00 00 00 mov $0x1,%eax |
| 135 | 44: e8 fc ff ff ff call 45 <my_open+0x45> |
| 136 | 49: eb c6 jmp 11 <my_open+0x11> |
| 137 | |
| 138 | * Size (x86) |
| 139 | |
| 140 | - Optimized |
| 141 | |
| 142 | Adds 6 bytes in the "likely" path. |
| 143 | Adds 32 bytes in the "unlikely" path. |
| 144 | |
| 145 | - Generic |
| 146 | |
| 147 | Adds 11 bytes in the "likely" path. |
| 148 | Adds 32 bytes in the "unlikely" path. |
| 149 | |
| 150 | |
| 151 | |
| 152 | Conclusion |
| 153 | |
| 154 | In an empty loop, the generic marker is faster than the optimized marker. This |
| 155 | may be due to better performances of the movzbl instruction over the movb on the |
| 156 | Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact |
| 157 | of the movzbl becomes greater because it uses the memory bandwidth. |
| 158 | |
| 159 | The preemption disabling and call to a probe itself costs 48.11 cycles, almost |
| 160 | as much as dynamically parsing the format string to get the variable arguments |
| 161 | (40.48 cycles). |
| 162 | |
| 163 | There is almost no difference, on x86, between passing the arguments directly on |
| 164 | the stack and using a variable argument list when its layout is known |
| 165 | statically (0.89 cycles vs 2.74 cycles). |
| 166 | |
| 167 | |
| 168 | |