[lttv.git] / tests / markers / markers-microbench-0.2.txt



Hi,

Following the huge discussion thread about tracing/static vs dynamic
instrumentation/markers, a consensus seems to emerge about the need for a
marker system in the Linux kernel. The main issues this mechanism addresses are:

- Identify code important to runtime data collection/analysis tools in tree so
  that it follows the code changes naturally.
- Be visually appealing to kernel developers.
- Have a very low impact on the system performance.
- Integrate in the standard kernel infrastructure : use C and loadable modules.

The time has come for some performance measurements of the Linux Kernel Markers,
which follows.


* Micro-benchmarks

Use timestamp counter to calculate the time spent, with interrupts disabled.
Machine : Pentium 4 3GHz, 1GB ram
Fully preemptible kernel
Linux Kernel Markers 0.19
Kernel : Linux 2.6.17

marker : MARK(subsys_mark1, "%d %p", 1, NULL);

This marker, with two elements (integer and pointer) have been chosen because it
is representative of high volume events. For instance, a trap entry event logs a
trap_id (long) and an address (pointer). The same applies to system calls, where
a system call entry event logs both the ID of the system call and the address of
the caller.


* Execute an empty loop

- Without marker
NR_LOOPS : 10000000
time delta (cycles): 15026497
cycles per loop : 1.50

- i386 "optimized" : immediate value, test and predicted branch
  (non connected marker)
NR_LOOPS : 10000000
time delta (cycles): 40031640
cycles per loop : 4.00
cycles per loop for marker : 4.00-1.50=2.50

- i386 "generic" : load, test and predicted branch
  (non connected marker)
NR_LOOPS : 10000000
time delta (cycles): 26697878
cycles per loop : 2.67
cycles per loop for marker : 2.67-1.50=1.17


* Execute a loop of memcpy 4096 bytes

This test has been done to show the impact of markers on a system where the
memory is already used, which is more representative of a running kernel.

- Without marker
NR_LOOPS : 10000
time delta (cycles): 12981555
cycles per loop : 1298.16

- i386 "optimized" : immediate value, test and predicted branch
  (non connected marker)
NR_LOOPS : 10000
time delta (cycles): 12982290
cycles per loop : 1298.23
cycles per loop for marker : 1298.23-1298.16=0.074

- i386 "generic" : load, test and predicted branch
  (non connected marker)
NR_LOOPS : 10000
time delta (cycles): 13002788
cycles per loop : 1300.28
cycles per loop for marker : 1300.28-1298.16=2.123


The following tests are done with the "optimized" markers only

- Execute a loop with marker enabled, with i386 "fastcall" register argument
  setup, probe empty. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 4407608
cycles per loop : 44.08
cycles per loop to disable preemption and setup arguments in registers :
44.08-4.00=40.08

- Execute a loop with a marker enabled, with an empty probe. Var args argument
  setup, probe empty. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5210587
cycles per loop : 52.11
additional cycles per loop to setup var args : 52.11-44.08=8.03

- Execute a loop with a marker enabled, with an empty probe. Var args argument
  setup, probe empty. No preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 3363450
cycles per loop : 33.63
cycles per loop to disable preemption : 44.08-33.63=10.45

- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
  Data is copied by the probe. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5299837
cycles per loop : 53.00
additional cycles per loop to get arguments in probe (from stack) on x86 :
53.00-52.11=0.89

- Execute a loop with marker enabled, with var args probe expecting arguments.
  Data is copied by the probe. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5574300
cycles per loop : 55.74
additional cycles per loop to get expected variable arguments on x86 : 
  55.74-53.00=2.74

- Execute a loop with marker enabled, with var args probe, format string
  Data is copied by the probe. This is a 6 bytes string to decode.
NR_LOOPS : 100000
time delta (cycles): 9622117
cycles per loop : 96.22
additional cycles per loop to dynamically parse arguments with a 6 bytes format
string : 96.22-55.74=40.48

- Execute a loop with marker enabled, with var args probe expecting arguments.
  Data is copied by the probe. With preemption disabling. An empty "kprobe" is
  connected to the probe.
NR_LOOPS : 100000
time delta (cycles): 423397455
cycles per loop : 4233.97
additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23


* Assembly code

The disassembly of the following marked function will be shown :

static int my_open(struct inode *inode, struct file *file)
{
        MARK(subsys_mark1, "%d %p", 1, NULL);

        return -EPERM;
}


- Optimized

static int my_open(struct inode *inode, struct file *file)
{
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 0c                sub    $0xc,%esp
        MARK(subsys_mark1, "%d %p", 1, NULL);
   6:   b0 00                   mov    $0x0,%al <-- immediate load 0 in al
   8:   84 c0                   test   %al,%al
   a:   75 07                   jne    13 <my_open+0x13>

        return -EPERM;
}
   c:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  11:   c9                      leave  
  12:   c3                      ret    
  13:   b8 01 00 00 00          mov    $0x1,%eax
  18:   e8 fc ff ff ff          call   19 <my_open+0x19> <-- preempt_disable
  1d:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
  24:   00 
  25:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
  2c:   00 
  2d:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
  34:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
  3a:   b8 01 00 00 00          mov    $0x1,%eax
  3f:   e8 fc ff ff ff          call   40 <my_open+0x40> <-- preempt_enable
  44:   eb c6                   jmp    c <my_open+0xc>


- Generic 

static int my_open(struct inode *inode, struct file *file)
{
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 0c                sub    $0xc,%esp
        MARK(subsys_mark1, "%d %p", 1, NULL);
   6:   0f b6 05 20 10 00 00    movzbl 0x1020,%eax <-- memory load byte
   d:   84 c0                   test   %al,%al
   f:   75 07                   jne    18 <my_open+0x18>

        return -EPERM;
}
  11:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  16:   c9                      leave  
  17:   c3                      ret    
  18:   b8 01 00 00 00          mov    $0x1,%eax
  1d:   e8 fc ff ff ff          call   1e <my_open+0x1e> <-- preempt_disable
  22:   c7 44 24 08 00 00 00    movl   $0x0,0x8(%esp)
  29:   00 
  2a:   c7 44 24 04 01 00 00    movl   $0x1,0x4(%esp)
  31:   00 
  32:   c7 04 24 0d 00 00 00    movl   $0xd,(%esp)
  39:   ff 15 74 10 00 00       call   *0x1074 <-- function pointer
  3f:   b8 01 00 00 00          mov    $0x1,%eax
  44:   e8 fc ff ff ff          call   45 <my_open+0x45> <-- preempt_enable
  49:   eb c6                   jmp    11 <my_open+0x11>


Here is the typical var arg probe that has been used in those tests. It saves
the values expectes as parameters in global variables. The DO_MARK1_FORMAT
define is used for probe registration to make sure that it will be connected
with a marker that has a matching format string. Note that this checking is
optional : the probe can register with a NULL format and afterward check itself
the format string received in parameter dynamically.

int value;
void *ptr;

#define DO_MARK1_FORMAT "%d %p"
void do_mark1(const char *format, ...)
{
        va_list ap;

        va_start(ap, format);
        value = va_arg(ap, int);
        ptr = va_arg(ap, void*);

        va_end(ap);
}


Here is the disassembly of the probe :

#define DO_MARK1_FORMAT "%d %p"
void do_mark1(const char *format, ...)
{
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 04                sub    $0x4,%esp
        va_list ap;

        va_start(ap, format);
        value = va_arg(ap, int);
   6:   8b 45 0c                mov    0xc(%ebp),%eax
   9:   a3 00 00 00 00          mov    %eax,0x0
        ptr = va_arg(ap, void*);
   e:   8b 45 10                mov    0x10(%ebp),%eax
  11:   a3 00 00 00 00          mov    %eax,0x0
        
        va_end(ap);
}
  16:   c9                      leave  
  17:   c3                      ret    


* Size (x86)

This is the size added by each marker to the memory image :

- Optimized

.text section : instructions
Adds 6 bytes in the "likely" path.
Adds 32 bytes in the "unlikely" path.
.data section : r/w data
0 byte
.rodata.str1 : strings
Length of the marker name
.debug_str : strings (if loaded..)
Length of the marker name + 7 bytes (__mark_)
.markers
8 bytes (2 pointers)
.markers.c
12 bytes (3 pointers)

- Generic

.text section : instructions
Adds 11 bytes in the "likely" path.
Adds 32 bytes in the "unlikely" path.
.data section : r/w data
1 byte (the activation flag)
.rodata.str1 : strings
Length of the marker name
.debug_str : strings (if loaded..)
Length of the marker name + 7 bytes (__mark_)
.markers
8 bytes (2 pointers)
.markers.c
12 bytes (3 pointers)


* Macro-benchmarks

Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
Running a 2.6.17 vanilla kernel :
real    8m2.443s
user    7m35.124s
sys     0m34.950s

Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
real    8m1.635s
user    7m34.552s
sys     0m36.298s

--> 0.98 % speedup with markers

Ping flood on loopback interface :
Running a 2.6.17 vanilla kernel :
136596 packets transmitted, 136596 packets received, 0% packet loss
round-trip min/avg/max = 0.0/0.0/0.1 ms

real    0m10.840s
user    0m0.360s
sys     0m10.485s

12601 packets transmitted/s

Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
108504 packets transmitted, 108504 packets received, 0% packet loss
round-trip min/avg/max = 0.0/0.0/0.1 ms

real    0m8.614s
user    0m0.264s
sys     0m8.353s

12596 packets transmitted/s

--> 0.03 % slowdown with markers


Conclusion

In an empty loop, the generic marker is faster than the optimized marker. This
may be due to better performances of the movzbl instruction over the movb on the
Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
of the movzbl becomes greater because it uses the memory bandwidth.

The preemption disabling and call to a probe itself costs 48.11 cycles, almost
as much as dynamically parsing the format string to get the variable arguments
(40.48 cycles).

There is almost no difference, on x86, between passing the arguments directly on
the stack and using a variable argument list when its layout is known
statically (0.89 cycles vs 2.74 cycles).

The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
(typcal use, high memory usage) used by the optimized marker by adding the
ability to insert a breakpoint at any location without any impact on the code
when inactive. This breakpoint based approach is very useful to instrument core
kernel code that has not been previously marked without need to recompile and
reboot. We can therefore compare the case "without markers" to the null impact
of an inactive int3 breakpoint.

However, the performance impact for using a kprobe is non negligible when
activated. Assuming that kprobes would have a mechanism to get the variables
from the caller's stack, it would perform the same task in at least 4178.23
cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
useful for the reason explained earlier, the high event rate paths in the kernel
would clearly benefit from a marker mechanism when the are probed.

Code size and memory footprints are smaller with the optimized version : 6
bytes of code in the likely path compared to 11 bytes. The memory footprint of
the optimized approach saves 4 bytes of data memory that would otherwise have to
stay in cache.

On the macro-benchmark side, no significant difference in performance has been
found between the vanilla kernel and a kernel "marked" with the standard LTTng
instrumentation.
Commit	Line	Data
	1
	2
	3	Hi,
	4
	5	Following the huge discussion thread about tracing/static vs dynamic
	6	instrumentation/markers, a consensus seems to emerge about the need for a
	7	marker system in the Linux kernel. The main issues this mechanism addresses are:
	8
	9	- Identify code important to runtime data collection/analysis tools in tree so
	10	that it follows the code changes naturally.
	11	- Be visually appealing to kernel developers.
	12	- Have a very low impact on the system performance.
	13	- Integrate in the standard kernel infrastructure : use C and loadable modules.
	14
	15	The time has come for some performance measurements of the Linux Kernel Markers,
	16	which follows.
	17
	18
	19	* Micro-benchmarks
	20
	21	Use timestamp counter to calculate the time spent, with interrupts disabled.
	22	Machine : Pentium 4 3GHz, 1GB ram
	23	Fully preemptible kernel
	24	Linux Kernel Markers 0.19
	25	Kernel : Linux 2.6.17
	26
	27	marker : MARK(subsys_mark1, "%d %p", 1, NULL);
	28
	29	This marker, with two elements (integer and pointer) have been chosen because it
	30	is representative of high volume events. For instance, a trap entry event logs a
	31	trap_id (long) and an address (pointer). The same applies to system calls, where
	32	a system call entry event logs both the ID of the system call and the address of
	33	the caller.
	34
	35
	36	* Execute an empty loop
	37
	38	- Without marker
	39	NR_LOOPS : 10000000
	40	time delta (cycles): 15026497
	41	cycles per loop : 1.50
	42
	43	- i386 "optimized" : immediate value, test and predicted branch
	44	(non connected marker)
	45	NR_LOOPS : 10000000
	46	time delta (cycles): 40031640
	47	cycles per loop : 4.00
	48	cycles per loop for marker : 4.00-1.50=2.50
	49
	50	- i386 "generic" : load, test and predicted branch
	51	(non connected marker)
	52	NR_LOOPS : 10000000
	53	time delta (cycles): 26697878
	54	cycles per loop : 2.67
	55	cycles per loop for marker : 2.67-1.50=1.17
	56
	57
	58	* Execute a loop of memcpy 4096 bytes
	59
	60	This test has been done to show the impact of markers on a system where the
	61	memory is already used, which is more representative of a running kernel.
	62
	63	- Without marker
	64	NR_LOOPS : 10000
	65	time delta (cycles): 12981555
	66	cycles per loop : 1298.16
	67
	68	- i386 "optimized" : immediate value, test and predicted branch
	69	(non connected marker)
	70	NR_LOOPS : 10000
	71	time delta (cycles): 12982290
	72	cycles per loop : 1298.23
	73	cycles per loop for marker : 1298.23-1298.16=0.074
	74
	75	- i386 "generic" : load, test and predicted branch
	76	(non connected marker)
	77	NR_LOOPS : 10000
	78	time delta (cycles): 13002788
	79	cycles per loop : 1300.28
	80	cycles per loop for marker : 1300.28-1298.16=2.123
	81
	82
	83	The following tests are done with the "optimized" markers only
	84
	85	- Execute a loop with marker enabled, with i386 "fastcall" register argument
	86	setup, probe empty. With preemption disabling.
	87	NR_LOOPS : 100000
	88	time delta (cycles): 4407608
	89	cycles per loop : 44.08
	90	cycles per loop to disable preemption and setup arguments in registers :
	91	44.08-4.00=40.08
	92
	93	- Execute a loop with a marker enabled, with an empty probe. Var args argument
	94	setup, probe empty. With preemption disabling.
	95	NR_LOOPS : 100000
	96	time delta (cycles): 5210587
	97	cycles per loop : 52.11
	98	additional cycles per loop to setup var args : 52.11-44.08=8.03
	99
	100	- Execute a loop with a marker enabled, with an empty probe. Var args argument
	101	setup, probe empty. No preemption disabling.
	102	NR_LOOPS : 100000
	103	time delta (cycles): 3363450
	104	cycles per loop : 33.63
	105	cycles per loop to disable preemption : 44.08-33.63=10.45
	106
	107	- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
	108	Data is copied by the probe. With preemption disabling.
	109	NR_LOOPS : 100000
	110	time delta (cycles): 5299837
	111	cycles per loop : 53.00
	112	additional cycles per loop to get arguments in probe (from stack) on x86 :
	113	53.00-52.11=0.89
	114
	115	- Execute a loop with marker enabled, with var args probe expecting arguments.
	116	Data is copied by the probe. With preemption disabling.
	117	NR_LOOPS : 100000
	118	time delta (cycles): 5574300
	119	cycles per loop : 55.74
	120	additional cycles per loop to get expected variable arguments on x86 :
	121	55.74-53.00=2.74
	122
	123	- Execute a loop with marker enabled, with var args probe, format string
	124	Data is copied by the probe. This is a 6 bytes string to decode.
	125	NR_LOOPS : 100000
	126	time delta (cycles): 9622117
	127	cycles per loop : 96.22
	128	additional cycles per loop to dynamically parse arguments with a 6 bytes format
	129	string : 96.22-55.74=40.48
	130
	131	- Execute a loop with marker enabled, with var args probe expecting arguments.
	132	Data is copied by the probe. With preemption disabling. An empty "kprobe" is
	133	connected to the probe.
	134	NR_LOOPS : 100000
	135	time delta (cycles): 423397455
	136	cycles per loop : 4233.97
	137	additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23
	138
	139
	140	* Assembly code
	141
	142	The disassembly of the following marked function will be shown :
	143
	144	static int my_open(struct inode inode, struct file file)
	145	{
	146	MARK(subsys_mark1, "%d %p", 1, NULL);
	147
	148	return -EPERM;
	149	}
	150
	151
	152	- Optimized
	153
	154	static int my_open(struct inode inode, struct file file)
	155	{
	156	0: 55 push %ebp
	157	1: 89 e5 mov %esp,%ebp
	158	3: 83 ec 0c sub $0xc,%esp
	159	MARK(subsys_mark1, "%d %p", 1, NULL);
	160	6: b0 00 mov $0x0,%al <-- immediate load 0 in al
	161	8: 84 c0 test %al,%al
	162	a: 75 07 jne 13 <my_open+0x13>
	163
	164	return -EPERM;
	165	}
	166	c: b8 ff ff ff ff mov $0xffffffff,%eax
	167	11: c9 leave
	168	12: c3 ret
	169	13: b8 01 00 00 00 mov $0x1,%eax
	170	18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
	171	1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
	172	24: 00
	173	25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
	174	2c: 00
	175	2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
	176	34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
	177	3a: b8 01 00 00 00 mov $0x1,%eax
	178	3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
	179	44: eb c6 jmp c <my_open+0xc>
	180
	181
	182	- Generic
	183
	184	static int my_open(struct inode inode, struct file file)
	185	{
	186	0: 55 push %ebp
	187	1: 89 e5 mov %esp,%ebp
	188	3: 83 ec 0c sub $0xc,%esp
	189	MARK(subsys_mark1, "%d %p", 1, NULL);
	190	6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
	191	d: 84 c0 test %al,%al
	192	f: 75 07 jne 18 <my_open+0x18>
	193
	194	return -EPERM;
	195	}
	196	11: b8 ff ff ff ff mov $0xffffffff,%eax
	197	16: c9 leave
	198	17: c3 ret
	199	18: b8 01 00 00 00 mov $0x1,%eax
	200	1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
	201	22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
	202	29: 00
	203	2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
	204	31: 00
	205	32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
	206	39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
	207	3f: b8 01 00 00 00 mov $0x1,%eax
	208	44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
	209	49: eb c6 jmp 11 <my_open+0x11>
	210
	211
	212
	213	Here is the typical var arg probe that has been used in those tests. It saves
	214	the values expectes as parameters in global variables. The DO_MARK1_FORMAT
	215	define is used for probe registration to make sure that it will be connected
	216	with a marker that has a matching format string. Note that this checking is
	217	optional : the probe can register with a NULL format and afterward check itself
	218	the format string received in parameter dynamically.
	219
	220	int value;
	221	void *ptr;
	222
	223	#define DO_MARK1_FORMAT "%d %p"
	224	void do_mark1(const char *format, ...)
	225	{
	226	va_list ap;
	227
	228	va_start(ap, format);
	229	value = va_arg(ap, int);
	230	ptr = va_arg(ap, void*);
	231
	232	va_end(ap);
	233	}
	234
	235
	236	Here is the disassembly of the probe :
	237
	238	#define DO_MARK1_FORMAT "%d %p"
	239	void do_mark1(const char *format, ...)
	240	{
	241	0: 55 push %ebp
	242	1: 89 e5 mov %esp,%ebp
	243	3: 83 ec 04 sub $0x4,%esp
	244	va_list ap;
	245
	246	va_start(ap, format);
	247	value = va_arg(ap, int);
	248	6: 8b 45 0c mov 0xc(%ebp),%eax
	249	9: a3 00 00 00 00 mov %eax,0x0
	250	ptr = va_arg(ap, void*);
	251	e: 8b 45 10 mov 0x10(%ebp),%eax
	252	11: a3 00 00 00 00 mov %eax,0x0
	253
	254	va_end(ap);
	255	}
	256	16: c9 leave
	257	17: c3 ret
	258
	259
	260	* Size (x86)
	261
	262	This is the size added by each marker to the memory image :
	263
	264	- Optimized
	265
	266	.text section : instructions
	267	Adds 6 bytes in the "likely" path.
	268	Adds 32 bytes in the "unlikely" path.
	269	.data section : r/w data
	270	0 byte
	271	.rodata.str1 : strings
	272	Length of the marker name
	273	.debug_str : strings (if loaded..)
	274	Length of the marker name + 7 bytes (__mark_)
	275	.markers
	276	8 bytes (2 pointers)
	277	.markers.c
	278	12 bytes (3 pointers)
	279
	280	- Generic
	281
	282	.text section : instructions
	283	Adds 11 bytes in the "likely" path.
	284	Adds 32 bytes in the "unlikely" path.
	285	.data section : r/w data
	286	1 byte (the activation flag)
	287	.rodata.str1 : strings
	288	Length of the marker name
	289	.debug_str : strings (if loaded..)
	290	Length of the marker name + 7 bytes (__mark_)
	291	.markers
	292	8 bytes (2 pointers)
	293	.markers.c
	294	12 bytes (3 pointers)
	295
	296
	297	* Macro-benchmarks
	298
	299	Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
	300	Running a 2.6.17 vanilla kernel :
	301	real 8m2.443s
	302	user 7m35.124s
	303	sys 0m34.950s
	304
	305	Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
	306	real 8m1.635s
	307	user 7m34.552s
	308	sys 0m36.298s
	309
	310	--> 0.98 % speedup with markers
	311
	312	Ping flood on loopback interface :
	313	Running a 2.6.17 vanilla kernel :
	314	136596 packets transmitted, 136596 packets received, 0% packet loss
	315	round-trip min/avg/max = 0.0/0.0/0.1 ms
	316
	317	real 0m10.840s
	318	user 0m0.360s
	319	sys 0m10.485s
	320
	321	12601 packets transmitted/s
	322
	323	Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
	324	108504 packets transmitted, 108504 packets received, 0% packet loss
	325	round-trip min/avg/max = 0.0/0.0/0.1 ms
	326
	327	real 0m8.614s
	328	user 0m0.264s
	329	sys 0m8.353s
	330
	331	12596 packets transmitted/s
	332
	333	--> 0.03 % slowdown with markers
	334
	335
	336	Conclusion
	337
	338	In an empty loop, the generic marker is faster than the optimized marker. This
	339	may be due to better performances of the movzbl instruction over the movb on the
	340	Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
	341	of the movzbl becomes greater because it uses the memory bandwidth.
	342
	343	The preemption disabling and call to a probe itself costs 48.11 cycles, almost
	344	as much as dynamically parsing the format string to get the variable arguments
	345	(40.48 cycles).
	346
	347	There is almost no difference, on x86, between passing the arguments directly on
	348	the stack and using a variable argument list when its layout is known
	349	statically (0.89 cycles vs 2.74 cycles).
	350
	351	The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
	352	(typcal use, high memory usage) used by the optimized marker by adding the
	353	ability to insert a breakpoint at any location without any impact on the code
	354	when inactive. This breakpoint based approach is very useful to instrument core
	355	kernel code that has not been previously marked without need to recompile and
	356	reboot. We can therefore compare the case "without markers" to the null impact
	357	of an inactive int3 breakpoint.
	358
	359	However, the performance impact for using a kprobe is non negligible when
	360	activated. Assuming that kprobes would have a mechanism to get the variables
	361	from the caller's stack, it would perform the same task in at least 4178.23
	362	cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
	363	useful for the reason explained earlier, the high event rate paths in the kernel
	364	would clearly benefit from a marker mechanism when the are probed.
	365
	366	Code size and memory footprints are smaller with the optimized version : 6
	367	bytes of code in the likely path compared to 11 bytes. The memory footprint of
	368	the optimized approach saves 4 bytes of data memory that would otherwise have to
	369	stay in cache.
	370
	371	On the macro-benchmark side, no significant difference in performance has been
	372	found between the vanilla kernel and a kernel "marked" with the standard LTTng
	373	instrumentation.
	374
	375