[lttv.git] / doc / developer / tsc-smallv2.txt

Adding support for "compact" 32 bits events.

Mathieu Desnoyers
March 12, 2007

Use a separate channel for compact events

Mux those events into this channel and magically they are "compact". Isn't it
beautiful.

event header

### COMPACT EVENTS

32 bits header
Aligned on 32 bits
  5 bits event ID
    32 events
  27 bits TSC (cut MSB)
    wraps 32 times per second at 4GHz
    each wraps spaced from 0.03125s
    100HZ clock : tick each 0.01s
      detect wrap at least each 3 jiffies (dangerous, may miss)
    granularity : 2^0 = 1 cycle : 0.25ns @4GHz
payload size known by facility

32 bits header
Aligned on 32 bits
  5 bits event ID
    32 events
  27 bits TSC (cut LSB)
    wraps each second at 4GHz
    100HZ clock : tick each 0.01s
    granularity : 2^5 = 32 cycles : 8ns @4GHz
payload size known by facility

32 bits header
Aligned on 32 bits
  6 bits event ID
    64 events
  26 bits TSC (cut LSB)
    wraps each 0.5 second at 4GHz
    100HZ clock : tick each 0.01s
    granularity : 2^6 = 64 cycles : 16ns @4GHz
payload size known by facility

32 bits header
Aligned on 32 bits
  7 bits event ID
    128 events
  25 bits TSC (cut LSB)
    wraps each 0.5 second at 4GHz
    100HZ clock : tick each 0.01s
    granularity : 2^7 = 128 cycles : 32ns @4GHz
payload size known by facility


### NORMAL EVENTS

64 bits header
Aligned on 64 bits
  32 bits TSC
    wraps each second at 4GHz
    100HZ clock : tick each 0.01s
  16 bits event id, (major 8 minor 8)
     65536 events
  16 bits event size (extra)

96 bits header (full 64 bits TSC, useful when no heartbeat available)
Aligned on 64 bits
  64 bits TSC
    wraps each 146.14 years at 4GHz
  16 bits event id, (major 8 minor 8)
     65536 events
  16 bits event size (extra)


## Discussion of compact events

Must put the event ID fields first in the large (64, 96-128 bits) event headers
What is the minimum granularity required ? (so we know how much LSB to cut)
  - How much can synchonized CPU TSCs drift apart one from another ?
    PLL
    http://en.wikipedia.org/wiki/Phase-locked_loop
    static phase offset -> tracking jitter
    25 MHz oscillator on motherboard for CPU
    jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns)
    http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM
    NEED MORE INFO.
  - What is the cacheline synchronization latency between the CPUs ?
    Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo
    Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf
    Intel Core 2, Intel Xeon 5100 
    http://www.intel.com/design/processor/manuals/253665.pdf
    Up to 10.7 GB/s FSB
    http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html
                      Intel Core Duo     Intel Core 2 Duo
    L2 cache latency  14 cycles          14 cycles
    (round-trip : 28 cycles) 7ns @4GHz
    sparc64 : between threads : shares L1 cache.
    suspected to be ~2 cycles total (1+1) (to check)
  - How close (cycle-wise) can be two consecutive recorded events in the same
    buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz)
  - Tracing code itself : if it's at a subbuffer boundary, more check to do.
    Must see the maximum duration of a non interrupted probe.
    Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz.
    TODO : test with NMIs disabled and HT disabled.
    Ordering can be changed if an interrupt comes between the memory operation
    and the tracer call. Therefore, we cannot rely on more precision than the
    expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz)
  - If there is a faster interconnect between the CPUs, it can be a problem, but
    seems to only be proprietary interconnects, not used in general.
  - IPI are expected to take much more than 28 cycles.
What is the minimum wrap-around interval ? (must be safe for timer interrupt
miss and multiple timer HZ (configurable) and CPU MHZ frequencies)

Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB)
  Probe never takes 1 cycle.
  Number of LSB skipped : max(0, (long)find_first_bit(probe_duration_in_cycles)-1)

Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s)
  (heartbeat each 100HZ, to be safe)
  Number of MSB to skip :
    32 - find_first_bit(( (expected_longest_interrupt_latency()[ms] +
       max_timer_interval[ms]) * cpu_khz[kcycles/s] )) - 1
    (the last -1 is to make sure we remove less or exact amount of bits, round
    near to 0, not round up).

Heartbeat timer :
  Each timer interrupt
  Event : 32 bytes in size
  each timer tick : 100HZ
  3.2kB/s

9LSB + 4MSB = 13 bits total. 13 bits for event IDs : 8192 events.
Commit	Line	Data
	1	Adding support for "compact" 32 bits events.
	2
	3	Mathieu Desnoyers
	4	March 12, 2007
	5
	6	Use a separate channel for compact events
	7
	8	Mux those events into this channel and magically they are "compact". Isn't it
	9	beautiful.
	10
	11	event header
	12
	13	### COMPACT EVENTS
	14
	15	32 bits header
	16	Aligned on 32 bits
	17	5 bits event ID
	18	32 events
	19	27 bits TSC (cut MSB)
	20	wraps 32 times per second at 4GHz
	21	each wraps spaced from 0.03125s
	22	100HZ clock : tick each 0.01s
	23	detect wrap at least each 3 jiffies (dangerous, may miss)
	24	granularity : 2^0 = 1 cycle : 0.25ns @4GHz
	25	payload size known by facility
	26
	27	32 bits header
	28	Aligned on 32 bits
	29	5 bits event ID
	30	32 events
	31	27 bits TSC (cut LSB)
	32	wraps each second at 4GHz
	33	100HZ clock : tick each 0.01s
	34	granularity : 2^5 = 32 cycles : 8ns @4GHz
	35	payload size known by facility
	36
	37	32 bits header
	38	Aligned on 32 bits
	39	6 bits event ID
	40	64 events
	41	26 bits TSC (cut LSB)
	42	wraps each 0.5 second at 4GHz
	43	100HZ clock : tick each 0.01s
	44	granularity : 2^6 = 64 cycles : 16ns @4GHz
	45	payload size known by facility
	46
	47	32 bits header
	48	Aligned on 32 bits
	49	7 bits event ID
	50	128 events
	51	25 bits TSC (cut LSB)
	52	wraps each 0.5 second at 4GHz
	53	100HZ clock : tick each 0.01s
	54	granularity : 2^7 = 128 cycles : 32ns @4GHz
	55	payload size known by facility
	56
	57
	58
	59	### NORMAL EVENTS
	60
	61	64 bits header
	62	Aligned on 64 bits
	63	32 bits TSC
	64	wraps each second at 4GHz
	65	100HZ clock : tick each 0.01s
	66	16 bits event id, (major 8 minor 8)
	67	65536 events
	68	16 bits event size (extra)
	69
	70	96 bits header (full 64 bits TSC, useful when no heartbeat available)
	71	Aligned on 64 bits
	72	64 bits TSC
	73	wraps each 146.14 years at 4GHz
	74	16 bits event id, (major 8 minor 8)
	75	65536 events
	76	16 bits event size (extra)
	77
	78
	79	## Discussion of compact events
	80
	81	Must put the event ID fields first in the large (64, 96-128 bits) event headers
	82	What is the minimum granularity required ? (so we know how much LSB to cut)
	83	- How much can synchonized CPU TSCs drift apart one from another ?
	84	PLL
	85	http://en.wikipedia.org/wiki/Phase-locked_loop
	86	static phase offset -> tracking jitter
	87	25 MHz oscillator on motherboard for CPU
	88	jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns)
	89	http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM
	90	NEED MORE INFO.
	91	- What is the cacheline synchronization latency between the CPUs ?
	92	Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo
	93	Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf
	94	Intel Core 2, Intel Xeon 5100
	95	http://www.intel.com/design/processor/manuals/253665.pdf
	96	Up to 10.7 GB/s FSB
	97	http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html
	98	Intel Core Duo Intel Core 2 Duo
	99	L2 cache latency 14 cycles 14 cycles
	100	(round-trip : 28 cycles) 7ns @4GHz
	101	sparc64 : between threads : shares L1 cache.
	102	suspected to be ~2 cycles total (1+1) (to check)
	103	- How close (cycle-wise) can be two consecutive recorded events in the same
	104	buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz)
	105	- Tracing code itself : if it's at a subbuffer boundary, more check to do.
	106	Must see the maximum duration of a non interrupted probe.
	107	Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz.
	108	TODO : test with NMIs disabled and HT disabled.
	109	Ordering can be changed if an interrupt comes between the memory operation
	110	and the tracer call. Therefore, we cannot rely on more precision than the
	111	expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz)
	112	- If there is a faster interconnect between the CPUs, it can be a problem, but
	113	seems to only be proprietary interconnects, not used in general.
	114	- IPI are expected to take much more than 28 cycles.
	115	What is the minimum wrap-around interval ? (must be safe for timer interrupt
	116	miss and multiple timer HZ (configurable) and CPU MHZ frequencies)
	117
	118	Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB)
	119	Probe never takes 1 cycle.
	120	Number of LSB skipped : max(0, (long)find_first_bit(probe_duration_in_cycles)-1)
	121
	122	Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s)
	123	(heartbeat each 100HZ, to be safe)
	124	Number of MSB to skip :
	125	32 - find_first_bit(( (expected_longest_interrupt_latency()[ms] +
	126	max_timer_interval[ms]) * cpu_khz[kcycles/s] )) - 1
	127	(the last -1 is to make sure we remove less or exact amount of bits, round
	128	near to 0, not round up).
	129
	130	Heartbeat timer :
	131	Each timer interrupt
	132	Event : 32 bytes in size
	133	each timer tick : 100HZ
	134	3.2kB/s
	135
	136	9LSB + 4MSB = 13 bits total. 13 bits for event IDs : 8192 events.
	137
	138
	139
	140
	141
	142
	143