d26b24d9 |
1 | Adding support for "compact" 32 bits events. |
2 | |
3 | Mathieu Desnoyers |
4 | March 12, 2007 |
5 | |
6 | Use a separate channel for compact events |
7 | |
8 | Mux those events into this channel and magically they are "compact". Isn't it |
9 | beautiful. |
10 | |
11 | event header |
12 | |
13 | ### COMPACT EVENTS |
14 | |
15 | 32 bits header |
16 | Aligned on 32 bits |
17 | 5 bits event ID |
18 | 32 events |
19 | 27 bits TSC (cut MSB) |
20 | wraps 32 times per second at 4GHz |
21 | each wraps spaced from 0.03125s |
22 | 100HZ clock : tick each 0.01s |
23 | detect wrap at least each 3 jiffies (dangerous, may miss) |
24 | granularity : 2^0 = 1 cycle : 0.25ns @4GHz |
25 | payload size known by facility |
26 | |
27 | 32 bits header |
28 | Aligned on 32 bits |
29 | 5 bits event ID |
30 | 32 events |
31 | 27 bits TSC (cut LSB) |
32 | wraps each second at 4GHz |
33 | 100HZ clock : tick each 0.01s |
34 | granularity : 2^5 = 32 cycles : 8ns @4GHz |
35 | payload size known by facility |
36 | |
37 | 32 bits header |
38 | Aligned on 32 bits |
39 | 6 bits event ID |
40 | 64 events |
41 | 26 bits TSC (cut LSB) |
42 | wraps each 0.5 second at 4GHz |
43 | 100HZ clock : tick each 0.01s |
44 | granularity : 2^6 = 64 cycles : 16ns @4GHz |
45 | payload size known by facility |
46 | |
47 | 32 bits header |
48 | Aligned on 32 bits |
49 | 7 bits event ID |
50 | 128 events |
51 | 25 bits TSC (cut LSB) |
52 | wraps each 0.5 second at 4GHz |
53 | 100HZ clock : tick each 0.01s |
54 | granularity : 2^7 = 128 cycles : 32ns @4GHz |
55 | payload size known by facility |
56 | |
57 | |
58 | |
59 | ### NORMAL EVENTS |
60 | |
61 | 64 bits header |
62 | Aligned on 64 bits |
63 | 32 bits TSC |
64 | wraps each second at 4GHz |
65 | 100HZ clock : tick each 0.01s |
66 | 16 bits event id, (major 8 minor 8) |
67 | 65536 events |
68 | 16 bits event size (extra) |
69 | |
70 | 96 bits header (full 64 bits TSC, useful when no heartbeat available) |
71 | Aligned on 64 bits |
72 | 64 bits TSC |
73 | wraps each 146.14 years at 4GHz |
74 | 16 bits event id, (major 8 minor 8) |
75 | 65536 events |
76 | 16 bits event size (extra) |
77 | |
78 | |
79 | ## Discussion of compact events |
80 | |
81 | Must put the event ID fields first in the large (64, 96-128 bits) event headers |
82 | What is the minimum granularity required ? (so we know how much LSB to cut) |
83 | - How much can synchonized CPU TSCs drift apart one from another ? |
84 | PLL |
85 | http://en.wikipedia.org/wiki/Phase-locked_loop |
86 | static phase offset -> tracking jitter |
87 | 25 MHz oscillator on motherboard for CPU |
88 | jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns) |
89 | http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM |
90 | NEED MORE INFO. |
91 | - What is the cacheline synchronization latency between the CPUs ? |
92 | Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo |
93 | Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf |
94 | Intel Core 2, Intel Xeon 5100 |
95 | http://www.intel.com/design/processor/manuals/253665.pdf |
96 | Up to 10.7 GB/s FSB |
97 | http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html |
98 | Intel Core Duo Intel Core 2 Duo |
99 | L2 cache latency 14 cycles 14 cycles |
100 | (round-trip : 28 cycles) 7ns @4GHz |
101 | sparc64 : between threads : shares L1 cache. |
102 | suspected to be ~2 cycles total (1+1) (to check) |
103 | - How close (cycle-wise) can be two consecutive recorded events in the same |
104 | buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz) |
105 | - Tracing code itself : if it's at a subbuffer boundary, more check to do. |
106 | Must see the maximum duration of a non interrupted probe. |
107 | Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz. |
108 | TODO : test with NMIs disabled and HT disabled. |
109 | Ordering can be changed if an interrupt comes between the memory operation |
110 | and the tracer call. Therefore, we cannot rely on more precision than the |
111 | expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz) |
112 | - If there is a faster interconnect between the CPUs, it can be a problem, but |
113 | seems to only be proprietary interconnects, not used in general. |
114 | - IPI are expected to take much more than 28 cycles. |
115 | What is the minimum wrap-around interval ? (must be safe for timer interrupt |
116 | miss and multiple timer HZ (configurable) and CPU MHZ frequencies) |
117 | |
118 | Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB) |
119 | Probe never takes 1 cycle. |
120 | Number of LSB skipped : max(0, (long)find_first_bit(probe_duration_in_cycles)-1) |
121 | |
122 | Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s) |
123 | (heartbeat each 100HZ, to be safe) |
124 | Number of MSB to skip : |
8de10cc0 |
125 | 32 - find_first_bit(( (expected_longest_interrupt_latency()[ms] + |
126 | max_timer_interval[ms]) * cpu_khz[kcycles/s] )) - 1 |
d26b24d9 |
127 | (the last -1 is to make sure we remove less or exact amount of bits, round |
128 | near to 0, not round up). |
129 | |
130 | Heartbeat timer : |
131 | Each timer interrupt |
132 | Event : 32 bytes in size |
133 | each timer tick : 100HZ |
134 | 3.2kB/s |
135 | |
136 | 9LSB + 4MSB = 13 bits total. 13 bits for event IDs : 8192 events. |
137 | |
138 | |
139 | |
140 | |
141 | |
142 | |
143 | |