9675a7c5 |
1 | Adding support for "compact" 32 bits events. |
2 | |
3 | Mathieu Desnoyers |
4 | March 9, 2007 |
5 | |
6 | |
7 | event header |
8 | |
9 | |
10 | 32 bits header |
11 | Aligned on 32 bits |
12 | 1 bit to select event type |
13 | 4 bits event ID |
14 | 16 events (too few) |
15 | 27 bits TSC (cut MSB) |
16 | wraps 32 times per second at 4GHz |
17 | each wraps spaced from 0.03125s |
18 | 100HZ clock : tick each 0.01s |
19 | detect wrap at least each 3 jiffies (dangerous, may miss) |
20 | granularity : 2^0 = 1 cycle : 0.25ns @4GHz |
21 | payload size known by facility |
22 | |
23 | 32 bits header |
24 | Aligned on 32 bits |
25 | 1 bit to select event type |
26 | 4 bits event ID |
27 | 16 events (too few) |
28 | 27 bits TSC (cut LSB) |
29 | wraps each second at 4GHz |
30 | 100HZ clock : tick each 0.01s |
31 | granularity : 2^5 = 32 cycles : 8ns @4GHz |
32 | payload size known by facility |
33 | |
34 | 32 bits header |
35 | Aligned on 32 bits |
36 | 1 bit to select event type |
37 | 5 bits event ID |
38 | 32 events |
39 | 26 bits TSC (cut LSB) |
40 | wraps each 0.5 second at 4GHz |
41 | 100HZ clock : tick each 0.01s |
42 | granularity : 2^6 = 64 cycles : 16ns @4GHz |
43 | payload size known by facility |
44 | |
45 | 32 bits header |
46 | Aligned on 32 bits |
47 | 1 bit to select event type |
48 | 6 bits event ID |
49 | 64 events |
50 | 25 bits TSC (cut LSB) |
51 | wraps each 0.5 second at 4GHz |
52 | 100HZ clock : tick each 0.01s |
53 | granularity : 2^7 = 128 cycles : 32ns @4GHz |
54 | payload size known by facility |
55 | |
56 | 64 bits header |
57 | Aligned on 32 bits |
58 | 1 bit to select event type |
59 | 15 bits event id, (major 8 minor 8) |
60 | 32768 events |
61 | 16 bits event size (extra) |
62 | 32 bits TSC |
63 | wraps each second at 4GHz |
64 | 100HZ clock : tick each 0.01s |
65 | |
66 | 96 or 128 bits header (full 64 bits TSC, useful when no heartbeat available |
67 | size depends on internal alignment) |
68 | Aligned on 32 bits |
69 | 1 bit to select event type |
70 | 15 bits event id, (major 8 minor 8) |
71 | 32768 events |
72 | 16 bits event size (extra) |
73 | Align on 64 bits |
74 | 64 bits TSC |
75 | wraps each 146.14 years at 4GHz |
76 | |
77 | |
78 | |
79 | |
80 | |
81 | Must put the event ID fields first in the large (64, 96-128 bits) event headers |
82 | Create a "compact" facility which reserves the facility IDs with the MSB at 1. |
83 | - or better : select mapping for events |
84 | What is the minimum granularity required ? (so we know how much LSB to cut) |
85 | - How much can synchonized CPU TSCs drift apart one from another ? |
86 | PLL |
87 | http://en.wikipedia.org/wiki/Phase-locked_loop |
88 | static phase offset -> tracking jitter |
89 | 25 MHz oscillator on motherboard for CPU |
90 | jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns) |
91 | http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM |
92 | NEED MORE INFO. |
93 | - What is the cacheline synchronization latency between the CPUs ? |
94 | Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo |
95 | Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf |
96 | Intel Core 2, Intel Xeon 5100 |
97 | http://www.intel.com/design/processor/manuals/253665.pdf |
98 | Up to 10.7 GB/s FSB |
99 | http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html |
100 | Intel Core Duo Intel Core 2 Duo |
101 | L2 cache latency 14 cycles 14 cycles |
102 | (round-trip : 28 cycles) 7ns @4GHz |
103 | sparc64 : between threads : shares L1 cache. |
104 | suspected to be ~2 cycles total (1+1) (to check) |
105 | - How close (cycle-wise) can be two consecutive recorded events in the same |
106 | buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz) |
107 | - Tracing code itself : if it's at a subbuffer boundary, more check to do. |
108 | Must see the maximum duration of a non interrupted probe. |
109 | Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz. |
110 | TODO : test with NMIs disabled and HT disabled. |
111 | Ordering can be changed if an interrupt comes between the memory operation |
112 | and the tracer call. Therefore, we cannot rely on more precision than the |
113 | expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz) |
114 | - If there is a faster interconnect between the CPUs, it can be a problem, but |
115 | seems to only be proprietary interconnects, not used in general. |
116 | - IPI are expected to take much more than 28 cycles. |
117 | What is the minimum wrap-around interval ? (must be safe for timer interrupt |
118 | miss and multiple timer HZ (configurable) and CPU MHZ frequencies) |
119 | Must align _all_ headers on 32 bits, not 64. |
120 | |
121 | Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB) |
122 | Number of LSB skipped : first_bit(probe_duration_in_cycles)-1 |
123 | |
124 | Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s) |
125 | (heartbeat each 100HZ, to be safe) |
126 | Number of MSB to skip : |
127 | 32 - first_bit(( (expected_longest_cli()[ms] + max_timer_interval[ms]) * 2 / |
128 | cpu_khz )) |
129 | |
130 | |
131 | 9LSB + 4MSB = 13 bits total. 12 bits for event IDs : 4096 events. |
132 | |
133 | |
134 | |