| 1 | Adding support for "compact" 32 bits events. |
| 2 | |
| 3 | Mathieu Desnoyers |
| 4 | March 12, 2007 |
| 5 | |
| 6 | Use a separate channel for compact events |
| 7 | |
| 8 | Mux those events into this channel and magically they are "compact". Isn't it |
| 9 | beautiful. |
| 10 | |
| 11 | event header |
| 12 | |
| 13 | ### COMPACT EVENTS |
| 14 | |
| 15 | 32 bits header |
| 16 | Aligned on 32 bits |
| 17 | 5 bits event ID |
| 18 | 32 events |
| 19 | 27 bits TSC (cut MSB) |
| 20 | wraps 32 times per second at 4GHz |
| 21 | each wraps spaced from 0.03125s |
| 22 | 100HZ clock : tick each 0.01s |
| 23 | detect wrap at least each 3 jiffies (dangerous, may miss) |
| 24 | granularity : 2^0 = 1 cycle : 0.25ns @4GHz |
| 25 | payload size known by facility |
| 26 | |
| 27 | 32 bits header |
| 28 | Aligned on 32 bits |
| 29 | 5 bits event ID |
| 30 | 32 events |
| 31 | 27 bits TSC (cut LSB) |
| 32 | wraps each second at 4GHz |
| 33 | 100HZ clock : tick each 0.01s |
| 34 | granularity : 2^5 = 32 cycles : 8ns @4GHz |
| 35 | payload size known by facility |
| 36 | |
| 37 | 32 bits header |
| 38 | Aligned on 32 bits |
| 39 | 6 bits event ID |
| 40 | 64 events |
| 41 | 26 bits TSC (cut LSB) |
| 42 | wraps each 0.5 second at 4GHz |
| 43 | 100HZ clock : tick each 0.01s |
| 44 | granularity : 2^6 = 64 cycles : 16ns @4GHz |
| 45 | payload size known by facility |
| 46 | |
| 47 | 32 bits header |
| 48 | Aligned on 32 bits |
| 49 | 7 bits event ID |
| 50 | 128 events |
| 51 | 25 bits TSC (cut LSB) |
| 52 | wraps each 0.5 second at 4GHz |
| 53 | 100HZ clock : tick each 0.01s |
| 54 | granularity : 2^7 = 128 cycles : 32ns @4GHz |
| 55 | payload size known by facility |
| 56 | |
| 57 | |
| 58 | |
| 59 | ### NORMAL EVENTS |
| 60 | |
| 61 | 64 bits header |
| 62 | Aligned on 64 bits |
| 63 | 32 bits TSC |
| 64 | wraps each second at 4GHz |
| 65 | 100HZ clock : tick each 0.01s |
| 66 | 16 bits event id, (major 8 minor 8) |
| 67 | 65536 events |
| 68 | 16 bits event size (extra) |
| 69 | |
| 70 | 96 bits header (full 64 bits TSC, useful when no heartbeat available) |
| 71 | Aligned on 64 bits |
| 72 | 64 bits TSC |
| 73 | wraps each 146.14 years at 4GHz |
| 74 | 16 bits event id, (major 8 minor 8) |
| 75 | 65536 events |
| 76 | 16 bits event size (extra) |
| 77 | |
| 78 | |
| 79 | ## Discussion of compact events |
| 80 | |
| 81 | Must put the event ID fields first in the large (64, 96-128 bits) event headers |
| 82 | What is the minimum granularity required ? (so we know how much LSB to cut) |
| 83 | - How much can synchonized CPU TSCs drift apart one from another ? |
| 84 | PLL |
| 85 | http://en.wikipedia.org/wiki/Phase-locked_loop |
| 86 | static phase offset -> tracking jitter |
| 87 | 25 MHz oscillator on motherboard for CPU |
| 88 | jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns) |
| 89 | http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM |
| 90 | NEED MORE INFO. |
| 91 | - What is the cacheline synchronization latency between the CPUs ? |
| 92 | Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo |
| 93 | Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf |
| 94 | Intel Core 2, Intel Xeon 5100 |
| 95 | http://www.intel.com/design/processor/manuals/253665.pdf |
| 96 | Up to 10.7 GB/s FSB |
| 97 | http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html |
| 98 | Intel Core Duo Intel Core 2 Duo |
| 99 | L2 cache latency 14 cycles 14 cycles |
| 100 | (round-trip : 28 cycles) 7ns @4GHz |
| 101 | sparc64 : between threads : shares L1 cache. |
| 102 | suspected to be ~2 cycles total (1+1) (to check) |
| 103 | - How close (cycle-wise) can be two consecutive recorded events in the same |
| 104 | buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz) |
| 105 | - Tracing code itself : if it's at a subbuffer boundary, more check to do. |
| 106 | Must see the maximum duration of a non interrupted probe. |
| 107 | Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz. |
| 108 | TODO : test with NMIs disabled and HT disabled. |
| 109 | Ordering can be changed if an interrupt comes between the memory operation |
| 110 | and the tracer call. Therefore, we cannot rely on more precision than the |
| 111 | expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz) |
| 112 | - If there is a faster interconnect between the CPUs, it can be a problem, but |
| 113 | seems to only be proprietary interconnects, not used in general. |
| 114 | - IPI are expected to take much more than 28 cycles. |
| 115 | What is the minimum wrap-around interval ? (must be safe for timer interrupt |
| 116 | miss and multiple timer HZ (configurable) and CPU MHZ frequencies) |
| 117 | |
| 118 | Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB) |
| 119 | Probe never takes 1 cycle. |
| 120 | Number of LSB skipped : max(0, (long)find_first_bit(probe_duration_in_cycles)-1) |
| 121 | |
| 122 | Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s) |
| 123 | (heartbeat each 100HZ, to be safe) |
| 124 | Number of MSB to skip : |
| 125 | 32 - find_first_bit(( (expected_longest_interrupt_latency()[ms] + |
| 126 | max_timer_interval[ms]) * cpu_khz[kcycles/s] )) - 1 |
| 127 | (the last -1 is to make sure we remove less or exact amount of bits, round |
| 128 | near to 0, not round up). |
| 129 | |
| 130 | Heartbeat timer : |
| 131 | Each timer interrupt |
| 132 | Event : 32 bytes in size |
| 133 | each timer tick : 100HZ |
| 134 | 3.2kB/s |
| 135 | |
| 136 | 9LSB + 4MSB = 13 bits total. 13 bits for event IDs : 8192 events. |
| 137 | |
| 138 | |
| 139 | |
| 140 | |
| 141 | |
| 142 | |
| 143 | |