| 1 | Adding support for "compact" 32 bits events. |
| 2 | |
| 3 | Mathieu Desnoyers |
| 4 | March 9, 2007 |
| 5 | |
| 6 | |
| 7 | event header |
| 8 | |
| 9 | |
| 10 | 32 bits header |
| 11 | Aligned on 32 bits |
| 12 | 1 bit to select event type |
| 13 | 4 bits event ID |
| 14 | 16 events (too few) |
| 15 | 27 bits TSC (cut MSB) |
| 16 | wraps 32 times per second at 4GHz |
| 17 | each wraps spaced from 0.03125s |
| 18 | 100HZ clock : tick each 0.01s |
| 19 | detect wrap at least each 3 jiffies (dangerous, may miss) |
| 20 | granularity : 2^0 = 1 cycle : 0.25ns @4GHz |
| 21 | payload size known by facility |
| 22 | |
| 23 | 32 bits header |
| 24 | Aligned on 32 bits |
| 25 | 1 bit to select event type |
| 26 | 4 bits event ID |
| 27 | 16 events (too few) |
| 28 | 27 bits TSC (cut LSB) |
| 29 | wraps each second at 4GHz |
| 30 | 100HZ clock : tick each 0.01s |
| 31 | granularity : 2^5 = 32 cycles : 8ns @4GHz |
| 32 | payload size known by facility |
| 33 | |
| 34 | 32 bits header |
| 35 | Aligned on 32 bits |
| 36 | 1 bit to select event type |
| 37 | 5 bits event ID |
| 38 | 32 events |
| 39 | 26 bits TSC (cut LSB) |
| 40 | wraps each 0.5 second at 4GHz |
| 41 | 100HZ clock : tick each 0.01s |
| 42 | granularity : 2^6 = 64 cycles : 16ns @4GHz |
| 43 | payload size known by facility |
| 44 | |
| 45 | 32 bits header |
| 46 | Aligned on 32 bits |
| 47 | 1 bit to select event type |
| 48 | 6 bits event ID |
| 49 | 64 events |
| 50 | 25 bits TSC (cut LSB) |
| 51 | wraps each 0.5 second at 4GHz |
| 52 | 100HZ clock : tick each 0.01s |
| 53 | granularity : 2^7 = 128 cycles : 32ns @4GHz |
| 54 | payload size known by facility |
| 55 | |
| 56 | 64 bits header |
| 57 | Aligned on 32 bits |
| 58 | 1 bit to select event type |
| 59 | 15 bits event id, (major 8 minor 8) |
| 60 | 32768 events |
| 61 | 16 bits event size (extra) |
| 62 | 32 bits TSC |
| 63 | wraps each second at 4GHz |
| 64 | 100HZ clock : tick each 0.01s |
| 65 | |
| 66 | 96 or 128 bits header (full 64 bits TSC, useful when no heartbeat available |
| 67 | size depends on internal alignment) |
| 68 | Aligned on 32 bits |
| 69 | 1 bit to select event type |
| 70 | 15 bits event id, (major 8 minor 8) |
| 71 | 32768 events |
| 72 | 16 bits event size (extra) |
| 73 | Align on 64 bits |
| 74 | 64 bits TSC |
| 75 | wraps each 146.14 years at 4GHz |
| 76 | |
| 77 | |
| 78 | |
| 79 | |
| 80 | |
| 81 | Must put the event ID fields first in the large (64, 96-128 bits) event headers |
| 82 | Create a "compact" facility which reserves the facility IDs with the MSB at 1. |
| 83 | - or better : select mapping for events |
| 84 | What is the minimum granularity required ? (so we know how much LSB to cut) |
| 85 | - How much can synchonized CPU TSCs drift apart one from another ? |
| 86 | PLL |
| 87 | http://en.wikipedia.org/wiki/Phase-locked_loop |
| 88 | static phase offset -> tracking jitter |
| 89 | 25 MHz oscillator on motherboard for CPU |
| 90 | jitter : expressed in ±picoseconds (should therefore be lower than 0.25ns) |
| 91 | http://www.eetasia.com/ART_8800082274_480600_683c4e6b200103.HTM |
| 92 | NEED MORE INFO. |
| 93 | - What is the cacheline synchronization latency between the CPUs ? |
| 94 | Worse case : Intel Core 2, Intel Xeon 5100, Intel core solo, intel core duo |
| 95 | Unified L2 cache. http://www.intel.com/design/processor/manuals/253668.pdf |
| 96 | Intel Core 2, Intel Xeon 5100 |
| 97 | http://www.intel.com/design/processor/manuals/253665.pdf |
| 98 | Up to 10.7 GB/s FSB |
| 99 | http://www.xbitlabs.com/articles/mobile/display/core2duo_2.html |
| 100 | Intel Core Duo Intel Core 2 Duo |
| 101 | L2 cache latency 14 cycles 14 cycles |
| 102 | (round-trip : 28 cycles) 7ns @4GHz |
| 103 | sparc64 : between threads : shares L1 cache. |
| 104 | suspected to be ~2 cycles total (1+1) (to check) |
| 105 | - How close (cycle-wise) can be two consecutive recorded events in the same |
| 106 | buffer ? (~200ns, time for logging an event) (~800 cycles @4GHz) |
| 107 | - Tracing code itself : if it's at a subbuffer boundary, more check to do. |
| 108 | Must see the maximum duration of a non interrupted probe. |
| 109 | Worse case (had NMIs enabled) : 6997 cycles. 1749 ns @4GHz. |
| 110 | TODO : test with NMIs disabled and HT disabled. |
| 111 | Ordering can be changed if an interrupt comes between the memory operation |
| 112 | and the tracer call. Therefore, we cannot rely on more precision than the |
| 113 | expected interrupt handler duration. (guess : ~10000cycles, 2500ns@4GHz) |
| 114 | - If there is a faster interconnect between the CPUs, it can be a problem, but |
| 115 | seems to only be proprietary interconnects, not used in general. |
| 116 | - IPI are expected to take much more than 28 cycles. |
| 117 | What is the minimum wrap-around interval ? (must be safe for timer interrupt |
| 118 | miss and multiple timer HZ (configurable) and CPU MHZ frequencies) |
| 119 | Must align _all_ headers on 32 bits, not 64. |
| 120 | |
| 121 | Granularity : 800ns (200 cycles@4GHz) : 2^9 = 512 (remove 9 LSB) |
| 122 | Number of LSB skipped : first_bit(probe_duration_in_cycles)-1 |
| 123 | |
| 124 | Min wrap : 100HZ system, each 3 timer ticks : 0.03s (32-4 MSB for 4 GHZ : 0.26s) |
| 125 | (heartbeat each 100HZ, to be safe) |
| 126 | Number of MSB to skip : |
| 127 | 32 - first_bit(( (expected_longest_cli()[ms] + max_timer_interval[ms]) * 2 / |
| 128 | cpu_khz )) |
| 129 | |
| 130 | |
| 131 | 9LSB + 4MSB = 13 bits total. 12 bits for event IDs : 4096 events. |
| 132 | |
| 133 | |
| 134 | |