| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <title>The new LTT trace format</title> |
| 5 | </head> |
| 6 | <body> |
| 7 | |
| 8 | <h1>The new LTT trace format</h1> |
| 9 | |
| 10 | <P> |
| 11 | A trace is contained in a directory tree. To send a trace remotely, |
| 12 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
| 13 | directory of user john, /home/john, would have the following content: |
| 14 | |
| 15 | <PRE><TT> |
| 16 | $ cd /home/john |
| 17 | $ tree foo |
| 18 | foo/ |
| 19 | |-- eventdefs |
| 20 | | |-- core.xml |
| 21 | | |-- net.xml |
| 22 | | |-- ipv4.xml |
| 23 | | `-- ide.xml |
| 24 | |-- info |
| 25 | | |-- bookmarks.xml |
| 26 | | `-- system.xml |
| 27 | |-- control |
| 28 | | |-- facilities |
| 29 | | |-- interrupts |
| 30 | | `-- processes |
| 31 | `-- cpu |
| 32 | |-- 0 |
| 33 | |-- 1 |
| 34 | |-- 2 |
| 35 | `-- 3 |
| 36 | </TT></PRE> |
| 37 | |
| 38 | <P> |
| 39 | The eventdefs directory contains the events descriptions for all the |
| 40 | facilities used. The syntax is a simple subset of XML; XML is widely |
| 41 | known and easily parsed or hand edited. Each file contains one or more |
| 42 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
| 43 | facilities may have the same name but different content (and thus will |
| 44 | generate a different checksum), typically when the event descriptions |
| 45 | for a given facility change from one version to the next, if a module |
| 46 | is recompiled and reloaded during a trace. |
| 47 | |
| 48 | <P> |
| 49 | A small number of events are predefined, part of the "builtin" facility, |
| 50 | and are not present there. These "builtin" events include "facility_load", |
| 51 | "block_start", "block_end" and "time_heartbeat". |
| 52 | |
| 53 | <P> |
| 54 | The cpu directory contains a tracefile for each cpu, numbered from 0, |
| 55 | in .trace format. A uniprocessor thus only contains the file cpu/0. |
| 56 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
| 57 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
| 58 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
| 59 | |
| 60 | <P> |
| 61 | The files in the control directory also follow the .trace format. |
| 62 | The "facilities" file only contains "builtin" facility_load events |
| 63 | and is used to determine the facilities used and the code range assigned |
| 64 | to each facility. The other control files contain the initial system |
| 65 | state and various subsequent important events, for example process |
| 66 | creations and exit. The interest of placing such subsequent events |
| 67 | in control trace files instead of (or in addition to) in the per cpu |
| 68 | trace files is that they may be accessed more quickly/conveniently |
| 69 | and that they may be kept even when the per cpu files are overwritten |
| 70 | in "flight recorder mode". |
| 71 | |
| 72 | <P> |
| 73 | The info directory contains in system.xml a description of the system on which |
| 74 | the trace was created as well as different user annotations in bookmark.xml. |
| 75 | This directory may also contain various information about the trace, generated |
| 76 | during trace analysis (statistics, index...). |
| 77 | |
| 78 | |
| 79 | <H2>Trace format</H2> |
| 80 | |
| 81 | <P> |
| 82 | Each tracefile is divided into equal size blocks with an uint32 at the block |
| 83 | end giving the offset to the last event in the block. Events are packed |
| 84 | sequentially in the block starting at offset 0 with a "block_start" event |
| 85 | and ending, at the offset stored in the last 4 bytes of the block, with a |
| 86 | block_end event. Both the block_start and block_end events |
| 87 | contain the kernel timestamp (timespec binary structure, |
| 88 | uint32 seconds, uint32 nanoseconds), the cycle counter (uint64 cycles), |
| 89 | and the buffer id (uint64). |
| 90 | |
| 91 | <P> |
| 92 | Each event consists in an event type id (uint16 which is the event type id |
| 93 | within the facility + the facility base id), a time delta (uint32 in cycles |
| 94 | or nanoseconds, depending on configuration, since the last time value, in the |
| 95 | block header or in a "time_heartbeat" event) and the event type specific data. |
| 96 | All values are packed in native byte order binary format. |
| 97 | |
| 98 | |
| 99 | <H2>System description</H2> |
| 100 | |
| 101 | <P> |
| 102 | The system type description, in system.xml, looks like: |
| 103 | |
| 104 | <PRE><TT> |
| 105 | <system |
| 106 | node_name="vaucluse" |
| 107 | domainname="polymtl.ca" |
| 108 | cpu=4 |
| 109 | arch_size="ILP32" |
| 110 | endian="little" |
| 111 | kernel_name="Linux" |
| 112 | kernel_release="2.4.18-686-smp" |
| 113 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
| 114 | machine="i686" |
| 115 | processor="unknown" |
| 116 | hardware_platform="unknown" |
| 117 | operating_system="Linux" |
| 118 | ltt_major_version="2" |
| 119 | ltt_minor_version="0" |
| 120 | ltt_block_size="100000" |
| 121 | > |
| 122 | Some comments about the system |
| 123 | </system> |
| 124 | </TT></PRE> |
| 125 | |
| 126 | <P> |
| 127 | The system attributes kernel_name, node_name, kernel_release, |
| 128 | kernel_version, machine, processor, hardware_platform and operating_system |
| 129 | come from the uname(1) program. The domainname attribute is obtained from |
| 130 | the "hostname --domain" command. The arch_size attribute is one of |
| 131 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
| 132 | long (L) and pointers (P). The endian attribute is "little" or "big". |
| 133 | While the arch_size and endian attributes could be deduced from the platform |
| 134 | type, having these explicit allows analysing traces from yet unknown |
| 135 | platforms. The cpu attribute specifies the maximum number of processors in |
| 136 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
| 137 | directory. |
| 138 | |
| 139 | <P> |
| 140 | Within the system element, the text enclosed may describe further the |
| 141 | system traced. |
| 142 | |
| 143 | |
| 144 | <H2>Event type descriptions</H2> |
| 145 | |
| 146 | <P> |
| 147 | A facility contains the descriptions of several event types. When a structure |
| 148 | is reused in several event types, a named type is defined and may be referenced |
| 149 | by several other event types or named types. |
| 150 | |
| 151 | <PRE><TT> |
| 152 | <facility name=facility_name> |
| 153 | <description>Some text</description> |
| 154 | <event name=eventtype_name> |
| 155 | <description>Some text</description> |
| 156 | --type structure-- |
| 157 | </event> |
| 158 | ... |
| 159 | <type name=type_name> |
| 160 | --type structure-- |
| 161 | </type> |
| 162 | </facility> |
| 163 | </TT></PRE> |
| 164 | |
| 165 | <P> |
| 166 | The type structure may be one of the following primitive type elements. |
| 167 | Whenever the keyword isize is used, the allowed values are |
| 168 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
| 169 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
| 170 | |
| 171 | <PRE><TT> |
| 172 | <int size=isize format="printf format"/> |
| 173 | |
| 174 | <uint size=isize format="printf format"/> |
| 175 | |
| 176 | <float size=fsize format="printf format"/> |
| 177 | |
| 178 | <string format="printf format"/> |
| 179 | |
| 180 | <enum size=isize format="printf format">label1 label2 ...</enum> |
| 181 | </TT></PRE> |
| 182 | |
| 183 | <P> |
| 184 | The string is null terminated. For the enumeration, the size of the integer |
| 185 | used for its representation is specified. |
| 186 | |
| 187 | <P> |
| 188 | The type structure may also be a compound type. |
| 189 | |
| 190 | <PRE><TT> |
| 191 | <array size=n> --type structure-- </array> |
| 192 | |
| 193 | <sequence lengthsize=isize> --type structure-- </sequence> |
| 194 | |
| 195 | <struct> |
| 196 | <field name=field_name> |
| 197 | <description>Some text</description> |
| 198 | --type structure-- |
| 199 | </field> |
| 200 | ... |
| 201 | </struct> |
| 202 | |
| 203 | <union typecodesize=isize> |
| 204 | <field name=field_name> |
| 205 | <description>Some text</description> |
| 206 | --type structure-- |
| 207 | </field> |
| 208 | ... |
| 209 | </union> |
| 210 | </TT></PRE> |
| 211 | |
| 212 | <P> |
| 213 | Array is a fixed size array of length size. Sequence is a variable size |
| 214 | array with its length stored as a prepended uint of length lengthsize. |
| 215 | A structure is simply an aggregation of fields. An union is one of its n |
| 216 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
| 217 | of the specified size typecodesize. |
| 218 | |
| 219 | <P> |
| 220 | Finally the type structure may be defined by referencing a named type. |
| 221 | |
| 222 | <PRE><TT> |
| 223 | <typeref name=type_name/> |
| 224 | </PRE></TT> |
| 225 | |
| 226 | <H2>Builtin events</H2> |
| 227 | |
| 228 | <P> |
| 229 | The facility named "builtin" is always present and contains at least the |
| 230 | following event types. |
| 231 | |
| 232 | <PRE><TT> |
| 233 | <event name=facility_load> |
| 234 | <description>Facility used in the trace</description> |
| 235 | <struct> |
| 236 | <field name="name"><string/></field> |
| 237 | <field name="checksum"><uint size=4/></field> |
| 238 | <field name="base_code"><uint size=4/></field> |
| 239 | </struct> |
| 240 | </event> |
| 241 | |
| 242 | <event name=block_start> |
| 243 | <description>Block start timestamp</description> |
| 244 | <typeref name=block_timestamp/> |
| 245 | </event> |
| 246 | |
| 247 | <event name=block_end> |
| 248 | <description>Block end timestamp</description> |
| 249 | <typeref name=block_timestamp/> |
| 250 | </event> |
| 251 | |
| 252 | <event name=time_heartbeat> |
| 253 | <description>System time values sent periodically to minimize cycle counter |
| 254 | drift with respect to real time clock and to detect cycle counter |
| 255 | rollovers |
| 256 | </description> |
| 257 | <typeref name=timestamp/> |
| 258 | </event> |
| 259 | |
| 260 | <type name=block_timestamp> |
| 261 | <struct> |
| 262 | <field name=timestamp><typeref name=timestamp></field> |
| 263 | <field name=block_id><uint size=4/></field> |
| 264 | </struct> |
| 265 | </type> |
| 266 | |
| 267 | <type name=timestamp> |
| 268 | <struct> |
| 269 | <field name=time><typeref name=timespec/></event> |
| 270 | <field name="cycle_count"><uint size=8/></field> |
| 271 | </struct> |
| 272 | </event> |
| 273 | |
| 274 | <type name=timespec> |
| 275 | <struct> |
| 276 | <field name="seconds"><uint size=4/></field> |
| 277 | <field name="nanoseconds"><uint size=4/></field> |
| 278 | </struct> |
| 279 | </type> |
| 280 | </TT></PRE> |
| 281 | |
| 282 | <H2>Control files</H2> |
| 283 | |
| 284 | <P> |
| 285 | The interrupts file reflects the content of the /proc/interrupts system file. |
| 286 | It contains one event describing each interrupt. At trace start, events are |
| 287 | generated describing all the current interrupts. If the assignment of |
| 288 | interrupts changes later, due to devices or device drivers being activated or |
| 289 | deactivated, additional events may be added to the file. Each interrupt |
| 290 | event has the following structure. |
| 291 | |
| 292 | <PRE><TT> |
| 293 | <event name=interrupt> |
| 294 | <description>Interrupt request number assignment<description> |
| 295 | <struct> |
| 296 | <field name="number"><uint size=4/></field> |
| 297 | <field name="count"><uint size=4/></field> |
| 298 | <field name="controller"><string/></field> |
| 299 | <field name="name"><string/></field> |
| 300 | </struct> |
| 301 | </event> |
| 302 | </TT></PRE> |
| 303 | |
| 304 | <P> |
| 305 | The processes file contains the list of processes already created when the |
| 306 | trace starts. Each process describing event is modeled after the |
| 307 | /proc/self/status system file. The number of fields in this event is |
| 308 | expected to be expanded in the future to include groups, signal masks, |
| 309 | opened file descriptors and address maps. |
| 310 | |
| 311 | <PRE><TT> |
| 312 | <event name=process> |
| 313 | <description>Existing process<description> |
| 314 | <struct> |
| 315 | <field name="name"><string/></field> |
| 316 | <field name="pid"><uint size=4/></field> |
| 317 | <field name="ppid"><uint size=4/></field> |
| 318 | <field name="tracer_pid"><uint size=4/></field> |
| 319 | <field name="uid"><uint size=4/></field> |
| 320 | <field name="euid"><uint size=4/></field> |
| 321 | <field name="suid"><uint size=4/></field> |
| 322 | <field name="fsuid"><uint size=4/></field> |
| 323 | <field name="gid"><uint size=4/></field> |
| 324 | <field name="egid"><uint size=4/></field> |
| 325 | <field name="sgid"><uint size=4/></field> |
| 326 | <field name="fsgid"><uint size=4/></field> |
| 327 | <field name="state"><enum size=4> |
| 328 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
| 329 | </enum></field> |
| 330 | </struct> |
| 331 | </event> |
| 332 | </TT></PRE> |
| 333 | |
| 334 | <H2>Facilities</H2> |
| 335 | |
| 336 | <P> |
| 337 | Facilities define a granularity of events grouping for filtering, activation |
| 338 | and compilation. Each facility does cost a table entry in the kernel (name, |
| 339 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
| 340 | one facility per tracing statement in the kernel would be too much (assuming |
| 341 | that they eventually are routinely inserted in the kernel code and replace |
| 342 | the 80000+ printk statements in some proportion). However, having a few |
| 343 | facilities, up to a few tens, would make sense. |
| 344 | |
| 345 | <P> |
| 346 | The "builtin" facility contains a small number of predefined events which must |
| 347 | always exist. The "core" facility contains a small subset of OS events which |
| 348 | are almost always of interest (scheduling, interrupts, faults, system calls). |
| 349 | Then, specialized facilities may exist for each subsystem (network, disks, |
| 350 | USB, SCSI...). |
| 351 | |
| 352 | |
| 353 | <H2>Bookmarks</H2> |
| 354 | |
| 355 | <P> |
| 356 | Bookmarks are user supplied information added to a trace. They contain user |
| 357 | annotations attached to a time interval. |
| 358 | |
| 359 | <PRE><TT> |
| 360 | <bookmarks> |
| 361 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
| 362 | ... |
| 363 | </bookmarks> |
| 364 | </TT></PRE> |
| 365 | |
| 366 | <P> |
| 367 | The interval is defined using either "time=" or "start_time=" and |
| 368 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
| 369 | The time is in seconds with decimals up to nanoseconds and cycle counts |
| 370 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
| 371 | |
| 372 | </BODY> |
| 373 | </HTML> |
| 374 | |
| 375 | |
| 376 | |
| 377 | |