| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <title>The LTTng trace format</title> |
| 5 | </head> |
| 6 | <body> |
| 7 | |
| 8 | <h1>The LTTng trace format</h1> |
| 9 | |
| 10 | <P> |
| 11 | This document describes the LTTng trace format. It should be used only by |
| 12 | developers who code the LTTng tracer or the traceread LTTV library, as this |
| 13 | library offers all the necessary abstractions on top of the raw trace data. |
| 14 | |
| 15 | <P> |
| 16 | A trace is contained in a directory tree. To send a trace remotely, |
| 17 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
| 18 | directory of user john, /home/john, would have the following content: |
| 19 | |
| 20 | <PRE><TT> |
| 21 | $ cd /home/john |
| 22 | $ tree foo |
| 23 | foo/ |
| 24 | |-- eventdefs |
| 25 | | |-- core.xml |
| 26 | | |-- fs.xml |
| 27 | | |-- ipc.xml |
| 28 | | |-- kernel.xml |
| 29 | | |-- memory.xml |
| 30 | | |-- network.xml |
| 31 | | |-- process.xml |
| 32 | | |-- s390_kernel.xml |
| 33 | | |-- socket.xml |
| 34 | | |-- timer.xml |
| 35 | | `-- ... |
| 36 | |-- info |
| 37 | | |-- bookmarks.xml |
| 38 | | `-- system.xml |
| 39 | |-- control |
| 40 | | |-- facilities_0 |
| 41 | | |-- facilities_1 |
| 42 | | |-- facilities_... |
| 43 | | |-- interrupts_0 |
| 44 | | |-- interrupts_1 |
| 45 | | |-- interrupts_... |
| 46 | | |-- modules_0 |
| 47 | | |-- modules_1 |
| 48 | | |-- modules_... |
| 49 | | `-- processes_0 |
| 50 | | `-- processes_1 |
| 51 | | `-- processes_... |
| 52 | |-- cpu_0 |
| 53 | |-- cpu_1 |
| 54 | `-- cpu_... |
| 55 | |
| 56 | </TT></PRE> |
| 57 | |
| 58 | <P> |
| 59 | The eventdefs directory contains the events descriptions for all the |
| 60 | facilities used. The syntax is a simple subset of XML; XML is widely |
| 61 | known and easily parsed or hand edited. Each file contains one or more |
| 62 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
| 63 | facilities may have the same name but different content (and thus will |
| 64 | generate a different checksum). It typically happens when, while tracing |
| 65 | is enabled, a module using the named facility is unloaded, modified |
| 66 | (along with the description of some events), recompiled and reloaded. |
| 67 | Then, the trace will contain events from two different, similarly named, |
| 68 | facility versions. |
| 69 | |
| 70 | <P> |
| 71 | A small number of events are predefined, part of the "core" facility, |
| 72 | and are not present there. These "core" events include "facility_load", |
| 73 | "facility_unload", "time_heartbeat" and "state_dump_facility_load". |
| 74 | |
| 75 | <P> |
| 76 | The root directory contains a tracefile for each cpu, numbered from 0, |
| 77 | in .trace format. A uniprocessor thus only contains the file cpu_0. |
| 78 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
| 79 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
| 80 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
| 81 | |
| 82 | <P> |
| 83 | The files in the control directory also follow the .trace format and are also |
| 84 | per cpu. |
| 85 | The "facilities" file only contains "core" facility_load, facility_unload, |
| 86 | time_heartbeat and state_dump_facility_load events |
| 87 | and is used to determine the facilities used and the code range assigned |
| 88 | to each facility. The other control files contain the initial system |
| 89 | state and various subsequent important events, for example process |
| 90 | creations and exit. The interest of placing such subsequent events |
| 91 | in control trace files instead of (or in addition to) in the per cpu |
| 92 | trace files is that they may be accessed more quickly/conveniently |
| 93 | and that they may be kept even when the per cpu files are overwritten |
| 94 | in "flight recorder mode". |
| 95 | |
| 96 | <P> |
| 97 | The info directory contains in system.xml a description of the system on which |
| 98 | the trace was created as well as different user annotations in bookmark.xml. |
| 99 | This directory may also contain various information about the trace, generated |
| 100 | during trace analysis (statistics, index...). |
| 101 | |
| 102 | |
| 103 | <H2>Trace format</H2> |
| 104 | |
| 105 | <P> |
| 106 | Each tracefile is divided into equal size blocks with a header at the beginning |
| 107 | of the block. Events are packed sequentially in the block starting right after |
| 108 | the block header. |
| 109 | <P> |
| 110 | Each block consists of : |
| 111 | <PRE><TT> |
| 112 | block start/end header |
| 113 | trace header |
| 114 | event 1 header |
| 115 | event 1 variable length data |
| 116 | event 2 header |
| 117 | event 2 variable length data |
| 118 | .... |
| 119 | padding |
| 120 | </TT></PRE> |
| 121 | |
| 122 | <P> |
| 123 | The block start/end header |
| 124 | |
| 125 | <PRE><TT> |
| 126 | begin |
| 127 | * the beginning of buffer information |
| 128 | uint64 cycle_count |
| 129 | * TSC at the beginning of the buffer |
| 130 | uint64 freq |
| 131 | * frequency of the CPUs at the beginning of the buffer. |
| 132 | end |
| 133 | * the end of buffer information |
| 134 | uint64 cycle_count |
| 135 | * TSC at the beginning of the buffer |
| 136 | uint64 freq |
| 137 | * frequency of the CPUs at the end of the buffer. |
| 138 | uint32 lost_size |
| 139 | * number of bytes of padding at the end of the buffer. |
| 140 | uint32 buf_size |
| 141 | * size of the sub-buffer. |
| 142 | </TT></PRE> |
| 143 | |
| 144 | |
| 145 | |
| 146 | <P> |
| 147 | The trace header |
| 148 | |
| 149 | <PRE><TT> |
| 150 | uint32 magic_number |
| 151 | * 0x00D6B7ED, used to check the trace byte order vs host byte order. |
| 152 | uint32 arch_type |
| 153 | * Architecture type of the traced machine. |
| 154 | uint32 arch_variant |
| 155 | * Architecture variant of the traced machine. May be unused on some arch. |
| 156 | uint32 float_word_order |
| 157 | * Byte order of floats and doubles, sometimes different from integer byte |
| 158 | order. Useful only for user space traces. |
| 159 | uint8 arch_size |
| 160 | * Size (in bytes) of the void * on the traced machine. |
| 161 | uint8 major_version |
| 162 | * major version of the trace. |
| 163 | uint8 minor_version |
| 164 | * minor version of the trace. |
| 165 | uint8 flight_recorder |
| 166 | * Is flight recorder mode activated ? If yes, data might be missing |
| 167 | (overwritten) in the trace. |
| 168 | uint8 has_heartbeat |
| 169 | * Does this trace have heartbeat timer event activated ? |
| 170 | Yes (1) -> Event header has 32 bits TSC |
| 171 | No (0) -> Event header has 64 bits TSC |
| 172 | uint8 has_alignment |
| 173 | * Is the information in this trace aligned ? |
| 174 | Yes (1) -> aligned on min(arch size, atomic data size). |
| 175 | No (0) -> data is packed. |
| 176 | uint328 freq_scale |
| 177 | event time is always calculated from : |
| 178 | trace_start_time + ((event_tsc - trace_start_tsc) * (freq / freq_scale)) |
| 179 | uint64 start_freq |
| 180 | * CPUs clock frequency at the beginnig of the trace. |
| 181 | uint64 start_tsc |
| 182 | * TSC at the beginning of the trace. |
| 183 | uint64 start_monotonic |
| 184 | * monotonically increasing time at the beginning of the trace. |
| 185 | (currently not supported) |
| 186 | start_time |
| 187 | * Real time at the beginning of the trace (as given by date, adjusted by NTP) |
| 188 | This is the only time reference with the real world : the rest of the trace |
| 189 | has monotonically increasing time from this point (with TSC difference and |
| 190 | clock frequency). |
| 191 | uint32 seconds |
| 192 | uint32 nanoseconds |
| 193 | </TT></PRE> |
| 194 | |
| 195 | |
| 196 | <P> |
| 197 | Event header |
| 198 | |
| 199 | <P> |
| 200 | Event headers differs depending on those conditions : does the traced system has |
| 201 | a heartbeat timer ? Is tracing alignment activated ? |
| 202 | |
| 203 | <P> |
| 204 | Event header : |
| 205 | <PRE><TT> |
| 206 | { uint32 timestamp |
| 207 | or |
| 208 | uint64 timestamp } |
| 209 | * if has_heartbeat : 32 LSB of the cycle counter at the event record time. |
| 210 | * else : 64 bits complete cycle counter. |
| 211 | uint8 facility_id |
| 212 | * Numerical ID of the facility corresponding to the event. See the facility |
| 213 | tracefile to know which facility ID matches which facility name and |
| 214 | description. |
| 215 | uint8 event_id |
| 216 | * Numerical ID of the event inside the facility. |
| 217 | uint16 event_size |
| 218 | * Size of the variable length data that follows this header. |
| 219 | </TT></PRE> |
| 220 | |
| 221 | <P> |
| 222 | Event header alignment |
| 223 | |
| 224 | <P> |
| 225 | If trace alignment is activated (has_alignment), the event header is aligned |
| 226 | on the architecture size (void pointer size). In addition, a padding is |
| 227 | automatically added after the event header so the variable length data is |
| 228 | automatically aligned on the architecture size. |
| 229 | |
| 230 | <P> |
| 231 | |
| 232 | <H2>System description</H2> |
| 233 | |
| 234 | <P> |
| 235 | The system type description, in system.xml, looks like: |
| 236 | |
| 237 | <PRE><TT> |
| 238 | <system |
| 239 | node_name="vaucluse" |
| 240 | domainname="polymtl.ca" |
| 241 | cpu=4 |
| 242 | arch_size="ILP32" |
| 243 | endian="little" |
| 244 | kernel_name="Linux" |
| 245 | kernel_release="2.4.18-686-smp" |
| 246 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
| 247 | machine="i686" |
| 248 | processor="unknown" |
| 249 | hardware_platform="unknown" |
| 250 | operating_system="Linux" |
| 251 | ltt_major_version="2" |
| 252 | ltt_minor_version="0" |
| 253 | ltt_block_size="100000" |
| 254 | > |
| 255 | Some comments about the system |
| 256 | </system> |
| 257 | </TT></PRE> |
| 258 | |
| 259 | <P> |
| 260 | The system attributes kernel_name, node_name, kernel_release, |
| 261 | kernel_version, machine, processor, hardware_platform and operating_system |
| 262 | come from the uname(1) program. The domainname attribute is obtained from |
| 263 | the "hostname --domain" command. The arch_size attribute is one of |
| 264 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
| 265 | long (L) and pointers (P). The endian attribute is "little" or "big". |
| 266 | While the arch_size and endian attributes could be deduced from the platform |
| 267 | type, having these explicit allows analysing traces from yet unknown |
| 268 | platforms. The cpu attribute specifies the maximum number of processors in |
| 269 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
| 270 | directory. |
| 271 | |
| 272 | <P> |
| 273 | Within the system element, the text enclosed may describe further the |
| 274 | system traced. |
| 275 | |
| 276 | |
| 277 | <H2>Event type descriptions</H2> |
| 278 | |
| 279 | <P> |
| 280 | A facility contains the descriptions of several event types. When a structure |
| 281 | is reused in several event types, a named type is defined and may be referenced |
| 282 | by several other event types or named types. |
| 283 | |
| 284 | <PRE><TT> |
| 285 | <facility name=facility_name> |
| 286 | <description>Some text</description> |
| 287 | <event name=eventtype_name> |
| 288 | <description>Some text</description> |
| 289 | --type structure-- |
| 290 | </event> |
| 291 | ... |
| 292 | <type name=type_name> |
| 293 | --type structure-- |
| 294 | </type> |
| 295 | </facility> |
| 296 | </TT></PRE> |
| 297 | |
| 298 | <P> |
| 299 | The type structure may be one of the following primitive type elements. |
| 300 | Whenever the keyword isize is used, the allowed values are |
| 301 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
| 302 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
| 303 | |
| 304 | <PRE><TT> |
| 305 | <int size=isize format="printf format"/> |
| 306 | |
| 307 | <uint size=isize format="printf format"/> |
| 308 | |
| 309 | <float size=fsize format="printf format"/> |
| 310 | |
| 311 | <string format="printf format"/> |
| 312 | |
| 313 | <enum size=isize format="printf format">label1 label2 ...</enum> |
| 314 | </TT></PRE> |
| 315 | |
| 316 | <P> |
| 317 | The string is null terminated. For the enumeration, the size of the integer |
| 318 | used for its representation is specified. |
| 319 | |
| 320 | <P> |
| 321 | The type structure may also be a compound type. |
| 322 | |
| 323 | <PRE><TT> |
| 324 | <array size=n> --type structure-- </array> |
| 325 | |
| 326 | <sequence lengthsize=isize> --type structure-- </sequence> |
| 327 | |
| 328 | <struct> |
| 329 | <field name=field_name> |
| 330 | <description>Some text</description> |
| 331 | --type structure-- |
| 332 | </field> |
| 333 | ... |
| 334 | </struct> |
| 335 | |
| 336 | <union typecodesize=isize> |
| 337 | <field name=field_name> |
| 338 | <description>Some text</description> |
| 339 | --type structure-- |
| 340 | </field> |
| 341 | ... |
| 342 | </union> |
| 343 | </TT></PRE> |
| 344 | |
| 345 | <P> |
| 346 | Array is a fixed size array of length size. Sequence is a variable size |
| 347 | array with its length stored as a prepended uint of length lengthsize. |
| 348 | A structure is simply an aggregation of fields. An union is one of its n |
| 349 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
| 350 | of the specified size typecodesize. |
| 351 | |
| 352 | <P> |
| 353 | Finally the type structure may be defined by referencing a named type. |
| 354 | |
| 355 | <PRE><TT> |
| 356 | <typeref name=type_name/> |
| 357 | </PRE></TT> |
| 358 | |
| 359 | <H2>Core events</H2> |
| 360 | |
| 361 | <P> |
| 362 | The facility named "core" is always present and contains at least the |
| 363 | following event types. |
| 364 | |
| 365 | <PRE><TT> |
| 366 | <event name=facility_load> |
| 367 | <description>Facility used in the trace</description> |
| 368 | <struct> |
| 369 | <field name="name"><string/></field> |
| 370 | <field name="checksum"><uint size=4/></field> |
| 371 | <field name="id"><uint size=4/></field> |
| 372 | <field name="int_size"><uint size=4/></field> |
| 373 | <field name="long_size"><uint size=4/></field> |
| 374 | <field name="pointer_size"><uint size=4/></field> |
| 375 | <field name="size_t_size"><uint size=4/></field> |
| 376 | <field name="has_alignment"><uint size=4/></field> |
| 377 | </struct> |
| 378 | </event> |
| 379 | |
| 380 | <event name=state_dump_facility_load> |
| 381 | <description>Facility used in the trace</description> |
| 382 | <struct> |
| 383 | <field name="name"><string/></field> |
| 384 | <field name="checksum"><uint size=4/></field> |
| 385 | <field name="id"><uint size=4/></field> |
| 386 | <field name="int_size"><uint size=4/></field> |
| 387 | <field name="long_size"><uint size=4/></field> |
| 388 | <field name="pointer_size"><uint size=4/></field> |
| 389 | <field name="size_t_size"><uint size=4/></field> |
| 390 | <field name="has_alignment"><uint size=4/></field> |
| 391 | </struct> |
| 392 | </event> |
| 393 | |
| 394 | <event name=time_heartbeat> |
| 395 | <description>System time values sent periodically to minimize cycle counter |
| 396 | drift with respect to real time clock and to detect cycle counter |
| 397 | rollovers |
| 398 | </description> |
| 399 | <typeref name=timestamp/> |
| 400 | </event> |
| 401 | |
| 402 | <type name=timestamp> |
| 403 | <struct> |
| 404 | <field name="seconds"><uint size=4/></field> |
| 405 | <field name="nanoseconds"><uint size=4/></field> |
| 406 | <field name="cycle_count"><uint size=8/></field> |
| 407 | </struct> |
| 408 | </event> |
| 409 | |
| 410 | </TT></PRE> |
| 411 | |
| 412 | <H2>Control files</H2> |
| 413 | |
| 414 | <P> |
| 415 | The interrupts file reflects the content of the /proc/interrupts system file. |
| 416 | It contains one event describing each interrupt. At trace start, events are |
| 417 | generated describing all the current interrupts. If the assignment of |
| 418 | interrupts changes later, due to devices or device drivers being activated or |
| 419 | deactivated, additional events may be added to the file. Each interrupt |
| 420 | event has the following structure. |
| 421 | |
| 422 | <PRE><TT> |
| 423 | <event name=interrupt> |
| 424 | <description>Interrupt request number assignment<description> |
| 425 | <struct> |
| 426 | <field name="number"><uint size=4/></field> |
| 427 | <field name="count"><uint size=4/></field> |
| 428 | <field name="controller"><string/></field> |
| 429 | <field name="name"><string/></field> |
| 430 | </struct> |
| 431 | </event> |
| 432 | </TT></PRE> |
| 433 | |
| 434 | <P> |
| 435 | The processes file contains the list of processes already created when the |
| 436 | trace starts. Each process describing event is modeled after the |
| 437 | /proc/self/status system file. The number of fields in this event is |
| 438 | expected to be expanded in the future to include groups, signal masks, |
| 439 | opened file descriptors and address maps. |
| 440 | |
| 441 | <PRE><TT> |
| 442 | <event name=process> |
| 443 | <description>Existing process<description> |
| 444 | <struct> |
| 445 | <field name="name"><string/></field> |
| 446 | <field name="pid"><uint size=4/></field> |
| 447 | <field name="ppid"><uint size=4/></field> |
| 448 | <field name="tracer_pid"><uint size=4/></field> |
| 449 | <field name="uid"><uint size=4/></field> |
| 450 | <field name="euid"><uint size=4/></field> |
| 451 | <field name="suid"><uint size=4/></field> |
| 452 | <field name="fsuid"><uint size=4/></field> |
| 453 | <field name="gid"><uint size=4/></field> |
| 454 | <field name="egid"><uint size=4/></field> |
| 455 | <field name="sgid"><uint size=4/></field> |
| 456 | <field name="fsgid"><uint size=4/></field> |
| 457 | <field name="state"><enum size=4> |
| 458 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
| 459 | </enum></field> |
| 460 | </struct> |
| 461 | </event> |
| 462 | </TT></PRE> |
| 463 | |
| 464 | <H2>Facilities</H2> |
| 465 | |
| 466 | <P> |
| 467 | Facilities define a granularity of events grouping for filtering, activation |
| 468 | and compilation. Each facility does cost a table entry in the kernel (name, |
| 469 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
| 470 | one facility per tracing statement in the kernel would be too much (assuming |
| 471 | that they eventually are routinely inserted in the kernel code and replace |
| 472 | the 80000+ printk statements in some proportion). However, having a few |
| 473 | facilities, up to a few tens, would make sense. |
| 474 | |
| 475 | <P> |
| 476 | The "builtin" facility contains a small number of predefined events which must |
| 477 | always exist. The "core" facility contains a small subset of OS events which |
| 478 | are almost always of interest (scheduling, interrupts, faults, system calls). |
| 479 | Then, specialized facilities may exist for each subsystem (network, disks, |
| 480 | USB, SCSI...). |
| 481 | |
| 482 | |
| 483 | <H2>Bookmarks</H2> |
| 484 | |
| 485 | <P> |
| 486 | Bookmarks are user supplied information added to a trace. They contain user |
| 487 | annotations attached to a time interval. |
| 488 | |
| 489 | <PRE><TT> |
| 490 | <bookmarks> |
| 491 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
| 492 | ... |
| 493 | </bookmarks> |
| 494 | </TT></PRE> |
| 495 | |
| 496 | <P> |
| 497 | The interval is defined using either "time=" or "start_time=" and |
| 498 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
| 499 | The time is in seconds with decimals up to nanoseconds and cycle counts |
| 500 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
| 501 | |
| 502 | </BODY> |
| 503 | </HTML> |
| 504 | |
| 505 | |
| 506 | |
| 507 | |