584db146 |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | <html> |
3 | <head> |
a25fb9c4 |
4 | <title>The LTTng trace format</title> |
584db146 |
5 | </head> |
6 | <body> |
7 | |
a25fb9c4 |
8 | <h1>The LTTng trace format</h1> |
9 | |
10 | <P> |
11 | This document describes the LTTng trace format. It should be used only by |
12 | developers who code the LTTng tracer or the traceread LTTV library, as this |
13 | library offers all the necessary abstractions on top of the raw trace data. |
584db146 |
14 | |
15 | <P> |
16 | A trace is contained in a directory tree. To send a trace remotely, |
17 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
18 | directory of user john, /home/john, would have the following content: |
19 | |
20 | <PRE><TT> |
21 | $ cd /home/john |
22 | $ tree foo |
23 | foo/ |
24 | |-- eventdefs |
25 | | |-- core.xml |
a25fb9c4 |
26 | | |-- fs.xml |
cb28e4a2 |
27 | | |-- ipc.xml |
a25fb9c4 |
28 | | |-- kernel.xml |
29 | | |-- memory.xml |
30 | | |-- network.xml |
31 | | |-- process.xml |
32 | | |-- s390_kernel.xml |
33 | | |-- socket.xml |
34 | | |-- timer.xml |
35 | | `-- ... |
584db146 |
36 | |-- info |
37 | | |-- bookmarks.xml |
38 | | `-- system.xml |
39 | |-- control |
a25fb9c4 |
40 | | |-- facilities_0 |
41 | | |-- facilities_1 |
42 | | |-- facilities_... |
43 | | |-- interrupts_0 |
44 | | |-- interrupts_1 |
45 | | |-- interrupts_... |
46 | | |-- modules_0 |
47 | | |-- modules_1 |
48 | | |-- modules_... |
49 | | `-- processes_0 |
50 | | `-- processes_1 |
51 | | `-- processes_... |
52 | |-- cpu_0 |
53 | |-- cpu_1 |
54 | `-- cpu_... |
55 | |
584db146 |
56 | </TT></PRE> |
57 | |
58 | <P> |
59 | The eventdefs directory contains the events descriptions for all the |
60 | facilities used. The syntax is a simple subset of XML; XML is widely |
61 | known and easily parsed or hand edited. Each file contains one or more |
62 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
63 | facilities may have the same name but different content (and thus will |
9c312311 |
64 | generate a different checksum). It typically happens when, while tracing |
65 | is enabled, a module using the named facility is unloaded, modified |
66 | (along with the description of some events), recompiled and reloaded. |
67 | Then, the trace will contain events from two different, similarly named, |
68 | facility versions. |
584db146 |
69 | |
70 | <P> |
a25fb9c4 |
71 | A small number of events are predefined, part of the "core" facility, |
72 | and are not present there. These "core" events include "facility_load", |
73 | "facility_unload", "time_heartbeat" and "state_dump_facility_load". |
584db146 |
74 | |
75 | <P> |
a25fb9c4 |
76 | The root directory contains a tracefile for each cpu, numbered from 0, |
77 | in .trace format. A uniprocessor thus only contains the file cpu_0. |
584db146 |
78 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
79 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
80 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
81 | |
82 | <P> |
a25fb9c4 |
83 | The files in the control directory also follow the .trace format and are also |
84 | per cpu. |
85 | The "facilities" file only contains "core" facility_load, facility_unload, |
86 | time_heartbeat and state_dump_facility_load events |
584db146 |
87 | and is used to determine the facilities used and the code range assigned |
88 | to each facility. The other control files contain the initial system |
89 | state and various subsequent important events, for example process |
90 | creations and exit. The interest of placing such subsequent events |
91 | in control trace files instead of (or in addition to) in the per cpu |
92 | trace files is that they may be accessed more quickly/conveniently |
93 | and that they may be kept even when the per cpu files are overwritten |
94 | in "flight recorder mode". |
95 | |
96 | <P> |
97 | The info directory contains in system.xml a description of the system on which |
98 | the trace was created as well as different user annotations in bookmark.xml. |
99 | This directory may also contain various information about the trace, generated |
100 | during trace analysis (statistics, index...). |
101 | |
102 | |
103 | <H2>Trace format</H2> |
104 | |
105 | <P> |
a25fb9c4 |
106 | Each tracefile is divided into equal size blocks with a header at the beginning |
107 | of the block. Events are packed sequentially in the block starting right after |
108 | the block header. |
109 | <P> |
110 | Each block consists of : |
111 | <PRE><TT> |
112 | block start/end header |
113 | trace header |
114 | event 1 header |
115 | event 1 variable length data |
116 | event 2 header |
117 | event 2 variable length data |
118 | .... |
119 | padding |
120 | </TT></PRE> |
121 | |
122 | <P> |
123 | The block start/end header |
124 | |
125 | <PRE><TT> |
126 | begin |
127 | * the beginning of buffer information |
128 | timestamp |
129 | * Used only when no TSC is available. |
130 | uint32 seconds |
131 | uint32 microseconds |
132 | uint64 cycle_count |
133 | * TSC at the beginning of the buffer |
134 | uint64 freq |
135 | * frequency of the CPUs at the beginning of the buffer. |
136 | end |
137 | * the end of buffer information |
138 | timestamp |
139 | * Used only when no TSC is available. |
140 | uint32 seconds |
141 | uint32 microseconds |
142 | uint64 cycle_count |
143 | * TSC at the beginning of the buffer |
144 | uint64 freq |
145 | * frequency of the CPUs at the beginning of the buffer. |
146 | uint32 lost_size |
147 | * number of bytes of padding at the end of the buffer. |
148 | uint32 buf_size |
149 | * size of the sub-buffer. |
150 | </TT></PRE> |
151 | |
152 | |
153 | |
154 | <P> |
155 | The trace header |
156 | |
157 | <PRE><TT> |
158 | uint32 magic_number |
159 | * 0x00D6B7ED, used to check the trace byte order vs host byte order. |
160 | uint32 arch_type |
161 | * Architecture type of the traced machine. |
162 | uint32 arch_variant |
163 | * Architecture variant of the traced machine. May be unused on some arch. |
164 | uint32 float_word_order |
165 | * Byte order of floats and doubles, sometimes different from integer byte |
166 | order. Useful only for user space traces. |
167 | uint8 arch_size |
168 | * Size (in bytes) of the void * on the traced machine. |
169 | uint8 major_version |
170 | * major version of the trace. |
171 | uint8 minor_version |
172 | * minor version of the trace. |
173 | uint8 flight_recorder |
174 | * Is flight recorder mode activated ? If yes, data might be missing |
175 | (overwritten) in the trace. |
176 | uint8 has_heartbeat |
177 | * Does this trace have heartbeat timer event activated ? |
178 | Yes (1) -> Event header has 32 bits TSC |
179 | No (0) -> Event header has 64 bits TSC |
180 | uint8 has_alignment |
181 | * Is the information in this trace aligned ? |
182 | Yes (1) -> aligned on min(arch size, atomic data size). |
183 | No (0) -> data is packed. |
184 | uint8 has_tsc |
185 | * Does the traced machine has a working TSC ? |
186 | Yes (1) -> event time is calculated from : |
187 | trace_start_time + ((event_tsc - trace_start_tsc) * freq) |
188 | No (0) -> event time is calculated from : |
189 | trace_start_time |
190 | + (buffer start timestamp - trace start_monotonic) |
191 | + (event_time_delta) |
192 | (not supported) |
193 | uint64 start_freq |
194 | * CPUs clock frequency at the beginnig of the trace. |
195 | uint64 start_tsc |
196 | * TSC at the beginning of the trace. |
197 | uint64 start_monotonic |
198 | * monotonically increasing time at the beginning of the trace. |
199 | (currently not supported) |
200 | start_time |
201 | * Real time at the beginning of the trace (as given by date, adjusted by NTP) |
202 | This is the only time reference with the real world : the rest of the trace |
203 | has monotonically increasing time from this point (with TSC difference and |
204 | clock frequency). |
205 | uint32 seconds |
206 | uint32 nanoseconds |
207 | </TT></PRE> |
208 | |
584db146 |
209 | |
210 | <P> |
a25fb9c4 |
211 | Event header |
584db146 |
212 | |
a25fb9c4 |
213 | <P> |
214 | Event headers differs depending on those conditions : does the traced system has |
215 | a heartbeat timer ? Is tracing alignment activated ? |
216 | |
217 | <P> |
218 | Event header : |
219 | <PRE><TT> |
220 | { uint32 timestamp |
221 | or |
222 | uint64 timestamp } |
223 | * if has_heartbeat : 32 LSB of the cycle counter at the event record time. |
224 | * else : 64 bits complete cycle counter. |
225 | * note : if there is no working TSC (has_tsc == 0), then this field contains |
226 | either the complete monotonically increasing time or the time delta from the |
227 | previous heartbeat event. (unsupported) |
228 | uint8 facility_id |
229 | * Numerical ID of the facility corresponding to the event. See the facility |
230 | tracefile to know which facility ID matches which facility name and |
231 | description. |
232 | uint8 event_id |
233 | * Numerical ID of the event inside the facility. |
234 | uint16 event_size |
235 | * Size of the variable length data that follows this header. |
236 | </TT></PRE> |
237 | |
238 | <P> |
239 | Event header alignment |
240 | |
241 | <P> |
242 | If trace alignment is activated (has_alignment), the event header is aligned |
243 | on the architecture size (void pointer size). In addition, a padding is |
244 | automatically added after the event header so the variable length data is |
245 | automatically aligned on the architecture size. |
246 | |
247 | <P> |
584db146 |
248 | |
249 | <H2>System description</H2> |
250 | |
251 | <P> |
252 | The system type description, in system.xml, looks like: |
253 | |
254 | <PRE><TT> |
255 | <system |
256 | node_name="vaucluse" |
257 | domainname="polymtl.ca" |
258 | cpu=4 |
259 | arch_size="ILP32" |
260 | endian="little" |
261 | kernel_name="Linux" |
262 | kernel_release="2.4.18-686-smp" |
263 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
264 | machine="i686" |
265 | processor="unknown" |
266 | hardware_platform="unknown" |
267 | operating_system="Linux" |
268 | ltt_major_version="2" |
269 | ltt_minor_version="0" |
270 | ltt_block_size="100000" |
271 | > |
272 | Some comments about the system |
273 | </system> |
274 | </TT></PRE> |
275 | |
276 | <P> |
277 | The system attributes kernel_name, node_name, kernel_release, |
278 | kernel_version, machine, processor, hardware_platform and operating_system |
279 | come from the uname(1) program. The domainname attribute is obtained from |
280 | the "hostname --domain" command. The arch_size attribute is one of |
281 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
282 | long (L) and pointers (P). The endian attribute is "little" or "big". |
283 | While the arch_size and endian attributes could be deduced from the platform |
284 | type, having these explicit allows analysing traces from yet unknown |
285 | platforms. The cpu attribute specifies the maximum number of processors in |
286 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
287 | directory. |
288 | |
289 | <P> |
290 | Within the system element, the text enclosed may describe further the |
291 | system traced. |
292 | |
293 | |
294 | <H2>Event type descriptions</H2> |
295 | |
296 | <P> |
297 | A facility contains the descriptions of several event types. When a structure |
298 | is reused in several event types, a named type is defined and may be referenced |
299 | by several other event types or named types. |
300 | |
301 | <PRE><TT> |
302 | <facility name=facility_name> |
303 | <description>Some text</description> |
304 | <event name=eventtype_name> |
305 | <description>Some text</description> |
306 | --type structure-- |
307 | </event> |
308 | ... |
309 | <type name=type_name> |
310 | --type structure-- |
311 | </type> |
312 | </facility> |
313 | </TT></PRE> |
314 | |
315 | <P> |
316 | The type structure may be one of the following primitive type elements. |
317 | Whenever the keyword isize is used, the allowed values are |
318 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
319 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
320 | |
321 | <PRE><TT> |
322 | <int size=isize format="printf format"/> |
323 | |
324 | <uint size=isize format="printf format"/> |
325 | |
326 | <float size=fsize format="printf format"/> |
327 | |
328 | <string format="printf format"/> |
329 | |
330 | <enum size=isize format="printf format">label1 label2 ...</enum> |
331 | </TT></PRE> |
332 | |
333 | <P> |
334 | The string is null terminated. For the enumeration, the size of the integer |
335 | used for its representation is specified. |
336 | |
337 | <P> |
338 | The type structure may also be a compound type. |
339 | |
340 | <PRE><TT> |
341 | <array size=n> --type structure-- </array> |
342 | |
343 | <sequence lengthsize=isize> --type structure-- </sequence> |
344 | |
345 | <struct> |
346 | <field name=field_name> |
347 | <description>Some text</description> |
348 | --type structure-- |
349 | </field> |
350 | ... |
351 | </struct> |
352 | |
353 | <union typecodesize=isize> |
354 | <field name=field_name> |
355 | <description>Some text</description> |
356 | --type structure-- |
357 | </field> |
358 | ... |
359 | </union> |
360 | </TT></PRE> |
361 | |
362 | <P> |
363 | Array is a fixed size array of length size. Sequence is a variable size |
364 | array with its length stored as a prepended uint of length lengthsize. |
365 | A structure is simply an aggregation of fields. An union is one of its n |
366 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
367 | of the specified size typecodesize. |
368 | |
369 | <P> |
370 | Finally the type structure may be defined by referencing a named type. |
371 | |
372 | <PRE><TT> |
373 | <typeref name=type_name/> |
374 | </PRE></TT> |
375 | |
376 | <H2>Builtin events</H2> |
377 | |
378 | <P> |
379 | The facility named "builtin" is always present and contains at least the |
380 | following event types. |
381 | |
382 | <PRE><TT> |
383 | <event name=facility_load> |
384 | <description>Facility used in the trace</description> |
385 | <struct> |
386 | <field name="name"><string/></field> |
387 | <field name="checksum"><uint size=4/></field> |
388 | <field name="base_code"><uint size=4/></field> |
389 | </struct> |
390 | </event> |
391 | |
392 | <event name=block_start> |
393 | <description>Block start timestamp</description> |
394 | <typeref name=block_timestamp/> |
395 | </event> |
396 | |
397 | <event name=block_end> |
398 | <description>Block end timestamp</description> |
399 | <typeref name=block_timestamp/> |
400 | </event> |
401 | |
402 | <event name=time_heartbeat> |
403 | <description>System time values sent periodically to minimize cycle counter |
404 | drift with respect to real time clock and to detect cycle counter |
405 | rollovers |
406 | </description> |
407 | <typeref name=timestamp/> |
408 | </event> |
409 | |
410 | <type name=block_timestamp> |
411 | <struct> |
412 | <field name=timestamp><typeref name=timestamp></field> |
413 | <field name=block_id><uint size=4/></field> |
414 | </struct> |
415 | </type> |
416 | |
417 | <type name=timestamp> |
418 | <struct> |
419 | <field name=time><typeref name=timespec/></event> |
420 | <field name="cycle_count"><uint size=8/></field> |
421 | </struct> |
422 | </event> |
423 | |
424 | <type name=timespec> |
425 | <struct> |
426 | <field name="seconds"><uint size=4/></field> |
427 | <field name="nanoseconds"><uint size=4/></field> |
428 | </struct> |
429 | </type> |
430 | </TT></PRE> |
431 | |
432 | <H2>Control files</H2> |
433 | |
434 | <P> |
435 | The interrupts file reflects the content of the /proc/interrupts system file. |
436 | It contains one event describing each interrupt. At trace start, events are |
437 | generated describing all the current interrupts. If the assignment of |
438 | interrupts changes later, due to devices or device drivers being activated or |
439 | deactivated, additional events may be added to the file. Each interrupt |
440 | event has the following structure. |
441 | |
442 | <PRE><TT> |
443 | <event name=interrupt> |
444 | <description>Interrupt request number assignment<description> |
445 | <struct> |
446 | <field name="number"><uint size=4/></field> |
447 | <field name="count"><uint size=4/></field> |
448 | <field name="controller"><string/></field> |
449 | <field name="name"><string/></field> |
450 | </struct> |
451 | </event> |
452 | </TT></PRE> |
453 | |
454 | <P> |
455 | The processes file contains the list of processes already created when the |
456 | trace starts. Each process describing event is modeled after the |
457 | /proc/self/status system file. The number of fields in this event is |
458 | expected to be expanded in the future to include groups, signal masks, |
459 | opened file descriptors and address maps. |
460 | |
461 | <PRE><TT> |
462 | <event name=process> |
463 | <description>Existing process<description> |
464 | <struct> |
465 | <field name="name"><string/></field> |
466 | <field name="pid"><uint size=4/></field> |
467 | <field name="ppid"><uint size=4/></field> |
468 | <field name="tracer_pid"><uint size=4/></field> |
469 | <field name="uid"><uint size=4/></field> |
470 | <field name="euid"><uint size=4/></field> |
471 | <field name="suid"><uint size=4/></field> |
472 | <field name="fsuid"><uint size=4/></field> |
473 | <field name="gid"><uint size=4/></field> |
474 | <field name="egid"><uint size=4/></field> |
475 | <field name="sgid"><uint size=4/></field> |
476 | <field name="fsgid"><uint size=4/></field> |
477 | <field name="state"><enum size=4> |
478 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
479 | </enum></field> |
480 | </struct> |
481 | </event> |
482 | </TT></PRE> |
483 | |
484 | <H2>Facilities</H2> |
485 | |
486 | <P> |
487 | Facilities define a granularity of events grouping for filtering, activation |
488 | and compilation. Each facility does cost a table entry in the kernel (name, |
489 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
490 | one facility per tracing statement in the kernel would be too much (assuming |
491 | that they eventually are routinely inserted in the kernel code and replace |
492 | the 80000+ printk statements in some proportion). However, having a few |
493 | facilities, up to a few tens, would make sense. |
494 | |
495 | <P> |
496 | The "builtin" facility contains a small number of predefined events which must |
497 | always exist. The "core" facility contains a small subset of OS events which |
498 | are almost always of interest (scheduling, interrupts, faults, system calls). |
499 | Then, specialized facilities may exist for each subsystem (network, disks, |
500 | USB, SCSI...). |
501 | |
502 | |
503 | <H2>Bookmarks</H2> |
504 | |
505 | <P> |
506 | Bookmarks are user supplied information added to a trace. They contain user |
507 | annotations attached to a time interval. |
508 | |
509 | <PRE><TT> |
510 | <bookmarks> |
511 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
512 | ... |
513 | </bookmarks> |
514 | </TT></PRE> |
515 | |
516 | <P> |
517 | The interval is defined using either "time=" or "start_time=" and |
518 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
519 | The time is in seconds with decimals up to nanoseconds and cycle counts |
520 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
521 | |
522 | </BODY> |
523 | </HTML> |
524 | |
525 | |
526 | |
527 | |