584db146 |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | <html> |
3 | <head> |
a25fb9c4 |
4 | <title>The LTTng trace format</title> |
584db146 |
5 | </head> |
6 | <body> |
7 | |
a25fb9c4 |
8 | <h1>The LTTng trace format</h1> |
9 | |
10 | <P> |
11 | This document describes the LTTng trace format. It should be used only by |
12 | developers who code the LTTng tracer or the traceread LTTV library, as this |
13 | library offers all the necessary abstractions on top of the raw trace data. |
584db146 |
14 | |
15 | <P> |
16 | A trace is contained in a directory tree. To send a trace remotely, |
17 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
18 | directory of user john, /home/john, would have the following content: |
19 | |
20 | <PRE><TT> |
21 | $ cd /home/john |
22 | $ tree foo |
23 | foo/ |
24 | |-- eventdefs |
25 | | |-- core.xml |
a25fb9c4 |
26 | | |-- fs.xml |
cb28e4a2 |
27 | | |-- ipc.xml |
a25fb9c4 |
28 | | |-- kernel.xml |
29 | | |-- memory.xml |
30 | | |-- network.xml |
31 | | |-- process.xml |
32 | | |-- s390_kernel.xml |
33 | | |-- socket.xml |
34 | | |-- timer.xml |
35 | | `-- ... |
584db146 |
36 | |-- info |
37 | | |-- bookmarks.xml |
38 | | `-- system.xml |
39 | |-- control |
a25fb9c4 |
40 | | |-- facilities_0 |
41 | | |-- facilities_1 |
42 | | |-- facilities_... |
43 | | |-- interrupts_0 |
44 | | |-- interrupts_1 |
45 | | |-- interrupts_... |
46 | | |-- modules_0 |
47 | | |-- modules_1 |
48 | | |-- modules_... |
49 | | `-- processes_0 |
50 | | `-- processes_1 |
51 | | `-- processes_... |
52 | |-- cpu_0 |
53 | |-- cpu_1 |
54 | `-- cpu_... |
55 | |
584db146 |
56 | </TT></PRE> |
57 | |
58 | <P> |
59 | The eventdefs directory contains the events descriptions for all the |
60 | facilities used. The syntax is a simple subset of XML; XML is widely |
61 | known and easily parsed or hand edited. Each file contains one or more |
62 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
63 | facilities may have the same name but different content (and thus will |
9c312311 |
64 | generate a different checksum). It typically happens when, while tracing |
65 | is enabled, a module using the named facility is unloaded, modified |
66 | (along with the description of some events), recompiled and reloaded. |
67 | Then, the trace will contain events from two different, similarly named, |
68 | facility versions. |
584db146 |
69 | |
70 | <P> |
a25fb9c4 |
71 | A small number of events are predefined, part of the "core" facility, |
72 | and are not present there. These "core" events include "facility_load", |
73 | "facility_unload", "time_heartbeat" and "state_dump_facility_load". |
584db146 |
74 | |
75 | <P> |
a25fb9c4 |
76 | The root directory contains a tracefile for each cpu, numbered from 0, |
77 | in .trace format. A uniprocessor thus only contains the file cpu_0. |
584db146 |
78 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
79 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
80 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
81 | |
82 | <P> |
a25fb9c4 |
83 | The files in the control directory also follow the .trace format and are also |
84 | per cpu. |
85 | The "facilities" file only contains "core" facility_load, facility_unload, |
86 | time_heartbeat and state_dump_facility_load events |
584db146 |
87 | and is used to determine the facilities used and the code range assigned |
88 | to each facility. The other control files contain the initial system |
89 | state and various subsequent important events, for example process |
90 | creations and exit. The interest of placing such subsequent events |
91 | in control trace files instead of (or in addition to) in the per cpu |
92 | trace files is that they may be accessed more quickly/conveniently |
93 | and that they may be kept even when the per cpu files are overwritten |
94 | in "flight recorder mode". |
95 | |
96 | <P> |
97 | The info directory contains in system.xml a description of the system on which |
98 | the trace was created as well as different user annotations in bookmark.xml. |
99 | This directory may also contain various information about the trace, generated |
100 | during trace analysis (statistics, index...). |
101 | |
102 | |
103 | <H2>Trace format</H2> |
104 | |
105 | <P> |
a25fb9c4 |
106 | Each tracefile is divided into equal size blocks with a header at the beginning |
107 | of the block. Events are packed sequentially in the block starting right after |
108 | the block header. |
109 | <P> |
110 | Each block consists of : |
111 | <PRE><TT> |
112 | block start/end header |
113 | trace header |
114 | event 1 header |
115 | event 1 variable length data |
116 | event 2 header |
117 | event 2 variable length data |
118 | .... |
119 | padding |
120 | </TT></PRE> |
121 | |
122 | <P> |
123 | The block start/end header |
124 | |
125 | <PRE><TT> |
126 | begin |
127 | * the beginning of buffer information |
a25fb9c4 |
128 | uint64 cycle_count |
129 | * TSC at the beginning of the buffer |
130 | uint64 freq |
131 | * frequency of the CPUs at the beginning of the buffer. |
132 | end |
133 | * the end of buffer information |
a25fb9c4 |
134 | uint64 cycle_count |
135 | * TSC at the beginning of the buffer |
136 | uint64 freq |
d88e4d7c |
137 | * frequency of the CPUs at the end of the buffer. |
a25fb9c4 |
138 | uint32 lost_size |
139 | * number of bytes of padding at the end of the buffer. |
140 | uint32 buf_size |
141 | * size of the sub-buffer. |
142 | </TT></PRE> |
143 | |
144 | |
145 | |
146 | <P> |
147 | The trace header |
148 | |
149 | <PRE><TT> |
150 | uint32 magic_number |
151 | * 0x00D6B7ED, used to check the trace byte order vs host byte order. |
152 | uint32 arch_type |
153 | * Architecture type of the traced machine. |
154 | uint32 arch_variant |
155 | * Architecture variant of the traced machine. May be unused on some arch. |
156 | uint32 float_word_order |
157 | * Byte order of floats and doubles, sometimes different from integer byte |
158 | order. Useful only for user space traces. |
159 | uint8 arch_size |
160 | * Size (in bytes) of the void * on the traced machine. |
161 | uint8 major_version |
162 | * major version of the trace. |
163 | uint8 minor_version |
164 | * minor version of the trace. |
165 | uint8 flight_recorder |
166 | * Is flight recorder mode activated ? If yes, data might be missing |
167 | (overwritten) in the trace. |
168 | uint8 has_heartbeat |
169 | * Does this trace have heartbeat timer event activated ? |
170 | Yes (1) -> Event header has 32 bits TSC |
171 | No (0) -> Event header has 64 bits TSC |
172 | uint8 has_alignment |
173 | * Is the information in this trace aligned ? |
174 | Yes (1) -> aligned on min(arch size, atomic data size). |
175 | No (0) -> data is packed. |
99f2111d |
176 | uint32 freq_scale |
cb310b57 |
177 | event time is always calculated from : |
178 | trace_start_time + ((event_tsc - trace_start_tsc) * (freq / freq_scale)) |
a25fb9c4 |
179 | uint64 start_freq |
180 | * CPUs clock frequency at the beginnig of the trace. |
181 | uint64 start_tsc |
182 | * TSC at the beginning of the trace. |
183 | uint64 start_monotonic |
184 | * monotonically increasing time at the beginning of the trace. |
185 | (currently not supported) |
186 | start_time |
187 | * Real time at the beginning of the trace (as given by date, adjusted by NTP) |
188 | This is the only time reference with the real world : the rest of the trace |
189 | has monotonically increasing time from this point (with TSC difference and |
190 | clock frequency). |
191 | uint32 seconds |
192 | uint32 nanoseconds |
193 | </TT></PRE> |
194 | |
584db146 |
195 | |
196 | <P> |
a25fb9c4 |
197 | Event header |
584db146 |
198 | |
a25fb9c4 |
199 | <P> |
200 | Event headers differs depending on those conditions : does the traced system has |
201 | a heartbeat timer ? Is tracing alignment activated ? |
202 | |
203 | <P> |
204 | Event header : |
205 | <PRE><TT> |
206 | { uint32 timestamp |
207 | or |
208 | uint64 timestamp } |
209 | * if has_heartbeat : 32 LSB of the cycle counter at the event record time. |
210 | * else : 64 bits complete cycle counter. |
a25fb9c4 |
211 | uint8 facility_id |
212 | * Numerical ID of the facility corresponding to the event. See the facility |
213 | tracefile to know which facility ID matches which facility name and |
214 | description. |
215 | uint8 event_id |
216 | * Numerical ID of the event inside the facility. |
217 | uint16 event_size |
218 | * Size of the variable length data that follows this header. |
219 | </TT></PRE> |
220 | |
221 | <P> |
222 | Event header alignment |
223 | |
224 | <P> |
225 | If trace alignment is activated (has_alignment), the event header is aligned |
226 | on the architecture size (void pointer size). In addition, a padding is |
227 | automatically added after the event header so the variable length data is |
228 | automatically aligned on the architecture size. |
229 | |
230 | <P> |
584db146 |
231 | |
232 | <H2>System description</H2> |
233 | |
234 | <P> |
235 | The system type description, in system.xml, looks like: |
236 | |
237 | <PRE><TT> |
238 | <system |
239 | node_name="vaucluse" |
240 | domainname="polymtl.ca" |
241 | cpu=4 |
242 | arch_size="ILP32" |
243 | endian="little" |
244 | kernel_name="Linux" |
245 | kernel_release="2.4.18-686-smp" |
246 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
247 | machine="i686" |
248 | processor="unknown" |
249 | hardware_platform="unknown" |
250 | operating_system="Linux" |
251 | ltt_major_version="2" |
252 | ltt_minor_version="0" |
253 | ltt_block_size="100000" |
254 | > |
255 | Some comments about the system |
256 | </system> |
257 | </TT></PRE> |
258 | |
259 | <P> |
260 | The system attributes kernel_name, node_name, kernel_release, |
261 | kernel_version, machine, processor, hardware_platform and operating_system |
262 | come from the uname(1) program. The domainname attribute is obtained from |
263 | the "hostname --domain" command. The arch_size attribute is one of |
264 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
265 | long (L) and pointers (P). The endian attribute is "little" or "big". |
266 | While the arch_size and endian attributes could be deduced from the platform |
267 | type, having these explicit allows analysing traces from yet unknown |
268 | platforms. The cpu attribute specifies the maximum number of processors in |
269 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
270 | directory. |
271 | |
272 | <P> |
273 | Within the system element, the text enclosed may describe further the |
274 | system traced. |
275 | |
276 | |
277 | <H2>Event type descriptions</H2> |
278 | |
279 | <P> |
280 | A facility contains the descriptions of several event types. When a structure |
281 | is reused in several event types, a named type is defined and may be referenced |
282 | by several other event types or named types. |
283 | |
284 | <PRE><TT> |
285 | <facility name=facility_name> |
286 | <description>Some text</description> |
287 | <event name=eventtype_name> |
288 | <description>Some text</description> |
289 | --type structure-- |
290 | </event> |
291 | ... |
292 | <type name=type_name> |
293 | --type structure-- |
294 | </type> |
295 | </facility> |
296 | </TT></PRE> |
297 | |
298 | <P> |
299 | The type structure may be one of the following primitive type elements. |
300 | Whenever the keyword isize is used, the allowed values are |
301 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
302 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
303 | |
304 | <PRE><TT> |
305 | <int size=isize format="printf format"/> |
306 | |
307 | <uint size=isize format="printf format"/> |
308 | |
309 | <float size=fsize format="printf format"/> |
310 | |
311 | <string format="printf format"/> |
312 | |
313 | <enum size=isize format="printf format">label1 label2 ...</enum> |
314 | </TT></PRE> |
315 | |
316 | <P> |
317 | The string is null terminated. For the enumeration, the size of the integer |
318 | used for its representation is specified. |
319 | |
320 | <P> |
321 | The type structure may also be a compound type. |
322 | |
323 | <PRE><TT> |
324 | <array size=n> --type structure-- </array> |
325 | |
326 | <sequence lengthsize=isize> --type structure-- </sequence> |
327 | |
328 | <struct> |
329 | <field name=field_name> |
330 | <description>Some text</description> |
331 | --type structure-- |
332 | </field> |
333 | ... |
334 | </struct> |
335 | |
336 | <union typecodesize=isize> |
337 | <field name=field_name> |
338 | <description>Some text</description> |
339 | --type structure-- |
340 | </field> |
341 | ... |
342 | </union> |
343 | </TT></PRE> |
344 | |
345 | <P> |
346 | Array is a fixed size array of length size. Sequence is a variable size |
347 | array with its length stored as a prepended uint of length lengthsize. |
348 | A structure is simply an aggregation of fields. An union is one of its n |
349 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
350 | of the specified size typecodesize. |
351 | |
352 | <P> |
353 | Finally the type structure may be defined by referencing a named type. |
354 | |
355 | <PRE><TT> |
356 | <typeref name=type_name/> |
357 | </PRE></TT> |
358 | |
d88e4d7c |
359 | <H2>Core events</H2> |
584db146 |
360 | |
361 | <P> |
d88e4d7c |
362 | The facility named "core" is always present and contains at least the |
584db146 |
363 | following event types. |
364 | |
365 | <PRE><TT> |
366 | <event name=facility_load> |
367 | <description>Facility used in the trace</description> |
368 | <struct> |
369 | <field name="name"><string/></field> |
370 | <field name="checksum"><uint size=4/></field> |
d88e4d7c |
371 | <field name="id"><uint size=4/></field> |
372 | <field name="int_size"><uint size=4/></field> |
373 | <field name="long_size"><uint size=4/></field> |
374 | <field name="pointer_size"><uint size=4/></field> |
375 | <field name="size_t_size"><uint size=4/></field> |
376 | <field name="has_alignment"><uint size=4/></field> |
584db146 |
377 | </struct> |
378 | </event> |
379 | |
7bfd7820 |
380 | <event name=state_dump_facility_load> |
381 | <description>Facility used in the trace</description> |
382 | <struct> |
383 | <field name="name"><string/></field> |
384 | <field name="checksum"><uint size=4/></field> |
385 | <field name="id"><uint size=4/></field> |
386 | <field name="int_size"><uint size=4/></field> |
387 | <field name="long_size"><uint size=4/></field> |
388 | <field name="pointer_size"><uint size=4/></field> |
389 | <field name="size_t_size"><uint size=4/></field> |
390 | <field name="has_alignment"><uint size=4/></field> |
391 | </struct> |
392 | </event> |
393 | |
584db146 |
394 | <event name=time_heartbeat> |
395 | <description>System time values sent periodically to minimize cycle counter |
396 | drift with respect to real time clock and to detect cycle counter |
397 | rollovers |
398 | </description> |
399 | <typeref name=timestamp/> |
400 | </event> |
401 | |
584db146 |
402 | <type name=timestamp> |
403 | <struct> |
d88e4d7c |
404 | <field name="seconds"><uint size=4/></field> |
405 | <field name="nanoseconds"><uint size=4/></field> |
584db146 |
406 | <field name="cycle_count"><uint size=8/></field> |
407 | </struct> |
408 | </event> |
409 | |
584db146 |
410 | </TT></PRE> |
411 | |
412 | <H2>Control files</H2> |
413 | |
414 | <P> |
415 | The interrupts file reflects the content of the /proc/interrupts system file. |
416 | It contains one event describing each interrupt. At trace start, events are |
417 | generated describing all the current interrupts. If the assignment of |
418 | interrupts changes later, due to devices or device drivers being activated or |
419 | deactivated, additional events may be added to the file. Each interrupt |
420 | event has the following structure. |
421 | |
422 | <PRE><TT> |
423 | <event name=interrupt> |
424 | <description>Interrupt request number assignment<description> |
425 | <struct> |
426 | <field name="number"><uint size=4/></field> |
427 | <field name="count"><uint size=4/></field> |
428 | <field name="controller"><string/></field> |
429 | <field name="name"><string/></field> |
430 | </struct> |
431 | </event> |
432 | </TT></PRE> |
433 | |
434 | <P> |
435 | The processes file contains the list of processes already created when the |
436 | trace starts. Each process describing event is modeled after the |
437 | /proc/self/status system file. The number of fields in this event is |
438 | expected to be expanded in the future to include groups, signal masks, |
439 | opened file descriptors and address maps. |
440 | |
441 | <PRE><TT> |
442 | <event name=process> |
443 | <description>Existing process<description> |
444 | <struct> |
445 | <field name="name"><string/></field> |
446 | <field name="pid"><uint size=4/></field> |
447 | <field name="ppid"><uint size=4/></field> |
448 | <field name="tracer_pid"><uint size=4/></field> |
449 | <field name="uid"><uint size=4/></field> |
450 | <field name="euid"><uint size=4/></field> |
451 | <field name="suid"><uint size=4/></field> |
452 | <field name="fsuid"><uint size=4/></field> |
453 | <field name="gid"><uint size=4/></field> |
454 | <field name="egid"><uint size=4/></field> |
455 | <field name="sgid"><uint size=4/></field> |
456 | <field name="fsgid"><uint size=4/></field> |
457 | <field name="state"><enum size=4> |
458 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
459 | </enum></field> |
460 | </struct> |
461 | </event> |
462 | </TT></PRE> |
463 | |
464 | <H2>Facilities</H2> |
465 | |
466 | <P> |
467 | Facilities define a granularity of events grouping for filtering, activation |
468 | and compilation. Each facility does cost a table entry in the kernel (name, |
469 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
470 | one facility per tracing statement in the kernel would be too much (assuming |
471 | that they eventually are routinely inserted in the kernel code and replace |
472 | the 80000+ printk statements in some proportion). However, having a few |
473 | facilities, up to a few tens, would make sense. |
474 | |
475 | <P> |
476 | The "builtin" facility contains a small number of predefined events which must |
477 | always exist. The "core" facility contains a small subset of OS events which |
478 | are almost always of interest (scheduling, interrupts, faults, system calls). |
479 | Then, specialized facilities may exist for each subsystem (network, disks, |
480 | USB, SCSI...). |
481 | |
482 | |
483 | <H2>Bookmarks</H2> |
484 | |
485 | <P> |
486 | Bookmarks are user supplied information added to a trace. They contain user |
487 | annotations attached to a time interval. |
488 | |
489 | <PRE><TT> |
490 | <bookmarks> |
491 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
492 | ... |
493 | </bookmarks> |
494 | </TT></PRE> |
495 | |
496 | <P> |
497 | The interval is defined using either "time=" or "start_time=" and |
498 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
499 | The time is in seconds with decimals up to nanoseconds and cycle counts |
500 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
501 | |
502 | </BODY> |
503 | </HTML> |
504 | |
505 | |
506 | |
507 | |