584db146 |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | <html> |
3 | <head> |
4 | <title>The new LTT trace format</title> |
5 | </head> |
6 | <body> |
7 | |
8 | <h1>The new LTT trace format</h1> |
9 | |
10 | <P> |
11 | A trace is contained in a directory tree. To send a trace remotely, |
12 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
13 | directory of user john, /home/john, would have the following content: |
14 | |
15 | <PRE><TT> |
16 | $ cd /home/john |
17 | $ tree foo |
18 | foo/ |
19 | |-- eventdefs |
20 | | |-- core.xml |
21 | | |-- net.xml |
22 | | |-- ipv4.xml |
23 | | `-- ide.xml |
24 | |-- info |
25 | | |-- bookmarks.xml |
26 | | `-- system.xml |
27 | |-- control |
28 | | |-- facilities |
29 | | |-- interrupts |
30 | | `-- processes |
31 | `-- cpu |
32 | |-- 0 |
33 | |-- 1 |
34 | |-- 2 |
35 | `-- 3 |
36 | </TT></PRE> |
37 | |
38 | <P> |
39 | The eventdefs directory contains the events descriptions for all the |
40 | facilities used. The syntax is a simple subset of XML; XML is widely |
41 | known and easily parsed or hand edited. Each file contains one or more |
42 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
43 | facilities may have the same name but different content (and thus will |
44 | generate a different checksum), typically when the event descriptions |
45 | for a given facility change from one version to the next, if a module |
46 | is recompiled and reloaded during a trace. |
47 | |
48 | <P> |
49 | A small number of events are predefined, part of the "builtin" facility, |
50 | and are not present there. These "builtin" events include "facility_load", |
51 | "block_start", "block_end" and "time_heartbeat". |
52 | |
53 | <P> |
54 | The cpu directory contains a tracefile for each cpu, numbered from 0, |
55 | in .trace format. A uniprocessor thus only contains the file cpu/0. |
56 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
57 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
58 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
59 | |
60 | <P> |
61 | The files in the control directory also follow the .trace format. |
62 | The "facilities" file only contains "builtin" facility_load events |
63 | and is used to determine the facilities used and the code range assigned |
64 | to each facility. The other control files contain the initial system |
65 | state and various subsequent important events, for example process |
66 | creations and exit. The interest of placing such subsequent events |
67 | in control trace files instead of (or in addition to) in the per cpu |
68 | trace files is that they may be accessed more quickly/conveniently |
69 | and that they may be kept even when the per cpu files are overwritten |
70 | in "flight recorder mode". |
71 | |
72 | <P> |
73 | The info directory contains in system.xml a description of the system on which |
74 | the trace was created as well as different user annotations in bookmark.xml. |
75 | This directory may also contain various information about the trace, generated |
76 | during trace analysis (statistics, index...). |
77 | |
78 | |
79 | <H2>Trace format</H2> |
80 | |
81 | <P> |
82 | Each tracefile is divided into equal size blocks with an uint32 at the block |
83 | end giving the offset to the last event in the block. Events are packed |
84 | sequentially in the block starting at offset 0 with a "block_start" event |
85 | and ending, at the offset stored in the last 4 bytes of the block, with a |
86 | block_end event. Both the block_start and block_end events |
87 | contain the kernel timestamp (timespec binary structure, |
88 | uint32 seconds, uint32 nanoseconds), the cycle counter (uint64 cycles), |
89 | and the buffer id (uint64). |
90 | |
91 | <P> |
92 | Each event consists in an event type id (uint16 which is the event type id |
93 | within the facility + the facility base id), a time delta (uint32 in cycles |
94 | or nanoseconds, depending on configuration, since the last time value, in the |
95 | block header or in a "time_heartbeat" event) and the event type specific data. |
96 | All values are packed in native byte order binary format. |
97 | |
98 | |
99 | <H2>System description</H2> |
100 | |
101 | <P> |
102 | The system type description, in system.xml, looks like: |
103 | |
104 | <PRE><TT> |
105 | <system |
106 | node_name="vaucluse" |
107 | domainname="polymtl.ca" |
108 | cpu=4 |
109 | arch_size="ILP32" |
110 | endian="little" |
111 | kernel_name="Linux" |
112 | kernel_release="2.4.18-686-smp" |
113 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
114 | machine="i686" |
115 | processor="unknown" |
116 | hardware_platform="unknown" |
117 | operating_system="Linux" |
118 | ltt_major_version="2" |
119 | ltt_minor_version="0" |
120 | ltt_block_size="100000" |
121 | > |
122 | Some comments about the system |
123 | </system> |
124 | </TT></PRE> |
125 | |
126 | <P> |
127 | The system attributes kernel_name, node_name, kernel_release, |
128 | kernel_version, machine, processor, hardware_platform and operating_system |
129 | come from the uname(1) program. The domainname attribute is obtained from |
130 | the "hostname --domain" command. The arch_size attribute is one of |
131 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
132 | long (L) and pointers (P). The endian attribute is "little" or "big". |
133 | While the arch_size and endian attributes could be deduced from the platform |
134 | type, having these explicit allows analysing traces from yet unknown |
135 | platforms. The cpu attribute specifies the maximum number of processors in |
136 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
137 | directory. |
138 | |
139 | <P> |
140 | Within the system element, the text enclosed may describe further the |
141 | system traced. |
142 | |
143 | |
144 | <H2>Event type descriptions</H2> |
145 | |
146 | <P> |
147 | A facility contains the descriptions of several event types. When a structure |
148 | is reused in several event types, a named type is defined and may be referenced |
149 | by several other event types or named types. |
150 | |
151 | <PRE><TT> |
152 | <facility name=facility_name> |
153 | <description>Some text</description> |
154 | <event name=eventtype_name> |
155 | <description>Some text</description> |
156 | --type structure-- |
157 | </event> |
158 | ... |
159 | <type name=type_name> |
160 | --type structure-- |
161 | </type> |
162 | </facility> |
163 | </TT></PRE> |
164 | |
165 | <P> |
166 | The type structure may be one of the following primitive type elements. |
167 | Whenever the keyword isize is used, the allowed values are |
168 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
169 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
170 | |
171 | <PRE><TT> |
172 | <int size=isize format="printf format"/> |
173 | |
174 | <uint size=isize format="printf format"/> |
175 | |
176 | <float size=fsize format="printf format"/> |
177 | |
178 | <string format="printf format"/> |
179 | |
180 | <enum size=isize format="printf format">label1 label2 ...</enum> |
181 | </TT></PRE> |
182 | |
183 | <P> |
184 | The string is null terminated. For the enumeration, the size of the integer |
185 | used for its representation is specified. |
186 | |
187 | <P> |
188 | The type structure may also be a compound type. |
189 | |
190 | <PRE><TT> |
191 | <array size=n> --type structure-- </array> |
192 | |
193 | <sequence lengthsize=isize> --type structure-- </sequence> |
194 | |
195 | <struct> |
196 | <field name=field_name> |
197 | <description>Some text</description> |
198 | --type structure-- |
199 | </field> |
200 | ... |
201 | </struct> |
202 | |
203 | <union typecodesize=isize> |
204 | <field name=field_name> |
205 | <description>Some text</description> |
206 | --type structure-- |
207 | </field> |
208 | ... |
209 | </union> |
210 | </TT></PRE> |
211 | |
212 | <P> |
213 | Array is a fixed size array of length size. Sequence is a variable size |
214 | array with its length stored as a prepended uint of length lengthsize. |
215 | A structure is simply an aggregation of fields. An union is one of its n |
216 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
217 | of the specified size typecodesize. |
218 | |
219 | <P> |
220 | Finally the type structure may be defined by referencing a named type. |
221 | |
222 | <PRE><TT> |
223 | <typeref name=type_name/> |
224 | </PRE></TT> |
225 | |
226 | <H2>Builtin events</H2> |
227 | |
228 | <P> |
229 | The facility named "builtin" is always present and contains at least the |
230 | following event types. |
231 | |
232 | <PRE><TT> |
233 | <event name=facility_load> |
234 | <description>Facility used in the trace</description> |
235 | <struct> |
236 | <field name="name"><string/></field> |
237 | <field name="checksum"><uint size=4/></field> |
238 | <field name="base_code"><uint size=4/></field> |
239 | </struct> |
240 | </event> |
241 | |
242 | <event name=block_start> |
243 | <description>Block start timestamp</description> |
244 | <typeref name=block_timestamp/> |
245 | </event> |
246 | |
247 | <event name=block_end> |
248 | <description>Block end timestamp</description> |
249 | <typeref name=block_timestamp/> |
250 | </event> |
251 | |
252 | <event name=time_heartbeat> |
253 | <description>System time values sent periodically to minimize cycle counter |
254 | drift with respect to real time clock and to detect cycle counter |
255 | rollovers |
256 | </description> |
257 | <typeref name=timestamp/> |
258 | </event> |
259 | |
260 | <type name=block_timestamp> |
261 | <struct> |
262 | <field name=timestamp><typeref name=timestamp></field> |
263 | <field name=block_id><uint size=4/></field> |
264 | </struct> |
265 | </type> |
266 | |
267 | <type name=timestamp> |
268 | <struct> |
269 | <field name=time><typeref name=timespec/></event> |
270 | <field name="cycle_count"><uint size=8/></field> |
271 | </struct> |
272 | </event> |
273 | |
274 | <type name=timespec> |
275 | <struct> |
276 | <field name="seconds"><uint size=4/></field> |
277 | <field name="nanoseconds"><uint size=4/></field> |
278 | </struct> |
279 | </type> |
280 | </TT></PRE> |
281 | |
282 | <H2>Control files</H2> |
283 | |
284 | <P> |
285 | The interrupts file reflects the content of the /proc/interrupts system file. |
286 | It contains one event describing each interrupt. At trace start, events are |
287 | generated describing all the current interrupts. If the assignment of |
288 | interrupts changes later, due to devices or device drivers being activated or |
289 | deactivated, additional events may be added to the file. Each interrupt |
290 | event has the following structure. |
291 | |
292 | <PRE><TT> |
293 | <event name=interrupt> |
294 | <description>Interrupt request number assignment<description> |
295 | <struct> |
296 | <field name="number"><uint size=4/></field> |
297 | <field name="count"><uint size=4/></field> |
298 | <field name="controller"><string/></field> |
299 | <field name="name"><string/></field> |
300 | </struct> |
301 | </event> |
302 | </TT></PRE> |
303 | |
304 | <P> |
305 | The processes file contains the list of processes already created when the |
306 | trace starts. Each process describing event is modeled after the |
307 | /proc/self/status system file. The number of fields in this event is |
308 | expected to be expanded in the future to include groups, signal masks, |
309 | opened file descriptors and address maps. |
310 | |
311 | <PRE><TT> |
312 | <event name=process> |
313 | <description>Existing process<description> |
314 | <struct> |
315 | <field name="name"><string/></field> |
316 | <field name="pid"><uint size=4/></field> |
317 | <field name="ppid"><uint size=4/></field> |
318 | <field name="tracer_pid"><uint size=4/></field> |
319 | <field name="uid"><uint size=4/></field> |
320 | <field name="euid"><uint size=4/></field> |
321 | <field name="suid"><uint size=4/></field> |
322 | <field name="fsuid"><uint size=4/></field> |
323 | <field name="gid"><uint size=4/></field> |
324 | <field name="egid"><uint size=4/></field> |
325 | <field name="sgid"><uint size=4/></field> |
326 | <field name="fsgid"><uint size=4/></field> |
327 | <field name="state"><enum size=4> |
328 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
329 | </enum></field> |
330 | </struct> |
331 | </event> |
332 | </TT></PRE> |
333 | |
334 | <H2>Facilities</H2> |
335 | |
336 | <P> |
337 | Facilities define a granularity of events grouping for filtering, activation |
338 | and compilation. Each facility does cost a table entry in the kernel (name, |
339 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
340 | one facility per tracing statement in the kernel would be too much (assuming |
341 | that they eventually are routinely inserted in the kernel code and replace |
342 | the 80000+ printk statements in some proportion). However, having a few |
343 | facilities, up to a few tens, would make sense. |
344 | |
345 | <P> |
346 | The "builtin" facility contains a small number of predefined events which must |
347 | always exist. The "core" facility contains a small subset of OS events which |
348 | are almost always of interest (scheduling, interrupts, faults, system calls). |
349 | Then, specialized facilities may exist for each subsystem (network, disks, |
350 | USB, SCSI...). |
351 | |
352 | |
353 | <H2>Bookmarks</H2> |
354 | |
355 | <P> |
356 | Bookmarks are user supplied information added to a trace. They contain user |
357 | annotations attached to a time interval. |
358 | |
359 | <PRE><TT> |
360 | <bookmarks> |
361 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
362 | ... |
363 | </bookmarks> |
364 | </TT></PRE> |
365 | |
366 | <P> |
367 | The interval is defined using either "time=" or "start_time=" and |
368 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
369 | The time is in seconds with decimals up to nanoseconds and cycle counts |
370 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
371 | |
372 | </BODY> |
373 | </HTML> |
374 | |
375 | |
376 | |
377 | |