584db146 |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
2 | <html> |
3 | <head> |
4 | <title>The new LTT trace format</title> |
5 | </head> |
6 | <body> |
7 | |
8 | <h1>The new LTT trace format</h1> |
9 | |
10 | <P> |
11 | A trace is contained in a directory tree. To send a trace remotely, |
12 | the directory tree may be tar-gzipped. Trace foo, placed in the home |
13 | directory of user john, /home/john, would have the following content: |
14 | |
15 | <PRE><TT> |
16 | $ cd /home/john |
17 | $ tree foo |
18 | foo/ |
19 | |-- eventdefs |
20 | | |-- core.xml |
21 | | |-- net.xml |
22 | | |-- ipv4.xml |
23 | | `-- ide.xml |
24 | |-- info |
25 | | |-- bookmarks.xml |
26 | | `-- system.xml |
27 | |-- control |
28 | | |-- facilities |
29 | | |-- interrupts |
30 | | `-- processes |
31 | `-- cpu |
32 | |-- 0 |
33 | |-- 1 |
34 | |-- 2 |
35 | `-- 3 |
36 | </TT></PRE> |
37 | |
38 | <P> |
39 | The eventdefs directory contains the events descriptions for all the |
40 | facilities used. The syntax is a simple subset of XML; XML is widely |
41 | known and easily parsed or hand edited. Each file contains one or more |
42 | <FACILITY NAME=name>...</FACILITY> elements. Indeed, several |
43 | facilities may have the same name but different content (and thus will |
9c312311 |
44 | generate a different checksum). It typically happens when, while tracing |
45 | is enabled, a module using the named facility is unloaded, modified |
46 | (along with the description of some events), recompiled and reloaded. |
47 | Then, the trace will contain events from two different, similarly named, |
48 | facility versions. |
584db146 |
49 | |
50 | <P> |
51 | A small number of events are predefined, part of the "builtin" facility, |
52 | and are not present there. These "builtin" events include "facility_load", |
53 | "block_start", "block_end" and "time_heartbeat". |
54 | |
55 | <P> |
56 | The cpu directory contains a tracefile for each cpu, numbered from 0, |
57 | in .trace format. A uniprocessor thus only contains the file cpu/0. |
58 | A multi-processor with some unused (possibly hotplug) CPU slots may have some |
59 | unused CPU numbers. For instance a 8 way SMP board with 6 CPUs randomly |
60 | installed may produce tracefiles named 0, 1, 2, 4, 6, 7. |
61 | |
62 | <P> |
63 | The files in the control directory also follow the .trace format. |
64 | The "facilities" file only contains "builtin" facility_load events |
65 | and is used to determine the facilities used and the code range assigned |
66 | to each facility. The other control files contain the initial system |
67 | state and various subsequent important events, for example process |
68 | creations and exit. The interest of placing such subsequent events |
69 | in control trace files instead of (or in addition to) in the per cpu |
70 | trace files is that they may be accessed more quickly/conveniently |
71 | and that they may be kept even when the per cpu files are overwritten |
72 | in "flight recorder mode". |
73 | |
74 | <P> |
75 | The info directory contains in system.xml a description of the system on which |
76 | the trace was created as well as different user annotations in bookmark.xml. |
77 | This directory may also contain various information about the trace, generated |
78 | during trace analysis (statistics, index...). |
79 | |
80 | |
81 | <H2>Trace format</H2> |
82 | |
83 | <P> |
84 | Each tracefile is divided into equal size blocks with an uint32 at the block |
85 | end giving the offset to the last event in the block. Events are packed |
86 | sequentially in the block starting at offset 0 with a "block_start" event |
87 | and ending, at the offset stored in the last 4 bytes of the block, with a |
88 | block_end event. Both the block_start and block_end events |
89 | contain the kernel timestamp (timespec binary structure, |
90 | uint32 seconds, uint32 nanoseconds), the cycle counter (uint64 cycles), |
91 | and the buffer id (uint64). |
92 | |
93 | <P> |
94 | Each event consists in an event type id (uint16 which is the event type id |
95 | within the facility + the facility base id), a time delta (uint32 in cycles |
96 | or nanoseconds, depending on configuration, since the last time value, in the |
97 | block header or in a "time_heartbeat" event) and the event type specific data. |
98 | All values are packed in native byte order binary format. |
99 | |
100 | |
101 | <H2>System description</H2> |
102 | |
103 | <P> |
104 | The system type description, in system.xml, looks like: |
105 | |
106 | <PRE><TT> |
107 | <system |
108 | node_name="vaucluse" |
109 | domainname="polymtl.ca" |
110 | cpu=4 |
111 | arch_size="ILP32" |
112 | endian="little" |
113 | kernel_name="Linux" |
114 | kernel_release="2.4.18-686-smp" |
115 | kernel_version="#1 SMP Sun Apr 14 12:07:19 EST 2002" |
116 | machine="i686" |
117 | processor="unknown" |
118 | hardware_platform="unknown" |
119 | operating_system="Linux" |
120 | ltt_major_version="2" |
121 | ltt_minor_version="0" |
122 | ltt_block_size="100000" |
123 | > |
124 | Some comments about the system |
125 | </system> |
126 | </TT></PRE> |
127 | |
128 | <P> |
129 | The system attributes kernel_name, node_name, kernel_release, |
130 | kernel_version, machine, processor, hardware_platform and operating_system |
131 | come from the uname(1) program. The domainname attribute is obtained from |
132 | the "hostname --domain" command. The arch_size attribute is one of |
133 | LP32, ILP32, LP64 or ILP64 and specifies the length in bits of integers (I), |
134 | long (L) and pointers (P). The endian attribute is "little" or "big". |
135 | While the arch_size and endian attributes could be deduced from the platform |
136 | type, having these explicit allows analysing traces from yet unknown |
137 | platforms. The cpu attribute specifies the maximum number of processors in |
138 | the system; only tracefiles 0 to this maximum - 1 may exist in the cpu |
139 | directory. |
140 | |
141 | <P> |
142 | Within the system element, the text enclosed may describe further the |
143 | system traced. |
144 | |
145 | |
146 | <H2>Event type descriptions</H2> |
147 | |
148 | <P> |
149 | A facility contains the descriptions of several event types. When a structure |
150 | is reused in several event types, a named type is defined and may be referenced |
151 | by several other event types or named types. |
152 | |
153 | <PRE><TT> |
154 | <facility name=facility_name> |
155 | <description>Some text</description> |
156 | <event name=eventtype_name> |
157 | <description>Some text</description> |
158 | --type structure-- |
159 | </event> |
160 | ... |
161 | <type name=type_name> |
162 | --type structure-- |
163 | </type> |
164 | </facility> |
165 | </TT></PRE> |
166 | |
167 | <P> |
168 | The type structure may be one of the following primitive type elements. |
169 | Whenever the keyword isize is used, the allowed values are |
170 | short, medium, long, 1, 2, 4, 8, indicating the size in bytes. |
171 | The fsize keyword represents one of medium, long, 4 and 8 bytes. |
172 | |
173 | <PRE><TT> |
174 | <int size=isize format="printf format"/> |
175 | |
176 | <uint size=isize format="printf format"/> |
177 | |
178 | <float size=fsize format="printf format"/> |
179 | |
180 | <string format="printf format"/> |
181 | |
182 | <enum size=isize format="printf format">label1 label2 ...</enum> |
183 | </TT></PRE> |
184 | |
185 | <P> |
186 | The string is null terminated. For the enumeration, the size of the integer |
187 | used for its representation is specified. |
188 | |
189 | <P> |
190 | The type structure may also be a compound type. |
191 | |
192 | <PRE><TT> |
193 | <array size=n> --type structure-- </array> |
194 | |
195 | <sequence lengthsize=isize> --type structure-- </sequence> |
196 | |
197 | <struct> |
198 | <field name=field_name> |
199 | <description>Some text</description> |
200 | --type structure-- |
201 | </field> |
202 | ... |
203 | </struct> |
204 | |
205 | <union typecodesize=isize> |
206 | <field name=field_name> |
207 | <description>Some text</description> |
208 | --type structure-- |
209 | </field> |
210 | ... |
211 | </union> |
212 | </TT></PRE> |
213 | |
214 | <P> |
215 | Array is a fixed size array of length size. Sequence is a variable size |
216 | array with its length stored as a prepended uint of length lengthsize. |
217 | A structure is simply an aggregation of fields. An union is one of its n |
218 | fields (variant record), as indicated by a preceeding code (0 to n - 1) |
219 | of the specified size typecodesize. |
220 | |
221 | <P> |
222 | Finally the type structure may be defined by referencing a named type. |
223 | |
224 | <PRE><TT> |
225 | <typeref name=type_name/> |
226 | </PRE></TT> |
227 | |
228 | <H2>Builtin events</H2> |
229 | |
230 | <P> |
231 | The facility named "builtin" is always present and contains at least the |
232 | following event types. |
233 | |
234 | <PRE><TT> |
235 | <event name=facility_load> |
236 | <description>Facility used in the trace</description> |
237 | <struct> |
238 | <field name="name"><string/></field> |
239 | <field name="checksum"><uint size=4/></field> |
240 | <field name="base_code"><uint size=4/></field> |
241 | </struct> |
242 | </event> |
243 | |
244 | <event name=block_start> |
245 | <description>Block start timestamp</description> |
246 | <typeref name=block_timestamp/> |
247 | </event> |
248 | |
249 | <event name=block_end> |
250 | <description>Block end timestamp</description> |
251 | <typeref name=block_timestamp/> |
252 | </event> |
253 | |
254 | <event name=time_heartbeat> |
255 | <description>System time values sent periodically to minimize cycle counter |
256 | drift with respect to real time clock and to detect cycle counter |
257 | rollovers |
258 | </description> |
259 | <typeref name=timestamp/> |
260 | </event> |
261 | |
262 | <type name=block_timestamp> |
263 | <struct> |
264 | <field name=timestamp><typeref name=timestamp></field> |
265 | <field name=block_id><uint size=4/></field> |
266 | </struct> |
267 | </type> |
268 | |
269 | <type name=timestamp> |
270 | <struct> |
271 | <field name=time><typeref name=timespec/></event> |
272 | <field name="cycle_count"><uint size=8/></field> |
273 | </struct> |
274 | </event> |
275 | |
276 | <type name=timespec> |
277 | <struct> |
278 | <field name="seconds"><uint size=4/></field> |
279 | <field name="nanoseconds"><uint size=4/></field> |
280 | </struct> |
281 | </type> |
282 | </TT></PRE> |
283 | |
284 | <H2>Control files</H2> |
285 | |
286 | <P> |
287 | The interrupts file reflects the content of the /proc/interrupts system file. |
288 | It contains one event describing each interrupt. At trace start, events are |
289 | generated describing all the current interrupts. If the assignment of |
290 | interrupts changes later, due to devices or device drivers being activated or |
291 | deactivated, additional events may be added to the file. Each interrupt |
292 | event has the following structure. |
293 | |
294 | <PRE><TT> |
295 | <event name=interrupt> |
296 | <description>Interrupt request number assignment<description> |
297 | <struct> |
298 | <field name="number"><uint size=4/></field> |
299 | <field name="count"><uint size=4/></field> |
300 | <field name="controller"><string/></field> |
301 | <field name="name"><string/></field> |
302 | </struct> |
303 | </event> |
304 | </TT></PRE> |
305 | |
306 | <P> |
307 | The processes file contains the list of processes already created when the |
308 | trace starts. Each process describing event is modeled after the |
309 | /proc/self/status system file. The number of fields in this event is |
310 | expected to be expanded in the future to include groups, signal masks, |
311 | opened file descriptors and address maps. |
312 | |
313 | <PRE><TT> |
314 | <event name=process> |
315 | <description>Existing process<description> |
316 | <struct> |
317 | <field name="name"><string/></field> |
318 | <field name="pid"><uint size=4/></field> |
319 | <field name="ppid"><uint size=4/></field> |
320 | <field name="tracer_pid"><uint size=4/></field> |
321 | <field name="uid"><uint size=4/></field> |
322 | <field name="euid"><uint size=4/></field> |
323 | <field name="suid"><uint size=4/></field> |
324 | <field name="fsuid"><uint size=4/></field> |
325 | <field name="gid"><uint size=4/></field> |
326 | <field name="egid"><uint size=4/></field> |
327 | <field name="sgid"><uint size=4/></field> |
328 | <field name="fsgid"><uint size=4/></field> |
329 | <field name="state"><enum size=4> |
330 | Running WaitInterruptible WaitUninterruptible Zombie Traced Paging |
331 | </enum></field> |
332 | </struct> |
333 | </event> |
334 | </TT></PRE> |
335 | |
336 | <H2>Facilities</H2> |
337 | |
338 | <P> |
339 | Facilities define a granularity of events grouping for filtering, activation |
340 | and compilation. Each facility does cost a table entry in the kernel (name, |
341 | checksum, event type code range), or somewhere between 20 and 30 bytes. Having |
342 | one facility per tracing statement in the kernel would be too much (assuming |
343 | that they eventually are routinely inserted in the kernel code and replace |
344 | the 80000+ printk statements in some proportion). However, having a few |
345 | facilities, up to a few tens, would make sense. |
346 | |
347 | <P> |
348 | The "builtin" facility contains a small number of predefined events which must |
349 | always exist. The "core" facility contains a small subset of OS events which |
350 | are almost always of interest (scheduling, interrupts, faults, system calls). |
351 | Then, specialized facilities may exist for each subsystem (network, disks, |
352 | USB, SCSI...). |
353 | |
354 | |
355 | <H2>Bookmarks</H2> |
356 | |
357 | <P> |
358 | Bookmarks are user supplied information added to a trace. They contain user |
359 | annotations attached to a time interval. |
360 | |
361 | <PRE><TT> |
362 | <bookmarks> |
363 | <location name=name cpu=n start_time=t end_time=t>Some text</location> |
364 | ... |
365 | </bookmarks> |
366 | </TT></PRE> |
367 | |
368 | <P> |
369 | The interval is defined using either "time=" or "start_time=" and |
370 | "end_time=", or "cycle=" or "start_cycle=" and "end_cycle=". |
371 | The time is in seconds with decimals up to nanoseconds and cycle counts |
372 | are unsigned integers with a 64 bits range. The cpu attribute is optional. |
373 | |
374 | </BODY> |
375 | </HTML> |
376 | |
377 | |
378 | |
379 | |