Commit | Line | Data |
---|---|---|
add19043 BP |
1 | Benjamin Poirier |
2 | benjamin.poirier@polymtl.ca | |
277e5b53 | 3 | 2009, 2010 |
add19043 BP |
4 | |
5 | + About time synchronization | |
6 | This framework performs offline time synchronization. This means that the | |
7 | synchronization is done after tracing is over. It is not the same as online | |
8 | synchronization like what is done by NTP. Nor is it directly influenced by it. | |
9 | ||
10 | Event timestamps are adjusted according to a clock correction function that | |
11 | palliates for initial offset and rate offset (ie. clocks that don't start out | |
12 | at the same value and clocks that don't run at the same speed). It can work on | |
13 | two or more traces. | |
14 | ||
15 | The synchronization is based on relations identified in network traffic | |
16 | between nodes. So, for it to work, there must be traffic exchanged between the | |
17 | nodes. At the moment, this must be TCP traffic. Any kind will do (ssh, http, | |
18 | ...) | |
19 | ||
20 | For scientific information about the algorithms used, see: | |
21 | * Duda, A., Harrus, G., Haddad, Y., and Bernard, G.: Estimating global time in | |
22 | distributed systems, Proc. 7th Int. Conf. on Distributed Computing Systems, | |
23 | Berlin, volume 18, 1987 | |
24 | * Ashton, P.: Algorithms for Off-line Clock Synchronisation, University of | |
25 | Canterbury, December 1995 | |
26 | http://www.cosc.canterbury.ac.nz/research/reports/TechReps/1995/tr_9512.pdf | |
27 | ||
28 | + Using time synchronization | |
29 | ++ Recording traces | |
30 | To use time synchronization you have to record traces on multiple nodes | |
31 | simultaneously with lttng (the tracer). While recording the traces, you have | |
32 | to make sure the following markers are enabled: | |
33 | * dev_receive | |
34 | * dev_xmit_extended | |
35 | * tcpv4_rcv_extended | |
36 | * udpv4_rcv_extended | |
bc7c054d | 37 | You can use 'ltt-armall -n' for this. |
9a9ca632 | 38 | |
add19043 BP |
39 | You also have to make sure there is some TCP traffic between the traced nodes. |
40 | ||
41 | ++ Viewing traces | |
42 | Afterwards, you have to make sure all the traces are accessible from a single | |
43 | machine, where lttv (the viewer) is run. | |
44 | ||
45 | Time synchronization is enabled and controlled via the following lttv options, | |
46 | as seen with "-h": | |
47 | --sync | |
48 | synchronize the time between the traces | |
49 | --sync-stats | |
50 | print statistics about the time synchronization | |
bc7c054d | 51 | See the section "Statistics" for more information. |
add19043 BP |
52 | --sync-null |
53 | read the events but do not perform any processing, this | |
54 | is mostly for performance evaluation | |
bc7c054d BP |
55 | --sync-analysis - argument: chull, linreg, eval |
56 | specify the algorithm to use for event analysis. See the | |
57 | section "Alogrithms". | |
add19043 BP |
58 | --sync-graphs |
59 | output gnuplot graph showing synchronization points | |
60 | --sync-graphs-dir - argument: DIRECTORY | |
61 | specify the directory where to store the graphs, by | |
62 | default in "graphs-<lttv-pid>" | |
63 | ||
64 | To enable synchronization, start lttv with the "--sync" option. It can be | |
65 | used in text mode or in GUI mode. You can add the traces one by one in the GUI | |
66 | but this will recompute the synchronization after every trace that is added. | |
67 | Instead, you can save some time by specifying all your traces on the command | |
68 | line (using -t). | |
69 | ||
70 | Example: | |
71 | lttv-gui -t traces/node1 -t traces/node2 --sync | |
72 | ||
73 | ++ Statistics | |
bc7c054d BP |
74 | The --sync-stats option is useful to know how well the synchronization |
75 | algorithms worked. Here is an example output (with added comments) from a | |
76 | successful chull (one of the synchronization algorithms) run of two traces: | |
add19043 BP |
77 | LTTV processing stats: |
78 | received frames: 452 | |
79 | received frames that are IP: 452 | |
80 | received and processed packets that are TCP: 268 | |
81 | sent packets that are TCP: 275 | |
82 | TCP matching stats: | |
83 | total input and output events matched together to form a packet: 240 | |
84 | Message traffic: | |
85 | 0 - 1 : sent 60 received 60 | |
86 | # Note that 60 + 60 < 240, this is because there was loopback traffic, which is | |
87 | # discarded. | |
88 | Convex hull analysis stats: | |
89 | out of order packets dropped from analysis: 0 | |
90 | Number of points in convex hulls: | |
91 | 0 - 1 : lower half-hull 7 upper half-hull 9 | |
92 | Individual synchronization factors: | |
93 | 0 - 1 : Middle a0= -1.33641e+08 a1= 1 - 4.5276e-08 accuracy 1.35355e-05 | |
94 | a0: -1.34095e+08 to -1.33187e+08 (delta= 907388) | |
95 | a1: 1 -6.81298e-06 to +6.72248e-06 (delta= 1.35355e-05) | |
bc7c054d BP |
96 | # "Middle" is the best type of synchronization for chull. See the section |
97 | # "Convex Hull" below. | |
add19043 BP |
98 | Resulting synchronization factors: |
99 | trace 0 drift= 1 offset= 0 (0.000000) start time= 18.799023588 | |
100 | trace 1 drift= 1 offset= 1.33641e+08 (0.066818) start time= 19.090688494 | |
101 | Synchronization time: | |
102 | real time: 0.113308 | |
103 | user time: 0.112007 | |
104 | system time: 0.000000 | |
105 | ||
106 | ++ Algorithms | |
107 | The synchronization framework is extensible and already includes two | |
108 | algorithms: chull and linreg. You can choose which analysis algorithm to use | |
109 | with the --sync-analysis option. | |
110 | ||
bc7c054d BP |
111 | +++ Convex Hull |
112 | chull, the default analysis module, can provide a garantee that there are no | |
113 | message inversions after synchronization. When printing the statistics, it | |
114 | will print, for each trace, the type of factors found: | |
115 | * "Middle", all went according to assumptions and there will be no message | |
116 | inversions | |
117 | * "Fallback", it was not possible to garantee no message inversion so | |
118 | approximate factors were given instead. This may happen during long running | |
119 | traces where the non-linearity of the clocks was notable. If you can, try to | |
120 | reduce the duration of the trace. (Sometimes this may happen during a trace | |
121 | as short as 120s. but sometimes traces 30 mins. or longer are ok, your | |
122 | milleage may vary). It would also be to improve the algorithms to avoid | |
123 | this, see the "Todo" section. In any case, you may get better results (but | |
124 | still no garantee) by choosing the linreg algorithm instead. | |
125 | * "Absent", the trace pair does not contain common communication events. Are | |
126 | you sure the nodes exchanged TCP traffic during the trace? | |
127 | ||
128 | There are also other, less common, types. See the enum ApproxType in | |
129 | event_analysis_chull.h. | |
130 | ||
131 | +++ Linear Regression | |
29dd991d | 132 | linreg sometimes gives more precise results than chull but it provides no |
bc7c054d BP |
133 | garantee |
134 | ||
135 | +++ Synchronization evaluation | |
136 | eval is a special module, it doesn't really perform synchronization, instead | |
137 | it calculates and prints different metrics about how well traces are | |
138 | synchronized. Although it can be run like other analysis modules, it is most | |
139 | useful when run in a postprocessing step, after another synchronization module | |
29dd991d | 140 | has been run. Eval is most commonly run in text mode. To do this, run: |
4378bba0 BP |
141 | lttv -m sync_chain_batch [usual options, ex: -t traces/node1 -t traces/node2 |
142 | --sync ...] | |
143 | It can also be run from the lttv source tree via runlttv: | |
144 | ./runlttv -m eval [usual runlttv options, ex: traces/node1 traces/node2] | |
bc7c054d BP |
145 | |
146 | eval provides a few more options: | |
147 | --eval-rtt-file - argument: FILE | |
148 | specify the file containing RTT information | |
149 | --eval-graphs - argument: none | |
150 | output gnuplot graph showing synchronization points | |
151 | --eval-graphs-dir - argument: eval-graphs-<lttv pid> | |
152 | specify the directory where to store the graphs | |
153 | ||
154 | The RTT file should contain information on the minimum round-trip time between | |
155 | nodes involved in the trace. This information is used (optionally) in the | |
156 | evaluation displayed and in the histogram graphs produced. The file should | |
157 | contain a series of lines of the form: | |
158 | 192.168.112.56 192.168.112.57 0.100 | |
159 | The first two fields are the IP addresses of the source and destination hosts. | |
160 | (hostnames are not supported). The last field is the minimum rtt in ms. The | |
161 | fields are separated by whitespace. '#' comments a line. | |
162 | ||
163 | Many commands can be used to measure the RTT, for example: | |
164 | ping -s 8 -A -c 8000 -w 10 192.168.112.57 | |
165 | ||
29dd991d BP |
166 | Note that this must be repeated in both directions in the file, that is: |
167 | 192.168.112.49 192.168.112.50 0.057 | |
168 | 192.168.112.50 192.168.112.49 0.050 | |
bc7c054d BP |
169 | |
170 | ++++ Linear Programming and GLPK | |
171 | The synchronization evaluation can optionally perform an analysis similar to | |
172 | chull but by using a linear program in one of the steps. This can be used to | |
173 | validate a part of the chull algorithm but it can also be used to provide a | |
174 | measure of the accuracy of the synchronization in any point (this is seen in | |
175 | the graph output). | |
176 | ||
177 | This is enabled by default at configure time (--with-glpk) if the GNU Linear | |
29dd991d BP |
178 | Programming Kit is available (libglpk). On Debian-like systems (ex. Ubuntu), |
179 | install the package "libglpk-dev". | |
180 | ||
181 | To see the output of this mode, run: | |
182 | lttv -m sync_chain_batch --eval-graphs [usual options, ex: -t traces/node1 -t | |
183 | traces/node2 --sync ...] | |
bc7c054d | 184 | |
add19043 BP |
185 | + Design |
186 | This part describes the design of the synchronization framework. This is to | |
187 | help programmers interested in: | |
188 | * adding new synchronization algorithms (analysis part) | |
189 | There are already two analysis algorithms available: chull and linreg | |
190 | * using new types of events (processing and matching parts) | |
bc7c054d BP |
191 | There are already two types of events supported: tcp messages and udp |
192 | broadcasts | |
add19043 BP |
193 | * using time synchronization with another data source/tracer (processing part) |
194 | There are already two data sources available: lttng and unittest | |
195 | ||
196 | ++ Sync chain | |
197 | This part is specific to the framework in use: the program doing | |
198 | synchronization, the executable linking to the event_*.o | |
199 | eg. LTTV, unittest | |
200 | ||
201 | This reads parameters, creates SyncState and calls the processing init | |
202 | function. The "sync chain" is the set of event-* modules. At the moment there | |
203 | is only one module at each stage. However, as more module are added, it will | |
204 | become relevant to have many modules at the same stage simultaneously. This | |
b670bb7c BP |
205 | will require some modifications. It is already partly supported at the |
206 | matching stage through encapsulation of other matching modules. | |
207 | ||
48b641c1 | 208 | sync_chain_unitest:main() provides a fairly simple example of sync chain |
b670bb7c | 209 | implementation. |
add19043 BP |
210 | |
211 | ++ Stage 1: Event processing | |
212 | Specific to the tracing data source. | |
213 | eg. LTTng, LTT userspace, libpcap | |
214 | ||
215 | Read the events from the trace and stuff them in an appropriate Event object. | |
216 | ||
217 | ++ Communication between stages 1 and 2: events | |
218 | Communication is done via objects specialized from Event. At the moment, all | |
219 | *Event are in data_structures.h. Specific event structures and functions could | |
220 | be in separate files. This way, adding a new set of modules would require | |
221 | shipping extra data_structures* files instead of modifying the existing one. | |
222 | For this to work, Event.type couldn't be an enum, it could be an int and use | |
f6691532 | 223 | #defines or constants defined in the specialized data_structures* files. |
add19043 BP |
224 | Event.event could be a void*. |
225 | ||
226 | ++ Stage 2: Event matching | |
227 | This stage and its modules are specific to the type of event. Event processing | |
228 | feeds the events one at a time but event analysis works on groups of events. | |
229 | Event matching is responsible for forming these groups. Generally speaking, | |
230 | these can have different types of relation ("one to one", "one to many", or a | |
231 | mix) and it will influence the overall behavior of the module. | |
232 | eg. TCP, UDP, MPI | |
233 | ||
f6691532 BP |
234 | matchEvent() takes an Event pointer. An actual matching module doesn't have to |
235 | be able to process every type of event. It will only be passed events of a | |
236 | type it can process (according to the .canMatch field of its MatchingModule | |
237 | struct). | |
add19043 BP |
238 | |
239 | ++ Communication between stages 2 and 3: event groups | |
240 | Communication consists of events grouped in Message, Exchange or Broadcast | |
241 | structs. | |
242 | ||
243 | About exchanges: | |
244 | If one event pair is a packet (more generally, something representable as a | |
245 | Message), an exchange is composed of at least two packets, one in each | |
246 | direction. There should be a non-negative minimum "round trip time" (RTT) | |
247 | between the first and last event of the exchange. This RTT should be as small | |
248 | as possible so these packets should be closely related in time like a data | |
249 | packet and an acknowledgement packet. If the events analyzed are such that the | |
250 | minimum RTT can be zero, there's nothing gained in analyzing exchanges beyond | |
251 | what can already be figured out by analyzing packets. | |
252 | ||
253 | An exchange can also consist of more than two packets, in case one packet | |
254 | single handedly acknowledges many data packets. In this case, it is best to | |
bc7c054d | 255 | use the last data packet. Assuming a linear clock, an acknowledged |
add19043 BP |
256 | packet is as good as any other. However, since the linear clock assumption is |
257 | further from reality as the interval grows longer, it is best to keep the | |
258 | interval between the two packets as short as possible. | |
259 | ||
260 | ++ Stage 3: Event analysis | |
261 | This stage and its modules are specific to the algorithm that analyzes events | |
262 | to deduce synchronization factors. | |
263 | eg. convex hull, linear regression, broadcast Maximum Likelihood Estimator | |
264 | ||
265 | Instead of having one analyzeEvents() function that can receive any sort of | |
266 | grouping of events, there are three prototypes: analyzeMessage(), | |
267 | analyzeExchange() and analyzeBroadcast(). A module implements only the | |
bc7c054d | 268 | relevant one(s) and the other function pointers are NULL. |
add19043 BP |
269 | |
270 | The approach is different from matchEvent() where there is one point of entry | |
271 | no mather the type of event. The analyze*() approach has the advantage that | |
272 | there is no casting or type detection to do. It is also possible to deduce | |
273 | from the functions pointers which groupings of events a module can analyze. | |
274 | However, it means each analysis module will have to be modified if there is | |
275 | ever a new type of event grouping. | |
276 | ||
277 | I chose this approach because: | |
278 | 1) I thought it likely that there will be new types of events but not so | |
279 | likely that there will be new types of event groups. | |
280 | 2) all events share some members (time, traceNb, ...) but not event groups | |
281 | 3) we'll see which one of the two approaches works best and we can adapt | |
282 | later. | |
283 | ||
284 | ++ Data flow | |
285 | Data from traces flows "down" from processing to matching to analysis. Factors | |
286 | come back up. | |
287 | ||
288 | ++ Evolution and adaptation | |
289 | It is possible to change/add another sync chain and to add other event_* | |
290 | modules. It has been done. New types of events may need to be added to | |
291 | data_structures.h. This is only to link between Event-* modules. If the data | |
292 | does not have to be shared, data_structures.h does not have to be modified. | |
293 | ||
294 | At the moment there is some code duplication in the last steps of linreg and | |
295 | chull analysis: the code to propagate the factors when there are more than two | |
296 | nodes. Maybe there could be a Stage 4 that does that? |