Fix: relayd: live client not notified of inactive streams
Observed issue
--------------
Some LTTng-tools live tests failures appear to show babeltrace2
hanging (failing to print expected events). The problem is intermittent,
but Kienan was able to develop a test case that's reproducible for him.
The test case performs the following steps:
- Start a ust application and leave it running
- Configure and then start an lttng live session
- Connect a live viewer (babeltrace)
- Run a second ust application
- Wait for the expected number of events
- In the failing case, no events are seen by babeltrace
Using per-uid buffers, the test typically completes normally. With
per-pid buffers the test fails, hanging indefinitely if waiting for the
specified number of events. While "hanging", babeltrace2 is polling the
relayd.
This affects for babeltrace2 stable-2.0 and master while using
lttng-tools master.
For more information, see the description of bug #1406[1]
Cause
-----
When consuming a live trace captured in per-PID mode, Babeltrace
periodically requests the index of the next packet it should consume.
As part of the reply, it gets a 'flags' field which is used to announce
that new streams, or new metadata, are available to the viewer.
Unfortunately, these 'flags' are only set when the relay daemon has new
tracing data to deliver. It is not set when the relay daemon indicates
that the stream is inactive (see LTTNG_VIEWER_INDEX_INACTIVE).
In the average case where an application is spawned while others are
actively emiting events, a request for new data will result in a reply
that returns an index entry (code LTTNG_VIEWER_INDEX_OK) for an
available packet accompanied by the LTTNG_VIEWER_FLAG_NEW_STREAM flag.
This flag indicates to the viewer that it should request new
streams (using the LTTNG_VIEWER_GET_NEW_STREAMS live protocol command)
before consuming the new data.
In the cases where we observe a hang, an application is running but not
emiting new events. As such, the relay daemon periodically emits "live
beacons" to indicate that the session's streams are inactive up to a
given time 'T'.
Since the existing application remains inactive and the viewer is never
notified that new streams are available, the viewer effectively remains
"stuck" and never notices the new application being traced.
The LTTNG_VIEWER_FLAG_NEW_METADATA communicates a similar semantic with
regards to the metadata. However, ignoring it for inactive streams isn't
as deleterious: the same information is made available to the viewer the
next time it will successfully request a new index to the relay daemon.
This would only become a problem if the tracers start to express
non-layout data (like supplemental environment information, but I don't
see a real use-case) as part of the metadata stream that should be made
available downstream even during periods of inactivity.
Note that the same problem most likely affects the per-UID buffer
allocation mode when multiple users are being traced.
Solution
--------
On the producer end, LTTNG_VIEWER_FLAG_NEW_STREAM is set even when
returning an inactivity index.
Note that to preserve compatibility with older live consumers that don't
expect this flag in non-OK response, the LTTNG_VIEWER_FLAG_NEW_STREAM
notification is repeated until the next LTTNG_VIEWER_GET_NEW_STREAMS
command that returns LTTNG_VIEWER_INDEX_OK.
The 'new_streams' state is no longer cleared from relay sessions during
the processing of the LTTNG_VIEWER_GET_NEXT_INDEX commands. Instead, it
is cleared when the viewer requests new streams.
On Babeltrace's end, the handler of the LTTNG_VIEWER_GET_NEXT_INDEX
command (lttng_live_get_next_index) is modified to expect
LTTNG_VIEWER_FLAG_NEW_STREAM in the cases where the command returns:
- LTTNG_VIEWER_INDEX_OK (as done previously),
- LTTNG_VIEWER_INDEX_HUP (new),
- LTTNG_VIEWER_INDEX_INACTIVE (new).
Drawbacks
---------
This is arguably a protocol change as none of the producers ever set the
NEW_METADATA/NEW_STREAM flags when indicating an inactive stream.
References
----------
[1] https://bugs.lttng.org/issues/1406
Fixes #1406
Change-Id: I84f53f089597ac7b22ce8bd0962d4b28112b7ab6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
This page took 0.026114 seconds and 4 git commands to generate.