Fix: rotation may never complete in per-PID buffering mode
Issue
-----
The current scheme to ensure that a rotation is completed
consists in the following, from the session daemon's perspective:
Iterate on all channels:
- Ask the consumerd to sample the current "write" positions
- Increment a count of channels being rotated
Wait for the consumer daemon to notify the session daemon every time
a channel's streams's "read" position have all reached the sampled
"write" position.
The idea behind this is making sure that all the data that was
produced before a rotation was triggered has been consumed (i.e.
been written to a local FS or streamed to the relay daemon) before
marking the rotation as completed.
However, this assumes that the session daemon is always aware of
all channels/streams that exist at the moment at which the rotation is
initiated. This is only true for the kernel domain.
In per-PID buffer mode, it is possible for an application, and its
buffers, to be torn down at any moment. Thus the following scenario
can happen:
- The application fills its buffers, causing the consumerd to fall
behind
- The application exits, leaving its full buffers behind to be
extracted by the consumer daemon
- The session daemon removes anything to do with the application from
its internal structures, including its channels
- A rotation is initiated
- The positions of the application's buffers are never sampled as the
session daemon does not see the channels when iterating on the
session's channels
Multiple bad things can happen from there.
First, the rotation can be marked as "completed" while the consumerd
is still exctracting the dead application's buffers, causing readers
to consume an incomplete/corrupted trace.
Second, if the session is being streamed to a relay daemon, it is
possible for the 'rename' command to be issued before the contents
of the buffers has been written causing indexes to fail to be
flushed (as the relay daemon attempts to write them to a now-defunct
location).
Solution
--------
Eliminate the pipe between the session daemon and consumer daemon that
is used to signify that a rotation is completed as the information is
unreliable.
The rotation thread now periodically asks the consumer daemon to check
for channels that have a pending rotation for a given session_id or
that belong to the ongoing rotation archive id.
Hence, for every stream:
- If the archive id during which it was created is '>' than that of
the ongoing rotation, we don't need to consider it
- If the current position is '>=' than the sampled rotation position,
we can consider its rotation 'done'
- If it belongs to the pending rotation archive id and doesn't have
a "target" position, it was unknown to the session daemon and the
application associated with it is dead. We must wait for the
stream to be flushed and torn down before assuming that the
rotation was completed.
Drawbacks
---------
This polling approach is somewhat inefficient and can cause rotations
to take longer to complete than necessary, especially in high-latency
networking conditions.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
23 files changed:
This page took 0.028001 seconds and 4 git commands to generate.