Fix: sessiond: size-based rotation threshold exceeded in per-pid tracing (1/2)
Issue observed
--------------
When tracing short-lived applications with buffers configured in per-pid
mode, the size-based rotation threshold is often greatly exceeded. In
the CI, this occasionally causes the size-based rotation tests to
timeout for the per-pid case.
Cause
-----
There is a scenario where a session's consumed size is miscalculated.
When an application exits during per-pid tracing, both the session and
consumer daemons notice it. The session daemon sees the application's
command pipe hanging-up, while the consumer daemon sees the
application's data-ready pipe hanging-up.
Upon handling these events, both daemons tear down their representation of
the channels.
In an ideal world, we'd want to sample the streams' "consumed_size" at
the last possible moment to get the size of all consumed data for this
stream. However, this is problematic in the following scenario:
- the sessiond destroys the channel before the consumer daemon,
- the consumer daemon sends a final buffer stats sample on tear down,
- the sessiond can do nothing with the sample as it doesn't know that
channel anymore.
(Note that the session daemon gracefully handles the case where it
doesn't know a channel.)
When applications have a short lifetime and are traced in per-PID
buffering mode, there is a high likelihood that the last buffer
statistics sample sent for a given channel will target a channel that
the session daemon has already torn down.
Solution
--------
Consumed-size conditions are somewhat special: they are bound to a
session, but they are evaluated through a per-channel event (buffer
statistics samples taken by the channels' monitoring timer).
To work around the problem of lifetime of channels, we can rely
on the fact that sessions outlive channels to perform the accounting
of the consumed size.
This patch is the first step to implement this fix: new
notification-thread commands are introduced to announce the creation and
destruction of an `ltt_session`. Currently, the notification thread
implies the existence of a session by tracking its channels' creation
and destruction.
With this change, it no longer needs to do so; session are explicitly
created and destroyed. Their unique ID is also kept stored.
The key of `sessions_ht` becomes the `id` of the session to allow
efficient look-ups on the reception of a buffer statistics sample.
The existing callsites that make use of the session's name to perform a
look-up are modified to look-up the id by name (see
sample_session_id_by_name()).
The add/remove channel commands and rotation ongoing/completed commands
are modified to refer to sessions by ID since they can assume the
notification thread knows about the session.
Note
----
In a follow-up patch, buffer statistics samples are modified to include
the session's ID and the consumed size is modified to become a "delta"
relative to the previous sample associated with a given channel.
This makes it possible to perform the accounting of a session's consumed
size beyond the lifetime of its channels.
The follow-up patch is the "core" of the fix, but it requires these
prior changes.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I865e9ac5e1a63e62123209be63957dad28c588a8
This page took 0.026624 seconds and 4 git commands to generate.