From: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Date: Thu, 25 Jun 2020 23:22:24 +0000 (-0400)
Subject: Fix: post-clear trace chunk has a late beginning packet
X-Git-Tag: v2.12.2~22
X-Git-Url: http://git.lttng.org./?a=commitdiff_plain;h=fc0c52e9fb0608ede702556178a68d15a1d25af8;hp=fc0c52e9fb0608ede702556178a68d15a1d25af8;p=lttng-tools.git

Fix: post-clear trace chunk has a late beginning packet

Observed issue
==============

In the following scenario:
  - create a regular session
  - enable an event that occurs on only one core
  - start tracing
  - trace for a while
  - stop tracing
  - clear the session
  - start
  - trace for a while (referred to the "active period" later on)
  - stop

The resulting trace will contain a very short stream intersection as the
active stream will contain packets spanning the entire active period.
However, all other streams will contain a single packet at the end of
the active period with a duration of 0 ns.

This presents two problems:
  1) This makes the stream intersection mode of viewers unusable,
  2) This misleads the user into thinking that tracing was not active
     for some buffers.

Cause
=====

The packet beginning timestamps of the buffers are initialized on
creation (on the first "start" of a tracing session). When a "clear" is
performed on a session, all open packets are closed and the existing
contents are purged.

If a stream is inactive, it is possible for no packet to be "opened"
until the "stop" of the tracing session.

On stop, a "flush_empty" is performed. Such a flush opens a packet
(if it was not already the case), closes it, and marks the packet as
being ready for consumption.

Solution
========

Attempt to flush an empty packet as close to the rotation point as
possible. In the event where a stream remains inactive after the
rotation point, this ensures that the new trace chunk has a beginning
timestamp set at the begining of the trace chunk instead of only
creating an empty packet when the trace chunk is stopped.

This indicates to the viewers that the stream was being recorded, but
more importantly it allows viewers to determine a useable trace
intersection.

This presents a problem in the case where the ring-buffer is completely
full.

Consider the following scenario:
  - The consumption of data is slow (slow network, for instance),
  - The ring buffer is full,
  - A rotation is initiated,
    - The flush below does nothing (no space left to open a new packet),
  - The other streams rotate very soon, and new data is produced in the
    new chunk,
  - This stream completes its rotation long after the rotation was
    initiated
  - The session is stopped before any event can be produced in this
    stream's buffers.

The resulting trace chunk will have a single packet temporaly at the end
of the trace chunk for this stream making the stream intersection more
narrow than it should be.

To work-around this, an empty flush is performed after the first
consumption of a packet during a rotation if the initial flush failed.
The idea is that consuming a packet frees enough space to switch packets
in this scenario and allows the tracer to "stamp" the beginning of the
new trace chunk at the earliest possible point.

Note that metadata streams are always skipped when opening a packet.
This is done for two reasons:
  1) Timestamps are not relevant to the metadata stream
  2) Inserting an empty packet in the metadata stream after a rotation
     breaks the use of "clear" in live mode.

     The contents of the metadata streams of successive chunks must be
     strict superset of one another as live clients only receive the
     information appended to a metadata stream (i.e. the parts it
     already has received can't change).

     If a flush_empty was performed after a clear/rotation, it would
     result in an empty packet being inserted at the beginning of the
     metadata stream that wasn't present in the first chunk.

     This would cause the live client and relay daemon to have
     mismatching copies of the metadata stream.

Known drawbacks
===============

In the case of an inactive stream, this results in the completed trace
chunk archive containing an extra empty packet at the beginning of the
stream (typically 4kB).

In the case of an active stream, this change will cause the first packet
to be empty or contain few events.

Those are all efficiency losses that are inevitable (AFAIK) given the
current buffer control APIs. It will be possible to recoup those losses
if an API allowing the consumer daemon to open a new packet is
introduced.

As noted in the comments, this patch is not final. The flush after the
rotation should not be open-coded in lttng_consumer_read_subbuffer. It
should be a data-stream specific "post-consume" step.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7d8ab876e55e9d0718a55ec1bb77ec6466accc02
---