Fix: ust-app: per-PID app unregister vs tracing stop races
There are various races with UST application unregister performed
concurrently with tracing stop operation when tracing with per-pid
buffers. This randomly affects availability of data shortly after the
data pending check returns that no more data is available.
ust_app_stop_trace_all() iterates on all applications in the ust_app_ht
hash table to issue a flush on all buffers. This is needed to ensure
that the sub-buffers being written to are made available to the
consumer, for both data consumption, and for the data pending check.
Failure to execute the sub-buffer flush makes following data pending
check return that there is no data in the buffers too early, thus
resulting in an incomplete trace.
It is therefore important that an application flushes all its buffers
before it is removed from the ust_app_ht.
This is where ust_app_unregister() needs to be fixed. Note that
ust_app_unregister() executes concurrently with
ust_app_stop_trace_all(), only taking the per-session lock. The order of
flush vs hash table removal therefore matters:
We need to push the metadata before removing application from
ust_app_ht. We also need to issue a flush for all application buffers
before removing the application from ust_app_ht.
Once this is fixed, there is yet another race, this time in
ust_app_flush_trace() (now renamed ust_app_flush_session()). It is
caused by the use of ustctl_sock_flush_buffer() which asks the
application to perform the buffer flush. Unfortunately, if the
application vanishes (not reachable anymore), but its unregistration has
not yet been processed by sessiond, then ust_app_stop_trace_all() will
fail to flush the application buffers, because
ustctl_sock_flush_buffer() will fail.
This final issue is fixed by asking the consumer daemon to flush the
associated channel rather than relying on the application.
There are cases where a stream can be completely empty (no packet to
write) with UST: for instance, if a traced application is either
preempted for a long time, terminated, or stopped, between reserve and
commit. This will make the consumer consider that this stream has no
data ready. If this situation occurs in the first sub-buffer of a
stream, this stream will have no data at all (0 bytes).
Therefore, we need to let the data pending check consider that no data
is pending in this situation, otherwise it can make the data pending
check always return that there is data pending.
The "break" statement on error skips the rest of the functions, thus
leaving test applications running after the end of the test, which is a
side-effect on the following tests.
Fix: don't destroy the sockets if the snapshot was successful
Missing a goto to skip the error condition that was destroying the
relayd sockets even if a snapshot was successful. We want to keep them
open to reuse them for the next snapshots.
Philippe Proulx [Thu, 27 Nov 2014 22:35:32 +0000 (17:35 -0500)]
Fix: channel names are not validated
This patch ensures:
1. A channel name does not contain any '/' character, since
relative paths may be injected in the channel name
otherwise (knowing that the channel name is eventually
part of a file name)
2. A channel name does not start with a '.' character, since
trace readers (Babeltrace is one of them) could interpret
files starting with a dot as hidden files and ignore
them when opening the CTF trace
See the associated bug report for a lenghty explanation of the issue
and of this fix. It fixes an issue when saving a live session's
configuration that was created by loading an .lttng file.
Fix: Don't leave events enabled if they were saved in a disabled state
Events are enabled by default on creation. The session configuration
loader must make sure to disable them after creation when restoring an
event that was saved in the "disabled" state.
Introduce tmp_path to ensure that no code path can possibly try to free
the return value of utils_get_home_dir(). Re-using alloc_path for both
static and dynamically allocated pointer is error-prone.
Fix: Live tracing does not honor live timer after first tracefile with tracefile rotation
When we pass to the 2nd sub-file (or following sub-files) of a stream in
relayd, the live timer has no visible effect from a live reader
perspective, and then everything is flushed when we reach the following
sub-file.
This is caused by the reset of stream->total_index_received after each
tracefile rotation. It should keep on incrementing to match what is
expected by check in check_index_status():
Fix: UST subbuffers silently dropped on moderate trace traffic
Well, it looks like we really screwed up on this one.
lttng-tools commit 02b3d1769d5f8a33e4109b1e681141c9295dfda6 introduced
an important regression for lttng-ust tracing in the consumer daemon:
after reading a sub-buffer, a check has been added to see whether there
are more sub-buffers available to read, and if it is the case, it
ensures the wakeup pipe will be awakened again.
The issue lies in the use of ustctl_put_next_subbuf() in this check.
This acts as if the sub-buffer has been read, when in reality it has not
been read. It therefore trashes the data contained by this sub-buffer.
This check should use ustctl_put_subbuf(), which does not move the
consumer position.
This is a severe bug, and the fix needs to be applied to stable-2.6,
stable-2.5, and stable-2.4.
Julien Desfossez [Wed, 12 Nov 2014 23:36:17 +0000 (18:36 -0500)]
Fix: create/destroy a splice_pipe per stream
We had a per-thread splice_pipe (one for data and one for metadata), but
in case of error, we would end up filling the write side of the pipe and
never emptying it. This could lead to leaking data from one session to
the other, but also to stall the consumer trying to splice into a full
pipe.
Now we create a splice_pipe per-stream, so it is destroyed when the
session is destroyed.
David Goulet [Tue, 7 Oct 2014 19:05:48 +0000 (15:05 -0400)]
Fix: return EINVAL if agent registration fails
The errno value might be 0 thus not returning an error if so. It has
been seen with an unstable python agent code base which means it could
happen in the future if a third part decides to create an agent.
Signed-off-by: David Goulet <dgoulet@efficios.com>
Fix: check userspace perf counter name when looking up contexts
create_ust_app_channel_context() looks for a context's existance
in a channel before adding it. However, it only checks for
context types. This is valid for all context types except for
LTTNG_UST_CONTEXT_PERF_THREAD_COUNTER since multiple perf
thread counters may be enabled at the same time.
This fix ensures that the perf counter name is taken into
consideration when checking for a context's presence in a
channel.
Reported-by: Alexander Grigoriev <alexgri@tbricks.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
In order to correctly handle the use-case where events are enabled
_after_ trace is started, and _after_ applications are already being
traced, the event should be created in a "disabled" state, so that it
does not trace events until its filter is attached.
This fix needs to be done both in lttng-tools and lttng-ust. In order to
keep ABI compatibility between tools and ust within a stable release
cycle, we introduce a new "disabled" within struct lttng_ust_event
padding (previously zeroed). Newer LTTng-UST checks this flag, and
fallback on the old racy behavior (enabling the event on creation) if it
is unset.
Therefore, old session daemon works with newer lttng-ust of the same
stable release, and vice-versa. However, building lttng-tools requires
an upgraded lttng-ust, which contains the communication protocol with
the new "disabled" field.
This patch should be backported to stable-2.4, stable-2.5, stable-2.6
branches.
David Goulet [Fri, 31 Oct 2014 17:23:29 +0000 (13:23 -0400)]
Fix: UST consumer sync all available metadata
In live mode, the sync metadata function was only working on one single
metadata stream of a given session ID. However, we can have multiple
metadata stream for the same session ID thus failing to send the data in
live mode correctly for the other streams.
This fixes it by simply iterating over all metadata stream for a session
ID and syncing them all.
Signed-off-by: David Goulet <dgoulet@efficios.com>
The issue uncovered a more serious problem. The loop on ready FDs of the
thread was exiting at each branch thus not going on all fd. This is
problematic when the thread quit pipe is triggered and when there is
also at the same time a request for metadata from the consumer since the
metadata request could have been ignored.
This patch makes sure we go through all FDs in the loop when the thread
quit pipe or the metadata fd is triggered.
Signed-off-by: David Goulet <dgoulet@efficios.com>
Julien Desfossez [Wed, 27 Aug 2014 17:59:21 +0000 (13:59 -0400)]
Fix: make sure no index is in flight before using inactivity beacons
Since the index is sent in two parts on two separate connections from
the consumer, there can be cases where we receive an inactivity beacon
between the index creation and the data reception.
This fix prevents from using the inactivity beacon if we know a data
index is coming.
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Fix: Parenthesize previous statement when adding conditions to a filter
Not parenthesizing the clauses in a filter string causes JUL events to be
traced even though they are not enabled when an enable-event command is
issued with a filter and the --loglevel-only option.
Fix: parse_prob_opts return the actual success of the function
This bug have been triggered by the mi merging and the use of a
command_ret in enable_events functions. Previously, enable_events was
reusing the ret variable for another operation and always replacing ret.
Parse_probe_event returned the last output of sscanf which represent
the number of match and not the success of the operation.
Fixes #830
Signed-off-by: Jonathan Rajotte Julien <jonathan.r.julien@gmail.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
It is never locked in this function, but should be. This is triggering
spurious runtime failures on my system, where it seems that sessiond was
sometimes breaking the communication pipe with liblttng-ctl when the
unbalanced unlock is reached.
This should be backported to stable-2.4 and stable-2.5.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 16 Jul 2014 19:50:31 +0000 (15:50 -0400)]
Fix: validate file path creds in autoload mode
Validate the credentials only when in auto load mode for the auto/
directory. The lttng load command now checks the session file creds and
validate if it's readable before trying to do so.
Thus with this, a user can load a session from the system wide directory
as long as she/he has the rights to read it.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 16 Jul 2014 18:02:56 +0000 (14:02 -0400)]
Fix: change session file loading order
Libconfig now tries to load a session file from the user directory
followed by the system wide. A user session file should always
supersedes the system wide.
Signed-off-by: David Goulet <dgoulet@efficios.com>
Fix: get the stream_id when generating live beacons
When we send an empty index (beacon), we need to extract the stream_id
to avoid stalling the client on inactive streams on startup.
Since the live clients need to know this feature is implemented, we had
to bump the lttng-live protocol version.
This fix should be backported to stable-2.4 as well.
David Goulet [Wed, 9 Jul 2014 19:24:04 +0000 (15:24 -0400)]
Fix: use biggest subbuffer size for snapshot max-size
Instead of using the output max size divided by the total amount of
streams in the session, we find the largest subbuffer in the session's
channels. Using that value, we multiply it by the total amount of
streams which gives us the lower limit of the snapshot size. That is
enough to make sure that we can take the snapshot or not.
Once done, the max stream size possible used for the snapshot record is
the largest subbuffer size in the session. This is to make sure that
every channel can extract the same amount of data which ensure fairness
between each channel in the session.
Fixes #783
Acked-by: Julien Desfossez <jdesfossez@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Fix: Possible memory leak when multiple config files are loaded
Some configuration options could leak when initialized multiple times
from different configuration files and from the command line arguments.
The previous options are now freed' before being set.
Fixes #796
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>