Fix: lttng: out-of-bound copy of arguments in 'view' command handler
The 'size' operand of memcpy() does not indicate the length of the
opts array; it is the size of the resulting array once the opts array
is concatenated with the options being added in this function. This
results in out-of-bound read(s) in the opts array.
Use 'sizeof(char *) * opts_len' as the length to copy at the beginning
of the resulting array.
strncpy is called with the source's length in two cases in the
session save code. Use the destination and remaining destination
length as intended by the API.
sessiond: fix: possible unaligned access in packed structure
'&rsock->sock.fd' is passed to consumer_send_fds and may result in an
unaligned pointer value. Use the ALIGNED_CONST_PTR macro to create
an aligned copy of the fd that is being passed.
The ctf_index structure, being part of the ABI, is explicitly packed
using the LTTNG_PACKED macro. However, populating it by using pointers
to its members is not acceptable as it may cause the ust and kernel
tracer APIs to populate write their return values using unaligned
pointers.
Use automatic storage variables to fetch the various index fields and
populate the index at-once using a compound literal.
Tests: fix: uninitialized values passed to close() on error
The fds array is not initialized resulting in uninitialized file
descriptors being passed to close() when an error is encountered in
the epoll-setting loop.
lttng-ctl: fix: lttng_data_pending confuses communication status
lttng_ctl_ask_sessiond can return a positive value even though it
failed to receive the variable length payload of a session message
reply. In this case, lttng_ctl_ask_sessiond ends up calling into
lttng_ctl_ask_sessiond_fds_varlen() which will return the (negated)
error code returned by the session daemon if it was not LTTNG_OK.
The peer could return anything here, which lttng_data_pending will end
up interpreting as the length of the variable data that was received.
In this case, if the sessiond returns '-1', '1' will be returned to
lttng_data_pending, which it will interpret as being the length of the
'data_pending' byte flag. It will then dereference 'pending', which is
NULL, and (most likely) crash.
Check for NULL on top of checking for the return code. This
communication layer needs love as much as it needs a bulldozer.
Fix: metadata stream is not marked as quiescent after packet commit
When a metadata stream's wait fd is hung-up or enters an error state,
it is checked for quiescence in lttng_ustconsumer_on_stream_hangup().
If the stream is not quiescent, the current packet is closed through
the flush_buffer operation.
Currently, all commits to metadata streams are done on a packet
basis. The various code paths using the commit_one_metadata_packet
helper all perform a flush directly after the commit. Performing this
flush leaves the stream in a "quiescent" state, but does not mark it
as such.
This results in an extraneous flush being performed in the err/hup
handler, which leaves an empty packet to be consumed. This packet is
then consumed during the execution of the err/hup handler.
This bug results in an empty packet being appended to metadata
streams. This packet is typically ignored by readers, but the fact
that it is written at the time of the destruction of a session
violates the immutability guarantee of the session stop
command. Moreover, following the introduction of trace chunks, this
results in the stream attempting to serialize the empty buffer to its
output file _after_ its trace chunk has been closed, causing an
assertion to hit.
Hence, this fix performs the buffer flush and sets the stream as
quiescent directly in commit_one_metadata_packet().
I observed that userspace tracing no longer worked when an
instrumented application (linked against liblttng-ust) was launched
before the session daemon.
While investigating this, I noticed that the shm_open() of
'/lttng-ust-wait-8' failed with EACCES. As the permissions on the
'/dev/shm' directory and the file itself should have allowed the
session daemon to open the shm, this pointed to a change in kernel
behaviour.
Moreover, it appeared that this could only be reproduced on my
system (running Arch Linux) and not on other systems.
It turns out that Linux 4.19 introduces a new protected_regular sysctl
to allow the mitigation of a class of TOCTOU security issues related
to the creation of files and FIFOs in sticky directories.
When this sysctl is not set to '0', it specifically blocks the way the
session daemon attempts to open the app notification shm that an
application has already created.
To quote a comment added in linux's fs/namei.c as part of 30aba6656f:
```
Block an O_CREAT open of a FIFO (or a regular file) when:
- sysctl_protected_fifos (or sysctl_protected_regular) is enabled
- the file already exists
- we are in a sticky directory
- we don't own the file
- the owner of the directory doesn't own the file
- the directory is world writable
```
While the concerns that led to the inclusion of this patch are valid,
the risks that are being mitigated do not apply to the session
daemon's and instrumented application's use of this shm. This shm is
only used to wake-up applications and get them to attempt to connect
to the session daemon's application socket. The application socket is
the part that is security sensitive. At worst, an attacker controlling
this shm could wake up the UST thread in applications which would then
attempt to connect to the session daemon.
Unfortunately (for us, at least), systemd v241+ sets the
protected_regular sysctl to 1 by default (see systemd commit 27325875), causing the open of the shm by the session daemon to fail.
Introduce a fall-back to attempt a shm_open without the O_CREAT flag
when opening it with 'O_RDWR | O_CREAT' fails. The comments detail the
reason why those attempts are made in that specific order.
Fix: leak of filter bytecode and expression on agent event re-enable
The agent subsystem does not properly assume the clean-up of an
event's filter bytecode and expression when a previously disabled
event is re-enabled.
This change ensures that the ownership of both the filter bytecode
and expression is assumed by the agent subsystem and discarded
when a matching event is found.
Steps to reproduce the leak:
$ lttng create
$ lttng enable-event --python allo --filter 'a[42] == 241'
$ lttng disable-event --python allo
$ lttng enable-event --python allo --filter 'a[42] == 241'
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Setting the "filter" object to NULL prevents the call to
add_filter_app_ctx when needed.
We use the filter from the newly created event to
perform the check and the call to add_filter_app_ctx.
Jonathan Rajotte [Wed, 28 Aug 2019 20:36:03 +0000 (16:36 -0400)]
Fix: check validity of a stream before invoking ust flush command
At the time ustctl_flush_buffer is called the ustream object might have
already been freed on lttng-ust side.
This can happen following a lttng_consumer_cleanup_relayd and concurrent
consumer flush command (lttng stop).
The chain of events goes as follows.
An error on communication with lttng-relayd occurs.
lttng_consumer_cleanup_relayd flags the streams for deletion
(CONSUMER_ENDPOINT_INACTIVE). validate_endpoint_status_data_stream calls
consumer_del_stream.
At the same time the hash table of streams is iterated over in the
flush_channel function following a stop command. The loop is iterating on
a given stream. The current thread is unscheduled before taking the stream
lock.
In the initial thread, the same stream is the current iteration of
cds_lfht_for_each_entry in validate_endpoint_status_data_stream.
consumer_del_stream is called on it. The stream lock is acquired, and
destroy_close_stream is called. lttng_ustconsumer_del_stream is eventually
called and at this point the ustream is freed.
Going back to the iteration in flush_channel. The current stream is still
valid from the point of view of the iteration, ustctl_flush_buffer is then
called on a freed ustream object.
This can lead to unknown behaviour since there is no validation on the
lttng-ust side. The underlying memory of the ustream object is garbage at
this point.
To prevent such scenario, we check for the presence of the node in the
hash table via cds_lfht_is_node_deleted while holding the stream lock.
This is valid because the stream destruction removes the node from
the hash table and frees the ustream object with the stream lock held.
This duplicate similar "validation" check of the stream object. [1][2]
This issue can be reproduced by the following scenario:
Modify flush_channel to sleep (i.e 10s) before acquiring the lock on
a stream.
Modify lttng-ust ustctl_destroy_stream to set the
ring_buffer_clock_read callback to NULL.
Note: An assert on !cds_lfht_is_node_deleted in flush channel
after acquiring the lock can provide the same information. We are
modifying the callback to simulate the original backtrace from our
customer.
lttng-relayd
lttng-sessiond
lttng create --live
lttng enable-event -u -a
lttng start
Start some applications to generate data.
lttng stop
The stop command force a flush of the channel/streams.
pkill -9 lttng-relayd
Expect assert or segfault
The original customer backtrace:
0 lib_ring_buffer_try_switch_slow (handle=<optimized out>, tsc=<synthetic pointer>, offsets=0x3fffa9b76c80, chan=0x3fff98006e90, buf=<optimized out>,
mode=<optimized out>) at /usr/src/debug/lttng-ust/2.9.1/git/libringbuffer/ring_buffer_frontend.c:1834
1 lib_ring_buffer_switch_slow (buf=0x3fff98016b40, mode=<optimized out>, handle=0x3fff98017670)
at /usr/src/debug/lttng-ust/2.9.1/git/libringbuffer/ring_buffer_frontend.c:1952
2 0x00003fffac680940 in ustctl_flush_buffer (stream=<optimized out>, producer_active=<optimized out>)
at /usr/src/debug/lttng-ust/2.9.1/git/liblttng-ust-ctl/ustctl.c:1568
3 0x0000000010031bc8 in flush_channel (chan_key=<optimized out>) at ust-consumer.c:772
4 lttng_ustconsumer_recv_cmd (ctx=<optimized out>, sock=<optimized out>, consumer_sockpoll=<optimized out>) at ust-consumer.c:1651
5 0x000000001000de50 in lttng_consumer_recv_cmd (ctx=<optimized out>, sock=<optimized out>, consumer_sockpoll=<optimized out>) at consumer.c:2011
6 0x0000000010014208 in consumer_thread_sessiond_poll (data=0x10079430) at consumer.c:3192
7 0x00003fffac608b30 in start_thread (arg=0x3fffa9b7bdb0) at pthread_create.c:462
8 0x00003fffac530d0c in .__clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:96
Fix: initialize syscall table when kernel tracer is lazily initialized
How to reproduce:
start lttng-sessiond while lttng-modules are not installed, then install
lttng-modules. Then issue "lttng list --syscall -k". It will show an
empty syscall list because the system call list has not been
initialized.
Jonathan Rajotte [Thu, 23 May 2019 18:02:26 +0000 (14:02 -0400)]
Fix: python binding: expose domain buffer type
On enable_channel the domain buffer type is used to create a temporary
channel. This currently fail for kernel channel since the buffer type is
not exposed at the binding level and default to LTTNG_BUFFER_PER_PID.
Channel for the kernel domain can only be created in LTTNG_BUFFER_GLOBAL
mode.
Exposing the buffer type also allow userpsace channel to use the per uid
buffering scheme.
The current bindings are in a rough state. This is to at least get them
to work with kernel domain.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
The debugging logging macros (e.g. DBG()) are used as printf in the
lttng-tools source files. The printf() implementation does not alter the
errno value, so the fact that log_add_time() (through clock_gettime())
can alter errno is unexpected. For instance, adding a logging statement
for debugging purposes within a function for which errno is expected to
stay unchanged on return will change the behavior between execution with
-vvv and non-verbose.
The relayd stream beacon_ts_end field is expected to have the value
-1ULL when unset (no beacon has been received since last index).
However, the initial state is wrong. It is left at the value 0, which
indicates that a live beacon has indeed been received (which is untrue),
which in turn causes a live beacon with ctf_stream_id of -1ULL to be
sent to babeltrace, which does not expect it, and fails.
This issue can be triggered with the following scenario:
1) create live session
2) setup UST per-uid buffers tracing
3) start tracing, without any active traced application
4) hook with babeltrace live client to view the trace
5) run a traced application
Step 5) will cause the babeltrace live client to receive a stream_id of
-1ULL, and error out.
Fix tests: NULL pointer dereference in ust channel unit tests
The test_create_ust_channel() test case erroneously checks for
a NULL session instead of a channel. This can result in a
NULL pointer dereference on failure to create a ust channel.
The scope of usess is reduced to prevent similar mistakes in the
future. Moving 'dom' has made it obvious that this variable is
unused. Hence, it is removed.
Yannick Lamarre [Tue, 26 Mar 2019 19:53:06 +0000 (15:53 -0400)]
Fix: Properly sanitize input parameter
The lttng client uses the sizeof the containing buffer, defined as
LTTNG_SYMBOL_NAME_LEN, for input string sanitation instead of libc defined
macro NAME_MAX. lttng-enable_channel improperly verified user input
and wrongly discarded valid input in case NAME_MAX was less than the
sizeof the containing buffer for the channel's name.
This patch also fixes potential buffer overflow caused by an improperly
bounded strcpy in the case where NAME_MAX would have been greater than
LTTNG_SYMBOL_NAME_LEN.
Fix: consumer snapshot: handle unsigned long overflow
Comparing the consumed iterator and the produced position without
using a difference generates an empty snapshot when the iterator is
before unsigned long overflow and the produced position is after
unsigned long overflow.
Fix: wrong error code returned by kernel_snapshot_record()
On snapshot error, kernel_snapshot_record() can return
LTTNG_ERR_KERN_CONSUMER_FAIL which means that the kernel consumer
daemon failed to launch. In this path, the appropriate error to
return is LTTNG_ERR_KERN_META_FAIL.
Docs: document the format of the lttng_session path member
Document that the path returned through a session listing operation
is not a path nor standard URL. While a UNIX path will be returned
when a session is configured to trace locally, a liblttng-ctl user
should not expect this field to contain a valid URL when a network
streaming (or live) output destination is configured. The "path"
field will hold a custom-formatted string describing the output.
This is arguably unexepected, but since this is currently the only
way to obtain the destination of an existing session, this format
will not be changed to preserve compatiblity with existing tools
which could rely on this format.
A description of the formating used by the session daemon is
added as part of this patch.
Fix: don't destroy the sockets if the snapshot was successful
Missing a goto to skip the error condition that was destroying the
relayd sockets even if a snapshot was successful. We want to keep them
open to reuse them for the next snapshots.
The run_as structure (handle) is allocated and initialized before
the fork() that spawns the run_as process. Currently, that structure
is only cleaned-up on the parent's end.
This fix performs the clean-up on the worker's side as well.
Fix: leak of filter bytecode and expression on agent event re-enable
The agent subsystem does not properly assume the clean-up of an
event's filter bytecode and expression when a previously disabled
event is re-enabled.
This change ensures that the ownership of both the filter bytecode
and expression is assumed by the agent subsystem and discarded
when a matching event is found.
Steps to reproduce the leak:
$ lttng create
$ lttng enable-event --python allo --filter 'a[42] == 241'
$ lttng disable-event --python allo
$ lttng enable-event --python allo --filter 'a[42] == 241'
test_crash expects side-effects of directory creation to happen while
tracing is still stopped. In preparation for changing that behavior,
ensure that tracing is started when those side-effects are expected.
The semantic expected from max_t and min_t is to perform the max/min
comparison in the type provided as first parameter.
Cast the input parameters to the proper type before comparing them,
rather than after. There is no more need to cast the result of the
expression now that both inputs are cast to the right type.
UST can receive the session start command before all probe provider
library constructors have completed running, therefore finding less
events than eventually enabled within the process. Moreover, with
per-uid buffers, many processes end up registering events into shared
buffers. Therefore, the guess based on number of events from the first
process to use the buffer is incorrect.
Considering that we typically have applications with more than 30
events, we will modify the session daemon so it selects the "large"
header type independently of the number of events.
Jonathan Rajotte [Tue, 11 Sep 2018 00:09:11 +0000 (20:09 -0400)]
Fix: double put on error path
Let relay_index_try_flush be responsible for the self-reference put on
error path.
Code flow of relay_index_try_flush is a bit tricky but the only error
flow (via relay_index_file_write) will always mark the index as flushed
and perform the self-reference put.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: acquire stream lock during kernel metadata snapshot
The stream lock is not taken when interacting with the kernel
metadata stream that is created at the time a snapshot is taken.
This was noticed while reviewing the code for an unrelated reason,
so there is no known problem caused by this. Nevertheless, this
is incorrect as the stream is globally visible in the consumer.
Moreover, the stream was not cleaned-up which can cause a leak
whenever a metadata snapshot fails.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
There is no value in listing a closed session. A viewer cannot hook
itself to a closed session in live mode and the session is about to be
removed from the sessions hash table.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: use LTTNG_VIEWER_ATTACH_UNK to report a closed session
LTTNG_VIEWER_NEW_STREAMS_HUP is not a valid error number for the
LTTNG_VIEWER_ATTACH_SESSION command. This result in erroneous error
reporting on the client side.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: perform relayd socket pair cleanup on control socket error
A reference to the local context for the socket pair is used to "force" an
evaluation of the data and metadata streams since we changed the endpoint
status. This imitates what is currently done for the data socket.
This prevents hitting network timeouts multiple times in a row when an
error occurs. For now, there is no mechanism for retry hence
"terminating" all communication make sense and prevent unwanted delays
on operation.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Tests: do not bound test app iterations when in background mode
On systems with a high number of CPUs and slow disk, taking snapshots
can take a long time. When running a long regression test, the tests
sometimes outlive the test application.
The test application then exits since the required number of
iterations was completed
(NR_ITER=2000000).
Set the iterations parameter to -1 to ensure the application keeps
producing events for the duration of the test.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
The snapshot command does not print explicit errors when
arguments are missing. This commit introduces more error
reporting and ensures that lttng_error_code and cmd_error_code
values are not freely mixed.
Fix: lttng-save command producing wrong XML fields
Saving a session configuration with a probe or a function event would
generate a XML file considered invalid by the lttng-load command.
This is due to the fact that for a probe event lttng-save would the
following xml event type field:
<type>KPROBE</type>
but lttng-load command would be expecting the following field:
<type>PROBE</type>.
As a fix, the lttng-save command now rightfully outputs the PROBE field.
Given that this usecase never worked, changing the field is not a
breaking change.
Also, the save command was wrongfully using FUNCTION xml event type for
the LTTNG_KERNEL_FUNCTION event type when it is in fact the
FUNCTION_ENTRY xml event type.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: possible NULL dereference in uri_parse_str_urls()
The data_url parsing of uri_parse_str_urls assumes that a ctrl
URL was provided to check that both URLs point to the same
destination. A check for 'ctrl_uris != NULL' is added, but this
function needs to be refactored at some point at it is not clear
what its role is (i.e. it's probably doing too much).
Set consumer's verbosity to the max level on --verbose-consumer
The consumer's verbosity is set to '1' when --verbose-consumer
is used when launching the session daemon. This means that all
DBG2/3() statements are ignored.
This commit always sets the consumer's verbosity to the maximal
level.
Fix: missing context enum values in session xml schema
Handling of the following enum are added:
LTTNG_EVENT_CONTEXT_INTERRUPTIBLE
LTTNG_EVENT_CONTEXT_PREEMPTIBLE
LTTNG_EVENT_CONTEXT_NEED_RESCHEDULE
LTTNG_EVENT_CONTEXT_MIGRATABLE
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Anders Wallin [Thu, 17 May 2018 20:50:41 +0000 (22:50 +0200)]
Tests: add session auto-loading test cases
lttng-sessiond can auto load sessions at startup;
- with "--load" option to lttng-sessiond, load one file
or all sessions files in that directory
- from session files in $LTTNG_HOME/.lttng/sessions/auto/
- from session files in $sysconfdir/lttng/sessions/auto
This test case validates the two first scenarios.
Signed-off-by: Anders Wallin <wallinux@gmail.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: calling ht_{hash, match}_enum with wrong argument
ht_hash_enum and ht_match_enum are currently called with the address of the
pointer to a ust_registry_enum rather than the expected pointer to a
ust_registry_enum. This means that those function calls would end up
using garbage for hashing and comparing.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: probes should be compared strictly by events metadata
Currently, events are compared using names and signatures. Events
with different payloads but identical name and signatures could
lead to corrupted trace because the Session Daemon would consider them
identical and give them the same event ID.
Events should be compared using the name, loglevel, fields and
model_emf_uri to ensure that their respective metadata is the same.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: perform the initialization memory barrier out of loop body
The memory barrier used by the client thread should be performed
after the lttng_sessiond_ready counter has been seen to have
reached zero.
This ensures that loads are not speculatively performed before
this point as the thread will interact with data structures
initialized by the support threads for which it was waiting for
the initialization to complete.
See the comment as to why this read barrier is promoted to a
full barrier.
Michael Jeanson [Tue, 15 May 2018 20:19:49 +0000 (16:19 -0400)]
Port: fix format warnings on Cygwin
On Cygwin, be64toh() returns a "long long unsigned int" while the
format specifier PRIu64 expects a "long unsigned int". Both types
are 64bits integers, just cast the result to uint64_t to silence
the warnings.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: don't wait for the load thread before serving client commands
Since the session loading thread uses the same communication than
the external clients, it should not be included in the set of
threads that must be launched before the sessiond starts to serve
client commands.
Since the "load session" thread is guaranteed to be the last
essential thread to be initialized, it can explicitly signal
the parents that the sessiond is ready once it is done auto-loading
session configurations.
This commit also adds a lengthy comment explaining the initialization
of the session daemon.
Fix: sessiond fails to launch on --without-ust configuration
The sessiond will never signal that it is ready (in daemonize or
background modes) if it was built without lttng-ust. The fix in 7eac7803 made the main thread wait for the agent thread to be
ready before signalling that the session daemon is ready.
When agent tracing is not possible due to the absence of lttng-ust,
a stub function is used to launch the agent thread. This stub
must call sessiond_notify_ready() in order to unblock the main
thread.
Note that it would be _incorrect_ to not wait for the agent
thread to be launched as users expect all tracing features to
be available as soon as 'lttng-sessiond --daemonize/--background'
returns.
Not waiting for the thread to be ready caused very rare failures
of the agent tracing tests on the CI, especially on ARM and
PowerPC targets.
Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: agent thread poll set creation failure results in deadlock
Failing to initialize the agent thread's pollset will cause
the thread to exit before calling sessiond_notify_ready().
This will cause the main thread to wait forever for all threads
to be launched when such an error occurs.
The agent thread is not needed for the sessiond to work (except
to enable the tracing of Java and Python applications). Such
a failure should leave the sessiond in a useable state.
Fix: failure to launch agent thread is not reported
A session daemon may fail to launch its agent thread. In such
a case, the tracing of agent domains fails silently as events
never get enabled through the agent.
The problem that was reported was caused by a second session
daemon being already bound on the agent TCP socket port, which
prevented the launch of the agent thread.
While in this situation tracing is still not possible, the user
will at least get an error indicating as such when enabling
an event in those domains.