Olivier Dion [Wed, 1 Feb 2023 21:04:14 +0000 (16:04 -0500)]
Tests: Add get_possible_cpus_count utility
lttng-ust uses the possible number of CPUs to allocate its ring buffers.
Certain tests have to take that into consideration in their calculation
instead of relying on online processors.
Thus, add the same logic for calculating the possible CPUs on the
system.
Change-Id: I9f14afba3e4adad9547cbd9386f8e1b1b55a3253 Signed-off-by: Olivier Dion <odion@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 23 Mar 2023 16:45:18 +0000 (12:45 -0400)]
Fix: truncated len in lttng_event_rule_user_tracepoint_serialize()
Observed issue
==============
On 64-bit big-endian platforms, the serialization/deserialization tests
of tracepoint event rules fail since the length of individual exclusions
is truncated in lttng_event_rule_user_tracepoint_serialize and appear as
"0" on the receiving end.
Cause
=====
The length of the exclusion name string is stored in a variable of type
`size_t`. However, since the protocol expects it to be expressed as a
uint32_t, the value is added to the payload by copying the first 4 bytes
of the value.
On a 32-bit system this would be fine since `sizeof(size_t) == 4`. Even
worse, it would work most of the time (assuming an exclusion name string
< 4GiB) on a little-endian 64-bit system as the least significant bits
would be copied and correctly express the length of the string.
On a big-endian 64-bit platform, the most-significant 4 bytes are copied
to the payload buffer thus making the string length appear as "0".
Solution
========
A temporary variable is used to hold the "casted" value and make it safe
to copy to the payload buffer regardless of the platform's endianness.
Known drawbacks
===============
None.
Change-Id: I64c03345fff7ffea2f8fcb84692a085da31c421b Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Mon, 31 Oct 2022 19:52:46 +0000 (15:52 -0400)]
Fix: lttng: poptGetArg doesn't provide string ownership
The string returned by poptGetArg() is 'const' because it is owned by
the popt librairy and is free'd by it when poptFreeContext() is called.
Copy those strings when we need to alter them to maintain proper
ownership.
The latest release of popt (v1.19) introduced a breaking
change (changing the ownership of left-over command line arguments) that
can cause double free()-s.
Fix: event field value: assertion fails on empty string
When converting msgpack objects to their event_field_value equivalent,
the following assertion fails: LTTNG_ASSERT(val);
#4 0x00007f1f65349486 in __assert_fail () from /usr/lib/libc.so.6
#5 0x00007f1f65584da7 in lttng_event_field_value_string_create_with_size (val=0x0, size=0) at event-field-value.cpp:186
#6 0x00007f1f65576a1a in event_field_value_from_obj (obj=0x557f597ccdb8, field_val=0x7ffcc9675dd0)
at conditions/event-rule-matches.cpp:1120
#7 0x00007f1f65577176 in event_field_value_from_capture_payload (condition=0x557f597c8520,
capture_payload=0x557f597c825b "\221\240", capture_payload_size=2) at conditions/event-rule-matches.cpp:1340
#8 0x00007f1f655772ea in lttng_evaluation_event_rule_matches_create (condition=0x557f597c8520,
capture_payload=0x557f597c825b "\221\240", capture_payload_size=2, decode_capture_payload=true)
at conditions/event-rule-matches.cpp:1398
#9 0x00007f1f655765fc in lttng_evaluation_event_rule_matches_create_from_payload (condition=0x557f597c8520,
view=0x7ffcc9675ff0, _evaluation=0x7ffcc9676080) at conditions/event-rule-matches.cpp:990
#10 0x00007f1f6557f273 in lttng_evaluation_create_from_payload (condition=0x557f597c8520, src_view=0x7ffcc9676100,
evaluation=0x7ffcc9676080) at evaluation.cpp:120
#11 0x00007f1f6559ba36 in lttng_notification_create_from_payload (src_view=0x7ffcc9676190, notification=0x7ffcc9676180)
at notification.cpp:123
#12 0x00007f1f65552577 in create_notification_from_current_message (channel=0x557f597c8ee0) at channel.cpp:124
#13 0x00007f1f6555298c in lttng_notification_channel_get_next_notification (channel=0x557f597c8ee0, _notification=0x7ffcc9676280)
at channel.cpp:292
The msgpack API represents string as p-style while the implementation of
event_field_value relies on null-terminated strings. When an empty
string is captured by a tracer, it is decoded as a msgpack_object with
`str = {size = 0, ptr = 0x0}`.
lttng_event_field_value_string_create_with_size does not require a
null-terminated string since it also receives the length. Hence, this
fix causes lttng_event_field_value_string_create_with_size to accept
null strings when their length is zero. A copy of an empty string is
made to accomodate the null-termination convention used by the rest of
that API.
Build fix: rpath of test libraries results in non-reproducible build
Yocto has trouble making LTTng-tools builds reproducibile since
abs_builddir is used as the rpath of the test libraries under
tests/regression/ust/ust-dl/.
From their commit message:
Specifing abs_builddir as an RPATH is plain wrong when cross
compiling. Sadly, removing the rpath makes libtool/automake do weird
things and breaks the build as shared libs are no longer generated.
We already try and delete the RPATH at do_install with chrpath however
that does leave the path in the string table so it doesn't help us
with reproducibility.
Their fix consists in hardcoding /usr/lib as the rpath of those
libraries. As mentionned, the rpath doesn't matter; it's used to
workaround libtool's behaviour.
This fix uses `$libdir`, which takes the user's specified prefix into
account to generate an rpath that is consistent with the one inserted in
the other artifacts.
Note that the change in the notification tests is a bit more involved
since we have to bypass libtool to instrument a test application with
uprobes.
Fixes #1361
Change-Id: I7739956f8bc8571ef90802c3b37a4e1f21352385 Reported-by: Alexander Kanavin <alex.kanavin@gmail.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: sessiond: abort called on undefined client command
Issue observed
--------------
When running in verbose mode, the session daemon calls abort()
when it receives an unknown client command:
#1 0x00007f66ffd69958 in raise () from /usr/lib/libc.so.6
#2 0x00007f66ffd5353d in abort () from /usr/lib/libc.so.6
#3 0x000055a671a6f6bd in lttcomm_sessiond_command_str (cmd=1633771873)
at ../../../src/common/sessiond-comm/sessiond-comm.hpp:199
#4 0x000055a671a73897 in process_client_msg (cmd_ctx=0x7f66f5ff6d10,
sock=0x7f66f5ff6c34, sock_error=0x7f66f5ff6c38) at client.cpp:1006
#5 0x000055a671a777fc in thread_manage_clients (data=0x55a673956100)
at client.cpp:2622
#6 0x000055a671a6d290 in launch_thread (data=0x55a673956170) at thread.cpp:68
Cause
-----
process_client_msg() logs the client command on entry. While it
previously logged the numerical value, it now provides the string-ified
version of the command id (since 19f912db8).
The lttcomm_sessiond_command_str() function aborts when it encounters an
unknown command id. This is reasonable (in so far that it is how we
handle these situations, typically). However, the validity of the
command must be checked beforehand as it comes from an
external (untrusted) source.
Solution
--------
Add lttcomm_sessiond_command_is_valid and tombstone command IDs to
lttcomm_sessiond_command to ensure only valid command ids are passed
to lttcomm_sessiond_command_str when logging.
Drawbacks
---------
None
Reported-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibd36f1e69da984c7f27b55ec68e5e3fe06d7ac91
Fix: consumer: snapshot: assertion on subsequent snapshot
Observed issue
==============
While a snapshot is being taken, the containing folder can disappear
unexpectedly. This can lead to the following errors, which are expected
and mostly handled fine:
PERROR - 14:47:32.002564464 [2922498/2922507]: Failed to open file relative to trace chunk file_path = "channel0_0", flags = 577, mode = 432: No such file or directory (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1411)
Error: Failed to open stream file "channel0_0"
Error: Snapshot channel failed
The problem happens on the subsequent snapshot for the session:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fbbdadb3859 in __GI_abort () at abort.c:79
#2 0x00007fbbdadb3729 in __assert_fail_base (fmt=0x7fbbdaf49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55c4212cfbb5 "!stream->trace_chunk", file=0x55c4212cf820 "kernel-co
#3 0x00007fbbdadc5006 in __GI___assert_fail (assertion=0x55c4212cfbb5 "!stream->trace_chunk", file=0x55c4212cf820 "kernel-consumer/kernel-consumer.cpp", line=188, function=0x55c4212cfb00 "
#4 0x000055c421268cc6 in lttng_kconsumer_snapshot_channel (channel=0x7fbbc4000b60, key=1, path=0x7fbbd37f8fd4 "", relayd_id=18446744073709551615, nb_packets_per_stream=0) at kernel-consume
#5 0x000055c42126b39d in lttng_kconsumer_recv_cmd (ctx=0x55c421b80a90, sock=31, consumer_sockpoll=0x7fbbd37fd280) at kernel-consumer/kernel-consumer.cpp:986
#6 0x000055c4212546d1 in lttng_consumer_recv_cmd (ctx=0x55c421b80a90, sock=31, consumer_sockpoll=0x7fbbd37fd280) at consumer/consumer.cpp:2090
#7 0x000055c421259963 in consumer_thread_sessiond_poll (data=0x55c421b80a90) at consumer/consumer.cpp:3281
#8 0x00007fbbdaf8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9 0x00007fbbdaeb0163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
How to reproduce:
1. Setting a breakpoint on snapshot_channel() inside
src/common/ust-consumer/ust-consumer.cpp
2. When the breakpoint hits, remove the the complete lttng directory
containing the session data.
3. Continue the lttng_consumerd process from gdb.
4. In that case you see a negative return value -1 from
consumer_stream_create_output_files() inside snapshot_channel().
5. Take another snapshot and lttng_consumerd crashes because
of the `assert(!stream->trace_chunk)` in snapshot_channel().
This last action does not require any breakpoint intervention.
Cause
=====
During the snapshot, the stream is assigned the channel current chunk.
It is expected that the stream does not have a chunk at this point.
The error handling is faulty here, the stream chunk must be
invalidated/reset on error to allow its reuse later on.
The problem exists for both consumer domains (user/kernel).
Solution
========
For the ust consumer, we can directly use the `error_close_stream`
label.
For the kernel consumer, the code path is slightly different since it
does not uses `consumer_stream_close`. Note that `consumer_stream_close`
cannot be used as is for the kernel consumer. The current implementation
partially resembles `consumer_stream_close` at the end of the iteration.
It is extracted to its own function for easier reuse from the new
`error_finalize_stream` label.
Fix: sessiond: instance uuid is not sufficiently unique
Observed issue
==============
Tracing a cluster of machines -- all launched simultaneously -- to the
same relay daemon occasionally produces corrupted traces.
The size of packets received (as seen from the relay daemon logs) and
that of those present in the on-disk stream occasionally didn't match.
The traces were observed to all relate to the same trace UUID, but
present packet begin timestamps that were not monotonic for a given
stream.
This causes both Babeltrace 1.x and 2.x to fail to open the traces with
different error messages related to clocks.
Cause
=====
On start, the session daemon generates a UUID to uniquely identify the
sessiond instance. Since the UUID generation utils use time() to seed
the random number generator, two session daemons launched within the
same second can end up with the same instance UUID.
Since the relay daemon relies on this UUID to uniquely identify a
session daemon accross its various connections, identifier clashes can
cause streams from the same `uid` or `pid` to become scrambled resulting
in corrupted traces.
Solution
========
The UUID utils now initializes its random seed using the getrandom() API
in non-blocking mode. If that fails -- most likely because the random
pool is depleted or the syscall is not available on the platform -- it
falls back to using a hash of two time readings (with nanosecond
precision), of the hostname, and the PID.
Known drawbacks
===============
This fix implements many fallbacks, each with their own caveats and we
don't have full test coverage for all of those for the moment.
This article presents the different drawbacks of using /dev/urandom vs
getrandom().
https://lwn.net/Articles/884875/
As for the pseudo-random time and configuration based fallback, it is
meant as a last resort for platforms or configurations where both
getrandom() (old kernels or non-Linux platforms) and /dev/urandom (e.g.
locked-down container) are not be available. I haven't done a formal
analysis of the entropy of this home-grown method. The practical
use-case we want to enable is launching multiple virtual machines (or
containers) at roughly the same time and ensure that they don't end up
using the same sessiond UUID.
In that respect, having a different host name and minute timing changes
seem enough to prevent a UUID clash.
Using the PID as part of the hash also helps when launching multiple
session daemons simultaneously for different users.
The waiter lttng_waiter_wait() implements a futex wait/wakeup
scheme similar to the liburcu workqueue code, which has an issue with
spurious wakeups.
A spurious wakeup on lttng_waiter_wait can cause
lttng_waiter_wait to reach label skip_futex_wait with a
waiter->state state of WAITER_WAITING, which is unexpected. It would
cause busy-waiting on WAITER_TEARDOWN state to start early. The
wait-teardown stage is done with WAIT_ATTEMPTS active attempts,
following by attempts spaced by 10ms sleeps. I do not expect that these
spurious wakeups will cause user-observable effects other than being
slightly less efficient that it should be.
This issue will cause spurious unexpected high CPU use, but will not
lead to data corruption.
Cause
=====
From futex(5):
FUTEX_WAIT
Returns 0 if the caller was woken up. Note that a wake-up can
also be caused by common futex usage patterns in unrelated code
that happened to have previously used the futex word's memory
location (e.g., typical futex-based implementations of Pthreads
mutexes can cause this under some conditions). Therefore, call‐
ers should always conservatively assume that a return value of 0
can mean a spurious wake-up, and use the futex word's value
(i.e., the user-space synchronization scheme) to decide whether
to continue to block or not.
Solution
========
We therefore need to validate whether the value differs from
WAITER_WAITING in user-space after the call to FUTEX_WAIT returns 0.
The futex futex_nto1_wait() implements a futex wait/wakeup scheme
identical to the liburcu workqueue code, which has an issue with
spurious wakeups.
A spurious wakeup on futex_nto1_wait can cause futex_nto1_wait to return
with a futex state of -1, which is unexpected.
futex_nto1_wait is used by the relayd live dispatcher thread, by the
relayd main dispatcher thread, as well as by the sessiond dispatcher
thread.
Given that following a futex_nto1_wait returning due to a spurious
wakeup futex_nto1_prepare will set the futex value to -1, things go
back to normal for the following futex_nto1_wait calls.
Therefore, the only impact of this issue is to spuriously use slightly
more CPU time than strictly required.
The effect is even shorter-lasting that in the liburcu counterparts
because futex_nto1_prepare explicitly sets the futex state to -1 rather
than use an atomic decrement, which immediately sets to state back to
a consistent state.
Cause
=====
From futex(5):
FUTEX_WAIT
Returns 0 if the caller was woken up. Note that a wake-up can
also be caused by common futex usage patterns in unrelated code
that happened to have previously used the futex word's memory
location (e.g., typical futex-based implementations of Pthreads
mutexes can cause this under some conditions). Therefore, call‐
ers should always conservatively assume that a return value of 0
can mean a spurious wake-up, and use the futex word's value
(i.e., the user-space synchronization scheme) to decide whether
to continue to block or not.
Solution
========
We therefore need to validate whether the value differs from -1 in
user-space after the call to FUTEX_WAIT returns 0.
Note that despite being created by the same function,
`utils_create_pid_file`, the permissions are not the same.
Cause
=====
`get_wait_shm` manipulates the umask and does not restore it, thus
influencing the outcome of following file creations that don't enforce
specific permissions (using chmod).
Also `fopen` defaults to mode `0666 & ~umask`, thus resulting in
unnecessarily lax permissions when the session daemon is started as a
non-privileged user (umask = 0002, most of the time).
Solution
========
Mimic other call sites of umask(), modify then revert the umask.
Open the pid and agent port files as 0644 letting the umask to do its
job as necessary for those files.
Remove unnecessary umask() usage when chmod is directly used.
Known drawbacks
===============
Use of umask in a multi-threaded process is not recommended. Still our
current usage is limited and mostly happens during the initialization
phase. The usage of umask() is required for the `wait_shm` since on
FreeBSD it is not possible to chmod an shm file descriptor. The default
umask would interfere here.
Discussion
==========
The usage in run-as is valid even when in no-clone mode (valgrind) since
it is the sole user of umask() following the initialization phase. When
spawned as a separate process the clearing of umask is totally valid
even if it is not ideal since we are ignoring any umask set by the user.
It seems like the current usage is the lesser evil here.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie224d254714fff05f4bced471ebfa8f19eede26a
Michael Jeanson [Tue, 9 Aug 2022 15:38:16 +0000 (15:38 +0000)]
Fix: tests: don't assume sequential cpuids
On Linux CPU ids aren't sequential if a CPU is offlined or unplugged.
Get the list of currently available CPU ids from sysfs and pick a random
one, if sysfs is not available use the previous behavior.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibdb63c7d036389104ac2f629827a6dce59e06983
A number of error codes were added to cmd_rotate_session since the
implementation of size-based rotations. The rotation thread doesn't
expect LTTNG_ERR_ROTATION_MULTIPLE_AFTER_STOP and
LTTNG_ERR_ROTATION_AFTER_STOP_CLEAR which are not fatal failures.
These rotations would simply result in an empty trace archive and are,
therefore, not produced. In both cases, it is safe to wait for the next
size cycle.
Fix: ust-consumerd: set `hangup_flush_done` in a locked context
hangup_flush_done is updated after releasing the stream lock. This
doesn't appear to be a problem right now since this attribute is
apparently always accessed by the same thread, but it is conceptually
sus.
Since c08136a3f, the rotation thread's handle_condition() checks that
the notification received matches the trigger that was registered.
As part of the equality check, the triggers' credentials are compared.
This checks fails systematically since the group id of a trigger's
credentials is not transported by the serialize/create_from functions.
The trigger that is received through the notification thus has an unset
group id, while the rotation trigger of the `ltt_session` has a group id
set; it was not stripped by the communication layer.
The check also fails since the trigger registered for the size-based
rotation is "hidden". This internal attribute is not propagated through
the communication layer, which causes the comparison to fail.
Solution
========
Since triggers only use the 'uid' part of lttng_credentials, we ensure
that lttng_trigger_set_credentials only sets this part of the structure.
Also, the `is_hidden` attribute of a trigger is now propagated through
the communication layer. This has no effect for external applications
since this attribute is not exposed through the API. However, it is
useful for internal triggers which use the same communication
facilities.
This allows the equality check in rotation-thread.cpp to go through as
expected.
Fix: lttng-snapshot: use after free of max size argument
gcc 12.1.0 reports:
commands/snapshot.cpp: In function ‘int cmd_snapshot(int, const char**)’:
../../../src/common/error.hpp:139:32: error: pointer ‘max_size_arg’ may be used after ‘void free(void*)’ [-Werror=use-after-free]
Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fa95633add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fa955e90c09 (/usr/lib/libpopt.so.0+0x3c09)
Direct leak of 3 byte(s) in 1 object(s) allocated from:
#0 0x7fb67ee0edd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fb67e964c09 (/usr/lib/libpopt.so.0+0x3c09)
Direct leak of 224 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb64175 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb6a291 in lttng_trigger* zmalloc<lttng_trigger>() ../../src/common/macros.hpp:89
#3 0x559fbeb64aa6 in lttng_trigger_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/trigger.cpp:58
#4 0x559fbe9dc417 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:87
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 208 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb16e21 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb16e31 in lttng_action_notify* zmalloc<lttng_action_notify>() ../../src/common/macros.hpp:89
#3 0x559fbeb168a0 in lttng_action_notify_create actions/notify.cpp:135
#4 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 160 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb3d7a1 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb3fa35 in lttng_condition_session_consumed_size* zmalloc<lttng_condition_session_consumed_size>() ../../src/common/macros.hpp:89
#3 0x559fbeb3e6fd in lttng_condition_session_consumed_size_create conditions/session-consumed-size.cpp:206
#4 0x559fbe9dc0f1 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:54
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 112 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb242ad in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb27062 in zmalloc<(anonymous namespace)::lttng_rate_policy_every_n> ../../src/common/macros.hpp:89
#3 0x559fbeb25e9f in lttng_rate_policy_every_n_create actions/rate-policy.cpp:492
#4 0x559fbeb168b9 in lttng_action_notify_create actions/notify.cpp:141
#5 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
#6 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#7 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#8 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#9 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#10 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 34 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e19319 in __interceptor_strdup /usr/src/debug/gcc/libsanitizer/asan/asan_interceptors.cpp:454
#1 0x559fbeb3f603 in lttng_condition_session_consumed_size_set_session_name conditions/session-consumed-size.cpp:442
#2 0x559fbe9dc2c4 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:71
#3 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#4 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#5 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#6 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#7 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
The rotation trigger of a session (used for size-based rotations) is
never cleaned-up. It is now cleaned up every time its condition is
hit and whenever the session is destroyed.
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7fef37a29fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7fef37792f2f in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x7fef3779573a in lttng_rotation_schedules* zmalloc<lttng_rotation_schedules>() ../../../src/common/macros.hpp:89
#3 0x7fef377947cc in lttng_rotation_schedules_create /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:353
#4 0x7fef37794aa0 in get_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:392
#5 0x7fef377956dc in lttng_session_list_rotation_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:665
#6 0x5646131713f2 in test_add_list_remove_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:252
#7 0x56461317157b in test_add_list_remove_size_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:270
#8 0x564613171680 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:307
#9 0x7fef373ae30f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fdbdc94add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fdbdc4a0c09 (/usr/lib/libpopt.so.0+0x3c09)
Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7f19429d9dd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7f19425342ad in poptGetNextOpt (/usr/lib/libpopt.so.0+0x82ad)
Direct leak of 112 byte(s) in 1 object(s) allocated from:
#0 0x7fdb9260cfb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7fdb9242348d in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x7fdb924295a9 in lttng_trigger* zmalloc<lttng_trigger>() ../../src/common/macros.hpp:89
#3 0x7fdb92423dbe in lttng_trigger_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/trigger.cpp:58
#4 0x56304832331f in register_trigger /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/trigger/utils/register-some-triggers.cpp:24
#5 0x5630483233f1 in register_trigger_action_list_notify /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/trigger/utils/register-some-triggers.cpp:46
#6 0x5630483239a0 in test_session_rotation_conditions /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/trigger/utils/register-some-triggers.cpp:246
#7 0x563048323d4d in main /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/trigger/utils/register-some-triggers.cpp:309
#8 0x7fdb91c6630f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7fe7c494fdd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fe7c44a5c09 (/usr/lib/libpopt.so.0+0x3c09)
Arguments obtained with poptGetOptArg() must be free'd.
Direct leak of 56 byte(s) in 1 object(s) allocated from:
#0 0x7f8b62634fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x557871869adb in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x55787186c8a0 in zmalloc<(anonymous namespace)::lttng_rate_policy_once_after_n> ../../src/common/macros.hpp:89
#3 0x55787186c173 in lttng_rate_policy_once_after_n_create actions/rate-policy.cpp:707
#4 0x55787186a368 in lttng_rate_policy_once_after_n_create_from_payload actions/rate-policy.cpp:183
#5 0x55787186ad02 in lttng_rate_policy_create_from_payload(lttng_payload_view*, lttng_rate_policy**) actions/rate-policy.cpp:287
#6 0x557871865b5b in test_rate_policy_once_after_n /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_rate_policy.cpp:231
#7 0x557871865dc9 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_rate_policy.cpp:250
#8 0x7f8b61c7130f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Direct leak of 56 byte(s) in 1 object(s) allocated from:
#0 0x7f8b62634fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x557871869adb in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x55787186c890 in zmalloc<(anonymous namespace)::lttng_rate_policy_every_n> ../../src/common/macros.hpp:89
#3 0x55787186b6cd in lttng_rate_policy_every_n_create actions/rate-policy.cpp:492
#4 0x55787186a699 in lttng_rate_policy_every_n_create_from_payload actions/rate-policy.cpp:220
#5 0x55787186ad02 in lttng_rate_policy_create_from_payload(lttng_payload_view*, lttng_rate_policy**) actions/rate-policy.cpp:287
#6 0x557871864cae in test_rate_policy_every_n /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_rate_policy.cpp:122
#7 0x557871865dc4 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_rate_policy.cpp:249
#8 0x7f8b61c7130f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
SUMMARY: AddressSanitizer: 112 byte(s) leaked in 2 allocation(s).
Direct leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x7f40a9d4c1b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x55ab716e1def in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
#2 0x55ab716e1956 in lttng_dynamic_buffer_append(lttng_dynamic_buffer*, void const*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:52
#3 0x55ab716ca64e in lttng_log_level_rule_serialize(lttng_log_level_rule const*, lttng_payload*) /home/jgalar/EfficiOS/src/lttng-tools/src/common/log-level-rule.cpp:177
#4 0x55ab716c760f in test_log_level_rule_serialize_deserialize /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:60
#5 0x55ab716c8457 in test_log_level_rule_at_least_as_severe_as /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:177
#6 0x55ab716c84d3 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:185
#7 0x7f40a938830f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Direct leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x7f40a9d4c1b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x55ab716e1def in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
#2 0x55ab716e1956 in lttng_dynamic_buffer_append(lttng_dynamic_buffer*, void const*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:52
#3 0x55ab716ca64e in lttng_log_level_rule_serialize(lttng_log_level_rule const*, lttng_payload*) /home/jgalar/EfficiOS/src/lttng-tools/src/common/log-level-rule.cpp:177
#4 0x55ab716c760f in test_log_level_rule_serialize_deserialize /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:60
#5 0x55ab716c8135 in test_log_level_rule_exactly /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:154
#6 0x55ab716c84ce in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_log_level_rule.cpp:184
#7 0x7f40a938830f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Direct leak of 8696 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707ddc6004 in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x55707ddceb17 in ltt_ust_session* zmalloc<ltt_ust_session>() ../../../src/common/macros.hpp:89
#3 0x55707ddc81e7 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:274
#4 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#5 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#6 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 24672 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707dee4ec1 in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x55707def774e in consumer_output* zmalloc<consumer_output>() ../../../src/common/macros.hpp:89
#3 0x55707dee90df in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:523
#4 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
#5 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#6 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#7 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bf985f in alloc_split_items_count /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:688
#2 0x7efed0bf985f in _cds_lfht_new /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:1642
Indirect leak of 656 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfac68 in __default_alloc_cds_lfht ../src/rculfhash-internal.h:172
#2 0x7efed0bfac68 in alloc_cds_lfht /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:81
Indirect leak of 48 byte(s) in 2 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:35
#2 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:28
Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707de3a9af in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x55707de3a9bf in lttng_ht* zmalloc<lttng_ht>() ../../src/common/macros.hpp:89
#3 0x55707de38461 in lttng_ht_new(unsigned long, lttng_ht_type) hashtable/hashtable.cpp:113
#4 0x55707dee9340 in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:535
#5 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
#6 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#7 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#8 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfac15 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:31
Direct leak of 32 byte(s) in 1 object(s) allocated from:
#0 0x7fdb86fcd1b2 in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x7fdb86d7c296 in lttng_dynamic_buffer_set_capacity(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:159
#2 0x7fdb86d7c060 in lttng_dynamic_buffer_set_size(lttng_dynamic_buffer*, unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/common/dynamic-buffer.cpp:112
#3 0x7fdb86d2589a in recv_payload_sessiond /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:230
#4 0x7fdb86d26fa5 in lttng_ctl_ask_sessiond_payload(lttng_payload_view*, lttng_payload*) /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:662
#5 0x7fdb86d2cd8d in lttng_list_tracepoint_fields /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.cpp:1767
#6 0x56481623cb4c in list_ust_event_fields commands/list.cpp:850
#7 0x5648162448d9 in cmd_list(int, char const**) commands/list.cpp:2394
#8 0x56481628fb3e in handle_command /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:238
#9 0x564816290601 in parse_args /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:427
#10 0x564816290908 in main /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng/lttng.cpp:476
#11 0x7fdb8661730f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
SUMMARY: AddressSanitizer: 32 byte(s) leaked in 1 allocation(s).
The session daemon's reply is indeed never released in
lttng_list_tracepoint_fields.
Fix: liblttng-ctl: non-packed structure used for tracker serialization
Using unpacked structures in liblttng-ctl's protocol can cause issues
when mixing sessiond and client of different bitness. In this specific
case I doubt it causes a problem, but it could rightfully do on some
architectures.
Fix: sessiond: assert on empty payload when handling client out event
Observed issue
==============
When servicing a large number of tracer notifications and sending
notifications to clients, the session daemon occasionally hits
an assertion:
#4 0x00007fb224d7d116 in __assert_fail () from /usr/lib/libc.so.6
#5 0x000056038b2fe4d7 in client_flush_outgoing_queue (client=0x7fb21400c3b0) at notification-thread-events.cpp:3586
#6 0x000056038b2ff819 in handle_notification_thread_client_out (state=0x7fb221974090, socket=77) at notification-thread-events.cpp:4104
#7 0x000056038b2f3d77 in thread_notification (data=0x56038cc7fe90) at notification-thread.cpp:763
#8 0x000056038b30ca7d in launch_thread (data=0x56038cc7e220) at thread.cpp:66
#9 0x00007fb224dcf5c2 in start_thread () from /usr/lib/libc.so.6
#10 0x00007fb224e54584 in clone () from /usr/lib/libc.so.6
Cause
=====
A client "out" event can be received when no payload is left
to send under some circumstances.
Many threads can flush a client's outgoing queue and, if they
had to queue their message (socket was full), will use the
"communication update" command to signal the (e)poll thread
to monitor for space being made available in the socket.
Commands are sent over an internal pipe serviced by the same
thread as the client sockets.
When space is made available in the socket, there is a race
between the (e)poll thread and the other threads that may
wish to use the client's socket to flush its outgoing queue.
A non-(e)poll thread may attempt (and succeed) in flushing
the queue before the (e)poll thread gets a chance to service
the client's "out" event.
In this situation, the (e)poll thread processing the client
out event will see an empty payload: there is nothing to do.
Solution
========
The (e)poll thread can simply ignore the "client out" event
when an empty payload is seen.
There is also no need to update the transmission status as
the other thread has already enqueued a "communication
update" command to do so.
During high throughput event notification generation scenarios the
following deadlock happens:
Thread 14 (Thread 0x7f74b4ff9700 (LWP 76062)):
#0 __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
#1 0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
#2 0x000056408704b207 in run_command_wait (handle=0x56408765ddd0, cmd=0x7f74b4ff7f70) at notification-thread-commands.cpp:31
#3 0x000056408704bcef in notification_thread_command_remove_tracer_event_source (handle=0x56408765ddd0, tracer_event_source_fd=54) at notification-thread-commands.cpp:319
#4 0x000056408708a0c1 in delete_ust_app (app=0x7f749c000bf0) at ust-app.cpp:1059
#5 0x000056408708a511 in delete_ust_app_rcu (head=0x7f749c000ca0) at ust-app.cpp:1122
#6 0x00007f74c988b4a7 in call_rcu_thread (arg=0x7f74b8004a80) at ../src/urcu-call-rcu-impl.h:369
#7 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 13 (Thread 0x7f74b57fa700 (LWP 76047)):
#0 0x00007f74c933a49e in epoll_wait (epfd=48, events=0x7f74a4000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b57f9240, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x00005640870abb65 in thread_agent_management (data=0x56408765f0b0) at agent-thread.cpp:424
#3 0x0000564087062b1a in launch_thread (data=0x56408765f150) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 12 (Thread 0x7f74b5ffb700 (LWP 76046)): [630/709]
#0 0x00007f74c933a49e in epoll_wait (epfd=47, events=0x7f74a0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b5ffa170, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x00005640870a4095 in thread_application_notification (data=0x56408765ee40) at notify-apps.cpp:78
#3 0x0000564087062b1a in launch_thread (data=0x56408765eed0) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 11 (Thread 0x7f74b67fc700 (LWP 76045)):
#0 0x00007f74c933a49e in epoll_wait (epfd=44, events=0x7f74ac000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b67fb170, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x00005640870723db in thread_application_management (data=0x56408765ebd0) at manage-apps.cpp:93
#3 0x0000564087062b1a in launch_thread (data=0x56408765ec60) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 10 (Thread 0x7f74b6ffd700 (LWP 76044)):
#0 0x00007f74c933a49e in epoll_wait (epfd=39, events=0x7f74a8000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b6ffc130, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x0000564087070a27 in thread_application_registration (data=0x56408765e940) at register.cpp:214
#3 0x0000564087062b1a in launch_thread (data=0x56408765e9f0) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 9 (Thread 0x7f74b77fe700 (LWP 76043)): [654/709]
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00005640870c8e25 in futex (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:72
#2 0x00005640870c8e6d in futex_async (uaddr=0x5640871e2800 <ust_cmd_queue>, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:104
#3 0x00005640870c939a in futex_nto1_wait (futex=0x5640871e2800 <ust_cmd_queue>) at futex.cpp:77
#4 0x000056408706f2af in thread_dispatch_ust_registration (data=0x56408765e740) at dispatch.cpp:453
#5 0x0000564087062b1a in launch_thread (data=0x56408765e760) at thread.cpp:66
#6 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 8 (Thread 0x7f74b7fff700 (LWP 76042)):
#0 0x00007f74c933a49e in epoll_wait (epfd=33, events=0x7f74b0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74b7ffad40, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x000056408706c424 in thread_manage_clients (data=0x56408765e4f0) at client.cpp:2528
#3 0x0000564087062b1a in launch_thread (data=0x56408765e560) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 7 (Thread 0x7f74c4b8f700 (LWP 76041)): [672/709]
#0 0x00007f74c933a49e in epoll_wait (epfd=31, events=0x7f74bc000b60, maxevents=3, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c4b8e240, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x000056408705f2b6 in thread_rotation (data=0x56408765e280) at rotation-thread.cpp:804
#3 0x0000564087062b1a in launch_thread (data=0x56408765e310) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 6 (Thread 0x7f74c5390700 (LWP 76040)):
#0 0x00007f74c925f1d2 in __GI___sigtimedwait (set=0x7f74c538f090, info=0x7f74c538f110, timeout=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:29
#1 0x000056408706138a in thread_timer (data=0x7ffc1fcbe3f0) at timer.cpp:359
#2 0x0000564087062b1a in launch_thread (data=0x56408765e0a0) at thread.cpp:66
#3 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 5 (Thread 0x7f74c5b91700 (LWP 76039)):
#0 __libc_write (nbytes=8, buf=0x7f74c5b8fc88, fd=24) at ../sysdeps/unix/sysv/linux/write.c:26
#1 __libc_write (fd=24, buf=0x7f74c5b8fc88, nbytes=8) at ../sysdeps/unix/sysv/linux/write.c:24
#2 0x00005640870eeb4f in lttng_write (fd=24, buf=0x7f74c5b8fc88, count=8) at readwrite.cpp:77
#3 0x000056408704b535 in run_command_no_wait (handle=0x56408765ddd0, in_cmd=0x7f74c5b8fdf0) at notification-thread-commands.cpp:92
#4 0x000056408704bf49 in notification_thread_client_communication_update (handle=0x56408765ddd0, id=2, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-command
#5 0x000056408707bc62 in client_handle_transmission_status (client=0x7f74b80050d0, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f74b8004410) at action-executor.cpp:258
#6 0x0000564087057525 in notification_client_list_send_evaluation (client_list=0x7f74b8004df0, trigger=0x7f74b0001030, evaluation=0x7f74b815d1d0, source_object_creds=0x0, client_report=0x5
#7 0x000056408707bce9 in action_executor_notify_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:269
#8 0x000056408707dd6d in action_executor_generic_handler (executor=0x7f74b8004410, work_item=0x7f74b815d430, item=0x7f74b80e48e0) at action-executor.cpp:670
#9 0x000056408707df01 in action_work_item_execute (executor=0x7f74b8004410, work_item=0x7f74b815d430) at action-executor.cpp:689
#10 0x000056408707e525 in action_executor_thread (_data=0x7f74b8004410) at action-executor.cpp:771 [698/709]
#11 0x0000564087062b1a in launch_thread (data=0x7f74b80044b0) at thread.cpp:66
#12 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 4 (Thread 0x7f74c6392700 (LWP 76038)):
#0 __lll_lock_wait (futex=futex@entry=0x56408765dde8, private=0) at lowlevellock.c:52
#1 0x00007f74c941a0a3 in __GI___pthread_mutex_lock (mutex=0x56408765dde8) at ../nptl/pthread_mutex_lock.c:80
#2 0x0000564087053c89 in handle_notification_thread_command (handle=0x56408765ddd0, state=0x7f74c63911b0) at notification-thread-events.cpp:3142
#3 0x000056408704ac81 in thread_notification (data=0x56408765ddd0) at notification-thread.cpp:715
#4 0x0000564087062b1a in launch_thread (data=0x56408765dec0) at thread.cpp:66
#5 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#run_command_no_wait6 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 3 (Thread 0x7f74c6b93700 (LWP 76037)):
#0 0x00007f74c933a49e in epoll_wait (epfd=21, events=0x7f74c0000b60, maxevents=2, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7f74c6b92170, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x000056408706400a in thread_manage_health (data=0x56408765db50) at health.cpp:140
#3 0x0000564087062b1a in launch_thread (data=0x56408765dbf0) at thread.cpp:66
#4 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#5 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 2 (Thread 0x7f74c7394700 (LWP 76036)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007f74c987d238 in futex (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:72
#2 futex_async (uaddr=0x564087659b10, op=0, val=-1, timeout=0x0, uaddr2=0x0, val3=0) at ../include/urcu/futex.h:104
#3 futex_wait (futex=0x564087659b10) at workqueue.c:136
#4 0x00007f74c987ced2 in workqueue_thread (arg=0x564087659ad0) at workqueue.c:237
#5 0x00007f74c9417609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007f74c933a163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7f74c73cd300 (LWP 76034)):
#0 0x00007f74c933a49e in epoll_wait (epfd=50, events=0x564087666880, maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00005640870eafa6 in compat_epoll_wait (events=0x7ffc1fcbe280, timeout=-1, interruptible=false) at compat/poll.cpp:280
#2 0x0000564087062244 in sessiond_wait_for_quit_pipe (timeout_ms=-1) at thread-utils.cpp:83
#3 0x00005640870127dc in main (argc=1, argv=0x7ffc1fcbe668) at main.cpp:1921
Cause
=====
The event_pipe used to notify the notification poll loop is full and the
lttng_write call blocks with the locks for both the client and the
cmd_queue held.
Solution
========
Go back to using eventfd but without the use of EFD_SEMAPHORE (linux
2.6.30) to continue supporting kernel between 2.6.27 and 2.6.29.
The EFD_SEMAPHORE is emulated with a read, decrement, write as explained
by the initial committer of EFD_SEMAPHORE [1].
Known drawbacks
=========
This does not solve the actual block+lock problem but simply push it
back further. The lttng_write on the eventfd can block when reaching
UINT64_MAX. This would represent, at 1 command queued per ns (which is
ridiculous), ~584 years of queueing without a dequeue operation.
Jonathan Rajotte [Thu, 31 Mar 2022 13:46:15 +0000 (09:46 -0400)]
Fix: notification: assert on len > 0 for dropped notification message
Observed issue
==============
Using the notification client from
doc/examples/trigger-condition-event-matches/notification-client.cpp, an
assert is hit when the notification subsystem is under load.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f69eab58859 in __GI_abort () at abort.c:79
#2 0x00007f69eab58729 in __assert_fail_base (fmt=0x7f69eacee588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=<optimized out>) at assert.c:92
#3 0x00007f69eab6a006 in __GI___assert_fail (assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=0x7f69eae1d598 "ssize_t lttcomm_recv_unix_sock(int, void*, size_t)") at assert.c:101
#4 0x00007f69eadd5fe6 in lttcomm_recv_unix_sock (sock=3, buf=0x55da9ecd5f89, len=0) at unix.cpp:179
#5 0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
#6 0x00007f69ead7e478 in lttng_notification_channel_get_next_notification (channel=0x55da9ecd6ee0, _notification=0x7ffdefed2570) at channel.cpp:279
#7 0x000055da9e0e742f in main (argc=2, argv=0x7ffdefed2698) at notification-client.cpp:506
(gdb) frame
#5 0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
64 ret = lttcomm_recv_unix_sock(channel->socket,
Jonathan Rajotte [Mon, 28 Mar 2022 20:49:17 +0000 (16:49 -0400)]
Fix: sessiond: assertion hit in ltt_sessions_ht_empty
Observed issue
==============
Scenario:
gdb lttng-sessiond
set non-stop
break rotation-thread.cpp:584
^ simulates a slow rotation thread or not scheduled thread.
lttng create test1
lttng enable-event -u -a
lttng start test1
lttng create test2
lttng enable-event -u -a
lttng start test2
lttng destroy test1
This will hang on rotation pending checks on the CLI side.
In another shell:
lttng destroy test2
This will hang on rotation pending checks on the CLI side.
Back to gdb
thread 7
continue
Results in:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff786c859 in __GI_abort () at abort.c:79
#2 0x00007ffff786c729 in __assert_fail_base (fmt=0x7ffff7a02588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=<optimized out>) at assert.c:92
#3 0x00007ffff787e006 in __GI___assert_fail (assertion=0x5555556bb148 "count == lttng_ht_get_count(ltt_sessions_ht_by_name)", file=0x5555556bae9f "session.cpp", line=395, function=0x5555556bb129 "int ltt_sessions_ht_empty()") at assert.c:101
#4 0x0000555555586d59 in ltt_sessions_ht_empty () at session.cpp:395
#5 0x0000555555586e53 in del_session_ht (ls=0x7fffdc000c30) at session.cpp:418
#6 0x0000555555588a95 in session_release (ref=0x7fffdc001e50) at session.cpp:999
#7 0x000055555558620f in urcu_ref_put (ref=0x7fffdc001e50, release=0x5555555886eb <session_release(urcu_ref*)>) at /home/joraj/lttng/master/install/include/urcu/ref.h:68
#8 0x0000555555588c8f in session_put (session=0x7fffdc000c30) at session.cpp:1048
#9 0x00005555555bf995 in handle_job_queue (handle=0x55555575d260, state=0x7fffeeffc240, queue=0x555555758960) at rotation-thread.cpp:612
#10 0x00005555555c05da in thread_rotation (data=0x55555575d260) at rotation-thread.cpp:847
#11 0x00005555555c3b1c in launch_thread (data=0x55555575d2f0) at thread.cpp:66
#12 0x00007ffff7a46609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007ffff7969163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Other scenarios can lead to a similar backtrace when using the
`--no-wait` lttng destroy option.
Cause
=====
Since ed41e5709047ef545aa28082416e641e003b45e0 [1], hash table removal
for the session object for the `ltt_sessions_ht_by_name` and
`ltt_sessions_ht_by_name` are "decoupled". Removal from
`ltt_sessions_ht_by_name` is done early in `session_destroy()` while
removal from `ltt_sessions_ht_by_id` is done during `session_release` when
the last reference of a session object is released.
This can leads to `imbalances` between the size of the two hash tables
when multiple sessions are at play.
Solution
========
Rework `ltt_sessions_ht_empty()` to exit early when
`ltt_sessions_ht_by_id` is not empty. Perform a sanity check on
`ltt_sessions_ht_by_name` only when `ltt_sessions_ht_by_id` is empty.
Note
========
Ideally both hash tables' lifetime would be managed separately but it
seems easier in term of initialization to bundle them together for now
considering the limited scope of the `ltt_sessions_ht_by_name` hash
table.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I66c459f80298f929add703ac977cccd1da6dd556
Jonathan Rajotte [Thu, 31 Mar 2022 15:20:01 +0000 (11:20 -0400)]
Fix: tests: missing _GNU_SOURCE for F_GETPIPE_SZ
Per man 2 fcntl:
F_GETOWN_EX, F_SETOWN_EX, F_SETPIPE_SZ, F_GETPIPE_SZ, F_GETSIG, F_SETSIG,
F_NOTIFY, F_GETLEASE, and F_SETLEASE are Linux-specific. (Define the
_GNU_SOURCE macro to obtain these definitions.)
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2b61dfb79ffa384dc2bab56cd3510ddc6ae21e85
Jonathan Rajotte [Tue, 29 Mar 2022 20:31:44 +0000 (16:31 -0400)]
Fix: compat: 'LTTNG_UST_ABI_PROCNAME_LEN' is undeclared
Observed issue
==============
On old systems, the `lttng_pthread_setname_np` function fallsback to
using the compat prctl version. In that context,
`LTTNG_UST_ABI_PROCNAME_LEN` is indeed not declared.
Solution
========
Use `LTTNG_PTHREAD_NAMELEN`. This mimics what is done in other versions
of `lttng_pthread_setname_np`.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I73956cacd7b1e9400881b17b1cd89db2530d3a00
1474980 Uninitialized pointer read
Incorrect values could be read from, or even written to, an arbitrary memory location, causing incorrect computations.
In test_event_rule_userspace_probe(): Reads an uninitialized pointer or its target (CWE-457)
Michael Jeanson [Fri, 11 Feb 2022 15:26:02 +0000 (15:26 +0000)]
Add Log4j 2.x agent tests for the 'log4j' domain
Add integration tests for the new Log4j 2.x agent in Log4j 1.x compat
mode using the current 'log4j' domain, use the new configure switch
'--enable-test-java-agent-log4j2' to enable it.
To run only this new test, use this command :
cd tests/regression && make check TESTS="ust/java-log4j2/test_agent_log4j2_domain_log4j"
Fix: relayd: session id is ignored by 2.11+ create session command
The id of the session used by the sessiond is not returned by
cmd_create_session_2_11 and its caller sets the value in the
relay_session to an uninitialized value.
Up until recently this didn't have much effect as this uninitialized
value was stored and used to perform look-ups in the trace chunk
registry, which would work.
However, the recent multi-consumer rotation fixes make this problem more
significant as this 'id' is used as a key to join relay sessions
originating from the same session daemon.
This was discovered by enabling the '-Wunused-parameter' warning.
When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream. Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:
ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
READ of size 8 at 0x604000000318 thread T7
#0 in metadata_bucket_destroy
#1 in consumer_stream_free
#2 in consumer_stream_destroy
#3 in setup_metadata
#4 in lttng_ustconsumer_recv_cmd
#5 in lttng_consumer_recv_cmd
#6 in consumer_thread_sessiond_poll
#7 in start_thread nptl/pthread_create.c:481
#8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)
0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
freed by thread T7 here:
#0 in __interceptor_free
#1 in metadata_bucket_destroy
#2 in consumer_stream_free
#3 in consumer_stream_destroy
#4 in clean_channel_stream_list
#5 in consumer_del_channel
#6 in consumer_stream_destroy
#7 in setup_metadata
#8 in lttng_ustconsumer_recv_cmd
#9 in lttng_consumer_recv_cmd
#10 in consumer_thread_sessiond_poll
#11 in start_thread nptl/pthread_create.c:481
previously allocated by thread T7 here:
#0 in __interceptor_calloc
#1 in zmalloc
#2 in metadata_bucket_create
#3 in consumer_stream_enable_metadata_bucketization
#4 in lttng_ustconsumer_set_stream_ops
#5 in lttng_ustconsumer_on_recv_stream
#6 in lttng_consumer_on_recv_stream
#7 in create_ust_streams
#8 in ask_channel
#9 in lttng_ustconsumer_recv_cmd
#10 in lttng_consumer_recv_cmd
#11 in consumer_thread_sessiond_poll
#12 in start_thread nptl/pthread_create.c:481
Thread T7 created by T0 here:
#0 in __interceptor_pthread_create
#1 in main
#2 in __libc_start_main ../csu/libc-start.c:332
SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy
This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:
diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
index fa1c71299..97ed59632 100644
/* Send metadata stream to relayd if needed. */
if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
- ret = consumer_send_relayd_stream(metadata->metadata_stream,
- metadata->pathname);
+ ret = -1;
if (ret < 0) {
ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
goto error;
Cause
=====
Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.
When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.
In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.
This problem is present in a number of error paths.
Solution
========
Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().
consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.
Drawbacks
=========
None.
Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Tested-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8
Direct leak of 240 byte(s) in 5 object(s) allocated from:
#0 0x7f5fce02cfb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7f5fcdd95a7a in zmalloc ../../../src/common/macros.h:23
#2 0x7f5fcdd95a7a in lttng_ust_ctl_create_stream /home/jgalar/EfficiOS/src/lttng-ust/src/lib/lttng-ust-ctl/ustctl.c:1649
A consumer stream can have an allocated
`struct lttng_ust_ctl_consumer_stream *` (ustream) even if it is
not globally visible at the time of its teardown.
In the case of the user space consumer, the only site that creates
consumer stream instances ensures that the allocation of the
lttng_ust_ctl_consumer_stream succeeded, ensuring that the
consumer stream's 'ustream' is always set.
compute_flattened_size() erroneously computes (over-estimates) the size
of the allocation required to hold the flat array of struct lttng_event
returned to the user by lttng_list_{events, syscalls, tracepoints}.
Fix: test: babeltrace1 python bindings exposes `op_enum` as a string
Note that `babeltrace2` is the "default" reader listed in the dependency
section of the readme but `babeltrace` is the actual reader used by the
test suite.
Change-Id: I5b47ba1e37a2671560f51ac866a7a35095be2338 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
4045de280 is a backport of a fix against a C++ file in which `mutable`
is used to allow an ASSERT_LOCKED check. Remove the use of mutable and
make session_has_ongoing_rotation() non-const.
Fix: relayd: connection abruptly closed on viewer stream creation failure
Commit fe88e5175 explains (and fixes) an issue that could cause the
creation of viewer streams to fail. Currently, the error path causes the
relay daemon to abruptly close the connection to its live viewer peer.
This behaviour makes it impossible for the viewer to determine if an
error occurred or if the network connection simply failed.
Returning an `LTTNG_VIEWER_NEW_STREAMS_ERR` status code allows the
viewer to report a precise error. The viewer connection is closed since
the internal error is unlikely to be recoverable.
Fix: relayd: live client fails on clear of multi-domain session
Observed issue
==============
Two test cases of the clear/test_ust test suite occasionally fail in the
integration jobs testing cross-bitness (32/64) LTTng deployments.
Babeltrace fails with the following error when a clear occurs while a
live client consumes a trace:
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1706 [lttng-live] Error receiving get new streams reply
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x55eab7eb1120, muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x55eab7eb0470, muxer-comp-addr=0x55eab7eb0510, muxer-msg-iter-addr=0x55eab7eb1120, msg-iter-addr=0x55eab7eb0fb0, status=ERROR
02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
02-28 16:55:03.262 32362 32362 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910, comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
02-28 16:55:03.262 32362 32362 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully
ERROR: [Babeltrace CLI] (babeltrace2.c:2548)
Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760,
comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
(`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910,
comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
iter-upstream-comp-class-name="muxer",
iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
Cannot validate muxer's upstream message iterator wrapper:
muxer-msg-iter-addr=0x55eab7eb1120,
muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
iter-upstream-comp-class-name="lttng-live",
iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
Error preparing the next batch of messages:
live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1706)
Error receiving get new streams reply
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198)
Remote side has closed connection
Looking at the relay daemon logs, we see the following error:
DBG1 - 16:55:03.262106718 [32139/32146]: Adding new file "ust/pid/gen-ust-events-32373-20220228-165503/chan_0" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1310)
PERROR - 16:55:03.262133333 [32139/32146]: Failed to open fs handle to ust/pid/gen-ust-events-32373-20220228-165503/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker/fd-tracker.cpp:548)
Cause
=====
Adding more debugging logging allows us to see the following situation
takes place:
- relay thread: Create trace chunk on session 1.
- live thread: get new streams against session 1, returns NO_NEW_STREAMS
since the session has an 'ongoing_rotation'.
- live thread: get new streams against session 2, sees no rotation
ongoing and attempts to open `chan_0` when creating a viewer stream
The "ongoing rotation" check was introduced in a7ceb342d and, in a
nutshell, prevents live viewers from creating new viewer streams during
a rotation.
The "ongoing rotation" state is entered when a CREATE_NEW_TRACE_CHUNK
command is issued against a session.
However, this presumes that a relay_session maps 1:1 to a session on the
session daemon's end. This isn't the case as, in multi-domain
scenarios (tracing 32-bit, 64-bit, and kernel events), a single session
daemon session can map to multiple relay_session objects. This is
because the consumer daemons maintain independant connections to the
relay daemon.
To synchronize rotations accross related relay_session instances, the
relay daemon uses the same trace chunk instances accross relay_session
instances. This means that while a trace chunk is created against a
specific relay session, it can be used by other relay_session instances.
To manage shared trace chunks between relay_sessions, the relay daemon
makes use of the trace_chunk_registry. This registry allows
relay_sessions to share trace chunk instances using a unique key tuple:
- session daemon instance uuid,
- session daemon session id,
- trace chunk id.
There is no equivalent mechanism to track the "ongoing_rotation" state
accross relay_sessions originating from the same sessiond session.
In the current scenario, this causes the live client to correctly see
that no new streams are available for session 1 (say, the 32-bit user
space session). Unfortunately, this state is not entered for other
sessions (64-bit and kernel relay sessions). Hence, the viewer succeds
in acquiring new streams from session 2, exposing the race the 'ongoing
rotation' state aims to protect against.
Solution
========
Like the trace chunk instances, the "ongoing rotation" state must be
shared accross relay sessions that originate from the same session
daemon session.
To "emulate" this shared state, session_has_ongoing_rotation() checks
if any relay session originating from the same sessiond session
have an ongoing rotation. If it is the case, we temporarily prevent
live viewers from acquiring new streams.
Known drawbacks
===============
session_has_ongoing_rotation() iterates over all sessions, acquiring
their lock in the process, which is certainly undesirable from a
performance standpoint.
Optimizing this is not a great challenge, but is beyond the scope
of this immediate fix.
Michael Jeanson [Thu, 5 Aug 2021 20:49:26 +0000 (16:49 -0400)]
fix: msgpack requires limits.h for UINT_MAX
Building with '-Wundef' reveals this issue :
unpack.c: In function ‘template_callback_array’:
unpack.c:197:17: warning: "UINT_MAX" is not defined, evaluates to 0 [-Wundef]
197 | #if SIZE_MAX == UINT_MAX
| ^~~~~~~~
unpack.c: In function ‘template_callback_map’:
unpack.c:241:17: warning: "UINT_MAX" is not defined, evaluates to 0 [-Wundef]
241 | #if SIZE_MAX == UINT_MAX
| ^~~~~~~~
Change-Id: I7dadd9f7013d613509f66e67ff1beb8ae593d2bf Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Mon, 14 Feb 2022 16:23:28 +0000 (11:23 -0500)]
Fix: rotation: hang on destroy when using scheduled rotation based on timer
Observed issue
==============
The following scenario results in a hang for `lttng destroy`:
lttng create test
lttng enable-event -u -a
lttng enable-rotation --timer 100000
lttng start
lttng stop
lttng start
lttng destroy
Cause
=====
There is an imbalance in how many times we start the rotation timer.
The rotation timer is only removed on `lttng destroy` or when disabling
a time-based-rotation. On the other hand, the timer is "started"
on `lttng start` and when enabling a time based rotation.
The imbalance emerging from a start/stop/start sequence would prevent the
teardown of the session object since each time the timer is started a
reference to the session is held.
Solution
========
Do not start the rotation schedule timer if it was already launched.
Known drawbacks
=========
None.
Change-Id: Ic5b8938166358fe7629187bebdf02a09e90846c0 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
if (chunk->in_registry_element) {
struct lttng_trace_chunk_registry_element *element;
element = container_of(chunk, typeof(*element), chunk);
if (element->registry) {
rcu_read_lock();
cds_lfht_del(element->registry->ht, &element->trace_chunk_registry_ht_node);
rcu_read_unlock();
-> call_rcu(&element->rcu_node, free_lttng_trace_chunk_registry_element);
} else {
```
The delayed reclaim of the `lttng_trace_chunk_registry_element` can
result in lttng-consumerd holding an open fd for the "chunk directory"
of the chunk since the close() is only done during the "*fini" phase of
the chunk (`lttng_trace_chunk_fini`).
Solution
========
Considering that the rcu lookup+refcount access scheme is used for the
trace chunk object and that at that point the refcount for the trace
chunk object is effectively zero, we can move the
`lttng_trace_chunk_fini` safely outside of the
`free_lttng_trace_chunk_registry_element` call_rcu call.
Known drawbacks
=========
Even if this solves the current situation, it is important to note that
the actual object holding the reference is itself refcounted and only
close the fd on release. This means that we are still exposed to this
problem if at some point the directory handle is shared and outlives the
trace chunk for some reason in the future.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6da3948824bf8b092fc8248b1bb0263fdd5887be
In file included from event.cpp:15:
event.cpp: In function ‘ssize_t lttng_event_create_from_payload(lttng_payload_view*, lttng_event**, lttng_event_exclusion**, char**, lttng_bytecode**)’:
../../src/common/error.h:191:28: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘ssize_t’ {aka ‘int’} [-Wformat=]
191 | __lttng_print(PRINT_WARN, "Warning: " fmt "\n", ## args)
| ^~~~~~~~~~~~~~~~~~~~
../../src/common/error.h:139:51: note: in definition of macro ‘__lttng_print’
139 | fprintf((type) == PRINT_MSG ? stdout : stderr, fmt, ## args); \
| ^~~
event.cpp:624:4: note: in expansion of macro ‘WARN’
624 | WARN("Userspace probe location from the received buffer is not the advertised length: header length = %" PRIu32 ", payload length = %lu", event_comm->userspace_probe_location_len, ret);
| ^~~~
Solution
========
Albeit there is no "canonical" way of printing ssize_t, use '%zd' since
we already make use of it elsewhere.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id41e6ccf07bd580813f169b65d281a4fa305fb48
CID 1475801: Unused value (UNUSED_VALUE)
Assigning value LTTNG_ERR_FATAL to ret_code here, but that stored value is overwritten before it can be used
CID 1475802: Unused value (UNUSED_VALUE)
Assigning value LTTNG_ERR_NOMEM to ret_code here, but that stored value is overwritten before it can be used.
CID 1475814: Unused value (UNUSED_VALUE)
Assigning value LTTNG_ERR_UST_CHAN_NOT_FOUND to ret_code here, but that stored value is overwritten before it can be used.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I79fb5d65ea0f4f3dfbbef2a42ec3dcf0542043c5
When tracing all system calls, nothing guarantees that the first system
call won't come from some _other_ program on the system, on a CPU != 0,
and stay invariant between the two snapshots (when it should not be).
Tests: Fix: test_list_triggers_cli: support in-kernel builtin lttng-modules
This commit changes the grep call to remove the [lttng_tracer] string
from the pattern. When building the lttng modules directly in the kernel
there is not mention of [lttng_tracer].
Furthermore, the symbol type "t" is "T" in my test VM. The difference
may be due to the builtin nature of lttng on this VM.
Fix: lttng: truncated addresses and offsets on 32-bit builds
The lttng client parses hexadecimal addresses using, at some point,
strtoul(). Using this function effectively caps addresses and
offsets to MAX_UINT32 resulting in failures to enable kprobes
against a 64-bit kernel using a 32-bit client.
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_event_field is not packed
Observed issue
==============
For MI testing where the lttng-sessiond is 64 bit and the lttng CLI is
32 bit, the tracepoint field listing fails with partial garbage output.
The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.
sizeof(struct lttng_event_field):
x86: 1136
x86-64: 1144
The struct cannot be marked as LTTNG_PACKED since it is part of the API.
Solution
========
Adopt a similar pattern to the new APIs with a "serialize" &
"create_from_buffer" approach. The only particularity is that we need to
flatten the event_field on listing.
Most of the complexity is moved to `src/common/event.c`
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I280d9809d110237574e2606ee93a7aeba41e704e
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_event_context is not packed
Observed issue
==============
The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.
sizeof(struct lttng_event_context):
x86: 308
x86-64: 312
The struct cannot be marked as LTTNG_PACKED since it is part of the API.
Solution
========
Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach.
Most of the complexity is moved to `src/common/event.c`
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ieb400eab2a2df4070ff51cb2b44929d3ea945ce4