Jérémie Galarneau [Fri, 15 Apr 2022 06:09:53 +0000 (02:09 -0400)]
Fix: lttng: snapshot: add-output: leak of max size parameter
==
1920281==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fa95633add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fa955e90c09 (/usr/lib/libpopt.so.0+0x3c09)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I87ce90a77d9624add0cab5d3090a7e83734da7f4
Jérémie Galarneau [Fri, 15 Apr 2022 05:55:45 +0000 (01:55 -0400)]
Tests: fix: lttng-create: leaked command parameter
==
1853705==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 3 byte(s) in 1 object(s) allocated from:
#0 0x7fb67ee0edd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fb67e964c09 (/usr/lib/libpopt.so.0+0x3c09)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2f3346157cb26de6712c6e6ebd5fafa6b51fac08
Jérémie Galarneau [Fri, 15 Apr 2022 05:30:50 +0000 (01:30 -0400)]
Fix: sessiond: rotation trigger leak
==
1801304==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 224 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb64175 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb6a291 in lttng_trigger* zmalloc<lttng_trigger>() ../../src/common/macros.hpp:89
#3 0x559fbeb64aa6 in lttng_trigger_create /home/jgalar/EfficiOS/src/lttng-tools/src/common/trigger.cpp:58
#4 0x559fbe9dc417 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:87
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 208 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb16e21 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb16e31 in lttng_action_notify* zmalloc<lttng_action_notify>() ../../src/common/macros.hpp:89
#3 0x559fbeb168a0 in lttng_action_notify_create actions/notify.cpp:135
#4 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 160 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb3d7a1 in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb3fa35 in lttng_condition_session_consumed_size* zmalloc<lttng_condition_session_consumed_size>() ../../src/common/macros.hpp:89
#3 0x559fbeb3e6fd in lttng_condition_session_consumed_size_create conditions/session-consumed-size.cpp:206
#4 0x559fbe9dc0f1 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:54
#5 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#6 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#7 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#8 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#9 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 112 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e73fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x559fbeb242ad in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x559fbeb27062 in zmalloc<(anonymous namespace)::lttng_rate_policy_every_n> ../../src/common/macros.hpp:89
#3 0x559fbeb25e9f in lttng_rate_policy_every_n_create actions/rate-policy.cpp:492
#4 0x559fbeb168b9 in lttng_action_notify_create actions/notify.cpp:141
#5 0x559fbe9dc34b in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:80
#6 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#7 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#8 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#9 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#10 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
Indirect leak of 34 byte(s) in 2 object(s) allocated from:
#0 0x7fe0f4e19319 in __interceptor_strdup /usr/src/debug/gcc/libsanitizer/asan/asan_interceptors.cpp:454
#1 0x559fbeb3f603 in lttng_condition_session_consumed_size_set_session_name conditions/session-consumed-size.cpp:442
#2 0x559fbe9dc2c4 in subscribe_session_consumed_size_rotation(ltt_session*, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/rotate.cpp:71
#3 0x559fbe995d6f in cmd_rotation_set_schedule(ltt_session*, bool, lttng_rotation_schedule_type, unsigned long, notification_thread_handle*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:5993
#4 0x559fbe9fe559 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2246
#5 0x559fbea01378 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2624
#6 0x559fbe9ea642 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:68
#7 0x7fe0f44935c1 in start_thread (/usr/lib/libc.so.6+0x8d5c1)
The rotation trigger of a session (used for size-based rotations) is
never cleaned-up. It is now cleaned up every time its condition is
hit and whenever the session is destroyed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5a89341535f87b7851b548ded9838c18bd1ccb95
Jérémie Galarneau [Fri, 15 Apr 2022 05:34:54 +0000 (01:34 -0400)]
Tests: fix: schedule api: leak of rotation schedule list
==
1769573==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7fef37a29fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7fef37792f2f in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x7fef3779573a in lttng_rotation_schedules* zmalloc<lttng_rotation_schedules>() ../../../src/common/macros.hpp:89
#3 0x7fef377947cc in lttng_rotation_schedules_create /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:353
#4 0x7fef37794aa0 in get_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:392
#5 0x7fef377956dc in lttng_session_list_rotation_schedules /home/jgalar/EfficiOS/src/lttng-tools/src/lib/lttng-ctl/rotate.cpp:665
#6 0x5646131713f2 in test_add_list_remove_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:252
#7 0x56461317157b in test_add_list_remove_size_schedule /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:270
#8 0x564613171680 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/regression/tools/rotation/schedule_api.c:307
#9 0x7fef373ae30f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9b7eb537d158791db76f9a7676ffeb5d4a1f2203
Jérémie Galarneau [Fri, 15 Apr 2022 05:29:46 +0000 (01:29 -0400)]
Fix: lttng: enable-rotation: leak of command parameter
==
1759491==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 6 byte(s) in 1 object(s) allocated from:
#0 0x7fdbdc94add9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fdbdc4a0c09 (/usr/lib/libpopt.so.0+0x3c09)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I29cc6ec4390e71829107f309f162247b9be2868c
Jérémie Galarneau [Fri, 15 Apr 2022 04:35:35 +0000 (00:35 -0400)]
Fix: lttng: track: leaked command parameter
==
1676099==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7f19429d9dd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7f19425342ad in poptGetNextOpt (/usr/lib/libpopt.so.0+0x82ad)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibafcaf42ad4f842b3fa74cf91dc5ecc8acb3487d
Jérémie Galarneau [Fri, 15 Apr 2022 00:22:03 +0000 (20:22 -0400)]
Fix: lttng: enable-channel: leak of popt arguments
==
1245463==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 5 byte(s) in 1 object(s) allocated from:
#0 0x7fe7c494fdd9 in __interceptor_malloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x7fe7c44a5c09 (/usr/lib/libpopt.so.0+0x3c09)
Arguments obtained with poptGetOptArg() must be free'd.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5a65ca6fbaa18f7717ea918a5bc7f42daeb1009a
Jérémie Galarneau [Fri, 15 Apr 2022 00:02:18 +0000 (20:02 -0400)]
Tests: fix: leak of some attributes of ltt_ust_session
==
1175545==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 8696 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707ddc6004 in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x55707ddceb17 in ltt_ust_session* zmalloc<ltt_ust_session>() ../../../src/common/macros.hpp:89
#3 0x55707ddc81e7 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:274
#4 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#5 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#6 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 24672 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707dee4ec1 in zmalloc_internal ../../../src/common/macros.hpp:60
#2 0x55707def774e in consumer_output* zmalloc<consumer_output>() ../../../src/common/macros.hpp:89
#3 0x55707dee90df in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:523
#4 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
#5 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#6 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#7 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bf985f in alloc_split_items_count /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:688
#2 0x7efed0bf985f in _cds_lfht_new /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash.c:1642
Indirect leak of 656 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfac68 in __default_alloc_cds_lfht ../src/rculfhash-internal.h:172
#2 0x7efed0bfac68 in alloc_cds_lfht /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:81
Indirect leak of 48 byte(s) in 2 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:35
#2 0x7efed0bfabd4 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:28
Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55707de3a9af in zmalloc_internal ../../src/common/macros.hpp:60
#2 0x55707de3a9bf in lttng_ht* zmalloc<lttng_ht>() ../../src/common/macros.hpp:89
#3 0x55707de38461 in lttng_ht_new(unsigned long, lttng_ht_type) hashtable/hashtable.cpp:113
#4 0x55707dee9340 in consumer_create_output(consumer_dst_type) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/consumer.cpp:535
#5 0x55707ddc8821 in trace_ust_create_session(unsigned long) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/trace-ust.cpp:321
#6 0x55707ddc2bea in test_create_one_ust_session /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:63
#7 0x55707ddc4941 in main /home/jgalar/EfficiOS/src/lttng-tools/tests/unit/test_ust_data.cpp:283
#8 0x7efed04f930f in __libc_start_call_main (/usr/lib/libc.so.6+0x2d30f)
Indirect leak of 16 byte(s) in 1 object(s) allocated from:
#0 0x7efed0f39fb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7efed0bfac15 in cds_lfht_alloc_bucket_table /home/jgalar/EfficiOS/src/userspace-rcu/src/rculfhash-mm-order.c:31
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib2ad82a197f2a4ccb86ae5799c1d93ff059888e3
Jérémie Galarneau [Thu, 14 Apr 2022 21:36:54 +0000 (17:36 -0400)]
Fix: liblttng-ctl: non-packed structure used for tracker serialization
Using unpacked structures in liblttng-ctl's protocol can cause issues
when mixing sessiond and client of different bitness. In this specific
case I doubt it causes a problem, but it could rightfully do on some
architectures.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie17096a55a4b7508c604e184cae877b83df6e451
Jonathan Rajotte [Fri, 25 Mar 2022 18:26:38 +0000 (14:26 -0400)]
Fix: test: lttng kernel modules still loaded after running test_clock_override
Observed issue
==============
After running test_clock_override, some lttng modules are still loaded.
$ lsmod | ag lttng
lttng_test 32768 0
lttng_tracer
2326528 1 lttng_test
lttng_statedump 749568 1 lttng_tracer
lttng_wrapper 16384 2 lttng_statedump,lttng_tracer
lttng_uprobes 16384 1 lttng_tracer
lttng_kprobes 16384 1 lttng_tracer
lttng_lib_ring_buffer 61440 1 lttng_tracer
lttng_kretprobes 16384 1 lttng_tracer
lttng_clock_plugin_test 16384 1
lttng_clock 16384 2 lttng_tracer,lttng_clock_plugin_test
Cause
=====
The order in which the modules are removed is important.
In `test_clock_override_timestamp` the last `modprobe --remove order` is
modprobe --remove lttng-clock-plugin-test lttng-clock lttng-test
While other callsites order is:
modprobe --remove lttng-test lttng-clock-plugin-test lttng-clock
Solution
========
Use
modprobe --remove lttng-test lttng-clock-plugin-test lttng-clock
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I528df2e3e90664433337a547a74cdbe476d4ee62
Jérémie Galarneau [Wed, 13 Apr 2022 18:28:01 +0000 (14:28 -0400)]
Update version to v2.12.11
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 6 Apr 2022 19:48:31 +0000 (15:48 -0400)]
Fix: notification: assert on len > 0 for dropped notification message
Observed issue
==============
Using the notification client from
doc/examples/trigger-condition-event-matches/notification-client.cpp, an
assert is hit when the notification subsystem is under load.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f69eab58859 in __GI_abort () at abort.c:79
#2 0x00007f69eab58729 in __assert_fail_base (fmt=0x7f69eacee588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=<optimized out>) at assert.c:92
#3 0x00007f69eab6a006 in __GI___assert_fail (assertion=0x7f69eae1d5dd "len > 0", file=0x7f69eae1d5cb "unix.cpp", line=179, function=0x7f69eae1d598 "ssize_t lttcomm_recv_unix_sock(int, void*, size_t)") at assert.c:101
#4 0x00007f69eadd5fe6 in lttcomm_recv_unix_sock (sock=3, buf=0x55da9ecd5f89, len=0) at unix.cpp:179
#5 0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
#6 0x00007f69ead7e478 in lttng_notification_channel_get_next_notification (channel=0x55da9ecd6ee0, _notification=0x7ffdefed2570) at channel.cpp:279
#7 0x000055da9e0e742f in main (argc=2, argv=0x7ffdefed2698) at notification-client.cpp:506
(gdb) frame
#5 0x00007f69ead7df3f in receive_message (channel=0x55da9ecd6ee0) at channel.cpp:64
64 ret = lttcomm_recv_unix_sock(channel->socket,
(gdb) print msg
$2 = {type = 5 '\005', size = 0, fds = 0, payload = 0x7ffdefed24a8 ""}
The msg type 5 is
`LTTNG_NOTIFICATION_CHANNEL_MESSAGE_TYPE_NOTIFICATION_DROPPED`
Cause
=====
The msg portion of a
`LTTNG_NOTIFICATION_CHANNEL_MESSAGE_TYPE_NOTIFICATION_DROPPED` is indeed
zero. There is no extra payload.
Solution
========
When the msg size is zero, skip the 'payload' reception phase.
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibabb922d0e410c9902414a5eabbe04738861d772
Jérémie Galarneau [Fri, 25 Mar 2022 19:34:47 +0000 (15:34 -0400)]
Fix: sessiond: cmd_enable_channel: negative error code used
A negative `lttng_error_code` value is returned (as an integer)
when a channel copy fails. Return a positive error code.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I340f739cc33858a06832bb75a7a6d5e18459551f
Jérémie Galarneau [Mon, 28 Mar 2022 18:59:37 +0000 (14:59 -0400)]
Update version to v2.12.10
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Mathieu Desnoyers [Fri, 25 Mar 2022 20:42:12 +0000 (16:42 -0400)]
Fix: Remove liblttng-ctl.so dependency on liburcu-cds.so and liburcu-common.so
Observed Issue
==============
After splitting libcommon into lgpl/gpl convenience libraries,
liblttng-ctl.so still depends on liburcu-cds.so and liburcu-common.so.
Cause
=====
The default behavior, for AC_CHECK_LIB when the `action-if-found` is NOT
defined, is to prepend the library to LIBS. [1]
"
If action-if-found is not specified, the default action prepends
-llibrary to LIBS and defines ‘HAVE_LIBlibrary’ (in all capitals).
"
It is important to note that the LIBS variable is used for ALL linking.
This is normally not a problem for most distribution since they force
the use of `--as-needed` at the toolchain level (gcc specs) (for example
debian [2]). One could also pass the `--as-needed` flag manually but
libtool reorganize flags in the case of shared object creation [3].
In our case, we always explicitly state the dependencies via the *_LIBADD
automake clause. We do not rely on the LIBS variable.
The current configure.ac does define what seems to be `empty but
defined` clause for the `action-if-found` as "[]". This is not a valid
"empty but defined" `action-if-find` clause and end up generating the
default behavior as defined in [1].
This leads to unnecessary dependencies for most of the shared object, at
link time, generated when using a distro that do not enforce the
`--as-needed` flag on linking.
Solution
========
Define an actual no-op shell operation `:` for the `action-if-found`
parameter.
Known drawbacks
=========
None.
References
==========
[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.69/html_node/Libraries.html
[2] https://salsa.debian.org/toolchain-team/gcc/-/blob/master/debian/patches/gcc-as-needed.diff
[3] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=347650
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ce4289f4022bd6f27f656639a451cb39b7c9a9e
Jérémie Galarneau [Mon, 28 Mar 2022 16:12:04 +0000 (12:12 -0400)]
Update version to v2.12.9
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 18 Mar 2022 14:58:03 +0000 (10:58 -0400)]
Fix: lttng-elf: wrong error label used by error path
1486805 Resource leak
The system resource will not be reclaimed and reused, reducing the future availability of the resource.
In lttng_elf_get_symbol_offset: Leak of memory or pointers to system resources (CWE-404)
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I78b868773b389c2eaf3f1d45151fe9416b1fe447
Michael Jeanson [Wed, 16 Mar 2022 15:40:52 +0000 (11:40 -0400)]
Fix: use the correct endian compat macros
Document which variant of the endian macros our compat header guarantees
across all platforms and fix incorrect uses.
This was discovered with -Wundef on macOS.
Change-Id: Iaf442fe5887063661273ac2a00c9fa4015e83d5c
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 16 Mar 2022 21:29:11 +0000 (17:29 -0400)]
Fix: lttng-elf: untrusted entry size divisor
1405557 Untrusted divisor
The divisor could be controlled by an attacker, who could cause a division by zero.
In lttng_elf_get_symbol_offset: An unscrutinized value from an untrusted source used as a divisor (CWE-369)
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I029708a0df4f62fe0031e374d50839c26f4f3f4b
Jérémie Galarneau [Wed, 16 Mar 2022 15:56:21 +0000 (11:56 -0400)]
Fix: event: erroneous bound check on perf counter name size
The wrong size if used when initializing a perf counter name from a
payload. The destination size must be used to prevent out of bound
writes.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8ea41d30815bd2c02bb2ad8b01e8cecd2d6549a8
Jérémie Galarneau [Wed, 16 Mar 2022 14:48:59 +0000 (10:48 -0400)]
Fix: relayd: missing session unlock on error path
1475890 Missing unlock May result in deadlock if there is another
attempt to acquire the lock.
In viewer_get_new_streams(relay_connection *): Missing a release of a
lock on a path (CWE-667)
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I66de344f5f39ac85bf8db93cf39a07d0c6cf7694
Michael Jeanson [Mon, 7 Mar 2022 16:37:49 +0000 (11:37 -0500)]
fix: warning '-Wstringop-truncation' with GCC 11.2
Building with GCC 11.2 results in the following warning :
In file included from ../../src/common/tracker.h:18,
from ../../src/bin/lttng-sessiond/trace-ust.h:17,
from test_ust_data.cpp:19:
../../src/common/sessiond-comm/sessiond-comm.h:569:14: note: while referencing ‘lttng_event_exclusion::names’
569 | char names[0][LTTNG_SYMBOL_NAME_LEN];
| ^~~~~
test_ust_data.cpp:209:16: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 256 equals destination size [-Wstringop-truncation]
209 | strncpy(LTTNG_EVENT_EXCLUSION_NAME_AT(exclusion, 0),
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
210 | get_random_string(), LTTNG_SYMBOL_NAME_LEN);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_ust_data.cpp:211:16: warning: ‘char* strncpy(char*, const char*, size_t)’ specified bound 256 equals destination size [-Wstringop-truncation]
211 | strncpy(LTTNG_EVENT_EXCLUSION_NAME_AT(exclusion, 1),
| ~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
212 | get_random_string(), LTTNG_SYMBOL_NAME_LEN);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Change-Id: I78eea760b4684227ee457c3368c6397d0a767af5
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Fri, 11 Feb 2022 15:26:02 +0000 (15:26 +0000)]
Add Log4j 2.x agent tests for the 'log4j' domain
Add integration tests for the new Log4j 2.x agent in Log4j 1.x compat
mode using the current 'log4j' domain, use the new configure switch
'--enable-test-java-agent-log4j2' to enable it.
To run only this new test, use this command :
cd tests/regression && make check TESTS="ust/java-log4j2/test_agent_log4j2_domain_log4j"
Change-Id: Id780c9ee13913c91c18548f58b14cc600e77e9fa
Depends-on: lttng-ust: I7fd5f79ad58c77175714bd4198d8ff5db2e6b846
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 10 Mar 2022 22:46:31 +0000 (17:46 -0500)]
Fix: relayd: session id is ignored by 2.11+ create session command
The id of the session used by the sessiond is not returned by
cmd_create_session_2_11 and its caller sets the value in the
relay_session to an uninitialized value.
Up until recently this didn't have much effect as this uninitialized
value was stored and used to perform look-ups in the trace chunk
registry, which would work.
However, the recent multi-consumer rotation fixes make this problem more
significant as this 'id' is used as a key to join relay sessions
originating from the same session daemon.
This was discovered by enabling the '-Wunused-parameter' warning.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7e33f5f93dc46bb630c431408c9472be3a75c030
Jérémie Galarneau [Thu, 3 Mar 2022 00:27:31 +0000 (19:27 -0500)]
Fix: consumerd: use-after-free of metadata bucket
Observed issue
==============
When consumer_stream_destroy() is called from, for example, the error
path in setup_metadata(), consumer_stream_free() can end up being called
twice on the same stream. Since the stream->metadata_bucket is not set
to NULL after being destroyed, it leads to a use-after-free:
ERROR: AddressSanitizer: heap-use-after-free on address 0x604000000318
READ of size 8 at 0x604000000318 thread T7
#0 in metadata_bucket_destroy
#1 in consumer_stream_free
#2 in consumer_stream_destroy
#3 in setup_metadata
#4 in lttng_ustconsumer_recv_cmd
#5 in lttng_consumer_recv_cmd
#6 in consumer_thread_sessiond_poll
#7 in start_thread nptl/pthread_create.c:481
#8 in clone (/lib/x86_64-linux-gnu/libc.so.6+0xfcbde)
0x604000000318 is located 8 bytes inside of 48-byte region [0x604000000310,0x604000000340)
freed by thread T7 here:
#0 in __interceptor_free
#1 in metadata_bucket_destroy
#2 in consumer_stream_free
#3 in consumer_stream_destroy
#4 in clean_channel_stream_list
#5 in consumer_del_channel
#6 in consumer_stream_destroy
#7 in setup_metadata
#8 in lttng_ustconsumer_recv_cmd
#9 in lttng_consumer_recv_cmd
#10 in consumer_thread_sessiond_poll
#11 in start_thread nptl/pthread_create.c:481
previously allocated by thread T7 here:
#0 in __interceptor_calloc
#1 in zmalloc
#2 in metadata_bucket_create
#3 in consumer_stream_enable_metadata_bucketization
#4 in lttng_ustconsumer_set_stream_ops
#5 in lttng_ustconsumer_on_recv_stream
#6 in lttng_consumer_on_recv_stream
#7 in create_ust_streams
#8 in ask_channel
#9 in lttng_ustconsumer_recv_cmd
#10 in lttng_consumer_recv_cmd
#11 in consumer_thread_sessiond_poll
#12 in start_thread nptl/pthread_create.c:481
Thread T7 created by T0 here:
#0 in __interceptor_pthread_create
#1 in main
#2 in __libc_start_main ../csu/libc-start.c:332
SUMMARY: AddressSanitizer: heap-use-after-free in metadata_bucket_destroy
This can be easily reproduced by forcing a failure during the setup
of the metadata reproducible using the following change:
diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/ust-consumer.c
index
fa1c71299..
97ed59632 100644
--- a/src/common/ust-consumer/ust-consumer.c
+++ b/src/common/ust-consumer/ust-consumer.c
@@ -908,8 +908,7 @@ static int setup_metadata(struct lttng_consumer_local_data *ctx, uint64_t key)
/* Send metadata stream to relayd if needed. */
if (metadata->metadata_stream->net_seq_idx != (uint64_t) -1ULL) {
- ret = consumer_send_relayd_stream(metadata->metadata_stream,
- metadata->pathname);
+ ret = -1;
if (ret < 0) {
ret = LTTCOMM_CONSUMERD_ERROR_METADATA;
goto error;
Cause
=====
Channels have a list of streams that are being "setup" and are not
yet monitored for consumption. During this setup phase, the streams are
owned by the channel. On destruction of the channel, any stream in that
list will thus be cleaned-up.
When destroying a consumer stream, a reference to its channel is 'put'.
This can result in the destruction of the channel.
In the situation described above, the release of the channel's reference
is done before the stream is removed from the channel's stream list.
This causes the channel's clean-up to invoke (again) the current
stream's clean-up, resulting in the double-free of the metadata bucket.
This problem is present in a number of error paths.
Solution
========
Some error paths already manually removed the consumer stream from it's
channel's stream list before invoking consumer_stream_destroy(). The
various error paths that have to deal with this possible situation are
changed to simply invoke consumer_stream_destroy().
consumer_stream_destroy() is modified to always remove the stream from
its channel's list before performing the rest of the clean-up. This
ensures that those double clean-ups can't occur.
Drawbacks
=========
None.
Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Tested-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibeca9b675b86fc46be3f57826f7158de4da43df8
Jérémie Galarneau [Thu, 3 Mar 2022 22:52:33 +0000 (17:52 -0500)]
Fix: ust-consumerd: leak of stream control structure
The following leak is reported by LeakSanitizer when
setup_metadata() fails to send the metadata stream to the relay
daemon:
==
3050181==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 240 byte(s) in 5 object(s) allocated from:
#0 0x7f5fce02cfb9 in __interceptor_calloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7f5fcdd95a7a in zmalloc ../../../src/common/macros.h:23
#2 0x7f5fcdd95a7a in lttng_ust_ctl_create_stream /home/jgalar/EfficiOS/src/lttng-ust/src/lib/lttng-ust-ctl/ustctl.c:1649
A consumer stream can have an allocated
`struct lttng_ust_ctl_consumer_stream *` (ustream) even if it is
not globally visible at the time of its teardown.
In the case of the user space consumer, the only site that creates
consumer stream instances ensures that the allocation of the
lttng_ust_ctl_consumer_stream succeeded, ensuring that the
consumer stream's 'ustream' is always set.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia4be7969e85bd8812ae13b042e1e100812a63c1d
Jérémie Galarneau [Fri, 4 Mar 2022 20:29:12 +0000 (15:29 -0500)]
Fix: liblttng-ctl: erroneous flat size computation
compute_flattened_size() erroneously computes (over-estimates) the size
of the allocation required to hold the flat array of struct lttng_event
returned to the user by lttng_list_{events, syscalls, tracepoints}.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0a80ef0fa66428b7df34303804a024e80b635c69
Jérémie Galarneau [Mon, 7 Mar 2022 20:17:53 +0000 (15:17 -0500)]
Build fix: relayd: mismatching function signatures
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ba2e7cacbe4357ad0cdbe1dc2c85894f8bedc27
Jérémie Galarneau [Mon, 7 Mar 2022 20:12:35 +0000 (15:12 -0500)]
Build fix: use of mutable keyword in C file
4045de280 is a backport of a fix against a C++ file in which `mutable`
is used to allow an ASSERT_LOCKED check. Remove the use of mutable and
make session_has_ongoing_rotation() non-const.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id6317ad5c271f08701c9545827f75010f5e2e55d
Jérémie Galarneau [Wed, 2 Mar 2022 17:59:17 +0000 (12:59 -0500)]
Fix: relayd: connection abruptly closed on viewer stream creation failure
Commit
fe88e5175 explains (and fixes) an issue that could cause the
creation of viewer streams to fail. Currently, the error path causes the
relay daemon to abruptly close the connection to its live viewer peer.
This behaviour makes it impossible for the viewer to determine if an
error occurred or if the network connection simply failed.
Returning an `LTTNG_VIEWER_NEW_STREAMS_ERR` status code allows the
viewer to report a precise error. The viewer connection is closed since
the internal error is unlikely to be recoverable.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I557a8b66c1bd1c0bf361cfbabe962d8a6808f4f4
Jérémie Galarneau [Wed, 2 Mar 2022 17:37:39 +0000 (12:37 -0500)]
Fix: relayd: live client fails on clear of multi-domain session
Observed issue
==============
Two test cases of the clear/test_ust test suite occasionally fail in the
integration jobs testing cross-bitness (32/64) LTTng deployments.
Babeltrace fails with the following error when a clear occurs while a
live client consumes a trace:
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1706 [lttng-live] Error receiving get new streams reply
02-28 16:55:03.262 32362 32362 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x55eab7eb1120, muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
02-28 16:55:03.262 32362 32362 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x55eab7eb0470, muxer-comp-addr=0x55eab7eb0510, muxer-msg-iter-addr=0x55eab7eb1120, msg-iter-addr=0x55eab7eb0fb0, status=ERROR
02-28 16:55:03.262 32362 32362 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
02-28 16:55:03.262 32362 32362 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910, comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
02-28 16:55:03.262 32362 32362 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully
ERROR: [Babeltrace CLI] (babeltrace2.c:2548)
Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
Component's "consume" method failed: status=ERROR, comp-addr=0x55eab7eb0760,
comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
(`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x55eab7ebd910,
comp-class-so-handle-path="/root/workspace/joraj_integration_base_job/deps-64/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x55eab7eb0fb0, iter-upstream-comp-name="muxer",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
iter-upstream-comp-class-name="muxer",
iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
Cannot validate muxer's upstream message iterator wrapper:
muxer-msg-iter-addr=0x55eab7eb1120,
muxer-upstream-msg-iter-wrap-addr=0x55eab7eb3a70
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x55eab7eb1170, iter-upstream-comp-name="lttng-live",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
iter-upstream-comp-class-name="lttng-live",
iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
Error preparing the next batch of messages:
live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1706)
Error receiving get new streams reply
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198)
Remote side has closed connection
Looking at the relay daemon logs, we see the following error:
DBG1 - 16:55:03.
262106718 [32139/32146]: Adding new file "ust/pid/gen-ust-events-32373-
20220228-165503/chan_0" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1310)
PERROR - 16:55:03.
262133333 [32139/32146]: Failed to open fs handle to ust/pid/gen-ust-events-32373-
20220228-165503/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker/fd-tracker.cpp:548)
Cause
=====
Adding more debugging logging allows us to see the following situation
takes place:
- relay thread: Create trace chunk on session 1.
- live thread: get new streams against session 1, returns NO_NEW_STREAMS
since the session has an 'ongoing_rotation'.
- live thread: get new streams against session 2, sees no rotation
ongoing and attempts to open `chan_0` when creating a viewer stream
The "ongoing rotation" check was introduced in
a7ceb342d and, in a
nutshell, prevents live viewers from creating new viewer streams during
a rotation.
The "ongoing rotation" state is entered when a CREATE_NEW_TRACE_CHUNK
command is issued against a session.
However, this presumes that a relay_session maps 1:1 to a session on the
session daemon's end. This isn't the case as, in multi-domain
scenarios (tracing 32-bit, 64-bit, and kernel events), a single session
daemon session can map to multiple relay_session objects. This is
because the consumer daemons maintain independant connections to the
relay daemon.
To synchronize rotations accross related relay_session instances, the
relay daemon uses the same trace chunk instances accross relay_session
instances. This means that while a trace chunk is created against a
specific relay session, it can be used by other relay_session instances.
To manage shared trace chunks between relay_sessions, the relay daemon
makes use of the trace_chunk_registry. This registry allows
relay_sessions to share trace chunk instances using a unique key tuple:
- session daemon instance uuid,
- session daemon session id,
- trace chunk id.
There is no equivalent mechanism to track the "ongoing_rotation" state
accross relay_sessions originating from the same sessiond session.
In the current scenario, this causes the live client to correctly see
that no new streams are available for session 1 (say, the 32-bit user
space session). Unfortunately, this state is not entered for other
sessions (64-bit and kernel relay sessions). Hence, the viewer succeds
in acquiring new streams from session 2, exposing the race the 'ongoing
rotation' state aims to protect against.
Solution
========
Like the trace chunk instances, the "ongoing rotation" state must be
shared accross relay sessions that originate from the same session
daemon session.
To "emulate" this shared state, session_has_ongoing_rotation() checks
if any relay session originating from the same sessiond session
have an ongoing rotation. If it is the case, we temporarily prevent
live viewers from acquiring new streams.
Known drawbacks
===============
session_has_ongoing_rotation() iterates over all sessions, acquiring
their lock in the process, which is certainly undesirable from a
performance standpoint.
Optimizing this is not a great challenge, but is beyond the scope
of this immediate fix.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I457a32fa497d151ca933c25433c80665268a7c1c
Jérémie Galarneau [Thu, 10 Feb 2022 16:59:31 +0000 (11:59 -0500)]
Tests: fix: select_poll_epoll: test assumes epoll fd value
The test currently assumes that epoll fds are always == 3, which
is not always the case depending on the execution environment.
This change causes `select_poll_epoll` to produce a JSON file
containing the application's pid and epoll fd values that is
then used by the validation script.
Note that the test is converted to C++ to allow the use of
internal utils (common/error.h/cpp) without changing their linkage.
However, the code is still regular C to ease the backport of this
fix.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie373c63f6e6b9267ae2d785c9f0a532a5de37905
Jonathan Rajotte [Mon, 14 Feb 2022 16:23:28 +0000 (11:23 -0500)]
Fix: rotation: hang on destroy when using scheduled rotation based on timer
Observed issue
==============
The following scenario results in a hang for `lttng destroy`:
lttng create test
lttng enable-event -u -a
lttng enable-rotation --timer 100000
lttng start
lttng stop
lttng start
lttng destroy
Cause
=====
There is an imbalance in how many times we start the rotation timer.
The rotation timer is only removed on `lttng destroy` or when disabling
a time-based-rotation. On the other hand, the timer is "started"
on `lttng start` and when enabling a time based rotation.
The imbalance emerging from a start/stop/start sequence would prevent the
teardown of the session object since each time the timer is started a
reference to the session is held.
Solution
========
Do not start the rotation schedule timer if it was already launched.
Known drawbacks
=========
None.
Change-Id: Ic5b8938166358fe7629187bebdf02a09e90846c0
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Wed, 9 Feb 2022 19:23:18 +0000 (14:23 -0500)]
Fix: consumerd: fd still open after `lttng snapshot record` returns
Observed issue
=====
Using a snapshot output located on a pramfs mount:
lttng snapshot record
rm -rf /my_mount/my_trace_output
`rm` fails with ENOTEMPTY on rmdir for /my_mount/my_trace_output.
At that point, the lttng-consumerd daemon have an open fd on:
/my_mount/my_trace_output/ust
Note that a sleep between both command "fixes" the issue.
Cause
=====
The reclaim for the in-registry trace chunks can happen after the LTTng
CLI returns since we use `call_rcu`.
```
static
void lttng_trace_chunk_release(struct urcu_ref *ref)
....
if (chunk->in_registry_element) {
struct lttng_trace_chunk_registry_element *element;
element = container_of(chunk, typeof(*element), chunk);
if (element->registry) {
rcu_read_lock();
cds_lfht_del(element->registry->ht, &element->trace_chunk_registry_ht_node);
rcu_read_unlock();
-> call_rcu(&element->rcu_node, free_lttng_trace_chunk_registry_element);
} else {
```
The delayed reclaim of the `lttng_trace_chunk_registry_element` can
result in lttng-consumerd holding an open fd for the "chunk directory"
of the chunk since the close() is only done during the "*fini" phase of
the chunk (`lttng_trace_chunk_fini`).
Solution
========
Considering that the rcu lookup+refcount access scheme is used for the
trace chunk object and that at that point the refcount for the trace
chunk object is effectively zero, we can move the
`lttng_trace_chunk_fini` safely outside of the
`free_lttng_trace_chunk_registry_element` call_rcu call.
Known drawbacks
=========
Even if this solves the current situation, it is important to note that
the actual object holding the reference is itself refcounted and only
close the fd on release. This means that we are still exposed to this
problem if at some point the directory handle is shared and outlives the
trace chunk for some reason in the future.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6da3948824bf8b092fc8248b1bb0263fdd5887be
Jonathan Rajotte [Wed, 2 Mar 2022 15:50:16 +0000 (10:50 -0500)]
Fix: event: unchecked return value for lttng_strncpy
Reported by Coverity:
CID
1475806: Unchecked return value (CHECKED_RETURN)
Calling lttng_strncpy without checking return value.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib2b16bef61bcc24fea381c18fd8f914e9a84a6e7
Jonathan Rajotte [Wed, 2 Mar 2022 15:36:11 +0000 (10:36 -0500)]
Fix: event: format specifier for ssize_t
Observed issue
==============
In file included from event.cpp:15:
event.cpp: In function ‘ssize_t lttng_event_create_from_payload(lttng_payload_view*, lttng_event**, lttng_event_exclusion**, char**, lttng_bytecode**)’:
../../src/common/error.h:191:28: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘ssize_t’ {aka ‘int’} [-Wformat=]
191 | __lttng_print(PRINT_WARN, "Warning: " fmt "\n", ## args)
| ^~~~~~~~~~~~~~~~~~~~
../../src/common/error.h:139:51: note: in definition of macro ‘__lttng_print’
139 | fprintf((type) == PRINT_MSG ? stdout : stderr, fmt, ## args); \
| ^~~
event.cpp:624:4: note: in expansion of macro ‘WARN’
624 | WARN("Userspace probe location from the received buffer is not the advertised length: header length = %" PRIu32 ", payload length = %lu", event_comm->userspace_probe_location_len, ret);
| ^~~~
Solution
========
Albeit there is no "canonical" way of printing ssize_t, use '%zd' since
we already make use of it elsewhere.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id41e6ccf07bd580813f169b65d281a4fa305fb48
Jonathan Rajotte [Tue, 1 Mar 2022 19:23:52 +0000 (14:23 -0500)]
Fix: sessiond: lttng_channel object is not reclaimed
The reclaim is also missing on the error path.
Reported by Coverity:
CID
1475807: Resource leak (RESOURCE_LEAK)
Variable channel going out of scope leaks the storage it points to.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9ae59900f6bc33282b8b8a65b834a718428a7f50
Jonathan Rajotte [Tue, 1 Mar 2022 15:27:37 +0000 (10:27 -0500)]
Fix: common: local_attr might leak
Reported by Coverity:
CID
1475808: Resource leak (RESOURCE_LEAK)
Variable local_attr going out of scope leaks the storage it points to.
CID
1475810: Resource leak (RESOURCE_LEAK)
Variable local_attr going out of scope leaks the storage it points to.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibe2ee6516618827b6f52676d9a9f7d510cadf1bc
Jérémie Galarneau [Tue, 1 Mar 2022 19:20:30 +0000 (14:20 -0500)]
Fix: session: end goto label used for error path
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1e5137c2de6c7b821485c9571ba24ee1036442e9
Jonathan Rajotte [Tue, 1 Mar 2022 14:50:06 +0000 (09:50 -0500)]
Fix: ser/des: missing null terminator on payload append
Reported by the lttng-ust-java-tests_master_build CI job.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I47b048c65a9855347c9b02fe44d9fdbae7e06dbf
Jonathan Rajotte [Tue, 1 Mar 2022 17:33:53 +0000 (12:33 -0500)]
Fix: test: test plan/skip must come after TAP initialization
TAP is initialized during the sourcing of `utils.sh`. Calling it before
yield nothing.
Move the platform validation after the `PLAN_TEST` call.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib7b4434440ed7590b39b0580ea294045ace40495
Jonathan Rajotte [Tue, 1 Mar 2022 02:30:44 +0000 (21:30 -0500)]
Fix: test: missing test plan for non-x86/arm platforms (i.e powerpc)
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9a34a0b45fd011943436ce6e45c8f0d1666df26f
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_event_field is not packed
Observed issue
==============
For MI testing where the lttng-sessiond is 64 bit and the lttng CLI is
32 bit, the tracepoint field listing fails with partial garbage output.
The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.
sizeof(struct lttng_event_field):
x86: 1136
x86-64: 1144
The struct cannot be marked as LTTNG_PACKED since it is part of the API.
Solution
========
Adopt a similar pattern to the new APIs with a "serialize" &
"create_from_buffer" approach. The only particularity is that we need to
flatten the event_field on listing.
Most of the complexity is moved to `src/common/event.c`
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I280d9809d110237574e2606ee93a7aeba41e704e
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_event_context is not packed
Observed issue
==============
The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.
sizeof(struct lttng_event_context):
x86: 308
x86-64: 312
The struct cannot be marked as LTTNG_PACKED since it is part of the API.
Solution
========
Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach.
Most of the complexity is moved to `src/common/event.c`
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ieb400eab2a2df4070ff51cb2b44929d3ea945ce4
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_event is not packed
Observed issue
==============
In `lttcomm_session_msg` the lttng_event struct is marked as
LTTNG_PACKED. This statement have no effect as explained in commit [2].
Solution
========
Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach.
Most of the complexity is moved to `src/common/event.c`
Known drawbacks
=========
None.
Reference
========
[1] https://review.lttng.org/gitweb?p=lttng-tools.git;a=commit;h=
7bd95aee4660c6419a4a65429fc27754481e7e90
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I35d848519dacb2b119324e88f262aa95951e4ac6
Jonathan Rajotte [Sat, 29 Jan 2022 00:29:58 +0000 (19:29 -0500)]
libcommon: move event.c to libcommon-lgpl
The `event.c` license is already LGPL. There is no technical reason why
it was not part of the lgpl side of libcommon, simply that nothing that
is LGPL needed it. This will change in upcoming commits with the
addition of ser/des functions of `struct lttng_event` and other structs
related to `lttng_event` for liblttng-ctl.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I574e9053645b768c37505f1f27e18e3da69c772a
Jonathan Rajotte [Wed, 12 Jan 2022 23:18:08 +0000 (18:18 -0500)]
Fix: liblttng-ctl comm: lttng_channel is not packed
Observed issue
==============
The size of the struct differs between bitness for x86-64 and x86
leading to serialization/deserialization problem across client
(liblttng-ctl) and lttng-sessiond.
sizeof(struct lttng_channel):
x86: 608
x86-64: 624
The struct cannot be marked as LTTNG_PACKED since it is part of the API.
Solution
========
Adopt a similar pattern to the new API with a "serialize" &
"create_from_buffer" approach. The only particularity is that we need to
flatten the channels on listing.
Most of the complexity is moved to `src/common/channel.c`
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I59b0e92286e36c4f183d2950417cb180d4efc200
Jonathan Rajotte [Mon, 10 Jan 2022 22:27:13 +0000 (17:27 -0500)]
Fix: lttng-ctl: erroneous uses of LTTNG_PACKED
The LTTNG_PACKED macro uses gcc attributes to indicate that a structure
should be packed. Hence, this macro obeys the same rules as the gcc
attribute.
Various mis-uses of the LTTNG_PACKED macros may result in structure not
being packed:
- The LTTNG_PACKED macro should always be placed _before_ an identifier
when a structure is declared in-place.
- Adding LTTNG_PACKED at the definition site has no effect if the
structure was declared elsewhere.
Those mis-uses cause issues when mixing the bitness (32/64) of the
session daemon and liblttng-ctl.
Outstanding issues include the following structures that are not
tagged as LTTNG_PACKED:
- struct lttng_event
- struct lttng_channel
- struct lttng_event_context
Unfortunately, those structures are exposed by the public API and
can't be tagged as being "packed". Doing so would break the ABI
of liblttng-ctl.
These structures should be packed/unpacked explicitly.
JORAJ CHERRY PICK: Change-Id: I095dc0dffc6bf9e15dc7a7ec797958a5780ef150
Change-Id: I33f026a613813af312de65adee23ac829a071cf8
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Tue, 18 Jan 2022 14:46:48 +0000 (09:46 -0500)]
Backport: lttng_buffer_view_is_valid, lttng_buffer_view_contains_string
Change-Id: I23892e6f20e2847037ba16ad1bba8b2d5a9e0a7c
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Thu, 27 Jan 2022 19:22:22 +0000 (14:22 -0500)]
Fix: conversion from KB to bytes overflow on arm32
Observed issue
==============
On enable channel the memory available check fails on arm32 when
available memory, in bytes, is larger than 2^32.
Cause
=====
`read_proc_meminfo_field` converts the read value (in KB) to bytes and
stores it into a size_t variable.
On the system running the reproducer the value of the `value_kb` variable
is
4839692, yielding an overflow when multiplied with 1024 since
`size_t` is 32 bit long. `size_t` can be larger in certain situation
(i.e LARGEFILE) but this is irrelevant to the problem at hand.
Solution
========
Convert all the checks to use uint64_t.
Known drawbacks
=========
None.
References
==========
The multiplication overflow check scheme is borrowed from
`src/common/time.c`
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I76fe7e57da149c1b4c230a411e0346ba6f9ef7b1
Jonathan Rajotte [Tue, 11 Jan 2022 17:28:15 +0000 (12:28 -0500)]
Fix: Unexpected payload size in cmd_recv_stream_2_11
Observed issue
==============
For the following scenario:
lttng-relayd: 64 bit
lttng-sessiond: 64 bit
lttng-consumerd: 32 bit
application: 32 bit
Commands
lttng create --set-url=net://127.0.0.1
lttng enable-event -u -a
lttng start
./application
On application start the lttng-relayd reports this error:
DEBUG1 - 14:16:38.
216442600 [
2004731/
2004735]: Done receiving control command payload: fd = 19, payload size = 4376 bytes (in relay_process_control_receive_payload() at main.c:3456)
DEBUG3 - 14:16:38.
216469462 [
2004731/
2004735]: Processing "RELAYD_ADD_STREAM" command for socket 19 (in relay_process_control_command() at main.c:3327)
Error: Unexpected payload size in "cmd_recv_stream_2_11": expected >=
3519925694 bytes, got 4376 bytes
Cause
=====
In `relayd_add_stream`, instead of taking the > 2.11 protocol path, the
`relayd_add_stream_2_2` function is called.
The value of the rsock version number are:
major: 21845
minor: 2
Which is simply invalid since we know that the version should be 2.12.
The relayd sock version numbers are set during the
LTTNG_CONSUMER_ADD_RELAYD_SOCKET command between the lttng-sessiond and
the lttng-consumerd process. It is important to note here that both
processes do NOT have the same bitness.
The serialization and deserialization of `struct lttcomm_relayd_sock` is
the culprit.
`struct lttcomm_relayd_sock` contains a `struct lttcomm_sock`:
struct lttcomm_sock {
int32_t fd;
enum lttcomm_sock_proto proto;
struct lttcomm_sockaddr sockaddr;
const struct lttcomm_proto_ops *ops;
} LTTNG_PACKED;
Note that `ops` is a pointer and its size varies based on the bitness of
the application. Hence the size of the `struct lttcomm_sock` differs
across bitness. Since it is the first member of `struct
lttcomm_relayd_sock`, the memory layout is simply invalid across
bitness (amd64/x86).
This results in invalid parsing for the overall "struct
lttcomm_relayd_sock" when dealing with a lttng-consumerd with a
different bitness than the lttng-sessiond. As far as I know local
tracing scenarios are not affected since this is only relevant when
dealing with a lttng-relayd.
Solution
========
Pass the socket protocol type, relayd major, relayd minor in
`lttcomm_consumer_msg`. On the receiver side, query the network stack to
get the peer information to populate a basic `lttcomm_sock`. Leaving
this work to the OS saves us from having to serialize the `sockaddr_in*`
structs.
Known drawbacks
=========
We rely on `getpeername` for the first time. Compatibility might be a
problem.
This code path assumes a lot of thing that cannot be asserted against
such as the fact that the socket from which we fetch the info must be
`connected`. Still at this point, the socket is completely setup and the
rest of the code depends on it already.
From GETPEERNAME(2):
```
For stream sockets, once a connect(2) has been performed, either
socket can call getpeername() to obtain the address of the peer
socket. On the other hand, datagram sockets are connectionless.
Calling connect(2) on a datagram socket merely sets the peer
address for outgoing datagrams sent with write(2) or recv(2).
The caller of connect(2) can use getpeername() to obtain the
peer address that it earlier set for the socket. However, the
peer socket is unaware of this information, and calling
getpeername() on the peer socket will return no useful
information (unless a connect(2) call was also executed on the
peer). Note also that the receiver of a datagram can obtain the
address of the sender when using recvfrom(2).
```
But here we are always "the caller of connect".
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I930ef9bbbf18fa881222850ba0fbbba026dc0220
Jonathan Rajotte [Mon, 31 Jan 2022 19:47:35 +0000 (14:47 -0500)]
Tests: add kernel test into the `make check` test suite.
The Jenkins CI mostly run the `make check` suite. Only the Lava base CI
run the root_regression test suite. Most of those test can be run on
`make check` without incurring any major extra time.
Only `regression/tools/streaming/test_high_throughput_limits` is left in
root_regression since it is currently "unreasonable" in term of the time
it takes to run. This could be tackled another time.
Change-Id: I29d40fa8bec872bf2e22a8bd933f58fa6376ee22
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 26 Jan 2022 22:31:19 +0000 (17:31 -0500)]
Update version to v2.12.8
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Mathieu Desnoyers [Wed, 12 Jan 2022 21:46:21 +0000 (16:46 -0500)]
Relicence all source and header files included in LGPL code
All code included in libcommon-lgpl.a should be LGPL. Some were licensed
as GPLv2 by mistake. We need to relicense those.
EfficiOS owns the copyright to all of the affected source files and
agrees to this relicensing from GPLv2 to LGPLv2.1.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib4aa1a7cc8a6f8e2a9891f1bd05c7ea2f8530e9b
Jérémie Galarneau [Wed, 26 Jan 2022 01:16:59 +0000 (20:16 -0500)]
Fix: sessiond: use of null session when session is not found
A bad backport introduced this use of `session` even when it is
NULL.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If7d540229ef3a2a50e1e58d4f408cb61252ed45d
Mathieu Desnoyers [Wed, 19 Jan 2022 16:35:20 +0000 (11:35 -0500)]
Move utils_expand_path and utils_expand_path_keep_symlink to libpath.la
Move the GPLv2 helper functions utils_expand_path and
utils_expand_path_keep_symlink to libpath.la. This will allow utils.cpp
to be relicensed to LGPLv2.1 by making sure EfficiOS owns the copyright
for the entire source file.
Statically include libpath.la into libcommon-gpl.la.
The "lttng" executable is GPLv2 and only depends on libcommon-lgpl.la,
so it needs to explicitly list libpath.la as its dependency.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2108e0bd35ce75797d4a20e65891c820a1dd79f8
Mathieu Desnoyers [Wed, 12 Jan 2022 20:23:09 +0000 (15:23 -0500)]
Link lttng executable on libcommon-lgpl.a
Allows testing whether liblttng-ctl.so has any dependency on
libcommon-gpl.a. Will trigger a link-time failure if it is the case.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie6f864dccdd61f896a7e238e202dc6188ec4f9c5
Mathieu Desnoyers [Tue, 11 Jan 2022 20:57:37 +0000 (15:57 -0500)]
Introduce libcommon-lgpl for liblttng-ctl
liblttng-ctl is a LGPLv2.1 library should should not use GPLv2 code.
Introduce libcommon-lgpl as a static archive containing only LGPLv2.1
compatible code.
This also removes the dependency from liblttng-ctl to liburcu.
Include some source files in libcommon-lgpl.a which are indirectly needed
by source files required in libcommon-lgpl.a:
- endpoint.cpp,
- lttng-elf.cpp,
- lttng-elf.h.
Include some source files in libcommon-lgpl.a which are only needed to
link the lttng executable:
- domain.cpp,
- spawn-viewer.cpp, spawn-viewer.h.
Introduce the new source file hashtable/seed.cpp to move the
lttng_ht_seed symbol in a source file which does not require
liburcu-cds, so it can be present in libcommon-lgpl. This allows
building compile units which are needed in the lgpl common library which
also contain functions which directly refer to lttng_ht_seed.
Programs and libraries which use libhashtable.la are changed to use
libcommon-gpl.la instead. libhashtable becomes internal to libcommon.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I27d2acb823a6d951692a5da88ce32bbe6bafb072
Mathieu Desnoyers [Tue, 11 Jan 2022 19:46:37 +0000 (14:46 -0500)]
Rename libcommon.so to libcommon-gpl.so
libcommon is a static library is currently used by both liblttng-ctl
(LGPLv2.1) and all lttng-tools executables (GPLv2).
Given that some code in libcommon depends on liburcu, this introduces an
indirect dependency from liblttng-ctl to liburcu, which is unwanted.
This first step renames libcommon.so to libcommon-gpl.so. Following
steps will introduce a more lightweight libcommon-lgpl.so which only
contains LGPLv2.1 code, and removes the dependency on liburcu.
Backport Notes
--------------
ini_config has to link against liblttng-ctl since since the internal
configuration library is not split into ini-config and libconfig
(see
3299fd310).
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia8f37ea229f68550200cbb1528216a505bbbd45f
Mathieu Desnoyers [Thu, 20 Jan 2022 15:30:01 +0000 (10:30 -0500)]
Copyright ownership transfer
Apply copyright ownership transfer from Jonathan Rajotte-Julien and
Olivier Cotte to EfficiOS Inc.
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030111.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030113.html
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Olivier Moussavou Cotte <olivier.moussavoucotte@mail.mcgill.ca>
Signed-off-by: Jonathan Rajotte-Julien <jonathan.r.julien@gmail.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0c2800ec17f5d435c8a18f2dae096809b8037b5d
Jérémie Galarneau [Fri, 14 Jan 2022 20:42:03 +0000 (15:42 -0500)]
Copyright ownership transfer
Apply copyright ownership transfer from David Goulet, Julien Desfossez,
and Simon Marchi to EfficiOS Inc.
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030087.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030092.html
Link: https://lists.lttng.org/pipermail/lttng-dev/2022-January/030091.html
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Signed-off-by: David Goulet <dgoulet@ev0ke.net>
Signed-off-by: Julien Desfossez <ju@klipix.org>
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Change-Id: Id13575afd4a2a09bb91a8d2b7a12dc3db8dc329f
Jérémie Galarneau [Fri, 23 Jul 2021 20:56:53 +0000 (16:56 -0400)]
Fix: relayd: erroneous rundir permission logging message
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2eb0901896feab684d397aa43f8c9036018d9c95
Jérémie Galarneau [Thu, 9 Dec 2021 21:27:29 +0000 (16:27 -0500)]
Fix: sessiond: rotation thread: fatal error when not finding a session
The rotation thread implements scheduled rotations (by size) by
registering a trigger that monitors the session's consumed size and
notifies when the next rotation's size threshold is exceeded.
The notification is delivered asynchronously which doesn't prevent
the session from being destroyed before the rotation thread has
had the time to process the notification (and perform a rotation).
Since it is possible for a session to be destroyed by the time the
notification is processed, the rotation thread shouldn't handle
this eventuality as a fatal error (shutting down the thread).
Note that nobody reported this issue nor did I attempt to reproduce it.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I588054bad3542854851f28d34f2c758bdf420a34
Jérémie Galarneau [Wed, 12 Jan 2022 20:48:00 +0000 (15:48 -0500)]
Fix: relayd: rotation failure for multi-domain session
Observed issue
==============
Rotating a multi-domain streaming session results in the following
error:
$ lttng rotate
Waiting for rotation to complete...
Error: Failed to retrieve rotation state.
Meanwhile, the relay daemon logs indicate the following:
DBG1 - 14:56:04.
213163667 [265774/265778]: lttng_trace_chunk_rename_path from .tmp_new_chunk to (null) (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.cpp:759)
PERROR - 14:56:04.
213242941 [265774/265778]: Failed to move trace chunk directory ".tmp_new_chunk" to "20220112T145604-0500-1": No such file or directory (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.cpp:799)
DBG1 - 14:56:04.
213396931 [265774/265778]: aborting session 2 (in session_abort() at session.cpp:588)
DBG1 - 14:56:04.
213512198 [265774/265778]: Control connection closed with 22 (in relay_thread_close_connection() at main.cpp:3874)
The 'abort' of session 2 here causes the kernel consumer to fail to
consume subbuffers:
Error: Relayd send index failed. Cleaning up relayd 3.
Error: Error consuming subbuffer: (0)
[...]
Cause
=====
Following the flow of execution in the relay daemon shows that different
trace chunks are used by the two relay sessions that result from the
streaming of a single multi-domain session. Both trace chunks "own" the
same output directory.
When a rotation is performed, the first trace chunk to be closed will
move the directory. Then, the second trace chunk to be closed will
attempt to do the same, failing to do so as seen in the relay daemon
log.
Solution
========
Using different trace chunk instances for relay sessions belonging to a
single sessiond session goes against the intended use of the sessiond
trace chunk registry.
A sessiond trace chunk registry allows the relay daemon to share trace
chunks used by different "relay sessions" when they were created for the
same user-visible session daemon session. Tracing multiple domains (e.g.
ust and kernel) results in per-domain relay sessions being created.
Sharing trace chunks, and their output directory more specifically, is
essential to properly implement session rotations. The sharing of output
directory handles allows directory renames to be performed once and
without races that would stem from from multiple renames.
The reason why sessiond trace chunk registry returns different trace
chunk instances for two relay sessions is that the wrong session `id` is
used to publish trace chunks. The `id` that must be used to share trace
chunks accross the relay sessions that belong to the same sessiond
session is `id_sessiond`.
`id_sessiond` is optional as it is only provided by consumers v2.11+.
Otherwise, it is fine to use the relay session `id`: it is a unique id
for a given session daemon instance and those consumers will not issue a
session rotation (or clear) as the feature didn't exist.
A reference counting bug revealed by this change is also fixed in the
implementation of the sessiond trace chunk registry.
When the trace chunk is first published, two references to the published
chunks exist. One is taken by the registry while the other is being
returned to the caller. In the use case of the relay daemon, the
reference held by the registry itself is undesirable.
We want the trace chunk to be removed from the registry as soon as it is
not being used by the relay daemon (through a session or a stream). This
differs from the behaviour of the consumer daemon which relies on an
explicit command from the session daemon to release the registry's
reference.
In cases where the trace chunk had already been published, the reference
belonging to the sessiond trace chunk registry instance has already been
'put' by the firt publication. We must simply return the published trace
chunk with a reference taken on behalf of the caller.
Known drawbacks
===============
None.
Change-Id: Ic33443b114a87574a1b26ac5ccb022e47f886ddd
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Mathieu Desnoyers [Tue, 11 Jan 2022 18:59:15 +0000 (13:59 -0500)]
Fix: lttng-ctl: lttng_list_sessions: initialize out_sessions to NULL when returning 0
Observed issue
==============
Users of lttng-ctl API's lttng_list_sessions observe application crash
when freeing the *out_sessions output value when lttng_list_sessions
returns 0.
Cause
=====
The implementation does not set *out_sessions to NULL when
lttng_ctl_ask_sessiond() sets the sessions variable to NULL.
This causes the user application to attempt to free(3) an uninitialized
pointer.
Solution
========
Initialize out_sessions to NULL before invoking
lttng_ctl_ask_sessiond(), so it is initialized when lttng_list_sessions
returns 0, thus allowing *out_sessions to be subsequently freed.
A free(3) on a NULL pointer is a no-op.
Known drawbacks
===============
None.
History
=======
This was introduced by those two commits:
b178f53e90 ("Generate session name and default output on sessiond's end")
27ea4ba825 ("Fix: error when listing sessions with no session")
This is a regression present in the stable-2.11, stable-2.12,
stable-2.13, and master branches.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I34125d708a32674d79b831e5004c48321ebd711e
Simon Marchi [Tue, 30 Nov 2021 03:15:53 +0000 (22:15 -0500)]
Fix: lttng: initialize variable in run_command_string
I got some crashes when using `lttng track` and hitting some error
paths. The tracker_handle variable is run_command_string is passed to
lttng_process_attr_tracker_handle_destroy uninitialized if
lttng_session_get_tracker_handle fails.
$ valgrind lttng track --kernel --pid 569878
==634572== Memcheck, a memory error detector
==634572== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==634572== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==634572== Command: lttng track --kernel --pid 569878
==634572==
Error: Unknown error
==634572== Conditional jump or move depends on uninitialised value(s)
==634572== at 0x4875007: lttng_process_attr_tracker_handle_destroy (tracker.cpp:25)
==634572== by 0x13AC55: run_command_string(cmd_type, char const*, lttng_domain_type, lttng_process_attr, char const*, mi_writer*) (track-untrack.cpp:485)
==634572== by 0x13ADA5: run_command(cmd_type, char const*, process_attr_command_args const*, mi_writer*) (track-untrack.cpp:535)
==634572== by 0x13B472: cmd_track_untrack(cmd_type, int, char const**, char const*) (track-untrack.cpp:740)
==634572== by 0x13B5D9: cmd_track(int, char const**) (track-untrack.cpp:805)
==634572== by 0x14C598: handle_command(int, char**) (lttng.cpp:237)
==634572== by 0x14CCE9: parse_args(int, char**) (lttng.cpp:426)
==634572== by 0x14CE65: main (lttng.cpp:475)
Fix it by initializing it to NULL.
Change-Id: Id2693e75c3c5c83cef58db3109973d7ab679b859
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Mathieu Desnoyers [Mon, 22 Nov 2021 19:38:35 +0000 (14:38 -0500)]
Fix: consumer-stream: live viewers observe timestamps going backwards
Observed issue
==============
When stress-testing over an entire weekend, we caught the following
occurrences of timestamps going backwards with babeltrace in live viewer
mode:
One occurrence:
# Test ust streaming live clear with viewer with new metadata after clear
# Parameters: tracing_active=0, clear_twice=0, buffer_type=pid
6-7 occurrences:
# Test ust basic streaming live with viewer
# Parameters: tracing_active=1, clear_twice=0, buffer_type=uid
Relevant log of the relayd focused on the context of the stream
triggering the issue (stream 994):
506:DBG1 - 17:26:54.
966288486 [
3866648/
3866655]: Relay viewer stream 994 not found (in viewer_stream_get_by_id() at viewer-stream.cpp:268)
603:DBG1 - 17:26:54.
970265349 [
3866648/
3866655]: Sending stream 994 to viewer (in send_viewer_streams() at live.cpp:241)
843:DBG1 - 17:26:55.
959835404 [
3866648/
3866652]: stream_add_index for stream 994 (in stream_add_index() at stream.cpp:1233)
844:DBG3 - 17:26:55.
959866303 [
3866648/
3866652]: Finding index for stream id 994 and seq_num 0 (in relay_index_get_by_id_or_create() at index.cpp:112)
845:DBG2 - 17:26:55.
959896611 [
3866648/
3866652]: Creating relay index for stream id 994 and seqnum 0 (in relay_index_create() at index.cpp:34)
846:DBG2 - 17:26:55.
959928097 [
3866648/
3866652]: Adding relay index with stream id 994 and seqnum 0 (in relay_index_add_unique() at index.cpp:70)
847:DBG2 - 17:26:55.
959968163 [
3866648/
3866652]: Index found or created in HT for stream ID 994 and seqnum 0 (in relay_index_get_by_id_or_create() at index.cpp:144)
884:DBG3 - 17:26:55.
961676906 [
3866648/
3866652]: Receiving data for stream id 994 seqnum 0, 0 bytes received, 84 bytes left to receive (in relay_process_data_receive_payload() at main.cpp:3583)
885:DBG1 - 17:26:55.
961783540 [
3866648/
3866652]: Wrote to stream 994: data_length = 84, padding_length = 0 (in stream_write() at stream.cpp:1113)
886:DBG1 - 17:26:55.
961862441 [
3866648/
3866652]: Wrote to stream 994: data_length = 0, padding_length = 4012 (in stream_write() at stream.cpp:1113)
887:DBG1 - 17:26:55.
961895593 [
3866648/
3866652]: handle_index_data: stream 994 net_seq_num 0 data offset 0 (in stream_update_index() at stream.cpp:1140)
888:DBG3 - 17:26:55.
961945901 [
3866648/
3866652]: Finding index for stream id 994 and seq_num 0 (in relay_index_get_by_id_or_create() at index.cpp:112)
889:DBG2 - 17:26:55.
961983056 [
3866648/
3866652]: Index found or created in HT for stream ID 994 and seqnum 0 (in relay_index_get_by_id_or_create() at index.cpp:144)
894:DBG2 - 17:26:55.
962334285 [
3866648/
3866652]: Writing index for stream ID 994 and seq num 0 (in relay_index_try_flush() at index.cpp:275)
895:DBG2 - 17:26:55.
962390756 [
3866648/
3866652]: index put for stream id 994 and seqnum 0 refcount 1 (in relay_index_put() at index.cpp:237)
1743:DBG1 - 17:26:56.
083287172 [
3866648/
3866655]: Check index status: index_received_seqcount 1 index_sent_seqcount 0 for stream 994 (in check_index_status() at live.cpp:1454)
1746:DBG1 - 17:26:56.
083446379 [
3866648/
3866655]: Sending viewer index for stream 994 offset 0 (in viewer_get_next_index() at live.cpp:1801)
1748:DBG1 - 17:26:56.
083544877 [
3866648/
3866655]: Index 1 for stream 994 sent (in viewer_get_next_index() at live.cpp:1842)
1751:DBG1 - 17:26:56.
083778149 [
3866648/
3866655]: Sent 4108 bytes for stream 994 (in viewer_get_packet() at live.cpp:1968)
2858:DBG1 - 17:26:56.
907744916 [
3866648/
3866652]: stream_add_index for stream 994 (in stream_add_index() at stream.cpp:1233)
2859:DBG1 - 17:26:56.
907762798 [
3866648/
3866652]: Received live beacon for stream 994 (in stream_add_index() at stream.cpp:1237)
2862:DBG3 - 17:26:56.
907855597 [
3866648/
3866652]: Receiving data for stream id 994 seqnum 1, 0 bytes received, 84 bytes left to receive (in relay_process_data_receive_payload() at main.cpp:3583)
2863:DBG1 - 17:26:56.
907950195 [
3866648/
3866652]: Wrote to stream 994: data_length = 84, padding_length = 0 (in stream_write() at stream.cpp:1113)
2864:DBG1 - 17:26:56.
908002911 [
3866648/
3866652]: Wrote to stream 994: data_length = 0, padding_length = 4012 (in stream_write() at stream.cpp:1113)
2865:DBG1 - 17:26:56.
908024312 [
3866648/
3866652]: handle_index_data: stream 994 net_seq_num 1 data offset 4096 (in stream_update_index() at stream.cpp:1140)
2866:DBG3 - 17:26:56.
908043082 [
3866648/
3866652]: Finding index for stream id 994 and seq_num 1 (in relay_index_get_by_id_or_create() at index.cpp:112)
2867:DBG2 - 17:26:56.
908061879 [
3866648/
3866652]: Creating relay index for stream id 994 and seqnum 1 (in relay_index_create() at index.cpp:34)
2868:DBG2 - 17:26:56.
908082115 [
3866648/
3866652]: Adding relay index with stream id 994 and seqnum 1 (in relay_index_add_unique() at index.cpp:70)
2869:DBG2 - 17:26:56.
908101275 [
3866648/
3866652]: Index found or created in HT for stream ID 994 and seqnum 1 (in relay_index_get_by_id_or_create() at index.cpp:144)
3011:DBG1 - 17:26:56.
913436908 [
3866648/
3866655]: Check index status: index_received_seqcount 1 index_sent_seqcount 1 for stream 994 (in check_index_status() at live.cpp:1454)
3012:DBG1 - 17:26:56.
913457688 [
3866648/
3866655]: Check index status: inactive with beacon, for stream 994 (in check_index_status() at live.cpp:1492)
3014:DBG1 - 17:26:56.
913507164 [
3866648/
3866655]: Index 1 for stream 994 sent (in viewer_get_next_index() at live.cpp:1842)
3043:DBG1 - 17:26:56.
914167206 [
3866648/
3866652]: stream_add_index for stream 994 (in stream_add_index() at stream.cpp:1233)
3044:DBG3 - 17:26:56.
914186324 [
3866648/
3866652]: Finding index for stream id 994 and seq_num 1 (in relay_index_get_by_id_or_create() at index.cpp:112)
3045:DBG2 - 17:26:56.
914205597 [
3866648/
3866652]: Index found or created in HT for stream ID 994 and seqnum 1 (in relay_index_get_by_id_or_create() at index.cpp:144)
3046:DBG2 - 17:26:56.
914227502 [
3866648/
3866652]: Writing index for stream ID 994 and seq num 1 (in relay_index_try_flush() at index.cpp:275)
3047:DBG2 - 17:26:56.
914299536 [
3866648/
3866652]: index put for stream id 994 and seqnum 1 refcount 1 (in relay_index_put() at index.cpp:237)
3587:DBG1 - 17:26:57.
067977800 [
3866648/
3866652]: Set begin data pending flag to stream 994 (in relay_begin_data_pending() at main.cpp:2240)
3644:DBG1 - 17:26:57.
070915787 [
3866648/
3866652]: Data pending for stream id 994: prev_data_seq 1, prev_index_seq 1, and last_seq 1 (in relay_data_pending() at main.cpp:2091)
3913:DBG1 - 17:26:57.
093492259 [
3866648/
3866652]: try_rotate_stream_index: Stream 994 (rotate_at_packet_seq_num = 2, received_packet_seq_num = (value = 1, is_set = 1)) (in try_rotate_stream_index() at stream.cpp:482)
3914:DBG1 - 17:26:57.
093525950 [
3866648/
3866652]: Rotating stream 994 index file (in try_rotate_stream_index() at stream.cpp:511)
3915:DBG1 - 17:26:57.
093561064 [
3866648/
3866652]: try_rotate_stream_data: Stream 994 (rotate_at_index_packet_seq_num = 2, rotate_at_prev_data_net_seq = 1, prev_data_seq = 1) (in try_rotate_stream_data() at stream.cpp:357)
3916:DBG1 - 17:26:57.
093591085 [
3866648/
3866652]: Rotating stream 994 data file with size 8192 (in stream_rotate_data_file() at stream.cpp:138)
3917:DBG1 - 17:26:57.
093626697 [
3866648/
3866652]: stream_rotate_data_file: reset tracefile_size_current for stream 994 was 8192 (in stream_rotate_data_file() at stream.cpp:169)
3918:DBG1 - 17:26:57.
093656578 [
3866648/
3866652]: Rotation completed for stream 994 (in stream_complete_rotation() at stream.cpp:66)
4238:DBG1 - 17:26:57.
635064782 [
3866648/
3866652]: Trying to close stream 994 (in try_stream_close() at stream.cpp:883)
4239:DBG1 - 17:26:57.
635098224 [
3866648/
3866652]: Succeeded in closing stream 994 (in try_stream_close() at stream.cpp:983)
4744:DBG1 - 17:26:57.
741972785 [
3866648/
3866655]: Check index status: index_received_seqcount 2 index_sent_seqcount 1 for stream 994 (in check_index_status() at live.cpp:1454)
4745:DBG1 - 17:26:57.
742030216 [
3866648/
3866655]: Sending viewer index for stream 994 offset 4096 (in viewer_get_next_index() at live.cpp:1801)
4747:DBG1 - 17:26:57.
742088421 [
3866648/
3866655]: Index 2 for stream 994 sent (in viewer_get_next_index() at live.cpp:1842)
4750:DBG1 - 17:26:57.
742197990 [
3866648/
3866655]: Sent 4108 bytes for stream 994 (in viewer_get_packet() at live.cpp:1968)
4755:DBG1 - 17:26:57.
932525633 [
3866648/
3866655]: Releasing stream id 994 (in stream_release() at stream.cpp:778)
4756:DBG1 - 17:26:57.
932555313 [
3866648/
3866655]: Rotation completed for stream 994 (in stream_complete_rotation() at stream.cpp:66)
Cause
=====
This log shows the following sequence:
consumer-data consumer-live-timer relayd-main relayd-live babeltrace
1) lttng_consumer_read_subbuffer
2) get next subbuf
3) write data packet to data socket
(starting here the data packet is
somewhere on the network)
4) put next subbuf
5) post_consume()
6) consumer_stream_sync_metadata_index()
7) wait for metadata
8) consumer_stream_sync_metadata()
9) check_stream()
10) set missed_metadata_flush
11) call send_live_beacon(()
12) sample empty ring buffer
13) read current timestamp
14) send inactivity beacon (empty packet)
15) receives a live beacon (@ 17:26:56.
907762798)
16) call consumer_stream_send_index()
17) send packet index to relayd
18) receives a data packet (@ 17:26:56.
907855597)
(at this point the data
packet is received from the
network)
19) ask for next index
20) informs the live viewer of the live beacon (@ 17:26:56.
913457688)
21) receives an index packet (@ 17:26:56.
914227502)
22) ask for next index
23) sends the packet to the viewer (@ 17:26:57.
742197990)
24) observes time going
backwards between the
previous live beacon
and the data packet.
The issue is caused by consumer_stream_sync_metadata_index which is
called after sending a data packet (therefore after having consumed a
data packet from the ring buffer). It invokes the send_live_beacon
callback before sending the index associated with the data packet that
was just sent.
However, this introduces a discrepancy between the live beacon
inactivity guarantees and the yet-to-be-sent packet index: the data
packet sent at [3] can be anywhere on the network, not even received by
the relay daemon, when the live beacon is sampling a now empty ring
buffer at [12], and thus sends a live inactivity beacon to the relay
daemon. Then, when the index is sent by consumer_stream_send_index
[16], its timestamp is in the past compared to the inactivity beacon
sent by send_live_beacon [11].
The purpose of the field "stream->indexes_in_flight" is to prevent
setting the inactivity timestamp in the relay stream when data is
missing for indexes that were received. This works because the indexes
are sent over the control socket, which is where the inactivity beacons
are also sent. It does not however prevent issues the other way around:
data sent prior to the inactivity beacon may or may not have reached
the relay daemon. It is therefore important to make sure that consuming
ring buffer data and sending that data's index vs sampling for an empty
ring buffer and sending an inactivity beacon are correctly ordered.
Solution
========
Send inactivity beacon after packet index.
Also document the purpose of sending an inactivity beacon in this
scenario.
Note
====
This issue is present since lttng-tools 2.7.0 (backported to 2.6.1),
where lttng_ustconsumer_read_subbuffer() invokes
consumer_flush_ust_index() prior to call consumer_stream_write_index().
It was introduced by commit
288bdb302a1 ("Fix: sessiond vs consumerd
push/get metadata deadlock").
Known drawbacks
===============
None.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ied43cc8229413b25e69b40ac218adad420f18797
Mathieu Desnoyers [Fri, 19 Nov 2021 20:11:37 +0000 (15:11 -0500)]
Fix: relayd: ressource leaks on viewer_stream_create error
Observed issue
==============
When facing failure to open viewer stream chunks in the context of "Fix:
relayd: failure to open chunk files concurrently with session clear",
we observe that the relay daemon triggers an assertion due to a
non-empty session hash table on cleanup.
Cause
=====
viewer_stream_create() does a stream_get(), but without any matching
stream_put() on error. This in turn holds a reference on the ctf_trace,
which holds a reference on the session.
By inspecting the code, we notice that the following ressources can be
leaked on error:
- vstream->stream_file.handle,
- vstream->index_file,
- vstream->stream.
In non-error scenarios, viewer_stream_release() is responsible for
releasing references on the composite objects.
The vstream->stream_file.trace_chunk is not an issue because it is put
in the destroy handler (as well as within the release, before having its
vstream->stream_file.trace_chunk pointer set to NULL).
Solution
========
Properly put references on all objects which are contained by the viewer
stream on error by introducing
viewer_stream_release_composite_objects(), which is used both in the
error path of viewer_stream_create() and in viewer_stream_release().
Note
====
Why not move those "put" operations in viewer_stream_destroy ?
This is done in the release to ensure we put references on composite
objects immediately when our own reference reaches 0, rather than
waiting for a grace period through call_rcu, which could then cause
chained call_rcu callbacks and require multiple invocation of
rcu_barrier on relayd exit to guarantee that all callbacks have been
executed and all ressources properly freed.
Known drawbacks
===============
None.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I864b58eda94ebda5e6faea45c922d4e814a15daa
Mathieu Desnoyers [Fri, 19 Nov 2021 16:12:25 +0000 (11:12 -0500)]
Fix: relayd: live: erroneous message timestamp observed from live viewer
Observed issue
==============
Another situation where erroneous message timestamp is observed by the
live viewer. Happens rarely (only two occurrences while running
ust_clear in stress-tests overnight on a 16-core machine).
Triggered with the following test:
# Test ust streaming live clear with viewer with new metadata after clear
# Parameters: tracing_active=0, clear_twice=1, buffer_type=pid
Babeltrace 2 error:
11-19 06:21:02.571 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE handle_late_message@lttng-live.c:1206 [lttng-live] Invalid live stream state: have a late message that is not a packet discarded or event discarded message: late-msg-type=PACKET_BEGINNING
11-19 06:21:02.578 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE next_stream_iterator_for_trace@lttng-live.c:1360 [lttng-live] Late message could not be handled correctly: lttng-live-msg-iter-addr=0x55b4cdee1590, stream-name="stream-1024", curr-msg-ts=
1637302861380968303, last-msg-ts=
1637302861472798701
11-19 06:21:02.578 488497 488497 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1802 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
11-19 06:21:02.579 488497 488497 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x55b4cdecfe30, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
Cause
=====
viewer_get_next_index() does not protect its use of the session
ongoing_rotation state by any synchronization mechanism. It only takes
the stream lock, which is not used to protect changes to the ongoing
rotation state with respect to the chunk rename operation performed by
chunk creation at the beginning of the clear command.
Solution
========
Protect the use of the session ongoing rotation state and file open
operations by the session lock in viewer_get_next_index().
Known drawbacks
===============
I don't expect this to cause any real scalability concern considering
the fact that the relay daemon has only two threads, one to handle
session daemon commands, and the other to handle viewer commands.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I242942ac6bc7a9199a17cd9800fc461a397b8955
Mathieu Desnoyers [Thu, 18 Nov 2021 20:22:39 +0000 (15:22 -0500)]
Fix: relayd: failure to open chunk files concurrently with session clear
Observed issue
==============
When stress-testing ust clear with an active live viewer, we observe a
situation where the live viewer thread fails to open chunk files in
make_viewer_streams() when executed after the creation of the new trace
chunk at the beginning of the clear command:
DBG3 - 16:19:50.
923577790 [40834/40838]: Processing "RELAYD_CREATE_TRACE_CHUNK" command for socket 19 (in relay_process_control_command() at main.cpp:3262)
DBG1 - 16:19:50.
923577730 [40834/40841]: Relay viewer stream 225 not found (in viewer_stream_get_by_id() at viewer-stream.cpp:265)
DBG1 - 16:19:50.
923600762 [40834/40838]: lttng_trace_chunk_rename_path from to .tmp_old_chunk (in lttng_trace_chunk_rename_path_no_lock() at trace-chunk.cpp:759)
DBG1 - 16:19:50.
923627202 [40834/40841]: Opening trace chunk file "ust/uid/0/64-bit/my_chan-0_26" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1359)
DBG1 - 16:19:50.
923664685 [40834/40841]: Adding new file "ust/uid/0/64-bit/my_chan-0_26" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1309)
DBG1 - 16:19:50.
923706441 [40834/40841]: Relay viewer stream 226 not found (in viewer_stream_get_by_id() at viewer-stream.cpp:265)
DBG1 - 16:19:50.
923727770 [40834/40841]: Opening trace chunk file "ust/uid/0/64-bit/my_chan-0_27" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1359)
DBG1 - 16:19:50.
923744686 [40834/40841]: Adding new file "ust/uid/0/64-bit/my_chan-0_27" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1309)
DBG1 - 16:19:50.
923773427 [40834/40841]: Relay viewer stream 227 not found (in viewer_stream_get_by_id() at viewer-stream.cpp:265)
DBG1 - 16:19:50.
923803791 [40834/40841]: Opening trace chunk file "ust/uid/0/64-bit/my_chan-0_28" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1359)
DBG1 - 16:19:50.
923831589 [40834/40841]: Adding new file "ust/uid/0/64-bit/my_chan-0_28" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1309)
DBG1 - 16:19:50.
923865981 [40834/40841]: Relay viewer stream 228 not found (in viewer_stream_get_by_id() at viewer-stream.cpp:265)
DBG1 - 16:19:50.
923889329 [40834/40841]: Opening trace chunk file "ust/uid/0/64-bit/index/my_chan-0_29.idx" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1359)
DBG1 - 16:19:50.
923905142 [40834/40838]: Creating trace chunk: chunk_id = 1, creation time =
20211118-161950 (in lttng_trace_chunk_create() at trace-chunk.cpp:440)
DBG1 - 16:19:50.
923907984 [40834/40841]: Adding new file "ust/uid/0/64-bit/index/my_chan-0_29.idx" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1309)
DBG1 - 16:19:50.
923937804 [40834/40838]: Chunk name set to "20211118T161950+0000-1" (in lttng_trace_chunk_create() at trace-chunk.cpp:471)
PERROR - 16:19:50.
923984288 [40834/40841]: Failed to open fs handle to ust/uid/0/64-bit/index/my_chan-0_29.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.cpp:548)
DBG1 - 16:19:50.
924050763 [40834/40841]: Opening trace chunk file "ust/uid/0/64-bit/my_chan-0_29" (in _lttng_trace_chunk_open_fs_handle_locked() at trace-chunk.cpp:1359)
DBG1 - 16:19:50.
924074480 [40834/40841]: Adding new file "ust/uid/0/64-bit/my_chan-0_29" to trace chunk "(unnamed)" (in lttng_trace_chunk_add_file() at trace-chunk.cpp:1309)
PERROR - 16:19:50.
924094720 [40834/40841]: Failed to open fs handle to ust/uid/0/64-bit/my_chan-0_29, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.cpp:548)
DBG1 - 16:19:50.
924193679 [40834/40841]: Viewer connection closed with 23 (in thread_worker() at live.cpp:2542)
DBG1 - 16:19:50.
924227482 [40834/40838]: Attempting to publish trace chunk: sessiond {
34038782-6f74-4b2d-801e-
752cf3d8885e}, session_id = 7, chunk_id = 1 (in sessiond_trace_chunk_registry_publish_chunk() at sessiond-trace-chunks.cpp:385)
DBG1 - 16:19:50.
924312916 [40834/40838]: Reset communication state of relay connection (fd = 19) (in connection_reset_protocol_state() at connection.cpp:82)
DBG3 - 16:19:50.
924350200 [40834/40838]: Relayd worker thread polling... (in relay_thread_worker() at main.cpp:3833)
DBG3 - 16:19:50.
924365222 [40834/40841]: Relayd live viewer worker thread polling... (in thread_worker() at live.cpp:2456)
DBG1 - 16:19:50.
926718319 [40834/40838]: Done receiving control command header: fd = 19, cmd = 18, cmd_version = 0, payload size = 532 bytes (in relay_process_control_receive_header() at main.cpp:3422)
DBG3 - 16:19:50.
926755574 [40834/40838]: Relayd worker thread polling... (in relay_thread_worker() at main.cpp:3833)
DBG1 - 16:19:50.
926787638 [40834/40838]: Done receiving control command payload: fd = 19, payload size = 532 bytes (in relay_process_control_receive_payload() at main.cpp:3339)
DBG3 - 16:19:50.
926811247 [40834/40838]: Processing "RELAYD_ROTATE_STREAMS" command for socket 19 (in relay_process_control_command() at main.cpp:3258)
Cause
=====
This is caused by relay_create_trace_chunk() using
lttng_trace_chunk_rename_path() to move away each trace subdirectory
into the subdirectory .tmp_old_chunk, and making this the new top-level
chunk directory (temporarily). This is a temporary state which will be
resorbed on relay_close_trace_chunk(), moving back the top-level chunk
directory to its original place.
Attempts to open chunk files from the prior chunk may result in failures,
because the chunk lock protecting the chunk rename operation only
protects the chunk owned by the relay thread, not its copy(ies) owned by
the live viewer thread.
This intermediate state should _not_ be observed by the live viewer
thread. The session ongoing rotation state should prevent the live
viewer threads from observing this.
Solution
========
Set the ongoing rotation state in relay_create_trace_chunk() earlier:
before invoking lttng_trace_chunk_rename_path(). Also ensure that the
session ongoing rotation state is protected by the session lock.
On the live thread side, introduce use of the session ongoing rotation
state in viewer_get_new_streams() and viewer_attach_session() to
effectively skip creation of the viewer streams if a session has a
rotation ongoing.
Viewers are expected to deal with the LTTNG_VIEWER_NEW_STREAMS_NO_NEW
reply (or handle the fact that no streams are currently available) and
try again later.
Both Babeltrace 2.0 and Babeltrace 1.5 appear to handle those replies
correctly.
Known drawbacks
===============
None.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3ad8d59dff2994bfb62efa02a620f3c0ffbb5e3b
Mathieu Desnoyers [Wed, 17 Nov 2021 16:55:36 +0000 (11:55 -0500)]
Fix: relayd: live: metadata stream reference count < 0 assert
Observed issue
==============
While running tests/regression/tools/clear/test_ust test in a loop we
eventually witness the following error:
The symptom on the Babeltrace side is a Connection reset by peer. This
is caused by a relayd abort after an assertion failure due to a
reference count being lower than 0.
Test case:
# Test ust streaming live clear with viewer
# Parameters: tracing_active=0, clear_twice=0, buffer_type=pid
Tools commit: https://review.lttng.org/c/lttng-tools/+/6754/1
Relay daemon assertion failure stack trace:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffa2e850859 in __GI_abort () at abort.c:79
#2 0x00007ffa2e850729 in __assert_fail_base (fmt=0x7ffa2e9e6588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x56221bdb96f4 "res >= 0",
file=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=66, function=<optimized out>) at assert.c:92
#3 0x00007ffa2e861f36 in __GI___assert_fail (assertion=assertion@entry=0x56221bdb96f4 "res >= 0", file=file@entry=0x56221bdb97a0 "/root/virtenv/usr/include/urcu/ref.h", line=line@entry=66,
function=function@entry=0x56221bdbd638 <__PRETTY_FUNCTION__.7554> "urcu_ref_put") at assert.c:101
#4 0x000056221bd6a1cc in urcu_ref_put (release=<optimized out>, ref=0x7ffa24008cb0) at /root/virtenv/usr/include/urcu/ref.h:66
#5 viewer_stream_put (vstream=vstream@entry=0x7ffa24008cb0) at viewer-stream.c:279
#6 0x000056221bd5e4c5 in viewer_get_metadata (conn=conn@entry=0x7ffa0c000fc0) at live.c:2211
#7 0x000056221bd63778 in process_control (conn=0x7ffa0c000fc0, recv_hdr=0x7ffa297c5af0) at live.c:2376
#8 thread_worker (data=<optimized out>) at live.c:2541
#9 0x00007ffa2ea28609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#10 0x00007ffa2e94d293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Babeltrace crash:
11-17 05:13:05.197 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1494 [live] Requesting next index for stream: cmd=GET_NEXT_INDEX, viewer-stream-id=3352
11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:245 [live] Error receiving from Relay: Connection reset by peer.
11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_next_index@viewer-connection.c:1522 [live] Error receiving get next index reply
11-17 05:13:05.215 47007 47007 D PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_msg_on_stream@lttng-live.c:1069 [live] Returning from advancing live stream iterator: status=LTTNG_LIVE_ITERATOR_STATUS_ERROR, stream-name="stream-3352", viewer-stream-id=3352
11-17 05:13:05.215 47007 47007 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1802 [live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07700, iter-upstream-comp-name="live", iter-upstream-comp-log-level=TRACE, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUT
PUT, iter-upstream-port-name="out", status=ERROR
11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:446 [mux] Upstream iterator's next method returned an error: status=ERROR
11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:989 [mux] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x563b7dd04740, muxer-upstream-msg-iter-wrap-addr=0x563b7dd0db30
11-17 05:13:05.216 47007 47007 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1417 [mux] Cannot get next message: comp-addr=0x563b7dd1a760, muxer-comp-addr=0x563b7dd1b930, muxer-msg-iter-addr=0x563b7dd04740, msg-iter-addr=0x563b7dd07590, status=ERROR
11-17 05:13:05.216 47007 47007 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:866 Component input port message iterator's "next" method failed: iter-addr=0x563b7dd07590, iter-upstream-comp-name="mux", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT,
iter-upstream-port-name="out", status=ERROR
11-17 05:13:05.216 47007 47007 W LIB/GRAPH consume_graph_sink@graph.c:462 Component's "consume" method failed: status=ERROR, comp-addr=0x563b7dd06d20, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x563b7dd051b0, comp-class-so-handle-path="/root/virtenv/usr/lib/babeltr
ace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
11-17 05:13:05.217 47007 47007 E CLI cmd_run@babeltrace2.c:2537 Graph failed to complete successfully
Cause
=====
Both relay and live threads can put the ownership reference on the
metadata viewer stream concurrently without synchronization, thus
leading to a reference count going lower than 0.
The viewer stream ownership design initially planned for being owned by
the live thread, thus allowing the live thread to put the ownership
reference as soon as the associated relay stream is observed as closed,
and the viewer stream is considered as hung up.
However, in the specific case of the metadata viewer stream, the
responsibility of closing the metadata viewer stream is shared between
the relay and live threads, because the viewers expect to observe a
LTTNG_VIEWER_NO_NEW_METADATA message before the metadata stream
hangs up (see comment in viewer_get_metadata()). Therefore, if
viewer_get_metadata() is done before the metadata stream is closed, the
viewer will receive the LTTNG_VIEWER_NO_NEW_METADATA message, and set
the no_new_metadata_notified state to true. It's then the relay thread's
relay_close_stream() which will invoke the ownership put. However,
the live thread may concurrently try to put the viewer stream ownership
as well from a subsequent viewer_get_metadata(), thus leading to a
reference count < 0.
Solution
========
Fix this by putting the ownership reference from the live viewer thread
rather than the relay thread. This can be done by tracking the state of
no_new_metadata_notified within the live viewer thread.
Known drawbacks
===============
This will postpone reclaim of the metadata viewer stream from the
relay stream close to the following viewer_get_metadata (after a
LTTNG_VIEWER_NO_NEW_METADATA message has been replied), which I don't
think is an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9783d3a8402dddaa364a7a69c0764eaf30c16fde
Jérémie Galarneau [Fri, 17 Dec 2021 17:00:36 +0000 (12:00 -0500)]
Update version to v2.12.7
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Mathieu Desnoyers [Wed, 24 Nov 2021 20:56:16 +0000 (15:56 -0500)]
Validate channel context mismatch across UST applications
Observed issue
==============
Applications traced with LTTng-UST are expected to all provide the exact
same layout for their channel's context fields, else it leads to
corrupted traces. This is only enforced within LTTng-UST. There is
nothing in the session daemon that prevents this scenario, and it is
only observable when reading the corrupted trace.
This makes the entire trace unreadable from the point where it is
corrupted.
Cause
=====
Even though LTTng-UST sends the entire description of its context fields
along with the channel registration notification, there is no validation
of the context fields' content against the context fields present in the
ust registry.
Solution
========
Validate each registered UST channel context fields against the fields
present in the registry. Reject the application if there is a mismatch.
Known drawbacks
===============
None.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8b49032bf4f766e549dfccfafdce8cddcbb2873f
Mathieu Desnoyers [Thu, 18 Nov 2021 15:37:29 +0000 (10:37 -0500)]
Fix: relayd: compare viewer chunks by ID rather than address
Observed issue
==============
In "Fix: relayd: live: erroneous message timestamp observed from live
viewer", we observe that sometimes viewer streams unexpectedly end up
performing a viewer stream rotation at session destroy.
Cause
=====
This may happen in the following scenario:
1) Relay stream A is rotated to NULL.
2) viewer_get_next_index for viewer stream A:
2.1) observes a NULL rstream->trace_chunk, updates the viewer session
current trace chunk to NULL (viewer_session_set_trace_chunk_copy).
2.2) "Transition the viewer stream into the latest trace chunk
available." does not issue viewer_stream_rotate_to_trace_chunk, because
the condition (rstream->completed_rotation_count ==
vstream->last_seen_rotation_count + 1 && !rstream->trace_chunk)
evaluates to "true", and thus the entire if () evaluates to false.
3) check_index_status detects rstream->closed and
index_received_seqcount == index_sent_seqcount, thus replying HUP to
viewer, effectively releasing ownership of the viewer stream.
4) viewer_get_next_index for viewer stream B (not rotated to NULL yet):
4.1) observes a non-NULL rstream->trace_chunk, updates the viewer
session current trace chunk to *a new copy* of the non-NULL
rstream->trace_chunk (viewer_session_set_trace_chunk_copy).
4.2) the comparison (conn->viewer_session->current_trace_chunk !=
vstream->stream_file.trace_chunk) done by pointer don't match, because
the viewer session current trace chunk is a new copy.
Therefore, due to those stream close scenarios where the viewer session
can go back and forth between NULL and _different copies_ of the relay
chunk, we cannot use a comparison of chunks by address on the viewer
chunks.
Solution
========
Compare the viewer stream chunks by ID rather than address.
Known drawbacks
===============
The comparison is probably slightly slower, but I don't expect this to
be significant.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I05e4f97f26b2659c007726cc29d3edafa17bdb98
Mathieu Desnoyers [Tue, 16 Nov 2021 20:59:01 +0000 (15:59 -0500)]
Fix: relayd: live: erroneous message timestamp observed from live viewer
Observed issue
==============
While running tests/regression/tools/clear/test_ust test in a loop we
eventually witness the following error:
# Test ust basic streaming live with viewer
# Parameters: tracing_active=0, clear_twice=0, buffer_type=uid
ok 425 - Create session 6jbcTSKUG7s2RIp5 with uri:net://localhost and opts: --live
ok 426 - Enable channel chan for session 6jbcTSKUG7s2RIp5
ok 427 - Enable ust event tp:tptest for session 6jbcTSKUG7s2RIp5
ok 428 - Start tracing for session 6jbcTSKUG7s2RIp5
# Waiting for live trace at url: net://localhost
ok 429 - Waiting for live trace at url: net://localhost
# Waiting for live viewers on url: net://localhost
ok 430 - Waiting for live viewers on url: net://localhost
# Wait until viewer sees all 10 expected events
ok 431 - Live viewer read 10 events, expect 10
ok 432 - Destroy session 6jbcTSKUG7s2RIp5
# Waiting for application to exit
ok 433 - Wait for application to exit
# Wait for viewer to exit
10-28 22:07:37.935 764967 764967 E PLUGIN/SRC.CTF.LTTNG-LIVE next_stream_iterator_for_trace@lttng-live.c:1222 [lttng-live] Message's timestamp is less than lttng-live's message iterator's last returned timestamp: lttng-live-msg-iter-addr=0x55fe45d4977
0, ts=
1635458857116911882, last-msg-ts=
1635458857123908033
10-28 22:07:37.937 764967 764967 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1662 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
10-28 22:07:37.937 764967 764967 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:861 Component input port message iterator's "next" method failed: iter-addr=0x55fe45d38c50, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING,
iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
10-28 22:07:37.937 764967 764967 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:452 [muxer] Upstream iterator's next method returned an error: status=ERROR
10-28 22:07:37.938 764967 764967 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:986 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x55fe45d33640, muxer-upstream-msg-iter-wrap-addr=0x55fe45d3
af20
10-28 22:07:37.938 764967 764967 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1411 [muxer] Cannot get next message: comp-addr=0x55fe45d49cb0, muxer-comp-addr=0x55fe45d49d30, muxer-msg-iter-addr=0x55fe45d33640, msg-iter-addr=0x55fe45d38ae0, sta
tus=ERROR
10-28 22:07:37.938 764967 764967 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:861 Component input port message iterator's "next" method failed: iter-addr=0x55fe45d38ae0, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter
-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
10-28 22:07:37.938 764967 764967 W LIB/GRAPH consume_graph_sink@graph.c:469 Component's "consume" method failed: status=ERROR, comp-addr=0x55fe45d49550, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-c
lass-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x55fe45d38460, comp-class-so-handle-path="/root/virtenv/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.la", comp-input-port-count=1, comp-out
put-port-count=0
10-28 22:07:37.938 764967 764967 E CLI cmd_run@babeltrace2.c:2547 Graph failed to complete successfully
Cause
=====
When doing the rotation associated with a clear,
viewer_stream_sync_tracefile_array_tail aims at pushing forward the
index_sent_seqcount for sessions in tracefile rotation mode to the
oldest available seqcount (tail) for this stream. It does not take into
consideration that the current index_sent_seqcount may already be passed
that oldest available seqcount position, thus eventually re-emitting
trace data.
For sessions *not* in tracefile rotation mode (this is known because
`tracefile_array_get_seq_tail()` returns -1ULL), this function
erroneously resets the index_sent_seqcount to 0, thus also causing trace
data to be re-emitted.
Solution
========
Solve this by using the maximum between the current index_sent_seqcount
and the tracefile array tail position as new position in
viewer_stream_sync_tracefile_array_tail.
Notes
=====
This symptom is also observed without using the clear command, simply on
destroy with a live viewer attached. This is caused by another issue
(not addressed by this patch) which causes
viewer_stream_rotate_to_trace_chunk to be sometimes invoked when streams
are closed on destroy.
Known drawbacks
===============
None.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ifb877d6e2bd94831ddf93077eac3cc968945d6f7
Jérémie Galarneau [Tue, 14 Dec 2021 19:52:24 +0000 (14:52 -0500)]
Fix: relayd comm: improperly packed rotate streams command header
Observed issue
==============
The implicit rotation of a session performed during its destruction
fails on LTTng 2.12 (and upwards) with the following errors:
Error: Relayd rotate streams replied error 152
Error: Relayd rotate stream failed. Cleaning up relayd 2
Error: Rotate channel failed
Failed to find relay daemon socket: relayd_id = 2
Error: Failed to perform a quiet rotation as part of the destruction of session "my_session": Rotation failure on consumer
Cause
=====
Error 152 matches the LTTNG_ERR_INVALID_PROTOCOL error, which implies
that the consumer daemon sent an unexpected command to the relay daemon.
It was determined that the RELAYD_ROTATE_STREAMS command header is not
properly packed since the LTTNG_PACKED annotation was omitted from its
`new_chunk_id` optional field. The documentation of LTTNG_OPTIONAL_COMM
duly indicates that this is required.
Without the use of LTTNG_PACKED, various lengths of padding (3 or 7
bytes) are inserted between new_chunk_id's `is_set` and `value` field to
align `value`, which results in an incorrect interpretation of the
command's arguments.
The relay daemon catches the protocol error when it is built in a
configuration that inserts 7 bytes of padding, while the consumer only
inserts three.
Solution
========
The solution proposed here is not perfect, see "Known drawbacks".
First, if we were to annotate the field, patched consumer daemons would
send unintelligible command headers to unpatched relay daemons. Leaving
it as is is the least of all evils, see "Known drawbacks" for more
details.
From the relay daemon end, we can easily anticipate the padding problem
by reading the `stream_count` field and use it to determine the expected
size of the payload.
The difference between the actual size of the payload and the expected
size allows us to determine the padding size and use the appropriate
declaration of the structure to interpret the command's arguments.
Known drawbacks
===============
While this fix causes the relay daemon to handle all improperly packed
command headers received from an unpatched consumer daemon, the reverse
is not completely true.
The following tables show which cases are known to work and which are
known to be broken when patched and unpatched versions of the relay
and consumer daemons are mixed, with the various alignment constraints.
Note that here:
- 4 byte alignement implies "daemon running on an architecture on
which uint64_t is aligned on an 4-byte boundary" (e.g. x86),
- 8 byte alignement implies "daemon running on an architecture on
which uint64_t is aligned on an 8-byte boundary" (e.g. x86-64).
Scenario 1 - Unpatched relay daemon and unpatched consumer daemon
-----------------------------------------------------------------------------------
| Architecture alignment | 4 byte alignement consumerd | 8 byte alignment consumerd |
|------------------------|-----------------------------|----------------------------|
| 4 byte alignment relay | Works | Fails |
| 8 byte alignment relay | Fails | Works |
-----------------------------------------------------------------------------------
Scenario 2 - Patched relay daemon and unpatched consumer daemon
-----------------------------------------------------------------------------------
| Architecture alignment | 4 byte alignement consumerd | 8 byte alignment consumerd |
|------------------------|-----------------------------|----------------------------|
| 4 byte alignment relay | Works | Works |
| 8 byte alignment relay | Works | Works |
-----------------------------------------------------------------------------------
Scenario 3 - Unpatched relay daemon and patched consumer daemon
-----------------------------------------------------------------------------------
| Architecture alignment | 4 byte alignement consumerd | 8 byte alignment consumerd |
|------------------------|-----------------------------|----------------------------|
| 4 byte alignment relay | Works | Fails |
| 8 byte alignment relay | Fails | Works |
-----------------------------------------------------------------------------------
Scenario 4 - Patched relay daemon and patched consumer daemon
-----------------------------------------------------------------------------------
| Architecture alignment | 4 byte alignement consumerd | 8 byte alignment consumerd |
|------------------------|-----------------------------|----------------------------|
| 4 byte alignment relay | Works | Works |
| 8 byte alignment relay | Works | Works |
-----------------------------------------------------------------------------------
Note that Scenarios 1 and 3 are the same since this fix doesn't
change the behaviour of the consumer daemon.
Also note that packing the `new_chunk_id` field would break the two
working cases of scenario 3 which are, in all likelyhood, the more
common cases.
A new command using a properly packed version of the command's header
could be implemented in future versions, but this presents no benefit as
part of this fix.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3b822a9766ea5be51996a771a0d4a90efe29ce0b
Jonathan Rajotte [Thu, 18 Nov 2021 15:47:16 +0000 (10:47 -0500)]
test: snapshot after regenerate metadata
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I48ae0e9d3d4e67172c4a686d42ceadf6bcb81ead
Jonathan Rajotte [Wed, 17 Nov 2021 21:18:59 +0000 (16:18 -0500)]
Fix: ust-consumer: segfault on snapshot after regenerate metadata
Observed issue
==============
lttng-consumerd segfaults for the following scenario:
$ lttng create test --snapshot
$ lttng enable-event -u -an
$ lttng start
# Run an app just to have some events
$ lttng regenerate metadata
$ lttng snapshot record
The following backtrace is obtained:
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
#1 0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
#2 0x000055b383d00197 in lttng_ustconsumer_request_metadata (ctx=0x55b3855a1e50, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:3336
#3 0x000055b383cf9e76 in snapshot_metadata (metadata_channel=0x7fe90000a760, key=4, path=0x7fe911a09944 "uid/1000/64-bit", relayd_id=
18446744073709551615, ctx=0x55b3855a1e50) at ust-consum
#4 0x000055b383cfbe73 in lttng_ustconsumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at ust-consumer.c:1853
#5 0x000055b383ccf9b7 in lttng_consumer_recv_cmd (ctx=0x55b3855a1e50, sock=28, consumer_sockpoll=0x7fe911a0dbb0) at consumer.c:2097
#6 0x000055b383cd3bfd in consumer_thread_sessiond_poll (data=0x55b3855a1e50) at consumer.c:3284
#7 0x00007fe914c22609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#8 0x00007fe914b47293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) up
#1 0x000055b383cfaed3 in lttng_ustconsumer_recv_metadata (sock=29, key=4, offset=0, len=12391, version=1, channel=0x7fe90000a760, timer=0, wait=1) at ust-consumer.c:1347
1347 pthread_mutex_lock(&channel->metadata_stream->lock);
(gdb) print channel->metadata_stream
$1 = (struct lttng_consumer_stream *) 0x0
Note that the following scenario also leads to a similar backtrace:
$ lttng create test --snapshot
$ lttng enable-event -u -an
$ lttng start
# Run an app just to have some events with a short duration
$ lttng regenerate metadata
# Run a second app just to have some events and a metadata flush while
# the metadata cache status is set as `invalidated`.
^ lttng-consumerd segfault on app termination.
The backtrace:
(gdb) bt
#0 __GI___pthread_mutex_lock (mutex=0x130) at ../nptl/pthread_mutex_lock.c:67
#1 0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
#2 0x00005612a5a14d0a in lttng_ustconsumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at ust-consumer.c:1818
#3 0x00005612a59e89b7 in lttng_consumer_recv_cmd (ctx=0x5612a6feee50, sock=28, consumer_sockpoll=0x7f3989494bb0) at consumer.c:2097
#4 0x00005612a59ecbfd in consumer_thread_sessiond_poll (data=0x5612a6feee50) at consumer.c:3284
#5 0x00007f398c6a9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007f398c5ce293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) up
#1 0x00005612a5a13ed3 in lttng_ustconsumer_recv_metadata (sock=28, key=2, offset=0, len=12391, version=1, channel=0x7f3978005400, timer=0, wait=1) at ust-consumer.c:1347
1347 pthread_mutex_lock(&channel->metadata_stream->lock);
(gdb) print channel->metadata_stream
$1 = (struct lttng_consumer_stream *) 0x0
(gdb)
Cause
=====
For session configured in snapshot mode the metadata channel
metadata_stream field is NULL except for a "short" window during the
actual snapshot record action (snapshot_metadata).
The `regenerate metadata` effectively flag the metadata cache as invalid
leading to handling the cache invalidation state
(`CONSUMER_METADATA_CACHE_WRITE_STATUS_INVALIDATED`) in
`lttng_ustconsumer_recv_metadata`. This was introduced by
b1316da1ffbd276fc8271e7a9438e683ad352781 [1].
At that point the function expects the `channel->metadata_stream` to be
non null. This is simply not true for snapshot session metadata channels.
Note that we cannot simply swap `lttng_ustconsumer_request_metadata` and
`create_ust_streams` in `snapshot_metadata` to ensure that the
`channel->metadata_stream` is non null since
`lttng_ustconsumer_recv_metadata` ends up being called on metadata flush when
an app quit. This sequence of events corresponds to the second scenario
put forward in the `Observed Issue` section.
Solution
========
Null check `channel->metadata_stream` and perform only the operation
when it is non null. This partly mirror what is done in `consumer_metadata_wakeup_pipe`.
I am not sure if the check on `channel->monitor` is required but it
seems irrelevant to the notion of resetting the stream consumed position
when the stream exists.
With this taken care off, we find ourself with another
backtrace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 [93/122]
#1 0x00007f75cf9b3859 in __GI_abort () at abort.c:79
#2 0x00007f75cf9b3729 in __assert_fail_base (fmt=0x7f75cfb49588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004
#3 0x00007f75cf9c4f36 in __GI___assert_fail (assertion=0x55ab004e9c68 "pthread_mutex_trylock(&stream->lock)", file=0x55ab004e8d7a "ust-consumer.c", line=1285, function=0x55ab004eb8a0 <__PR
#4 0x000055ab004b1b9c in metadata_stream_reset_cache_consumed_position (stream=0x7f75b400a850) at ust-consumer.c:1285
#5 0x000055ab004b4fef in commit_one_metadata_packet (stream=0x7f75b400a850) at ust-consumer.c:2551
#6 0x000055ab004b5f46 in get_next_subbuffer_metadata (stream=0x7f75b400a850, subbuffer=0x7f75cc972630) at ust-consumer.c:2927
#7 0x000055ab0048b6a9 in lttng_consumer_read_subbuffer (stream=0x7f75b400a850, ctx=0x55ab01d4ee50, locked_by_caller=true) at consumer.c:3522
#8 0x000055ab004b0f66 in snapshot_metadata (metadata_channel=0x7f75b4005400, key=2, path=0x7f75cc972944 "uid/1000/64-bit", relayd_id=
18446744073709551615, ctx=0x55ab01d4ee50) at ust-consum
#9 0x000055ab004b2e86 in lttng_ustconsumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at ust-consumer.c:1861
#10 0x000055ab004869b7 in lttng_consumer_recv_cmd (ctx=0x55ab01d4ee50, sock=28, consumer_sockpoll=0x7f75cc976bb0) at consumer.c:2097
#11 0x000055ab0048abfd in consumer_thread_sessiond_poll (data=0x55ab01d4ee50) at consumer.c:3284
#12 0x00007f75cfb8b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f75cfab0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Which is also caused in part to the invalidation of the cache.
`metadata_stream_reset_cache_consumed_position` expect that the stream
at that point be locked. Which is not the case despite what the last argument
to `lttng_consumer_read_subbuffer` indicates. To alleviate that we hold
the lock during the call to `lttng_consumer_read_subbuffer`.
Known drawbacks
=========
None.
References
=========
[1] https://git.lttng.org/?p=lttng-tools.git;a=commit;h=
b1316da1ffbd276fc8271e7a9438e683ad352781
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib09fdc989b1baae26d0f496afeabc3e7cb27228c
Jérémie Galarneau [Fri, 10 Dec 2021 19:10:13 +0000 (14:10 -0500)]
Tests: live kernel: no plan printed when non-root
The live kernel test does not produce a valid TAP output when
skipping due to not running the test as root.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4a10de494260084216ddb1b8f4ee27a546c4d8ed
Jérémie Galarneau [Wed, 3 Nov 2021 02:31:15 +0000 (22:31 -0400)]
Fix: sessiond: snapshot: leak of trace chunk
Valgrind reports a leak after every snapshot record command:
==827791== 430 (280 direct, 150 indirect) bytes in 1 blocks are definitely lost in loss record 34 of 37
==827791== at 0x48435FF: calloc (vg_replace_malloc.c:1117)
==827791== by 0x223D01: zmalloc (macros.h:45)
==827791== by 0x224B79: lttng_trace_chunk_allocate (trace-chunk.c:387)
==827791== by 0x224E41: lttng_trace_chunk_create (trace-chunk.c:427)
==827791== by 0x150B55: session_create_new_trace_chunk (session.c:656)
==827791== by 0x164A11: snapshot_record (cmd.c:5113)
==827791== by 0x1651EE: cmd_snapshot_record (cmd.c:5302)
==827791== by 0x196E74: process_client_msg (client.c:2166)
==827791== by 0x198AF1: thread_manage_clients (client.c:2742)
==827791== by 0x18E245: launch_thread (thread.c:66)
==827791== by 0x4B9E258: start_thread (in /usr/lib/libpthread-2.33.so)
==827791== by 0x4CB45E2: clone (in /usr/lib/libc-2.33.so)
session_set_trace_chunk() on line 5162 returns a reference to the
current trace chunk which is never released.
This also causes tests/regression/tools/snapshots/test_ust_long to fail
due to a file descriptor exhaustion (presumably from using too many
directory file descriptors) when it is executed by an unprivileged user.
The CI doesn't catch this since the long regression test suite is
executed as root.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4b6e45df48c3daafa2294c80ccd8a2b4d91401e1
Jonathan Rajotte [Tue, 23 Nov 2021 17:15:42 +0000 (12:15 -0500)]
Fix: test: use BABELTRACE_BIN instead of babeltrace
Observed issue
==============
The System tests jobs fails on multi-session test since the move to bt2.
Cause
=====
The tests uses `babeltrace` instead of `BABELTRACE_BIN`.
Solution
========
Use `BABELTRACE_BIN`.
Add a babelrace bail out.
While there fix easy shellcheck warning.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I441d736e85c447c5765bffd520ec2f267c86048f
Mathieu Desnoyers [Thu, 2 Dec 2021 22:33:55 +0000 (17:33 -0500)]
Fix: relayd: `!vsession->current_trace_chunk` assertion failed
Observed issue
==============
When performing:
#!/bin/bash
lttng create py_syscalls --live
lttng enable-event -u -a
lttng enable-event -k -a
lttng start
babeltrace2 -i lttng-live net://localhost/host/raton/py_syscalls
The relay daemon hits this assertion:
Thread 8 (Thread 0x7fffeeffd700 (LWP 167040) "lttng-relayd"):
#0 0x00007ffff7b1618b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff7af5859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff7af5729 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff7b06f36 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00005555555889bb in viewer_session_attach (vsession=0x7fffdc001400, session=session@entry=0x7fffe8001180) at viewer-session.c:80
#5 0x000055555557bcff in viewer_attach_session (conn=0x7fffd0001140) at live.c:1275
#6 process_control (conn=0x7fffd0001140, recv_hdr=0x7fffeeffcaf0) at live.c:2341
#7 thread_worker (data=<optimized out>) at live.c:2515
#8 0x00007ffff7ccd609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9 0x00007ffff7bf2293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
Cause
=====
This assert appears to be entirely wrong.
It checks that the "viewer session" has a NULL current trace chunk when
attaching a session to a viewer session, but in the case where a viewer
session has multiple sessions (e.g. with kernel and ust tracing
combined), we are attaching each session individually to the viewer
session, and we set the current trace chunk of the viewer session when
we attach the first session to it.
So it is expected to be non-NULL when attaching the second session.
Solution
========
Remove the assertion.
Known limitations
=================
None.
Fixes: #1335
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4d8f5d5347b4588144ddf449976cae5a94b81b3a
Simon Marchi [Wed, 10 Nov 2021 13:42:25 +0000 (08:42 -0500)]
Fix: tests: fix unused-but-set warning in test_fd_tracker.c
When building with clang-14 on Ubuntu 20.04, I get:
CC test_fd_tracker.o
/home/smarchi/src/lttng-tools/tests/unit/test_fd_tracker.c:169:15: error: variable 'fds_set_to_minus_1' set but not used [-Werror,-Wunused-but-set-variable]
unsigned int fds_set_to_minus_1 = 0;
^
The compiler seems right, so remove fds_set_to_minus_1. It might be
that the intention was to assert something using this variable, but I
couldn't figure it out.
Change-Id: I12bfd07bca7829de8d5b85d375d9b52bd84d677a
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Wed, 10 Nov 2021 13:39:22 +0000 (08:39 -0500)]
Fix: sessiond: fix possible buffer overflow warning
When compiling with clang-14 on Ubuntu 20.04, I get:
CC lttng-syscall.lo
/home/smarchi/src/lttng-tools/src/bin/lttng-sessiond/lttng-syscall.c:70:13: error: 'fscanf' may overflow; destination buffer in argument 4 has size 255, but the corresponding specifier may require size 256 [-Werror,-Wfortify-source]
&index, name, &bitness) == 3) {
^
I think the compiler is right, we read a string when length up to 255 in
a buffer of size 255. We need one more byte for the NULL terminator,
fix that.
Change-Id: I6b2eec401af3ef6230dd4b6c8559032de9b54584
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Tue, 19 Oct 2021 19:22:39 +0000 (15:22 -0400)]
Fix: lttng-ctl: tracing_group memory leaks
Observed issue
==============
liblttng-ctl leaks memory if `lttng_set_tracing_group` is called at least
1 time by an API client.
joraj@~/lttng/master/lttng-tools-dev [master][]$ valgrind --leak-check=full lttng --group=joraj list
==24823== Memcheck, a memory error detector
==24823== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24823== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==24823== Command: lttng --group=joraj list
==24823==
Error: No session daemon is available
==24823==
==24823== HEAP SUMMARY:
==24823== in use at exit: 8 bytes in 1 blocks
==24823== total heap usage: 55 allocs, 54 frees, 87,023 bytes allocated
==24823==
==24823== 8 bytes in 1 blocks are definitely lost in loss record 1 of 1
==24823== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==24823== by 0x4BA7DC7: __vasprintf_internal (vasprintf.c:71)
==24823== by 0x4C4B742: __asprintf_chk (asprintf_chk.c:34)
==24823== by 0x48687D9: asprintf (stdio2.h:181)
==24823== by 0x48687D9: lttng_set_tracing_group (lttng-ctl.c:2620)
==24823== by 0x4011B89: call_init.part.0 (dl-init.c:72)
==24823== by 0x4011C90: call_init (dl-init.c:30)
==24823== by 0x4011C90: _dl_init (dl-init.c:119)
==24823== by 0x4001139: ??? (in /usr/lib/x86_64-linux-gnu/ld-2.31.so)
==24823== by 0x2: ???
==24823== by 0x1FFEFFFCFE: ???
==24823== by 0x1FFEFFFD04: ???
==24823== by 0x1FFEFFFD12: ???
==24823==
==24823== LEAK SUMMARY:
==24823== definitely lost: 8 bytes in 1 blocks
==24823== indirectly lost: 0 bytes in 0 blocks
==24823== possibly lost: 0 bytes in 0 blocks
==24823== still reachable: 0 bytes in 0 blocks
==24823== suppressed: 0 bytes in 0 blocks
==24823==
==24823== For lists of detected and suppressed errors, rerun with: -s
==24823== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Cause
=====
The allocated pointer in the library constructor is not freed on
subsequent assignation.
Solution
========
Free the pointer.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie1d4c45df2764a88c74d56de691783df9215633c
Francis Deslauriers [Mon, 1 Nov 2021 19:31:26 +0000 (15:31 -0400)]
Fix: Tests: unchecked `close()` return value
CID
1465101 (#1 of 1): Unchecked return value (CHECKED_RETURN)
9. check_return: Calling close without checking return value (as is done
elsewhere 177 out of 185 times).
CID
1465100 (#1 of 1): Unchecked return value (CHECKED_RETURN)
4. check_return: Calling close without checking return value (as is done
elsewhere 177 out of 185 times)
CID
1465099 (#1 of 1): Unchecked return value (CHECKED_RETURN) 4.
check_return: Calling close without checking return value (as is done
elsewhere 177 out of 185 times).
CID
1465098 (#1 of 1): Unchecked return value (CHECKED_RETURN) 4.
check_return: Calling close without checking return value (as is done
elsewhere 177 out of 185 times).
CID
1465097 (#1 of 1): Unchecked return value (CHECKED_RETURN) 4.
check_return: Calling close without checking return value (as is done
elsewhere 177 out of 185 times).
Reported-by: Coverity Scan
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8e2552c75ab7cec5aa3707e2c1c4d9f2484b501a
Jérémie Galarneau [Mon, 1 Nov 2021 19:43:55 +0000 (15:43 -0400)]
Fix: relayd: live: mishandled initial null trace chunk
Observed issue
==============
As reported in #1323 (https://bugs.lttng.org/issues/1323), crashes of
the relay daemon are observed when running the user space clear tests.
The crash occurs with the following stack trace:
#0 0x000055fbb861d6ae in urcu_ref_get_unless_zero (ref=0x28) at /usr/local/include/urcu/ref.h:85
#1 lttng_trace_chunk_get (chunk=0x0) at trace-chunk.c:1836
#2 0x000055fbb86051e2 in make_viewer_streams (relay_session=relay_session@entry=0x7f6ea002d540, viewer_session=<optimized out>, seek_t=seek_t@entry=LTTNG_VIEWER_SEEK_BEGINNING, nb_total=nb_total@entry=0x7f6ea9607b00, nb_unsent=nb_unsent@entry=0x7f6ea9607aec, nb_created=nb_created@entry=0x7f6ea9607ae8, closed=<optimized out>) at live.c:405
#3 0x000055fbb86061d9 in viewer_get_new_streams (conn=0x7f6e94000fc0) at live.c:1155
#4 process_control (conn=0x7f6e94000fc0, recv_hdr=0x7f6ea9607af0) at live.c:2353
#5 thread_worker (data=<optimized out>) at live.c:2515
#6 0x00007f6eae86a609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7 0x00007f6eae78f293 in clone () from /lib/x86_64-linux-gnu/libc.so.6
The race window during which this occurs seems very small as it can take
hours to reproduce this crash. However, a minimal reproducer could be
identified, as stated in the bug report.
Essentially, the same crash can be reproduced by attaching a live viewer
to a session that has seen events being produced, been stopped and been
cleared.
Cause
=====
The crash occurs as an attempt is made to take a reference to a viewer
session’s trace chunk as viewer streams are created. The crux of the
problem is that the code doesn’t expect a viewer session’s trace chunk
to be NULL.
The viewer session’s current trace chunk is initially set, when a viewer
attaches to the viewer session, to a copy the corresponding
relay_session’s current trace chunk.
A live session always attempts to "catch-up" to the newest available
trace chunk. This means that when a viewer reaches the end of a trace
chunk, the viewer session may not transition to the "next" one: it jumps
to the most recent trace chunk available (the one being produced by the
relay_session). Hence, if the producer performs multiple rotations
before a viewer completes the consumption of a trace chunk, it will skip
over those "intermediary" trace chunks.
A viewer session updates its current trace chunk when:
1) new viewer streams are created,
2) a new index is requested,
3) metadata is requested.
Hence, as a general principle, the viewer session will reference the
most recent trace chunk available _even if its streams do not point to
it_. It indicates which trace chunk viewer streams should transition to
when the end of their current trace chunk is reached.
The live code properly handles transitions to a null chunk. This can be
verified by attaching a viewer to a live session, stopping the session,
clearing it (thus entering a null trace chunk), and resuming tracing.
The only issue is that the case where the first trace chunk of a viewer
session is "null" (no active trace chunk) is mishandled in two places:
1) in make_viewer_streams(), where the crash is observed,
2) in viewer_get_metadata().
Solution
========
In make_viewer_streams(), it is assumed that a viewer session will have
a non-null trace chunk whenever a rotation is not ongoing. This is
reflected by the fact that a reference is always acquired on the viewer
session’s trace chunk.
That code is one of the three places that can cause a viewer session’s
trace chunk to be updated. We still want to update the viewer session to
the most recently seen trace chunk (null, in this case). However, there
is no reference to acquire and the trace chunk to use for the creation
of the viewer stream is NULL. This is properly handled by
viewer_stream_create().
The second site to change is viewer_get_metadata() which doesn’t handle
a viewer metadata stream not having an active trace chunk at all.
Thankfully, the protocol allows us to express this condition by
returning the LTTNG_VIEWER_NO_NEW_METADATA status code when a viewer
metadata stream doesn’t have an open file and doesn’t have a current
trace chunk.
Surprisingly, this bug didn’t trigger in the case where a transition to
a null chunk occurred _after_ attaching to a viewer session.
This is because viewers will typically ask for metadata as a result of an
LTTNG_VIEWER_FLAG_NEW_METADATA reply to the GET_NEXT_INDEX command. When
a session is stopped and all data was consumed, this command returns
that no new data is available, causing the viewers to wait and ask again
later.
However, when attaching, babeltrace2 (at least, and probably babeltrace 1.x)
always asks for an initial segment of metadata before asking for an
index.
Known drawbacks
===============
None.
Fixes: #1323
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I516fca60755e6897f6b7170c12d706ef57ad61a5
Francis Deslauriers [Thu, 30 Sep 2021 18:43:11 +0000 (14:43 -0400)]
Fix: configure.ac: reporting SDT uprobe as a UST feature
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I86638d6a148b04e7131e4af7ec830c5e56817fdc
Francis Deslauriers [Thu, 7 Oct 2021 18:52:27 +0000 (14:52 -0400)]
Fix: Tests: leaking epoll fd
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5ec4fcdb87159f35932c20e7314cda764d14967c
Jérémie Galarneau [Mon, 18 Oct 2021 21:10:08 +0000 (17:10 -0400)]
Update version to v2.12.6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Thu, 7 Oct 2021 20:19:41 +0000 (16:19 -0400)]
Fix: sessiond: previously created channel cannot be enabled
Observed issue
==============
A previously created channel cannot be enabled back once a session is
started.
Cause
=====
The check validating that the session was started is to early in the
`cmd_enable_channel` function.
Solution
========
Move the check at the creation code path when the channel is not found.
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: I8e7d62b7e97246e65f1cf9022270293a6dd34cc9
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Thu, 8 Jul 2021 18:17:51 +0000 (14:17 -0400)]
Fix: ust: app stuck on recv message during UST comm timeout scenario
Observed issue
==============
The following scenario lead to the UST thread to be "stuck" on recvmsg
on the notify socket.
The problem manifest itself when an application is unresponsive during
the ustctl_start_session call. Note that the default timeout for ust
communication is 5 seconds.
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_start_session
lttng create my_session
lttng enable-event -u -a
lttng start
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
sleep 5 # This make sure lttng-sessiond unregister the app from its point of view
kill -s SIGCONT $(pgrep app)
gdb -p $(pgrep app)
thread apply all bt
App stack trace:
Thread 3 (Thread 0x7fe2c6f58700 (LWP 48172)):
#0 __libc_recvmsg (flags=0, msg=0x7fe2c6f56ac0, fd=4) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
#1 __libc_recvmsg (fd=fd@entry=4, msg=msg@entry=0x7fe2c6f56ac0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:25
#2 0x00007fe2c7a010ba in ustcomm_recv_unix_sock (sock=sock@entry=4, buf=buf@entry=0x7fe2c6f56ea0, len=len@entry=48) at lttng-ust-comm.c:308
#3 0x00007fe2c7a037c3 in ustcomm_register_channel (sock=4, session=session@entry=0x7fe2c0000ba0, session_objd=<optimized out>, channel_objd=<optimized out>, nr_ctx_fields=nr_ctx_fields@entry=0, ctx_fields=<optimized out>, chan_id=0x7fe2
c6f5716c, header_type=0x7fe2c0012b18) at lttng-ust-comm.c:1544
#4 0x00007fe2c7a10787 in lttng_session_enable (session=0x7fe2c0000ba0) at lttng-events.c:444
#5 0x00007fe2c7a0b785 in lttng_session_cmd (objd=1, cmd=128, arg=
140611977311672, uargs=0x7fe2c6f57800, owner=0x7fe2c7a5da00 <local_apps>) at lttng-ust-abi.c:576
#6 0x00007fe2c7a07d6d in handle_message (lum=0x7fe2c6f57590, sock=3, sock_info=0x7fe2c7a5da00 <local_apps>) at lttng-ust-comm.c:1003
#7 ust_listener_thread (arg=0x7fe2c7a5da00 <local_apps>) at lttng-ust-comm.c:1712
#8 0x00007fe2c7993609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9 0x00007fe2c78ba293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
...
Cause
=====
When the app continues after the timeout from lttng-sessiond side, the
actual start_session message is received on the application side then
UST, app side, send commands on the notify socket. On lttng-sessiond
side, the command is received but no reply is sent.
This is due to the fact that the lookup against the
ust_app_ht_by_notify_sock hash table (find_app_by_notify_sock)
return nothing since the app is unregistered at this point and the hash
table node was removed on unregistration.
Solution
========
When the app lookup fails, return an error that will trigger the cleanup
of the notify socket.
Known drawbacks
=========
None
Note
=========
Subsequent error path in reply_ust_register_channel,
add_event_ust_registry, and add_enum_ust_registry might lead to the same
type of problem since no reply is sent to the app. Still, for those
cases the complete application/notify socket should not be destroyed
since the error path relate to either a session or a sub object of a
session.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: Iea0dc027ca1ee772e84c7e545114f1be69fd1f63
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Wed, 23 Jun 2021 02:17:03 +0000 (22:17 -0400)]
Fix: ust: UST communication can return -EAGAIN
Observed issue
==============
The following scenario lead to an abort on event creation. The
problem manifest itself when an application is unresponsive. Note that
the default timeout for ust communication is 5 seconds.
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_create_event.
lttng create my_session
lttng enable-event -u -a
lttng start
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
# lttng-sessiond will abort.
Note that for UST this is not an expected behaviour. Expected
communication failure with a single app should not invalidate the
complete channel, compromise its setup or result in an abort.
Note that a similar scenario for the following ustctl call sites also
lead to scenario where failure of a single app lead to error reporting
and/or error propagation to upper level object.
Problematic callsites:
ustctl_set_exclusion
ustctl_set_filter
ustctl_disable_channel
These callsites are also fixed by this patch.
Cause
=====
For an unresponsive application, EAGAIN is returned and is treated as an
"unknown" hard error.
In this particular case the abort() call was introduced by commit:
88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. It is not clear if this is
a leftover from debugging session since this is the only callsite where
an abort is issued on communication failure via ustctl.
Solution
========
Handle EAGAIN coming from ustctl_* and treat it the same way a
dying application is handled. The only minor difference is that we WARN
on communication time out. Albeit not the most useful thing for a CLI
client, it could help overall user of lttng-sessiond in time out
situation.
Most call site already handled "unknown" error correctly. For those call
site we simply end up bringing more info in regards to the timeout
issue instead of mentioning that "-11" was returned.
Note, the reclamation of "app" is handled by the poll loop and
ust_app_unregister since the socket is shutdown by lttng-ust internally
on error, including EAGAIN.
Note that the application will try to register itself back to the
lttng-sessiond based on its configuration.
Known drawbacks
=========
None
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
88e3c2f5610b9ac89b0923d448fee34140fc46fb
Fixes: #1384
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: If364b5d48e7fd2b664276a0fb1b7eec2c45ed683
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Mon, 12 Jul 2021 20:44:38 +0000 (16:44 -0400)]
Fix: ust: segfault on lttng start on filter bytecode copy
Observed issue
==============
A segmentation fault is observed for multiple UST timeout scenarios.
Backtrace:
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1 0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
#2 0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
#3 0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
#4 0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
#5 0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
#6 0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
#7 0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
#8 0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
#9 0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
#10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
#11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
#12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
The scenario:
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_set_filter
lttng create my_session
lttng enable-event -u tp:tp_test
lttng start
lttng enable-event -u __dummy --filter 'my_field == "user34"'
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
# enable-event will return an error. This a bug in itself, still let's
# continue with the current bug.
lttng stop
# Start a new app that will register.
./app &
sleep 1
lttng start
# lttng-sessiond should segfault.
Cause
=====
During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.
On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.
The problem was introduced by commit
88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.
Solution
========
Go to `end` label to prevent freeing of the uevent object.
Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.
The `error` label is now obsolete.
Known drawbacks
=========
None.
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
88e3c2f5610b9ac89b0923d448fee34140fc46fb
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 15 Oct 2021 21:03:38 +0000 (17:03 -0400)]
Build fix: Missing message in LTTNG_DEPRECATED invocation
Coverity scan build jobs fail since LTTNG_DEPRECATED expects a string
and none is provided at the lttng_metadata_regenerate use site.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7e6701abd24c679f578b0adead771ac93b6566cd
Simon Marchi [Wed, 6 Oct 2021 14:57:24 +0000 (10:57 -0400)]
include: remove spurious spaces in condition/session-rotation.h
Change-Id: Ia525d24c3b4098dff5c50fb2c5d93c16f6e08f5c
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 5 Oct 2021 20:10:18 +0000 (16:10 -0400)]
tests: fix header of regression/ust/getcpu-override/run-getcpu-override
The "SPDX-License-Identifier:" header is not in a comment, so is
interpreted as a bash command. This is harmless, but it appears in the
test output:
ok 13 - Start tracing for session sequence-cpu
# Launching app with getcpu-plugin wrapper
./tests/regression/ust/getcpu-override//run-getcpu-override: 2: SPDX-License-Identifier:: not found
ok 14 - Application with wrapper done
Fix that, and add a proper copyright notice, based on the other files
that were added at the same time as this one.
Change-Id: Icdf5e2fd5aec4080b2e5cad10cca4813bad26394
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 5 Aug 2021 20:48:51 +0000 (16:48 -0400)]
fix: wrong define used for GCC version check
As far as I can tell, the __GNUC_MAJOR__ define has never existed, the
proper define for the major version is __GNUC__. See
https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html for
more details.
Change-Id: I0d47d524e7efd204fd2f8976311c62e872eb6170
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Mon, 4 Oct 2021 16:41:51 +0000 (12:41 -0400)]
Fix: userspace-probe: unreported error on string copy error
Issue
=====
String copy errors, either due to the length or an allocation failure,
are not reported by
lttng_userspace_probe_location_tracepoint_create_from_payload
and don't log a clear error message.
This allowed truncation bugs like the one fixed in
b45a296 to go
unnoticed.
Fix
===
Return an "invalid" status code and log a more descriptive error
message.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia07cac7cba315ea79337262e9082dd06eb60950f
Francis Deslauriers [Fri, 1 Oct 2021 20:10:24 +0000 (16:10 -0400)]
Fix: userspace-probe: truncating binary path for SDT
Issue
=====
This issue was uncovered when we enabled the testing of the SDT
userspace probe instrumentation on the CI, where the paths to file are
specially long.
The reported error is:
- rule: ma-probe-sdt (type: kernel:uprobe, location type: SDT, location: /root/workspace/dev_gerrit_lttng-tools_rootbuild/arch/amd64/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/node/amd64-rootnode/test_type/base/src/lttng-tools/tests/utils/testapp/userspace-probe-sdt-binary/.libs/userspace-probe-sdt-binary:foobar:tp1)
+ rule: ma-probe-sdt (type: kernel:uprobe, location type: SDT, location: /root/workspace/dev_gerrit_lttng-tools_rootbuild/arch/amd64/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/node/amd64-rootnode/test_type/base/src/lttng-tools/tests/utils/testapp/userspace-probe-sdt-binary/.libs/userspace-probe-s:foobar:tp1)
The important part to notice is that the path to the binary is truncated
compared to was is expected by the test case.
The problem is caused by the
`lttng_userspace_probe_location_tracepoint_create_from_payload()`
function that strdup() the path string using the wrong defined value.
Fix
===
Use LTTNG_PATH_MAX rather then LTTNG_SYMBOL_NAME_LEN to copy the binary
path.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I24cbf413baba405bf4c4b534ccbc2b18f8d5d43f
This page took 0.070917 seconds and 4 git commands to generate.