Jonathan Rajotte [Wed, 23 Jun 2021 02:17:03 +0000 (22:17 -0400)]
Fix: ust: UST communication can return -EAGAIN
Observed issue
==============
The following scenario lead to an abort on event creation. The
problem manifest itself when an application is unresponsive. Note that
the default timeout for ust communication is 5 seconds.
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_create_event.
lttng create my_session
lttng enable-event -u -a
lttng start
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
# lttng-sessiond will abort.
Note that for UST this is not an expected behaviour. Expected
communication failure with a single app should not invalidate the
complete channel, compromise its setup or result in an abort.
Note that a similar scenario for the following ustctl call sites also
lead to scenario where failure of a single app lead to error reporting
and/or error propagation to upper level object.
Problematic callsites:
ustctl_set_exclusion
ustctl_set_filter
ustctl_disable_channel
These callsites are also fixed by this patch.
Cause
=====
For an unresponsive application, EAGAIN is returned and is treated as an
"unknown" hard error.
In this particular case the abort() call was introduced by commit:
88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. It is not clear if this is
a leftover from debugging session since this is the only callsite where
an abort is issued on communication failure via ustctl.
Solution
========
Handle EAGAIN coming from ustctl_* and treat it the same way a
dying application is handled. The only minor difference is that we WARN
on communication time out. Albeit not the most useful thing for a CLI
client, it could help overall user of lttng-sessiond in time out
situation.
Most call site already handled "unknown" error correctly. For those call
site we simply end up bringing more info in regards to the timeout
issue instead of mentioning that "-11" was returned.
Note, the reclamation of "app" is handled by the poll loop and
ust_app_unregister since the socket is shutdown by lttng-ust internally
on error, including EAGAIN.
Note that the application will try to register itself back to the
lttng-sessiond based on its configuration.
Known drawbacks
=========
None
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
88e3c2f5610b9ac89b0923d448fee34140fc46fb
Fixes: #1384
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: If364b5d48e7fd2b664276a0fb1b7eec2c45ed683
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Mon, 12 Jul 2021 20:44:38 +0000 (16:44 -0400)]
Fix: ust: segfault on lttng start on filter bytecode copy
Observed issue
==============
A segmentation fault is observed for multiple UST timeout scenarios.
Backtrace:
#0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:384
#1 0x0000557fe0395df9 in copy_filter_bytecode (orig_f=0x7f9c5802b790) at ust-app.c:1196
#2 0x0000557fe0397702 in shadow_copy_event (ua_event=0x7f9c58025ff0, uevent=0x7f9c58033560) at ust-app.c:1824
#3 0x0000557fe039ac46 in create_ust_app_event (ua_sess=0x7f9c5802ec20, ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, app=0x7f9c5c001da0) at ust-app.c:3192
#4 0x0000557fe03a054d in ust_app_channel_synchronize_event (ua_chan=0x7f9c58025cc0, uevent=0x7f9c58033560, ua_sess=0x7f9c5802ec20, app=0x7f9c5c001da0) at ust-app.c:5096
#5 0x0000557fe03a0772 in ust_app_synchronize (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5173
#6 0x0000557fe03a0a70 in ust_app_global_update (usess=0x7f9c580074a0, app=0x7f9c5c001da0) at ust-app.c:5255
#7 0x0000557fe03a00e0 in ust_app_start_trace_all (usess=0x7f9c580074a0) at ust-app.c:4987
#8 0x0000557fe0355c6a in cmd_start_trace (session=0x7f9c5800a190) at cmd.c:2668
#9 0x0000557fe0382e70 in process_client_msg (cmd_ctx=0x7f9c58003d70, sock=0x7f9c74bf44e0, sock_error=0x7f9c74bf44e4) at client.c:1527
#10 0x0000557fe03848a2 in thread_manage_clients (data=0x557fe06d9440) at client.c:2200
#11 0x0000557fe037d1cb in launch_thread (data=0x557fe06d94b0) at thread.c:75
#12 0x00007f9c796af609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f9c795b6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
The scenario:
# Start an instrumented app
./app
gdb lttng-sessiond
# put a breakpoint on ustctl_set_filter
lttng create my_session
lttng enable-event -u tp:tp_test
lttng start
lttng enable-event -u __dummy --filter 'my_field == "user34"'
# The tracepoint should hit. Do not continue.
kill -s SIGSTOP $(pgrep app)
# Continue lttng-sessiond.
# enable-event will return an error. This a bug in itself, still let's
# continue with the current bug.
lttng stop
# Start a new app that will register.
./app &
sleep 1
lttng start
# lttng-sessiond should segfault.
Cause
=====
During the "lttng enable-event" command, the timeout error bubbles up
all the way to event_ust_enable_tracepoint and is different from
LTTNG_UST_ERR_EXIST. `trace_ust_destroy_event` is called and frees the
`uevent` object. Note that contrary to the comment `uevent` is added to
the channel event hash table at this point.
On the next `lttng start` command, the event node is still present in
the hash table and is iterated on. lttng-sessiond segfault on the first
data access of the previously freed memory.
The problem was introduced by commit
88e3c2f5610b9ac89b0923d448fee34140fc46fb [1]. Which essentially move the
callsite of `add_unique_ust_event` before `ust_app_*_event_glb` calls.
Solution
========
Go to `end` label to prevent freeing of the uevent object.
Note that app synchronization should not force an error at the channel
level, since a single app can fail but the whole channel should not.
The `error` label is now obsolete.
Known drawbacks
=========
None.
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
88e3c2f5610b9ac89b0923d448fee34140fc46fb
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: Ifaf3f4c71bb2da869c7b441aaa4b367f8f7cbdd6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 15 Oct 2021 21:03:38 +0000 (17:03 -0400)]
Build fix: Missing message in LTTNG_DEPRECATED invocation
Coverity scan build jobs fail since LTTNG_DEPRECATED expects a string
and none is provided at the lttng_metadata_regenerate use site.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7e6701abd24c679f578b0adead771ac93b6566cd
Simon Marchi [Wed, 6 Oct 2021 14:57:24 +0000 (10:57 -0400)]
include: remove spurious spaces in condition/session-rotation.h
Change-Id: Ia525d24c3b4098dff5c50fb2c5d93c16f6e08f5c
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 5 Oct 2021 20:10:18 +0000 (16:10 -0400)]
tests: fix header of regression/ust/getcpu-override/run-getcpu-override
The "SPDX-License-Identifier:" header is not in a comment, so is
interpreted as a bash command. This is harmless, but it appears in the
test output:
ok 13 - Start tracing for session sequence-cpu
# Launching app with getcpu-plugin wrapper
./tests/regression/ust/getcpu-override//run-getcpu-override: 2: SPDX-License-Identifier:: not found
ok 14 - Application with wrapper done
Fix that, and add a proper copyright notice, based on the other files
that were added at the same time as this one.
Change-Id: Icdf5e2fd5aec4080b2e5cad10cca4813bad26394
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 5 Aug 2021 20:48:51 +0000 (16:48 -0400)]
fix: wrong define used for GCC version check
As far as I can tell, the __GNUC_MAJOR__ define has never existed, the
proper define for the major version is __GNUC__. See
https://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html for
more details.
Change-Id: I0d47d524e7efd204fd2f8976311c62e872eb6170
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Mon, 4 Oct 2021 16:41:51 +0000 (12:41 -0400)]
Fix: userspace-probe: unreported error on string copy error
Issue
=====
String copy errors, either due to the length or an allocation failure,
are not reported by
lttng_userspace_probe_location_tracepoint_create_from_payload
and don't log a clear error message.
This allowed truncation bugs like the one fixed in
b45a296 to go
unnoticed.
Fix
===
Return an "invalid" status code and log a more descriptive error
message.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia07cac7cba315ea79337262e9082dd06eb60950f
Francis Deslauriers [Fri, 1 Oct 2021 20:10:24 +0000 (16:10 -0400)]
Fix: userspace-probe: truncating binary path for SDT
Issue
=====
This issue was uncovered when we enabled the testing of the SDT
userspace probe instrumentation on the CI, where the paths to file are
specially long.
The reported error is:
- rule: ma-probe-sdt (type: kernel:uprobe, location type: SDT, location: /root/workspace/dev_gerrit_lttng-tools_rootbuild/arch/amd64/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/node/amd64-rootnode/test_type/base/src/lttng-tools/tests/utils/testapp/userspace-probe-sdt-binary/.libs/userspace-probe-sdt-binary:foobar:tp1)
+ rule: ma-probe-sdt (type: kernel:uprobe, location type: SDT, location: /root/workspace/dev_gerrit_lttng-tools_rootbuild/arch/amd64/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/node/amd64-rootnode/test_type/base/src/lttng-tools/tests/utils/testapp/userspace-probe-sdt-binary/.libs/userspace-probe-s:foobar:tp1)
The important part to notice is that the path to the binary is truncated
compared to was is expected by the test case.
The problem is caused by the
`lttng_userspace_probe_location_tracepoint_create_from_payload()`
function that strdup() the path string using the wrong defined value.
Fix
===
Use LTTNG_PATH_MAX rather then LTTNG_SYMBOL_NAME_LEN to copy the binary
path.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I24cbf413baba405bf4c4b534ccbc2b18f8d5d43f
Francis Deslauriers [Thu, 19 Aug 2021 21:14:46 +0000 (17:14 -0400)]
Fix: statements with side-effects in assert statements
Background
==========
When building with the NDEBUG definition the `assert()` statements are
removed.
Issue
=====
Currently, a few `assert()` statements in the code base contain
statements that have side effects and removing them changes the
behavior for the program.
Fix
===
Extract the statements with side effects out of the `assert()`
statements.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0b11c8e25c3380563332b4c0fad15f70b09a7335
Jonathan Rajotte [Thu, 16 Sep 2021 15:20:07 +0000 (11:20 -0400)]
Fix: lttng_trace_archive_location_serialize is called on freed memory
Observed issue
==============
The following backtrace have been reported [1].
#0 __GI_raise (sig=sig@entry=6) at /usr/src/debug/glibc/2.31+gitAUTOINC+
f84949f1c4-r0/git/sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000003123025528 in __GI_abort () at /usr/src/debug/glibc/2.31+gitAUTOINC+
f84949f1c4-r0/git/stdlib/abort.c:79
#2 0x0000000000419884 in lttng_trace_archive_location_serialize (location=0x7f1c9c001160, buffer=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/location.c:230
#3 0x00000000004c8f06 in lttng_evaluation_session_rotation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/conditions/session-rotation.c:539
#4 0x00000000004a80fa in lttng_evaluation_serialize (evaluation=0x7f1cb000a7f0, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/evaluation.c:42
#5 0x00000000004bc24f in lttng_notification_serialize (notification=0x7f1cb961c310, payload=0x7f1cb961c320) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/common/notification.c:63
#6 0x0000000000458b7d in notification_client_list_send_evaluation (client_list=0x7f1cb0008f90, trigger=0x7f1ca40113d0, evaluation=<optimized out>, source_object_creds=0x7f1cb000a874, client_report=0x475840 <client_handle_transmission_status>, user_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/notification-thread-events.c:4379
#7 0x0000000000476586 in action_executor_generic_handler (item=0x7f1cb0009600, work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:696
#8 action_work_item_execute (work_item=0x7f1cb000a820, executor=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:715
#9 action_executor_thread (_data=0x7f1cb0006010) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/action-executor.c:797
#10 0x0000000000462327 in launch_thread (data=0x7f1cb00060b0) at /usr/src/debug/lttng-tools/2.13.0-r0/lttng-tools-2.13.0/src/bin/lttng-sessiond/thread.c:66
#11 0x0000003123408ea4 in start_thread (arg=<optimized out>) at /usr/src/debug/glibc/2.31+gitAUTOINC+
f84949f1c4-r0/git/nptl/pthread_create.c:477
#12 0x00000031230f8dcf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
This can be easily reproduced with the following session and trigger
configuration:
lttng create test
lttng enable-event -u -a
lttng start
# Register two similar triggers via a dummy C program since rotation
# completed condition is not exposed on the CLI for now. Yielding the
# following triggers:
lttng list-triggers
- name: trigger0
owner uid: 1000
condition: session rotation completed
session name: test
errors: none
action:notify
errors: none
- name: trigger1
owner uid: 1000
condition: session rotation completed
session name: test
errors: none
action:notify
errors: none
lttng rotate <- abort happens here.
Cause
=====
The problem lies in how the location (`lttng_trace_archive_location`)
object is assigned to the `lttng_evaluation` objects. A single location
object can end up being shared between multiple `lttng_evaluation` objects
since we iterate over all triggers and create an `lttng_evaluation` object
with the location each time as needed.
See `src/bin/lttng-sessiond/notification-thread-events.c:1956`.
The location object is then freed when the first notification is
completely serialized. The second serialization end up having a
reference to a freed `lttng_trace_archive_location` object.
Solution
========
Implement ref counting for the lttng_trace_archive_location object.
Note
=======
This also fixes a leak that was present in `cmd_destroy_session_reply`.
The location is created by `session_get_trace_archive_location` and is
never `destroyed`/`put`.
Known drawbacks
=========
None.
References
==========
[1] https://bugs.lttng.org/issues/1325
Fixes: #1325
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Change-Id: I99dc595ee5b0288c727b193ed061f5273752bd24
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Mon, 13 Sep 2021 20:49:48 +0000 (16:49 -0400)]
Fix: sessiond: ust session is inactive during ust_app_global_update
Observed issue
==============
The following scenario leads to an abort of lttng-sessiond.
lttng-sessiond (with kernel tracing available)
lttng create system-trace --snapshot -U /tmp/snapshot
lttng enable-channel -k system-trace --subbuf-size=4k --num-subbuf=256
lttng enable-event -c system-trace -k 'sched_wak*' -s system-trace
lttng start system-trace
lttng enable-event -u -a
Fails as expected with:
Error: Events: The command tried to enable an event in a new domain for
a session that has already been started once. (channel channel0,
session system-trace)
Launch any ust app such as easy_ust from the lttng-ust repository.
The following backtrace is generated:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7af0859 in __GI_abort () at abort.c:79
#2 0x00007ffff7af0729 in __assert_fail_base (fmt=0x7ffff7c86588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55555564b765 "usess->active", file=0x555555649a60 "ust-app.c", line
#3 0x00007ffff7b01f36 in __GI___assert_fail (assertion=0x55555564b765 "usess->active", file=0x555555649a60 "ust-app.c", line=5123, function=0x55555564ecf0 <__PRETTY_FUNCTION__.14199> "ust_
#4 0x00005555555d1f5e in ust_app_global_update (usess=0x7fffe001fb90, app=0x7fffac000b80) at ust-app.c:5123
#5 0x00005555555b60d4 in update_ust_app (app_sock=82) at dispatch.c:71
#6 0x00005555555b7025 in thread_dispatch_ust_registration (data=0x5555556a07f0) at dispatch.c:409
#7 0x00005555555ad5ab in launch_thread (data=0x5555556a0810) at thread.c:65
#8 0x00007ffff7ce6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9 0x00007ffff7bed293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
This also happens for the track command. You can replace the `lttng
enable-event -u -a` with `lttng track --userspace --vuid=0` then launch
an app and the same backtrace gets generated.
Cause
=====
During `process_client_msg` the `create_ust_session` function is called
and a ust session is assigned to the "system_trace" session with a
state of `active` set to 0 (false). This is not a problem.
The problem seems to lie with a single call site for
`ust_app_global_update` in `update_ust_app`. The status of the ust
session is not checked before calling the `ust_app_global_update`. It is
important to note that all `ust_app_global_update_all` callsites guard
the call with a check against the status of the session.
Solution
========
Guard the call to `ust_app_global_update` with a check of the ust
session active state.
Known drawbacks
=========
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I14d25d99d0609689247cdfa86130bd0219613581
Francis Deslauriers [Wed, 8 Sep 2021 14:16:23 +0000 (10:16 -0400)]
Fix: Tests: race condition in test_ns_contexts_change
Issue
=====
The test script doesn't wait for the test application to complete before
stopping the tracing session. The race is that depending on the
scheduling the application is not always done generating events when the
session is stopped.
Fix
===
Make the test script wait for the termination of the test app before
stopping the session.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I29d9b41d2a2ed60a6c42020509c2067442ae332c
Jérémie Galarneau [Wed, 8 Sep 2021 19:54:32 +0000 (15:54 -0400)]
Fix: Tests: race condition in test_event_tracker
Background
==========
The `test_event_tracker` file contains test cases when the event
generating app in executed in two distinct steps. Those two steps are
preparation and execution.
1. the preparation is the launching the app in the background, and
2. the execution is actually generating the event that should or
should not be traced depending on the test case.
This is useful to test the tracker feature since we want to ensure that
already running apps are notified properly when changing their tracking
status.
Issue
=====
The `test_event_vpid_track_untrack` test case suffers from a race
condition that is easy to reproduce on Yocto.
The issue is that sometimes events are end up the trace when none is
expected.
This is due to the absence of synchronization point at the launch of the
app which leads to the app being scheduled in-between the track-untrack
calls leading to events being recorded to the trace.
It's easy to reproduce this issue on my machine by adding a `sleep 5`
between the track and untrack calls and setting the `NR_USEC_WAIT`
variable to 1.
Fix
===
Using the testapp `--sync-before-last-event-touch` flag to make the app
create a file when all but the last event are executed. We then have the
app wait until we create a file (`--sync-before-last-event`) to generate
that last event. This way, we are sure no event will be generated when
running the track and untrack commands.
Notes
=====
- This issue affects other test cases in this file.
- This commit fixes a typo in the test header.
- This commit adds `diag` calls to help tracking to what test the output
relates to when reading the log.
Backport
========
The changes applying to `test_event_tracker` introduced by
33e5571 were
squashed into this commit as the clean-up affected a lot of areas
impacted by this change.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3c7dcb8e3c3427944fb9c07d2d0b9b6213b6ab32
Jérémie Galarneau [Fri, 20 Aug 2021 19:12:20 +0000 (15:12 -0400)]
Fix: man: lttng-rotate: trace file count/size limitation does not apply
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I21f4809215df1b8e9b1cf24edabd478e9b530969
Jérémie Galarneau [Fri, 6 Aug 2021 16:50:37 +0000 (12:50 -0400)]
Update version to v2.12.5
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Francis Deslauriers [Fri, 6 Aug 2021 13:40:20 +0000 (09:40 -0400)]
Fix: runas: less-than-zero comparison of an unsigned value
Fixes two defects found by Coverity related to unsigned integers being
treated as signed.
Reported by Coverity:
CID
1461333: Control flow issues (NO_EFFECT)
This less-than-zero comparison of an unsigned value is never true. "buf_size < 0UL".
CID
1461332: Integer handling issues (NEGATIVE_RETURNS)
"buf_size" is passed to a parameter that cannot be negative.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id6d4a71960f2ef34f14c05e66ef5d934b7a3e524
Francis Deslauriers [Fri, 23 Jul 2021 20:27:00 +0000 (16:27 -0400)]
Fix: runas: supplementary groups are ignored on lttng save
Observed issue
==============
On `lttng save` the following is reported to the user:
$ sudo -u my_user lttng save -o /tmp/my_dir my_session_name
Error: Permission denied
Note that:
* the running lttng-sessiond is root,
* "my_user" is part of the tracing group,
* "my_user" primary group is "my_user" and is part of group "my_dummy_group"
* The "/tmp/my_dir" has the following permissions:
drwxrwx--- 2 root my_dummy_group 4096 Jul 26 16:39 /tmp/my_dir/
Cause
=====
The supplementary groups are not initialized when the run-as process
demote itself to the user "my_user" to perform the recursive mkdir
required by the `lttng save` command.
From the point of the view the kernel, at the moment of performing the
mkdir call the permissions looks like this:
euid: uid of "my_user"
egid: primary gid of "my_user"
supplementary group list: "root"
Note that the kernel does not treat the presence of the root group in
the supplementary group list in any special way. Since "root gid" !=
"my_dummy_group gid" the directory creation is refused.
Solution
========
Use initgroups(3) to initialize the supplementary group list.
Known drawbacks
=========
None.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I58656a3107e4f7b59a2391a4759988401cad7a2b
Simon Marchi [Mon, 2 Aug 2021 01:02:39 +0000 (21:02 -0400)]
Fix: lttng: free sessions in cmd_destroy
When doing `lttng destroy`, I get:
Direct leak of 4385 byte(s) in 1 object(s) allocated from:
#0 0x7f74ae025459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7f74add4129a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
#2 0x7f74add42b9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
#3 0x7f74add42f9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
#4 0x7f74add41714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
#5 0x7f74add41747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
#6 0x7f74add4a922 in lttng_list_sessions /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2105
#7 0x56472bcbdf80 in cmd_destroy /home/simark/src/lttng-tools/src/bin/lttng/commands/destroy.c:330
#8 0x56472bd00764 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#9 0x56472bd01218 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
#10 0x56472bd0151a in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
#11 0x7f74ad963b24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
This is due to cmd_destroy not free'ing the result of
lttng_list_sessions. Fix that.
Change-Id: Iff2e75e6ec1cdcd0bdfdbbc3d5099422e592905b
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Mon, 2 Aug 2021 00:33:23 +0000 (20:33 -0400)]
Fix: lttng: free domains and channels in get_session_stats_str
When doing `lttng stop`, I get:
Direct leak of 656 byte(s) in 1 object(s) allocated from:
#0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
#2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
#3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
#4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
#5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
#6 0x7f9706ec4604 in lttng_list_channels /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2262
#7 0x55837235c4e7 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:499
#8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
#9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
#10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
#11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
#13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
#14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Direct leak of 308 byte(s) in 1 object(s) allocated from:
#0 0x7f970719e459 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x7f9706eba29a in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
#2 0x7f9706ebbb9d in recv_sessiond_optional_data /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:494
#3 0x7f9706ebbf9a in lttng_ctl_ask_sessiond_fds_varlen /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:596
#4 0x7f9706eba714 in lttng_ctl_ask_sessiond_varlen_no_cmd_header /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:58
#5 0x7f9706eba747 in lttng_ctl_ask_sessiond /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl-helper.h:78
#6 0x7f9706ec421c in lttng_list_domains /home/simark/src/lttng-tools/src/lib/lttng-ctl/lttng-ctl.c:2220
#7 0x55837235c3d3 in get_session_stats_str /home/simark/src/lttng-tools/src/bin/lttng/utils.c:484
#8 0x55837235bf73 in print_session_stats /home/simark/src/lttng-tools/src/bin/lttng/utils.c:445
#9 0x55837231cc12 in stop_tracing /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:138
#10 0x55837231d062 in cmd_stop /home/simark/src/lttng-tools/src/bin/lttng/commands/stop.c:229
#11 0x55837235e63e in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#12 0x55837235f0f2 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:426
#13 0x55837235f3f4 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:475
#14 0x7f9706adcb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
This is due to the get_session_stats_str function not free'ing the
results of lttng_list_channels and lttng_list_domains. Fix that.
Change-Id: I4c200d3df41bf09bdce8eadb000abbff7fe5a751
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 16 Jun 2021 19:08:21 +0000 (15:08 -0400)]
Fix: rotation client example: leak of handle on error
1452927 Resource leak
The system resource will not be reclaimed and reused, reducing the
future availability of the resource.
In setup_session: Leak of memory or pointers to system
resources (CWE-404)
CID
1452927 (#1 of 1): Resource leak (RESOURCE_LEAK)8. leaked_storage:
Variable chan_handle going out of scope leaks the storage it points to
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4c215ac4a86f9f70fd5c9d3aa13f944d3d7a2cc7
Michael Jeanson [Tue, 15 Jun 2021 18:59:21 +0000 (14:59 -0400)]
.gitreview: Set default branch to 'stable-2.12'
Change-Id: I5ba90465e49e01203172d3db7d252cfda2e4fcf8
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Tue, 11 May 2021 14:29:50 +0000 (10:29 -0400)]
tests: Move tap-driver.sh out of the autotools aux directory
We have made local modifications to this script, move it to the test
suite directory so it doesn't get overwritten by an updated version from
autotools.
Change-Id: I22823176cc5b98901865c9415250ce8e49384cbd
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Francis Deslauriers [Wed, 16 Jun 2021 16:10:42 +0000 (12:10 -0400)]
Fix: use of uninitialised bytes valgrind warning
Issue
=====
Valgrind reports usage of uninitialised stack allocated memory:
==
2961363== Thread 9 Client manageme:
==
2961363== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
==
2961363== at 0x521418D: __libc_sendmsg (sendmsg.c:28)
==
2961363== by 0x521418D: sendmsg (sendmsg.c:25)
==
2961363== by 0x53411B: lttcomm_send_unix_sock (unix.c:294)
==
2961363== by 0x48AA8C: send_unix_sock (client.c:896)
==
2961363== by 0x484F45: thread_manage_clients (client.c:2865)
==
2961363== by 0x480FB4: launch_thread (thread.c:66)
==
2961363== by 0x5208608: start_thread (pthread_create.c:477)
==
2961363== by 0x5346292: clone (clone.S:95)
==
2961363== Address 0x7575389 is 25 bytes inside a block of size 16,384 alloc'd
==
2961363== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==
2961363== by 0x4EB618: lttng_dynamic_buffer_set_capacity (dynamic-buffer.c:166)
==
2961363== by 0x4EB52C: lttng_dynamic_buffer_append (dynamic-buffer.c:55)
==
2961363== by 0x48CBA1: setup_lttng_msg (client.c:125)
==
2961363== by 0x48AD70: setup_lttng_msg_no_cmd_header (client.c:860)
==
2961363== by 0x489825: process_client_msg (client.c:2253)
==
2961363== by 0x484A97: thread_manage_clients (client.c:2807)
==
2961363== by 0x480FB4: launch_thread (thread.c:66)
==
2961363== by 0x5208608: start_thread (pthread_create.c:477)
==
2961363== by 0x5346292: clone (clone.S:95)
==
2961363== Uninitialised value was created by a stack allocation
==
2961363== at 0x485FE4: process_client_msg (client.c:928)
After some digging, I found that this warning was caused by the padding
of the `struct lttng_session_list_schedules_return` during the
`LTTNG_SESSION_LIST_ROTATION_SCHEDULES` command.
All the fields are of the stack allocated struct are initialised by the
designated initializer but the padding is not.
These padding bytes are reported by Valgrind as being used
uninitialised.
Fix
===
Remove the padding by adding the LTTNG_PACKED attribute to the nested
structs in `struct lttng_session_list_schedules_return`.
Notes
=====
In light of the actual root cause, this is stacktrace is not really
useful.
The realloc call to grow the buffer makes it hard to find what is the
actual uninitialised stack allocation because Valgrind reports the
realloc call as the problematic site.
I was able to track this issue by adding a "consuming" step in the
`lttng_dynamic_buffer_append()` function. This consuming step would sum
all the bytes of the `buf` parameter so as to force Valgrind to check
each byte and not wait until the `sendmsg()` call. This way, I was able
to get a more precise location of the root cause of the issue.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib4a729575e9117cf95716ad25e1417c833f4232b
Jérémie Galarneau [Mon, 14 Jun 2021 19:21:23 +0000 (15:21 -0400)]
Fix: bump minimal urcu dependency to 0.11
cds_lfht_destroy is used from within RCU read-side critical sections
which is only allowed for urcu >= 0.10 (see userspace-rcu commit
d0ec0ed2f).
Such uses were introduced as part of the 2.11 release (contemporary
to urcu 0.10).
This version of the fix differs from
879000628 since the configure
script has changed significantly for the 2.13 release (use of PKG_CONFIG
rather than looking for a version-specific symbol).
The configure script looks for `urcu_memb_barrier` which was introduced
in liburcu 0.11.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7d8daba877cbdcfaf93cdc4c9b16a3b3df974782
Mathieu Desnoyers [Thu, 27 May 2021 20:45:55 +0000 (16:45 -0400)]
Fix: consumer: unbalanced RCU read-side lock on error
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I22ddc8585af84d07e8f3d19e9df582599352a1b8
Jonathan Rajotte [Mon, 15 Mar 2021 17:13:37 +0000 (13:13 -0400)]
Fix: tests: missing LOG_DRIVER and LOG_DRIVER_FLAGS
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I808bb2bb80824a574850cc3015cb24eec7b3ff81
Mathieu Desnoyers [Thu, 27 May 2021 21:11:22 +0000 (17:11 -0400)]
Fix: list_lttng_agent_events: unbalanced RCU read-side lock on error
The error label jumps to the end label which releases the RCU read-side
lock. There are many error paths in this function which goto error
without holding the RCU read-side lock, thus causing unbalanced RCU
read-side lock.
There is no point in keeping so short RCU read-side critical sections,
so cover the entire function with a single read-side critical section.
[ Applies to stable-2.12 and possibly prior versions. Does _not_ apply
to stable-2.13+. ]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5b20c229a5df22d22ecfdc64dbbb87ee118649d2
Jérémie Galarneau [Tue, 18 May 2021 21:53:45 +0000 (17:53 -0400)]
Update version to v2.12.4
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Tue, 1 Dec 2020 17:19:53 +0000 (12:19 -0500)]
Fix: snapshot path have domain subdir duplicate "ust/ust" or "kernel/kernel"
This is a combination of 2 commits.
This is the 1st commit message:
Observed issue
==============
lttng-ivc observed unexpected path generated for streamed snapshot:
joraj-alpa/
test-
20190319-120000-
20210113-110101/
snapshot-1-
20210113-110102-0/
* ust/
* ust/
pid/
app-ust-
2362198-
20210113-110101/
channel_0
channel_1
metadata
channel_2
channel_3
"ust" the domain subdir is present two-time instead off only one time.
The same problem is seen for kernel snapshots.
Cause
=====
Based on dissection the problem was introduced by commit
5da88b0f58d7f838068037ea449ddfb25d3e85ad [1]
For snapshots, the consumer output object of the snapshot output and
information is fetched from the *original* consumer output.
(code block around src/bin/lttng-sessiond/cmd.c:4748)
The snapshot consumer output does not contains the necessary information
to populate domain_subdir on copy (domain_subdir is '\0').
This would lead to a len evaluation of 1 for the consumer_path_offset in
setup_channel_trace_path:70. This would end up not "skipping" the "ust"
and "kernel" part of the path.
Solution
========
Part of the solution is to copy the domain_subdir from the original
session consumer output to the snapshot output.
Still, the problem was still present since that now that the
domain_subdir was not "\0", the value was suffixed to the passed
session_path. In the snapshot code path, "ust/" and "kernel/" were
already present in the session_path passed to setup_channel_trace_path.
A quick modification at the caller level in the snapshot code path fixes
the issues once and for all.
Tests
=======
The test suit is augmented for certain key tests to validate the
complete path of a trace.
Path validation is based on crude bash pattern matching.
Known drawbacks
=========
None
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
5da88b0f58d7f838068037ea449ddfb25d3e85ad
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
This is the commit message #2:
Fix: expected procname should not have -ust suffix
commit ("Fix: ustcomm: application name uses the '-ust'-suffixed thread
name") in LTTng-UST removes "-ust" suffix from procname, which appears
in the output trace directory hierarchy.
Adapt tests to not expect this quirk.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie7cd2d5471ee3a942fa511e2f4cab09e3aa499e4
Jérémie Galarneau [Wed, 7 Oct 2020 18:10:35 +0000 (14:10 -0400)]
Fix: relayd: failure to read index entry or stream packet after clear
Observed issue
==============
The clear tests occasionally fail with the following babeltrace error
when a live session is stopped following a "clear". Unfortunately, this
problem only seems to occur on certain machines. In my case, I only
managed to reproduce this on the CI's workers.
10-07 12:39:48.333 7679 7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_get_stream_bytes@viewer-connection.c:1610 [lttng-live] Received get_data_packet response: error
10-07 12:39:48.333 7679 7679 E PLUGIN/CTF/MSG-ITER request_medium_bytes@msg-iter.c:563 [lttng-live] User function failed: status=ERROR
10-07 12:39:48.333 7679 7679 E PLUGIN/CTF/MSG-ITER ctf_msg_iter_get_next_message@msg-iter.c:2899 [lttng-live] Cannot handle state: msg-it-addr=0x5603c28e2830, state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
10-07 12:39:48.333 7679 7679 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_iterator_next_handle_one_active_data_stream@lttng-live.c:845 [lttng-live] CTF message iterator failed to get next message: msg-iter=0x5603c28e2830, msg-iter-status=ERROR
10-07 12:39:48.333 7679 7679 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
10-07 12:39:48.333 7679 7679 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
10-07 12:39:48.333 7679 7679 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
10-07 12:39:48.333 7679 7679 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x5603c28dbe70, muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
10-07 12:39:48.333 7679 7679 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x5603c28dc960, muxer-comp-addr=0x5603c28db0a0, muxer-msg-iter-addr=0x5603c28dbe70, msg-iter-addr=0x5603c28caf80, status=ERROR
10-07 12:39:48.333 7679 7679 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
10-07 12:39:48.333 7679 7679 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140, comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
10-07 12:39:48.333 7679 7679 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully
10-07 12:39:48.333 7679 7679 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_detach@viewer-connection.c:1227 [lttng-live] Unknown detach return code 0
ERROR: [Babeltrace CLI] (babeltrace2.c:2548)
Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
Component's "consume" method failed: status=ERROR, comp-addr=0x5603c28dcb60,
comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
(`text` fo", comp-class-is-frozen=0, comp-class-so-handle-addr=0x5603c28c8140,
comp-class-so-handle-path="/home/jenkins/jgalar-debug/build/usr/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x5603c28caf80, iter-upstream-comp-name="muxer",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
iter-upstream-comp-class-name="muxer",
iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
Cannot validate muxer's upstream message iterator wrapper:
muxer-msg-iter-addr=0x5603c28dbe70,
muxer-upstream-msg-iter-wrap-addr=0x5603c28cd0f0
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x5603c28cb0f0, iter-upstream-comp-name="lttng-live",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
iter-upstream-comp-class-name="lttng-live",
iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
Error preparing the next batch of messages:
live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:845)
CTF message iterator failed to get next message: msg-iter=0x5603c28e2830,
msg-iter-status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:2899)
Cannot handle state: msg-it-addr=0x5603c28e2830,
state=DSCOPE_TRACE_PACKET_HEADER_BEGIN
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (msg-iter.c:563)
User function failed: status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1610)
Received get_data_packet response: error
This occurs immediately following a 'stop' on the session. As the error
indicates, a request to obtain a data packet fails with a generic
error reply.
Moreover, the following LTTNG_VIEWER_DETACH_SESSION appears to fail
with an invalid status code. This is addressed in a different commit.
Reproducing the test's failure without redirecting the relay daemon's
allows us to see the following errors after the first stop:
PERROR - 14:33:44.
929675253 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
PERROR - 14:33:45.
030037417 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
PERROR - 14:33:45.
130429370 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
PERROR - 14:33:45.
230829447 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
PERROR - 14:33:45.
331223320 [25108/25115]: Failed to open fs handle to ust/uid/1001/64-bit/index/chan_0.idx, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
This is produced with the following back-trace:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff69648b1 in __GI_abort () at abort.c:79
#2 0x00005555555b4f1f in fd_tracker_open_fs_handle (tracker=0x55555582c620, directory=0x7fffe8006680,
path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0, mode=0x7ffff0a24508) at fd-tracker.c:550
#3 0x0000555555595c34 in _lttng_trace_chunk_open_fs_handle_locked (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx",
flags=0, mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1388
#4 0x0000555555595eef in lttng_trace_chunk_open_fs_handle (chunk=0x7fffe0002130, file_path=0x7ffff0a25870 "ust/uid/1001/64-bit/index/chan_1.idx", flags=0,
mode=432, out_handle=0x7ffff0a24710, expect_no_file=true) at trace-chunk.c:1433
#5 0x00005555555da6c2 in _lttng_index_file_create_from_trace_chunk (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, unlink_existing_file=false, flags=0,
expect_no_file=true, file=0x7fffe0002270) at index.c:97
#6 0x00005555555dad8a in lttng_index_file_create_from_trace_chunk_read_only (chunk=0x7fffe0002130, channel_path=0x7fffe8018c30 "ust/uid/1001/64-bit",
stream_name=0x7fffe8018c10 "chan_1", stream_file_size=0, stream_file_index=0, index_major=1, index_minor=1, expect_no_file=true, file=0x7fffe0002270)
at index.c:186
#7 0x000055555557640f in try_open_index (vstream=0x7fffe0002250, rstream=0x7fffe8018c50) at live.c:1378
#8 0x0000555555577155 in viewer_get_next_index (conn=0x7fffd4001440) at live.c:1643
#9 0x0000555555579a01 in process_control (recv_hdr=0x7ffff0a27c30, conn=0x7fffd4001440) at live.c:2311
#10 0x000055555557a1db in thread_worker (data=0x0) at live.c:2482
#11 0x00007ffff6d1c6db in start_thread (arg=0x7ffff0a28700) at pthread_create.c:463
#12 0x00007ffff6a45a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
That problem is mostly cosmetic in nature (the open can fail
"legitimately") as the PERROR should simply not be printed and is
addressed in a different commit.
This error is also produced after a 'clear' is issued:
PERROR - 14:33:45.
532782268 [25108/25115]: Failed to read from file system handle of viewer stream id 1, offset: 4096: No such file or directory (in viewer_get_packet() at live.c:1849)
Which is produced with the following back-trace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f53e297c8b1 in __GI_abort () at abort.c:79
#2 0x000055dd77ccef2c in viewer_get_packet (conn=0x7f53c4001100) at live.c:1850
#3 0x000055dd77cd0a15 in process_control (recv_hdr=0x7f53dca3fc30, conn=0x7f53c4001100) at live.c:2315
#4 0x000055dd77cd11db in thread_worker (data=0x0) at live.c:2483
#5 0x00007f53e2d346db in start_thread (arg=0x7f53dca40700) at pthread_create.c:463
#6 0x00007f53e2a5da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
A similar problem occurs, although more rarely, when reading an
index entry in viewer_get_next_index().
Cause
=====
The following situation leads to both failures to get a
packet and failures to get the next index:
- Viewer connects to an existing session,
- Viewer consumes a number of packets, alternating the
GET_NEXT_INDEX and GET_PACKET command,
- The session's streams are rotated to a new trace chunk
(as part of a clear),
- The session is started and stopped, causing new packets
to be produced and received,
- The session is stopped and destroyed, causing the session's
streams to rotate into a "null" trace chunk (no active
trace files),
- Viewer issues GET_NEXT_INDEX or GET_PACKET, but the fact
that a rotation occurred on the receiving end is not detected
as the relay streams' trace chunk are "null".
The crux of the problem is that lttng_trace_chunk_ids_equal() is
bypassed when the current trace chunk of a relay stream is "null".
The rationale for skipping this check is that it is assumed that the
files currently opened by the live server can can still be used even
if the consumer has rotated the corresponding streams into a 'null'
trace chunk, meaning no trace chunk is 'set' for those streams.
This makes sense in one scenario: the session was destroyed and we wish
to allow a connected live client to finish consuming the trace packets
up to the end of the session's lifetime.
Here, the situation is different. The viewer is reading chunk 'A'.
Meanwhile, a rotation occurs into chunk 'B' and packets are received for
chunk 'B'. Then, a rotation to a 'null' chunk (no active chunk) occurs.
In essence, the live server never sees the rotation between chunk 'A'
and 'B', and simply assumes that a rotation from 'A' to 'null' occurred,
as would happen at the end of a session.
In terms of the code, in viewer_get_next_index(), a call to
check_index_status() is performed to determine if an index is available.
The function checks that `index_received_seqcount` is greater than
`index_sent_seqcount`. In that case, it determines that an index must be
available.
Unfortunately, there is no way for the live server to determine that the
remaining indexes are in a chunk that doesn't exist anymore (chunk 'B').
Thus, viewer_get_next_index() attempts to read an index entry from the
current index file and fails.
Solution
========
1) lttng_trace_chunk_ids_equal() is modified to properly handle
'null' trace chunks:
- A null and a non-null trace chunk are not equal,
- Two null trace chunks are equal.
2) Rotation count
A rotation counter is introduced to track the number of rotations
that occurred during a relay stream's lifetime. This counter is
sampled by the matching viewer streams on creation and on rotation
and is used to determine if all rotations were "seen" by the viewer
stream.
Hence, this allows us to handle the special case where a viewer
is consuming the contents of a relay stream that just transitioned
into a 'null' trace chunk (see comments in patch).
The rest of the modifications simply allow the live server to handle
null trace chunks in viewer streams. This fixes another unrelated bug
that I observed while investigating this: sessions that don't have an
active trace chunk are not shown when listing sessions with babeltrace.
To reproduce, simply stop, clear a session, and attempt to list the
sessions of the associated relay daemon.
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibb3116990e34b7ec3b477f3482d0c0ff1e848d09
Jérémie Galarneau [Thu, 22 Apr 2021 18:47:14 +0000 (14:47 -0400)]
Fix: sessiond: leak of config_path on duplicate --config option
1452207 Resource leak
The system resource will not be reclaimed and reused, reducing the
future availability of the resource.
CID
1452373 (#3 of 3): Resource leak (RESOURCE_LEAK)
26. overwrite_var: Overwriting config_path in config_path =
utils_expand_path(optarg) leaks the storage that config_path points to.
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I70eff0f4de5ac9b84b699e88232669259eab54db
Jonathan Rajotte [Thu, 6 May 2021 15:14:52 +0000 (11:14 -0400)]
Fix: relayd: live: data is missing between viewer attach and retry
Observed issue
==============
Data produced between the time a live viewer attach and a viewer
subsequent "polling" is missing from the point of view of the viewer.
Using the following reproducer:
# A lttng-sessiond is already running
lttng-relayd -b -vvv > relayd.log 2>&1
lttng create live --live
lttng enable-event -u "hello_world:*"
lttng start
date
# Set the "polling" interval at ~30 seconds
babeltrace2 --retry-duration=
30000000 -i lttng-live net://localhost/host/$HOSTNAME/live &
pid=$!
# Make sure we are in the retry phase of bt2
date
sleep 2
# Produce events
# Note here that hello perform a sleep of 1 seconds before generating
# 10 events to make sure lttng-ust had time to register. A total of
# 100 events is generated.
date
for i in `seq 1 10`; do
./hello
done
# Here we expect to see 100 events outputted at some point (~18
# seconds (30 - 2 - 10) from that time).
# No events are received by babeltrace2.
date
sleep 40
date
# We retry the same exercise a second time.
echo "Moving to phase 2"
date
for i in `seq 1 10`; do
./hello
done
# Here data is received correctly.
date
sleep 40
date
kill $pid
wait
lttng destroy live
pkill lttng-relayd
Cause
=====
At the moment the viewer attach is done, no streams exist. On the
following viewer_get_new_streams (~30 seconds later), streams now exist
since they were allocated lazily following the first application
registration, but LTTNG_VIEWER_SEEK_LAST is used. This essentially
"discards" any events in the stream that happened between the attach
time and the viewer_get_new_stream execution.
Note that the same problem would happen if a new UID (in per-uid mode)
would happen to begin tracing in between viewer retry.
Solution
========
Always use `LTTNG_VIEWER_SEEK_BEGINNING` during
`viewer_get_new_streams`.
Known drawbacks
=========
This does not fix the per-pid problem where an app run and die in between
the viewer retries.
References
==========
https://lists.lttng.org/pipermail/lttng-dev/2021-May/029953.html
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If0e2df639572e4bce91bac03994494ddbe138315
Jérémie Galarneau [Mon, 3 May 2021 16:50:25 +0000 (12:50 -0400)]
Fix: lttng-ctl: erroneous check if user is part of the tracing group
in_tgroup is set to `-1` whenever the current user is not part of the
tracing group _or_ if an error occurred while looking up if the user
is part of the tracing group. In other words, the value '0' is unused.
in_tgroup must be explicitly checked against '1' and can't be assumed
to behave as a boolean value.
This is _not_ a security issue: if the user is not part of the tracing
group, she will fail to open the root session damon's socket because
of the kernel-side permission checking. However, the behaviour of the
lttng client (and error reporting) will be confusing.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1c12285fd0f42bb73c560a32a56cab03f1a5a6e1
Mathieu Desnoyers [Fri, 30 Apr 2021 16:02:47 +0000 (12:02 -0400)]
Fix: kernel consumer: get next subbuffer EAGAIN handling
The caller of get next subbuffer (data and metadata) callbacks only
expects -ENODATA when there is no data to read. However, the kernel
tracer distinguishes between no data for a finalized stream (-ENODATA)
and no data for a non-finalized stream (-EAGAIN).
Given that the consumer daemon uses the POLLHUP returned by epoll to
detect stream end of life, it does not care about the distinction
between -EAGAIN and -ENODATA when streaming.
However, taking a snapshot of a metadata stream uses the distinction
between nodata and again. Change this so it considers a return value of
0 from lttng_consumer_read_subbuffer to mean there is no more data to
read, so we can combine -EAGAIN and -ENODATA within get next subbuffer
callbacks and return -ENODATA for both.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ifb157dbe28498279dce30e9efa3b5aedcf3f9b1d
Mathieu Desnoyers [Fri, 30 Apr 2021 14:44:44 +0000 (10:44 -0400)]
Fix: kernel consumer: signal metadata ready condition variable
The behaviour of the kernel consumer does not match that of the user
space consumer. When a live metadata stream goes back to "sleep", it
must broadcast on the metadata_rdv to wake-up any thread that is
performing a metadata sync.
This mismatch causes a hang in the kernel consumer during the kernel
clear tests.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7c93007b6d08f340881c08f23931582443bfa397
Jérémie Galarneau [Thu, 29 Apr 2021 20:17:45 +0000 (16:17 -0400)]
Fix: consumerd: unbalanced subbuffer 'get' when checking operation availability
A WARN_ON in lttng-modules is hit in the get_subbuf ioctl, which
pointed to an unbalanced get/put pair when accessing the subbuffers
of a channel.
517716.168856] ------------[ cut here ]------------
[517716.171559] WARNING: CPU: 1 PID: 19313 at /home/efficios/git/lttng-modules/src/lib/ringbuffer/ring_buffer_frontend.c:1263 lib_ring_buffer_get_subbuf+0x24f/0x260 [lttng_lib_ring_buffer]
[517716.180096] Modules linked in: lttng_test(O) lttng_probe_x86_exceptions(O) lttng_probe_x86_irq_vectors(O) lttng_probe_writeback(O) lttng_probe_workqueue(O) lttng_probe_vmscan(O) lttng_probe_udp(O) lttng_probe_timer(O) lttng_probe_sunrpc(O) lttng_probe_statedump(O) lttng_probe_sock(O) lttng_probe_skb(O) lttng_probe_signal(O) lttng_probe_scsi(O) lttng_probe_sched(O) lttng_probe_regulator(O) lttng_probe_regmap(O) lttng_probe_rcu(O) lttng_probe_random(O) lttng_probe_printk(O) lttng_probe_power(O) lttng_probe_net(O) lttng_probe_napi(O) lttng_probe_module(O) lttng_probe_kmem(O) lttng_probe_jbd2(O) lttng_probe_irq(O) lttng_probe_i2c(O) lttng_probe_gpio(O) lttng_probe_ext4(O) lttng_probe_compaction(O) lttng_probe_btrfs(O) lttng_probe_block(O) lttng_counter_client_percpu_32_modular(O) lttng_counter_client_percpu_64_modular(O) lttng_counter(O) lttng_ring_buffer_event_notifier_client(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O)
[517716.180815] lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_wrapper(O) lttng_uprobes(O) lttng_clock(O) lttng_kprobes(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O) [last unloaded: lttng_wrapper]
[517716.213228] CPU: 1 PID: 19313 Comm: lttng-consumerd Tainted: G O 5.11.2 #80
[517716.215573] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[517716.220341] RIP: 0010:lib_ring_buffer_get_subbuf+0x24f/0x260 [lttng_lib_ring_buffer]
[517716.222579] Code: 50 f0 ff 00 0f 0b 49 03 5f 28 44 8b 85 80 00 00 00 49 8b 77 30 45 85 c0 48 89 d9 0f 85 4f ff ff ff e9 25 ff ff ff f0 ff 45 00 <0f> 0b b8 f0 ff ff ff e9 a6 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00
[517716.227610] RSP: 0018:
ffffbafd09023e88 EFLAGS:
00010202
[517716.229147] RAX:
0000000000000000 RBX:
ffff9906ed069a00 RCX:
ffff9905c4e4a400
[517716.231186] RDX:
ffffdafcffc4ec90 RSI:
0000000000200000 RDI:
ffffdafcffc4e9f0
[517716.233265] RBP:
ffff9905c4e4a400 R08:
0000000000300000 R09:
0000000000200000
[517716.235284] R10:
0000000000200000 R11:
0000000000000000 R12:
0000000000000000
[517716.237333] R13:
0000000000000000 R14:
000000000000005e R15:
0000000000000000
[517716.239360] FS:
00007ff327fff700(0000) GS:
ffff9905a7a40000(0000) knlGS:
0000000000000000
[517716.241634] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[517716.243262] CR2:
000055b91cf07730 CR3:
000000076f45e003 CR4:
00000000001706e0
[517716.245316] Call Trace:
[517716.246281] lib_ring_buffer_ioctl+0x181/0x300 [lttng_lib_ring_buffer]
[517716.248301] lttng_stream_ring_buffer_ioctl+0x1a3/0x200 [lttng_tracer]
[517716.252621] __x64_sys_ioctl+0x8e/0xd0
[517716.253931] do_syscall_64+0x33/0x80
[517716.255016] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[517716.256448] RIP: 0033:0x7ff3372f46d7
[517716.257586] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
[517716.262472] RSP: 002b:
00007ff327ffe2c8 EFLAGS:
00000246 ORIG_RAX:
0000000000000010
[517716.264634] RAX:
ffffffffffffffda RBX:
00007ff310002740 RCX:
00007ff3372f46d7
[517716.266674] RDX:
0000000000000000 RSI:
000000000000f605 RDI:
000000000000005e
[517716.268713] RBP:
00007ff327ffe310 R08:
00007ff310002870 R09:
a002000000000000
[517716.270732] R10:
000055992b0a6530 R11:
0000000000000246 R12:
000055992c47da70
[517716.272768] R13:
00007ff318005e80 R14:
00007ff310002740 R15:
000055992b0a6528
[517716.274800] irq event stamp:
4526705
[517716.275897] hardirqs last enabled at (
4526713): [<
ffffffff9016a474>] console_unlock+0x4b4/0x5b0
[517716.278320] hardirqs last disabled at (
4526722): [<
ffffffff9016a3d0>] console_unlock+0x410/0x5b0
[517716.280759] softirqs last enabled at (
4526658): [<
ffffffff9120030f>] __do_softirq+0x30f/0x432
[517716.285125] softirqs last disabled at (
4526653): [<
ffffffff91001052>] asm_call_irq_on_stack+0x12/0x20
[517716.287648] ---[ end trace
506e55b312b731bf ]---
The check for the availability of the 'get_next_check_metadata'
operation attempts to use the operation at the creation of the metadata
stream. Most of the time this occurs before any metadata could be
generated.
However, the check will sometimes (very rarely) occur after the
generation of some metadata causing the 'get' to succeed and,
consequently, a subbuffer to be acquired. In those cases, the subbuffer
must be released immediately.
Fixes #1313
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic4f3deb6fa665a13e3673fd859b23d7ef803f963
Francis Deslauriers [Thu, 22 Apr 2021 21:24:57 +0000 (17:24 -0400)]
Fix: tests: gen-ust-events-ns: Uninitialized argument value
If both `if (snprintf(...` of the `get_ns_inum()` function fail, the
function will not uninitialize the `ns_inum` output parameter and still
return 0. Leading to the argument `ns1` of debug_printf() being used
uninitialized.
Reported-by: scan-build.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I090e57d5c896fbac7381989e887ae7281697b4de
Jérémie Galarneau [Thu, 22 Apr 2021 23:54:47 +0000 (19:54 -0400)]
Fix: consumerd: strlen called on uninitialized path
closed_trace_chunk_path is uninitialized when relayd_id != 0.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I71564bdce0617de7749fc5848764a8cda4d122ea
Jérémie Galarneau [Thu, 22 Apr 2021 16:32:26 +0000 (12:32 -0400)]
Fix: tests: health thread stall: only stop consumerd when required
Since
a0f8e3109, stop_lttng_consumerd will report a failure when
there is no consumer daemon to kill. This fix ensures it is only
invoked for tests that launch a consumer daemon.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ifba6c0a87158d5cf20eeacb2b65fe049197d9b55
Francis Deslauriers [Mon, 19 Apr 2021 18:58:31 +0000 (14:58 -0400)]
Fix: tests: quote variable in case it's not set
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I34d39498a271e3a1644f49ac7f86f3e7f6d37a0f
Jonathan Rajotte [Mon, 15 Mar 2021 21:52:24 +0000 (17:52 -0400)]
Fix: sessiond: session destroy hang in per-uid when context cannot be added
Observed issue
==============
The system_test CI jobs hang on the perf test suite during the destroy
command steps of the ust perf raw subtest.
Cause
=====
The system_test are running inside a kvm as root. It turns out that the
PMU (UNHALTED_REFERENCE_CYCLES) the test suite is trying to add is
unavailable on the qemu host.
ustctl_add_context return -1024 since it fails to add the context.
This leads us down the error path for the callstack leading to the
ustctl_add_context call.
1) `ust_app_channel_create` returns `ret` != 0;
2) `find_or_create_ust_app_channel` returns `ret != 0`;
3) `ust_app_synchronize` based on the `ret` value goes directly to the
end of the function to an error path without passing on the
`create_ust_app_metadata` function and clean-up structure related to
the app.
Note that being in per-uid mode, data and metadata
channel/streams/buffer allocation is done on the fly for the first app
during `ust_app_synchronize` and its callee. For the current problematic
scenario, only the data channels have been allocated on the consumer for
the uid at that point. The metadata for that uid is not yet created.
Now that we know more of what is going on during an ""add context""
let's take a look at the actual hang.
The client never complete the destroy command since the consumerd
indicates that the trace chunk for the session is not closed. The trace
chunk still exists despite the fact that a close chunk command has been
issued. This is the case since its refcount never reaches zero and thus
the release does not complete.
In a normal execution without the use of contexts, the release of the
trace hunk (refcount == 0) occurs during the final rotation on destroy.
Upon further comparison between a working execution and a non-working
execution, in a non-working execution the `cmd_rotate_session` does not
issue the rotation for the data channels since the loop detects that no
metadata is present. Which, as we discussed earlier, can happen if we
fail to add the context to the app channel.
[1]
```
cds_list_for_each_entry(reg, &usess->buffer_reg_uid_list, lnode) {
struct buffer_reg_channel *reg_chan;
struct consumer_socket *socket;
if (!reg->registry->reg.ust->metadata_key) {
/* Skip since no metadata is present */
continue;
}
....
/* Rotate the data channels. */
cds_lfht_for_each_entry(reg->registry->channels->ht, &iter.iter,
reg_chan, node.node) {
ret = consumer_rotate_channel(socket,
reg_chan->consumer_key,
usess->uid, usess->gid,
usess->consumer,
/* is_metadata_channel */ false);
if (ret < 0) {
cmd_ret = LTTNG_ERR_ROTATION_FAIL_CONSUMER;
goto error;
}
}
....
}
```
Solution
========
Move the metadata check after the data channel rotation since it is
possible to have data channels but no metadata channel, although it is a
corner case.
Note that per-pid mode and kernel are not affected by the current bug
since a complete teardown of all objects is done. This only affect
per-uid due to the "on the fly" allocation nature of it since we need to
share the channel/stream/buffers across apps.
Known drawbacks
=========
None.
References
==========
[1] https://github.com/lttng/lttng-tools/blob/
3d1384a4add389c38f8554130e8dec2e2d06009d/src/bin/lttng-sessiond/ust-app.c#L7057
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I611761ac4051c9a80158705a87e59541fb37660c
Jonathan Rajotte [Wed, 13 Jan 2021 15:09:06 +0000 (10:09 -0500)]
Fix: backward relayd: path contains a leading "ust" folder
Observed issue
==============
test_output_path_relayd[lttng-tools-2.13-lttng-tools-2.10-uid] from
the lttng-ivc fails on path validation for the resulting trace.
Here lttng-sessiond is 2.13 and lttng-relayd is 2.10
Traces are generated and the following hierarchy is found in the
lttng-relayd trace folder.
lttng_home
└── lttng-traces
* └── ust
└── joraj-alpa
├── auto-
20210113-165054
│ └── ust
│ └── uid
Note the extra "ust" (*) in the hierarchy.
The tests expects:
lttng_home
└── lttng-traces
└── joraj-alpa
├── auto-
20210113-165054
│ └── ust
│ └── uid
Cause
=====
Introduced by:
5da88b0f58d7f838068037ea449ddfb25d3e85ad [1]
relayd_add_stream now suffixes the domain_name to the pathname.
This is only necessary for cases where the corresponding
lttng-relayd version is greater than 2.10. In other cases, modification
of pathname is not necessary.
Solution
========
Perform domain suffixing only for relayd > 2.10.
Known drawbacks
=========
None.
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
5da88b0f58d7f838068037ea449ddfb25d3e85ad
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If43719c0cf8285d3242d846316d7277d4f5519f1
Christophe Bedard [Fri, 28 Aug 2020 15:00:30 +0000 (11:00 -0400)]
Fix: lttng_destroy_session_no_wait: return 0 on success
lttng_destroy_session_no_wait() is supposed to behave like
lttng_destroy_session():
> Return 0 on success else a negative LTTNg error code.
However, it returns LTTNG_OK on success. Make it return 0 instead.
Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I787307d4291babfb9ddcb12937efdbfbb5fd7009
Simon Marchi [Fri, 2 Apr 2021 19:52:27 +0000 (15:52 -0400)]
config: fix typo in error message
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iba45933cf70452a8c4e89cb471a11cd42bcd08b8
Mathieu Desnoyers [Mon, 25 May 2020 20:19:18 +0000 (16:19 -0400)]
Tests: array expressions without contant index are invalid
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iaff8d199295359aae9d82d645a30f6b3bf81212b
Mathieu Desnoyers [Mon, 25 May 2020 20:15:40 +0000 (16:15 -0400)]
Fix: validate that array expression contains constant
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic4f9b178d918b2c7c9dd27bc6a30c1798a171727
Jonathan Rajotte [Wed, 2 Dec 2020 22:01:35 +0000 (17:01 -0500)]
Fix: test: base-path tests are not run
Observed issue
==============
The base-path tests are never run during `make check`.
Cause
=====
Albeit
2a1668643ca94195d5c3889d0337e19165805a42 [1] introduces the tests,
the test file is not added to the test list.
Solution
========
Add the test file to the test list.
One of the substest is failing.
not ok 20 - Snapshot recorded
# Failed test 'Snapshot recorded'
# in .//../../../utils/utils.sh:lttng_snapshot_record() at line 1468.
ok 21 - Destroy session ust_app_snapshot_base_path
not ok 22 - Validate trace for event tp:tptest
This is caused by the removal of `trace_path=$2` in
c28fcefd993b7539716bb5cd9557a08a217ec463 [2]. The removal is most
probably a merge error or simply a mistake on my end. This commit revert
that change.
Also add load-stream-extra-path.lttng in EXTRA_DIST.
Known drawbacks
=========
None
References
==========
[1] https://github.com/lttng/lttng-tools/commit/
2a1668643ca94195d5c3889d0337e19165805a42
[2] https://github.com/lttng/lttng-tools/commit/
2a1668643ca94195d5c3889d0337e19165805a42
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I255908165d59e02e0e11b123d3538a74f53fb584
Simon Marchi [Fri, 9 Apr 2021 15:19:17 +0000 (11:19 -0400)]
Fix: filter: memory leak in filter_parser_ctx
When running
$ lttng add-trigger --condition on-event -u ust_tests_demo2:loop --capture intfield --action notify
I get the leaks pasted below. It seems like filter_parser_ctx_free
doesn't free everything in filter_parser_ctx. Add what's missing.
Re-order the frees so that they are in the same order as the members of
the struct, just because it's easier to follow and make sure we didn't
forget anything.
=================================================================
==
1073803==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 128 byte(s) in 1 object(s) allocated from:
#0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
#2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
#3 0x5555556a3d61 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:667
#4 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
#5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Direct leak of 112 byte(s) in 1 object(s) allocated from:
#0 0x7ffff767783a in __interceptor_realloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:164
#1 0x5555556833be in bytecode_reserve /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:59
#2 0x55555568360f in bytecode_push /home/simark/src/lttng-tools/src/common/bytecode/bytecode.c:79
#3 0x5555556a1b94 in visit_node_load_expression_legacy /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:198
#4 0x5555556a1d18 in visit_node_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:231
#5 0x5555556a2540 in visit_node_load /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:399
#6 0x5555556a3a8b in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:622
#7 0x5555556a12fa in visit_node_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:53
#8 0x5555556a3a76 in recursive_visit_gen_bytecode /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:620
#9 0x5555556a3c55 in filter_visitor_bytecode_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-bytecode.c:661
#10 0x55555569c9b1 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:394
#11 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#12 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#13 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#14 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#15 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#16 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#17 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#18 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Direct leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a3dd2 in make_op_root /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:35
#2 0x5555556a73a5 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:874
#3 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#4 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#5 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#6 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#7 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#8 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#9 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#10 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#11 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#12 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 40 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a4f1d in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:280
#2 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#3 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#4 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#5 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#6 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#7 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#8 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#9 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#10 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#11 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#12 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#13 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#14 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a484d in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:201
#2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
#3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a4e64 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:262
#2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
#3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 24 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a4bbc in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:233
#2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
#3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 9 byte(s) in 1 object(s) allocated from:
#0 0x7ffff761fa69 in __interceptor_strdup /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:452
#1 0x5555556a4c41 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:238
#2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
#3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
Indirect leak of 8 byte(s) in 1 object(s) allocated from:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x5555556a4829 in create_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:196
#2 0x5555556a5040 in make_op_load_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:287
#3 0x5555556a696f in make_expression /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:637
#4 0x5555556a73df in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:882
#5 0x5555556a7382 in generate_ir_recursive /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:870
#6 0x5555556a74d6 in filter_visitor_ir_generate /home/simark/src/lttng-tools/src/common/filter/filter-visitor-generate-ir.c:903
#7 0x55555569c859 in filter_parser_ctx_create_from_filter_expression /home/simark/src/lttng-tools/src/common/filter/filter-parser.y:353
#8 0x55555560542e in parse_event_rule /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:704
#9 0x555555607429 in handle_condition_event /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1088
#10 0x555555608760 in parse_condition /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1326
#11 0x55555560bca0 in cmd_add_trigger /home/simark/src/lttng-tools/src/bin/lttng/commands/add_trigger.c:1925
#12 0x555555616b55 in handle_command /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:237
#13 0x555555617516 in parse_args /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:421
#14 0x555555617812 in main /home/simark/src/lttng-tools/src/bin/lttng/lttng.c:470
#15 0x7ffff700bb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
SUMMARY: AddressSanitizer: 409 byte(s) leaked in 9 allocation(s).
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic338ea1689d3f002bf9cade6d4f23e62d935968b
Simon Marchi [Wed, 31 Mar 2021 19:06:34 +0000 (15:06 -0400)]
Fix: sessiond: fix -Wshadow error in save.c
I think this actually fixes a bug. Because of the second ret variable,
the return value from init_ust_event_from_agent_event or save_ust_event
would not be forwarded correctly, in case of error.
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ieb61d9b267ab90e18cc349d86754d9b89b5703f1
Simon Marchi [Mon, 12 Apr 2021 17:23:39 +0000 (13:23 -0400)]
Fix: utils: avoid strncpy overlap in utils_partial_realpath
When running the test_utils_expand_path test with ASan enabled, I get:
➜ lttng-tools ./tests/unit/test_utils_expand_path
1..29
INPUT: /a/b/c/d/e
=================================================================
==
1485873==ERROR: AddressSanitizer: strncpy-param-overlap: memory ranges [0x621000021d00,0x621000021d0b) and [0x621000021d00, 0x621000021d0b) overlap
#0 0x7ffff761fd97 in __interceptor_strncpy /build/gcc/src/gcc/libsanitizer/asan/asan_interceptors.cpp:481
#1 0x555555573834 in utils_partial_realpath /home/simark/src/lttng-tools/src/common/utils.c:195
#2 0x55555557410b in _utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:374
#3 0x555555574340 in utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:420
#4 0x555555570b28 in test_utils_expand_path /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:274
#5 0x55555557119e in main /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:345
#6 0x7ffff725fb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
#7 0x55555556fa3d in _start (/home/simark/build/lttng-tools/tests/unit/test_utils_expand_path+0x1ba3d)
0x621000021d00 is located 0 bytes inside of 4096-byte region [0x621000021d00,0x621000022d00)
allocated by thread T0 here:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55555557269d in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
#2 0x555555573d34 in _utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:335
#3 0x555555574340 in utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:420
#4 0x555555570b28 in test_utils_expand_path /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:274
#5 0x55555557119e in main /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:345
#6 0x7ffff725fb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
0x621000021d00 is located 0 bytes inside of 4096-byte region [0x621000021d00,0x621000022d00)
allocated by thread T0 here:
#0 0x7ffff7677639 in __interceptor_calloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:154
#1 0x55555557269d in zmalloc /home/simark/src/lttng-tools/src/common/macros.h:45
#2 0x555555573d34 in _utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:335
#3 0x555555574340 in utils_expand_path /home/simark/src/lttng-tools/src/common/utils.c:420
#4 0x555555570b28 in test_utils_expand_path /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:274
#5 0x55555557119e in main /home/simark/src/lttng-tools/tests/unit/test_utils_expand_path.c:345
#6 0x7ffff725fb24 in __libc_start_main (/usr/lib/libc.so.6+0x27b24)
The sole caller of utils_partial_realpath, _utils_expand_path, passes
the same buffer (resolved_path) for the input and output. This causes
utils_partial_realpath to call strncpy with overlapping strings.
Fix it by making utils_partial_realpath allocate new memory for the
returned string itself. This causes one more allocation than the
current code, because we don't re-use the existing buffer, but this
should be fine since this isn't exactly performance-critical code.
I think the code is easier to follow as a result.
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iab983e2f44fa57563b11ac6e9c03a41150669d9e
Simon Marchi [Mon, 12 Apr 2021 17:18:15 +0000 (13:18 -0400)]
Clean-up: utils: make utils_partial_realpath static
It is only used within its file.
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1467f5eeb11c66ecbe358290a172c37db34d881a
Anders Wallin via lttng-dev [Wed, 31 Mar 2021 18:54:39 +0000 (20:54 +0200)]
Fix: added missing test_channel and removed double defined test_syscall
Signed-off-by: Anders Wallin <wallinux@gmail.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia65951dffc641bffa41793a5c52474b8db90eb4d
Jérémie Galarneau [Wed, 7 Apr 2021 15:55:26 +0000 (11:55 -0400)]
Fix: sessiond: kernel: invalid error code check
The `EEXIST` error code does not make sense for a kernctl_disable()
operation. In fact, not existing would be an error.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8f66c6d6769a4a10e8ff446736c6a7d55a9c32be
Jérémie Galarneau [Wed, 7 Apr 2021 15:49:39 +0000 (11:49 -0400)]
Tests: fix: unchecked sscanf return value
1407934 Unchecked return value
If the function returns an error value, the error value may be mistaken for a normal value.
In parse_arguments: Value returned from a function is not checked for errors before being used (CWE-252)
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I26e4d324c97833acedab4ebd030d412848dcbfe4
Mathieu Desnoyers [Thu, 25 Mar 2021 19:18:13 +0000 (15:18 -0400)]
compiler warning cleanup: is_signed_type: compare -1 to 1
Comparing -1 to 0 triggers compiler warnings (gcc -Wtype-limits and
-Wbool-compare) and Coverity warning "Macro compares unsigned to 0".
Comparing -1 to 1 instead takes care of silencing those warnings while
keeping the same behavior.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id4aebce6eed214f0ccba2c0289fb3b034808cb64
Jérémie Galarneau [Tue, 23 Feb 2021 03:08:18 +0000 (22:08 -0500)]
Update version to v2.12.3
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Mon, 8 Feb 2021 19:40:33 +0000 (14:40 -0500)]
Fix: ust-consumer: metadata thread not woken-up after version change
Issue observed
==============
The metadata regeneration test fails, very rarely, in the "streaming"
case on the CI. The interesting part of the test boils down to:
1) start session
2) launch an app tracing one event
3) stop session
4) delete metadata file
5) start session
6) regenerate metadata
7) stop session
8) destroy session
9) read trace: babeltrace fails on an invalid metadata file.
The problem is hard to capture, but modifying the test allows us to see
that there appears to be a short window between steps 7 and 8 where the
metadata file is empty or doesn't exist.
Cause
=====
When metadata is regenerated, its version is bumped and the metadata
cache is "reset". In some cases, such as in this test, the new metadata
will have exactly the same size as it had prior as nothing happened to
change that (e.g. no new apps/probes were registered).
When this occurs, the metadata thread is not woken-up by
consumer_metadata_cache_write() as it sees that max_offset of the
metadata cache didn't change; the data was replaced but it has the same
size.
The metadata consumption thread also checks for version bumps and
resets the amount of consumed metadata. Hence, if the "cache write"
operation woke up the metadata consumption thread, the stream's
"ust metadata pushed" state would be reset and the new contents would
be consumed.
Solution
========
The metadata stream's "ust metadata pushed" position is directly reset
to zero when a metadata version change is detected by the metadata
cache. The metadata poll thread is also woken up to resume the
consumption of the newly-available data.
It is unclear why the change to the consumption position was only done
on the metadata consumption thread's code path and not directly by the
session daemon command handling.
Note that a session rotation will also result in a reset of the pushed
position and a wake-up of the metadata poll thread from the command
handling thread. I am speculating that this couldn't be done due to the
design of the locking at the time of the original
implementation (I haven't checked).
In implementing this change, the metadata reception code path is
untangled a bit to separate the logic that affects the metadata stream
from the logic that manages the metadata cache. I suspect the original
error stems from a mix-up/confusion between both concerns.
When a metadata version change happens, the metadata cache resets its
'max_offset' (in other words, it's current size) and notifies the
caller. The caller then resets the "ust pushed metadata" position to
zero and wakes-up the metadata thread to consume the new contents of the
metadata cache.
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I142ef957140d497ac7fc4294ca65a55c12518598
Jérémie Galarneau [Mon, 8 Feb 2021 18:00:56 +0000 (13:00 -0500)]
Clean-up: ust-consumer: simplify metadata cache unlock on error path
The metadata cache lock can be released directly after the call to
consumer_metadata_cache_write() since nothing depends on the cache
being locked after.
This simplifies the unlocking of the metadata cache by removing an
extra "unlock" that is specific to an error path.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib0151d9640309005dfc83e95efb538e3c4f0df4a
Jérémie Galarneau [Wed, 10 Feb 2021 19:25:57 +0000 (14:25 -0500)]
Fix: sessiond: timer: unitiliazed sigevent fields
Coverity reports:
1445773 Uninitialized scalar variable
The variable will contain an arbitrary value left from earlier computations.
In timer_start: Use of an uninitialized variable (CWE-457)
CID
1445773 (#1 of 1): Uninitialized scalar variable (UNINIT)2.
uninit_use_in_call: Using uninitialized value sev. Field sev._sigev_un
is uninitialized when calling timer_create
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I306ea7e58192223408fac9d354dbbfebe6c0cc70
Jérémie Galarneau [Wed, 10 Feb 2021 19:20:34 +0000 (14:20 -0500)]
Fix: sessiond: ust-registry: dereference of NULL pointer on allocation failure
Coverity reports:
1445771 Dereference after null check
Either the check against null is unnecessary, or there may be a null pointer dereference.
In ust_registry_channel_add: Pointer is checked against null but then dereferenced anyway (CWE-476)
Reported-by: Coverity Scan
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3976e0d7b059b47ca233bc9a20585eff2e1290fc
Jérémie Galarneau [Mon, 8 Feb 2021 18:03:20 +0000 (13:03 -0500)]
Fix: ust-consumer: metadata cache lock not taken when sampling max offset
Found by inspecting the code while searching for an unrelated problem.
The 'max_offset' field is probably only accessed by the sessiond
poll thread, but this isn't a documented (nor reasonably maintainable)
guarantee.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1d2b29e9a483aa8e49c2db589284b853cce65491
Mathieu Desnoyers [Wed, 3 Feb 2021 15:21:42 +0000 (10:21 -0500)]
Fix: use MT-safe strtok_r in multithreaded context
append_list_to_probes uses the non-multithread-safe "strtok" while in
multithreaded context.
It is called by modprobe_lttng_data(), which is called from
init_kernel_tracer().
init_kernel_tracer is used from main() after other threads are created,
and also from process_client_msg() when a user attempts to interact with
the kernel domain if the kernel tracer has not yet been successfully
initialized.
Fixed by using the MT-safe strtok_r() instead.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic3f81551b8508451e0f9733c5de56a4d4407b813
Jérémie Galarneau [Tue, 12 Jan 2021 22:41:54 +0000 (17:41 -0500)]
Fix: liblttng-ctl: unreported truncations when copying strings
gcc 10.2 reports a large number of string truncation warning in
liblttng-ctl. Replace the uses of lttng_ctl_copy_string() util by
lttng_strncpy() (handling the null source case when applicable) and
report the truncations when they occur.
Example gcc warning:
lttng-ctl.c:86:3: warning: ‘strncpy’ output may be truncated copying 254 bytes from a string of length 254 [-Wstringop-truncation]
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Icca5f4c2490c6796b451999d7694db8597bae719
Jérémie Galarneau [Mon, 11 Jan 2021 23:11:46 +0000 (18:11 -0500)]
Fix: configure: support Autoconf 2.70
The newly-released autoconf 2.70 introduces a number of breaking
changes [1] and is being rolled-out by some distros.
Amongst those changes, the AC_PROG_CC_STDC macro is marked as obsolete
and was merged into AC_PROG_CC, which we already use. On 2.70, this
results in a warning which we handle as an error.
A version check is added to invoke the AC_PROG_CC_STDC macro only when
running a pre-2.70 version of autoconf, fixing the issue.
A single use of the AC_HELP_STRING macro is replaced by AS_HELP_STRING
as the former was marked as obsolete.
The AC_PROG_LEX now takes an argument, and the argument-less version is
marked as obsolete. The macro is invoked with the `noyywrap` option, as
recommended in the documentation.
Also, the AX_PTHREAD macro makes use of the $as_echo built-in shell
variable which no longer exists in 2.70. A patch was submitted to the
GNU Autoconf archive in March, but there have been no signs of life
given since then [2].
As such, our local copy is updated to the latest version and the patch
(which looks fairly straight-forward / safe) is applied. This should
minimize changes once we go back to an "official" version of the macro.
[1] https://lwn.net/Articles/839395/
[2] https://savannah.gnu.org/patch/?9906
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Reviewed-by: Michael Jeanson <michael.jeanson@efficios.com>
Change-Id: Ie949de73442770f60cbef55300265205527731c6
Jérémie Galarneau [Tue, 1 Dec 2020 21:51:23 +0000 (16:51 -0500)]
Fix: sessiond: metadata not created on app unregistration during start
Issue observed
==============
A test for an incoming feature (trigger actions on on-event conditions)
hangs. While this problem was discovered using this test, it exercises a
scenario that is problematic as of this fix.
The destruction of a session can hang if a single application being
traced unregisters (dies) during the 'start' of a session.
Cause
=====
When a per-uid session is started, its buffers (channels and streams)
are allocated only if an instrumented application is registered to the
session daemon at that moment.
For historical reasons, the 'data' and 'metadata' buffers are allocated
in separate code paths. The 'data' buffers are allocated in
ust_app_synchronize() and the 'metadata' buffers are allocated in
ust_app_start_trace(). Both functions perform their own look-up for an
application session and will gracefully fail if an application session
can't be found; it typically means the application has exited.
This leaves a race window open where ust_app_synchronize() can succeed
in looking-up the application session, and ust_app_start_trace() can
fail following the death of the application.
When this occurs, the session is left with 'data' buffers allocated and
unallocated ''metadata' buffers. This is an unexpected state and results
in the rotation code attempting to rotate a partially initialized
metadata stream.
The rotation of this partially initialized metadata stream never
completes which, in turn, never allows the session to complete its
implicit rotation on destruction.
This race window is fairly narrow, but can be reproduced by sleep()-ing
at the beginning of ust_app_start_trace() and killing an application
that is being traced during the sleep period.
Solution
========
The creation of the metadata channel is performed as part of
ust_app_synchronize() if the application look-up succeeds. When it
fails, both 'data' and 'metadata' streams will fail to be created
resulting in an expected and valid state.
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ice0ec16734a39a6bb885986d3ad70d20cd2618e0
Francis Deslauriers [Tue, 15 Sep 2020 16:10:18 +0000 (12:10 -0400)]
Cleanup: use `modprobe --remove` rather than `rmmod`
Background
==========
According to the rmmod(8) man page:
rmmod is a trivial program to remove a module (when module unloading
support is provided) from the kernel. Most users will want to use
modprobe(8) with the -r option instead.
`rmmod` simply unloads the provided module and decrements the refcount
of the modules it depended on but doesn't unload those dependencies if
their refcount is zero.
Issue
=====
With the following scenario we can end up if modules with a zero
refcount still loaded in the kernel:
modprobe lttng-test
lttng-sessiond
... (test case) ...
ctrl+c sessiond
rmmod lttng-test
When we teardown the lttng-sessiond, some modules are kept in the kernel
because the `lttng-test` module depends on them. So unloading
`lttng-test` using `rmmod` keeps those dependencies in the kernel.
Solution
========
Use `modprobe --remove` to unload modules and their now unused
dependencies.
From the modprobe(8) man page:
-r, --remove
This option causes modprobe to remove rather than insert a module.
If the modules it depends on are also unused, modprobe will try to
remove them too. Unlike insertion, more than one module can be
specified on the command line
Note
====
This commit also replaces existing uses of `modprobe -r` to `modprobe
--remove` for consistency.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7be83a645097e1eddd478cfbb717906b971f04ea
Jonathan Rajotte [Thu, 28 May 2020 01:29:05 +0000 (21:29 -0400)]
Tests: Fix: 99% fill ratio for high buffer usage is too high for larger events
If the event being registered is bigger than 1% of a subbuffer, the 99%
ratio cannot be achieved since the "last event" necessary to go over 99%
will always be dropped by the tracer.
e.g:
DBG1 - 19:31:07.
665963875 [Notification]: [notification-thread] High buffer usage condition being evaluated: threshold = 16220, highest usage = 16196 (in evaluate_buffer_usage_condition() at notification-thread-events.c:3733)
We use a ratio of 90% to keep a little headroom.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I06180735e0b5e88209b888e51cc83b4ac7d98193
Jérémie Galarneau [Fri, 16 Oct 2020 18:43:39 +0000 (14:43 -0400)]
Fix: common: poll: compat_poll_wait never finishes
compat_poll_wait hangs when poll returns an array of file
descriptors of the form:
[ Inactive Active ]
The logic to find the first idle pollfd entry is bogus and actually
skips the first idle entry. This causes the follow-up loop to never
conclude.
The pollfd array defragmentation logic is re-written in a simpler
style to handle those cases appropriately.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8669a870df1ec1160f05e35e83671917bb80d6f9
Jérémie Galarneau [Fri, 11 Sep 2020 03:57:54 +0000 (23:57 -0400)]
Build fix: implicit declaration of function 'PERROR' on Solaris
Solaris 10/11 CI builds complain that PERROR is not declared
in memstream.h. This addresses the warning on both platforms.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I09ce7cea216e075d0ef7256f738ea7400d193b6b
Jérémie Galarneau [Tue, 15 Sep 2020 20:22:14 +0000 (16:22 -0400)]
Fix: PERROR spam when `tracing` group does not exist
The session daemon prints a PERROR on launch when the tracing group does
not exist. This should not occur when the group simply does not exist as
this is not an error. In that case (ESRCH), a DBG statement is
sufficient.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3ade29071a8f4e9fe2eb56bf05ff4150b70fd463
Jérémie Galarneau [Thu, 20 Aug 2020 19:38:18 +0000 (15:38 -0400)]
Fix: memcpy used on potentially overlapping regions
Caught by reviewing unrelated code, these two uses of memcpy
can operate on overlapping buffers. I checked all other uses
of "raw" memcpy and those appear safe.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I72b1204bc52a92015042adb6a67b022d140f5b4e
Jérémie Galarneau [Mon, 3 Aug 2020 20:50:36 +0000 (16:50 -0400)]
Update version to v2.12.2
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 30 Jul 2020 03:13:16 +0000 (23:13 -0400)]
Tests: rework tracefile_count test to meet the tracefile count limit
The original test simply fires a test application and checks that file
count limits are not exceeded.
The test is reworked to produce events until the total stream size
stabilizes. Then the number of files is checked against the expected
limit.
Some shellcheck warnings are fixed as part of this change.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id9ca3e16c05898c4009b4fb8fcf7882fdcb9dcf1
Jérémie Galarneau [Fri, 31 Jul 2020 15:20:40 +0000 (11:20 -0400)]
Fix: extraneous empty/inactive flush on rotation out of a trace chunk
Observed issue
==============
A test (tests/regression/tools/tracefile-limits/test_tracefile_count)
occasionally fails on ppc64. The trace validation steps in the fails in
the case where the trace file count limit is set to 1.
Examining the resulting trace shows that the last packet of data
produced by the test application appears to be missing.
The test case enables a channel in "overwrite" mode. Normally, this
would guarantee that the last data produced will always be available in
the resulting trace.
Cause
=====
An empty/inactive flush is performed when rotating "out" of a non-null
trace chunk to ensure that the trace chunk contained at least one
packet.
Looking at the test's resulting trace and by following the consumerd
logs, we see that the test application runs on one CPU for most of its
lifetime. The stream file is repeatedly replaced to make room for the
latest data.
Eventually, the application is migrated to another CPU. A number of
packets are written to this new stream. The session is then stopped
which causes an active flush to occur to close the current packet of
all streams (see `ust_app_stop_trace_all`).
Then, when the session is destroyed, an empty/inactive flush is
performed to ensure that at least one packet was produced in the current
trace chunk [1].
At the moment of writing this empty packet, the consumer daemon sees
that there is not enough space left in the stream file to honour the
trace file size restriction. It thus overwrites the file resulting in
the loss of the last events to replace them with the empty "end of
chunk" packet that occupies a single page.
While the problem is not specific to PowerPC 64, it has a lot more
chances to occur there as pages are typically configured to be of 64kb
length. Due to current implementation limitations, empty packets have
a size of one page.
In other words, 4kb pages typically fit in the space left in the file,
causing the problem to not be easily reproducible on x64.
Note that while the file size limit is specified as "3 * PAGE_SIZE" in
the test, it is rounded-up to 512kb to accomodate at least one
sub-buffer.
Solution
========
[1] This empty/inactive flush is no longer necessary since
f96af312b
as an "open packet" (which performs an empty/inactive flush) is
performed when a stream enters a non-null trace chunk. There is no
concern that a trace chunk will be left empty unless this initial flush
fails (see patch comments and work-around).
The empty flush that was performed for data streams is converted
into an active flush under most circumstances; the packet is simply
closed.
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5602b7ab8318374f75060489cf9c27af4e058805
Jérémie Galarneau [Thu, 30 Jul 2020 16:14:21 +0000 (12:14 -0400)]
Fix: relayd: double unlock on viewer stream creation error
viewer_stream_create must be called with the relay stream's
lock held since
9edaf114d. A call to pthread_mutex_unlock
was forgotten in the error path of viewer_stream_create resulting
in a double-unlock in some error scenarios.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I613f67440669af52f625b7bdc4e9ae6ce75956ea
Jérémie Galarneau [Tue, 28 Jul 2020 15:29:18 +0000 (11:29 -0400)]
Fix: relayd: live connection fails to open file during clear
Issue observed
==============
A `session clear` test occasionaly fails on the CI (very rarely, and
more often on PowerPC executors for some reason) with babeltrace
reporting that the live connection was closed by the remote side:
PASS: tools/clear/test_ust 276 - Waiting for live viewers on url: net://localhost
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_recv@viewer-connection.c:198 [lttng-live] Remote side has closed connection
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE/VIEWER lttng_live_session_get_new_streams@viewer-connection.c:1701 [lttng-live] Error receiving get new streams reply
07-21 16:17:07.058 23855 23855 E PLUGIN/SRC.CTF.LTTNG-LIVE lttng_live_msg_iter_next@lttng-live.c:1665 [lttng-live] Error preparing the next batch of messages: live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
07-21 16:17:07.058 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE, iter-upstream-comp-class-name="lttng-live", iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_upstream_msg_iter_next@muxer.c:454 [muxer] Upstream iterator's next method returned an error: status=ERROR
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER validate_muxer_upstream_msg_iters@muxer.c:991 [muxer] Cannot validate muxer's upstream message iterator wrapper: muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210
07-21 16:17:07.059 23855 23855 E PLUGIN/FLT.UTILS.MUXER muxer_msg_iter_next@muxer.c:1415 [muxer] Cannot get next message: comp-addr=0x1014cca8, muxer-comp-addr=0x1015afd8, muxer-msg-iter-addr=0x1014d668, msg-iter-addr=0x1014d598, status=ERROR
07-21 16:17:07.059 23855 23855 W LIB/MSG-ITER bt_message_iterator_next@iterator.c:864 Component input port message iterator's "next" method failed: iter-addr=0x1014d598, iter-upstream-comp-name="muxer", iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER, iter-upstream-comp-class-name="muxer", iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu", iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
07-21 16:17:07.059 23855 23855 W LIB/GRAPH consume_graph_sink@graph.c:473 Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128, comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK, comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages (`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8, comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so", comp-input-port-count=1, comp-output-port-count=0
07-21 16:17:07.059 23855 23855 E CLI cmd_run@babeltrace2.c:2548 Graph failed to complete successfully
ERROR: [Babeltrace CLI] (babeltrace2.c:2548)
Graph failed to complete successfully
CAUSED BY [libbabeltrace2] (graph.c:473)
Component's "consume" method failed: status=ERROR, comp-addr=0x1014d128,
comp-name="pretty", comp-log-level=WARNING, comp-class-type=SINK,
comp-class-name="pretty", comp-class-partial-descr="Pretty-print messages
(`text` fo", comp-class-is-frozen=1, comp-class-so-handle-addr=0x10159dd8,
comp-class-so-handle-path="/home/jenkins/workspace/lttng-tools_stable-2.12_portbuild/arch/powerpc/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.12/test_type/base/deps/build/lib/babeltrace2/plugins/babeltrace-plugin-text.so",
comp-input-port-count=1, comp-output-port-count=0
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x1014d598, iter-upstream-comp-name="muxer",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=FILTER,
iter-upstream-comp-class-name="muxer",
iter-upstream-comp-class-partial-descr="Sort messages from multiple inpu",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:991)
Cannot validate muxer's upstream message iterator wrapper:
muxer-msg-iter-addr=0x1014d668, muxer-upstream-msg-iter-wrap-addr=0x1014e210
CAUSED BY [muxer: 'filter.utils.muxer'] (muxer.c:454)
Upstream iterator's next method returned an error: status=ERROR
CAUSED BY [libbabeltrace2] (iterator.c:864)
Component input port message iterator's "next" method failed:
iter-addr=0x1014d6a8, iter-upstream-comp-name="lttng-live",
iter-upstream-comp-log-level=WARNING, iter-upstream-comp-class-type=SOURCE,
iter-upstream-comp-class-name="lttng-live",
iter-upstream-comp-class-partial-descr="Connect to an LTTng relay daemon",
iter-upstream-port-type=OUTPUT, iter-upstream-port-name="out", status=ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (lttng-live.c:1665)
Error preparing the next batch of messages:
live-iter-status=LTTNG_LIVE_ITERATOR_STATUS_ERROR
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:1701)
Error receiving get new streams reply
CAUSED BY [lttng-live: 'source.ctf.lttng-live'] (viewer-connection.c:198)
Remote side has closed connection
ok 277 - Clear session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 277 - Clear session J7WXjh7fmMleTCfE
ok 278 - Clear session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 278 - Clear session J7WXjh7fmMleTCfE
ok 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 279 - Stop lttng tracing for session J7WXjh7fmMleTCfE
ok 280 - Destroy session J7WXjh7fmMleTCfE
PASS: tools/clear/test_ust 280 - Destroy session J7WXjh7fmMleTCfE
# Wait for viewer to exit
not ok 281 - Babeltrace succeeds
Cause
=====
Looking at the relay daemon logs, it appears that the live client
requests an enumeration of the available streams while a rotation is
ongoing (clear).
Ultimately, this results in the relay daemon attempting to open a
non-existing file:
PERROR - 16:33:59.
242388809 [734380/734387]: Failed to open fs handle to ust/uid/1000/64-bit/chan_0, open() returned: No such file or directory (in fd_tracker_open_fs_handle() at fd-tracker.c:550)
The logs indicate that this file existed at some point. However, it
was unlinked and its newest instance was created in a trace chunk
named "20200727T163359-0400-1".
This chunk name is a temporary name used until the original trace
chunk can be unlinked (cleared) and the newest can be moved in its
place.
The file is being opened as part of the creation of a viewer stream
when make_viewer_stream() fails to find it. This implies that, somehow,
an outdated trace chunk is being used to open the viewer stream's file.
The reason why is that make_viewer_stream is called with the
viewer session's current trace chunk. During a rotation, the use of the
viewer session's current trace chunk is touchy due to the way the
switch-over to a new chunk is handled.
How viewer session/stream trace chunks are rotated
--------------------------------------------------
The viewer polls the relay daemon for new data/metadata to consume using
the `GET_NEXT_INDEX` and `GET_METADATA` commands. Both commands will
indicate to the viewer that it should try again later if a rotation is
ongoing on the "side" of the relay session.
When a rotation is not ongoing, the relay compares the `id` of the
target viewer stream's trace chunk with the relay session's current
trace chunk. If those `id`s don't match, the viewer session's current
trace chunk is then updated to a copy of the relay session's trace
chunk. The viewer stream's files are then closed and re-opened in the
context of the viewer session's now-updated trace chunk.
While the live protocol allows `GET_NEXT_INDEX` and `GET_METADATA` to
communicate to the viewer that it should "retry" later, there is no such
provisions made for the paths that lead to the creation of the viewer
streams. This means viewer streams must be created even if a rotation is
ongoing.
Solution
========
If a rotation is ongoing at the moment of the creation of the viewer
streams, we wish to use a copy of the trace chunk being used by the
relay stream. This way, we can be sure that the streams' files exist.
It is okay for viewer streams to hold references to different copies of
the trace chunks since no user-visible actions are performed when the
reference to those chunks is released. This is different from relay
streams where the detection of the completion of a rotation is done when
the last relay stream releases its reference to a specific trace chunk
instance.
Known drawbacks
===============
None beyond a slight increase (temporary until the next rotation) in the
number of FDs used when a client connects during a session rotation.
Note
====
Since make_viewer_streams now acts on relay and viewer counterparts of
the session and stream objects, the various variables are prefixed with
`relay` and `viewer` to make the code easier to understand.
The locking period of the relay stream is extended to most of the
iteration in make_viewer_stream() rather than only in
viewer_stream_create(). As a result, callers of viewer_stream_create()
must hold the relay stream's lock.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3b9c52a57835134712eadba0f68f6a21e922356b
Jérémie Galarneau [Mon, 27 Jul 2020 22:14:30 +0000 (18:14 -0400)]
Fix: relayd: wrong stream type used in DBG statement
get_next_index() is never invoked for metadata streams; the
debug statement was erroneously copy-pasted from the get_metadata()
function.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia57dac7f102eddc78e24781905b390b4fa54bd80
Jérémie Galarneau [Tue, 23 Jun 2020 03:35:02 +0000 (23:35 -0400)]
Fix: lttng: leak of userspace probe path on listing
realpath() returns a malloc'ed string when `resolved_path` is NULL.
`binary_path` can be safely free'd after it is printed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic35862f3275f58274ae2572e1d6be736f8a2107e
Jérémie Galarneau [Mon, 15 Jun 2020 22:44:23 +0000 (18:44 -0400)]
Fix: uprobe: missing error code on allocation failure
lttng_userspace_probe_location_function_create_from_payload
does not report an allocation failure when strdup fails.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibed9906cabeca81081647d220a7aef1f54d251d4
Jérémie Galarneau [Mon, 15 Jun 2020 22:43:02 +0000 (18:43 -0400)]
Fix: sessiond: don't negate error code on list error
Listing errors are already negative. Negating in the error path
causes error codes to be interpreted as a number of events and
cause a communication error further on.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6df6fa14e4e443798dfdc1aa7a8fd7cfe380835b
Jonathan Rajotte [Thu, 9 Jul 2020 20:22:19 +0000 (16:22 -0400)]
Fix: send/received actual size is overwritten by 'expected' size
Observed issue
==============
None.
Cause
=====
None.
Solution
========
Return the actual send/received size. Caller code correctly handles
partial send/receive.
Known drawbacks
=========
None
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie52c39d0ab778108e3fde1b2375c0535caec97b1
Jonathan Rajotte [Thu, 9 Jul 2020 18:57:57 +0000 (14:57 -0400)]
Fix: partial recv lead to client disconnect
On EWOULDBLOCK and EAGAIN, the ret value from
`lttcomm_send_unix_sock_non_block` and
`lttcomm_recv_unix_sock_non_block` would be equal to -1.
Solution
=====
For such cases, set ret equal to zero since it is effectively what was
received/sent. Caller should treat ret >= 0 as `success` anyway.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I47677f26850bf544f672d270972de7cc2c6a2817
Michael Jeanson [Mon, 13 Jul 2020 19:41:01 +0000 (15:41 -0400)]
tests: return the proper TAP exit code
The C TAP library provides the 'exit_status()' function that will return
the proper exit code according to the number of tests that succeeded or
failed.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0de2349609eb34b1c5e58f09012c1db0126923c0
Francis Deslauriers [Thu, 9 Jul 2020 20:44:32 +0000 (16:44 -0400)]
Tests: live/test_{lttng_,}kernel: use lttng_test_filter_event instead of sched_switch
Background
==========
These tests currently rely on system load (the `sched_switch` event) to
generate trace data.
Issue
=====
This is can be problematic for the `test_kernel`
test case because it has a fixed sized buffer to store the trace:
#define mmap_size 524288
This caused this test failure to randomly happen on my machine:
ok 7 - Get one index per stream
# mmap_size not big enough
not ok 8 - Get one data packet for stream 0, offset 0, len 4096
# Failed test (live_test.c:main() at line 709)
[error] Error detaching viewer session
not ok 9 - Detach viewer session
# Failed test (live_test.c:main() at line 715)
Solution
========
Using the `lttng_test_filter_event` event to control the size and
number of the event expected in the trace rather then depending on how
many Electon apps are currently fighting for my CPUs.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I15d500d5becf9c5e526ae11ff0b2a2f4f6b753ac
Francis Deslauriers [Thu, 9 Jul 2020 21:38:17 +0000 (17:38 -0400)]
Fix: consumer: Move sanity check within `consumer_subbuffer` functions
The sanity check on the number bytes written by the `consumer_subbuffer`
callback was not correct channel configured in the splice output type in
a live session.
To simplify this, move checks in the callback themselves so they can be
specialized.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4e47a305860684c461ba7ffffd5e3bb3a21990b0
Jérémie Galarneau [Tue, 21 Jul 2020 16:51:48 +0000 (12:51 -0400)]
Fix: sessiond: wrong variable checked for error code
Coverity reports:
1430513 Logically dead code
The indicated dead code may have performed some action; that action
will never occur.
In ust_app_open_packets: Code can never be reached because of a
logical contradiction (CWE-561)
The `open_ret` variable should be checked instead of `ret` after calling
consumer_open_channel_packets.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I49ad2d2b747e29414fee7f0727df1feb343695bb
Jérémie Galarneau [Tue, 21 Jul 2020 16:44:59 +0000 (12:44 -0400)]
Fix: consumerd: double unlock on rotate channel error path
Coverity Scan reports:
1430541 Double unlock
May result in undefined behavior.
In lttng_consumer_rotate_channel: Attempt to release a non-recursive
lock that is not held (CWE-765)
The error path should jump to end_unlock_channel as the stream
lock is only held for the duration of the call to
consumer_stream_open_packet.
This bug was introduced by the previous commit.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2de684eece963973d0861e5924263575363484f6
Jérémie Galarneau [Mon, 20 Jul 2020 19:29:41 +0000 (15:29 -0400)]
Fix: consumerd: packet sent before channel rotation
Issue observed
==============
A clear test occasionally fails with the following output:
# Test ust streaming rotate-clear
# Parameters: tracing_active=0, clear_twice=1, rotate_before=0, rotate_after=0, buffer_type=uid
ok 605 - Create session S0BXcJKWrmAwNSzm with uri:net://localhost and opts:
PASS: tools/clear/test_ust 605 - Create session S0BXcJKWrmAwNSzm with uri:net://localhost and opts:
ok 606 - Enable channel chan for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 606 - Enable channel chan for session S0BXcJKWrmAwNSzm
ok 607 - Enable ust event tp:tptest for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 607 - Enable ust event tp:tptest for session S0BXcJKWrmAwNSzm
ok 608 - Start tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 608 - Start tracing for session S0BXcJKWrmAwNSzm
ok 609 - Rotate session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 609 - Rotate session S0BXcJKWrmAwNSzm
ok 610 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 610 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
ok 611 - Clear session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 611 - Clear session S0BXcJKWrmAwNSzm
ok 612 - Clear session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 612 - Clear session S0BXcJKWrmAwNSzm
Error: Relayd rotate streams replied error 97
Error: Relayd rotate stream failed. Cleaning up relayd 33
Error: Relayd send index failed. Cleaning up relayd 33.
Error: Rotate channel failed
Error: Stream 76 relayd ID 33 unknown. Can't write index.
Error: Stream 74 relayd ID 33 unknown. Can't write index.
Error: Stream 72 relayd ID 33 unknown. Can't write index.
ok 613 - Start tracing for session S0BXcJKWrmAwNSzm
PASS: tools/clear/test_ust 613 - Start tracing for session S0BXcJKWrmAwNSzm
not ok 614 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
FAIL: tools/clear/test_ust 614 - Stop lttng tracing for session S0BXcJKWrmAwNSzm
# Failed test 'Stop lttng tracing for session S0BXcJKWrmAwNSzm'
# in ./tools/clear//../../../utils/utils.sh:stop_lttng_tracing_opt() at line 1311.
ok 615 - Validate trace for event tp:tptest, 1 events
PASS: tools/clear/test_ust 615 - Validate trace for event tp:tptest, 1 events
not ok 616 - Read a total of 1 events, expected 4
FAIL: tools/clear/test_ust 616 - Read a total of 1 events, expected 4
# Failed test 'Read a total of 1 events, expected 4'
# in ./tools/clear//../../../utils/utils.sh:validate_trace_count() at line 1764.
Error: Failed to perform an implicit rotation as part of the destruction of session "S0BXcJKWrmAwNSzm": Unknown error code
not ok 617 - Destroy session S0BXcJKWrmAwNSzm
FAIL: tools/clear/test_ust 617 - Destroy session S0BXcJKWrmAwNSzm
# Failed test 'Destroy session S0BXcJKWrmAwNSzm'
# in ./tools/clear//../../../utils/utils.sh:destroy_lttng_session() at line 1347.
# Test ust streaming clear-rotate
Looking at the relay daemon log when the problem is reproduced,
we see:
Error: Protocol error: received a packet for a stream that doesn't have a current trace chunk: stream_id = 1, channel_name = chan_0
Cause
=====
The "rotate channel" consumer command iterates over a channel's streams
to perform a rotation and open a new packet when necessary
(see comments).
In the case where a channel is associated with a relay daemon, the
rotation positions are accumulated to send a single "rotate channel
streams" command to the relay daemon. This is done to reduce the time
needed to complete a rotation when tracing to a relay daemon through an
high-latency network connection.
Unfortunately, this causes packets to be opened before the rotation
command was sent to the relay daemon as the "open packet" command
is performed during the iteration on the streams.
Solution
========
Streams for which a packet should be opened are accumulated into an
array of stream pointers. The "open packet" is performed after a
successful rotation of the streams as a second "pass".
Known drawbacks
===============
None.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I6cb0b92619da73ebf223e1dfee88530e4244b56b
Jérémie Galarneau [Fri, 17 Jul 2020 18:26:17 +0000 (14:26 -0400)]
Fix: relayd: wrong specifier used in DBG format string
`len` is of type uint64_t while the format string specifies a size of
`%zd`. This results in a warning on most 32-bit architectures.
In file included from ../../../src/common/common.h:12:0,
from live.c:33:
live.c: In function `viewer_get_metadata`:
../../../src/common/error.h:161:35: warning: format `%zd` expects
argument of type `signed size_t`, but argument 6 has type `uint64_t
{aka long long unsigned int}` [-Wformat=]
#define DBG(fmt, args...) _ERRMSG("DEBUG1", PRINT_DBG, fmt, ## args)
^
../../../src/common/error.h:136:51: note: in definition of macro `__lttng_print`
fprintf((type) == PRINT_MSG ? stdout : stderr, fmt, ## args); \
^~~
../../../src/common/error.h:161:27: note: in expansion of macro `_ERRMSG`
#define DBG(fmt, args...) _ERRMSG("DEBUG1", PRINT_DBG, fmt, ## args)
^~~~~~~
live.c:2051:4: note: in expansion of macro `DBG`
DBG("Failed to read metadata: requested = %zd, got = %zd",
^~~
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I108191e0a94e712b0596d0d52a21cf8c971d6c1e
Jérémie Galarneau [Thu, 16 Jul 2020 21:17:08 +0000 (17:17 -0400)]
Tests: add a "new metadata after clear" test
Add a test that validates that the relay daemon performs a
metadata stream rotation after a clear.
The test enables a single event and launches a test application to
trigger it 10 times. Once the 10 event occurrences have been seen
by the live client, the session is cleared.
Then, two new events (statedump start and end) are enabled. This causes
new event descriptions to be appended to the metadata stream.
After a clear, the relay daemon should rotate the metadata stream
and send the new contents to the client. The client will then
be able to decode the statedump start/end events.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ie6057577e3ee8a7c3ed1a84941308d10d372c16c
Jérémie Galarneau [Thu, 16 Jul 2020 16:39:20 +0000 (12:39 -0400)]
Fix: relayd: send_viewer_streams sends stack data in padding
A single stack-allocated instance of `struct lttng_viewer_stream` is
used to send the various streams to the live viewer. This structure
contains a path and channel name which remain uninitialized beyond the
null terminator.
The structure is zeroed on every iteration.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I1c60ee18e01e8f56157f0fe44bd3bd64b1194703
Jérémie Galarneau [Fri, 26 Jun 2020 22:40:12 +0000 (18:40 -0400)]
Fix: stream intersection fails on snapshot of cleared session
Observed issue
==============
In the following scenario:
lttng create --snapshot
lttng enable-event -u -a
lttng start
taskset -c 0 <tracepoint producing app>
lttng clear
taskset -c 0 <tracepoint producing app>
lttng snapshot record
lttng destroy
When using the stream-intersection mode, babetrace complains that the
time range for the intersection is invalid since the begin timestamp is
after the end timestamp.
This is caused by the presence of "inactive" streams for which no events
are recorded between the clear action and the recording of the snapshot.
These streams have a begin timestamp roughly equal to the moment when
the snapshot was taken (i.e. the end timestamp). Babeltrace, in
stream-intersection mode, attempts to use the latest beginning timestamp
of all streams as the start of the intersection and the earliest end
timestamp as the end boundary.
Path │File size │Packets │Timestamp: beginning │Timestamp: end │
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_0 │ 4.000 KiB│ 1│2020-06-22 21:26:01.
903685878│2020-06-22 21:26:17.
630456312│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_1 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
630909310│2020-06-22 21:26:17.
630909310│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_2 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
631295033│2020-06-22 21:26:17.
631295033│
snapshot-1-
20200622-212617-1/ust/uid/1000/64-bit/channel0_3 │ 4.000 KiB│ 1│2020-06-22 21:26:17.
631673614│2020-06-22 21:26:17.
631673614│
Cause
=====
The packet beginning timestamps of the buffers are initialized on
creation (on the first "start" of a tracing session). When a "clear" is
performed on a session, all open packets are closed and the existing
contents are purged.
If a stream is inactive, it is possible for no packet to be "opened"
until a snapshot of the tracing session is recorded.
Solution
========
A new consumer command, "open channel packets" is added. This command
performs a "flush empty" operation on all streams of a channel.
This command is invoked after a clear (after the tracing is re-started)
and on start. This ensures that streams are opened as soon as possible
after a clear, a rotation, or a session start.
Known drawbacks
===============
In the case of an inactive stream, this results an extra empty packet at
the beginning of the inactive streams (typically 4kB) in the snapshots.
In the case of an active stream, this change will cause the first packet
to be empty or contain few events. If the stream is active enough to
wrap-around, that empty packet will simply be overwritten.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I19b5c630fa8bddfb13c3c10f86c6cc9dc4990b08
Jérémie Galarneau [Mon, 13 Jul 2020 22:42:22 +0000 (18:42 -0400)]
Fix: relayd: viewer metadata is not rotated after a session clear
Issue observed
==============
Following a session clear, babeltrace sometimes doesn't receive
the content of the metadata that is announced in the get_metadata
reply header.
This causes babeltrace2 to assert (a fix has been submitted) since
the protocol state becomes de-synchronized, causing babeltrace to
interpret follow-up replies as garbage.
This was occasionally observed on the CI when running the "clear" tests.
Cause
=====
There are no provisions made for rotating a viewer metadata stream when
a clear is performed on a live session. This means that a client can
request metadata that is only present in a newer chunk.
In the case of the crashes observed on the CI, the relay daemon attempts
to service a get_new_metadata request of size `4096`. It then fails to
read the data (as it was never present in the original trace chunk).
The relay daemon does not interpret the `0` returned by lttng_read() as
an error and sends a reply announcing `4096`bytes of metadata and no
payload.
Solution
========
Two fixes are rolled into this patch.
First, the return of lttng_read is checked for `0` and that situation is
handled as an error. However, this still leaves the problem of the
metadata stream not being rotated.
Secondly, the metadata relay_stream is checked for a rotation on every
`get_metadata` command. If a rotation has been detected, a viewer
rotation is performed on the metadata stream (very similar to the data
stream).
This solves the problem, but it leaves a case which the protocol does
not account for.
Essentially, the following can take place:
- relayd sets the "NEW_METADATA" flag as part of a `get_next_index`
query reply
- A rotation of the metadata stream occurs, no data is sent.
- client requests metadata
- metadata sent > received (was reset to 0 as part of the rotation)
In this scenario, the current implementation returned NO_NEW_METADATA,
but it is erroneous. Returning this guarantees to the viewer that it
will be able to decode all data packets that follow (until new metadata
is signalled, if ever).
Ideally, we would return a `RETRY` code, as is done by the data stream
handler when it detects that a rotation is taking place. Unfortunately,
such a code doesn't exist for the `get_metadata` command.
We return ̀ OK` with a length of 0, which is technically correct since
viewers are supposed to fetch metadata until the relay daemon returns
the `NO_NEW_METADATA` status code. However, supporting this has required
changes to babeltrace2's lttng-live source component.
I'm anticipating that most implementations don't handle the 0-length
case any better.
Known drawbacks
===============
Older viewers may not handle `OK` replies with a length of 0 gracefully.
Sending `NO_NEW_METADATA` is not an option as it breaks the guarantee
that all necessary metadata will be received before `NO_NEW_METADATA`.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I99cc1cea15125b48f5ee4e407d782b34b5e3dfb5
Jérémie Galarneau [Thu, 25 Jun 2020 23:22:24 +0000 (19:22 -0400)]
Fix: post-clear trace chunk has a late beginning packet
Observed issue
==============
In the following scenario:
- create a regular session
- enable an event that occurs on only one core
- start tracing
- trace for a while
- stop tracing
- clear the session
- start
- trace for a while (referred to the "active period" later on)
- stop
The resulting trace will contain a very short stream intersection as the
active stream will contain packets spanning the entire active period.
However, all other streams will contain a single packet at the end of
the active period with a duration of 0 ns.
This presents two problems:
1) This makes the stream intersection mode of viewers unusable,
2) This misleads the user into thinking that tracing was not active
for some buffers.
Cause
=====
The packet beginning timestamps of the buffers are initialized on
creation (on the first "start" of a tracing session). When a "clear" is
performed on a session, all open packets are closed and the existing
contents are purged.
If a stream is inactive, it is possible for no packet to be "opened"
until the "stop" of the tracing session.
On stop, a "flush_empty" is performed. Such a flush opens a packet
(if it was not already the case), closes it, and marks the packet as
being ready for consumption.
Solution
========
Attempt to flush an empty packet as close to the rotation point as
possible. In the event where a stream remains inactive after the
rotation point, this ensures that the new trace chunk has a beginning
timestamp set at the begining of the trace chunk instead of only
creating an empty packet when the trace chunk is stopped.
This indicates to the viewers that the stream was being recorded, but
more importantly it allows viewers to determine a useable trace
intersection.
This presents a problem in the case where the ring-buffer is completely
full.
Consider the following scenario:
- The consumption of data is slow (slow network, for instance),
- The ring buffer is full,
- A rotation is initiated,
- The flush below does nothing (no space left to open a new packet),
- The other streams rotate very soon, and new data is produced in the
new chunk,
- This stream completes its rotation long after the rotation was
initiated
- The session is stopped before any event can be produced in this
stream's buffers.
The resulting trace chunk will have a single packet temporaly at the end
of the trace chunk for this stream making the stream intersection more
narrow than it should be.
To work-around this, an empty flush is performed after the first
consumption of a packet during a rotation if the initial flush failed.
The idea is that consuming a packet frees enough space to switch packets
in this scenario and allows the tracer to "stamp" the beginning of the
new trace chunk at the earliest possible point.
Note that metadata streams are always skipped when opening a packet.
This is done for two reasons:
1) Timestamps are not relevant to the metadata stream
2) Inserting an empty packet in the metadata stream after a rotation
breaks the use of "clear" in live mode.
The contents of the metadata streams of successive chunks must be
strict superset of one another as live clients only receive the
information appended to a metadata stream (i.e. the parts it
already has received can't change).
If a flush_empty was performed after a clear/rotation, it would
result in an empty packet being inserted at the beginning of the
metadata stream that wasn't present in the first chunk.
This would cause the live client and relay daemon to have
mismatching copies of the metadata stream.
Known drawbacks
===============
In the case of an inactive stream, this results in the completed trace
chunk archive containing an extra empty packet at the beginning of the
stream (typically 4kB).
In the case of an active stream, this change will cause the first packet
to be empty or contain few events.
Those are all efficiency losses that are inevitable (AFAIK) given the
current buffer control APIs. It will be possible to recoup those losses
if an API allowing the consumer daemon to open a new packet is
introduced.
As noted in the comments, this patch is not final. The flush after the
rotation should not be open-coded in lttng_consumer_read_subbuffer. It
should be a data-stream specific "post-consume" step.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7d8ab876e55e9d0718a55ec1bb77ec6466accc02
Jérémie Galarneau [Tue, 7 Jul 2020 22:55:19 +0000 (18:55 -0400)]
Fix: kconsumer: missing wait for metadata thread in do_sync_metadata
The `do_sync_metadata` function is invoked everytime a data sub-buffer
is consumed in live mode.
In the user space case, lttng_ustconsumer_sync_metadata() returns
EAGAIN (positive) when there is new metadata to consume. This causes the
"metadata rendez-vous" synchronization to take place. However, the
kernel variant of this function returns 0 when there is new data to
consume, causing the "rendez-vous" to be skipped.
I have not observed an issue caused by this first-hand, but the check
appears bogus and skips over an essential synchronization step.
This check has been in place since at least 2013, although the callees
and their return values may have changed at some point in the past.
Solution
--------
The user space and kernel code paths mix various return code conventions
(negative errno, positive errno, 0/-1) which makes it difficult to
understand the final return codes and most likely lead to this confusion
in the first place.
Moreover, returning EAGAIN to indicate that data is ready to be consumed
is not appropriate in view of the existing conventions in the code base.
An explicit `enum sync_metadata_status` is returned by the domains'
sync_metadata operations which allows the common code to handle the
various conditions in a straight-forward manner and for the
"rendez-vous" to take place in the kernel case.
Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib022eee97054c0b376853dd05593e3b94bc9a8ca
Jérémie Galarneau [Thu, 9 Jul 2020 15:44:47 +0000 (11:44 -0400)]
Fix: tests: interrupting get_next_notification causes test to fail
Attaching a debugger to the `notification` test application causes
tests to fail when the application is waiting for a notification
as the function will return an `INTERRUPTED` status code.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4ac9ac13d02a96366791e81746e53bb5dde94885
Francis Deslauriers [Mon, 6 Jul 2020 16:31:00 +0000 (12:31 -0400)]
Fix: consumer.c: wrong order of parameter in `DBG()` statement
This debug statement is printing the padded_subbuf_size as the
subbuf_size and inversely.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Change-Id: I8c719e21999ffaa8df2de6089b65b90f1b211a72
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jonathan Rajotte [Thu, 25 Jun 2020 20:52:33 +0000 (16:52 -0400)]
Fix: consumer: dangling chunk on buffer allocation failure
Observed issue
==============
Using docker the /dev/shm mount size is 64 MB by default. On system
with many threads (256), combined with the default subbuffer count and
subbufer size, allocation failure of the subbuffer is inevitable since
the /dev/shm mountpoint is too small.
When a user try to destroy the problematic session, the destroy
command hangs.
# Force the size of /dev/shm, make sure to remount it to its orignal
# size when done testing.
sudo mount -o remount,size=1G /dev/shm
lttng create
lttng enable-channel --subbuf-size 500M -u test
lttng enable-event -u -a -c test
lttng start
# Run an app;
../lttng-ust/doc/examples/easy-ust/sample
lttng-sessiond should output:
Error: ask_channel_creation consumer command failed
Error: Error creating UST channel "test" on the consumer daemon
lttng destroy
Output hang:
Destroying session auto-
20200626-112733..........
Cause
=====
The hang is caused by the check of ongoing rotation which never finishes.
The consumer reports that the trace chunk still exists. This is caused
by an imbalance of the reference count for the trace chunk on close.
This is caused by a missing lttng_trace_chunk_put in destroy_channel
which is called on the error path for the consumer channel creation.
Solution
========
Call lttng_trace_chunk_put if the channel->trace_chunk is assigned
inside the channel_destroy function.
The error handling in ust_app_channel_create is also reworked since the
return value for ust_app_channel_send would be squashed for scenario
where contexts are present.
Known drawbacks
=========
None
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Idf19b1d306cf347619ccfda1621d91d8303f3c7c
Jérémie Galarneau [Wed, 17 Jun 2020 23:13:50 +0000 (19:13 -0400)]
Fix: consumerd: uninitialized written_bytes on no-data sleep
e66d26f51 introduces a jump to a label which causes `written_bytes`
(the return value) to not be initialized.
written_bytes may be used uninitialized in this function [-Wmaybe-uninitialized]
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I72c5c07298093f27fa72b72cb157ce4eedb81adb
Jérémie Galarneau [Wed, 17 Jun 2020 22:54:09 +0000 (18:54 -0400)]
Build fix: consumerd misnamed label
`sleep_stream` label was changed from `stream_sleep` as part of
a fixup/rebase which didn't make it into the master branch.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ifd257e2b6d1f522d018cf0284a89f49a92b12b02
This page took 0.069268 seconds and 4 git commands to generate.