Kienan Stewart [Wed, 8 Jan 2025 14:22:02 +0000 (09:22 -0500)]
Tests: Support redirection to identical stderr/stdout in`_run_lttng_cmd`
Observed issue
==============
`tests/regression/kernel/test_userspace_probe` fails with the
following error:
```
Bail out! The current `_run_lttng_cmd` implementation does not support
redirecting stdout and stderr to the same file.
```
Cause
=====
The commit aae350f3c2ac878227bae41281e7abca65e33c0f ("Tests: Add path
check to prevent incorrect redirection") introduces a
safety check to bail out if the same stderr and stdout file are used
in `_run_lttng_cmd`. This avoids some foot-gun scenarios where output
is expected in a file from both, but the implementation didn't
actually support it.
There has been a number of changes to both `_run_lttng_cmd` for
logging, and it's common for both `stdout_dest` and `stderr_dest` to
be set (usually to `/dev/null`).
In particular, this check ends up being triggered in
`tests/regression/kernel/test_userspace_probe`:
```
1..175
# Userspace probe - Testing userspace probe on ELF symbol
ok 1 - 0 LTTng modules loaded, expected count = 0
# export LTTNG_SESSION_CONFIG_XSD_PATH=/root/workspace/lttng-tools_master_root_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.14/node/sles15sp4-amd64-rootnode/platform/sles15sp4-amd64/src/lttng-tools/tests/../src/common/
# env /root/workspace/lttng-tools_master_root_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.14/node/sles15sp4-amd64-rootnode/platform/sles15sp4-amd64/src/lttng-tools/tests/../src/bin/lttng-sessiond/lttng-sessiond --consumerd64-path=/root/workspace/lttng-tools_master_root_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/stable-0.14/node/sles15sp4-amd64-rootnode/platform/sles15sp4-amd64/src/lttng-tools/tests/../src/bin/lttng-consumerd/lttng-consumerd 1
ok 2 - Start session daemon
# Userspace probe enable on non-existant file
# Killing (signal SIGTERM) lttng-sessiond and lt-lttng-sessiond pids: 18167 18182
Bail out! The current `_run_lttng_cmd` implementation does not support redirecting stdout and stderr to the same file.
# Looks like you planned 175 tests but only ran 2.
```
Solution
========
Add support for invoking `_run_lttng_cmd` with the same stderr and
stdout destinations.
Known drawbacks
===============
None.
Change-Id: I9133ccfd1f524000dbf02dfc05730feca1c59ab7 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Clean-up: Test list triggers: Remove unused tmp variables
The variables can be removed since the functions that compare the
actual and expected output for these tests, `list_triggers_matches_ok`
and `list_triggers_matches_mi_ok` define `tmp_stdout` and `tmp_stderr`
variables locally.
Kienan Stewart [Thu, 7 Nov 2024 17:20:58 +0000 (12:20 -0500)]
Tests: Use serial runner
Observed issue
==============
Running `make check -jN` with N greater than 1 will fail a number of
different checks.
Cause
=====
The majority of the tests aren't designed to work in
isolation (e.g. using `LTTNG_HOME` and `LTTNG_RUNDIR`), and some tests
that exercise global system resources (e.g. kernel modules) will never
work reliably in parallel.
Solution
========
Create a serial test runner that can execute all the tests that
require global resources while the other tests that are defined as
'safe to parallelize' can be run concurrently.
Known drawbacks
===============
None.
Change-Id: I0b87d2cd5e870ee7f9241ec78390ed96a01efb38 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 28 Nov 2024 18:41:06 +0000 (13:41 -0500)]
Clean-up: Tests: Provide default cwd when testing relayd working dir
In `tap.sh`, `is()` ends up aborting the test script if it is passed
an empty (null) value. When, for example, lttng-relayd fails to start,
the `cwd` will be empty as `readlink` will fail.
E.g. Output when a test fails and limits the runs:
```
...
ok 4 - lttng-relayd already started
---
duration_ms: 12.914679
...
not ok 5 - Found lttng-relayd
---
duration_ms: 49.119194
...
./tests/regression/tools/working-directory//../../..//utils/tap/tap.sh: line 288: 1: parameter null or not set
```
Change-Id: I54ac5c23a6d2604d261e8240679a827a12a57a1a Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 14 Nov 2024 19:28:06 +0000 (14:28 -0500)]
Tests: Wait for specific PIDs of background applications
When running lttng-relayd or any other SUT from utils.sh in a fashion
that doesn't take it out of the test's process tree (e.g. when started
without '-d' or '-b'), `wait` will hang on those processes are well.
Change-Id: I82e0365d5165088980c0fce8e451fa0557ee201a Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Tue, 12 Nov 2024 20:23:06 +0000 (15:23 -0500)]
lttng-relayd: Add `--pid-file` parameter
Intended to be used a mechanism to get the PID of the process when
starting with `-b` or `-d`.
`lttng-sessiond` writes to a well-known path inside `$LTTNG_HOME`
instead.
The exit for the lttng-relayd `main()` function is changed from
`exit()` to `return`. When using `exit()` or `std::exit()`, the
end-of-scope objects (e.g. from `lttng::make_scope_exit()` do not
fire.
Change-Id: I1ce6fd316356ac251cb3a0c0039c565f1ca1210f Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 7 Nov 2024 13:44:36 +0000 (08:44 -0500)]
Tests: Start sessiond in python test environments with `--no-kernel`
Observed issue
==============
One of the goals of the lttngtest Environment is to be able to produce
separate test environments for each run in order to eventually run
some tests in parallel.
Solution
========
If `lttng-sessiond` is run as root, by default it will attempt to load
the lttng-modules kernel modules.
The session daemon is now started with `--no-kernel` to disable the
kernel domain and prevent that instance from loading the kernel
modules. This behaviour can be changed for specific tests that
exercise kernel tracing.
Known drawbacks
===============
The `--no-kernel` start is less representative of the default usage of
the session daemon.
Change-Id: Iacb85d0a33d6981bd038d57bf629d61117f9344d Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 7 Nov 2024 13:38:13 +0000 (08:38 -0500)]
Tests: Check python version before using signal.strsignal
Observed issue
==============
When running CI tests against sles12sp5 (with python 3.4) the clean-up
of a stalled test produced the following output:
```
Traceback (most recent call last):
File "./tools/live/test_early_inactive_app.py", line 111, in <module>
log=tap.diagnostic, with_relayd=True, with_sessiond=True
File "/usr/lib64/python3.4/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/home/jenkins/workspace/dev_review_lttng-tools_master_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/platform/sles12sp5-amd64/src/lttng-tools/tests/utils/lttngtest/environment.py", line 1192, in test_environment
with_sessiond, log, with_relayd, extra_env_vars, skip_temporary_lttng_home
File "/home/jenkins/workspace/dev_review_lttng-tools_master_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/platform/sles12sp5-amd64/src/lttng-tools/tests/utils/lttngtest/environment.py", line 677, in __init__
self._launch_lttng_relayd() if with_relayd else None
File "/home/jenkins/workspace/dev_review_lttng-tools_master_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/platform/sles12sp5-amd64/src/lttng-tools/tests/utils/lttngtest/environment.py", line 933, in _launch_lttng_relayd
time.sleep(0.1)
File "/home/jenkins/workspace/dev_review_lttng-tools_master_slesbuild/babeltrace_version/stable-2.0/build/std/conf/agents/liburcu_version/master/platform/sles12sp5-amd64/src/lttng-tools/tests/utils/lttngtest/environment.py", line 1044, in _handle_termination_signal
signal_name=signal.strsignal(signal_number)
AttributeError: 'module' object has no attribute 'strsignal'
```
Cause
=====
`signal.strsignal` is introduced in Python 3.8[1].
Solution
========
Check the system version and fall back to using `str(signal_number)`
which will give the numeric signal (e.g. signal.SIGKILL -> 9).
The settings are retrieved by the script and used if they were
specified.
Note that we should probably do the same for `black` as they
change the formatting from version to version. Right now I
feel we imply "use the latest version", but we should probably
impose one and update it on a release-to-release basis.
Kienan Stewart [Tue, 10 Dec 2024 17:11:37 +0000 (12:11 -0500)]
misc: Add pre-commit hook for common checks
The initial version of the pre-commit hook, if installed, will check
the changed with clang-format, clang-tidy, and
python-black. Furthermore, C/C++ source files will be parsed using
python-clang and the comment style (starting with `/*` and ending with
`*/`) will be validated.
This pre-commit hook will check only the staged files.
```
$ git hook run pre-commit 2>&1
ERROR:root:Failed rule 'cpp-comment-style'
Wrong comment style at <SourceRange start <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 109, column 1>, end <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 109, column 6>>
// hi
Wrong comment style at <SourceRange start <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 111, column 1>, end <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 111, column 12>>
/** hey **/
Wrong comment style at <SourceRange start <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 113, column 1>, end <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 113, column 11>>
/* sup **/
Wrong comment style at <SourceRange start <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 115, column 1>, end <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 115, column 3>>
//
Wrong comment style at <SourceRange start <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 117, column 1>, end <SourceLocation file 'src/bin/lttng-sessiond/main.cpp', line 117, column 5>>
/**/
ERROR:root:Failed rule 'clang-format'
src/bin/lttng-sessiond/main.cpp:1516:63: error: code should be clang-formatted [-Wclang-format-violations]
/* Queue of rotation jobs populated by the sessiond-timer. */
^
src/bin/lttng-sessiond/main.cpp:1518:47: error: code should be clang-formatted [-Wclang-format-violations]
struct lttng_thread *client_thread = nullptr;
^
ERROR: root:Failed rule 'python-black'
would reformat asdf.py
Kienan Stewart [Fri, 8 Nov 2024 20:59:43 +0000 (15:59 -0500)]
lttng: Warn on session start when client-visible shm path is too small
Observed issue
==============
When adding one or more channels to a session, it is possible to
configure sessions such that the minimum memory required to
successfully trace exceeds the size of the shared memory location.
There is no user feedback when starting sessions that are configured
in such a way that they will never work. It therefore leads to
user confusion when no events are recorded.
As allocations are made lazily, there is no feedback for this type of
scenario visible to an lttng client. The session can successfully
start; however, there will be allocation errors visible in the
consumer daemon logs.
Solution
========
When starting a session, the minimum shared memory size required is
estimated using the configured non-kernel channels. This assumes
per-PID or per-UID buffers with a single PID or a single UID. An extra
set of buffers are estimated when the session is configured in
snapshot mode.
E.g.
```
$ lttng create
Session auto-20241112-102937 created.
Traces will be output to
/home/kstewart/src/efficios/lttng/master/home//lttng-traces/auto-20241112-102937
$ lttng enable-channel -u --subbuf-size=256M --num-subbuf=8 a2
user space channel `a2` enabled for session `auto-20241112-102937`
$ lttng start
Warning: The estimated minimum shared memory size for all non-kernel channels of session 'auto-20241112-102937' is greater than the total shared memory allocated to the default shared memory location (8192MiB >= 7873MiB). Tracing for this session may not record events due to allocation failures.
Tracing started for session `auto-20241112-102937`
$ lttng list
Available recording sessions:
1) auto-20241112-102937 [active]
Trace output:
/home/kstewart/src/efficios/lttng/master/home//lttng-traces/auto-20241112-102937
```
Known drawbacks
===============
The intention of this change is to be backwards compatible, and
therefore doesn't stop the session from being started.
With per-PID or per-UID sub-buffers, the estimate is based only a
single PID or UID. Additional PIDs or UIDs could cause allocation
failures later on in the runtime which will continue to only be
visible in the consumer daemon logs.
The shm path is estimated by the lttng client rather than the consumer
daemon. It is possible for the consumer daemon be running in a
different container whose shared memory path maps to a different
backing storage.
This checks only against the total size of the shared memory storage -
not the current available usage. It does not take into account other
current enabled sessions.
Fix: sessiond: notification-thread: add to pollset fails silently
Noticed when reviewing the code, so I don't have a reproducer
for the issue. However, the "ADD_TRACER_EVENT_SOURCE" command
returns LTTNG_OK even when the notification thread fails to add
the tracer event source to the poll set.
The error path properly performs the requisite clean-up, but
the command emitter will be under the impression that the command
succeeded. In doing so, it will most likely use the
"REMOVE_TRACER_EVENT_SOURCE" at a later point which will cause
a failed assertion to hit when the `source_element` isn't found.
Kienan Stewart [Tue, 29 Oct 2024 12:48:45 +0000 (08:48 -0400)]
docs: Update coding standard blurb in contributions readme
Observed issue
==============
The CodingStyle document now includes style guidelines for Python,
Shell (bash), and legacy C code.
Solution
========
* Fix wording in the first sentence and use a relative link to the file
* Remove paragraph that other languages do not have a coding style
* Add a link to the test README
Known drawbacks
===============
None.
Change-Id: I057ae8d51afaa07892aa40b47c2f1716574b3b06 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: sessiond: cmd_add_ctx: leak of internal channel members
lttng_channel instances must be released using channel_attr_destroy.
However, an error path of cmd_add_ctx uses free() directly, which causes
internal structures of lttng_channel to be leaked.
Wrap the lttng_channel instance to use a unique_ptr which invokes
channel_attr_destroy on release.
Clean-up: sessiond: enable_event: use unique_ptr to manage memory
Simplify the error paths used to create the internal events associated
with agent events as copies of the filter bytecode and expressions are
performed.
Fix: sessiond: leak of filter_expression, bytecode, and exclusion
The filter_expression, bytecode, and exclusion arguments are leaked
whenever a client omits the transmission of an event-rule as part of an
enable-event command (a protocol error).
Ensure the three arguments are automatically managed using smart
pointers before performing the protocol consistency check.
Fix: lttng-ctl: Validate lttng_enable_event_with_exclusions inputs at beginning of function
Checking for ev == nullptr after having dereferenced ev triggers this
Coverity warning:
CID 1566411: Null pointer dereferences (REVERSE_INULL)
Null-checking "ev" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
Move the handle and ev nullptr checks before the LTTNG_EVENT_ALL
special-case. This is relevant because within LTTNG_EVENT_ALL, after
modifying the event type, lttng_enable_event_with_exclusions is invoked
again with the modified type, which will end up doing the input
validation anyway.
Move the original_filter_expression before the LTTNG_EVENT_ALL event
type check for the same reason: it will end up being checked with the
modified event type in the nested call anyway, so favor early checking
of input arguments.
On SLES12SP5 (gcc 4.8.5), occasional crashes were observed with the
following log entry:
```
DBG1 - 18:49:41.084261033 [Notification]: Poll wait returned (1) (in thread_notification() at notification-thread.cpp:632)
DBG1 - 18:49:41.084264635 [Notification]: Handling fd (28) activity (1) (in thread_notification() at notification-thread.cpp:656)
DBG1 - 18:49:41.084269451 [Notification]: Received `REMOVE_TRACER_EVENT_SOURCE` command (in handle_notification_thread_command() at notification-thread-events.cpp:3178)
lttng-sessiond: notification-thread-events.cpp:2228: int
handle_notification_thread_command_remove_tracer_event_source(notification_thread_state*,
int, lttng_error_code*): Assertion `source_element' failed
```
The easiest reproduce was to comment out most all tests in
`regression/ust/ust-app-ctl-paths/test_ust_app_ctl_paths` except
`test_trace_self_default_paths`. Then run a loop until a failure is
detected, e.g.
```
make -j4 && \
ITER=0 && \
while true ; do LTTNG_UST_DEBUG=1 LTTNG_TEST_LOG_DIR=- LTTNG_TEST_VERBOSE_SESSIOND=1 ./tests/regression/ust/ust-app-ctl-paths/test_ust_app_ctl_paths 2>&1 | tee sessiond.log ; if grep -qE '^not ok' sessiond.log ; then break; fi ; ITER=$((ITER+1)) ; echo "Done $ITER iterations" ; sleep 1 ; done
```
Cause
=====
After investigation, it appears to be that `lta = new ust_app;` may
occasionally return the address multiple times when previous
allocation is rapidly de-allocated.
For example, in the following snippet of a log tracking an application
address that shows re-use:
```
DBG1 - 19:47:29.103043500 [UST registration dispatch]: Create new ust app allocation 0x7fce14001da0 for pid 23256 sock 53 (in ust_app_create() at ust-app.cpp:4061)
DBG1 - 19:47:29.115347169 [UST registration dispatch]: wait_node_in_queue looked up app 0x7fce14001da0 pid 23256 (in thread_dispatch_ust_registration() at dispatch.cpp:362)
DBG1 - 19:47:29.115933991 [UST registration dispatch]: ust app version app 0x7fce14001da0 pid 23256 (in thread_dispatch_ust_registration() at dispatch.cpp:392)
DBG1 - 19:47:29.115954872 [UST registration dispatch]: Added ust app 0x7fce14001da0 pid 23256 (in thread_dispatch_ust_registration() at dispatch.cpp:403)
DBG1 - 19:47:29.115958125 [UST registration dispatch]: ust_app_setup_event_notifier_group app 0x7fce14001da0 pid 23256 (in ust_app_setup_event_notifier_group() at ust-app.cpp:4226)
DBG1 - 19:47:29.116367064 [UST registration dispatch]: app 0x7fce14001da0 add tracer event source in setup_event_notifier_group fd 60 ret 10 (in ust_app_setup_event_notifier_group() at ust-app.cpp:4279)
DBG1 - 19:47:29.117653161 [UST registration dispatch]: app 0x7fce14001da0 pid 23256 return from setup trace event notifier group ret 0, event notifier group 0x7fce14003180 (in ust_app_setup_event_notifier_group() at ust-app.cpp:4313)
DBG1 - 19:47:29.152058781 [23308/23332]: delete_ust_app 0x7fce14001da0 pid 23256 sock 53 (in delete_ust_app() at ust-app.cpp:1021)
DBG1 - 19:47:29.152076408 [23308/23332]: No event notifier group object remove tracer event source fd 60 app 0x7fce14001da0 (in delete_ust_app() at ust-app.cpp:1075)
DBG1 - 19:47:29.191427936 [UST registration dispatch]: Create new ust app allocation 0x7fce14001da0 for pid 23371 sock 71 (in ust_app_create() at ust-app.cpp:4061)
DBG1 - 19:47:29.204450310 [UST registration dispatch]: wait node 0x7fce14001d10, app 0x7fce14001da0 (in sanitize_wait_queue() at dispatch.cpp:144)
DBG1 - 19:47:29.204454373 [UST registration dispatch]: Culling app from wait queue: pid=23371 app 0x7fce14001da0 (in sanitize_wait_queue() at dispatch.cpp:146)
DBG1 - 19:47:29.259291727 [23308/23332]: delete_ust_app 0x7fce14001da0 pid 23371 sock 71 (in delete_ust_app() at ust-app.cpp:1021)
DBG1 - 19:47:29.590230014 [23308/23332]: No event notifier group object remove tracer event source fd 81 app 0x7fce14001da0 (in delete_ust_app() at ust-app.cpp:1075)
lttng-sessiond: notification-thread-events.cpp:2233: int handle_notification_thread_command_remove_tracer_event_source(notification_thread_state*, int, lttng_error_code*):
Assertion `source_element' failed.
```
As this variable is default initialized[1], the memory contents are
indeterminate. In `ust_app_create`, many fields are manually
initialized; however `lta->event_notifier_group.object` is not set
explicitly.
The next time `event_notifier_group.object` is set is when the the
`notification_thread_command_add_tracer_event_source` succeeds. If the
app dies before than, or if that registration fails, the
`ust_app.event_notifier_group.object` may be non-NULL. When this
object passes to `delete_ust_app`, the only check made before trying
to remove the tracer event
source (`notification_thread_command_remove_tracer_event_source`) is
if `event_notifier_group.object` is NULL or not.
If the file descriptor hasn't been previously registered, the assert
is triggered.
While I haven't confirmed my hypothesis, it is possible that the
behaviour of GCC regarding default initialization has changed between
GCC 4.8.5 and now.
Solution
========
Use an explicit initialization for the new allocation.
Known drawbacks
===============
None.
References
==========
[1]: https://en.cppreference.com/w/cpp/language/default_initialization (See case 2)
Change-Id: I12de1864d2995be014d5526283fb843d0f99093e Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: Close per-process event notifier error accounting fds on registration
On application registration, the event notifier error accounting file
descriptors are duplicated to send the error accounting counter objects
to the application.
Those are left open until the application unregisters.
There is one file descriptor per CPU, so on larger systems (228 CPUs
Intel or 192 CPUs AMD EPYC), this adds up to a lot of file descriptors
when the number of registered applications is large, which can result in
file descriptor exhaustion errors.
Moreover, the application unregistration is done from delete_ust_app(),
which is used from a call_rcu() worker thread, thus after an RCU grace
period delay. This means that a steady stream of short-lived
applications with a short enough lifetime could end up allocating more
file descriptors than can be closed.
Fix this by closing those file descriptors immediately after the objects
are sent to the application, similarly to what is done for the ring
buffer streams.
Maintain recording channel configuration objects in ltt_session
Maintain the state of event rules in the ltt_session's channel
configurations. This is laying the groundwork to maintain triggers (and
ultimately, event-rules) associated with map channels.
liblttng-ctl now converts the lttng_event specifications into
event-rules which are used by the session daemon to maintain recording
channel configurations.
The objective is to share the synchronization code of
tracers (essentially, the description of enablers and channel
configuration) with the session configurations when adding support for
maps.
As of this commit, the event-rule-configurations are maintained
internally, but aren't use to synchronize the tracer configuration.
Fix: sessiond: assertion on inconsistent filter bytecode and expression
The session daemon correctly expects that a recording event-rule that
specifies a filter must have both a bytecode and an expression (or
neither of them).
However, it shouldn't assert as those elements are user-specified. The
handling is changed to return an "invalid parameter" error to the
client.
Fix: uprobe event-rule: logging mentions pattern instead of name
A uprobe event rule names a new event; it does not match existing
events. Thus, it doesn't make sense to use the "pattern" terminology
used by some other event rule types.
In order to make a follow-up change which adds non-POD members to
ltt_session, allocate it using the `new operator which causes
ltt_session's constructor to run.
Kienan Stewart [Thu, 31 Oct 2024 12:33:04 +0000 (08:33 -0400)]
Tests: Fix: Use '.logfile' instead of '.log' for test app output
Observed issue
==============
Frequent CI errors related to parsing TAP log files.
E.g.
```
17:04:55 Parsing TAP test result [/var/lib/jenkins/jobs/dev_review_lttng-tools_master_linuxbuild/configurations/axis-babeltrace_version/stable-2.0/axis-build/std/axis-conf/std/axis-liburcu_version/stable-0.14/axis-platform/deb12-amd64/builds/857/tap-master-files/tap/regression/ust/ust-app-ctl-paths/test_blocking.log.d/babeltrace.err.uMIm4i.log].
17:04:55 org.tap4j.parser.ParserException: Error parsing TAP Stream:
Missing TAP Plan.
```
Cause
=====
The TAP collector in the [CI][1] uses `**/*.*`[2] as the ANT pattern
for globbing test log files.
Consequently if, for example, `./test.log.d/session.XXXXX.log` exists,
the content will try to be parsed as a TAP report and potentially
fail.
The Jenkins TAP plugin[3] does not expose the option of adding
excludes[4] to the globbing pattern.
Solution
========
Use `.logfile` as the suffix and modify the CI TAP collector
configuration to use `**/*.log`, as `.log` is the suffix for test logs
produced when running `make check`[5].
Kienan Stewart [Thu, 18 Apr 2024 20:35:40 +0000 (16:35 -0400)]
sessiond/lttng-ctl: Introduce LTTNG_RUNDIR
Observed issue
==============
Starting multiple instances of lttng-sessiond as the root user isn't
possible, even with setting different values for the `LTTNG_HOME`
environment variable.
Cause
=====
When starting `lttng-sessiond` the `apps_unix_sock_path`,
`client_unix_sock_path`, and `health_unix_sock_path` are set to
configure time static defines under `CONFIG_LTTNG_SYSTEM_RUNDIR`,
e.g. `/var/lib/lttng`.
Solution
========
A new environment variable `LTTNG_RUNDIR` is introduced to control the
base directory used by applications to find communication sockets.
The default behaviour is to emulate the existing divide: if
`LTTNG_RUNDIR` is not set the root `lttng-sessiond` will continue to
use `CONFIG_LTTNG_SYSTEM_RUNDIR`, and a non-root user `lttng-sessiond`
will use `LTTNG_HOME/.lttng`.
When `LTTNG_RUNDIR` is set, it takes priority over `LTTNG_HOME` for
determining the base directory. The output directory for traces is
not affected by `LTTNG_RUNDIR`, it continues to respect `LTTNG_HOME`.
Example with starting multiple root `lttng-sessiond`s:
In the example above, as `LTTNG_HOME` is not set, the default output
directory for traces by both over the `lttng-sessiond` instances will
be `$HOME/lttng-traces`.
Kienan Stewart [Thu, 26 Oct 2023 20:08:27 +0000 (16:08 -0400)]
tests: Add tests for LTTNG_UST_APP_PATH and LTTNG_UST_CTL_PATH
test_blocking_mode: Verifies that the sessiond starts (or doesn't start)
appropriately depending on the combination of path settings in
conjuction with `LTTNG_UST_ALLOW_BLOCKING` and `--blocking-timeout`.
test_path_separators:Path separators: Verifies the behaviour of the
sessiond and applications with multiple `LTTNG_UST_APP_PATH`s and
multiple `LTTNG_UST_CTL_PATHS`, including the verification of Java JUL
and Python agents.
test_ust_app_ctl_paths: Verifies the sessiond and traced applications
with different combinations of `LTTNG_UST_APP_PATH` and
`LTTNG_UST_CTL_PATH` settings.
Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9f25cdc20fd482cdcc71294aadff3a2a6b34c01e
Kienan Stewart [Mon, 27 Nov 2023 16:32:42 +0000 (11:32 -0500)]
sessiond: Add initial support for multiple LTTNG_UST_CTL_PATHs
LTTNG_UST_CTL_PATH may be separated with ':'. There is no provision
for escaping the ':' path separator (similar to `$PATH`). Subsequent
paths after the first are ignored.
Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I80f7d63e4be164f7afe3fe478794bd55c71c31fb
* Control the tracer sessiond by setting LTTNG_HOME=/tmp/tracer-home
* Control the tracee sessiond by setting LTTNG_HOME=/tmp/tracee-home
* Run an application traced by the tracee sessiond
```
LTTNG_UST_APP_PATH=/tmp/tracee-ust myapp
```
* Run an application traced by the tracer sessiond
Some comment blocks use /** in lieu of /* on the first line. Since the
latter style is prevalent in the code base, replace all instances to
attain One True Style.
Simon Marchi [Mon, 16 Sep 2024 09:00:35 +0000 (05:00 -0400)]
Bump required clang-format version to 16
Bump the required clang-format version from 14 to 16. For reference,
this is the version the Babeltrace project uses [1], and it is checked
by the LTTng CI [2], so I think it makes sense to go to that version.
More recent versions might not be readily available on the CI machines
(mostly running Debian 12 at the moment).
FWIW, I think that the code changes generated by this version change are
improvements.
Kienan Stewart [Thu, 17 Oct 2024 14:40:48 +0000 (10:40 -0400)]
Fix: relayd: leaked socket for live connections
Observed issue
==============
While exploring an issue where Python programs using the `bt2` module
would crash on shutdown, it was noticed that the same case also
highlighted a leaked file descriptor.
The following Python program may be used to demonstrate the leak.
```
import os
import socket
import time
import bt2
def is_connected():
ctf_live_cc = bt2.find_plugin("ctf").source_component_classes["lttng-live"]
q = bt2.QueryExecutor(ctf_live_cc, "sessions", params={"url": "net://localhost"})
connected = False
try:
for x in q.query():
print(x)
if x['session-name'] == 'test' and x['client-count'] >= 1:
connected = True
break
except Exception as e:
print(e)
return connected
os.system("lttng create test --live")
os.system("lttng enable-event -u --all")
os.system("lttng start")
During the clean-up at the end of the live worker thread, all remaining
viewer connections have their references put (which should cause the
connection to be released).
When releasing connections, `close` is never called on the socket's file
descriptor. Furthermore, the release doesn't modify `the_fd_tracker`.
Solution
========
Explicitly close and stop tracking the sockets for viewer connections
just prior to the final put.
Known drawbacks
===============
None.
Change-Id: I5eb0a3e5b9cb14dc1199be463bc312cbc72d8244 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Fri, 11 Oct 2024 15:40:34 +0000 (15:40 +0000)]
Tests: Add environment variables in tests to attach gdbserver
When debugging tests, the current infrastructure in both the python and
bash test harnesses require that the user start the sessiond or relayd
beforehand, and run the tests with environment variables to stop the
spawning of the respective programs.
To facilitate the process, new environment variables are added to allow
gdbserver to be spawned and attach to the relayd or sessiond. The user
may then connect with gdb, for example: `gdb -ex "target
localhost:1001"`.
Change-Id: Id4d1b446c7d6682c011ef27682198fb4a503f5f4 Signed-off-by: kienan Stewart <kstewart@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Specify that the caller of `get_session_name()` must explicitly free the
memory associated with the returned string.
Historically this software was written in C. C programmers have the
reflex to assume they are responsible for freeing memory. However, now
that this project is mixed C/C++ and is transitioning towards C++ the
assumption that developers will automatically know to free memory
(according to C programming conventions) does not hold as well. For this
reason, clarify that memory must be explicitly freed by the function
caller.
There are other C functions in this software that return variables that
must explicitly be freed by the caller. However, these changes have not
been applied consistently to all these cases. The assumption is that
this individual clarification still reduces confusion even if not
applied consistently.
Fix: relayd: viewer_stream leak causes assertion failure on exit
Observed issue
==============
Running the test proposed in change #11584[1], the relay daemon aborts
when destroying the viewer_streams_ht as it is not empty.
Cause
=====
A viewer stream reference is leaked when sending streams to the live
client causing them to remain published in the viewer_streams_ht beyond
the lifetime of the viewer_connection.
The send_viewer_streams() function operates in two phases. First, it
iterates over the viewer_streams_ht to find streams that belong to the
target session and have not been sent yet.
In the second phase, it iterates over the session's unannounced stream
list. The commit message of 98b82dfa2 gives more background on the role
of the unannounced stream list.
When a viewer stream is created, two references are acquired:
- one belongs to the global viewer_streams_ht,
- the other belongs to the unannounced stream list.
When the viewer stream is eventually sent to the client, it is removed
from the unannounced stream list and that reference must be dropped.
Unfortunately, the reference is not dropped during the first phase.
Solution
========
Put the reference of the viewer streams that are sent during the first
phase of send_viewer_streams().
g++ emits warnings that it can't recognize the clang-specific diagnostic
pragmas. They are replaced by the internal compiler-specific macros so
that nothing is emitted when g++ is used.
Disable clang warning for injected class name ambiguity in non_copyable_reference
clang raises a warning (-Winjected-class-name) due to ambiguity between
a constructor name and a type within the non_copyable_reference code.
Since clang could not infer the correct type context, this commit uses
`#pragma clang diagnostic` to disable the specific warning in the
affected area of the code.
The `push` and `pop` pragmas ensure that the warning is disabled only
where needed, preventing it from affecting other parts of the codebase,
and allowing us to maintain clean and clear code without unnecessary
compiler warnings.
A static_assert enforces that CustomDeleter::deleter is indeed a type,
although interpreting it as a constructor would be non-sensical here.
Use fmtlib to format the session attribute string when saving
the current session to .lttngrc. This eliminates a warning
emitted by clang (VLAs are not standard in C++).