Philippe Proulx [Thu, 14 Mar 2024 18:41:02 +0000 (14:41 -0400)]
extras/zsh-completion/_lttng: add missing "requires" verb
Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com>
Change-Id: I9221a31e2a265e93b17242b152a00e8e1851cab7
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Tue, 12 Mar 2024 19:57:31 +0000 (15:57 -0400)]
Clean-up: sessiond: use empty() instead of comparing size to 0
Harmonize the project's coding style a little by favoring the use of the
'empty()' methood of containers rather than comparing their size to 0.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I22e6b7fe4d94d8f43362fe119b4ca6d480587291
Jérémie Galarneau [Tue, 12 Mar 2024 19:48:09 +0000 (15:48 -0400)]
Build fix: missing operator- for iterator on g++7
The project fails to build on 'g++ (SUSE Linux) 7.5.0' since its STL
implementation assumes that operator- is available for random access
iterators.
The build fails with the following error:
event_name.cpp:82:71: required from here
/usr/include/c++/7/bits/stl_iterator_base_funcs.h:104:21: error: no match for ‘operator-’ (operand types are ‘lttng::utils::random_access_container_wrapper<const bt_value*, const char*, event_name_set_operations>::_iterator<const lttng::utils::random_access_container_wrapper<const bt_value*, const char*, event_name_set_operations>, const char* const>’ and ‘lttng::utils::random_access_container_wrapper<const bt_value*, const char*, event_name_set_operations>::_iterator<const lttng::utils::random_access_container_wrapper<const bt_value*, const char*, event_name_set_operations>, const char* const>’)
A trivial implementation of that operator is provided and allows the
build to succeed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ib1637e81e5cdc42cd5a142dcee21150ced9fcc55
Jérémie Galarneau [Fri, 15 Dec 2023 16:44:34 +0000 (11:44 -0500)]
Fix: relayd: live client not notified of inactive streams
Observed issue
--------------
Some LTTng-tools live tests failures appear to show babeltrace2
hanging (failing to print expected events). The problem is intermittent,
but Kienan was able to develop a test case that's reproducible for him.
The test case performs the following steps:
- Start a ust application and leave it running
- Configure and then start an lttng live session
- Connect a live viewer (babeltrace)
- Run a second ust application
- Wait for the expected number of events
- In the failing case, no events are seen by babeltrace
Using per-uid buffers, the test typically completes normally. With
per-pid buffers the test fails, hanging indefinitely if waiting for the
specified number of events. While "hanging", babeltrace2 is polling the
relayd.
This affects for babeltrace2 stable-2.0 and master while using
lttng-tools master.
For more information, see the description of bug #1406[1]
Cause
-----
When consuming a live trace captured in per-PID mode, Babeltrace
periodically requests the index of the next packet it should consume.
As part of the reply, it gets a 'flags' field which is used to announce
that new streams, or new metadata, are available to the viewer.
Unfortunately, these 'flags' are only set when the relay daemon has new
tracing data to deliver. It is not set when the relay daemon indicates
that the stream is inactive (see LTTNG_VIEWER_INDEX_INACTIVE).
In the average case where an application is spawned while others are
actively emiting events, a request for new data will result in a reply
that returns an index entry (code LTTNG_VIEWER_INDEX_OK) for an
available packet accompanied by the LTTNG_VIEWER_FLAG_NEW_STREAM flag.
This flag indicates to the viewer that it should request new
streams (using the LTTNG_VIEWER_GET_NEW_STREAMS live protocol command)
before consuming the new data.
In the cases where we observe a hang, an application is running but not
emiting new events. As such, the relay daemon periodically emits "live
beacons" to indicate that the session's streams are inactive up to a
given time 'T'.
Since the existing application remains inactive and the viewer is never
notified that new streams are available, the viewer effectively remains
"stuck" and never notices the new application being traced.
The LTTNG_VIEWER_FLAG_NEW_METADATA communicates a similar semantic with
regards to the metadata. However, ignoring it for inactive streams isn't
as deleterious: the same information is made available to the viewer the
next time it will successfully request a new index to the relay daemon.
This would only become a problem if the tracers start to express
non-layout data (like supplemental environment information, but I don't
see a real use-case) as part of the metadata stream that should be made
available downstream even during periods of inactivity.
Note that the same problem most likely affects the per-UID buffer
allocation mode when multiple users are being traced.
Solution
--------
On the producer end, LTTNG_VIEWER_FLAG_NEW_STREAM is set even when
returning an inactivity index.
Note that to preserve compatibility with older live consumers that don't
expect this flag in non-OK response, the LTTNG_VIEWER_FLAG_NEW_STREAM
notification is repeated until the next LTTNG_VIEWER_GET_NEW_STREAMS
command that returns LTTNG_VIEWER_INDEX_OK.
The 'new_streams' state is no longer cleared from relay sessions during
the processing of the LTTNG_VIEWER_GET_NEXT_INDEX commands. Instead, it
is cleared when the viewer requests new streams.
On Babeltrace's end, the handler of the LTTNG_VIEWER_GET_NEXT_INDEX
command (lttng_live_get_next_index) is modified to expect
LTTNG_VIEWER_FLAG_NEW_STREAM in the cases where the command returns:
- LTTNG_VIEWER_INDEX_OK (as done previously),
- LTTNG_VIEWER_INDEX_HUP (new),
- LTTNG_VIEWER_INDEX_INACTIVE (new).
Drawbacks
---------
This is arguably a protocol change as none of the producers ever set the
NEW_METADATA/NEW_STREAM flags when indicating an inactive stream.
References
----------
[1] https://bugs.lttng.org/issues/1406
Fixes #1406
Change-Id: I84f53f089597ac7b22ce8bd0962d4b28112b7ab6
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 8 Mar 2024 21:18:29 +0000 (16:18 -0500)]
Clean-up: tests: bt2 plug-ins: modernize the plug-ins
By virtue of their use of the C Babeltrace 2 APIs, the test plug-ins
perform a fair amount of manual resource management.
To make it possible to adopt a more modern C++ style in those plug-ins,
a number of helpers are introduced.
Introduce reference wrappers for the Babeltrace 2 interface:
- value_ref: wraps a bt_value reference using std::unique_ptr
- message_const_ref: wraps a constant message reference using a
unique_ptr
- message_iterator_ref: wraps a message iterator reference using a
unique_ptr
- event_class_const_ref: wraps a constant event class reference using
a unique_ptr
A specialized random_access_container_wrapper is specialized to wrap
bt_value arrays of strings.
In doing so, it is possible to eliminate the use of gotos and manual
reference management on error paths. Some struct/classes are renamed to
eliminate ambiguities that arose over the refactoring.
The changes allow some simplifications of the code flow in places which
are applied directly.
Change-Id: I25c148d7970cb89add55a86f2c162973d3d27e4a
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Tue, 12 Mar 2024 01:53:15 +0000 (21:53 -0400)]
Clean-up: typo in make_unique_wrapper comment
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Idd5d203bd26ef2e3d2eab94e30f9ef5f8e3a1d90
Jérémie Galarneau [Fri, 8 Mar 2024 21:17:46 +0000 (16:17 -0500)]
Move the lttng::free util under the lttng::memory namespace
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I40bf5aefaa8f441f470c0866b71b2957a6c30154
Jérémie Galarneau [Fri, 8 Mar 2024 17:06:30 +0000 (12:06 -0500)]
format: use unique_ptr to wrap unmangled string
Change-Id: I8459507a55caf2c77a21fcc3442bcde069b2601b
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 28 Sep 2023 20:54:42 +0000 (16:54 -0400)]
tests: Replace babelstats.pl with bt2 plugins
Observed Issue
==============
`tests/regression/tools/filtering/test_valid_filters` is a long running
test, especially when running as root and exercising the tests across
the kernel domain.
I observed that a sizable amount of time was being spent in the analysis
of the results using `babelstats.pl`.
Solution
========
Instead of using a script to parse the pretty output of babeltrace2, I
decided to write two C++ plugins to replicate the behaviour of the
`babelstats.pl` script.
I measured the time using `sudo -E time ./path/to/test`
| Test | Time with `babelstats.pl` | Time with bt2 plugins |
| test_tracefile_count | 13.04s | 11.73s |
| test_exclusion | 22.75s | 22.07s |
| test_valid_filter | 301.04s | 144.41s |
The switch to using babeltrace2 plugins reduces the runtime of the
`test_valid_filter` test (when running with kernel tests) by half. The
runtime changes to the other tests that were modified are not
significant.
Known drawbacks
===============
The field_stats plugin behaviour differs from `babelstats.pl` with
regards to enumeration fields ("container" in `babelstats.pl`). However,
no tests depend on that behaviour to pass.
The field_stats sink plugin doesn't perform a lot of run-time
error-checking of functions it invokes, and doesn't fully clean up all
the references it allocates though the babeltrace2 API. As the intended
usage is for short lived invocations with relatively small traces, the
principal drawback of this approach is that errors in the plugin may be
harder to debug.
Building tests of lttng-tools will now depend on having the babeltrace2
development headers and libraries available.
Change-Id: Ie8ebdd255b6901a7d0d7c4cd584a02096cccd4fb
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Tue, 26 Sep 2023 13:39:41 +0000 (09:39 -0400)]
tests: Run relayd-grouping tests by grouping type
Observed issue
==============
The `relayd-grouping/test_ust` test takes ~2 minutes to run. A
significant amount of that time is statring and stopping the relay and
sesion daemons.
Solution
========
Each test function is run with a different grouping setup for the
relayd. Rather than iterating over each test and then grouping
variations, the iteration can be changed to organize the tests run by
grouping setup. This allows us to start th relay and session daemons
once per grouping setup, rather than twice for each test function.
Further more, each test function is run twice: once with auto-generated
session names, once with user-defined session names. This behaviour can
be cut out to reduce the runtime of the test further.
On my development machine, the test went from running in 113s to 18s.
Known drawbacks
===============
This no longer exercises the automatic session naming. I don't think
that the automatic session naming paths are pertinent with regards to
the grouping settings; however it appears it can impact output
directories (eg. in
`test_ust_uid_streaming_snapshot_add_output_custom_name`).
Change-Id: I89d8cb224e594dd68b7e8f3367d1907ecfa2bf13
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 7 Mar 2024 20:20:17 +0000 (15:20 -0500)]
tests: Split test_ust_constructor into several tests
Observed issue
==============
TAP parsers fail when parsing a single executable that contains
several plans. Eg.,
```
ok 44 - Found no unexpected events
PASS: ust/ust-constructor/test_ust_constructor.py 44 - Found no unexpected events
1..44
ERROR: ust/ust-constructor/test_ust_constructor.py - multiple test plans
ok 1 - Create a session
ERROR: ust/ust-constructor/test_ust_constructor.py 1 - Create a session # UNPLANNED
```
and
```
14:03:23 org.tap4j.parser.ParserException: Error parsing TAP Stream: Duplicated TAP Plan found.
14:03:23 at org.tap4j.parser.Tap13Parser.parseTapStream(Tap13Parser.java:257)
14:03:23 at org.tap4j.parser.Tap13Parser.parseFile(Tap13Parser.java:231)
14:03:23 at org.tap4j.plugin.TapParser.parse(TapParser.java:172)
14:03:23 at org.tap4j.plugin.TapPublisher.loadResults(TapPublisher.java:475)
14:03:23 at org.tap4j.plugin.TapPublisher.performImpl(TapPublisher.java:352)
14:03:23 at org.tap4j.plugin.TapPublisher.perform(TapPublisher.java:312)
14:03:23 at jenkins.tasks.SimpleBuildStep.perform(SimpleBuildStep.java:123)
14:03:23 at hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:80)
14:03:23 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
14:03:23 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
14:03:23 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
14:03:23 at hudson.model.Build$BuildExecution.post2(Build.java:179)
14:03:23 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
14:03:23 at hudson.model.Run.execute(Run.java:1918)
14:03:23 at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
14:03:23 at hudson.model.ResourceController.execute(ResourceController.java:101)
14:03:23 at hudson.model.Executor.run(Executor.java:442)
14:03:23 Caused by: org.tap4j.parser.ParserException: Duplicated TAP Plan found.
14:03:23 at org.tap4j.parser.Tap13Parser.parseLine(Tap13Parser.java:354)
14:03:23 at org.tap4j.parser.Tap13Parser.parseTapStream(Tap13Parser.java:252)
14:03:23 ... 16 more
```
Cause
=====
09a872ef0b4e1432329aa42fecc61f50e9baa367 introduced multiple plans in
to test_ust_constructor
Solution
========
Split the script into several smaller test scripts sharing a common
import for data and the bulk of execution.
Known drawbacks
===============
None.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I81649d714afe0e325996b730d5c72cfd5b28d1f8
Kienan Stewart [Tue, 23 Jan 2024 15:08:34 +0000 (10:08 -0500)]
tests: Add diagnostic info for kernel bug, warning, and oops
When test_select_poll_epoll fails with an error due to hitting one a new
WARNING, OOPS, or BUG statements in dmesg, the user must go and read the
the logs themselves to try and find the matching statements.
Providing the previous and new messages in diagnostic output will allow
a person reading the test results to more quickly ascertain if the
messages are pertinent to lttng-modules or not. That being said, there
is no guarantee that there are not other WARNINGs, OOPs, or BUGs in
dmesg between before and after that are pertinent.
Change-Id: Ida026dfe852cafdcc55979089c92995949e2ef0d
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 7 Mar 2024 19:01:10 +0000 (14:01 -0500)]
Clean-up: run clang-format 14 on the tree
Miscellaneous code style changes to correct little violations that
slipped through the cracks.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Id378ff3fa42cb69a8543b43c08d60b9a2f2c1c06
Kienan Stewart [Fri, 9 Feb 2024 15:23:39 +0000 (10:23 -0500)]
tests: Add C versions of gen-ust-events-constructor
Observed issue
==============
The constructor tests exercise only the case where C++ applications
are built.
Solution
========
Adding C test applications allows us the reuse the existing test
infrastructure to cover these cases.
Known drawbacks
===============
None.
Change-Id: Ib178dfd33cce0f1d0aa125aaee078c2dcb84ecb9
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 8 Feb 2024 19:29:49 +0000 (14:29 -0500)]
tests: test_ust_constructor: Split test_ust_constructor binary
Observed issue
==============
The single test executable gen-ust-events-constructor covers a lot of
different cases in a single executable. This decreases the legibility of
the test results and debuggability of the test application as many
different pieces are in play.
Solution
========
The test functionality covered by the executable is split into two main
parts: one using a dynamically loaded shared object, and the second
using a statically linked archive.
Known drawbacks
===============
Rather than creating a second test script, the same script is re-used to
run multiple TapGenerator sequentially. This could hamper future efforts
to parallelize python-based tests.
Change-Id: I86d247780ce5412570eada6ebadb83a01547f2b0
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Fri, 9 Feb 2024 14:16:26 +0000 (09:16 -0500)]
tests: Ensure `_process` is set in _TraceTestApplications
Observed issue
==============
An exception is thrown when deleting a _TraceTestApplication object that
has thrown an exception during it's `__init__` method. Eg.
```
Exception ignored in: <function _TraceTestApplication.__del__ at 0x7fcbc9a21620>
Traceback (most recent call last):
File "/home/kstewart/src/efficios/lttng/master/src/lttng-tools/tests/utils/lttngtest/environment.py", line 348, in __del__
self._process.kill()
^^^^^^^^^^^^^
AttributeError: '_TraceTestApplication' object has no attribute '_process'
```
Similarly, this can happen to _WaitTraceTestApplication objects.
Cause
=====
The object's `_process` attribute is set during `__init__`; however,
if an exception is thrown during `subprocess.Popen` a value is never
assigned to the attribute.
Solution
========
A default value for the `_process` attribute is set and checked as
part of the condition when executing the `__del__` method.
Known drawbacks
===============
None.
Change-Id: I2220ae764be49fafb3b977a5e723931421485d63
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Fri, 9 Feb 2024 14:08:07 +0000 (09:08 -0500)]
tests: Correct tap_generator skip() when count is greater than 1
Issue observed
==============
Output when skipping multiple was incorrectly printing the test case number,
eg.
```
ok 3 - Start session `session_ldr8cxix`
41
ok 4 # Skip: Test application 'gen-ust-events-constructor/gen-ust-events-constructor-so' not found
ok 6 # Skip: Test application 'gen-ust-events-constructor/gen-ust-events-constructor-so' not found
ok 8 # Skip: Test application
'gen-ust-events-constructor/gen-ust-events-constructor-so' not found
```
Cause
=====
The `test_number` was adding the current index to the already modified
`self._last_test_case_id`.
Solution
========
Use `self._last_test_case_id` with no changes.
Known drawbacks
===============
None.
Change-Id: I8ff16b83619cf6e6db2636eeccd58725cc03d0f8
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 8 Feb 2024 14:02:48 +0000 (09:02 -0500)]
tests: test_ust_constructor: Use a C-compiled shared object
Similar to the previous change, this change splits the c-style
constructors for the shared object into a separate object which can be
compiled with gcc instead of g++.
This makes it possible to test the constructors are traced even if
LTTng-UST uses the LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP build
configuration.
Change-Id: Icd96cb30cedc1615951a6fec3c72731776f95d81
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 8 Feb 2024 13:46:46 +0000 (08:46 -0500)]
tests: test_ust_constructor: Use a C-compiled static archive
Observed issue
==============
The test output describes the tracepoint as
`constructor_c_provider_static_archive`, which can be a bit misleading.
The tracepoints are indeed emitted inside C-style constructors. However,
as the tracepoints are being compiled inside a C++ translation unit,
they were never traceable when using a heap allocated implementation. If
the static archive is compiled as C and linked against the C++
application, the tracepoints are expected to always be visible.
Solution
========
In splitting the c-style constructors for the static archive into a
separate object the compilation can be made to use gcc instead of g++.
Drawback
========
This change doesn't keep a C-style constructor inside a C++ application
and asserts that it is indeed not traced when compiled using a heap
allocated implementation.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3837fe318b2f8e1d9572ee0bfb6f6bbbd047c5f5
Kienan Stewart [Wed, 7 Feb 2024 20:49:26 +0000 (15:49 -0500)]
tests: Handle test failures for ust-constructors with heap allocation
Observed issue
==============
A number of tests from `ust/ust-constructor/test_ust_constructor.py`
fail when compiled with gcc-4.8 (observed on SLES12SP5). Eg.
```
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 8 - Found
expected event name="tp_a:constructor_c_provider_static_archive"
msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 10 - Found expected event name="tp_a:constructor_cplusplus_provider_static_archive" msg="global - static archive define and provider"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 11 - Found expected event name="tp:constructor_c_across_units_before_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 12 - Found expected event name="tp:constructor_cplusplus" msg="global - across units before define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 13 - Found expected event name="tp:constructor_c_same_unit_before_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 14 - Found expected event name="tp:constructor_c_same_unit_after_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 15 - Found expected event name="tp:constructor_cplusplus" msg="global - same unit before define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 16 - Found expected event name="tp:constructor_cplusplus" msg="global - same unit after define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 17 - Found expected event name="tp:constructor_c_across_units_after_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 18 - Found expected event name="tp:constructor_cplusplus" msg="global - across units after define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 19 - Found expected event name="tp:constructor_c_same_unit_before_provider" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 20 - Found expected event name="tp:constructor_c_same_unit_after_provider" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 21 - Found
expected event name="tp:constructor_cplusplus" msg="global - same unit
before provider"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 34 - Found expected event name="tp:destructor_cplusplus" msg="global - same unit before provider"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 35 - Found expected event name="tp:destructor_cplusplus" msg="global - across units after define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 36 - Found expected event name="tp:destructor_cplusplus" msg="global - same unit after define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 37 - Found expected event name="tp:destructor_cplusplus" msg="global - same unit before define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 38 - Found expected event name="tp:destructor_cplusplus" msg="global - across units before define"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 39 - Found
expected event
name="tp_a:destructor_cplusplus_provider_static_archive" msg="global -
static archive define and provider"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 41 - Found expected event name="tp:destructor_c_across_units_after_provider" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 42 - Found expected event name="tp:destructor_c_same_unit_after_provider" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 43 - Found expected event name="tp:destructor_c_same_unit_before_provider" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 44 - Found expected event name="tp:destructor_c_across_units_after_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 45 - Found expected event name="tp:destructor_c_same_unit_after_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 46 - Found expected event name="tp:destructor_c_same_unit_before_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 47 - Found expected event name="tp:destructor_c_across_units_before_define" msg="None"
12:22:17 FAIL: ust/ust-constructor/test_ust_constructor.py 48 - Found expected event name="tp_a:destructor_c_provider_static_archive" msg="None"
```
Cause
=====
As gcc-4.8 and earlier don't support C99 compound literals, the
lttngust `ust-compiler.h` falls back to using heap allocated
compound literals[1][2].
The probe registration in these cases is done via a C++ object[3].
As C-style constructors are executed before the C++ runtime is
processed, the probe is not yet registered[4].
In a case where g++ <= 4.8 is being used or
`-DLTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP` is defined, the
following tracepoints will not be recorded:
* C-style constructors and destructors in statically linked archives
* C-style constructors and destructors in the application itself
* Some C++ constructors and destructors invoked during the
initialization of the static global variables
* Note: this depends on the initialization order both between translation
units, which is not specified, and the initialization order (usually
lexicographical) within a given translation unit.
This is a known limitation; however, the test does not support
verifying that it's being run in a such a situation.
Solution
========
A small program has been added which returns a different status code
depending on whether `LTTNG_UST_ALLOCATE_COMPOUND_LITERAL_ON_HEAP` is
defined or not.
The test script uses this application to signal that certain events
may fail (in that they may be present, or they may be absent).
Drawbacks
=========
None.
References
==========
[1]: https://github.com/lttng/lttng-ust/commit/
e1904921db97b70d94e69f0ab3264c6f7fe62f32
[2]: https://github.com/lttng/lttng-ust/commit/
7edfc1722684982b9df894c054d69808dc588a6a
[3]: https://github.com/lttng/lttng-ust/commit/
05bfa3dc3a6e6b2ece3686a5f384b6645c2a5010
[4]: https://github.com/lttng/lttng-ust/blob/
3287f48be61ef3491aff0a80b7185ac57b3d8a5d/include/lttng/ust-compiler.h#L110
Change-Id: I49159df4f85126c641aaf5fb0a8b5b22fd91bf12
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Xiangyu Chen [Mon, 12 Feb 2024 14:23:54 +0000 (09:23 -0500)]
tests: add check_skip_kernel_test to check root user and lttng kernel modules
The current tests will run both userspace and kernel testing. Some of
use cases only use lttng for one kind of tracing on an embedded
device (e.g. userspace), so in this scenario, the kernel modules might
not install to target rootfs, the test cases would be fail and exit.
Add LTTNG_TOOLS_DISABLE_KERNEL_TESTS to skip the lttng kernel features
test, this flag can be set via "make":
make check LTTNG_TOOLS_DISABLE_KERNEL_TESTS=1
When this flag was set, all kernel related testcases would be marked as
SKIP in result.
Since the the LTTNG_TOOLS_DISABLE_KERNEL_TESTS was checked in function
check_skip_kernel_test, lots of testcases also need to check root
permission, so merging the root permission checking into
check_skip_kernel_test.
Change-Id: I49a1f642a9869c21a69e0186c296fd917bd7b525
Signed-off-by: Xiangyu Chen <xiangyu.chen@windriver.com>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Fri, 1 Mar 2024 18:09:51 +0000 (13:09 -0500)]
Fix: force _lttng python binding to be linked with g++
Observed issue
==============
On Enterprise Linux 7 CI nodes, several tests using the python binding
were failing with errors such as the following:
```
ERROR: ust/exit-fast/test_exit-fast
===================================
Warning: Failed to produce a random seed using getrandom(), falling back to pseudo-random device seed generation which will block until its pool is initialized: getrandom() is not supported by this platform [getrandom_nonblock() random.cpp:90]
Traceback (most recent call last):
File "/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el7-amd64/src/lttng-tools/extras/bindings/swig/python/lttng.py", line 24, in swig_import_helper
fp, pathname, description = imp.find_module('_lttng', [dirname(__file__)])
File "/usr/lib64/python3.6/imp.py", line 297, in find_module
raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named '_lttng'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./ust/exit-fast/test_exit-fast.py", line 19, in <module>
from test_utils import *
File "/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el7-amd64/src/lttng-tools/tests/utils/test_utils.py", line 24, in <module>
from lttng import *
File "/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el7-amd64/src/lttng-tools/extras/bindings/swig/python/lttng.py", line 34, in <module>
_lttng = swig_import_helper()
File "/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el7-amd64/src/lttng-tools/extras/bindings/swig/python/lttng.py", line 26, in swig_import_helper
import _lttng
ImportError: /home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el7-amd64/src/lttng-tools/extras/bindings/swig/python/.libs/_lttng.so: undefined symbol: _ZNSt13runtime_errorC2EPKc
ERROR: ust/exit-fast/test_exit-fast - missing test plan
```
The link mode can be seen the commands use to do the linking in the CI
node build logs. For example,
```
libtool: link: gcc -shared -fPIC -DPIC .libs/lttng_wrap.o -Wl,--whole-archive ../../../../src/common/.libs/libsessiond-comm.a ../../../../src/common/.libs/libcommon-gpl.a -Wl,--no-whole-archive -Wl,-rpath -Wl,/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el8-amd64/src/lttng-tools/src/lib/lttng-ctl/.libs -Wl,-rpath -Wl,/build/lib64 -L/home/jenkins/workspace/lttng-tools_master_elbuild/babeltrace_version/stable-2.0/build/std/conf/std/liburcu_version/master/platform/el8-amd64/deps/build/lib64 ../../../../src/lib/lttng-ctl/.libs/liblttng-ctl.so -lxml2 -L/build/lib64 -lurcu -lurcu-common -lurcu-cds -lrt -pthread -g -O2 -pthread -Wl,-soname -Wl,_lttng.so.0 -o .libs/_lttng.so.0.0.0
```
Cause
=====
Automake chooses the link mode based on the types of files in the
library or executable. Given that the generated bindings are only C
code, automake uses the gcc link mode.
Solution
========
By adding a dummy (non existant) C++ source file to the library,
automake can be 'forced' to the switch the link mode to `g++`.
Example link command in `g++` mode:
```
libtool: link: g++-10 -std=gnu++11 -fPIC -DPIC -shared -nostdlib
/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/crti.o
/usr/lib/gcc/x86_64-linux-gnu/10/crtbeginS.o .libs/lttng_wrap.o
-Wl,--whole-archive ../../../../src/common/.libs/libsessiond-comm.a
../../../../src/common/.libs/libcommon-gpl.a -Wl,--no-whole-archive
-Wl,-rpath
-Wl,/home/kstewart/src/efficios/lttng/master/src/lttng-tools/src/lib/lttng-ctl/.libs
-Wl,-rpath -Wl,/home/kstewart/src/efficios/lttng/master/usr/lib
-Wl,-rpath -Wl,/home/kstewart/src/efficios/lttng/master/usr/lib
../../../../src/lib/lttng-ctl/.libs/liblttng-ctl.so -lxml2
-L/home/kstewart/src/efficios/lttng/master/usr/lib
/home/kstewart/src/efficios/lttng/master/usr/lib/liburcu.so
/home/kstewart/src/efficios/lttng/master/usr/lib/liburcu-cds.so
/home/kstewart/src/efficios/lttng/master/usr/lib/liburcu-common.so
-lrt -L/usr/lib/gcc/x86_64-linux-gnu/10
-L/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu
-L/usr/lib/gcc/x86_64-linux-gnu/10/../../../../lib
-L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu
-L/usr/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/10/../../.. -lstdc++
-lm -lc -lgcc_s /usr/lib/gcc/x86_64-linux-gnu/10/crtendS.o
/usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/crtn.o -g
-O2 -fuse-ld=lld -pthread -Wl,-soname -Wl,_lttng.so.0 -o
.libs/_lttng.so.0.0.0
```
Known drawbacks
===============
None.
Change-Id: I5f1dedec435089518e36cc12cd09c2bb151adb67
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 21 Feb 2024 13:57:43 +0000 (08:57 -0500)]
tests: Add test for live viewer hanging when connecting after a clear
References: https://review.lttng.org/c/lttng-tools/+/11819
Change-Id: Ic40f3ee674657a802d4081e008cdb67247cd70ff
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 21 Feb 2024 13:57:19 +0000 (08:57 -0500)]
tests: Add mechanism to start relayd in python testing environment
Change-Id: I787528c4d281d1047d1ab119bde86c95decb9cca
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 21 Feb 2024 13:56:31 +0000 (08:56 -0500)]
tests: Add clear command to python tests
Change-Id: I9eba90a4ebfe2b983a2fac1344d6a472d8b1c849
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Sat, 17 Feb 2024 13:57:47 +0000 (08:57 -0500)]
Fix: relayd: live: dispose of zombie viewer metadata stream
Issue observed
==============
In the CI, builds on SLES15SP5 frequently experience timeouts. From
prior inspections, there are hangs during
tests/regression/tools/clear/test_ust while waiting for babeltrace to
exit.
It is possible to reproduce the problem fairly easily:
$ lttng create --live
$ lttng enable-event --userspace --all
$ lttng start
# Launch an application that emits a couple of events
$ ./my_app
$ lttng stop
# Clear the data, this eventually results in the deletion of all
# trace files on the relay daemon's end.
$ lttng clear
# Attach to the live session from another terminal
$ babeltrace -i lttng-live net://...
# The 'destroy' command completes, but the viewer never exits.
$ lttng destroy
Cause
=====
After the clear command completes, the relay daemon no longer has any
data to serve. We notice that the live client loops endlessly repeatably
sending GET_METADATA requests. In response, the relay daemon replies
with the NO_NEW_METADATA status.
In concrete terms, the viewer_get_metadata() function short-circuits to
send that reply when it sees that the metadata stream has no active
trace chunk (i.e., there are no backing files from which to read the
data at the moment).
This situation is not abnormal in itself: it is legitimate for a client
to wait for the metadata to become available again. For example, in the
reproducer above, it would be possible for the user to restart the
tracing (lttng start), which would create a new trace chunk and make the
metadata stream available. New events could also be emitted following
this restart.
However, when a session's connection is closed, there is no hope that
the metadata stream will ever transition back to an active trace chunk.
Solution
========
When the metadata stream has no active chunk and the corresponding
consumerd-side connection has been closed, there is no way the relay
daemon will be able to serve the metadata contents to the client.
As such, the viewer stream can be disposed-of since it will no longer be
of any use to the client. Since some client implementations expect at
least one GET_METADATA command to result in NO_NEW_METADATA, that status
code is initially returned.
Later, when the client emits a follow-up GET_METADATA request for that
same stream, it will receive an "error" status indicating that the
stream no longer exists. This situation is not treated as an error by
the clients. For instance, babeltrace2 will simply close the
corresponding trace and indicate it ended.
The 'no_new_metadata_notified' flag doesn't appear to be necessary to
implement the behaviour expected by the clients (seeing at least one
NO_NEW_METADATA status reply for every metadata stream). The
viewer_get_metadata() function is refactored a bit to drop the global
reference to the viewer metadata stream as it exits, while still
returning the NO_NEW_METADATA status code.
Known drawbacks
===============
None.
Note
====
The commit message of
e8b269fa provides more details behind the
intention of the 'no_new_metadata_notified' flag.
Change-Id: Ib1b80148d7f214f7aed221d3559e479b69aedd82
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Mon, 19 Feb 2024 20:45:49 +0000 (15:45 -0500)]
Docs: relayd: viewer stream has no lock member
The viewer stream object has no lock. This outdated comment can be
removed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I81fd56dbcd4ffb7637f63c098c58cf2b59dabae3
Jérémie Galarneau [Mon, 19 Feb 2024 18:20:53 +0000 (13:20 -0500)]
Docs: relayd: received metadata position is reset on clear
Correct a comment in the relayd documentation that incorrectly mentioned
the 'sent' position being reset by the 'clear' command.
The correct behavior resets the metadata stream's 'received' position to
'0', not the 'sent' position. The relay daemon expects to re-receive the
metadata contents that matches the previous contents up to the previous
'received' position.
The client, however, does not expect to receive the original contents of
the metadata stream a second time.
Note that from the relay daemon's perspective, a "clear" command does
not exist per se. It is implemented as a stream rotation that moves the
streams from a trace chunk that has an associated 'DELETE' close command
to a new one (which may also be a 'nil' chunk).
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I598fe736c57ab3e934ff0207674d0ecff2bf3e74
Philippe Proulx [Wed, 30 Aug 2023 16:33:13 +0000 (12:33 -0400)]
Add Zsh completion files for public LTTng CLI commands
Zsh is an extended Bourne shell with many improvements, including some
features of Bash, ksh, and tcsh. Zsh features a powerful completion
system which makes it possible to improve the interactive user
experience greatly when using an LTTng command.
Those four new files are Very Sophisticated Zsh completion files,
especially `extras/zsh-completion/_lttng`.
Notable features for all commands:
* Support of LTTng 2.5 through LTTng 2.14, with version-specific
completion.
Set `LTTNG_ZSH_COMP_IGNORE_VERSION_LIMIT=1` to disable the upper limit
of the version check. This should be safe most of the time, but if
there's a breaking change in option/argument interaction, the
completions might be wrong.
* Exclusion of options and arguments depending on the current options
and arguments, according to the manual pages.
For example, for `lttng enable-channel`, you cannot specify
`--buffers-uid` if you already specified `--kernel` (and vice versa).
Notable features for the `lttng` command:
* Full support, except for the condition and action specifiers of the
`add-trigger` subcommand: although I may now add "skillful in Zsh
completion" to my resume, the positional design of `--condition` and
`--action` needs event more spicy Zsh wizardry which I didn't explore
yet.
* Custom tags and support for the `verbose` style to customize the
completion behaviour and look with `zstyle`.
* For any dynamic completion (relying on some output of the `lttng`
command), connect to the right session daemon depending on the
selected tracing group (`g`/`--group`).
* User/group ID completion with displayed corresponding Unix user/group
names.
* Dynamic recording session name completion with a summary of properties
(activity and mode).
Only the relevant ones are added to the completion set. For example,
names of active sessions are not part of the completion set for
`lttng start`.
* Current recording session taken into account for subcommands needing
one when you don't specify the dedicated recording session
option/argument.
* Dynamic channel name completion depending on the selected recording
session and tracing domain, with a summary of properties (status,
tracing domain, event record loss mode).
Only the relevant ones are added to the completion set. For example,
names of enabled channels are not part of the completion set for
`lttng enable-channel`.
* Dynamic recording event rule name condition completion for
`lttng disable-event`.
* Dynamic instrumentation point name completion depending on the
selected tracing domain
* Dynamic context field type completion depending on the selected
tracing domain.
* Log level name completion depending on the selected tracing domain.
* Dynamic trigger name completion depending on the selected owner
user ID.
Notable features for the `lttng-sessiond` command:
* LTTng kernel probe module name completion (checks within the
`/usr/lib/modules` directory).
Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If8c2c58a50664f41ecc41ab1df72879127d1cd02
Christophe Bedard [Tue, 23 Jan 2024 22:52:55 +0000 (14:52 -0800)]
configure.ac: introduce --{disable,enable}-lib-lttng-ctl
The goal is to be able to only build liblttng-ctl, for example without
needing to build bin/lttng.
Since liblttng-ctl is required when building some of the binaries,
./configure will fail if --disabled (explicitly) unless those binaries
are --disabled too.
Previously, the following would result in liblttng-ctl not getting
built, but it now gets built by default:
./configure \
--disable-bin-lttng \
--disable-bin-lttng-relayd \
--disable-bin-lttng-sessiond
Change-Id: I9338c46e64c031360aa762a3ce891511a3dbba39
Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 29 Feb 2024 17:15:12 +0000 (12:15 -0500)]
Cleanup: run black on tree
Change-Id: I8974e9955dd4200b50b954697a03f428e4474c89
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 29 Feb 2024 17:06:49 +0000 (12:06 -0500)]
Misc: add pyproject.toml
This file provides the metadata of what versions of python are required
to run `make check` and tests[1]. The required versions of python will
also inform `black` for choosing which syntaxes and linting formats to
use[2].
An additional section for the `black` linter[3] is provided. While
empty, it allows `blacken-mode`[4] for emacs to use the
`blacken-only-if-project-is-blackened` setting.
References
==========
[1]: https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#python-requires
[2]: https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html#t-target-version
[3]: https://black.readthedocs.io/en/stable/usage_and_configuration/the_basics.html#where-black-looks-for-the-file
[4]: https://github.com/pythonic-emacs/blacken
Change-Id: I07765dbd088059f336d278a6212ac8d4ee87c79d
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 29 Feb 2024 19:52:21 +0000 (14:52 -0500)]
Tests: namespace TAP_AUTOTIME under LTTNG_TESTS
Test suite variables that are user exposed are usually namespaced under
the 'LTTNG_TESTS_' prefix.
Change-Id: I7ec31efa08050460e2a1e274ef35889b97768f87
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 29 Feb 2024 19:42:41 +0000 (14:42 -0500)]
Tests: standardize TAP_AUTOTIME parsing in python
Disable TAP autotime only when 'TAP_AUTOTIME == 0'. Also remove the
version check as we don't support Python <= 3.3 and there is already an
assertion in the code.
Change-Id: Idf8badb5a27b1a01cbe7c230495eec342b9c3878
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Wed, 17 Jan 2024 15:50:26 +0000 (10:50 -0500)]
Tests: Disable 'stdbuf' when TAP autotime is disabled
The 'stdbuf' command is used by default in 'tap-driver.sh' to force line
buffering. It was added to help with TAP autotime output to log files.
However, 'stdbuf' causes issues in our 32-64 integration tests where we
mix 32 and 64 bit binaries. It uses an LD_PRELOAD library that is not
in a multiarch path which results in the following warning message on
stderr when a 32-bit binary is executed on a 64-bit system:
ERROR: ld.so: object '/usr/libexec/coreutils/libstdbuf.so' from
LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS64): ignored.
Many of our tests compare the content of stderr to an expected file
which results in their failure.
We already have an environment variable "TAP_AUTOTIME" to disable the
autotime feature, make it disable the use of 'stdbuf' as well.
Change-Id: I307cbfcddd7772f69e8211c51b03fb9a3da8e841
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Michael Jeanson [Thu, 29 Feb 2024 19:34:14 +0000 (14:34 -0500)]
Tests: fix TAP_AUTOTIME parsing in tap.c
The atoi() function can return '0' on error and is not guaranteed to set
errno accordingly. Use strtol() instead which can also return 0 on error
but will set errno properly.
Check errno after the function call to distinguish between the value '0'
and the error code '0'.
Change-Id: I5a82bb0f18d7e398dc3594aede5a38e6fc10dd7b
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 29 Feb 2024 19:21:28 +0000 (14:21 -0500)]
Tests: lttngtest: confusing comment regarding supported python versions
The comment in _get_time_ns() caused me to do a double-take since the
function checks for Python > 3.3 and mentions Python 3.8 for unrelated
reasons.
I am clarifying the comments a bit to explain the reason for the two
versions being mentionned.
Also, the version check is changed to match 3.3 (although we don't
support that version) since that is when time.monotonic was introduced.
A type annotation is also added to clarify the function's intended usage
(i.e., it will not return fractional nanoseconds).
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9f4fd7cf1d0673e80f6bce9e3cf86b9697fe3f91
Kienan Stewart [Mon, 12 Feb 2024 21:02:32 +0000 (16:02 -0500)]
docs: coding style: shell scripts should be linted using shellcheck
Change-Id: I6532c73c2217918619a7d08d03a5a8da156beede
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Mon, 12 Feb 2024 21:00:42 +0000 (16:00 -0500)]
docs: coding style: Add usage of black for formatting python code
Change-Id: I73131b9cae9de05be0e85d197edb1976b762402e
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Mon, 12 Feb 2024 20:22:17 +0000 (15:22 -0500)]
docs: Group C++ coding style points together under a heading
In preparation for adding coding styles for other languages that are
used in parts of the project, eg. shell (bash), and python.
Change-Id: I9a7c234d1aed6814f80ee5d448be804f83b82763
Signed-off-by: Kienan Stewart <kstewart@efficios.com
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Fri, 9 Feb 2024 17:08:11 +0000 (12:08 -0500)]
Docs: Add the example commit from CONTRIBUTING as a template
Observed issue
==============
I found myself referring to the contributing guide each time a made a
new commit to ensure that I had the appropriate style and sections.
Solution
========
A stub template `.commit_template` has been added and the instructions
in `CONTRIBUTING.md` indicate that the local checkout that be
configured to use it as a template for new commits which are made.
Known drawbacks
===============
None.
Change-Id: Idfacd4d726657cb57f193f0a3375a840d8a9c746
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Tue, 27 Feb 2024 04:09:37 +0000 (23:09 -0500)]
lttng: enable-event: print kernel tracer status on error
Use the new kernel status query API to present a more descriptive error
when a kernel event rule fails to be enabled.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Icad2518bacec1a9ab3103a44052c0085eadda1a7
Jérémie Galarneau [Tue, 27 Feb 2024 02:57:32 +0000 (21:57 -0500)]
lttng: enable-event: use the terminology of the documentation
Rework most of the human-readable messages of the enable-event to use
the terminology used throughout the online documentation and the man
pages.
Some clean-ups are also done to follow the rest of the project's
conventions, such as quoting user input with back-ticks, not ending
messages with a period, etc.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I00d89d6e3c32ccbde60081ef427a099fb8cd206e
Jérémie Galarneau [Mon, 26 Feb 2024 20:56:27 +0000 (15:56 -0500)]
lttng: enable-event: treat 'all' case as a regular pattern
The cmd_enable_events function is essentially duplicated to handle the
"all events" case, but it simply substitutes the event name for '*'.
The case can be eliminated if we simply add '*' as one of the patterns
to enable when the '--all' option is used.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If4235c391c2ce38a67208184c97bbe0f5c40c97d
Jérémie Galarneau [Mon, 11 Dec 2023 21:57:37 +0000 (16:57 -0500)]
lttng: enable-event: remove gotos from cmd_enable_event
The use of automated resource management makes it possible to remove the
numerous uses of gotos in cmd_enable_event. Replace them by simple
return statements.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic9f3207d8b1e9c5b044506e0233468230db1acd0
Jérémie Galarneau [Mon, 11 Dec 2023 19:10:09 +0000 (14:10 -0500)]
lttng: enable-event: replace raw session string by std::string
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2eb0e64e690d8de70bebc25f069589e0151907e8
Jérémie Galarneau [Mon, 11 Dec 2023 19:00:13 +0000 (14:00 -0500)]
lttng: enable-event: wrap the use of poptContext into a unique_ptr
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Idf6cdb405cc4b316100b832dfe1b850d77ee161b
Jérémie Galarneau [Mon, 11 Dec 2023 18:48:42 +0000 (13:48 -0500)]
lttng: enable-event: wrap mi_writer use in a unique_ptr
To allow further clean-ups and simplify the use of STL containers, wrap
the manually managed mi_writer instance.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8a0b21f0647460333bae1c0a2afeb5d2193a2c9b
Jérémie Galarneau [Wed, 6 Dec 2023 20:07:38 +0000 (15:07 -0500)]
lttng: enable-event: move static vars/funcs to anonymous NS
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: If2661aa3c73b385e1cc0e0a8e038d03491462113
Jérémie Galarneau [Wed, 6 Dec 2023 19:35:27 +0000 (14:35 -0500)]
lttng: enable-channel: move kernel tracer status check to util
In order to re-use the same logic in the enable-event command, move the
kernel status checking and printing to a common utility function.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I85956298c27fd7073ac02aac901d39e3d82bb280
Jérémie Galarneau [Tue, 5 Dec 2023 21:06:07 +0000 (16:06 -0500)]
lttng-ctl: manage memory automatically in kernel tracer status check
Use a unique_ptr to manage the dynamically allocated payload returned by
lttng_ctl_ask_sessiond.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I685fc03c1da7ff7903503ed82636d27f98f9895e
Jérémie Galarneau [Tue, 5 Dec 2023 19:51:25 +0000 (14:51 -0500)]
sessiond: kernel: log kernel tracer status on change
Log the value of newly-introduced the kernel tracer status when it is
set.
This will make it easier to confirm that the
lttng_get_kernel_tracer_status API returns a given value given the logs
of the session daemon.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I7a49493d05bf34868acd32f7f7fba301a41f69d5
Jérémie Galarneau [Tue, 5 Dec 2023 19:26:42 +0000 (14:26 -0500)]
sessiond: kernel: clean-up: move static variables to anonymous namespace
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3ee0e5ff2ff4c47c23f6f27f5707be08107f70f1
Kienan Stewart [Wed, 22 Nov 2023 21:28:01 +0000 (16:28 -0500)]
sessiond: lttng: Add command to check kernel tracer status
Issue observed
--------------
When `lttng enable-channel --kernel` fails, little feedback is
available to user to help them to understand the cause.
Eg.
```
Error: Channel asdf: Kernel tracer not available (session auto-
20231123-092621)
```
Solution
--------
The semantic status of the kernel tracer is tracked and persisted in
the session daemon (through `init_kernel_tracer` and
`cleanup_tracer_tracer`.
A new client command `lttng_kernel_tracer_status` is added to request
the current value of the `kernel_tracer_status`. The `lttng` client
uses this command after enabling a kernel-domain channel fails to
provide the user with a more specific cause of the failure.
Eg.
```
Error: Channel asdf: Kernel tracer not available (session auto-
20231123-092621)
Missing one or more required kernel modules
Consult lttng-sessiond logs for more information
```
The kernel tracer status is tracked with an enum defined in
`include/lttng/kernel.h` to avoid passing potentially different errno values
or locale-dependant strings between the LTTng client and session
daemon.
Loading modules and checking signatures can fail with a number of
different errno values. For example:
C.f. https://gitlab.com/linux-kernel/stable/-/blob/master/kernel/module/signing.c#L70
* `EKEYREJECTED`
* Any other error code
C.f. https://gitlab.com/linux-kernel/stable/-/blob/master/Documentation/security/keys/core.rst
* `EKEYREVOKED`
* `EKEYEXPIRED`
* `ENOKEY`
* Others, such as `ENOMEM`
Known drawbacks
---------------
None.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I2ae4b188f0110a472200c2511439b9e3e600527d
Kienan Stewart [Thu, 8 Jun 2023 15:05:17 +0000 (11:05 -0400)]
Tests: Avoid looping on app PIDs during cleanup
When the list of app PIDs becomes long, eg. in the case for
`tests/regression/ust/nprocesses/test_nprocesses`, then performing the
following type of loop can be quite slow, especially if the apps have
some teardown time internally:
```
for p in ${APP_PIDS} ; do
kill ${p}
wait ${p}
done
```
Both `kill` and `wait` take a list of PIDs, so the cleanup is easily
factorable to
```
kill ${APP_PIDS}
wait ${APP_PIDS}
```
In the case of `test_nprocesses`, the test run time drops from ~25s to
~4s. The difference is less important in tests with fewer apps.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9619939b4f1201a99f9666dee3e19551a67c9fb6
Kienan Stewart [Wed, 10 Jan 2024 16:11:08 +0000 (11:11 -0500)]
tests: Correct timing of python tests with python3 < 3.7
`time.monotonic_ns()` was introduced in python 3.7. Prior to that, the
other available monotonic time function was `time.monotonic()` which was
itself introduced in python 3.3.
For python3 < 3.3, the automatic timing of TAP tests is disabled.
The use of underscores for readable integers was also only introduced in
python 3.6.
Change-Id: Ibf85669c4d108347097d2cea7ab5d28cde9d0cc6
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 10 Jan 2024 15:59:40 +0000 (10:59 -0500)]
tests: Remove debug statement from tap.sh
It looks like a debug statement got left in for commit
2a69bf1437. The
debug statement uses print, which is not necessarily a commonly
available command, and led to spurious errors on certain CI nodes.
Eg. https://ci.lttng.org/job/lttng-tools_master_slesbuild/976/babeltrace_version=stable-2.0,build=std,conf=agents,liburcu_version=master,platform=sles15sp4-amd64/consoleFull
Change-Id: If035e57489b5bf6185f73d1046bfdbd03f4a1fed
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Tue, 26 Sep 2023 15:36:24 +0000 (11:36 -0400)]
tests: Reduce runtime of tools/tracker/test_event_tracker tests
Observed issue
==============
When running the test as root, so both UST and kernel tests are
exercised, this test takes about 100s to run on my development system.
Solution
========
By using session destroy with '--no-wait', the runtime is reduced by
30-40s. For the kernel tests in particular this introduces a detail to
keep in mind with regards to unloading the lttng-tests kernel
module. More details in the 'Known drawbacks' section.
The test applications (both userspace and kernel) also execute much
more quickly than the 0.5s sleep they are given. By reducing the sleep
to a hundreth of a second, another 15s or so can be shaved off the
test runtime.
Overall, the test runtime is reduced from 102s to 45s on my
development machine.
Known drawbacks
===============
If `modprobe -r lttng-tests` is run too quickly after the last session
destruction with `--no-wait`, the removal will fail. This patch uses a
simple one second sleep to give some time for the processes using that
module to get completely shutdown. I think it could be somewhat
brittle on systems that are slow or overcommitted; however, it seemed
more 'maintainable' than remembering to ensure that the last kernel
test session destruction doesn't use `--no-wait`.
Change-Id: Ib953ef22299d30507f46d2e6507fbd0f5641aa27
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Mon, 25 Sep 2023 15:00:17 +0000 (11:00 -0400)]
tests: Use `--no-wait` when destroying sessions in relayd-grouping
By using `--no-wait`, the test completes about 22s faster on
my development machine.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I53cb0665bbb5a038fbe8da6ece924579e8e91549
Kienan Stewart [Thu, 21 Sep 2023 16:08:17 +0000 (12:08 -0400)]
tests: Use '--no-wait' to reduce clear/test_ust runtime
Motivation
==========
`regression/clear/test_ust` is one of the longer running regression
tests.
Solution
========
By using `--no-wait` when destroying sessions and reducing the value
of `DELAYUS` from the default of `
1000000` (us) to `500000` (us) the
test runtime decreases from 5m48s to 4m36s on my development machine.
Known drawbacks
===============
When `DELAYUS` is further decreased, the events are no longer recorded
by the connected babeltrace live viewer.
Using `--no-wait` causes additional warnings and "errors" to be
emitted, which increase the effort required to find actual errors in
the test logs. For example,
```
PERROR - 12:07:59.
932183981 [Rotation]: sendmsg: Broken pipe (in lttcomm_send_unix_sock() at unix.cpp:300)
Error: Failed to send result of the destruction of session
"s5yRHQuEBtpmQ8sF" to client
```
Change-Id: I58e71fe34cb4ee97e0879ebd5e08e6d9a1e6c07f
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 21 Sep 2023 15:49:27 +0000 (11:49 -0400)]
tests: Add POPT_CFLAGS to gen-ust-events-ns and gen-ns-events
Observed issue
==============
When building in an environment where popt was in a non-standard
location, the builds for these two test binaries failed with the
following error:
```
gen-ns-events.cpp:19:10: fatal error: popt.h: No such file or directory
19 | #include <popt.h>
| ^~~~~~~~
```
Solution
========
Set the binary-specific CPPFLAGS in the `Makefile.am`
Known drawbacks
===============
None
Change-Id: I5563e24f330be86d630c68c32eaafaedf53a6c87
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 6 Sep 2023 15:56:45 +0000 (11:56 -0400)]
tests: Use destroy with no-wait during filter tests
Motivation
==========
The regression/tool/sfiltering/test_valid_filter test is one of the
longest running tests in the test suite, largely due to the number of
filters tested.
The session destroy step for each test takes in the order of 500-600ms
on my development machine and accounts for an important sum of time
across a run as each filter test will create and destroy a session.
Solution
========
By passing "--no-wait" to the session destroy command, the step time
is reduced from 500-600ms to ~20ms. The overall test time is
reduced (when running without root tests) from ~150s to ~40s.
Using "--no-wait" with the session destroy seems safe in this case as
the session stop is called before, which will finalize the writes to
disk.
Known drawbacks
===============
The patch causes errors similar to the following to be logged to
stderr:
```
PERROR - 11:55:26.
730314528 [Rotation]: sendmsg: Broken pipe (in lttcomm_send_unix_sock() at unix.cpp:300)
Error: Failed to send result of the destruction of session "valid_filter" to client
```
Change-Id: Ic005116bbe910cb3da3e99aa85dc90244ed72f5b
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 31 Aug 2023 20:06:09 +0000 (16:06 -0400)]
tests: Use session discarded counter to validate blocking test
Motivation
==========
On my development machine, the block test takes about 75s to complete,
with the majority of the time being used in the infinite blocking
test.
Solution
========
Using the discarded counter provided by `lttng list SESSION`, the test
no longer spends time reading and processing the ~1G of traces with
babeltrace2. This reduces the run time of the entire suite from 75s to
about 15s.
Known drawbacks
===============
There's no longer a validation that the trace files written to disk
are parseable by a CTF reader.
Change-Id: I0ccdef53ef80f1ffe5cfff970b92c3de4ba460ec
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 31 Aug 2023 18:00:26 +0000 (14:00 -0400)]
tests: Reduce sleep time when waiting for various daemons to die
Motivation
==========
Using the rough duration of the tests 'Wait after kill \w+ daemon',
approximately 200s were being spent in those areas when running
non-root, non-destructive tests on my development machine.
```
$ find . -iname '*.log' -exec grep -A4 -HE 'Wait for kill [a-z]+ daemon'
{} \; | grep duration_ms | cut -d':' -f2 | datamash sum 1
198078.707216
```
Solution
========
The `stop_x_opt()` functions with timers in `tests/utils/utils.sh`
have the timeouts multiplied by 5 to reduce the counted sleep interval
to 0.1s.
After applying this change, the time spent in the tests as counted
with the find command in the 'Motivation' section above was reduced to
92061.768ms, a reduction of about 1.5 minutes.
Known drawbacks
===============
None
Change-Id: I1d4ad899808f37f6cb7b88bbab0ff05ab0c2b787
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 31 Aug 2023 17:23:12 +0000 (13:23 -0400)]
tests: Reduce sleep in regression/tools/clear/test_ust
Motivation
==========
This test is one of the longer running non-kernel
non-destructive tests, taking ~7 minutes to run
in my development environment.
Solution
========
By reducing the the sleeps from a half second to a tenth of a second,
the test passes from 364.0s to 309.1s.
Known drawbacks
===============
None
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibe6c37d474edf5ca118c82669e073089d888ff84
Kienan Stewart [Fri, 25 Aug 2023 19:31:16 +0000 (15:31 -0400)]
tests: Automatically time TAP tests
Output approximate test timing by default for TAP tests by
initializing a timer when then TAP tests plan is initialized, and
resetting after every result.
Automatic timing of tests may be disabled by setting `TAP_AUTOTIME=0`
in the environment. In `tap.sh`, `autotime` is provided as a public
command to update configuration at runtime.
tap.sh and the tap-driver.sh scripts use a small helper
`tests/utils/tap/clock` to print the result of `lttng_clock_gettime`.
Originally `date` was used, but there are two principal drawbacks:
* the `%N` formatter to provided sub-second resolution is specific to
GNU date, and unavailable on other platforms (eg. FreeBSD).
* destructive tests that modify the system date would cause strange
results (eg. a test that takes 10 years to run)
Known drawbacks
===============
The automatic timing depends on having plan called first. If plan
isn't called (eg. in a late plan mode), the first test time will be
wrong.
The duration key is hardcoded to "duration_ms", as used by
https://github.com/jenkinsci/tap-plugin
As the timing information for the TAP tests is stored in a multiline
YAML block following the result line, any unexpected output (eg. from
stderr) could be written in that region. These lines can cause tools
parsing the TAP log to fail as the lines written may not be valid
YAML. For ci.lttng.org, the TAP parser should be configured to remove
invalid YAML instead of causing the build to become "unstable".
After a test run, lines other than `duration_ms` in the TAP YAML block
can be identified with the following command:
find . -iname '*.log' -exec sed -n '/ ---/,/ \.\.\./p' {} \; \
| grep -v -E ' ---| \.\.\.| duration_ms'
Some solutions to the above issue were considered and discarded:
* Using a named pipe to pass stderr through sed so lines may be
prefixed with '# '
* Switching the `tap-driver.sh` to run with bash and take advantage
of the Process substition for performing the prefixing of '# '
The above options ended up causing more coherency issues in the output
file than were resolved due to differing types of buffering and
processing delays.
* Redirection to the stderr of the test script to another file
The '*.log' and '*.trs' cleanups are driven by the automake log driver
implementation, which is not aware of other files that may be produced
during the invocation of the modified `tap-driver.sh`. I didn't find
an easy way to hook into the automake behaviour to add additional file
patterns to cleanup.
* Cleanup in the various test scripts to systematically prefix
stderr, or to respect the `ERROR_OUTPUT_DEST` of
`tests/utils/utils.sh`.
The scope of the patch would be significantly increased and for
relatively low added value compared to instructing the CI systems to
discard invalid YAML. Furthermore the values `OUTPUT_DEST` and
`ERROR_OUTPUT_DEST` are set to `/dev/null`, which would further
reduce the ability to understand failures based on the test logs.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iabc35fc00f14085ef035d4b4e19a2c30cd86d851
Kienan Stewart [Wed, 23 Aug 2023 18:43:54 +0000 (14:43 -0400)]
tests: Output test time to trs files
Observed issue
==============
The test suite for lttng-tools is relatively large and can take
upwards of an hour or more to run in CI, especially with root
tests enabled.
Neither automake driver (`tests/utils/tap-driver.sh`) nor the
standalone tap script (`tests/utils/tap/tap.sh`) provide timing
information for the test runs or the steps within a test run.
Solution
========
The TAP driver invoked by `make check` has been modified to include a
`:time-taken:` metadata field (in seconds) in the produced trs file.
Known drawbacks
===============
`date` on FreeBSD does not support the `%N` formatter for nanoseconds,
which results in the precision dropping to 1s.
Metadata which isn't recognized by the automake test harness is
currently ignored, but it is a behaviour that remains open to change.
C.f. https://www.gnu.org/software/automake/manual/html_node/Log-files-generation-and-test-results-recording.html
Further work is required on the CI side to make use of the information.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I94c7ddd7eb9388c595794e76dbbedc9bfe64d206
Kienan Stewart [Wed, 20 Dec 2023 15:58:42 +0000 (10:58 -0500)]
tests: Fix typo in tests/regression/kernel/test_ns_contexts
Change-Id: I50e6027f87b6d4a08a61337782356f8fbc6a64ae
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Tue, 12 Dec 2023 21:54:41 +0000 (16:54 -0500)]
Fix: sessiond: freeze on channel creation on restart
Issue observed
--------------
When using lttng via a script, the session and consumer daemons appear
to completely lock up when we request that a channel be created. The
conditions for this lockup seem to be created by destroying a sessiond
and then creating a sessiond in quick sequence.
This can be reproduced, on some systems, by launching a session daemon
and running the following commands:
$ sudo killall lttng-sessiond
$ sudo lttng-sessiond --daemonize
$ lttng create my_session --snapshot --output /tmp/demo-output
$ lttng enable-channel --kernel my_channel
Note that 'killall' above is racy as it does not wait for the session
daemon to be killed. Hence, it is not unexpected for the incoming
session daemon to see the smoldering ashes of the "outgoing" session
daemon. However, it would be helpful if the second session daemon
instance warned the user of the existing session daemon instance.
From the logs captured from both instances of the lttng-sessiond (the
outgoing and incoming instances), there appears to be a time period
during which both session daemons are active at once.
This behaviour is unexpected as the session daemon guards itself (in
theory) from running multiple conflicting instances.
The guarding mechanism works in two steps (see the implementation of
`check_existing_daemon` @ src/bin/lttng-sessiond/main.cpp:926)
When a session daemon is launched, it attempts to connect to any active
session daemon's 'client' endpoint (a UNIX socket, the same used by
liblttng-ctl to communicate with the session daemon).
If the daemon receives a reply, it can assume that another session
daemon instance is already active and abort its launch. Conversely, when
no reply is received, it uses a "lock file" mechanism to check for other
running instances.
The lock file-based check creates a file (typically
/var/run/lttng/lttng-sessiond.lck in the case of a root session daemon)
and acquires an exclusive (write) POSIX lock on it [1]. The assumption
is that any other instance would own the lock and cause the operation to
fail.
On a reproducer system, we could notice that the client thread of the
outgoing sessiond daemon was torn down before the launch of the
initialization of the incoming session daemon. This caused the incoming
session daemon to not receive a reply to its connection attempt and
fall-back to the lock file-based mechanism.
Surprinsingly, it appears that the lock file checks succeeds even though
the outgoing session daemon still holds the lock file according to its
log.
See the original bug report for more information about the investigation
and how to reproduce the problem.
Cause
-----
The POSIX file locking API has a number of surprising behaviours[2] that
have seen it being superseded by platform-specific APIs. In our case,
the one that bit us is that any file lock held by a process is
automatically released when any of the file descriptors that reference
the file's description is released.
In practical terms, if a process forks and its child dies, it loses its
file lock since the child's file descriptors are closed on exit.
The LWN article linked below describes a similar scenario:
It's common to have a library routine that opens a file, reads or
writes to it, and then closes it again, without the calling
application ever being aware that has occurred. If the application
happens to be holding a lock on the file when that occurs, it can lose
that lock without ever being aware of it.
The problem affects any use of the --background/--daemonize options
since, as part of the daemonization process (which occurs after the lock
file acquisition), the session daemon forks and its parent process
exits. This causes one of the descriptors pointing to the lock file to
be closed and the lock to be released.
After that point, any other instance of the session daemon process would
succeed in acquiring the lock file and assume it is the sole instance on
the system.
Solution
--------
The lock file code is modified to use the non-POSIX `flock`[3]
interface which is available on Linux and some BSDs[4]. `flock` provides
us with the guarantee we thought we had: that the file lock is only
released when _all_ file descriptors pointing to a given file
description are closed.
Drawbacks
---------
As a fallback, platforms that don't support `flock` will use the original
locking mechanism. Since this is a "hint" to warn users when erroneously
launch a second privileged session daemon, it seems acceptable for it
to not be completely reliable on secondary platforms.
References
----------
[1] https://man7.org/linux/man-pages/man2/fcntl.2.html (see F_SETLK)
[2] https://lwn.net/Articles/586904/
[3] https://linux.die.net/man/2/flock
[4] https://man.freebsd.org/cgi/man.cgi?query=flock&sektion=2
Fixes #1405
Reported-by: Erica Bugden <ebugden@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic505ff0671c321f808050831ef2b7152cdbf4b8a
Jérémie Galarneau [Tue, 12 Dec 2023 21:13:59 +0000 (16:13 -0500)]
common: move utils_create_lock_file to its own file
A follow-up change introduces platform-specific implementations of this
functions. Moving the function to a separate file makes it possible to
add other implementations without polluting utils.cpp with more
platform-specific code.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ibd566d8710380fe378a8f3df9454e21e83655b62
Kienan Stewart [Tue, 19 Dec 2023 19:01:47 +0000 (14:01 -0500)]
tests: tools/clear/test_ust wait for specific test app pid
Observed issue
==============
When debugging failing tests manually, one step that is sometimes done
is to quickly swap the commands that start the relay or sessiond in
`tests/utils/utils.sh` (eg. in `start_lttng_relayd_opt`) for the version
which uses a verbose output to a logfile.
When doing this, the `relayd` wasn't using the background
`process_mode`, and was a child of the running test.
This caused `test_ust_local_snapshot_per_pid` in
`tests/regression/tools/clear/test_ust` to hang as it waited for all
child processes to terminate.
Solution
========
The test has been updated to wait for only the specific test application
pid.
Known drawbacks
===============
None.
Change-Id: I8761649a52fceda92a5545c71818dc2eb027bfcf
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Mon, 18 Dec 2023 16:44:01 +0000 (11:44 -0500)]
README.adoc: Jenkins CI badge points to non-existent job
The lttng-tools_master_build job has been superseded by
lttng-tools_master_linuxbuild. The badge now points to this job as it is
the "main" build configuration.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I72e4b53709ccef95c4bb708dc3ac49f45f48b6d7
Kienan Stewart [Wed, 22 Nov 2023 16:18:23 +0000 (11:18 -0500)]
sessiond: log error message when libkmod operations fail
Issue observed
--------------
When loading or unloading modules and libkmod is being used, there are
no details available in the `lttng-sessiond` error ouput.
For example,
```
$ sudo -E lttng-sessiond
Error: Unable to load required module lttng-ring-buffer-client-discard
Warning: No kernel tracer available
```
When libkmod is not available or disabled in the configuration
options, `lttng-sessiond` falls back to invoking `modprobe` via a
`system` call. The command's error output will be visible and provide
the necessary details, eg.
```
$ sudo -E lttng-sessiond
modprobe: FATAL: Module lttng-ring-buffer-client-discard not found in directory /usr/lib/modules/6.2.0-36-generic
Error: Unable to load required module lttng-ring-buffer-client-discard
Warning: No kernel tracer available
```
Solution
--------
Include the error message from `strerror` in the message that is
logged via the `DBG` or `ERR` macros.
The error is no clearer for users, eg.
```
$ sudo -E lttng-sessiond
PERROR - 12:00:05.
004593045 [Main]: Unable to load required module
lttng-ring-buffer-client-discard: No such file or directory (in
modprobe_lttng() at modprobe.cpp:396)
Warning: No kernel tracer available
```
Known drawbacks
---------------
The debug message emitted when a non-obligatory kernel module fails to
load is now printed with `PERROR`.
Change-Id: Ibd25614a6c5b5dd3b801063eafc272a4017058cd
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 13 Jul 2023 19:13:12 +0000 (15:13 -0400)]
Fix: sessiond: crash when sending data_pending to an active session
Observed Issue
==============
When a data_pending command is sent to an active session, the sessiond
crashes with the following assert
```
lttng-sessiond: client.cpp:2647: void* thread_manage_clients(void*): Assertion `cmd_ctx.reply_payload.buffer.size >= sizeof(*llm)' failed.
Error: 1 trace chunks are leaked by lttng-consumerd. This can be caused by an internal error of the session daemon.
```
Cause
=====
When a session is active, cmd.cpp:cmd_data_pending() returns
LTTNG_ERR_SESSION_STARTED. In client.cpp:process_client_msg(), this
return value causes the execution to go the the setup_error label. In
the setup_error label, no default LLM header is added to the reply,
meaning the reply has a zero size and triggering the assert above.
Solution
========
When cmd_data_pending() returns a value that is neither 0 nor 1, the
return code is set appropriately as follows:
* when the return value is outside the range of lttng error codes,
LTTNG_ERR_UNK is used
* otherwise, the return value is used
The execution then jumps to the error label so that the default LLM
message header can be added.
Known Drawbacks
===============
None.
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Iff46f87c7725d25c131a86ac3dbaed5c99b4d16b
Kienan Stewart [Thu, 23 Nov 2023 21:34:18 +0000 (16:34 -0500)]
docs: Update contributing guide
Indicate that Gerrit (https://review.lttng.org) is the principal place
where patches are submitted and reviewed, rather than the mailing list.
Based on feedback received on the mailing list:
https://lists.lttng.org/pipermail/lttng-dev/2023-November/030670.html
Change-Id: Icb0bc3e45bb35fa85eca272d8043e5553465f700
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Fri, 6 Oct 2023 19:17:41 +0000 (15:17 -0400)]
Fix: sessiond: cmd potentially used uninitialized
GCC reports the following warning:
libtool: compile: g++ -std=gnu++11 -DHAVE_CONFIG_H -I../../../include -I../../../include -I../../../src -I../../../src -include config.h -I/home/mjeanson/opt/include -I/home/mjeanson/opt/include -DINSTALL_BIN_PATH=\"/home/mjeanson/opt/lib/lttng/libexec\" -DINSTALL_LIB_PATH=\"/home/mjeanson/opt/lib\" -fvisibility=hidden -fvisibility-inlines-hidden -fno-strict-aliasing -Wall -Wextra -Wmissing-declarations -Wnull-dereference -Wundef -Wredundant-decls -Wshadow -Wsuggest-attribute=format -Wwrite-strings -Wformat=2 -Wstrict-aliasing -Wmissing-noreturn -Wduplicated-cond -Wduplicated-branches -Wlogical-op -Winit-self -Wno-incomplete-setjmp-declaration -Wno-gnu-folding-constant -Wno-sign-compare -Werror -pthread -g -O2 -MT notification-thread-events.lo -MD -MP -MF .deps/notification-thread-events.Tpo -c notification-thread-events.cpp -fPIC -DPIC -o .libs/notification-thread-events.o
In file included from field.hpp:14,
from ust-field-convert.hpp:11,
from ust-app.hpp:14,
from event-notifier-error-accounting.hpp:11,
from notification-thread-events.cpp:12:
../../../src/vendor/optional.hpp: In function ‘int handle_notification_thread_command(notification_thread_handle*, notification_thread_state*)’:
../../../src/vendor/optional.hpp:877:17: error: ‘cmd’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
877 | return &data;
| ^~~~
notification-thread-events.cpp:3181:45: note: ‘cmd’ was declared here
3181 | struct notification_thread_command *cmd;
|
When failing to pop a command, cmd can indeed be used uninitialized when
jumping to the error label.
Reported-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5e63451ed4936638e9fbe05146f16c86ea2e42e2
Jérémie Galarneau [Mon, 2 Oct 2023 19:37:58 +0000 (15:37 -0400)]
.gitignore: add man{1,3,7,8} symlinks
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I9695833394aa51c1fa62f86cc77ce8981088cbbf
Olivier Dion [Mon, 2 Oct 2023 19:09:44 +0000 (15:09 -0400)]
utils: Allow users to define LTTNG_MANPATH
Currently, the configured value `MANPATH` is used when executing `man`.
This forces `lttng --help` to use man pages where they will be
installed, even with the `pre-inst-env` script.
Instead, let the user provide a `LTTNG_MANPATH` environment variable. If
not defined, fallback to the configured `MANPATH`.
This allows developers to do:
$ ./pre-inst-env lttng --help
to read the locally generated man pages where the `pre-inst-env` script
was generated.
Also adding the `LTTNG_MAN_BIN_PATH` to `pre-inst-env` since `man` could
be installed someplace else.
Change-Id: I32d9af480737bb80732dc5d690f947242aacac4f
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Olivier Dion [Mon, 2 Oct 2023 16:01:29 +0000 (12:01 -0400)]
doc/man/Makefile: Mimic mandb(5) path hierarchy
The following allows developers to read locally generated man pages by
using the `pre-inst-env' script. For example:
$ ./pre-inst-env man lttng-add-context
will open the `lttng-add-context.1' man pages in the build directory
under which the `pre-inst-env' was generated.
This is done by:
1. Simlinking `build/doc/man{1,3,7,8}' to `build/doc/man'
2. Adding MANPATH to `pre-inst-env'
The symlinking part is a hack to force `man' to use our current doc
layout, doing a less invasive change than would be required otherwise
Change-Id: I2ea1af779f237fe1808a1d44d4f3b1c3a8535e2d
Signed-off-by: Olivier Dion <odion@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 27 Sep 2023 12:54:22 +0000 (08:54 -0400)]
Tests: valid_filters: temporary trace path uses 'invalid'
Likely a result of a copy/paste error, the temporary trace path used for
the 'valid' filter tests uses the word 'invalid'.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I70123f372b0f52660f9d3cb2c27e7ff90e9c3c7a
Jérémie Galarneau [Sat, 23 Sep 2023 15:24:26 +0000 (11:24 -0400)]
waiter: modernize the waiter interface
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I3e8ffac4324e36dc3bf7a7f79d874f228a48def6
Jérémie Galarneau [Thu, 24 Aug 2023 17:33:44 +0000 (13:33 -0400)]
License: common: error_query: fix typo in SPDX specifier
The error-query API files were erroneously licensed under "GPL-2.1", a
license which doesn't exist.
As the author of those files, I hereby confirm that the intention was to
license these files under LGPL-2.1 as evidenced by their presence in the
libcommon_lgpl internal library.
Reported-by: Christophe Bedard <christophe.bedard@apex.ai>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0d56372ca009e44732e0737bd8f10ad4a4d000c5
Jérémie Galarneau [Tue, 22 Aug 2023 19:22:22 +0000 (15:22 -0400)]
Docs: CodingStyle.md: remove extraneous underscore in emphasis
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I802551140889d23dc0ac6521d95254ea441ffa22
Erica Bugden [Thu, 10 Aug 2023 21:16:25 +0000 (17:16 -0400)]
docs: fix: Match stated automake requirement
to the applied requirement. Previously, the README stated the minimum
required version of automake was 1.10. However, since commit
343a7a984
(Require automake >= 1.12), the configuration script (configure.ac)
actually enforced a minimum of 1.12:
AM_INIT_AUTOMAKE([1.12 foreign dist-bzip2 no-dist-gzip tar-pax ...
Change-Id: I4f9fcc3aca340e4638e93d69155c51f82247e29d
Signed-off-by: Erica Bugden <ebugden@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Mon, 19 Jun 2023 19:32:54 +0000 (15:32 -0400)]
Tests: use CPU ids from online ranges
test_tracefile_count could fail randomly on systems where there are CPUs
present but not online. For example:
$ cat /sys/devices/system/cpu/online
0-7
$ cat /sys/devices/system/cpu/present
0-39
When a CPU is present, it will have an entry in
/sys/devices/system/cpu/cpuX for it's ID, and thus the test may pick
that CPU's ID. However, a present CPU which is not online is not a valid
target for taskset.
In cases where `get_any_available_cpu` is used with task set, the tests
could fail for a similar reason. This case can be somewhat less common,
because it would return the numerically lowest CPU first; however, with
online as follows cpu 0 isn't available and taskset fails.
$ cat /sys/devices/system/cpu/online
18-19,135,142
$ cat /sys/devices/system/cpu/present
0-167
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ia7fb7ff69ecdd7aa6bac9dcfdf72344df08f6782
Simon Marchi [Tue, 27 Jun 2023 17:02:04 +0000 (13:02 -0400)]
Move lttng_session unique_ptr to lttng/session-internal.hpp
Make it possible to use this unique_ptr elsewhere.
Change-Id: I30141efac45d842f4bc3414ca03fffb2e4ba5cce
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 27 Jun 2023 16:39:42 +0000 (12:39 -0400)]
lttng: make session_storage::_array a unique_ptr to array
By making this field a unique_ptr of lttng_session[] instead of
lttng_session, we can use the subscript operator on it, making the code
in session_list_operations a bit simpler.
Change-Id: Ic4e441a23be834e68c5af3f4cca2794f86f2f57e
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Mon, 26 Jun 2023 19:38:11 +0000 (15:38 -0400)]
sessiond: remove ust-metadata.cpp
This file is essentially empty, remove it. Note that it's no longer
listed in Makefile.am, so it isn't compiled.
Change-Id: I88251890cbe10e10ed7c135dc88be5d4c0e2ef5a
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 20 Jun 2023 20:40:48 +0000 (16:40 -0400)]
Tests: fix formatting in gen-ust-events.cpp
I see this change when running format-cpp.
Change-Id: If80837682bb65dfc14fe8a5df22aca94e7047e44
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 20 Jun 2023 20:38:14 +0000 (16:38 -0400)]
sessiond: disable clang-format to work around unstable output
When running format-cpp multiple times, I see clang-format-14
alternating between these two forms:
_environment += lttng::format(
" {} = \"{}\";\n", field.name, escape_tsdl_env_string_value(field.value));
_environment += lttng::format(" {} = \"{}\";\n",
field.name,
escape_tsdl_env_string_value(field.value));
Disable clang-format locally to avoid always having some spurious
changes.
Change-Id: I71b10a2ad1a5264f26c61f54743f298eb10917bf
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Simon Marchi [Tue, 20 Jun 2023 20:32:51 +0000 (16:32 -0400)]
format-cpp: run clang-format in parallel
Use the -P option of GNU xargs to run multiple instances of clang-format
in parallel, which speeds up the execution quite a bit (depending on the
number of cores, of course).
Inspired by this babeltrace commit:
http://git.efficios.com/?p=babeltrace.git;a=commit;h=
66c3bce11973e6e96a3791c378a9e5f98ddaa280
Change-Id: I201535244ef4c3614dfd742ae6f1c427994e6147
Signed-off-by: Simon Marchi <simon.marchi@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 27 Jul 2023 18:09:00 +0000 (14:09 -0400)]
Fix: format-cpp: don't pass -i twice to clang-format
From Simon Marchi's original commit message:
I'm trying to run format-cpp, with clang-format-14 in my PATH, and I
get a ton of these messages:
clang-format-14: for the -i option: may only occur zero or one times!
This is because -i is present in the FORMATTER variables, as well as in
the command line where the formatter is invoked.
Remove the one in the command-line.
Instead of assuming the FORMATTER variable contains '-i', assume it
doesn't since the options are not semantically part of the formatter's
name. The '-i' option is passed to the formatter invocation directly
since it is always needed.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0c26ff0d8d4d99b3f161ca9e5aa94ff867e3f916
Michael Jeanson [Tue, 27 Jun 2023 17:33:18 +0000 (17:33 +0000)]
Tests: python: path-like object introduced in python 3.6
Prior to python 3.6 the os.path() function expected a string or bytes
object for the pathname. Use a compat method to convert the path-like
object to a string on interpreters that lack PEP-519 [1] support.
Traceback (most recent call last):
File "tests/regression/tools/context/test_ust.py", line 156, in <module>
tap, test_env, lttngtest.VpidContextType(), lambda test_app: test_app.vpid
File "tests/regression/tools/context/test_ust.py", line 114, in test_static_context
test_app = test_env.launch_wait_trace_test_application(50)
File "tests/utils/lttngtest/environment.py", line 541, in launch_wait_trace_test_application
wait_before_exit_file_path,
File "tests/utils/lttngtest/environment.py", line 163, in __init__
self._wait_for_file_to_be_created(pathlib.Path(app_ready_file_path))
File "tests/utils/lttngtest/environment.py", line 168, in _wait_for_file_to_be_created
if os.path.exists(sync_file_path):
File "/usr/lib/python3.5/genericpath.py", line 19, in exists
os.stat(path)
TypeError: argument should be string, bytes or integer, not PosixPath
[1] https://peps.python.org/pep-0519/
Change-Id: I783e36f61223d44667294ccbf4b3aec5bff68701
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Thu, 27 Jul 2023 16:26:42 +0000 (12:26 -0400)]
Fix: lttng-add-context: context type options possible null dereference
Coverity reports that:
** CID
1518091: Null pointer dereferences (FORWARD_NULL)
/src/bin/lttng/commands/add_context.cpp: 820 in destroy_ctx_type(<unnamed>::ctx_type *)(
Free application context options only if type->opt isn't null.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Icc27d04480c4821ed33127f5baf293510cdb314e
Jérémie Galarneau [Thu, 29 Jun 2023 18:04:37 +0000 (14:04 -0400)]
Fix: consumerd: slow metadata push slows down application registration
Issue observed
--------------
When rotating the channels of a session configured with a "per-pid"
buffer sharing policy, applications with a long registration
timeout (e.g. LTTNG_UST_REGISTER_TIMEOUT=-1, see LTTNG-UST(3)) sometimes
experience long start-up times.
Cause
-----
The session list lock is held during the registration of an application
and during the setup of a rotation.
While setting up a rotation in the userspace domain, the session daemon
flushes its metadata cache to the userspace consumer daemon and waits
for a confirmation that all metadata emitted before that point in time
has been serialized (whether on disk or sent through a network output).
As the consumer daemon waits for the metadata to be consumed, it
periodically checks the metadata stream's output position with a 200ms
delay (see DEFAULT_METADATA_AVAILABILITY_WAIT_TIME).
In practice, in per-uid mode, this delay is seldomly encountered since
the metadata has already been pushed by the consumption thread.
Moreover, if it was not, a single polling iteration will typically
suffice.
However, in per-pid buffering mode and with a sustained "heavy" data
production rate, this delay becomes problematic since:
- metadata is pushed for every application,
- the delay is hit almost systematically as the consumption thread
is busy and has to catch up to consume the most recent metadata.
Hence, some rotation setups can easily take multiple seconds (at least
200ms per application). This makes the locking scheme employed on that
path unsuitable as it blocks some operations (like application
registrations) for an extended period of time.
Solution
--------
The polling "back-off" delay is eliminated by using a waiter that allows
the consumer daemon thread that runs the metadata push command to
wake-up whenever the criteria used to evaluate the "pushed" metadata
position are changed.
Those criteria are:
- the metadata stream's pushed position
- the lifetime of the metadata channel's stream
- the status of the session's endpoint
Whenever those states are affected, the waiters are woken-up to force a
re-evaluation of the metadata cache flush position and, eventually,
cause the metadata push command to complete.
Note
----
The waiter queue is adapted from urcu-wait.h of liburcu (also LGPL
licensed).
Change-Id: Ib86c2e878abe205c73f930e6de958c0b10486a37
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 29 Jun 2023 14:32:59 +0000 (10:32 -0400)]
Docs: Fix broken reference in lttng-add-trigger
Change-Id: I4068570d188fbf75e402898234944b6e21cfa2a1
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Thu, 6 Jul 2023 20:20:24 +0000 (16:20 -0400)]
Docs: Fix broken reference to lttng-concepts(7) man page
Change-Id: Iaa700e06ec98a3a451f10b4e287c7b28e6ff4524
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Kienan Stewart [Wed, 21 Jun 2023 13:39:06 +0000 (09:39 -0400)]
Tests: Preemptively fail infinite blocking tests when low on disk space
In the system tests run by LAVA, the infinite blocking tests were
hanging when the system under test ran out of disk space. This is the
expected behaviour of the failing test, but the condition can be
detected and the tests preemptively failed with a clear error of what
needs to be addressed in the system being tested.
Change-Id: I9e6126408b57c2cd5aa64c2e360e0672f9eb2314
Signed-off-by: Kienan Stewart <kstewart@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Jérémie Galarneau [Wed, 17 May 2023 17:41:03 +0000 (13:41 -0400)]
Fix: sessiond: bad fd used while rotating exiting app's buffers
Issue observed
--------------
From bug #1372:
We are observing seemingly random crashes in the LTTng consumer daemon
when tracing a C++ application with LTTng-UST. Our workload has a single
printf-like tracepoint, where each string is in the order of 1kb and the
total output is around 30MB/s.
LTTng is set up with a single session and channel enabling this
tracepoint, and we enabled rotation with a maximum size of 100MB or
every 30 seconds. We are periodically starting new traced processes and
the system runs close to 100% CPU load. This ran on an AWS
Graviton2 (ARM) instance with CentOS 7 and a 5.4 kernel, using LTTng-UST
2.13.5 and LTTng-tools 2.13.8.
The first reported error is a write to a bad file descriptor (-1),
apparently when waking up the metadata poll thread during a rotation.
Cause
-----
Inspecting the logs, we see that the metadata channel with key 574 has a
negative poll fd write end which causes the write in
consumer_metadata_wakeup_pipe to fail because of an invalid file
descriptor:
DBG1 - 15:12:13.
271001175 [6593/6605]: Waking up metadata poll thread (writing to pipe): channel name = 'metadata', channel key = 574 (in consumer_metadata_wakeup_pipe() at consumer.c:888)
DBG3 - 15:12:13.
271010093 [6593/6605]: write() fd = -1 (in consumer_metadata_wakeup_pipe() at consumer.c:892)
PERROR - 15:12:13.
271014655 [6593/6605]: Failed to write to UST metadata pipe while attempting to wake-up the metadata poll thread: Bad file descriptor (in consumer_metadata_wakeup_pipe() at consumer.c:907)
Error: Failed to dump the metadata cache
Error: Rotate channel failed
Meanwhile, a lot of applications seem to be unregistering. Notably, the
application associated with that metadata channel is being torn down.
Leading up to the use of a bad file descriptor, the chain of events is:
1) The "rotation" thread starts to issue "Consumer rotate channel" on
key 574 (@ `15:12:12.
865621802`), but blocks on the consumer socket
lock. We can deduce this from the fact that thread "6605" in the
consumer wakes up to process an unrelated command originating from the
same socket.
We don't see that command being issued by the session daemon, most
likely because it occurs just before the captured logs start. All
call sites that use this socket take the socket lock, issue their
command, wait for a reply, and release the socket lock.
2) The application unregisters (@ `15:12:13.
269722736`). The
`registry_session`, which owns the metadata contents, is destroyed
during `delete_ust_app_session` which is done directly as a consequence
of the app unregistration (through a deferred RCU call), see
`ust_app_unregister`.
This is problematic since the consumer will request the metadata during
the rotation of the metadata channel. In the logs, we can see that
the "close_metadata" command blocks on the consumer socket lock.
However, the problem occurs when the `manage-apps` acquires the lock
before the "rotation" thread. In this instance, the "close-metadata"
command is performed by the consumer daemon, closing the metadata
poll file descriptor.
3) As the "close_metadata" command completes, the rotation thread
successfully acquires the socket lock. It is not aware of the
unregistration of the application and of the subsequent tear-down of the
application, registry, and channels since it was already iterating on
the application's channels.
The consumer starts to process the channel rotation command (@
`15:12:13.
270633213`) which fails on the metadata poll fd.
Essentially, we must ensure that the lifetime of metadata
channel/streams exceeds any ongoing rotation, and prevent a rotation
from being launched when an application is being torn-down in per-PID
buffering mode.
The problem is fairly hard to reproduce as it requires threads to
wake-up in the problematic order described above. I don't have a
straight-forward reproducer for the moment.
Solution
--------
During the execution of a rotation on a per-pid session, the session
daemon iterates on all applications to rotate their data and metadata
channels.
The `ust_app` itself is correctly protected: it is owned by an RCU HT
(`ust_app_ht`) and the RCU read lock is acquired as required to protect
the lifetime of the storage of `ust_app`. However, there is no way to
lock an `ust_app` instance itself.
The rotation command assumes that if it finds the `ust_app`, it will be
able to rotate all of its channels. This isn't true: the `ust_app` can
be unregistered by the `manage-applications` thread which monitors the
application sockets for their deaths in order to teardown the
applications.
The `ust_app` doesn't directly own its channels; they are owned by an
`ust_app_session` which, itself, has a `lock` mutex. Also, the metadata
of the application is owned by the "session registry", which itself can
also be locked.
At a high-level, we want to ensure that the metadata isn't closed while
a rotation is being setup. The registry lock could provide this
guarantee. However, it currently needs to remain unlocked during the
setup of the rotation as it is used when providing the metadata to the
consumer daemon.
Taking the registry lock over the duration of the setup would result in
a deadlock like so:
- the consumer buffer consumption thread consumed a data buffer and attempts
a metadata sync,
- the command handling thread of the consumer daemon attempts to rotate
any stream that is already at its rotation position and locks on the
channel lock held by the consumption thread,
- the metadata sync launches a metadata request against the session
daemon which attempts to refresh the metadata contents through the
command socket,
- the command handling thread never services the metadata "refresh" sent
by the session daemon since it is locked against the same channel as
the buffer consumption thread, resulting in a deadlock.
Instead, a different approach is required: extending the lifetime of the
application's channels over the duration of the setup of a rotation.
To do so, the `ust_app` structure (which represents a registered
application) is now reference-counted. A reference is acquired over the
duration of the rotation's setup phase. This reference transitively
holds a reference the application's tracing buffers.
Note that taking a reference doesn't prevent applications from
unregistering; it simply defers the reclamation of their buffers to the
end of the rotation setup.
As the rotation completes its setup phase, the references to the
application (and thus, its tracing buffers) are released, allowing the
reclamation of all buffering ressources.
Note that the setup phase of the rotation doesn't last long so it
shouldn't significantly change the observable behaviour in terms of
memory usage. The setup phase mostly consists in sampling the
consumption/production positions of all buffers in order to establish a
switch-over point between the old and new files.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I8dc1ee45dd00c85556dd70d34a3af4f3a4d4e7cb
Jérémie Galarneau [Mon, 24 Jul 2023 20:45:07 +0000 (16:45 -0400)]
Fix: sessiond: leak of application context in channel
Issue observed
--------------
ASAN generates the following report when the session daemon exists after
running the tests/regression/tools/context/test_ust.py test suite.
lttng-sessiond: ==930543==ERROR: LeakSanitizer: detected memory leaks
lttng-sessiond: Direct leak of 8 byte(s) in 1 object(s) allocated from:
lttng-sessiond: 0 0x7f8d1706c33a in __interceptor_strdup /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_interceptors.cpp:454
lttng-sessiond: 1 0x55e36fa6d107 in alloc_ust_app_ctx /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:1368
lttng-sessiond: 2 0x55e36fa82f73 in create_ust_app_channel_context /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:2912
lttng-sessiond: 3 0x55e36fa9eeac in ust_app_channel_create /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5062
lttng-sessiond: 4 0x55e36faa9fef in find_or_create_ust_app_channel /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5936
lttng-sessiond: 5 0x55e36faab610 in ust_app_synchronize_all_channels /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6147
lttng-sessiond: 6 0x55e36faac12e in ust_app_synchronize /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6208
lttng-sessiond: 7 0x55e36faacc29 in ust_app_global_update(ltt_ust_session*, ust_app*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6268
lttng-sessiond: 8 0x55e36faa910e in ust_app_start_trace_all(ltt_ust_session*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5850
lttng-sessiond: 9 0x55e36f920343 in cmd_start_trace(ltt_session*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:2826
lttng-sessiond: 10 0x55e36f9ffac5 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:1779
lttng-sessiond: 11 0x55e36fa077c0 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2588
lttng-sessiond: 12 0x55e36f9e4d85 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:67
lttng-sessiond: 13 0x7f8d15c9d44a (/usr/lib/libc.so.6+0x8744a) (BuildId:
2f005a79cd1a8e385972f5a102f16adba414d75e)
lttng-sessiond: Direct leak of 5 byte(s) in 1 object(s) allocated from:
lttng-sessiond: 0 0x7f8d1706c33a in __interceptor_strdup /usr/src/debug/gcc/gcc/libsanitizer/asan/asan_interceptors.cpp:454
lttng-sessiond: 1 0x55e36fa6d059 in alloc_ust_app_ctx /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:1367
lttng-sessiond: 2 0x55e36fa82f73 in create_ust_app_channel_context /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:2912
lttng-sessiond: 3 0x55e36fa9eeac in ust_app_channel_create /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5062
lttng-sessiond: 4 0x55e36faa9fef in find_or_create_ust_app_channel /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5936
lttng-sessiond: 5 0x55e36faab610 in ust_app_synchronize_all_channels /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6147
lttng-sessiond: 6 0x55e36faac12e in ust_app_synchronize /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6208
lttng-sessiond: 7 0x55e36faacc29 in ust_app_global_update(ltt_ust_session*, ust_app*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:6268
lttng-sessiond: 8 0x55e36faa910e in ust_app_start_trace_all(ltt_ust_session*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/ust-app.cpp:5850
lttng-sessiond: 9 0x55e36f920343 in cmd_start_trace(ltt_session*) /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/cmd.cpp:2826
lttng-sessiond: 10 0x55e36f9ffac5 in process_client_msg /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:1779
lttng-sessiond: 11 0x55e36fa077c0 in thread_manage_clients /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/client.cpp:2588
lttng-sessiond: 12 0x55e36f9e4d85 in launch_thread /home/jgalar/EfficiOS/src/lttng-tools/src/bin/lttng-sessiond/thread.cpp:67
lttng-sessiond: 13 0x7f8d15c9d44a (/usr/lib/libc.so.6+0x8744a) (BuildId:
2f005a79cd1a8e385972f5a102f16adba414d75e)
lttng-sessiond: SUMMARY: AddressSanitizer: 13 byte(s) leaked in 2 allocation(s).
Cause
-----
In the case of application contexts, alloc_ust_app_ctx() copies the
provider and application context names. However, these fields are not
free'd by delete_ust_app_ctx().
Solution
--------
The application context and provider names are free'd during
delete_ust_app_ctx() when the context type is LTTNG_UST_ABI_CONTEXT_APP.
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I0759018ec1811cf6246b5a80d4f5a7545c63910a
This page took 0.059075 seconds and 4 git commands to generate.