Test the lttng_ust_lib:load, lttng_ust_lib:build_id,
lttng_ust_lib:debug_link, and lttng_ust_lib:unload events from
lttng-ust, which track the state of loaded libraries. This ensures we
correctly handle dlopen of libraries with direct dependencies.
Allow regenerating the statedump of a running session
The "lttng regenerate statedump" command can be used to regenerate the
statedump of a running session whenever needed. This is particularly
useful in snapshot and trace-file rotation modes where the original
statedump may be lost.
Rename the "metadata regenerate" command to "regenerate metadata"
Prepare the deprecation of the "metadata regenerate" command since we
need to regenerate the statedump as well, so it is more convenient to
have one command to regenerate various session's attributes.
Fix: handle negative (unlimited) system stack size limits
This also changes the stack size selection policy to select
the largest of:
1) default pthread stack size (dictated by libc)
2) system soft limit
3) 2 MB
This is bounded by the system's hard limit on stack size.
If this limit is smaller than 2 MB, the default size mentionned
in pthread_create(3) for Linux, we warn the user that the daemons
may be unreliable and advise bumping this limit.
Note that is is most likely possible to operate the daemons with
way less than 2MB of stack space. However, this was not
extensively tested.
Julien Desfossez [Tue, 28 Jun 2016 21:45:52 +0000 (17:45 -0400)]
Test for select, poll and epoll syscall overrides
This test for root_regression checks if the syscall overrides for
select, pselect6, poll, ppoll, epoll_ctl, epoll_wait and epoll_pwait
work as expected on arm and x86 (32 and 64-bit).
There are 11 test cases that check for normal and abnormal behaviour. If
the test system has the Babeltrace python bindings, the test validates
the content of the events, otherwise only the presence of the generated
events is checked.
We also check if kernel OOPS, WARNING or BUG were generated during the
test.
Tests: tap.sh spams tests' output when no plan is set
Some tests are implemented in C (using tap.h) or in Python
and don't use tap.sh's facilities. However, it is sourced
by utils.sh and prints an error message during its clean-up
because a plan was never set.
Michael Jeanson [Wed, 15 Jun 2016 21:18:07 +0000 (17:18 -0400)]
Fix: Set thread stack size to ulimit soft value
Some libc don't honor the limit set for the stack size and use their own
empirically chosen static value. Detect this behavior by checking if the
current stack size is smally than the soft limit and in that case set
the pthread stack size to soft limit value.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Run the cleanup thread in the unit tests to ensure that the hash
tables are properly cleaned-up even in the context of tests. This
allows us to reliably check for memory leaks in the unit tests.
Allow the lttng cmd line and liblttng-ctl users to override the channel
mode in snapshot sessions.
Note that liblttng-ctl users expecting that an "overwrite" mode
explicitly set at the value "0" (discard) for a snapshot session will
now create a channel in discard mode.
The DEFAULT_CHANNEL_OVERWRITE used by liblttng-ctl
lttng_channel_set_default_attr() is changed to "-1".
Fix: validate number of subbuffers after tweaking properties
There are properties that are tweaked by each of ust and kernel channel
create functions after a validation on the number of subbuffers for
overwrite channels. Move validation after those properties
modifications.
The ht_cleanup thread is shut down before the queue of rcu
callbacks is emptied by the rcu_barrier(). Since callbacks added
by call_rcu can push hash tables through the ht_cleanup pipe, we run
into cases where the clean-up thread has been shutdown and
hash tables pushed through the clean-up pipe are leaked.
For channels configured with large sub-buffer size, the relayd copies
the entire trace sub-buffer (trace packet) into a large buffer, and then
copies the large buffer to disk. It is inefficient from a point of view
of cache locality.
Use a 64k buffer on the stack instead, and move the data piece-wise.
Fix: reduce scope of kconsumer consumed_pos and produced_pos
The consumed_pos and produced_pos accesses are protected by the
stream mutex, which is fine as-is. However, consumed_pos is
passed to consumer_get_consume_start_pos() and is flagged by
Coverity as a possible use of a "stale" consumed_pos.
From an analyzer's standpoint, this makes sense since
both lttng_kconsumer_get_produced_snapshot() and
lttng_kconsumer_get_consumed_snapshot() could leave their output
parameter uninitialized and return 0 since they both assume that
ioctl() will set errno if ret != 0.
IOCTL(3P) specifies that errno is only set if ret < 0.
A bug in lttng-modules could cause ioctl() to return a positive
value, leaving the errno variable unset. In such a case,
both functions would return 0, leaving the positions uninitialized.
A follow-up fix enforces this assumption (ret never > 0) as part
of the kernctl API.
Jonathan Rajotte [Thu, 26 May 2016 22:14:37 +0000 (18:14 -0400)]
Fix: set the logger level to prevent unexpected level inheritance
BSF and other jars can ship with an embedded log4j.properties
file. This causes problem when launching an application with a general
class path (e.g /usr/share/java/*) since log4j will look for a
configuration file in all loaded jars. If any contains a directive for
the root logger, it will affect any logger with no level that are
directly under the root logger. This can result in an unexpected
behaviour (e.g no events triggered etc.).
Link: https://issues.apache.org/jira/browse/BSF-24 Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Cleanup error.h __lttng_print() used for message printing
The loglevels have never really been a mask, and it is useless to try to
use them as masks, because the compiler statically knows the value of
the loglevel requested, and can therefore optimise away all the logic.
This takes care of Coverity warning about mixed bitwise and boolean
logic, which was technically correct, but more complex than needed.
Philippe Proulx [Tue, 17 May 2016 23:30:39 +0000 (19:30 -0400)]
doc/man: put AsciiDoc attributes in their own file
This facilitates the generation of man pages using another
asciidoc.conf file, but keeping the same attributes, without
having to split the generated configuration file.
Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
The new environment variable LTTNG_ABORT_ON_ERROR allows each
lttng-tools program to call abort() on PERROR() and ERR() after the
error message has been printed to stderr.
Fix: ust-consumer: flush empty packets on snapshot channel
Snapshot operation on a non-stopped stream should use a "final" flush to
ensure empty packets are flushed, so we gather timestamps at the moment
where the snapshot is taken. This is important for streams that have a
low amount of activity, which might be on an empty packet when the
snapshot is triggered.
PRINT_ERR maps to 0x1, PRINT_WARN maps to 0x2, which is fine so far to
use as masks, but PRINT_BUG maps to 0x3, which is the same as both
PRINT_ERR and PRINT_WARN, and does not make sense to use in masks with
__lttng_print:
(type & (PRINT_WARN | PRINT_ERR | PRINT_BUG))
Fix this by ensuring PRINT_BUG has its own mask, and express all
constants as shifts to eliminate the risk of re-introducing a similar
bug in the future.
We should flush the last packet after stop, not before. Otherwise, we
may end up with events written immediately after the flush, which
defeats the purpose of flushing.
Fix: UST should not generate packet at destroy after stop
In the following scenario:
- create, enable events (ust),
- start
- ...
- stop (await for data_pending to complete)
- destroy
- rm the trace directory
We would expect that the "rm" operation would not conflict with the
consumer daemon trying to output data into the trace files, since the
"stop" operation ensured that there was no data_pending.
However, the "destroy" operation currently generates an extra packet
after the data_pending check (the "on_stream_hangup"). This causes the
consumer daemon to try to perform trace file rotation concurrently with
the trace directory removal in the scenario above, which triggers
errors. The main reason why this empty packet is generated by "destroy"
is to deal with trace start/stop scenario which would otherwise generate
a completely empty stream.
Therefore, introduce the concept of a "quiescent stream". It is
initialized at false on stream creation (first packet is empty). When
tracing is started, it is set to false (for cases of start/stop/start).
When tracing is stopped, if the stream is not quiescent, perform a
"final" flush (which will generate an empty packet if the current packet
was empty), and set quiescent to true. On "destroy" stream and on
application hangup: if the stream is not quiescent, perform a "final"
flush, and set the quiescent state to true.
The test case for '*', which enables all events, is flaky by its
nature since buffers may be filled by other kernel events preventing
the test script from finding the test event (it is often discarded).
Fix: bad file descriptors on close after rotation error
Ensure we don't try to close output stream file descriptors twice when a
trace file rotation error occurs (once at tracefile rotation, once when
closing the stream). Set the fd value to -1 after the first close to
ensure we don't try to close it again.
Michael Jeanson [Wed, 18 May 2016 19:43:06 +0000 (15:43 -0400)]
Fix: merge tap tests stdout and stderr
This makes the output and error statement ordered in the log
file and ensure that the first line is the tap test plan. Some tap
parser are confused if the test plan is not on the first line.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Fix: remove logically dead code in send_channel_uid_to_ust
Found by Coverity:
at_most: At condition ret < 0, the value of ret must be at most -1.
cannot_set: At condition ret < 0, the value of ret cannot be equal
to any of {-1030, -32}.
dead_error_condition: The condition ret < 0 must be true.
2825 } else if (ret < 0) {
2826 goto error_stream_unlock;
2827 }
CID 1323135 (#1 of 1): Logically dead code
(DEADCODE)dead_error_line: Execution cannot reach this statement: goto
error_stream_unlock;.
Fix: unchecked return value in low throughput test
Found by Coverity:
CID 1019967 (#1 of 1): Unchecked return value from library
(CHECKED_RETURN)2. check_return: Calling poll(NULL, 0UL, 60000) without
checking return value. This library function may fail and return an
error code.
We really don't care whether this poll succeeds or not.