David Goulet [Mon, 18 Feb 2013 19:08:19 +0000 (14:08 -0500)]
Fix: set relayd sock sent flag per consumer socket
The session daemon usually handles two consumers, one for 32 bits
applications and one for 64 bits.
When sending relayd information to the consumer, we must sent it to
every existing consumer in case a session contains 32 and 64 bits
applications. So, the sent flag is moved from the consumer output object
to be per consumer socket.
This bug was seen and diagnosed during the 2.2 development phase and is
fixed in master branch in the following commit.
David Goulet [Wed, 30 Jan 2013 16:08:54 +0000 (11:08 -0500)]
Fix: change health poll update to entry/exit calls
It adds a better semantic to the code flow. Furthermore, the current
counter of the health state is now validated raising an assert() if the
value is unexepected.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Tue, 29 Jan 2013 17:26:09 +0000 (12:26 -0500)]
Fix: remove consumer health poll update on startup
With the TLS health state, the consumer thread has to register in order
to be validated during the health check so the poll update work around
is no longer needed andi replaced with a simple code update just after
the health registration of the thread.
This has been reported after the TLS feature ticket #411 has been
implemented.
Fixes #428
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 24 Jan 2013 19:29:49 +0000 (14:29 -0500)]
Cleanup unused health state reference
The old health state structure are not used anymore since we now rely on
TLS health state.
This is backported from master because it does not alter any behavior
and makes the next fix easier to merge from the master branch since this
commit changes many lines.
Signed-off-by: David Goulet <dgoulet@efficios.com>
So, the check was always made over an uninitialized variable on the
stack. Fortunately, worst case scenario, new_size is set to the maximum
allowed or kept untouched.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Tue, 22 Jan 2013 17:13:13 +0000 (12:13 -0500)]
Fix: add missing rcu lock for UST lookup
Trace UST channel and event were not protected by RCU lock when calling
their find function. Furthermore, the event lookup was not protected at
all during and after the lookup.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 10 Jan 2013 17:07:35 +0000 (12:07 -0500)]
Fix: update next_net_seq_num after sending header
Increment the sequence number after we are sure that the relayd has
received correctly the data header. If an error occurs when sending the
header, the data won't be extracted from the buffers thus keeping this
sequence number untouched.
Furthermore, after sending the header, if the relayd dies, this value
won't matter much and if there is an error on the stream when reading
the trace data, the stream will be deleted thus closed on the relayd
making this value useless.
It's important to note that this sequence number is updated on the
relayd side if the full expected data packet was received. So,
incrementing the value after the transmission of the header is not
changing anything in terms of value coherency. The point is to have a
semantic of when read and used successfully (transmission to relayd),
let's update it.
In that code flow, the stream's lock is acquired so no need to
read/update it atomically. I've also added a comments to better
understand the purpose of this variable and how to use it.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 10 Jan 2013 15:18:31 +0000 (10:18 -0500)]
Fix: wrong loop continuation in metadata thread
The validation of the endpoint status can change the metadata hash table
meaning stream(s) can be removed from it and the poll set. After that,
continuing the for loop was making the thread use possible invalid file
descriptor that were not in the hash table anymore trigerring the lookup
assert of the node just after the for loop.
The very important part here is that when the metadata ht changes, we
MUST go back to the poll wait() to synchronize the subset of fd we are
looking at.
Reported-by: Jesus Garcia <jesus.garcia@ericsson.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 20 Dec 2012 01:56:04 +0000 (20:56 -0500)]
Fix: bad check of accept() return value
Also fix a missing ret = -1 assignment. Although, the chances are
unlikely to hit a positive ret value that does not match the structure
size, better safe than sorry.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 19 Dec 2012 23:11:12 +0000 (18:11 -0500)]
Fix: change perror to debug statement
Most of the changes here remove a double PERROR which is done by the
transport layer. So we notify in the debug message to understand where
the transport error was.
Also, don't print an error if the relayd is not found. This is possible
if the relayd dies so an error here is useless to the common user but
useful as a debug statement.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 19 Dec 2012 22:54:25 +0000 (17:54 -0500)]
Fix: don't print EPIPE error which can happen
Anytime a relayd is killed, writing on a closed fd is totally possible
so the PERROR of an EPIPE error is useless as an error but we do print
it as a dbg message now.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 19 Dec 2012 20:36:59 +0000 (15:36 -0500)]
Fix: Off by one in seq num for data pending command
Like the close stream command, the next sequence number of the stream
needs to be used minus 1 for the data pending or else we are off by one
on the relayd during the check since 4 data packets for instance means a
prev_seq value of 4 but a last_next_seq_num of 5 hence creating an off
by one for the data pending check.
Furthermore, the check was actually wrong on the relayd side. Having a
previous sequence number lower than the last one seen does NOT mean that
the data is not pending so the check needed was actually equal or
greater.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 19 Dec 2012 19:13:24 +0000 (14:13 -0500)]
Fix: wrong check on session started on stop command
This is problematic for application that lives longer than the tracing
session so the make check unfortunately did not catch this problem since
we either kill the applications before the stop or wait for them to die.
I will quote a colleague of mine on IRC after discovering this:
14:14 < cbab> moar tests!
:)
Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Tue, 18 Dec 2012 21:31:17 +0000 (16:31 -0500)]
run-report: Allow tests to spawn and control their own sessiond
The run-report script can spawn a sessiond if the 'daemon' key value is
set to 'True' in the test description dictionary. If the 'daemon' key is
set to 'False', the TEST_NO_SESSIOND environment variable is set so no
sessiond can be spawned in the tests. This variable is also set when the
run-report spawn its own sessiond.
This behavior has the unfortunate side-effect of restricting any kind of
spawning and control of the sessiond via the tests.
Fix this issue by allowing the tests to spawn their own sessiond. We
need to pass an additional env dictionary to the TestWorker in order to
spawn the test with the proper environment variables set.
To indicate that a test will spawn and manage its own sessiond, the
'daemon' key value should be set to the "test" string.
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Tue, 18 Dec 2012 21:31:16 +0000 (16:31 -0500)]
run-report: Fix CPU usage stats computation
The CPU usage statistics are computed by grepping the top command
output. The top output format as since changed so the CPU usage
statistics were not properly computed.
Fix this by adjusting to the new top command output format.
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Tue, 18 Dec 2012 21:31:15 +0000 (16:31 -0500)]
run-report: Restore SIGPIPE default handler in subprocess calls
Python override the SIGPIPE default handler because it prefers to check
every write and raise an IOError exception rather than taking SIGPIPE
[1].
This behavior has the unfortunate side-effect of polluting stdout with
broken pipe messages on shell pipelines invocations (e.g. echo foo |
grep something | etc.) in shell scripts spawned via subprocess.Popen().
This commit fix the polluting of stdout by restoring the default SIGPIPE
handler on subprocess calls.
[1] - http://bugs.python.org/issue1652
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Tue, 18 Dec 2012 21:31:14 +0000 (16:31 -0500)]
run-report: Use libtool wrapper to spawn the sessiond for tests
The run-report script was using the sessiond binary generated via
libtool under the ".libs/" folder. When using this binary, the consumerd
used when starting the sessiond is the one installed system-wide (if
any). This could lead to tests failures if no consumer are installed in
the system or any version mismatch occurs.
This commit fix this by using the consumerd that was built with libtool
in the local source tree.
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Tue, 18 Dec 2012 19:02:14 +0000 (14:02 -0500)]
Fix: flag metadata stream on quiescent control cmd
For the relayd, when doing a quiescent control command, we have to flag
the corresponding metadata stream or else it will simply stay alive
until a close stream and always returning that data is inflight at the
end data pending command.
Add a stream id to the relayd command so the relayd can identify which
stream to flag.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Tue, 18 Dec 2012 17:05:24 +0000 (12:05 -0500)]
Fix: remove ua_sess->started assert on stop trace
It's totally possible that a start failed for a specific app but the
started flag is set for the global session making a stop trace possible
on a failed started session.
The assert is no longer valid since this code flow is possible.
Signed-off-by: David Goulet <dgoulet@efficios.com>
Julien Desfossez [Mon, 17 Dec 2012 17:13:38 +0000 (12:13 -0500)]
Set classes of traffic in high_throughput_limits
This patch creates 2 classes for the bandwidth limited test instead of
one. The intent is to have multiple queues in the kernel instead of just
one. That way we can prioritize the control port over the data port and
make sure it gets its share of the bandwidth.
With this update, the control port gets 1/10th of the limit and the data
get the remaining 9/10th. If unused, the data connection can borrow the
remaining bandwidth.
Signed-off-by: Julien Desfossez <jdesfossez@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Mon, 17 Dec 2012 17:19:56 +0000 (12:19 -0500)]
Fix: force the poll() return value to be nb_fd
With poll(), we have to iterate over all fd in the pollset since it is
handled in user space where we don't have to with epoll.o
This is a first patch to fix the fact that we should iterate over the
number of fd the lttng_poll_wait() call returns which is for epoll the
number of returned events and with poll the whole set of fd.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Thu, 13 Dec 2012 23:39:13 +0000 (18:39 -0500)]
Tests: Add health check testpoint fail test
This test trigger a failure in a specified thread by using the testpoint
mechanism. The testpoints behavior is implemented in health_fail.c. The
testpoint code simply return 1 (non-zero values are considered as errors
for testpoints) to trigger the specific thread error handling mechanism.
This test ensure that we can detect health failure for each thread error
handling paths.
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
Christian Babeux [Thu, 13 Dec 2012 23:38:56 +0000 (18:38 -0500)]
Add return code to the testpoint mechanism
The testpoint processing could fail and currently there is no mechanism
to notify the caller of such failures. This patch adds an int return
code to the testpoint prototype. Non-zero return code indicate failure.
When using the testpoint mechanism, the caller should properly handle
testpoint failure cases and trigger the appropriate response (error
handling, thread teardown, etc.).
Signed-off-by: Christian Babeux <christian.babeux@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 13 Dec 2012 22:51:45 +0000 (17:51 -0500)]
Fix: RCU unlock out of error path
On channel error, RCU was not unlocking the read side. Furthermore,
remove a check for a NULL session that was also not going through an RCU
unlock. Change it to an assert.
This also adds a channel subbuf size check when enabling a channel.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Thu, 13 Dec 2012 01:16:33 +0000 (20:16 -0500)]
Fix data pending for inflight streaming
The consumer_data_pending() function call had a bad label naming. The
goto label data_not_pending was actually going to the return value of
pending data (1). So, this patch fixes that by renaming the label to the
right meaning.
Add a missing destroy of the relayd session id mapping hash table.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 12 Dec 2012 22:05:45 +0000 (17:05 -0500)]
Add the relayd create session command
This is needed in order to fix a specific condition of the data pending
where we need to have streams associated with a session and this command
will be used for new feature in the future.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 12 Dec 2012 16:23:20 +0000 (11:23 -0500)]
Make the consumer sends a ACK after each command
This is needed to avoid buffer bloating when throttling communication
between the consumer and the relayd. Considering a very low bandwith
limit between the relayd and consumerd, the session daemon would send a
high debit of commands to the consumer without ever
emptying the unix socket queue, which makes the UNIX socket reach buffer
full conditions, which is prone to trigger corner-cases behaviors in
blocking send/recv with MSG_WAITALL, which is likely the cause of hang
experienced when limiting relayd bandwidth.
Adding an ACK to each command makes sure that we acknowledge the session
daemon that we, the consumer, have emptied the unix socket buffer.
NOTE: In consumer_add_relayd_socket(), there might be a problem with the
error path and message status to the sessiond. A subsequent patch might
fix a possible issue but for now it is not at all critical since any
critical error on the consumer side will notify the sessiond through the
error socket.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Wed, 12 Dec 2012 18:39:37 +0000 (13:39 -0500)]
Remove MSG_WAITALL on every recvmsg() socket type
In order to handle messages that are possibly larger than the socket
buffer size set by wmem_max and rmem_max /proc files, ensure that the
recv-side reads the data chunk-wise rather than hanging on a
MSG_WAITALL.
In addition to fixing this issue, chances are that it will also help
fixing hangs detected due to UNIX socket buffers filling up. The
MSG_WAITALL behavior in such situations might be unexpected.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Mon, 10 Dec 2012 21:03:58 +0000 (16:03 -0500)]
Fix: Use stream deletion function when cleaning up
In theory, once the destroy stream ht function is called with the hash
table, it should be empty. However, for some fatal errors, it might not
so it's imperative that we gracefully delete the stream and free it
using an RCU call so both hash tables (stream and the one for the
pending command) are synchronized.
Simply freeing the stream could have created possible fd leaks and
invalid node for the data pending hash table.
Signed-off-by: David Goulet <dgoulet@efficios.com>
David Goulet [Mon, 10 Dec 2012 17:16:15 +0000 (12:16 -0500)]
Fix: Relayd and sessiond version check
Now only checks for the major version to be equal. After 2.1 stable
release, both components will adapt to the lowest minor version for the
same major version. For this, the session daemon now send it's version
values to the relayd so slight change in the protocol here.
For instance, a relayd 2.4 talking to a sessiond 2.8, the communication
and available feature will only be those of 2.4 version.
For a relayd let say 3.2 and a sessiond 2.2, the communication stops
right there since both major version differs.
Acked-by: Julien Desfossez <julien.desfossez@efficios.com> Signed-off-by: David Goulet <dgoulet@efficios.com>