Observed issue
==============
The system_test CI jobs hang on the perf test suite during the destroy
command steps of the ust perf raw subtest.
Cause
=====
The system_test are running inside a kvm as root. It turns out that the
PMU (UNHALTED_REFERENCE_CYCLES) the test suite is trying to add is
unavailable on the qemu host.
ustctl_add_context return -1024 since it fails to add the context.
This leads us down the error path for the callstack leading to the
ustctl_add_context call.
1) `ust_app_channel_create` returns `ret` != 0;
2) `find_or_create_ust_app_channel` returns `ret != 0`;
3) `ust_app_synchronize` based on the `ret` value goes directly to the
end of the function to an error path without passing on the
`create_ust_app_metadata` function and clean-up structure related to
the app.
Note that being in per-uid mode, data and metadata
channel/streams/buffer allocation is done on the fly for the first app
during `ust_app_synchronize` and its callee. For the current problematic
scenario, only the data channels have been allocated on the consumer for
the uid at that point. The metadata for that uid is not yet created.
Now that we know more of what is going on during an ""add context""
let's take a look at the actual hang.
The client never complete the destroy command since the consumerd
indicates that the trace chunk for the session is not closed. The trace
chunk still exists despite the fact that a close chunk command has been
issued. This is the case since its refcount never reaches zero and thus
the release does not complete.
In a normal execution without the use of contexts, the release of the
trace hunk (refcount == 0) occurs during the final rotation on destroy.
Upon further comparison between a working execution and a non-working
execution, in a non-working execution the `cmd_rotate_session` does not
issue the rotation for the data channels since the loop detects that no
metadata is present. Which, as we discussed earlier, can happen if we
fail to add the context to the app channel.
[1]
```
cds_list_for_each_entry(reg, &usess->buffer_reg_uid_list, lnode) {
struct buffer_reg_channel *reg_chan;
struct consumer_socket *socket;
if (!reg->registry->reg.ust->metadata_key) {
/* Skip since no metadata is present */
continue;
}
....
/* Rotate the data channels. */
cds_lfht_for_each_entry(reg->registry->channels->ht, &iter.iter,
reg_chan, node.node) {
ret = consumer_rotate_channel(socket,
reg_chan->consumer_key,
usess->uid, usess->gid,
usess->consumer,
/* is_metadata_channel */ false);
if (ret < 0) {
cmd_ret = LTTNG_ERR_ROTATION_FAIL_CONSUMER;
goto error;
}
}
....
}
```
Solution
========
Move the metadata check after the data channel rotation since it is
possible to have data channels but no metadata channel, although it is a
corner case.
Note that per-pid mode and kernel are not affected by the current bug
since a complete teardown of all objects is done. This only affect
per-uid due to the "on the fly" allocation nature of it since we need to
share the channel/stream/buffers across apps.
Known drawbacks
=========
None.
References
==========
[1] https://github.com/lttng/lttng-tools/blob/
3d1384a4add389c38f8554130e8dec2e2d06009d/src/bin/lttng-sessiond/ust-app.c#L7057
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I611761ac4051c9a80158705a87e59541fb37660c