From 690732c05af4afe24d01a7c8509b7baf321dc221 Mon Sep 17 00:00:00 2001 From: Jonathan Rajotte Date: Mon, 15 Mar 2021 17:52:24 -0400 Subject: [PATCH] Fix: sessiond: session destroy hang in per-uid when context cannot be added MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit Observed issue ============== The system_test CI jobs hang on the perf test suite during the destroy command steps of the ust perf raw subtest. Cause ===== The system_test are running inside a kvm as root. It turns out that the PMU (UNHALTED_REFERENCE_CYCLES) the test suite is trying to add is unavailable on the qemu host. ustctl_add_context return -1024 since it fails to add the context. This leads us down the error path for the callstack leading to the ustctl_add_context call. 1) `ust_app_channel_create` returns `ret` != 0; 2) `find_or_create_ust_app_channel` returns `ret != 0`; 3) `ust_app_synchronize` based on the `ret` value goes directly to the end of the function to an error path without passing on the `create_ust_app_metadata` function and clean-up structure related to the app. Note that being in per-uid mode, data and metadata channel/streams/buffer allocation is done on the fly for the first app during `ust_app_synchronize` and its callee. For the current problematic scenario, only the data channels have been allocated on the consumer for the uid at that point. The metadata for that uid is not yet created. Now that we know more of what is going on during an ""add context"" let's take a look at the actual hang. The client never complete the destroy command since the consumerd indicates that the trace chunk for the session is not closed. The trace chunk still exists despite the fact that a close chunk command has been issued. This is the case since its refcount never reaches zero and thus the release does not complete. In a normal execution without the use of contexts, the release of the trace hunk (refcount == 0) occurs during the final rotation on destroy. Upon further comparison between a working execution and a non-working execution, in a non-working execution the `cmd_rotate_session` does not issue the rotation for the data channels since the loop detects that no metadata is present. Which, as we discussed earlier, can happen if we fail to add the context to the app channel. [1] ``` cds_list_for_each_entry(reg, &usess->buffer_reg_uid_list, lnode) { struct buffer_reg_channel *reg_chan; struct consumer_socket *socket; if (!reg->registry->reg.ust->metadata_key) { /* Skip since no metadata is present */ continue; } .... /* Rotate the data channels. */ cds_lfht_for_each_entry(reg->registry->channels->ht, &iter.iter, reg_chan, node.node) { ret = consumer_rotate_channel(socket, reg_chan->consumer_key, usess->uid, usess->gid, usess->consumer, /* is_metadata_channel */ false); if (ret < 0) { cmd_ret = LTTNG_ERR_ROTATION_FAIL_CONSUMER; goto error; } } .... } ``` Solution ======== Move the metadata check after the data channel rotation since it is possible to have data channels but no metadata channel, although it is a corner case. Note that per-pid mode and kernel are not affected by the current bug since a complete teardown of all objects is done. This only affect per-uid due to the "on the fly" allocation nature of it since we need to share the channel/stream/buffers across apps. Known drawbacks ========= None. References ========== [1] https://github.com/lttng/lttng-tools/blob/3d1384a4add389c38f8554130e8dec2e2d06009d/src/bin/lttng-sessiond/ust-app.c#L7057 Signed-off-by: Jonathan Rajotte Signed-off-by: Jérémie Galarneau Change-Id: I611761ac4051c9a80158705a87e59541fb37660c --- src/bin/lttng-sessiond/ust-app.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/src/bin/lttng-sessiond/ust-app.c b/src/bin/lttng-sessiond/ust-app.c index 6c1505400..35e4e3bce 100644 --- a/src/bin/lttng-sessiond/ust-app.c +++ b/src/bin/lttng-sessiond/ust-app.c @@ -2488,7 +2488,7 @@ static int do_consumer_create_channel(struct ltt_ust_session *usess, health_code_update(); /* - * Now get the channel from the consumer. This call wil populate the stream + * Now get the channel from the consumer. This call will populate the stream * list of that channel and set the ust objects. */ if (usess->consumer->enabled) { @@ -6277,11 +6277,6 @@ enum lttng_error_code ust_app_rotate_session(struct ltt_session *session) struct buffer_reg_channel *reg_chan; struct consumer_socket *socket; - if (!reg->registry->reg.ust->metadata_key) { - /* Skip since no metadata is present */ - continue; - } - /* Get consumer socket to use to push the metadata.*/ socket = consumer_find_socket_by_bitness(reg->bits_per_long, usess->consumer); @@ -6304,6 +6299,19 @@ enum lttng_error_code ust_app_rotate_session(struct ltt_session *session) } } + /* + * The metadata channel might not be present. + * + * Consumer stream allocation can be done + * asynchronously and can fail on intermediary + * operations (i.e add context) and lead to data + * channels created with no metadata channel. + */ + if (!reg->registry->reg.ust->metadata_key) { + /* Skip since no metadata is present. */ + continue; + } + (void) push_metadata(reg->registry->reg.ust, usess->consumer); ret = consumer_rotate_channel(socket, -- 2.34.1