From: Jérémie Galarneau Date: Thu, 21 Mar 2024 20:55:08 +0000 (-0400) Subject: Fix: consumerd: leak of tracing buffers on relayd connectivity issue X-Git-Url: https://git.lttng.org./?a=commitdiff_plain;h=e793ddc8acc7f978bde7a991644cc6749ac1bcc9;p=lttng-tools.git Fix: consumerd: leak of tracing buffers on relayd connectivity issue Observed issue ============== A leak of the tracing buffers can be noticed when the relay daemon is terminated following the creation of a live session, but prior to the initiation of any applications. The issue can be reproduced with the following steps: # Create a live session $ lttng create --live # Kill the relay daemon before the allocation of the buffers $ killall lttng-relayd $ lttng enable-event --userspace --all $ lttng start # Launch an instrumented application $ ./my_app # Destroy all sessions $ lttng destroy --all # List the open file descriptors of the lttng-consumerd process # and notice how the tracing buffer are still visible. $ ls -lah /proc/$pid_of_lttng_consumerd/fd [...] lrwx------ 1 root root 64 Mar 19 19:50 987 -> '/dev/shm/shm-ust-consumer-358446 (deleted)' lrwx------ 1 root root 64 Mar 19 19:50 988 -> '/dev/shm/shm-ust-consumer-358446 (deleted)' lrwx------ 1 root root 64 Mar 19 19:50 989 -> '/dev/shm/shm-ust-consumer-358446 (deleted)' [...] Cause ===== The consumer daemon allocates recording channels and their tracing buffers in a two-step process. First, the session daemon emits an `ASK_CHANNEL_CREATION` command, which results in the allocation of the internal consumer channel structures and of the actual tracing buffers. The channel's unique key is returned to the session daemon. After this command, the channel temporarily holds a list of streams which are waiting to be sent to the session and relay daemons as a result of the `GET_CHANNEL` command. At this moment, the channel's reference count is one over the number of streams as they all hold a back-reference to their parent channel and there is a global reference held by the session daemon. The session daemon uses the key it received to emit the `GET_CHANNEL` command. When executing this command, the consumer daemon attempts to send the streams to the relay daemon. On failure to do so, the session daemon is informed of the connection error. The consumer daemon then omits a step of the command: the streams are never handed-off from the channel's internal list to the consumption/monitoring thread. This hand-off is what is internally referred-to as making the streams "globally visible". The session daemon, upon receiving the failure error code of the GET_CHANNEL command, tears down its internal ust_app channel structures. As part of that process, it emits the `DESTROY_CHANNEL` command to reclaim the channel on the consumer daemon's end. This command is deferred to the channel poll thread as the `CHANNEL_DEL` internal command. As part of this internal command, the channel poll thread cleans the channel's stream list to clean-up any streams that are not "globally visible": all of them, in our case. Then, the session daemon's global reference is released which should normally result in the reclamation of the channel itself. While reproducing the problem, we noted that channel wasn't reclaimed and that its reference count matched the number of CPUs on the system at the time the `CHANNEL_DEL` command completed. This hinted at the streams holding a reference to the channel even after the completion of the reclamation command. Looking at clean_channel_stream_list(), which cleans up the channel's temporary stream list, we note that the streams' monitor property is overridden to `false` just before the call to consumer_stream_destroy(). This is strange and a comment (added as part of 212d67a2 in 2014) hints at a locking problem that was being worked-around. In all likelihood, this no longer applies as the locking strategies used have evolved quite a bit since then. Still, setting the monitor property to `false` is problematic as, in that mode (say, channels that are used to record a snapshot), streams do not hold a reference to their parent channel. This causes the clean-up code to forego the clean-up of the channel, resulting in its leak. Since the channel ultimately owns the 'stream_fds' which represent the shared memory files, those files (and associated memory) are also leaked (they are closed during the execution of lttng_ustconsumer_del_channel()). Solution ======== We simply remove the stream monitor mode override to leave it in its appropriate state. The clean-up then proceeds normally, ensuring the tracing buffers are properly reclaimed. Known drawbacks =============== None. Fixes #1411 Signed-off-by: Jérémie Galarneau Change-Id: I4a2fb8cddd2f9da9a2c9df19ba36229627ad2569 --- diff --git a/src/common/consumer/consumer.cpp b/src/common/consumer/consumer.cpp index ed844f8df..1da243601 100644 --- a/src/common/consumer/consumer.cpp +++ b/src/common/consumer/consumer.cpp @@ -178,13 +178,6 @@ static void clean_channel_stream_list(struct lttng_consumer_channel *channel) /* Delete streams that might have been left in the stream list. */ cds_list_for_each_entry_safe (stream, stmp, &channel->streams.head, send_node) { - /* - * Once a stream is added to this list, the buffers were created so we - * have a guarantee that this call will succeed. Setting the monitor - * mode to 0 so we don't lock nor try to delete the stream from the - * global hash table. - */ - stream->monitor = 0; consumer_stream_destroy(stream, nullptr); } }