From: Jérémie Galarneau Date: Fri, 15 Dec 2023 16:44:34 +0000 (-0500) Subject: Fix: relayd: live client not notified of inactive streams X-Git-Tag: v2.12.15~2 X-Git-Url: https://git.lttng.org./?a=commitdiff_plain;h=5c764488e215742fbad29fd12ef904fda59878db;p=lttng-tools.git Fix: relayd: live client not notified of inactive streams Observed issue -------------- Some LTTng-tools live tests failures appear to show babeltrace2 hanging (failing to print expected events). The problem is intermittent, but Kienan was able to develop a test case that's reproducible for him. The test case performs the following steps: - Start a ust application and leave it running - Configure and then start an lttng live session - Connect a live viewer (babeltrace) - Run a second ust application - Wait for the expected number of events - In the failing case, no events are seen by babeltrace Using per-uid buffers, the test typically completes normally. With per-pid buffers the test fails, hanging indefinitely if waiting for the specified number of events. While "hanging", babeltrace2 is polling the relayd. This affects for babeltrace2 stable-2.0 and master while using lttng-tools master. For more information, see the description of bug #1406[1] Cause ----- When consuming a live trace captured in per-PID mode, Babeltrace periodically requests the index of the next packet it should consume. As part of the reply, it gets a 'flags' field which is used to announce that new streams, or new metadata, are available to the viewer. Unfortunately, these 'flags' are only set when the relay daemon has new tracing data to deliver. It is not set when the relay daemon indicates that the stream is inactive (see LTTNG_VIEWER_INDEX_INACTIVE). In the average case where an application is spawned while others are actively emiting events, a request for new data will result in a reply that returns an index entry (code LTTNG_VIEWER_INDEX_OK) for an available packet accompanied by the LTTNG_VIEWER_FLAG_NEW_STREAM flag. This flag indicates to the viewer that it should request new streams (using the LTTNG_VIEWER_GET_NEW_STREAMS live protocol command) before consuming the new data. In the cases where we observe a hang, an application is running but not emiting new events. As such, the relay daemon periodically emits "live beacons" to indicate that the session's streams are inactive up to a given time 'T'. Since the existing application remains inactive and the viewer is never notified that new streams are available, the viewer effectively remains "stuck" and never notices the new application being traced. The LTTNG_VIEWER_FLAG_NEW_METADATA communicates a similar semantic with regards to the metadata. However, ignoring it for inactive streams isn't as deleterious: the same information is made available to the viewer the next time it will successfully request a new index to the relay daemon. This would only become a problem if the tracers start to express non-layout data (like supplemental environment information, but I don't see a real use-case) as part of the metadata stream that should be made available downstream even during periods of inactivity. Note that the same problem most likely affects the per-UID buffer allocation mode when multiple users are being traced. Solution -------- On the producer end, LTTNG_VIEWER_FLAG_NEW_STREAM is set even when returning an inactivity index. Note that to preserve compatibility with older live consumers that don't expect this flag in non-OK response, the LTTNG_VIEWER_FLAG_NEW_STREAM notification is repeated until the next LTTNG_VIEWER_GET_NEW_STREAMS command that returns LTTNG_VIEWER_INDEX_OK. The 'new_streams' state is no longer cleared from relay sessions during the processing of the LTTNG_VIEWER_GET_NEXT_INDEX commands. Instead, it is cleared when the viewer requests new streams. On Babeltrace's end, the handler of the LTTNG_VIEWER_GET_NEXT_INDEX command (lttng_live_get_next_index) is modified to expect LTTNG_VIEWER_FLAG_NEW_STREAM in the cases where the command returns: - LTTNG_VIEWER_INDEX_OK (as done previously), - LTTNG_VIEWER_INDEX_HUP (new), - LTTNG_VIEWER_INDEX_INACTIVE (new). Drawbacks --------- This is arguably a protocol change as none of the producers ever set the NEW_METADATA/NEW_STREAM flags when indicating an inactive stream. References ---------- [1] https://bugs.lttng.org/issues/1406 Fixes #1406 Change-Id: I84f53f089597ac7b22ce8bd0962d4b28112b7ab6 Signed-off-by: Jérémie Galarneau --- diff --git a/src/bin/lttng-relayd/live.c b/src/bin/lttng-relayd/live.c index f02ee02b9..c266465fb 100644 --- a/src/bin/lttng-relayd/live.c +++ b/src/bin/lttng-relayd/live.c @@ -88,6 +88,27 @@ static uint64_t last_relay_viewer_session_id; static pthread_mutex_t last_relay_viewer_session_id_lock = PTHREAD_MUTEX_INITIALIZER; +static const char * +lttng_viewer_next_index_return_code_str(enum lttng_viewer_next_index_return_code code) +{ + switch (code) { + case LTTNG_VIEWER_INDEX_OK: + return "INDEX_OK"; + case LTTNG_VIEWER_INDEX_RETRY: + return "INDEX_RETRY"; + case LTTNG_VIEWER_INDEX_HUP: + return "INDEX_HUP"; + case LTTNG_VIEWER_INDEX_ERR: + return "INDEX_ERR"; + case LTTNG_VIEWER_INDEX_INACTIVE: + return "INDEX_INACTIVE"; + case LTTNG_VIEWER_INDEX_EOF: + return "INDEX_EOF"; + default: + abort(); + } +} + /* * Cleanup the daemon */ @@ -156,28 +177,30 @@ static int check_new_streams(struct relay_connection *conn) { struct relay_session *session; - unsigned long current_val; int ret = 0; + rcu_read_lock(); if (!conn->viewer_session) { goto end; } - rcu_read_lock(); - cds_list_for_each_entry_rcu(session, - &conn->viewer_session->session_list, - viewer_session_node) { + + cds_list_for_each_entry_rcu( + session, &conn->viewer_session->session_list, viewer_session_node) + { if (!session_get(session)) { continue; } - current_val = uatomic_cmpxchg(&session->new_streams, 1, 0); - ret = current_val; + + ret = uatomic_read(&session->new_streams); session_put(session); if (ret == 1) { goto end; } } + end: rcu_read_unlock(); + DBG("Viewer connection has%s new streams: socket_fd = %d", ret == 0 ? " no" : "", conn->sock->fd); return ret; } @@ -1187,6 +1210,8 @@ int viewer_get_new_streams(struct relay_connection *conn) response.status = htobe32(LTTNG_VIEWER_NEW_STREAMS_ERR); goto send_reply_unlock; } + + uatomic_set(&session->new_streams, 0); send_streams = 1; response.status = htobe32(LTTNG_VIEWER_NEW_STREAMS_OK); @@ -1601,6 +1626,7 @@ int viewer_get_next_index(struct relay_connection *conn) struct relay_stream *rstream = NULL; struct ctf_trace *ctf_trace = NULL; struct relay_viewer_stream *metadata_viewer_stream = NULL; + bool attached_sessions_have_new_streams = false; assert(conn); @@ -1648,6 +1674,17 @@ int viewer_get_next_index(struct relay_connection *conn) goto send_reply; } + ret = check_new_streams(conn); + if (ret < 0) { + viewer_index.status = LTTNG_VIEWER_INDEX_ERR; + ERR("Error checking for new streams in the attached sessions, returning status=%s", + lttng_viewer_next_index_return_code_str( + (enum lttng_viewer_next_index_return_code) viewer_index.status)); + goto send_reply; + } else if (ret == 1) { + attached_sessions_have_new_streams = true; + } + if (rstream->ongoing_rotation.is_set) { /* Rotation is ongoing, try again later. */ viewer_index.status = htobe32(LTTNG_VIEWER_INDEX_RETRY); @@ -1710,6 +1747,7 @@ int viewer_get_next_index(struct relay_connection *conn) */ goto send_reply; } + /* At this point, ret is 0 thus we will be able to read the index. */ assert(!ret); @@ -1769,14 +1807,6 @@ int viewer_get_next_index(struct relay_connection *conn) vstream->stream_file.handle = fs_handle; } - ret = check_new_streams(conn); - if (ret < 0) { - viewer_index.status = htobe32(LTTNG_VIEWER_INDEX_ERR); - goto send_reply; - } else if (ret == 1) { - viewer_index.flags |= LTTNG_VIEWER_FLAG_NEW_STREAM; - } - ret = lttng_index_file_read(vstream->index_file, &packet_index); if (ret) { ERR("Relay error reading index file"); @@ -1821,6 +1851,10 @@ send_reply: pthread_mutex_unlock(&metadata_viewer_stream->stream->lock); } + if (attached_sessions_have_new_streams) { + viewer_index.flags |= LTTNG_VIEWER_FLAG_NEW_STREAM; + } + viewer_index.flags = htobe32(viewer_index.flags); health_code_update(); diff --git a/src/bin/lttng-relayd/main.c b/src/bin/lttng-relayd/main.c index 902fcab8e..88284d40e 100644 --- a/src/bin/lttng-relayd/main.c +++ b/src/bin/lttng-relayd/main.c @@ -1524,6 +1524,7 @@ static void publish_connection_local_streams(struct relay_connection *conn) if (session->viewer_attached) { uatomic_set(&session->new_streams, 1); } + pthread_mutex_unlock(&session->lock); }