From: Jonathan Rajotte Date: Tue, 5 May 2020 20:06:58 +0000 (-0400) Subject: Fix: notification: deadlock on cmd_queue.lock and client->lock X-Git-Tag: v2.13.0-rc1~499 X-Git-Url: https://git.lttng.org./?a=commitdiff_plain;h=97a71ea68e8e05a20f44903db11e9e57dbbd29af;p=lttng-tools.git Fix: notification: deadlock on cmd_queue.lock and client->lock Observed issue ============== Deadlock between the notification thread and the action executor thread. Thread 5 holds cmd_queue.lock and request the client lock. Thread 6 holds the client lock and request the cmd_queue lock. Thread 5 have little value in holding the queue lock considering it effectively to a "pop" of the cmd_queue. Thread 9 is waiting on the cmd_queue lock but does not hold any other locks and thus not part of the deadlock but is a casualties of this deadlock and leave a client "hanging". Other threads are all in their respective waiting state. Thread 9 (Thread 0x7f76f2ffd700 (LWP 240467)): #0 __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52 [1070/1123] #1 0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80 #2 0x00000000004611dd in run_command_wait (handle=0x1ad12f0, cmd=0x7f76f2fe31e0) at notification-thread-commands.c:31 #3 0x000000000046143a in notification_thread_command_unregister_trigger (handle=0x1ad12f0, trigger=0x7f76e4000ef0) at notification-thread-commands.c:148 #4 0x00000000004444af in cmd_unregister_trigger (cmd_ctx=0x7f76e4000d40, sock=68, notification_thread=0x1ad12f0) at cmd.c:4618 #5 0x0000000000483d23 in process_client_msg (cmd_ctx=0x7f76e4000d40, sock=0x7f76f2ffcba4, sock_error=0x7f76f2ffcb90) at client.c:2001 #6 0x000000000047f00b in thread_manage_clients (data=0x1ad1a80) at client.c:2402 #7 0x000000000047b303 in launch_thread (data=0x1ad1af0) at thread.c:66 #8 0x00007f77052c5609 in start_thread (arg=) at pthread_create.c:477 #9 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 6 (Thread 0x7f7700fcf700 (LWP 240464)): #0 __lll_lock_wait (futex=futex@entry=0x1ad1308, private=0) at lowlevellock.c:52 #1 0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x1ad1308) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000000000461bf2 in run_command_no_wait (handle=0x1ad12f0, in_cmd=0x7f7700fce340) at notification-thread-commands.c:87 #3 0x0000000000461b93 in notification_thread_client_communication_update (handle=0x1ad12f0, id=1, transmission_status=CLIENT_TRANSMISSION_STATUS_QUEUED) at notification-thread-commands.c:400 #4 0x0000000000497658 in client_handle_transmission_status (client=0x7f76f8004e30, status=CLIENT_TRANSMISSION_STATUS_QUEUED, user_data=0x7f76f8004a00) at action-executor.c:154 #5 0x0000000000467be7 in notification_client_list_send_evaluation (client_list=0x7f76f8004fe0, condition=0x7f76e40041a0, evaluation=0x7f76cc000cc0, trigger_creds=0x7f76e4004288, source_object_creds=0x0, client_report=0x4971a0 , user_data=0x7f76f8004a00) at notification-thread-events.c:4007 #6 0x00000000004956bb in action_executor_notify_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:199 #7 0x00000000004953fd in action_executor_generic_handler (executor=0x7f76f8004a00, work_item=0x7f76f80062d0, action=0x7f76e4004210) at action-executor.c:493 #8 0x0000000000495101 in action_work_item_execute (executor=0x7f76f8004a00, work_item=0x7f76f80062d0) at action-executor.c:506 #9 0x0000000000493ff5 in action_executor_thread (_data=0x7f76f8004a00) at action-executor.c:559 #10 0x000000000047b303 in launch_thread (data=0x7f76f8004aa0) at thread.c:66 #11 0x00007f77052c5609 in start_thread (arg=) at pthread_create.c:477 #12 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Thread 5 (Thread 0x7f77017d0700 (LWP 240463)): #0 __lll_lock_wait (futex=futex@entry=0x7f76f8004e30, private=0) at lowlevellock.c:52 #1 0x00007f77052c80a3 in __GI___pthread_mutex_lock (mutex=0x7f76f8004e30) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000000000463080 in handle_notification_thread_command (handle=0x1ad12f0, state=0x7f77017cfb00) at notification-thread-events.c:2936 #3 0x000000000045e881 in thread_notification (data=0x1ad12f0) at notification-thread.c:705 #4 0x000000000047b303 in launch_thread (data=0x1ad1420) at thread.c:66 #5 0x00007f77052c5609 in start_thread (arg=) at pthread_create.c:477 #6 0x00007f77051cc103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Cause ===== The action executor holds the client lock across the communication to prevent simultaneous update to the client state. The notification thread holds the cmd_queue lock across operation for no apparent reason (TODO make sure there is no internal add to the queue. if so we should reacquire the lock only when necessery.) Solution ======== Reduce the windows for which the cmd_queue lock is held by the notification thread to only the "pop" action on the queue. As soon as we have the lock, get the cmd, remove it from the list and release the lock. This prevent inverted lock acquisition base on the pattern of the action executor thread. Signed-off-by: Jonathan Rajotte Signed-off-by: Jérémie Galarneau Change-Id: I91d30c134bc1a128c96058f0e0cdd325808c91bc --- diff --git a/src/bin/lttng-sessiond/notification-thread-events.c b/src/bin/lttng-sessiond/notification-thread-events.c index 3f8cb4804..6ba6c415b 100644 --- a/src/bin/lttng-sessiond/notification-thread-events.c +++ b/src/bin/lttng-sessiond/notification-thread-events.c @@ -2412,6 +2412,8 @@ int handle_notification_thread_command( pthread_mutex_lock(&handle->cmd_queue.lock); cmd = cds_list_first_entry(&handle->cmd_queue.list, struct notification_thread_command, cmd_list_node); + cds_list_del(&cmd->cmd_list_node); + pthread_mutex_unlock(&handle->cmd_queue.lock); switch (cmd->type) { case NOTIFICATION_COMMAND_TYPE_REGISTER_TRIGGER: DBG("[notification-thread] Received register trigger command"); @@ -2474,19 +2476,16 @@ int handle_notification_thread_command( goto error_unlock; } end: - cds_list_del(&cmd->cmd_list_node); if (cmd->is_async) { free(cmd); cmd = NULL; } else { lttng_waiter_wake_up(&cmd->reply_waiter); } - pthread_mutex_unlock(&handle->cmd_queue.lock); return ret; error_unlock: /* Wake-up and return a fatal error to the calling thread. */ lttng_waiter_wake_up(&cmd->reply_waiter); - pthread_mutex_unlock(&handle->cmd_queue.lock); cmd->reply_code = LTTNG_ERR_FATAL; error: /* Indicate a fatal error to the caller. */