Fix: abort on decrement_sem_count during concurrent tracing start and teardown
Observed issue
==============
The following backtrace has been reported:
#0 __GI_raise (sig=sig@entry=6)
at /usr/src/debug/glibc/2.31/git/sysdeps/unix/sysv/linux/raise.c:50
#1 0x0000007f90b3fdd4 in __GI_abort () at /usr/src/debug/glibc/2.31/git/stdlib/abort.c:79
#2 0x0000007f90b4bf50 in __assert_fail_base (fmt=0x7f90c3da98 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count",
file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c",
line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count")
at /usr/src/debug/glibc/2.31/git/assert/assert.c:92
#3 0x0000007f90b4bfb4 in __GI___assert_fail (assertion=assertion@entry=0x7f9112cb90 "uatomic_read(&sem_count) >= count",
file=file@entry=0x7f9112cb30 "/usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c",
line=line@entry=664, function=function@entry=0x7f911317e8 <__PRETTY_FUNCTION__.10404> "decrement_sem_count")
at /usr/src/debug/glibc/2.31/git/assert/assert.c:101
#4 0x0000007f910e3830 in decrement_sem_count (count=<optimized out>)
at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:664
#5 0x0000007f910e5d28 in handle_pending_statedump (sock_info=0x7f9115c608 <global_apps>)
at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:737
#6 handle_message (lum=0x7f8dde46d8, sock=3, sock_info=0x7f9115c608 <global_apps>)
at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:1410
#7 ust_listener_thread (arg=0x7f9115c608 <global_apps>)
at /usr/src/debug/lttng-ust/2_2.13.0-r0/lttng-ust-2.13.0/src/lib/lttng-ust/lttng-ust-comm.c:2055
#8 0x0000007f90af73e0 in start_thread (arg=0x7fc27a82f6)
at /usr/src/debug/glibc/2.31/git/nptl/pthread_create.c:477
#9 0x0000007f90bead5c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
It turns out that the main thread is at that point iterating over the
libraries destructors:
Thread 3 (LWP 1983):
#0 0x0000007f92a68a0c in _dl_fixup (l=0x7f9054e510, reloc_arg=432)
at /usr/src/debug/glibc/2.31/git/elf/dl-runtime.c:69
#1 0x0000007f92a6ea3c in _dl_runtime_resolve () at ../sysdeps/aarch64/dl-trampoline.S:100
#2 0x0000007f905170f8 in __do_global_dtors_aux () from <....>/crash/work/rootfs/usr/lib/libbsd.so.0
#3 0x0000007f92a697f8 in _dl_fini () at /usr/src/debug/glibc/2.31/git/elf/dl-fini.c:138
#4 0x0000007f90b54864 in __run_exit_handlers (status=0, listp=0x7f90c65648 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:108
#5 0x0000007f90b549f4 in __GI_exit (status=<optimized out>)
at /usr/src/debug/glibc/2.31/git/stdlib/exit.c:139
#6 0x0000000000404c98 in a_function_name (....) at main.c:152
#7 0x0000000000404a98 in main (argc=3, argv=0x7fc27a8858, env=0x7fc27a8878) at main.c:97
Cause
=====
An enable command is processed at the same time that the lttng-ust
destructor is run. At the end of the command handling,
`handle_pending_statedump` is called. Multiple variables from the
`sock_info` struct are checked outside the UST lock at that point.
lttng-ust-comm.c +1406:
/*
* Performed delayed statedump operations outside of the UST
* lock. We need to take the dynamic loader lock before we take
* the UST lock internally within handle_pending_statedump().
*/
handle_pending_statedump(sock_info);
Namely:
registration_done
statedump_pending
initial_statedump_done
`statedump_pending` is set during the enable command
(`lttng_session_statedump`, lttng-events.c +631) in the same thread.
As for `registration_done` and `initial_statedump_done` they are invariant
from the registration of the app until `lttng_ust_cleanup` is called.
`cleanup_sock_info` called by `lttng_ust_cleanup`, itself called by
`lttng_ust_exit` resets the `registration_done` and
`initial_statedump_done` fields. Note that these operations are done
outside of the listener thread.
Note that by that point `lttng_ust_exit` expects all "getters" on
`sock_info` to fail while trying to acquire the UST lock due to
`lttng_ust_comm_should_quit` set to 1. Note that the listener threads
can still exist because we do not join them, we only execute
pthread_cancel which is async.
Clearly we are missing mutual exclusion provided by locking
when accessing `registration_done` and `initial_statedump_done`.
Solution
========
Here we can do better and simply not require any mutual exclusion based on locking.
`registration_done` and `initial_statedump_done` only need to be reset
to zero when we are not actually exiting (`lttng_ust_after_fork_child`).
In this case, no concurrent listener thread exists at that point
that could access those fields during the reset. Hence we can move the
reset to only the non-exiting code path and alleviate the current
situation.
Known drawbacks
===============
None.
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I45ba3eaee20c49a3988837a87fa680ce0a6ed953
This page took 0.026068 seconds and 4 git commands to generate.