Fix: sessiond: freeze on channel creation on restart
Issue observed
--------------
When using lttng via a script, the session and consumer daemons appear
to completely lock up when we request that a channel be created. The
conditions for this lockup seem to be created by destroying a sessiond
and then creating a sessiond in quick sequence.
This can be reproduced, on some systems, by launching a session daemon
and running the following commands:
$ sudo killall lttng-sessiond
$ sudo lttng-sessiond --daemonize
$ lttng create my_session --snapshot --output /tmp/demo-output
$ lttng enable-channel --kernel my_channel
Note that 'killall' above is racy as it does not wait for the session
daemon to be killed. Hence, it is not unexpected for the incoming
session daemon to see the smoldering ashes of the "outgoing" session
daemon. However, it would be helpful if the second session daemon
instance warned the user of the existing session daemon instance.
From the logs captured from both instances of the lttng-sessiond (the
outgoing and incoming instances), there appears to be a time period
during which both session daemons are active at once.
This behaviour is unexpected as the session daemon guards itself (in
theory) from running multiple conflicting instances.
The guarding mechanism works in two steps (see the implementation of
`check_existing_daemon` @ src/bin/lttng-sessiond/main.cpp:926)
When a session daemon is launched, it attempts to connect to any active
session daemon's 'client' endpoint (a UNIX socket, the same used by
liblttng-ctl to communicate with the session daemon).
If the daemon receives a reply, it can assume that another session
daemon instance is already active and abort its launch. Conversely, when
no reply is received, it uses a "lock file" mechanism to check for other
running instances.
The lock file-based check creates a file (typically
/var/run/lttng/lttng-sessiond.lck in the case of a root session daemon)
and acquires an exclusive (write) POSIX lock on it [1]. The assumption
is that any other instance would own the lock and cause the operation to
fail.
On a reproducer system, we could notice that the client thread of the
outgoing sessiond daemon was torn down before the launch of the
initialization of the incoming session daemon. This caused the incoming
session daemon to not receive a reply to its connection attempt and
fall-back to the lock file-based mechanism.
Surprinsingly, it appears that the lock file checks succeeds even though
the outgoing session daemon still holds the lock file according to its
log.
See the original bug report for more information about the investigation
and how to reproduce the problem.
Cause
-----
The POSIX file locking API has a number of surprising behaviours[2] that
have seen it being superseded by platform-specific APIs. In our case,
the one that bit us is that any file lock held by a process is
automatically released when any of the file descriptors that reference
the file's description is released.
In practical terms, if a process forks and its child dies, it loses its
file lock since the child's file descriptors are closed on exit.
The LWN article linked below describes a similar scenario:
It's common to have a library routine that opens a file, reads or
writes to it, and then closes it again, without the calling
application ever being aware that has occurred. If the application
happens to be holding a lock on the file when that occurs, it can lose
that lock without ever being aware of it.
The problem affects any use of the --background/--daemonize options
since, as part of the daemonization process (which occurs after the lock
file acquisition), the session daemon forks and its parent process
exits. This causes one of the descriptors pointing to the lock file to
be closed and the lock to be released.
After that point, any other instance of the session daemon process would
succeed in acquiring the lock file and assume it is the sole instance on
the system.
Solution
--------
The lock file code is modified to use the non-POSIX `flock`[3]
interface which is available on Linux and some BSDs[4]. `flock` provides
us with the guarantee we thought we had: that the file lock is only
released when _all_ file descriptors pointing to a given file
description are closed.
Drawbacks
---------
As a fallback, platforms that don't support `flock` will use the original
locking mechanism. Since this is a "hint" to warn users when erroneously
launch a second privileged session daemon, it seems acceptable for it
to not be completely reliable on secondary platforms.
References
----------
[1] https://man7.org/linux/man-pages/man2/fcntl.2.html (see F_SETLK)
[2] https://lwn.net/Articles/586904/
[3] https://linux.die.net/man/2/flock
[4] https://man.freebsd.org/cgi/man.cgi?query=flock&sektion=2
Fixes #1405
Reported-by: Erica Bugden <ebugden@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: Ic505ff0671c321f808050831ef2b7152cdbf4b8a
This page took 0.026118 seconds and 4 git commands to generate.