From: Jérémie Galarneau Date: Tue, 12 Dec 2023 21:54:41 +0000 (-0500) Subject: Fix: sessiond: freeze on channel creation on restart X-Git-Tag: v2.12.15~4 X-Git-Url: https://git.lttng.org./?a=commitdiff_plain;h=e78066f186cdc1e63ed65b023b90986bd251f724;p=lttng-tools.git Fix: sessiond: freeze on channel creation on restart Issue observed -------------- When using lttng via a script, the session and consumer daemons appear to completely lock up when we request that a channel be created. The conditions for this lockup seem to be created by destroying a sessiond and then creating a sessiond in quick sequence. This can be reproduced, on some systems, by launching a session daemon and running the following commands: $ sudo killall lttng-sessiond $ sudo lttng-sessiond --daemonize $ lttng create my_session --snapshot --output /tmp/demo-output $ lttng enable-channel --kernel my_channel Note that 'killall' above is racy as it does not wait for the session daemon to be killed. Hence, it is not unexpected for the incoming session daemon to see the smoldering ashes of the "outgoing" session daemon. However, it would be helpful if the second session daemon instance warned the user of the existing session daemon instance. From the logs captured from both instances of the lttng-sessiond (the outgoing and incoming instances), there appears to be a time period during which both session daemons are active at once. This behaviour is unexpected as the session daemon guards itself (in theory) from running multiple conflicting instances. The guarding mechanism works in two steps (see the implementation of `check_existing_daemon` @ src/bin/lttng-sessiond/main.cpp:926) When a session daemon is launched, it attempts to connect to any active session daemon's 'client' endpoint (a UNIX socket, the same used by liblttng-ctl to communicate with the session daemon). If the daemon receives a reply, it can assume that another session daemon instance is already active and abort its launch. Conversely, when no reply is received, it uses a "lock file" mechanism to check for other running instances. The lock file-based check creates a file (typically /var/run/lttng/lttng-sessiond.lck in the case of a root session daemon) and acquires an exclusive (write) POSIX lock on it [1]. The assumption is that any other instance would own the lock and cause the operation to fail. On a reproducer system, we could notice that the client thread of the outgoing sessiond daemon was torn down before the launch of the initialization of the incoming session daemon. This caused the incoming session daemon to not receive a reply to its connection attempt and fall-back to the lock file-based mechanism. Surprinsingly, it appears that the lock file checks succeeds even though the outgoing session daemon still holds the lock file according to its log. See the original bug report for more information about the investigation and how to reproduce the problem. Cause ----- The POSIX file locking API has a number of surprising behaviours[2] that have seen it being superseded by platform-specific APIs. In our case, the one that bit us is that any file lock held by a process is automatically released when any of the file descriptors that reference the file's description is released. In practical terms, if a process forks and its child dies, it loses its file lock since the child's file descriptors are closed on exit. The LWN article linked below describes a similar scenario: It's common to have a library routine that opens a file, reads or writes to it, and then closes it again, without the calling application ever being aware that has occurred. If the application happens to be holding a lock on the file when that occurs, it can lose that lock without ever being aware of it. The problem affects any use of the --background/--daemonize options since, as part of the daemonization process (which occurs after the lock file acquisition), the session daemon forks and its parent process exits. This causes one of the descriptors pointing to the lock file to be closed and the lock to be released. After that point, any other instance of the session daemon process would succeed in acquiring the lock file and assume it is the sole instance on the system. Solution -------- The lock file code is modified to use the non-POSIX `flock`[3] interface which is available on Linux and some BSDs[4]. `flock` provides us with the guarantee we thought we had: that the file lock is only released when _all_ file descriptors pointing to a given file description are closed. Drawbacks --------- As a fallback, platforms that don't support `flock` will use the original locking mechanism. Since this is a "hint" to warn users when erroneously launch a second privileged session daemon, it seems acceptable for it to not be completely reliable on secondary platforms. References ---------- [1] https://man7.org/linux/man-pages/man2/fcntl.2.html (see F_SETLK) [2] https://lwn.net/Articles/586904/ [3] https://linux.die.net/man/2/flock [4] https://man.freebsd.org/cgi/man.cgi?query=flock&sektion=2 Fixes #1405 Reported-by: Erica Bugden Signed-off-by: Jérémie Galarneau Change-Id: Ic505ff0671c321f808050831ef2b7152cdbf4b8a --- diff --git a/configure.ac b/configure.ac index 243aa7244..e5fbf3240 100644 --- a/configure.ac +++ b/configure.ac @@ -235,7 +235,7 @@ AC_CHECK_FUNCS([ \ mkdir munmap putenv realpath rmdir socket strchr strcspn strdup \ strncasecmp strndup strnlen strpbrk strrchr strstr strtol strtoul \ strtoull dirfd gethostbyname2 getipnodebyname epoll_create1 \ - sched_getcpu sysconf sync_file_range getrandom + sched_getcpu sysconf sync_file_range getrandom flock ]) # Check if clock_gettime, timer_create, timer_settime, and timer_delete are available in lib rt, and if so, diff --git a/src/common/lockfile.c b/src/common/lockfile.c index a34838c55..bf1ddf9e3 100644 --- a/src/common/lockfile.c +++ b/src/common/lockfile.c @@ -12,49 +12,93 @@ #include +#include + #ifdef HAVE_FLOCK +#include + +static int lock_file(const char *filepath, int fd) +{ + int ret; + + /* + * Attempt to lock the file. If this fails, there is + * already a process using the same lock file running + * and we should exit. + */ + ret = flock(fd, LOCK_EX | LOCK_NB); + if (ret == -1) { + /* EWOULDBLOCK are expected if the file is locked: don't spam the logs. */ + if (errno != EWOULDBLOCK) { + PERROR("Failed to apply lock on lock file: file_path=`%s`", filepath); + } + } + + return ret; +} + #else /* HAVE_FLOCK */ -#include +static int lock_file(const char *filepath, int fd) +{ + int ret; + struct flock lock = {}; + + lock.l_whence = SEEK_SET; + lock.l_type = F_WRLCK; + + /* + * Attempt to lock the file. If this fails, there is + * already a process using the same lock file running + * and we should exit. + */ + ret = fcntl(fd, F_SETLK, &lock); + if (ret == -1) { + /* EAGAIN and EACCESS are expected if the file is locked: don't spam the logs. */ + if (errno != EAGAIN && errno != EACCES) { + PERROR("Failed to set lock on lock file: file_path=`%s`", filepath); + } + } + + return ret; +} + +#endif /* HAVE_FLOCK */ int utils_create_lock_file(const char *filepath) { - int ret; - int fd; - struct flock lock; + int ret, fd; assert(filepath); - memset(&lock, 0, sizeof(lock)); fd = open(filepath, O_CREAT | O_WRONLY, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP); if (fd < 0) { - PERROR("open lock file %s", filepath); + PERROR("Failed to open lock file `%s`", filepath); fd = -1; goto error; } /* - * Attempt to lock the file. If this fails, there is - * already a process using the same lock file running - * and we should exit. + * Attempt to lock the file. If this fails, there is already a process using the same lock + * file running and we should exit. + * + * lock_file is chosen based on the build configuration, see implementations above. */ - lock.l_whence = SEEK_SET; - lock.l_type = F_WRLCK; - - ret = fcntl(fd, F_SETLK, &lock); + ret = lock_file(filepath, fd); if (ret == -1) { - PERROR("fcntl lock file"); - ERR("Could not get lock file %s, another instance is running.", filepath); + ERR("Could not get lock file `%s`, another instance is running.", filepath); + if (close(fd)) { - PERROR("close lock file"); + PERROR("Failed to close lock file fd: fd=%d", fd); } + fd = ret; goto error; } + DBG("Acquired lock file: file_path=`%s`", filepath); + error: return fd; } - -#endif /* HAVE_FLOCK */