We can use it to understand what the RCU core is going to free. For
example, some users maybe interested in when the RCU core starts
freeing reclaimable slabs like dentry to reduce memory pressure.
Fix: filter interpreter early-exits on uninitialized value
I observed that syscall filtering on string arguments wouldn't work on
my development machines, both running 5.11.2-arch1-1 (Arch Linux).
For instance, enabling the tracing of the `openat()` syscall with the
'filename == "/proc/cpuinfo"' filter would not produce events even
though matching events were present in another session that had no
filtering active. The same problem occurred with `execve()`.
I tried a couple of kernel versions before (5.11.1 and 5.10.13, if
memory serves me well) and I had the same problem. Meanwhile, I couldn't
reproduce the problem on various Debian machines (the LTTng CI) nor on a
fresh Ubuntu 20.04 with both the stock kernel and with an updated 5.11.2
kernel.
I built the lttng-modules with the interpreter debugging printout and
saw the following warning:
LTTng: [debug bytecode in /home/jgalar/EfficiOS/src/lttng-modules/src/lttng-bytecode-interpreter.c:bytecode_interpret@1508] Bytecode warning: loading a NULL string.
After a shedload (yes, a _shed_load) of digging, I figured that the
problem was hidden in plain sight near that logging statement.
In the `BYTECODE_OP_LOAD_FIELD_REF_USER_STRING` operation, the 'ax'
register's 'user_str' is initialized with the stack value (the user
space string's address in our case). However, a NULL check is performed
against the register's 'str' member.
I initialy suspected that both members would be part of the same union
and alias each-other, but they are actually contiguous in a structure.
On the unaffected machines, I could confirm that the `str` member was
uninitialized to a non-zero value causing the condition to evaluate to
false.
Francis Deslauriers reproduced the problem by initializing the
interpreter stack to zero.
I am unsure of the exact kernel configuration option that reveals this
issue on Arch Linux, but my kernel has the following option enabled:
CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL:
Zero-initialize any stack variables that may be passed by reference
and had not already been explicitly initialized. This is intended to
eliminate all classes of uninitialized stack variable exploits and
information exposures.
I have not tried to build without this enabled as, anyhow, this seems
to be a legitimate issue.
I have spotted what appears to be an identical problem in
`BYTECODE_OP_LOAD_FIELD_REF_USER_SEQUENCE` and corrected it. However,
I have not exercised that code path.
The commit that introduced this problem is 5b4ad89.
The debug print-out of the `BYTECODE_OP_LOAD_FIELD_REF_USER_STRING`
operation is modified to print the user string (truncated to 31 chars).
Both filter runtime and event enabler ref objects are owned by the
event notifier, but are not freed upon destruction of the event notifier
object, thus leaking memory.
Use the GPL-exported bdi_dev_name introduced in kernel 5.7. Do not use
static inline bdi_dev_name in prior kernels because it uses the bdi_unknown_name
symbol which is not exported to GPL modules.
It is currently backported into stable branches 5.4 and 5.5, but appears
to be missing from the 4.4, 4.9, 4.14, 4.19 LTS branches.
Implement our own lttng_bdi_dev_name wrapper to provide this fix on
builds against stable kernels which do not have this fix.
There is one user-visible change with this commit: for builds against
kernels < 4.4.0, the writeback_work_class events did use the
default_backing_dev_info to handle cases where the device is NULL,
writing "default" into the trace. This behavior is now aligned to
match what is done in kernels >= 4.4.0, which is to write "(unknown)"
into the name field.
Michael Jeanson [Mon, 8 Feb 2021 20:32:47 +0000 (15:32 -0500)]
fix: RT_PATCH_VERSION is close to overflow
We allocated only 8bits for RT_PATCH_VERSION in LTTNG_RT_VERSION_CODE,
the current RT patch version for the 4.4 branch is currently 214 which
is getting close to 256. Bump it to 16bits to avoid breakage in the
future.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I0666bbe996854696ac98e025eb02e5fced0540b1
Michael Jeanson [Fri, 5 Feb 2021 20:21:55 +0000 (15:21 -0500)]
fix: UTS_UBUNTU_RELEASE_ABI is close to overflow
We allocated only 8bits for UTS_UBUNTU_RELEASE_ABI in
LTTNG_UBUNTU_KERNEL_VERSION, the current Xenial kernel has an ABI of 207
which is getting close to 256. Bump it to 16bits to avoid breakage in
the future.
Change-Id: Iee99757bb28cdd958b044b31df3232b9f8816873 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Fri, 5 Feb 2021 17:08:40 +0000 (12:08 -0500)]
fix: sublevel version overflow in LINUX_VERSION_CODE
The 4.4.256 and 4.9.256 stable release overflow the 8bits allocated to
the sublevel in LINUX_VERSION_CODE which ends means they report
themselves as 4.5.0 and 4.10.0 respectively. The next releases in these
stables branches will have sublevel clamped at 255 and will thus report
themselves as 4.4.255 and 4.9.255 for all subsequent releases.
We need a way to way to properly detect these release since I doubt they
will stop breaking tracepoints declarations. As a workaround, extract
the version information from the Makefile in the kernel headers and use
this information to generate a version code when the sublevel is equal
or greater than 256.
Change-Id: I96ae9f22c0c1ba8c619643946a5311c767fbcf8c Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: counter-api: always inline counter add function
The counter add function uses cmpxchg() and cmpxchg_local() on 1, 2, 4,
and 8 bytes types.
In libcounter, the 8 bytes type is only supported on 64-bit
architectures, but the 1, 2, 4 byte type code is present for all
architectures, even though only the 4 byte code is currently used by
lttng-modules.
The ARM implementation of cmpxchg uses the "__bad_cmpxchg" linker error
to report use of cmpxchg on an unsupported size.
Considering that "inline" does not strictly mean always inline (depends
on CONFIG_OPTIMIZE_INLINING on some kernels, and does not mean forced
inlining in recent kernels), the compiler is free to generate a function
rather than perform inlining. If that happens, then the __bad_cmpxchg
linker error is generated even if the 1 and 2 bytes types are unused.
Therefore, use __always_inline for functions in counter-api.h to force
inlining, and therefore removal of unused code before linking, which is
required by this Linux kernel __bad_cmpxchg linker error trick.
No more (ab)use in drivers finally. There is still the modular build of
PPC/KVM which needs it, so restrict it to this case which still makes it
unavailable for most drivers.
block: merge struct block_device and struct hd_struct
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one. This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.
Change-Id: I47702d1867fda0d8fc0754d761aa4d1ae702cdeb Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The kretprobe hash is mostly superfluous, replace it with a per-task
variable.
This gets rid of the task hash and it's related locking.
Note that this may change the kprobes module-exported API for kretprobe
handlers. If any out-of-tree kretprobe user uses ri->rp, use
get_kretprobe(ri) instead.
Also remove the confusing comment about checking if a fd exists. I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
Fix: notifier: use store-release/load-acquire for error counter
The "record_error" operation is executed concurrently with setting the
error counter in the notifier group without locking, so we need to
explicitly provide existance guarantees.
The only visible transition is from NULL -> !NULL, because the only
situation reverting the error counter back to NULL is on destruction of
the notification group, after an RCU synchronisation guarantees that no
record_error can observe this pointer anymore.
Problems
========
We can't reuse the same function pointer field in `struct
lttng_bytecode_runtime` as both interpreter functions will have
different signatures.
We can't change the signature of this existing filter function because
it's used in the tracepoint probes.
Solution
========
Add a union of callbacks to hold both interpreter functions. This also
doesn't change the layout of the `struct lttng_bytecode_runtime`
objects.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ic4423100e1f97654fe43c0927d3b81de2d1d724f
Add `lttng_bytecode_interpret_format_output()` for top of stack extraction
This new static function will be used to extract the register on the top of
stack after the execution of the bytecode. This is currently not used by the
filter bytecode but will be used by capture bytecode.
The returned value is saved in a tagged union struct named `struct
lttng_interpreter_output` and can be used by the caller of the interpreter
function.
Typically, this struct will be allocated on the stack to avoid dynamic
allocation inside the tracepoint probes.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I1cfd3ab6e84b7e308c48ed7a8a9555a3e338eea7
Fix: bytecode: Validate register type for instructions expecting unknown type
The bytecode validator allows unknown type as input for some
instructions which are not specialized. The interpreter therefore needs
to check the register type for their input.
Thie requires that every instruction in the interpreter sets the
register type of the output it populates (unless it is unchanged).
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I3339c36340645937b801f6bf6dbf517d06416a14
Fix: syscalls: address of statically allocated element never null
This check is intended to confirm that the table element for that syscall
is indeed populated but checked that the element is NULL. This was never
the case because the address of an element of a statically allocated
array cannot be NULL.
Fix this by check if the function pointer is NULL instead. This means
that the element is not populated.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I1d769d6609fa4517199f022e1a262c4494c8f63a
Michael Jeanson [Mon, 23 Nov 2020 17:15:43 +0000 (12:15 -0500)]
Improve the release script
* Use git-archive, this removes all custom code to cleanup the repo, it
can now be used in an unclean repo as the code will be exported from
a specific tag.
* Add parameters, this will allow using the script on any machine
while keeping the default behavior for the maintainer.
Change-Id: I9f29d0e1afdbf475d0bbaeb9946ca3216f725e86 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Currently the tracepoint site will iterate a vector and issue indirect
calls to however many handlers are registered (ie. the vector is
long).
Using static_call() it is possible to optimize this for the common
case of only having a single handler registered. In this case the
static_call() can directly call this handler. Otherwise, if the vector
is longer than 1, call a function that iterates the whole vector like
the current code.
Change-Id: I739dd84d62cc1a821b8bd8acff74fa29aa25d22f Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed
Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
values instead of overloading RET_PF_RETRY. In the short term, the
unique values add clarity to the code and RET_PF_SPURIOUS will be used
by set_spte() to avoid unnecessary work for spurious faults.
In the long term, TDX will use RET_PF_FIXED to deterministically map
memory during pre-boot. The page fault flow may bail early for benign
reasons, e.g. if the mmu_notifier fires for an unrelated address. With
only RET_PF_RETRY, it's impossible for the caller to distinguish between
"cool, page is mapped" and "darn, need to try again", and thus cannot
handle benign cases like the mmu_notifier retry.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie0855c78852b45f588e131fe2463e15aae1bc023
Add functions to handle page faults in the TDP MMU. These page faults
are currently handled in much the same way as the x86 shadow paging
based MMU, however the ordering of some operations is slightly
different. Future patches will add eager NX splitting, a fast page fault
handler, and parallel page faults.
Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie56959cb6c77913d2f1188b0ca15da9114623a4e
KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I638fa29ef7d8bb432de42a33f9ae4db43259b915
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ia65cf44e108f2df0b458f0d335f33a8f18f50baa
The kretprobe behavior is to fire twice (entry and exit of target
function), placing an event notifier on such function does not make
sense at first glance. If we come up with a use case it will be quite
easy to enable.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I6f214501706d1ef170c81b80a1f82c039d687502
Fix: event notifier: adapt read iterator state to poll expectations
When completing to read a subbuffer, ensure that the state of the
iterator is moved forward so the "put_subbuf" is performed before
returning to the user, so poll() will not return POLLIN when there
is actually no data available to read.