random: rather than entropy_store abstraction, use global
Originally, the RNG used several pools, so having things abstracted out
over a generic entropy_store object made sense. These days, there's only
one input pool, and then an uneven mix of usage via the abstraction and
usage via &input_pool. Rather than this uneasy mixture, just get rid of
the abstraction entirely and have things always use the global. This
simplifies the code and makes reading it a bit easier.
Change-Id: I1a2a14d7b6e69a047804e1e91e00fe002f757431 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
btrfs: pass fs_info to trace_btrfs_transaction_commit
The root on the trans->root can be anything, and generally we're
committing from the transaction kthread so it's usually the tree_root.
Change this to just take an fs_info, and to maintain compatibility
simply put the ROOT_TREE_OBJECTID as the root objectid for the
tracepoint. This will allow use to remove trans->root.
Change-Id: Ie5a4804330edabffac0714fcb9c25b8c8599e424 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
mm: compaction: fix the migration stats in trace_mm_compaction_migratepages()
Now the migrate_pages() has changed to return the number of {normal
page, THP, hugetlb} instead, thus we should not use the return value to
calculate the number of pages migrated successfully. Instead we can
just use the 'nr_succeeded' which indicates the number of normal pages
migrated successfully to calculate the non-migrated pages in
trace_mm_compaction_migratepages().
block: don't call blk_status_to_errno in blk_update_request
We only need to call it to resolve the blk_status_t -> errno mapping for
tracing, so move the conversion into the tracepoints that are not called
at all when tracing isn't enabled.
Change-Id: Ic556cee1d82e44a93a1467f55d45b6e17a48d387 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: MMU: change tracepoints arguments to kvm_page_fault
Pass struct kvm_page_fault to tracepoints instead of extracting the
arguments from the struct. This also lets the kvm_mmu_spte_requested
tracepoint pick the gfn directly from fault->gfn, instead of using
the address.
Change-Id: I5ee3f344a8d1cd0ed185cdeecb3a3121183c9b43 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
Extend the get_exit_info static call to provide the reason for the VM
exit. Modify relevant trace points to use this rather than extracting
the reason in the caller.
Change-Id: I28903e658eb7cbfc6666e35ba4cffba5e49d1445 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Daniel Gomez [Fri, 8 Oct 2021 17:38:02 +0000 (19:38 +0200)]
fix: implicit-int error in EXPORT_SYMBOL_GPL
Add module header to fix error:
error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
[-Werror=implicit-int]
Log:
/workdir/build/workspace/sources/lttng-modules/src/wrapper/random.c:54:1:
warning: data definition has no type or storage class
54 EXPORT_SYMBOL_GPL(wrapper_get_bootid);
^~~~~~~~~~~~~~~~~
/workdir/build/workspace/sources/lttng-modules/src/wrapper/random.c:54:1:
error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
[-Werror=implicit-int]
/workdir/build/workspace/sources/lttng-modules/src/wrapper/random.c:54:1:
warning: parameter names (without types) in function declaration
cc1: some warnings being treated as errors
make[3]: ***
[/workdir/build/tmp/work-shared/qt5222/kernel-source/scripts/Makefile.build:271:
/workdir/build/workspace/sources/lttng-modules/src/wrapper/random.o]
Signed-off-by: Daniel Gomez <daniel@qtec.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: If1b96f6fcc19988b94dc09b12d071f2c61491b83
Michael Jeanson [Mon, 13 Sep 2021 18:16:22 +0000 (14:16 -0400)]
fix: Revert "Makefile: Enable -Wimplicit-fallthrough for Clang" (v5.15)
Starting with v5.15, "-Wimplicit-fallthrough=5" was added to the build
flags which requires the use of "__attribute__((__fallthrough__))" to
annotate fallthrough case statements.
It turns out that the problem with the clang -Wimplicit-fallthrough
warning is not about the kernel source code, but about clang itself, and
that the warning is unusable until clang fixes its broken ways.
In particular, when you enable this warning for clang, you not only get
warnings about implicit fallthroughs. You also get this:
warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
which is completely broken becasue it
(a) doesn't even tell you where the problem is (seriously: no line
numbers, no filename, no nothing).
(b) is fundamentally broken anyway, because there are perfectly valid
reasons to have a fallthrough statement even if it turns out that
it can perhaps not be reached.
In the kernel, an example of that second case is code in the scheduler:
switch (state) {
case cpuset:
if (IS_ENABLED(CONFIG_CPUSETS)) {
cpuset_cpus_allowed_fallback(p);
state = possible;
break;
}
fallthrough;
case possible:
where if CONFIG_CPUSETS is enabled you actually never hit the
fallthrough case at all. But that in no way makes the fallthrough
wrong.
So the warning is completely broken, and enabling it for clang is a very
bad idea.
In the meantime, we can keep the gcc option enabled, and make the gcc
build use
-Wimplicit-fallthrough=5
which means that we will at least continue to require a proper
fallthrough statement, and that gcc won't silently accept the magic
comment versions. Because gcc does this all correctly, and while the odd
"=5" part is kind of obscure, it's documented in [1]:
"-Wimplicit-fallthrough=5 doesn’t recognize any comments as
fallthrough comments, only attributes disable the warning"
so if clang ever fixes its bad behavior we can try enabling it there again.
Change-Id: Iea69849592fb69ac04fb9bb28efcd6b8dce8ba88 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The counting 'rwsem' hackery of get|put_online_cpus() is going to be
replaced by percpu rwsem.
Rename the functions to make it clear that it's locking and not some
refcount style interface. These new functions will be used for the
preparatory patches which make the code ready for the percpu rwsem
conversion.
Rename all instances in the cpu hotplug code while at it.
Change-Id: I5a37cf5afc075a402b7347989fac637dfa60a1ed Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Change-Id: I3a379192d6b977753fe58d4f67833a78dd7a0a47 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().
It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.
Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.
By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.
Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.
Change-Id: If96e99c2d9533d96d9d1aa6460bb7fd3ac9ed7ab Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Tue, 11 May 2021 21:22:12 +0000 (17:22 -0400)]
Disable block rwbs bitwise enum in default build
Only generate the bitwise enumerations when
CONFIG_LTTNG_EXPERIMENTAL_BITWISE_ENUM is enabled, so the default build
does not generate traces which lead to warnings when viewed with
babeltrace 1.x and babeltrace 2 with default options.
Michael Jeanson [Tue, 11 May 2021 21:15:15 +0000 (17:15 -0400)]
Disable sched_switch bitwise enum in default build
Only generate the bitwise enumerations when
CONFIG_LTTNG_EXPERIMENTAL_BITWISE_ENUM is enabled, so the default build
does not generate traces which lead to warnings when viewed with
babeltrace 1.x and babeltrace 2 with default options.
Michael Jeanson [Wed, 12 May 2021 20:21:50 +0000 (16:21 -0400)]
Add experimental bitwise enum config option
Only generate the bitwise enumerations when
CONFIG_LTTNG_EXPERIMENTAL_BITWISE_ENUM is enabled, so the default build
does not generate traces which lead to warnings when viewed with
babeltrace 1.x and babeltrace 2 with default options.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Id54ae3df470b9cdbc0edc0a528fa79532493d1ad
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Change-Id: I6e7c0ced13acd4fcd88bcd572d0ba1f9b254c58c Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Mon, 10 May 2021 15:39:24 +0000 (11:39 -0400)]
fix: block: remove disk_part_iter (v5.12)
In v5.12 a refactoring of the genhd code was started and the symbols
related to 'disk_part_iter' were unexported. In v5.13 they were
completely removed.
This patch replaces the short lived compat code that is specific to
v5.12 and replaces it with a generic internal implementation that
iterates directly on the 'disk->part_tbl' xarray which will be used
on v5.12 and up.
This seems like a better option than keeping the compat code that will
only work on v5.12 and make maintenance more complicated. The compat was
backported to the stable branches but isn't yet part of a point release
so can be safely replaced.
Now that no fast path lookups in the partition table are left, there is
no point in micro-optimizing the data structure for it. Just use a bog
standard xarray.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I43d8ef8463cb7a83dc8859a32dc29502cd897ddf
Fix: increment buffer offset when failing to copy from user-space
Upon failure to copy from user-space due to failing access ok check, the
ring buffer offset is not incremented, which could generate unreadable
traces because we don't account for the padding we write into the ring
buffer.
Note that this typically won't affect a common use-case of copying
strings from user-space, because unless mprotect is invoked within a
narrow race window (between user strlen and user strcpy), the strlen
will fail on access ok when calculating the space to reserve, which will
match what happens on strcpy.
Sync `show_inode_state()` macro with Ubuntu 4.15 kernel
The following commit changed the `show_inode_state()` macro which
triggered a warning on our CI build:
commit 63388062bea96e5cd8b8d7abf7b7142f8666ca1f
Author: Jan Kara <jack@suse.cz>
Date: Mon Jan 25 12:37:43 2021 -0800
writeback: Drop I_DIRTY_TIME_EXPIRE
Also, this commit adds a comment to clarify why we keep these
`#if/#elif` even though we don't use it the macro.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I2dd53a1a286ab8a431977bda6cde01f700f0c7d9
He Zhe [Mon, 19 Apr 2021 09:09:28 +0000 (09:09 +0000)]
fix: mm, tracing: kfree event name mismatching with provider kmem (v5.12)
a8bc8ae5c932 ("fix: mm, tracing: record slab name for kmem_cache_free() (v5.12)")
introduces the following call trace for kfree. This is caused by mismatch
between kfree event and its provider kmem.
Add a helper to call kobject_uevent for the disk and all partitions, and
unexport the disk_part_iter_* helpers that are now only used in the core
block code.
Change-Id: If6e8797049642ab382d5699660ee1dd734e92c90 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: bytecode linker: validate event and field array/sequence encoding
The bytecode linker should only allow linking filter expressions loading
fields which are string-encoded arrays and sequence for comparison
against a string, and reject arrays and sequences without encoding, so
the filter interpreter does not attempt to load non-NULL terminated
arrays/sequences as if they were strings.
The `filter_bytecode_runtime_head` list is currently not initialized for
the return event of the kretprobe. This caused a kernel null ptr
dereference when destroying a session. It can reproduced with the
following commands:
We can use it to understand what the RCU core is going to free. For
example, some users maybe interested in when the RCU core starts
freeing reclaimable slabs like dentry to reduce memory pressure.
Fix: filter interpreter early-exits on uninitialized value
I observed that syscall filtering on string arguments wouldn't work on
my development machines, both running 5.11.2-arch1-1 (Arch Linux).
For instance, enabling the tracing of the `openat()` syscall with the
'filename == "/proc/cpuinfo"' filter would not produce events even
though matching events were present in another session that had no
filtering active. The same problem occurred with `execve()`.
I tried a couple of kernel versions before (5.11.1 and 5.10.13, if
memory serves me well) and I had the same problem. Meanwhile, I couldn't
reproduce the problem on various Debian machines (the LTTng CI) nor on a
fresh Ubuntu 20.04 with both the stock kernel and with an updated 5.11.2
kernel.
I built the lttng-modules with the interpreter debugging printout and
saw the following warning:
LTTng: [debug bytecode in /home/jgalar/EfficiOS/src/lttng-modules/src/lttng-bytecode-interpreter.c:bytecode_interpret@1508] Bytecode warning: loading a NULL string.
After a shedload (yes, a _shed_load) of digging, I figured that the
problem was hidden in plain sight near that logging statement.
In the `BYTECODE_OP_LOAD_FIELD_REF_USER_STRING` operation, the 'ax'
register's 'user_str' is initialized with the stack value (the user
space string's address in our case). However, a NULL check is performed
against the register's 'str' member.
I initialy suspected that both members would be part of the same union
and alias each-other, but they are actually contiguous in a structure.
On the unaffected machines, I could confirm that the `str` member was
uninitialized to a non-zero value causing the condition to evaluate to
false.
Francis Deslauriers reproduced the problem by initializing the
interpreter stack to zero.
I am unsure of the exact kernel configuration option that reveals this
issue on Arch Linux, but my kernel has the following option enabled:
CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL:
Zero-initialize any stack variables that may be passed by reference
and had not already been explicitly initialized. This is intended to
eliminate all classes of uninitialized stack variable exploits and
information exposures.
I have not tried to build without this enabled as, anyhow, this seems
to be a legitimate issue.
I have spotted what appears to be an identical problem in
`BYTECODE_OP_LOAD_FIELD_REF_USER_SEQUENCE` and corrected it. However,
I have not exercised that code path.
The commit that introduced this problem is 5b4ad89.
The debug print-out of the `BYTECODE_OP_LOAD_FIELD_REF_USER_STRING`
operation is modified to print the user string (truncated to 31 chars).
Also remove the confusing comment about checking if a fd exists. I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
Use the GPL-exported bdi_dev_name introduced in kernel 5.7. Do not use
static inline bdi_dev_name in prior kernels because it uses the bdi_unknown_name
symbol which is not exported to GPL modules.
It is currently backported into stable branches 5.4 and 5.5, but appears
to be missing from the 4.4, 4.9, 4.14, 4.19 LTS branches.
Implement our own lttng_bdi_dev_name wrapper to provide this fix on
builds against stable kernels which do not have this fix.
There is one user-visible change with this commit: for builds against
kernels < 4.4.0, the writeback_work_class events did use the
default_backing_dev_info to handle cases where the device is NULL,
writing "default" into the trace. This behavior is now aligned to
match what is done in kernels >= 4.4.0, which is to write "(unknown)"
into the name field.
Michael Jeanson [Mon, 8 Feb 2021 20:32:47 +0000 (15:32 -0500)]
fix: RT_PATCH_VERSION is close to overflow
We allocated only 8bits for RT_PATCH_VERSION in LTTNG_RT_VERSION_CODE,
the current RT patch version for the 4.4 branch is currently 214 which
is getting close to 256. Bump it to 16bits to avoid breakage in the
future.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: If791cecf3ac8ceb2a4d0c2879410ae6b4199117d
Michael Jeanson [Fri, 5 Feb 2021 20:21:55 +0000 (15:21 -0500)]
fix: UTS_UBUNTU_RELEASE_ABI is close to overflow
We allocated only 8bits for UTS_UBUNTU_RELEASE_ABI in
LTTNG_UBUNTU_KERNEL_VERSION, the current Xenial kernel has an ABI of 207
which is getting close to 256. Bump it to 16bits to avoid breakage in
the future.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Id6046ac402b33c8aff577d66a7d68397a1f08d5c
Michael Jeanson [Fri, 5 Feb 2021 17:08:40 +0000 (12:08 -0500)]
fix: sublevel version overflow in LINUX_VERSION_CODE
The 4.4.256 and 4.9.256 stable release overflow the 8bits allocated to
the sublevel in LINUX_VERSION_CODE which ends means they report
themselves as 4.5.0 and 4.10.0 respectively. The next releases in these
stables branches will have sublevel clamped at 255 and will thus report
themselves as 4.4.255 and 4.9.255 for all subsequent releases.
We need a way to way to properly detect these release since I doubt they
will stop breaking tracepoints declarations. As a workaround, extract
the version information from the Makefile in the kernel headers and use
this information to generate a version code when the sublevel is equal
or greater than 256.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie12dec7d2f28fc83343262f42680d42a2f40b59e
No more (ab)use in drivers finally. There is still the modular build of
PPC/KVM which needs it, so restrict it to this case which still makes it
unavailable for most drivers.
block: merge struct block_device and struct hd_struct
Instead of having two structures that represent each block device with
different life time rules, merge them into a single one. This also
greatly simplifies the reference counting rules, as we can use the inode
reference count as the main reference count for the new struct
block_device, with the device model reference front ending it for device
model interaction.
Change-Id: I47702d1867fda0d8fc0754d761aa4d1ae702cdeb Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The kretprobe hash is mostly superfluous, replace it with a per-task
variable.
This gets rid of the task hash and it's related locking.
Note that this may change the kprobes module-exported API for kretprobe
handlers. If any out-of-tree kretprobe user uses ri->rp, use
get_kretprobe(ri) instead.
Michael Jeanson [Mon, 23 Nov 2020 17:15:43 +0000 (12:15 -0500)]
Improve the release script
* Use git-archive, this removes all custom code to cleanup the repo, it
can now be used in an unclean repo as the code will be exported from
a specific tag.
* Add parameters, this will allow using the script on any machine
while keeping the default behavior for the maintainer.
Change-Id: I9f29d0e1afdbf475d0bbaeb9946ca3216f725e86 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Currently the tracepoint site will iterate a vector and issue indirect
calls to however many handlers are registered (ie. the vector is
long).
Using static_call() it is possible to optimize this for the common
case of only having a single handler registered. In this case the
static_call() can directly call this handler. Otherwise, if the vector
is longer than 1, call a function that iterates the whole vector like
the current code.
Change-Id: I739dd84d62cc1a821b8bd8acff74fa29aa25d22f Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed
Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
values instead of overloading RET_PF_RETRY. In the short term, the
unique values add clarity to the code and RET_PF_SPURIOUS will be used
by set_spte() to avoid unnecessary work for spurious faults.
In the long term, TDX will use RET_PF_FIXED to deterministically map
memory during pre-boot. The page fault flow may bail early for benign
reasons, e.g. if the mmu_notifier fires for an unrelated address. With
only RET_PF_RETRY, it's impossible for the caller to distinguish between
"cool, page is mapped" and "darn, need to try again", and thus cannot
handle benign cases like the mmu_notifier retry.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie0855c78852b45f588e131fe2463e15aae1bc023
Add functions to handle page faults in the TDP MMU. These page faults
are currently handled in much the same way as the x86 shadow paging
based MMU, however the ordering of some operations is slightly
different. Future patches will add eager NX splitting, a fast page fault
handler, and parallel page faults.
Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie56959cb6c77913d2f1188b0ca15da9114623a4e
KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I638fa29ef7d8bb432de42a33f9ae4db43259b915
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ia65cf44e108f2df0b458f0d335f33a8f18f50baa
Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.
Change-Id: I825c3fcbcc17e9b46e2a661fadc66b52a94eb2da Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Thu, 24 Sep 2020 19:38:35 +0000 (15:38 -0400)]
fix: Use 'kernel_read' to read from procfs
Use the 'kernel_read' helper to read files in procfs, it's present in
the kernel since the 2.6 series and does the right thing on kernels that
require the set_fs dance and newer one which don't.
Change-Id: I1a53fda379e0bb9acc79331626925bbdba63d727 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Fri, 25 Sep 2020 20:05:00 +0000 (16:05 -0400)]
fix: don't allow userspace copy to read kernel memory
This patch fixes a security issue which allows the root user to read
arbitrary kernel memory. Considering the security model used in LTTng
userspace tooling for kernel tracing, this bug also allows members of
the 'tracing' group to read arbitrary kernel memory.
Calls to __copy_from_user_inatomic() where wrongly enclosed in
set_fs(KERNEL_DS) defeating the access_ok() calls and allowing to read
from kernel memory if a kernel address is provided.
Remove all set_fs() calls around __copy_from_user_inatomic().
As a side effect this will allow us to support v5.10 which should remove
set_fs().
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I35e4562c835217352c012ed96a7b8f93e941381e
The system call filter table has effectively been unused for a long
time due to system call name prefix mismatch. This means the overhead of
selective system call tracing was larger than it should have been because
the event payload preparation would be done for all system calls as soon
as a single system call is traced.
However, fixing this underlying issue unearths several issues that crept
unnoticed when the "enabler" concept was introduced (after the original
implementation of the system call filter table).
Here is a list of the issues which are resolved here:
- Split lttng_syscalls_unregister into an unregister and destroy
function, thus awaiting for a grace period (and therefore quiescence
of the users) after unregistering the system call tracepoints before
freeing the system call filter data structures. This effectively fixes
a use-after-free.
- The state for enabling "all" system calls vs enabling specific system
calls (and sequences of enable-disable) was incorrect with respect to
the "enablers" semantic. This is solved by always tracking the
bitmap of enabled system calls, and keeping this bitmap even when
enabling all system calls. The sc_filter is now always allocated
before system call tracing is registered to tracepoints, which means
it does not need to be RCU dereferenced anymore.
Padding fields in the ABI are reserved to select whether to:
- Trace either native or compat system call (or both, which is the
behavior currently implemented),
- Trace either system call entry or exit (or both, which is the
behavior currently implemented),
- Select the system call to trace by name (behavior currently
implemented) or by system call number,
writeback: Fix sync livelock due to b_dirty_time processing
When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.
Change-Id: I8b894b13ccc14d9b8983ee4c2810a927c319560b Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I92e37c2ff3ec36d431e8f9de5c8e37c5a2da55ea
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.
Change-Id: Ife8880bd9378dca2972da8838f40fc35ccdfaaac Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.
After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:
Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s
Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I5169cb24853d4da32e2862a6626f1f058689b053
Beniamin Sandu [Thu, 13 Aug 2020 13:24:39 +0000 (16:24 +0300)]
Kconfig: fix dependency issue when building in-tree without CONFIG_FTRACE
When building in-tree, one could disable CONFIG_FTRACE from kernel
config which will leave CONFIG_TRACEPOINTS selected by LTTNG modules,
but generate a lot of linker errors like below because it leaves out
other stuff, e.g.:
trace.c:(.text+0xd86b): undefined reference to `trace_event_buffer_reserve'
ld: trace.c:(.text+0xd8de): undefined reference to `trace_event_buffer_commit'
ld: trace.c:(.text+0xd926): undefined reference to `event_triggers_call'
ld: trace.c:(.text+0xd942): undefined reference to `trace_event_ignore_this_pid'
ld: net/mac80211/trace.o: in function `trace_event_raw_event_drv_tdls_cancel_channel_switch':
It appears to be caused by the fact that TRACE_EVENT macros in the Linux
kernel depend on the Ftrace ring buffer as soon as CONFIG_TRACEPOINTS is
enabled.
Steps to reproduce:
- Get a clone of an upstream stable kernel and use scripts/built-in.sh on it
- Configure a standard x86-64 build, enable built-in LTTNG but disable
CONFIG_FTRACE from Kernel Hacking-->Tracers using menuconfig