This shows exactly how btrfs processes the delayed refs onto disks,
which is very helpful on understanding delayed ref mechanism and
debugging related bugs.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Tue, 9 Jan 2018 20:43:19 +0000 (15:43 -0500)]
Fix: debian kernel version parsing
The debian version script only worked for ckt kernels and that was fine
until now because we only had checks for those versions in the code.
ckt (Canonical Kernel Team) kernels were used for a while during the jessie
cycle, their versionning is a bit different. They track the upstream vanilla
stable updates but they don't update the minor version number and instead add
an additionnal -cktX. They were all 3.16.7-cktX and after a while the version
switched back to upstream style at 3.16.36.
Knowing that, we can compare regular debian and ckt kernel versions
using this scheme :
MAJOR.PATCHLEVEL.SUBLEVEL.CKT.DEBABI.DEBPATCH
And setting CKT to zero for non-ckt kernels.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Update: kvm instrumentation for 3.16.52 and 3.2.97
Starting from 3.16.52 and 3.2.97, the 3.16 and 3.2 stable kernel
branches backport a kvm instrumentation change introduced in 4.15 which
affects the prototype of the kvm_mmio event.
locking/atomics, mm: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.
However, for some features it is necessary to instrument reads and
writes separately, which is not possible with ACCESS_ONCE(). This
distinction is critical to correct operation.
It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, this doesn't handle comments, leaving references
to ACCESS_ONCE() instances which have been removed. As a preparatory
step, this patch converts the mm code and comments to use
{READ,WRITE}_ONCE() consistently.
----
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
timer: Prepare to change timer callback argument type
Modern kernel callback systems pass the structure associated with a
given callback to the callback function. The timer callback remains one
of the legacy cases where an arbitrary unsigned long argument continues
to be passed as the callback argument. This has several problems:
- This bloats the timer_list structure with a normally redundant
.data field.
- No type checking is being performed, forcing callbacks to do
explicit type casts of the unsigned long argument into the object
that was passed, rather than using container_of(), as done in most
of the other callback infrastructure.
- Neighboring buffer overflows can overwrite both the .function and
the .data field, providing attackers with a way to elevate from a buffer
overflow into a simplistic ROP-like mechanism that allows calling
arbitrary functions with a controlled first argument.
- For future Control Flow Integrity work, this creates a unique function
prototype for timer callbacks, instead of allowing them to continue to
be clustered with other void functions that take a single unsigned long
argument.
This adds a new timer initialization API, which will ultimately replace
the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
named timer_setup() (to mirror hrtimer_setup(), making instances of its
use much easier to grep for).
In order to support the migration of existing timers into the new
callback arguments, timer_setup() casts its arguments to the existing
legacy types, and explicitly passes the timer pointer as the legacy
data argument. Once all setup_*timer() callers have been replaced with
timer_setup(), the casts can be removed, and the data argument can be
dropped with the timer expiration code changed to just pass the timer
to the callback directly.
:
Modern kernel callback systems pass the structure associated with a
given callback to the callback function. The timer callback remains one
of the legacy cases where an arbitrary unsigned long argument continues
to be passed as the callback argument. This has several problems:
- This bloats the timer_list structure with a normally redundant
.data field.
- No type checking is being performed, forcing callbacks to do
explicit type casts of the unsigned long argument into the object
that was passed, rather than using container_of(), as done in most
of the other callback infrastructure.
- Neighboring buffer overflows can overwrite both the .function and
the .data field, providing attackers with a way to elevate from a buffer
overflow into a simplistic ROP-like mechanism that allows calling
arbitrary functions with a controlled first argument.
- For future Control Flow Integrity work, this creates a unique function
prototype for timer callbacks, instead of allowing them to continue to
be clustered with other void functions that take a single unsigned long
argument.
This adds a new timer initialization API, which will ultimately replace
the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
named timer_setup() (to mirror hrtimer_setup(), making instances of its
use much easier to grep for).
In order to support the migration of existing timers into the new
callback arguments, timer_setup() casts its arguments to the existing
legacy types, and explicitly passes the timer pointer as the legacy
data argument. Once all setup_*timer() callers have been replaced with
timer_setup(), the casts can be removed, and the data argument can be
dropped with the timer expiration code changed to just pass the timer
to the callback directly.
Since the regular pattern of using container_of() during local variable
declaration repeats the need for the variable type declaration
to be included, this adds a helper modeled after other from_*()
helpers that wrap container_of(), named from_timer(). This helper uses
typeof(*variable), removing the type redundancy and minimizing the need
for line wraps in forthcoming conversions from "unsigned data long" to
"struct timer_list *" in the timer callbacks:
Finally, in order to support the handful of timer users that perform
open-coded assignments of the .function (and .data) fields, provide
cast macros (TIMER_FUNC_TYPE and TIMER_DATA_TYPE) that can be used
temporarily. Once conversion has been completed, these can be globally
trivially removed.
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.
The static function __vmalloc_node is not visible by KALLSYMS_ALL on at
least some kernels, which leads to a call to a NULL function when trying
to perform allocation of lttng buffer memory under memory fragmentation
conditions (kmalloc_node failure).
Use __vmalloc_node_range instead, and check that the returned pointer
is non-NULL to ensure this type of failure does not happen in any
condition.
Fallback to __vmalloc(), even though it is not NUMA-aware, in case
we fail to find __vmalloc_node_range, and print an explicit warning
to the user console about the need to enable KALLSYMS_ALL.
This affects kernels < 4.12. Later kernels provide kvmalloc(), which
we use.
Comparing a signed return value against an unsigned nr_pages performs
the comparison as "unsigned", and therefore mistakenly considers
get_user_pages_fast() errors as success.
By passing an invalid pointer to write() to the /proc/lttng-logger
interface, unprivileged user-space processes can trigger a kernel OOPS.
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone. This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted. Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.
mm: move vmscan writes and file write accounting to the node
As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node. For consistency,
also account page writes and page dirtying on a per-node basis.
After this patch, there are a few remaining zone counters that may appear
strange but are fine. NUMA stats are still per-zone as this is a
user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Tue, 26 Sep 2017 18:16:47 +0000 (14:16 -0400)]
Fix: vmalloc wrapper on kernel < 2.6.38
Ensure that all probes end up including the vmalloc wrapper through the
lttng-tracer.h header so the trace_*() static inlines are generated
through inclusion of include/trace/events/kmem.h before we define
CREATE_TRACE_POINTS.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
There are many open coded kmalloc with vmalloc fallback instances in the
tree. Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.
As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.
Most callers, I could find, have been converted to use the helper
instead. This is patch 6. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.
Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Some architectures (e.g. implementations of arm64) implement their
caches based on the virtual addresses (rather than physical address).
It has the upside of making the cache access faster (no TLB lookup
required to access the cache line), but the downside of requiring
virtual mappings (e.g. kernel vs user-space) to be aligned on the number
of bits used for cache aliasing.
Perform dcache flushing for the entire sub-buffer in the get_subbuf
operation on those architectures, thus ensuring we don't end up with
cache aliasing issues.
An alternative approach we could eventually take would be to create a
kernel mapping for the ring buffer that is aligned with the user-space
mapping.
Two variables in ext4_inode_info, i_reserved_meta_blocks and
i_allocated_meta_blocks, are unused. Removing them saves a little
memory per in-memory inode and cleans up clutter in several tracepoints.
Adjust tracepoint output from ext4_alloc_da_blocks() for consistency
and fix a typo and whitespace near these changes.
Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: Sleeping function called from invalid context
It affects system call instrumentation for accept, accept4 and connect,
only on the x86-64 architecture.
We need to use the LTTng accessing functions to touch user-space memory,
which take care of disabling the page fault handler, so we don't preempt
while in preempt-off context (tracepoints disable preemption).
The "pid" notion exposed by LTTng translates to the "pgid" notion in the
Linux kernel. Therefore using "current->pid" as argument to the PID
tracker actually ends up behaving as a "tid" tracker, which does not
match the intent nor the user-space tracer behavior.
Fix: NULL pointer dereference of THIS_MODULE with built-in modules
THIS MODULE is defined to 0 when a module is built-in the kernel [1].
This caused NULL pointer dereference when booting a kernel with the
lttng-modules built-in.
To fix this issue, add #if guard around the wrapper_lttng_fixup_sig
function checking if the MODULE macro is defined to confirm that this
piece of code will end up in a module and not in the kernel itself.
Fix: add "flush empty" ioctl for stream intersection
Changing the behavior of the "snapshot" lttng command to implicitly do a
buffer "flush" (even when current packet is empty) had unwanted
side-effects: for instance, the snapshot ABI is used by the live timer
to grab the buffer positions, and we don't want to generate useless
empty packets in that scenario.
Therefore, add the "flush empty" behavior as a new ioctl to the ring
buffer. This allows lttng-tools to perform buffer flush (even for empty
packets) when it needs to. Given that this new ioctl is added within
stable branches as well, lttng-tools always need to handle "-ENOSYS"
gracefully.
There is no need to bump the LTTNG_MODULES_ABI_MINOR_VERSION
since the multiple wildcard feature introduced as part of the 2.10
release already bumps it from 2 to 3.
Use SIZE_MAX instead of -1ULL for size_t parameter
strutils_star_glob_match() receives a size_t. Passing -1ULL truncates
the value implicitly on systems where size_t is 32-bit. It is cleaner to
use SIZE_T.
Philippe Proulx [Sun, 19 Feb 2017 01:01:34 +0000 (20:01 -0500)]
Add support for star globbing patterns in event names
This patch adds support for full star-only globbing patterns used in
the event names (enabler names).
strutils_star_glob_match() is always used to perform the match when
the enabler is LTTNG_ENABLER_STAR_GLOB. This enabler is set when it is
detected that its name contains at least one non-escaped star with
strutils_is_star_glob_pattern().
The match is performed by strutils_star_glob_match(), the same function
that the filter interpreter uses.
Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Philippe Proulx [Sun, 19 Feb 2017 01:04:11 +0000 (20:04 -0500)]
Filtering: add support for star-only globbing patterns
This patch adds the support for "full" star-only globbing patterns to be
used in filter literal strings. A star-only globbing pattern is a
globbing pattern with the star (`*`) being the only special character.
This means `?` and character sets (`[abc-k]`) are not supported here. We
cannot support them without a strategy to differentiate the globbing
pattern because `?` and `[` are not special characters in filter literal
strings right now. The eventual strategy to support them would probably
look like this:
filename =* "?sys*.[ch]"
The filter bytecode generator in LTTng-tools's session daemon creates
the new FILTER_OP_LOAD_STAR_GLOB_STRING operation when the interpreter
should load a star globbing pattern literal string. Even if both
"plain", or legacy strings and star globbing pattern strings are literal
strings, they do not represent the same thing, that is, the == and !=
operators act differently.
The validation process checks that:
1. There's no binary operator between two
FILTER_OP_LOAD_STAR_GLOB_STRING operations. It is illegal to compare
two star globbing patterns, as this is not trivial to implement, and
completely useless as far as I know.
2. Only the == and != binary operators are allowed between a
star globbing pattern and a string.
For the special case of star globbing patterns with a star at the end
only, the current behaviour is not changed to preserve a maximum of
backward compatibility. This is also why the ABI version is changed from
2.2 to 2.3, not to 3.0.
== or != operations between REG_STRING and REG_STAR_GLOB_STRING
registers is specialized to FILTER_OP_EQ_STAR_GLOB_STRING and
FILTER_OP_NE_STAR_GLOB_STRING. Which side is the actual globbing pattern
(the one with the REG_STAR_GLOB_STRING type) is checked at execution
time. The strutils_star_glob_match() function is used to perform the
match operation. See the implementation for more details.
Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: use of uninitialized ret value in lttng_abi_open_metadata_stream
Fixes the following compiler warning:
lttng-abi.c: In function ‘lttng_metadata_ioctl’:
lttng-abi.c:971:6: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized]
int ret;
^
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The underlying type of `struct kref` changed in kernel 4.11 from an
atomic_t to a refcount_t. This change was introduced in kernel
commit:10383ae. This commit also added a builtin overflow checks to
`kref_get()` so we use it.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: atomic_add_unless() returns true/false rather than prior value
The previous implementation assumed that `atomic_add_unless` returned
the prior value of the atomic counter when in fact it returned if the
addition was performed (true) or not performed (false).
Since `atomic_add_unless` can not return INT_MAX, the `lttng_kref_get`
always returned that the call was successful.
This issue had a low likelihood of being triggered since the two refcounts
of the counters used with this call are both bounded by the maximum
number of file descriptors on the system.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
On 32-bit systems, the algorithm within lttng-modules that ensures the
nmi-safe clock increases monotonically on a CPU assumes to have one
clock read per 32-bit LSB overflow period, which is not guaranteed. It
also has an issue on the first clock reads after module load, because
the initial value for the last LSB is 0. It can cause the time to stay
stuck at the same value for a few seconds at the beginning of the trace,
which is unfortunate for the first trace after module load, because this
is where the offset between realtime and trace_clock is sampled, which
prevents correlation of kernel and user-space traces for that session.
It only affects 32-bit systems with kernels >= 3.17.
Fix this by using the non-nmi-safe clock source on 32-bit systems.
While we are there, remove an implementation-defined c99 behavior
regarding casting u64 to long by using unsigned arithmetic instead:
turn:
if (((long) now - (long) last) < 0)
into:
if (U64_MAX / 2 < now - last)
from /home/compudj/git/lttng-modules/lttng-context-perf-counters.c:23:
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c: In function ‘lttng_add_perf_counter_to_ctx’:
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c:353:22: error: ‘cpu’ undeclared (first use in this function)
for_each_online_cpu(cpu) {
^
./include/linux/cpumask.h:223:8: note: in definition of macro ‘for_each_cpu’
for ((cpu) = -1; \
^
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c:353:2: note: in expansion of macro ‘for_each_online_cpu’
for_each_online_cpu(cpu) {
^
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c:353:22: note: each undeclared identifier is reported only once for each function it appears in
for_each_online_cpu(cpu) {
^
./include/linux/cpumask.h:223:8: note: in definition of macro ‘for_each_cpu’
for ((cpu) = -1; \
^
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c:353:2: note: in expansion of macro ‘for_each_online_cpu’
for_each_online_cpu(cpu) {
^
./include/linux/cpumask.h:224:38: warning: left-hand operand of comma expression has no effect [-Wunused-value]
(cpu) = cpumask_next((cpu), (mask)), \
^
./include/linux/cpumask.h:717:36: note: in expansion of macro ‘for_each_cpu’
#define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)
^
/home/compudj/git/lttng-modules/lttng-context-perf-counters.c:353:2: note: in expansion of macro ‘for_each_online_cpu’
for_each_online_cpu(cpu) {
^
scripts/Makefile.build:289: recipe for target '/home/compudj/git/lttng-modules/lttng-context-perf-counters.o' failed
make[2]: *** [/home/compudj/git/lttng-modules/lttng-context-perf-counters.o] Error 1
make[2]: *** Waiting for unfinished jobs....
Fix: bump stable kernel version ranges for clock work-around
Linux commit 27727df240c7 ("Avoid taking lock in NMI path with
CONFIG_DEBUG_TIMEKEEPING"), changed the logic to open-code
the timekeeping_get_ns() function, but forgot to include
the unit conversion from cycles to nanoseconds, breaking the
function's output, which impacts LTTng.
We expected Linux commit 58bfea9532 "timekeeping: Fix
__ktime_get_fast_ns() regression" to make its way into stable
kernels promptly, but it appears new stable kernel releases were
done before the fix was cherry-picked from the master branch.
We therefore need to bump the version ranges for the work-around
in lttng-modules.
Linux commit 27727df240c7 ("Avoid taking lock in NMI path with
CONFIG_DEBUG_TIMEKEEPING"), changed the logic to open-code
the timekeeping_get_ns() function, but forgot to include
the unit conversion from cycles to nanoseconds, breaking the
function's output, which impacts LTTng.
The following kernel versions are affected: 4.8, 4.7.4+, 4.4.20+,
4.1.32+
We expect that the upstream fix will reach the master and stable
branches timely before the next releases, so we use 4.8.1, 4.7.7,
4.4.24, and 4.1.34 as upper bounds (exclusive).
Fall-back to the non-NMI-safe trace clock for those kernel versions.
We simply discard events from NMI context with a in_nmi() check,
as we did before Linux 3.17.
Simon Marchi [Tue, 4 Oct 2016 21:07:05 +0000 (17:07 -0400)]
Add support for i2c tracepoints
This patch teaches lttng-modules about the i2c tracepoints in the Linux
kernel.
It contains the following tracepoints:
* i2c_write
* i2c_read
* i2c_reply
* i2c_result
I translated the fields and assignments from the kernel's
include/trace/events/i2c.h as well as I could. I also tried building
this module against a kernel without CONFIG_I2C, and it built fine (the
required types are unconditionally defined). So I don't think any "#if
CONFIG_I2C" or similar are required.
A module parameter (extract_sensitive_payload) controls the extraction
of possibly sensitive data from events.
[ With edit by Mathieu Desnoyers. ]
Signed-off-by: Simon Marchi <simon.marchi@ericsson.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Version checks in makefiles should always be a disjunctive normal form
where the conjunctions consist of one or more "equals" comparisons and
at most a single greater-or-equal comparison.