sched: Remove vruntime from trace_sched_stat_runtime()
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.
Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.
Change-Id: I74acf0b8340c371e8411116e07e5c97b10f9c756 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Mon, 16 Dec 2024 20:02:23 +0000 (15:02 -0500)]
fix: add missing check for __must_check 'lttng_file_ref_put()' (v6.13)
Add a missing return code check to a call to 'lttng_file_ref_put()'
which is marked as 'must_check', otherwise it results in a build
failure when using -Werror.
Change-Id: Ib50ec669ffc0fe87a367b25b788518d148f7a85e Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found. When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.
But that turned them pointless - e.g.
rcu_read_lock();
file = lookup_fdget_rcu(fd);
rcu_read_unlock();
is equivalent to
file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form. Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change-Id: I98f1c617e8bd7ad9db7a9af2a1fa76c5eb26e8b8 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Port files to rely on file_ref reference to improve scaling and gain
overflow protection.
- We continue to WARN during get_file() in case a file that is already
marked dead is revived as get_file() is only valid if the caller
already holds a reference to the file. This hasn't changed just the
check changes.
- The semantics for epoll and ttm's dmabuf usage have changed. Both
epoll and ttm synchronize with __fput() to prevent the underlying file
from beeing freed.
Change-Id: I9f376af50835f15f74ff7fc82bdb752e09f77222 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
information.
First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
from MIGRATE_HIGHATOMIC/MIGRATE_CMA. Nevertheless, tracepoint use
requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.
Second, after commit 44042b4498728 ("mm/page_alloc: allow high-order pages
to be stored on the per-cpu lists") percpu-list can store high order
pages. But trace point determine whether it is a refiil of percpu-list by
comparing requested order and 0.
To handle these problems, make mm_page_alloc_zone_locked() only be called
by __rmqueue_smallest with correct migration type. With a new argument
called percpu_refill, it can show roughly whether it is a refill of
percpu-list.
uprobes: make uprobe_register() return struct uprobe *
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *"
rather than inode + offset. This simplifies the code and allows to avoid
the unnecessary find_uprobe() + put_uprobe() in these functions.
TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that
this uprobe can't be freed before up_write(&uprobe->register_rwsem).
With uprobe_unregister() having grown a synchronize_srcu(), it becomes
fairly slow to call. Esp. since both users of this API call it in a
loop.
Peel off the sync_srcu() and do it once, after the loop.
We also need to add uprobe_unregister_sync() into uprobe_register()'s
error handling path, as we need to be careful about returning to the
caller before we have a guarantee that partially attached consumer won't
be called anymore. This is an unlikely slow path and this should be
totally fine to be slow in the case of a failed attach.
It doesn't make any sense to have 2 versions of _register(). Note that
trace_uprobe_enable(), the only user of uprobe_register(), doesn't need
to check tu->ref_ctr_offset to decide which one should be used, it could
safely pass ref_ctr_offset == 0 to uprobe_register_refctr().
Add this argument to uprobe_register(), update the callers, and kill
uprobe_register_refctr().
Change-Id: I8d1f9a5db1f19c2bc2029709ae36f82e86f6fe58 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Mon, 28 Oct 2024 20:02:59 +0000 (16:02 -0400)]
Fix: silence 'non-consumed' message for non-started sessions
Destroying a session with at least one enabled event and which has never
been started will currently result in an error message in the kernel log
about 'non-consumed data' for each of the per-cpu buffer. This happens
because a packet header is created in the buffer but never consumed if
the session is not started.
Add a check in the buffer cleanup code to avoid printing 'non-consumed
data' errors for buffers associated with a session taht was never
started.
Change-Id: I1358e1ae49d03544a961515b97b115a488434e27 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Support is divided into two main areas:
- reading VPD pages and setting sdev request_queue limits
- support WRITE ATOMIC (16) command and tracing
The relevant block limits VPD page need to be read to allow the block layer
request_queue atomic write limits to be set. These VPD page limits are
described in sbc4r22 section 6.6.4 - Block limits VPD page.
There are five limits of interest:
- MAXIMUM ATOMIC TRANSFER LENGTH
- ATOMIC ALIGNMENT
- ATOMIC TRANSFER LENGTH GRANULARITY
- MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY
- MAXIMUM ATOMIC BOUNDARY SIZE
MAXIMUM ATOMIC TRANSFER LENGTH is the maximum length for a WRITE ATOMIC
(16) command. It will not be greater than the device MAXIMUM TRANSFER
LENGTH.
ATOMIC ALIGNMENT and ATOMIC TRANSFER LENGTH GRANULARITY are the minimum
alignment and length values for an atomic write in terms of logical blocks.
Unlike NVMe, SCSI does not specify an LBA space boundary, but does specify
a per-IO boundary granularity. The maximum boundary size is specified in
MAXIMUM ATOMIC BOUNDARY SIZE. When used, this boundary value is set in the
WRITE ATOMIC (16) ATOMIC BOUNDARY field - layout for the WRITE_ATOMIC_16
command can be found in sbc4r22 section 5.48. This boundary value is the
granularity size at which the device may atomically write the data. A value
of zero in WRITE ATOMIC (16) ATOMIC BOUNDARY field means that all data must
be atomically written together.
MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
length if a non-zero boundary value is set.
For atomic write support, the WRITE ATOMIC (16) boundary is not of much
interest, as the block layer expects each request submitted to be executed
be atomically written together.
MAXIMUM ATOMIC TRANSFER LENGTH WITH BOUNDARY is the maximum atomic write
length if a non-zero boundary value is set.
For atomic write support, the WRITE ATOMIC (16) boundary is not of much
interest, as the block layer expects each request submitted to be executed
atomically. However, the SCSI spec does leave itself open to a quirky
scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero, yet MAXIMUM ATOMIC
TRANSFER LENGTH WITH BOUNDARY and MAXIMUM ATOMIC BOUNDARY SIZE are both
non-zero. This case will be supported.
To set the block layer request_queue atomic write capabilities, sanitize
the VPD page limits and set limits as follows:
- atomic_write_unit_min is derived from granularity and alignment values.
If no granularity value is not set, use physical block size
- atomic_write_unit_max is derived from MAXIMUM ATOMIC TRANSFER LENGTH. In
the scenario where MAXIMUM ATOMIC TRANSFER LENGTH is zero and boundary
limits are non-zero, use MAXIMUM ATOMIC BOUNDARY SIZE for
atomic_write_unit_max. New flag scsi_disk.use_atomic_write_boundary is
set for this scenario.
- atomic_write_boundary_bytes is set to zero always
SCSI also supports a WRITE ATOMIC (32) command, which is for type 2
protection enabled. This is not going to be supported now, so check for
T10_PI_TYPE2_PROTECTION when setting any request_queue limits.
To handle an atomic write request, add support for WRITE ATOMIC (16)
command in handler sd_setup_atomic_cmnd(). Flag use_atomic_write_boundary
is checked here for encoding ATOMIC BOUNDARY field.
Trace info is also added for WRITE_ATOMIC_16 command.
Change-Id: Ie072002fe2184615c72531ac081a324ef18cfb03 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Change-Id: I2e3245bb0d1f5263e902659aa05848d5e231909b Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).
Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.
For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.
Change-Id: Ib03fc685b4e876bf4e53afdd28ca9826342a0e4e Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Change-Id: I025a30d49b3e3ddc37d7846acc191ebbdf2ff19e Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
ext4: make ext4_da_reserve_space() reserve multi-clusters
Add 'nr_resv' parameter to ext4_da_reserve_space(), which indicates the
number of clusters wants to reserve, make it reserve multiple clusters
at a time.
Change-Id: Ib1ce8c3023d53a6d22ec444a435fdb3c871f64c5 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
skb does not include enough information to find out receiving
sockets/services and netns/containers on packet drops. In theory
skb->dev tells about netns, but it can get cleared/reused, e.g. by TCP
stack for OOO packet lookup. Similarly, skb->sk often identifies a local
sender, and tells nothing about a receiver.
Allow passing an extra receiving socket to the tracepoint to improve
the visibility on receiving drops.
Change-Id: I33c8ce1a48006456f198ab1592f733b55be01016 Signed-off-by: Kienan Stewart <kstewart@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Introduce two extension points for trace hit counters:
1) Future "actions" to perform other than "increment",
2) Future dimension indexing schemes (keys) other than tokens.
Change the layout of struct lttng_kernel_abi_counter_key_dimension
by adding a "key_type" field. A new struct lttng_kernel_abi_counter_key_dimension_tokens
inherits from struct lttng_kernel_abi_counter_key_dimension, and contains
the uint32_t nr_key_tokens field. The only currently supported key_type
is LTTNG_KERNEL_ABI_KEY_TYPE_TOKENS = 0.
Change the layout of struct lttng_kernel_abi_counter_event by adding an
"action" field. The only currently supported action is
LTTNG_KERNEL_ABI_COUNTER_ACTION_INCREMENT = 0.
Change the struct lttng_kernel_abi_key_token_string so it inherits from
struct lttng_kernel_abi_key_token. The "len" field of
struct lttng_kernel_abi_key_token now includes the length of the entire
child structure.
Remove struct lttng_kernel_abi_counter_key: it was previously expecting
all key dimensions to have the same size. But because each dimension can
be of a different type, each may have its own distinct size.
Change the newly introduced API between probe providers to change the
"event_counter_add" callback into a "counter_hit" callback, which takes
one less argument (no integer value), but takes additional stack_data,
probe_ctx, and event_counter_ctx arguments for future use.
Fix: event notifier: set eval_capture to false for kprobe, kretprobe and uprobe
Trying to capture fields for kprobe, kretprobe, uprobe, event
notifications will end up dereferencing NULL pointers. Prevent execution
of capture code in those cases.
Michael Jeanson [Tue, 18 Jun 2024 18:35:38 +0000 (14:35 -0400)]
Implement REUSE 3.0 with SPDX identifiers
Implement the full REUSE spec [1] to help with copyright and licensing
audits and compliance. This will reduce a lot of manual work for the
licensing audit required in Debian on each update and also allow using
automated tools.
For files that lacked copyright and licensing information, I used the
following guidelines. If a clear author could be determined from the git
history use it, otherwise use 'EfficiOS Inc.'. For code use
'GPL-2.0-only OR LGPL-2.1-only' unless otherwise stated, for
documentation 'CC-BY-SA-4.0' and for data files 'CC0-1.0'.
Freeform text files were converted to Markdown to allow licensing
comments.
Running the reuse tool on the repo is now succesful:
$ reuse lint
# SUMMARY
* Bad licenses: 0
* Deprecated licenses: 0
* Licenses without file extension: 0
* Missing licenses: 0
* Unused licenses: 0
* Used licenses: CC0-1.0, GPL-2.0-only, CC-BY-SA-4.0, MIT, LGPL-2.1-only
* Read errors: 0
* files with copyright information: 358 / 358
* files with license information: 358 / 358
Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)
[1] https://reuse.software/tutorial/
Change-Id: I1755cab24a6fcec7a6c9a2136891418203ec34b8 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Wed, 17 Jul 2024 15:32:31 +0000 (11:32 -0400)]
fix: add 'static inline' to lttng_kretprobes_init_event()
Add missing 'static inline' to lttng_kretprobes_init_event() placeholder
function when CONFIG_KRETPROBES is not set. Also some minor reformating
to improve readability.
Change-Id: I23cf83dff99f4168ae0f339c2b4911796e0b0273 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Tue, 16 Jul 2024 21:02:36 +0000 (17:02 -0400)]
fix: copy_struct_from_user() for non-LTS branches < v4.19
The 'linux/bits.h' was backported to LTS branches but is not available
on non-LTS before v4.19. Use 'asm/byteorder.h' instead to get the
__LITTLE_ENDIAN define which is available on all kernel versions we
support.
Change-Id: Icfe733ab944616b3bd6d0023ad0869eefb830b34 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Both lttng_abi_copy_user_old_counter_conf and
lttng_abi_copy_user_counter_conf should zero-init the counter_conf
destination argument, else the "dimension->flags" field is uninitialized
before being OR'd with flags.
Fix the following warning by splitting lttng_counter_ioctl (which has a
lot of local variables in the switch/case legs) into many sub-functions.
/home/efficios/git/lttng-modules/src/lttng-abi.c: In function ‘lttng_counter_ioctl’:
/home/efficios/git/lttng-modules/src/lttng-abi.c:1227:1: warning: the frame size of 1032 bytes is larger than 1024 bytes [-Wframe-larger-than=]
1227 | }
| ^
Errors returned by lttng_kernel_event_create are handled by the caller,
and may happen e.g. when a kprobe or kretprobe symbol does not exist.
It should not generate a warning in the kernel console.
Fix: circular dependency on symbol lttng_id_tracker_lookup
Adding lttng_id_tracker_lookup feature into kprobes, uprobes and
kretprobes introduces a circular dependency between lttng-tracer.ko and
the respective probe modules.
There is no real reason for having the kprobes/uprobes/kretprobes
modules separate from the tracer core, so combine those.
Commit 0badc02f82b38 ("Fix: adjust SLE version ranges to build with SP2
and SP3") introduced code duplication. Modify the version match logic to
remove duplicated code.
Also remove the confusing comment about checking if a fd exists. I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
Michael Jeanson [Wed, 29 May 2024 19:02:15 +0000 (15:02 -0400)]
Warn and return on fd overflow fdt
The fdt should only grow and iterate_fd() holds file_lock, which should
ensure the fdt does not change while the lock is taken but be cautious
and check anyway.
Change-Id: Icd6a3263026734cbe3f296f6087f79add4148a8f Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
and destination IP/port whereas this information can be critical in case
of UDP/syslog.
Change-Id: I0c337c5817b0a120298cbf5088d60671d9625b0d Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>