ext4: adjust reserved cluster count when removing extents
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree. Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.
Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster. To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed. This prevents another thread
from allocating the freed block before the reservation can be reapplied.
Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.
Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
There are no more users of SEND_SIG_FORCED so it may be safely removed.
Remove the definition of SEND_SIG_FORCED, it's use in is_si_special,
it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without
any users the uses of SEND_SIG_FORCED are now unncessary.
This makes the code simpler, easier to understand and use. Users of
signal sending functions now no longer need to ask themselves do I
need to use SEND_SIG_FORCED.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
signal: Distinguish between kernel_siginfo and siginfo
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.
The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.
So split siginfo in two; kernel_siginfo and siginfo. Keeping the
traditional name for the userspace definition. While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.
The definition of struct kernel_siginfo I have put in include/signal_types.h
A set of buildtime checks has been added to verify the two structures have
the same field offsets.
To make it easy to verify the change kernel_siginfo retains the same
size as siginfo. The reduction in size comes in a following change.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Julien Desfossez [Fri, 26 Oct 2018 19:55:18 +0000 (15:55 -0400)]
CPU topology statedump on x86
New statedump tracepoint to dump the active CPU/NUMA topology. This
allows to know which CPUs are SMT sibling or on the same socket. For now
only x86 is supported because all architectures has different fields.
The field "architecture" is statically defined and should be present in
all implementations so parsing tools know what content to expect.
Upstream Linux commit 46e0c9be20 introduces relative references in the
struct tracepoint array of pointers.
Up to (including) v4.19-rc7, the upstream kernel has a type mismatch bug
that allows it to pass an out-of-bound end of array to modules
coming/going notifiers.
The fix for upstream Linux is to introduce a new type: tracepoint_ptr_t,
which can be used to adequately iterate on the array. It is introduced
prior to v4.19 as commit 9c0be3f6b5d77 "tracepoint: Fix tracepoint array
element size mismatch".
Fix: sync event enablers before choosing header type
On session start, we should allocate the event IDs before figuring
out the number of events per channel and select the proper header
type.
Without this, the number of events is always perceived to be 0,
which selects the "compact" header type. For a channel containing
many events (e.g. enable-event -k -a), this selects an inefficient
header type. With this fix, it selects the "large" header type,
which is more appropriate for a larger number of event IDs.
This will lead to a reduced trace throughput for tracing workloads
that have many events.
* si_mem_available() was added in kernel 4.6 with commit d02bd27.
* {set, clear}_current_oom_origin() were added in kernel 3.8 with commit: e1e12d2f
Solution
========
Add wrappers around these functions such that older kernels will build
with these functions defined as NOP or trivial return value.
wrapper_check_enough_free_pages() uses the si_mem_available() kernel
function to compute if the number pages requested passed as parameter is
smaller than the number of pages available on the machine. If the
si_mem_available() kernel function is unavailable, we always return
true.
wrapper_set_current_oom_origin() function wraps the
set_current_oom_origin() kernel function when it is available.
If set_current_oom_origin() is unavailable the wrapper is empty.
wrapper_clear_current_oom_origin() function wraps the
clear_current_oom_origin() kernel function when it is available.
If clear_current_oom_origin() is unavailable the wrapper is empty.
Drawbacks
=========
None.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Prevent allocation of buffers if exceeding available memory
Issue
=====
The running system can be rendered unusable by creating a channel
buffers larger than the available memory of the system, resulting in
random processes being killed by the OOM-killer.
These simple commands trigger the crash on my 15G of RAM laptop:
lttng create
lttng enable-channel -k --subbuf-size=16G --num-subbuf=1 chan0
Note that the subbuf-size * num-subbuf is larger than the physical
memory.
Solution
========
Get an estimate of the number of available pages and return ENOMEM if
there are not enough pages to cover the needs of the caller. Also, mark
the calling user thread as the first target for the OOM killer in case
the estimate of available pages was wrong.
This greatly reduces the attack surface of this issue as well as reducing
its potential impact.
This approach is inspired by the one taken by the Linux kernel
trace ring buffer[1].
Drawback
========
This approach is imperfect because it's based on an estimate.
rcu: Convert rcu_grace_period tracepoint to gp_seq
This commit makes the rcu_grace_period tracepoint use gp_seq instead
of ->gpnum or ->completed. It also introduces a "cpuofl-bgp" string to
less obscurely indicate when a CPU has gone offline while a grace period
is waiting on it.
tracing: Centralize preemptirq tracepoints and unify their usage
This patch detaches the preemptirq tracepoints from the tracers and
keeps it separate.
Advantages:
* Lockdep and irqsoff event can now run in parallel since they no longer
have their own calls.
* This unifies the usecase of adding hooks to an irqsoff and irqson
event, and a preemptoff and preempton event.
3 users of the events exist:
- Lockdep
- irqsoff and preemptoff tracers
- irqs and preempt trace events
The unification cleans up several ifdefs and makes the code in preempt
tracer and irqsoff tracers simpler. It gets rid of all the horrific
ifdeferry around PROVE_LOCKING and makes configuration of the different
users of the tracepoints more easy and understandable. It also gets rid
of the time_* function calls from the lockdep hooks used to call into
the preemptirq tracer which is not needed anymore. The negative delta in
lines of code in this patch is quite large too.
In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
as a single point for registering probes onto the tracepoints. With
this,
the web of config options for preempt/irq toggle tracepoints and its
users becomes:
net: expose sk wmem in sock_exceed_buf_limit tracepoint
Currently trace_sock_exceed_buf_limit() only show rmem info,
but wmem limit may also be hit.
So expose wmem info in this tracepoint as well.
Regarding memcg, I think it is better to introduce a new tracepoint(if
that is needed), i.e. trace_memcg_limit_hit other than show memcg info in
trace_sock_exceed_buf_limit.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Jonathan Rajotte [Wed, 19 Sep 2018 21:48:49 +0000 (17:48 -0400)]
Fix: access migrate_disable field directly
For stable real time kernel > 4.9, the __migrate_disabled utility symbol
is not always exported. This can result in linking problem at build time
and runtime, preventing the loading of the tracer.
The problem was reported to the RT community. [1] [2]
A solution is to access the field directly instead of using the
utility wrapper.
It is important to note that the field is now available for other
configurations than CONFIG_PREEMPT_RT_FULL. For now, we choose to
expose the migratable context only for configurations where
CONFIG_PREEMPT_RT_FULL is set.
Based on the configuration dependency of the kernels, selecting
CONFIG_PREEMPT_RT_FULL ensures the presence of the migrate_disable
field.
CPU hotplug handles teardown on failure to complete adding an instance
of CPU hotplug. Trying to remove after a failed "add" on that instance
triggers a NULL pointer dereference OOPS.
Fix: instruction pointer has different names across arch
Different terms are used to refer to the instruction pointer depending
on the CPU architecture.
For example:
x86 -> ip
powerpc -> nip
RISC-V -> sepc
ARM -> ARM_pc
Microblaze -> pc
To fix this issue, we use the instruction_pointer() kernel function
(or macro depending on the arch) to get the right field in the pt_regs
struct for the current architecture.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Problems
========
- There is a typo in the struct name of the parameters of stub version
of the lttng_uprobes_add_callsite function,
- We are building the lttng-uprobes.o object file even when
CONFIG_UPROBES is absent.
Both of these are causing build errors.
Fixes
=====
- Replace struct lttng_kernel_callsite_uprobe by struct
lttng_kernel_event_callsite,
- Only add the lttng-uprobes.o object file to the needed artefacts if
CONFIG_UPROBES is present.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Thu, 9 Aug 2018 15:55:32 +0000 (11:55 -0400)]
Fix: adjust SLE version ranges to build with SP2 and SP3
The early kernel versions of SuSE 12 SP3 overlap with the range from the
later SP2 kernels but are from a different source trees. This patch adds
specific ranges for the SP3 kernels that overlap and allows compatibility
with both SP2 and SP3 kernels.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Thu, 9 Aug 2018 15:55:31 +0000 (11:55 -0400)]
Fix: Allow alphanumeric characters in SLE version
Allow alphanumeric characters in the long version string before
extracting specific version numbers. This prevents failure in detecting
a SuSE kernel when the version string was customized by the end user.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Fri, 29 Jun 2018 21:28:33 +0000 (17:28 -0400)]
Cleanup: move to kernel style SPDX license identifiers
The SPDX identifier is a legally binding shorthand, which can be used instead
of the full boiler plate text. According to kernel documentation it has
to be inserted on the first or second line of a file.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
btrfs: trace: Remove unnecessary fs_info parameter for btrfs__reserve_extent event class
fs_info can be extracted from btrfs_block_group_cache, and all
btrfs_block_group_cache is created by btrfs_create_block_group_cache()
with fs_info initialized, no need to worry about NULL pointer
dereference.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
btrfs: use fs_info for btrfs_handle_em_exist tracepoint
We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
After a change to the snd_jack structure, the 'name' member
is no longer available in all configurations, which results in a
build failure in the tracing code:
include/trace/events/asoc.h: In function 'trace_event_raw_event_snd_soc_jack_report':
include/trace/events/asoc.h:240:32: error: 'struct snd_jack' has no member named 'name'
The name field is normally initialized from the card shortname and
the jack "id" field:
This changes the tracing output to just contain the 'id' by
itself, which slightly changes the output format but avoids the
link error and is hopefully still enough to see what is going on.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The snd_soc_dapm_input_path and snd_soc_dapm_output_path trace events are
identical except for the direction. Instead of having two events have a
single one that has a field that contains the direction.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The ASoC framework is in the process of migrating all IO operations to regmap.
regmap has its own more sophisticated tracing infrastructure for IO operations,
which means that the ASoC level IO tracing becomes redundant, hence this patch
removes them. There are still a handful of ASoC drivers left that do not use
regmap yet, but hopefully the removal of the ASoC IO tracing will be an
additional incentive to switch to regmap.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Userspace callstack context often triggers kernel pagefaults that can be
traced by the kernel tracer which might then attempt to gather the
userspace callstack again... This recursion will be stop by the
RING_BUFFER_MAX_NESTING check but will still pollute the traces with
redundant information.
To prevent this, check if the tracer is already gathering the userspace
callstack and if it's the case don't record it again.
Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Use a limit that fits in a 4096 bytes page on a 64-bit system. The only
reason for the prior 25 entries limitation was a bug in the header size
calculation (now fixed).
Pass same arguments to get_size_arg() than to record(). This new
operation has the same effect than get_size(), and the client code can
implement either one.
Signed-off-by: Francis Giraldeau <francis.giraldeau@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This shows exactly how btrfs processes the delayed refs onto disks,
which is very helpful on understanding delayed ref mechanism and
debugging related bugs.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
While debugging truncate problems, I found that these tracepoints could
help us quickly know what went wrong.
Two sets of tracepoints are created to track regular/prealloc file item
and inline file item respectively, I put inline as a separate one since
what inline file items cares about are way less than the regular one.
This adds four tracepoints:
- btrfs_get_extent_show_fi_regular
- btrfs_get_extent_show_fi_inline
- btrfs_truncate_show_fi_regular
- btrfs_truncate_show_fi_inline
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
BUILD_BUG_ON(): fix it and a couple of bogus uses of it
gcc permitting variable length arrays makes the current construct used for
BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
controlling expression isn't really constant. Instead, this patch makes
it so that a bit field gets used here. Consequently, those uses where the
condition isn't really constant now also need fixing.
Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
the expression is compile time constant (__builtin_constant_p() yields
true), the array is still deemed of variable length by gcc, and hence the
whole expression doesn't have the intended effect.
BUILD_BUG_ON used to use the optimizer to do code elimination or fail
at link time; it was changed to first the size of a negative array (a
nicer compile time error), then (in 8c87df457cb58fe75b9b893007917cf8095660a0) to a bitfield.
This forced us to change some non-constant cases to MAYBE_BUILD_BUG_ON();
as Jan points out in that commit, it didn't work as intended anyway.
bitfields: needs a literal constant at parse time, and can't be put under
"if (__builtin_constant_p(x))" for example.
negative array: can handle anything, but if the compiler can't tell it's
a constant, silently has no effect.
link time: breaks link if the compiler can't determine the value, but the
linker output is not usually as informative as a compiler error.
If we use the negative-array-size method *and* the link time trick,
we get the ability to use BUILD_BUG_ON() under __builtin_constant_p()
branches, and maximal ability for the compiler to detect errors at
build time.
We also document it thoroughly.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: pid tracker should track "pgid" for noargs probes
The "pid" notion exposed by LTTng translates to the "pgid" notion in the
Linux kernel. Therefore using "current->pid" as argument to the PID
tracker actually ends up behaving as a "tid" tracker, which does not
match the intent nor the user-space tracer behavior.
The probes taking arguments were fixed by a prior commit, but it missed
probes without arguments.
Clean up: struct rpc_task carries a pointer to a struct rpc_clnt,
and in fact task->tk_client is always what is passed into trace
points that are already passing @task.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event
The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
parameters! Seven of them are from the reclaim_stat structure. This
structure is currently local to mm/vmscan.c. By moving it to the global
vmstat.h header, we can also reference it from the vmscan tracepoints.
In moving it, it brings down the overhead of passing so many arguments
to the trace event. In the future, we may limit the number of arguments
that a trace event may pass (ideally just 6, but more realistically it
may be 8).
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
Kswapd will not wakeup if per-zone watermarks are not failing or if too
many previous attempts at background reclaim have failed.
This can be true if there is a lot of free memory available. For high-
order allocations, kswapd is responsible for waking up kcompactd for
background compaction. If the zone is not below its watermarks or
reclaim has recently failed (lots of free memory, nothing left to
reclaim), kcompactd does not get woken up.
When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
woken up even if kswapd will not reclaim. This allows high-order
allocations, such as thp, to still trigger background compaction even
when the zone has an abundance of free memory.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Lars Persson [Sun, 11 Mar 2018 14:02:43 +0000 (15:02 +0100)]
Fix: do not use CONFIG_HOTPLUG_CPU for the new hotplug API
Kernel configurations without CONFIG_HOTPLUG_CPU throw an unknown
symbol error when attempting to insert the lttng-trace module:
lttng_tracer: Unknown symbol lttng_hp_prepare (err 0)
lttng_tracer: Unknown symbol lttng_hp_online (err 0)
This was caused by lttng-events and lttng-context-perf-counter not
agreeing on which preprocessor condition that should guard the use of
the hotplug API. In fact the API is available also on kernels built
without CONFIG_HOTPLUG_CPU.
Signed-off-by: Lars Persson <larper@axis.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>