lttng-modules.git
5 years agoFix: Revert "KVM: MMU: show mmu_valid_gen..." (v5.1)
Michael Jeanson [Mon, 18 Mar 2019 20:20:35 +0000 (16:20 -0400)] 
Fix: Revert "KVM: MMU: show mmu_valid_gen..." (v5.1)

See upstream commit :

  commit b59c4830ca185ba0e9f9e046fb1cd10a4a92627a
  Author: Sean Christopherson <sean.j.christopherson@intel.com>
  Date:   Tue Feb 5 13:01:30 2019 -0800

    Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints"

    ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
    is one part of a revert all patches from the series that introduced the
    mechanism[1].

    This reverts commit 2248b023219251908aedda0621251cffc548f258.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: pipe: stop using ->can_merge (v5.1)
Michael Jeanson [Mon, 18 Mar 2019 20:20:34 +0000 (16:20 -0400)] 
Fix: pipe: stop using ->can_merge (v5.1)

See upstream commit:

  commit 01e7187b41191376cee8bea8de9f907b001e87b4
  Author: Jann Horn <jannh@google.com>
  Date:   Wed Jan 23 15:19:18 2019 +0100

    pipe: stop using ->can_merge

    Al Viro pointed out that since there is only one pipe buffer type to which
    new data can be appended, it isn't necessary to have a ->can_merge field in
    struct pipe_buf_operations, we can just check for a magic type.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: rcu: Remove wrapper definitions for obsolete RCU... (v5.1)
Michael Jeanson [Mon, 18 Mar 2019 20:20:33 +0000 (16:20 -0400)] 
Fix: rcu: Remove wrapper definitions for obsolete RCU... (v5.1)

See upstream commit :

commit 6ba7d681aca22e53385bdb35b1d7662e61905760
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Wed Jan 9 15:22:03 2019 -0800

    rcu: Remove wrapper definitions for obsolete RCU update functions

    None of synchronize_rcu_bh, synchronize_rcu_bh_expedited, call_rcu_bh,
    rcu_barrier_bh, synchronize_sched, synchronize_sched_expedited,
    call_rcu_sched, rcu_barrier_sched, get_state_synchronize_sched, and
    cond_synchronize_sched are actually used.  This commit therefore removes
    their trivial wrapper-function definitions.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: mm: create the new vm_fault_t type (v5.1)
Michael Jeanson [Mon, 18 Mar 2019 20:20:32 +0000 (16:20 -0400)] 
Fix: mm: create the new vm_fault_t type (v5.1)

See upstream commit:

  commit 3d3539018d2cbd12e5af4a132636ee7fd8d43ef0
  Author: Souptick Joarder <jrdr.linux@gmail.com>
  Date:   Thu Mar 7 16:31:14 2019 -0800

    mm: create the new vm_fault_t type

    Page fault handlers are supposed to return VM_FAULT codes, but some
    drivers/file systems mistakenly return error numbers.  Now that all
    drivers/file systems have been converted to use the vm_fault_t return
    type, change the type definition to no longer be compatible with 'int'.
    By making it an unsigned int, the function prototype becomes
    incompatible with a function which returns int.  Sparse will detect any
    attempts to return a value which is not a VM_FAULT code.

    VM_FAULT_SET_HINDEX and VM_FAULT_GET_HINDEX values are changed to avoid
    conflict with other VM_FAULT codes.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: extra-version-git.sh redirect stderr to /dev/null
Mathieu Desnoyers [Fri, 15 Mar 2019 15:13:39 +0000 (11:13 -0400)] 
Fix: extra-version-git.sh redirect stderr to /dev/null

Running make in a git repo that does not contain any tag prints:

fatal: No names found, cannot describe anything.

in the make and make clean outputs.

It's fine to have no tag name available (extra-version-git.sh will
return the value 0), but we should not print an error in the make
output. Redirect this error to /dev/null.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Michael Jeanson <mjeanson@efficios.com>
5 years agoVersion 2.9.12 v2.9.12
Mathieu Desnoyers [Tue, 12 Mar 2019 16:15:15 +0000 (12:15 -0400)] 
Version 2.9.12

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoBlacklist: kprobe for arm
Jonathan Rajotte [Thu, 7 Mar 2019 19:57:59 +0000 (14:57 -0500)] 
Blacklist: kprobe for arm

This upstream kernel commit broke optimized kprobe.

commit e46daee53bb50bde38805f1823a182979724c229
Author: Kees Cook <keescook@chromium.org>
Date:   Tue Oct 30 22:12:56 2018 +0100

    ARM: 8806/1: kprobes: Fix false positive with FORTIFY_SOURCE

    The arm compiler internally interprets an inline assembly label
    as an unsigned long value, not a pointer. As a result, under
    CONFIG_FORTIFY_SOURCE, the address of a label has a size of 4 bytes,
    which was tripping the runtime checks. Instead, we can just cast the label
    (as done with the size calculations earlier).

    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1639397

Reported-by: William Cohen <wcohen@redhat.com>
Fixes: 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions")
Cc: stable@vger.kernel.org
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: William Cohen <wcohen@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
It was introduced in the 4.20 cycle.
It was also backported to the 4.19 and 4.14 branch.

This issue is fixed upstream by [1] and is present in the 5.0 kernel
release.

[1] 0ac569bf6a7983c0c5747d6df8db9dc05bc92b6c

The fix was backported to 4.20, 4.19 and 4.14 branch.
It is included starting at:
    v5.0.0
    v4.20.13
    v4.19.26
    v4.14.104

Fixes #1174

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoCleanup: tp mempool: Remove logically dead code
Mathieu Desnoyers [Thu, 14 Feb 2019 16:40:50 +0000 (11:40 -0500)] 
Cleanup: tp mempool: Remove logically dead code

Found by Coverity:
CID 1391045 (#1 of 1): Logically dead code (DEADCODE)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: btrfs: Remove fsid/metadata_fsid fields from btrfs_info
Michael Jeanson [Thu, 10 Jan 2019 19:56:15 +0000 (14:56 -0500)] 
Fix: btrfs: Remove fsid/metadata_fsid fields from btrfs_info

Introduced in v5.0.

See upstream commit :

  commit de37aa513105f864d3c21105bf5542d498f21ca2
  Author: Nikolay Borisov <nborisov@suse.com>
  Date:   Tue Oct 30 16:43:24 2018 +0200

    btrfs: Remove fsid/metadata_fsid fields from btrfs_info

    Currently btrfs_fs_info structure contains a copy of the
    fsid/metadata_uuid fields. Same values are also contained in the
    btrfs_fs_devices structure which fs_info has a reference to. Let's
    reduce duplication by removing the fields from fs_info and always refer
    to the ones in fs_devices. No functional changes.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: SUNRPC: Simplify defining common RPC trace events (v5.0)
Michael Jeanson [Wed, 9 Jan 2019 19:59:17 +0000 (14:59 -0500)] 
Fix: SUNRPC: Simplify defining common RPC trace events (v5.0)

See upstream commit :

  commit dc5820bd21d84ee34770b0a1e2fca9378f8f7456
  Author: Chuck Lever <chuck.lever@oracle.com>
  Date:   Wed Dec 19 11:00:16 2018 -0500

    SUNRPC: Simplify defining common RPC trace events

    Clean up, no functional change is expected.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid
Michael Jeanson [Wed, 9 Jan 2019 19:59:16 +0000 (14:59 -0500)] 
Fix: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid

Introduced in v3.12.

See upstream commit :

  commit 92cb6c5be8134db6f7c38f25f6afd13e444cebaf
  Author: Trond Myklebust <Trond.Myklebust@netapp.com>
  Date:   Wed Sep 4 22:09:50 2013 -0400

    SUNRPC: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid

    Instead of the pointer values, use the task and client identifier values
    for tracing purposes.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: Remove 'type' argument from access_ok() function (v5.0)
Michael Jeanson [Wed, 9 Jan 2019 19:59:15 +0000 (14:59 -0500)] 
Fix: Remove 'type' argument from access_ok() function (v5.0)

See upstream commit :

  commit 96d4f267e40f9509e8a66e2b39e8b95655617693
  Author: Linus Torvalds <torvalds@linux-foundation.org>
  Date:   Thu Jan 3 18:57:57 2019 -0800

    Remove 'type' argument from access_ok() function

    Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access.  But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model.  And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

     - csky still had the old "verify_area()" name as an alias.

     - the iter_iov code had magical hardcoded knowledge of the actual
       values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
       really used it)

     - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
5 years agoFix: timer instrumentation for RHEL 7.6
Michael Jeanson [Thu, 6 Dec 2018 16:31:51 +0000 (11:31 -0500)] 
Fix: timer instrumentation for RHEL 7.6

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: ext4: adjust reserved cluster count when removing extents (v4.20)
Michael Jeanson [Mon, 5 Nov 2018 16:35:54 +0000 (11:35 -0500)] 
Fix: ext4: adjust reserved cluster count when removing extents (v4.20)

See upstream commit :

  commit 9fe671496b6c286f9033aedfc1718d67721da0ae
  Author: Eric Whitney <enwlinux@gmail.com>
  Date:   Mon Oct 1 14:25:08 2018 -0400

    ext4: adjust reserved cluster count when removing extents

    Modify ext4_ext_remove_space() and the code it calls to correct the
    reserved cluster count for pending reservations (delayed allocated
    clusters shared with allocated blocks) when a block range is removed
    from the extent tree.  Pending reservations may be found for the clusters
    at the ends of written or unwritten extents when a block range is removed.
    If a physical cluster at the end of an extent is freed, it's necessary
    to increment the reserved cluster count to maintain correct accounting
    if the corresponding logical cluster is shared with at least one
    delayed and unwritten extent as found in the extents status tree.

    Add a new function, ext4_rereserve_cluster(), to reapply a reservation
    on a delayed allocated cluster sharing blocks with a freed allocated
    cluster.  To avoid ENOSPC on reservation, a flag is applied to
    ext4_free_blocks() to briefly defer updating the freeclusters counter
    when an allocated cluster is freed.  This prevents another thread
    from allocating the freed block before the reservation can be reapplied.

    Redefine the partial cluster object as a struct to carry more state
    information and to clarify the code using it.

    Adjust the conditional code structure in ext4_ext_remove_space to
    reduce the indentation level in the main body of the code to improve
    readability.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: signal: Remove SEND_SIG_FORCED (v4.20)
Michael Jeanson [Mon, 5 Nov 2018 16:35:53 +0000 (11:35 -0500)] 
Fix: signal: Remove SEND_SIG_FORCED (v4.20)

See upstream commit :

  commit 4ff4c31a6e85f4c49fbeebeaa28018d002884b5a
  Author: Eric W. Biederman <ebiederm@xmission.com>
  Date:   Mon Sep 3 10:39:04 2018 +0200

    signal: Remove SEND_SIG_FORCED

    There are no more users of SEND_SIG_FORCED so it may be safely removed.

    Remove the definition of SEND_SIG_FORCED, it's use in is_si_special,
    it's use in TP_STORE_SIGINFO, and it's use in __send_signal as without
    any users the uses of SEND_SIG_FORCED are now unncessary.

    This makes the code simpler, easier to understand and use.  Users of
    signal sending functions now no longer need to ask themselves do I
    need to use SEND_SIG_FORCED.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: signal: Distinguish between kernel_siginfo and siginfo (v4.20)
Michael Jeanson [Mon, 5 Nov 2018 16:35:52 +0000 (11:35 -0500)] 
Fix: signal: Distinguish between kernel_siginfo and siginfo (v4.20)

See upstream commit :

  commit ae7795bc6187a15ec51cf258abae656a625f9980
  Author: Eric W. Biederman <ebiederm@xmission.com>
  Date:   Tue Sep 25 11:27:20 2018 +0200

    signal: Distinguish between kernel_siginfo and siginfo

    Linus recently observed that if we did not worry about the padding
    member in struct siginfo it is only about 48 bytes, and 48 bytes is
    much nicer than 128 bytes for allocating on the stack and copying
    around in the kernel.

    The obvious thing of only adding the padding when userspace is
    including siginfo.h won't work as there are sigframe definitions in
    the kernel that embed struct siginfo.

    So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
    traditional name for the userspace definition.  While the version that
    is used internally to the kernel and ultimately will not be padded to
    128 bytes is called kernel_siginfo.

    The definition of struct kernel_siginfo I have put in include/signal_types.h

    A set of buildtime checks has been added to verify the two structures have
    the same field offsets.

    To make it easy to verify the change kernel_siginfo retains the same
    size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoVersion 2.9.11 v2.9.11
Mathieu Desnoyers [Thu, 1 Nov 2018 22:32:18 +0000 (23:32 +0100)] 
Version 2.9.11

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update kvm instrumentation for SLES12 SP2 LTSS >= 4.4.121-92.92
Michael Jeanson [Fri, 26 Oct 2018 22:01:17 +0000 (18:01 -0400)] 
Fix: update kvm instrumentation for SLES12 SP2 LTSS >= 4.4.121-92.92

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Add missing const to lttng_tracepoint_ptr_deref prototype
Mathieu Desnoyers [Wed, 24 Oct 2018 19:43:49 +0000 (20:43 +0100)] 
Fix: Add missing const to lttng_tracepoint_ptr_deref prototype

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: adapt to kernel relative references
Mathieu Desnoyers [Fri, 12 Oct 2018 18:47:53 +0000 (14:47 -0400)] 
Fix: adapt to kernel relative references

Upstream Linux commit 46e0c9be20 introduces relative references in the
struct tracepoint array of pointers.

Up to (including) v4.19-rc7, the upstream kernel has a type mismatch bug
that allows it to pass an out-of-bound end of array to modules
coming/going notifiers.

The fix for upstream Linux is to introduce a new type: tracepoint_ptr_t,
which can be used to adequately iterate on the array. It is introduced
prior to v4.19 as commit 9c0be3f6b5d77 "tracepoint: Fix tracepoint array
element size mismatch".

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: implicit declarations caused by buffer size checks.
Francis Deslauriers [Tue, 16 Oct 2018 19:23:22 +0000 (15:23 -0400)] 
Fix: implicit declarations caused by buffer size checks.

Issue
=====
Three kernel functions used in the following commit are unavailable on
some supported kernels:

commit 1f0ab1eb0409d23de5f67cc588c3ea4cee4d10e0
Prevent allocation of buffers if exceeding available memory

* si_mem_available() was added in kernel 4.6 with commit d02bd27.
* {set, clear}_current_oom_origin() were added in kernel 3.8 with commit:
  e1e12d2f

Solution
========
Add wrappers around these functions such that older kernels will build
with these functions defined as NOP or trivial return value.

wrapper_check_enough_free_pages() uses the si_mem_available() kernel
function to compute if the number pages requested passed as parameter is
smaller than the number of pages available on the machine. If the
si_mem_available() kernel function is unavailable, we always return
true.

wrapper_set_current_oom_origin() function wraps the
set_current_oom_origin() kernel function when it is available.
If set_current_oom_origin() is unavailable the wrapper is empty.

wrapper_clear_current_oom_origin() function wraps the
clear_current_oom_origin() kernel function when it is available.
If clear_current_oom_origin() is unavailable the wrapper is empty.

Drawbacks
=========
None.

Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoPrevent allocation of buffers if exceeding available memory
Francis Deslauriers [Thu, 11 Oct 2018 21:37:00 +0000 (17:37 -0400)] 
Prevent allocation of buffers if exceeding available memory

Issue
=====
The running system can be rendered unusable by creating a channel
buffers larger than the available memory of the system, resulting in
random processes being killed by the OOM-killer.

These simple commands trigger the crash on my 15G of RAM laptop:
  lttng create
  lttng enable-channel -k --subbuf-size=16G --num-subbuf=1 chan0

Note that the subbuf-size * num-subbuf is larger than the physical
memory.

Solution
========
Get an estimate of the number of available pages and return ENOMEM if
there are not enough pages to cover the needs of the caller. Also, mark
the calling user thread as the first target for the OOM killer in case
the estimate of available pages was wrong.

This greatly reduces the attack surface of this issue as well as reducing
its potential impact.

This approach is inspired by the one taken by the Linux kernel
trace ring buffer[1].

Drawback
========
This approach is imperfect because it's based on an estimate.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/trace/ring_buffer.c#n1172

Signed-off-by: Francis Deslauriers <francis.deslauriers@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Convert rcu tracepointis to gp_seq (v4.19)
Michael Jeanson [Wed, 10 Oct 2018 18:17:46 +0000 (14:17 -0400)] 
Fix: Convert rcu tracepointis to gp_seq (v4.19)

See upstream commits :

  commit 477351f7829d2268769c5d545511081555066529
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 12:54:11 2018 -0700

    rcu: Convert rcu_grace_period tracepoint to gp_seq

    This commit makes the rcu_grace_period tracepoint use gp_seq instead
    of ->gpnum or ->completed.  It also introduces a "cpuofl-bgp" string to
    less obscurely indicate when a CPU has gone offline while a grace period
    is waiting on it.

  commit 63d86a7e85f84b8ac3b2f394570965aedbb03787
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 13:08:46 2018 -0700

    rcu: Convert rcu_grace_period_init tracepoint to gp_seq

    This commit makes the rcu_grace_period_init tracepoint use gp_seq instead
    of ->gpnum.

  commit 598ce09480efb6b48799df60c66bac70bea5ef54
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 13:35:20 2018 -0700

    rcu: Convert rcu_preempt_task tracepoint to ->gp_seq

    This commit makes the rcu_preempt_task tracepoint use ->gp_seq instead
    of ->gpnum.

  commit 865aa1e08d8aefdfd1f5d30ecfce1b8ef8cd520a
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 13:35:20 2018 -0700

    rcu: Convert rcu_unlock_preempted_task tracepoint to ->gp_seq

    This commit makes the rcu_unlock_preempted_task tracepoint use ->gp_seq
    instead of ->gpnum.

  commit db023296f0115d2fe01fdabad54678f2b806da23
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 13:35:20 2018 -0700

    rcu: Convert rcu_quiescent_state_report tracepoint to ->gp_seq

    This commit makes the rcu_quiescent_state_report tracepoint use ->gp_seq
    instead of ->gpnum.

  commit fee5997c17562e95fb1fecc142efb2da0934baa4
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Tue May 1 13:35:20 2018 -0700

    rcu: Convert rcu_fqs tracepoint to ->gp_seq

    This commit makes the rcu_fqs tracepoint use ->gp_seq instead of ->gpnum.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: net: expose sk wmem in sock_exceed_buf_limit tracepoint (4.19)
Michael Jeanson [Fri, 7 Sep 2018 16:21:12 +0000 (12:21 -0400)] 
Fix: net: expose sk wmem in sock_exceed_buf_limit tracepoint (4.19)

See upstream commit:

  commit d6f19938eb031ee2158272757db33258153ae59c
  Author: Yafang Shao <laoar.shao@gmail.com>
  Date:   Sun Jul 1 23:31:30 2018 +0800

    net: expose sk wmem in sock_exceed_buf_limit tracepoint

    Currently trace_sock_exceed_buf_limit() only show rmem info,
    but wmem limit may also be hit.
    So expose wmem info in this tracepoint as well.

    Regarding memcg, I think it is better to introduce a new tracepoint(if
    that is needed), i.e. trace_memcg_limit_hit other than show memcg info in
    trace_sock_exceed_buf_limit.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: access migrate_disable field directly
Jonathan Rajotte [Wed, 19 Sep 2018 21:48:49 +0000 (17:48 -0400)] 
Fix: access migrate_disable field directly

For stable real time kernel > 4.9, the __migrate_disabled utility symbol
is not always exported. This can result in linking problem at build time
and runtime, preventing the loading of the tracer.

The problem was reported to the RT community. [1] [2]

A solution is to access the field directly instead of using the
utility wrapper.

It is important to note that the field is now available for other
configurations than CONFIG_PREEMPT_RT_FULL. For now, we choose to
expose the migratable context only for configurations where
CONFIG_PREEMPT_RT_FULL is set.

Based on the configuration dependency of the kernels, selecting
CONFIG_PREEMPT_RT_FULL ensures the presence of the migrate_disable
field.

Initial bug report [3].

[1] https://marc.info/?l=linux-rt-users&m=153730414126984&w=2
[2] https://marc.info/?l=linux-rt-users&m=153729444223779&w=2
[3] https://lists.lttng.org/pipermail/lttng-dev/2018-September/028216.html

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: out of memory error handling
Mathieu Desnoyers [Fri, 7 Sep 2018 21:55:32 +0000 (17:55 -0400)] 
Fix: out of memory error handling

CPU hotplug handles teardown on failure to complete adding an instance
of CPU hotplug. Trying to remove after a failed "add" on that instance
triggers a NULL pointer dereference OOPS.

Fixes: #1167
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoVersion 2.9.10 v2.9.10
Mathieu Desnoyers [Thu, 9 Aug 2018 19:22:26 +0000 (15:22 -0400)] 
Version 2.9.10

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: adjust SLE version ranges to build with SP2 and SP3
Michael Jeanson [Thu, 9 Aug 2018 15:56:56 +0000 (11:56 -0400)] 
Fix: adjust SLE version ranges to build with SP2 and SP3

The early kernel versions of SuSE 12 SP3 overlap with the range from the
later SP2 kernels but are from a different source trees. This patch adds
specific ranges for the SP3 kernels that overlap and allows compatibility
with both SP2 and SP3 kernels.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Allow alphanumeric characters in SLE version
Michael Jeanson [Thu, 9 Aug 2018 15:56:55 +0000 (11:56 -0400)] 
Fix: Allow alphanumeric characters in SLE version

Allow alphanumeric characters in the long version string before
extracting specific version numbers. This prevents failure in detecting
a SuSE kernel when the version string was customized by the end user.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Adjust range for SuSE 4.4.103-92 kernels
Michael Jeanson [Thu, 2 Aug 2018 18:34:30 +0000 (14:34 -0400)] 
Fix: Adjust range for SuSE 4.4.103-92 kernels

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoAdd extra version information framework
Michael Jeanson [Fri, 29 Jun 2018 21:28:30 +0000 (17:28 -0400)] 
Add extra version information framework

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: btrfs: Remove unnecessary fs_info parameter
Michael Jeanson [Mon, 18 Jun 2018 18:53:19 +0000 (14:53 -0400)] 
Fix: btrfs: Remove unnecessary fs_info parameter

See upstream commit:

  commit 3dca5c942dac60164e6a6e89172f25b86af07ce7
  Author: Qu Wenruo <wqu@suse.com>
  Date:   Thu Apr 26 14:24:25 2018 +0800

    btrfs: trace: Remove unnecessary fs_info parameter for btrfs__reserve_extent event class

    fs_info can be extracted from btrfs_block_group_cache, and all
    btrfs_block_group_cache is created by btrfs_create_block_group_cache()
    with fs_info initialized, no need to worry about NULL pointer
    dereference.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: asoc: Remove snd_soc_cache_sync() implementation
Michael Jeanson [Mon, 18 Jun 2018 18:53:17 +0000 (14:53 -0400)] 
Fix: asoc: Remove snd_soc_cache_sync() implementation

See upstream commit:

  commit 427d204c86e095bb91eb8af381bd90a48376a860
  Author: Lars-Peter Clausen <lars@metafoo.de>
  Date:   Sat Nov 8 16:38:07 2014 +0100

    ASoC: Remove snd_soc_cache_sync() implementation

    This function has no more non regmap user, which means we can remove the
    implementation of the function and associated functions and structure
    fields.

    For convenience we keep a static inline version of the function that
    forwards calls to regcache_sync() unconditionally.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: asoc: fix printing jack name
Michael Jeanson [Mon, 18 Jun 2018 18:53:16 +0000 (14:53 -0400)] 
Fix: asoc: fix printing jack name

See upstream commit:

  commit f4833a519aec793cf8349bf479589d37473ef6a7
  Author: Arnd Bergmann <arnd@arndb.de>
  Date:   Wed Feb 24 17:38:14 2016 +0100

    ASoC: trace: fix printing jack name

    After a change to the snd_jack structure, the 'name' member
    is no longer available in all configurations, which results in a
    build failure in the tracing code:

    include/trace/events/asoc.h: In function 'trace_event_raw_event_snd_soc_jack_report':
    include/trace/events/asoc.h:240:32: error: 'struct snd_jack' has no member named 'name'

    The name field is normally initialized from the card shortname and
    the jack "id" field:

            snprintf(jack->name, sizeof(jack->name), "%s %s",
                     card->shortname, jack->id);

    This changes the tracing output to just contain the 'id' by
    itself, which slightly changes the output format but avoids the
    link error and is hopefully still enough to see what is going on.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: asoc: Consolidate path trace events
Michael Jeanson [Mon, 18 Jun 2018 18:53:15 +0000 (14:53 -0400)] 
Fix: asoc: Consolidate path trace events

See upstream commit:

  commit 6e588a0d839b51bae49852b68740a25cacc91978
  Author: Lars-Peter Clausen <lars@metafoo.de>
  Date:   Tue Aug 11 21:38:01 2015 +0200

    ASoC: dapm: Consolidate path trace events

    The snd_soc_dapm_input_path and snd_soc_dapm_output_path trace events are
    identical except for the direction. Instead of having two events have a
    single one that has a field that contains the direction.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: ASoC level IO tracing removed upstream
Michael Jeanson [Mon, 18 Jun 2018 18:53:14 +0000 (14:53 -0400)] 
Fix: ASoC level IO tracing removed upstream

Removed in v3.16.

See upstream commits:

  Author: Lars-Peter Clausen <lars@metafoo.de>
  Date:   Tue Apr 22 13:23:17 2014 +0200

    ASoC: Remove ASoC level IO tracing

    The ASoC framework is in the process of migrating all IO operations to regmap.
    regmap has its own more sophisticated tracing infrastructure for IO operations,
    which means that the ASoC level IO tracing becomes redundant, hence this patch
    removes them. There are still a handful of ASoC drivers left that do not use
    regmap yet, but hopefully the removal of the ASoC IO tracing will be an
    additional incentive to switch to regmap.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: dyntick field added to trace_rcu_dyntick in v4.16
Michael Jeanson [Thu, 7 Jun 2018 19:32:49 +0000 (15:32 -0400)] 
Fix: dyntick field added to trace_rcu_dyntick in v4.16

See upstream commit:

  commit dec98900eae1e22467182e58688abe5fae98bd5f
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Wed Oct 4 16:24:29 2017 -0700

    rcu: Add ->dynticks field to rcu_dyntick trace event

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: BUILD_BUG_ON with compile time constant on < v2.6.38
Michael Jeanson [Thu, 7 Jun 2018 16:24:28 +0000 (12:24 -0400)] 
Fix: BUILD_BUG_ON with compile time constant on < v2.6.38

See upstream commits :

  commit 8c87df457cb58fe75b9b893007917cf8095660a0
  Author: Jan Beulich <JBeulich@novell.com>
  Date:   Tue Sep 22 16:43:52 2009 -0700

    BUILD_BUG_ON(): fix it and a couple of bogus uses of it

    gcc permitting variable length arrays makes the current construct used for
    BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
    controlling expression isn't really constant.  Instead, this patch makes
    it so that a bit field gets used here.  Consequently, those uses where the
    condition isn't really constant now also need fixing.

    Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
    MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
    the expression is compile time constant (__builtin_constant_p() yields
    true), the array is still deemed of variable length by gcc, and hence the
    whole expression doesn't have the intended effect.

  commit 7ef88ad561457c0346355dfd1f53e503ddfde719
  Author: Rusty Russell <rusty@rustcorp.com.au>
  Date:   Mon Jan 24 14:45:10 2011 -0600

    BUILD_BUG_ON: make it handle more cases

    BUILD_BUG_ON used to use the optimizer to do code elimination or fail
    at link time; it was changed to first the size of a negative array (a
    nicer compile time error), then (in
    8c87df457cb58fe75b9b893007917cf8095660a0) to a bitfield.

    This forced us to change some non-constant cases to MAYBE_BUILD_BUG_ON();
    as Jan points out in that commit, it didn't work as intended anyway.

    bitfields: needs a literal constant at parse time, and can't be put under
            "if (__builtin_constant_p(x))" for example.
    negative array: can handle anything, but if the compiler can't tell it's
            a constant, silently has no effect.
    link time: breaks link if the compiler can't determine the value, but the
            linker output is not usually as informative as a compiler error.

    If we use the negative-array-size method *and* the link time trick,
    we get the ability to use BUILD_BUG_ON() under __builtin_constant_p()
    branches, and maximal ability for the compiler to detect errors at
    build time.

    We also document it thoroughly.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: pid tracker should track "pgid" for noargs probes
Mathieu Desnoyers [Tue, 15 May 2018 21:51:24 +0000 (17:51 -0400)] 
Fix: pid tracker should track "pgid" for noargs probes

The "pid" notion exposed by LTTng translates to the "pgid" notion in the
Linux kernel. Therefore using "current->pid" as argument to the PID
tracker actually ends up behaving as a "tid" tracker, which does not
match the intent nor the user-space tracer behavior.

The probes taking arguments were fixed by a prior commit, but it missed
probes without arguments.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoVersion 2.9.9 v2.9.9
Mathieu Desnoyers [Wed, 9 May 2018 18:12:56 +0000 (14:12 -0400)] 
Version 2.9.9

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update RCU instrumentation for 4.17
Mathieu Desnoyers [Tue, 1 May 2018 20:42:44 +0000 (16:42 -0400)] 
Fix: update RCU instrumentation for 4.17

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: sunrpc instrumentation for 4.17
Michael Jeanson [Tue, 17 Apr 2018 15:07:47 +0000 (11:07 -0400)] 
Fix: sunrpc instrumentation for 4.17

See upstream commit:

  commit e671edb9428c8a61662aaf8c39f5edced7cc45c7
  Author: Chuck Lever <chuck.lever@oracle.com>
  Date:   Fri Mar 16 10:33:44 2018 -0400

    sunrpc: Simplify synopsis of some trace points

    Clean up: struct rpc_task carries a pointer to a struct rpc_clnt,
    and in fact task->tk_client is always what is passed into trace
    points that are already passing @task.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: use struct reclaim_stat in mm_vmscan_lru_shrink_inactive for 4.17
Michael Jeanson [Tue, 17 Apr 2018 15:07:46 +0000 (11:07 -0400)] 
Fix: use struct reclaim_stat in mm_vmscan_lru_shrink_inactive for 4.17

See upstream commit:

  commit d51d1e64500fcb48fc6a18c77c965b8f48a175f2
  Author: Steven Rostedt <rostedt@goodmis.org>
  Date:   Tue Apr 10 16:28:07 2018 -0700

    mm, vmscan, tracing: use pointer to reclaim_stat struct in trace event

    The trace event trace_mm_vmscan_lru_shrink_inactive() currently has 12
    parameters! Seven of them are from the reclaim_stat structure.  This
    structure is currently local to mm/vmscan.c.  By moving it to the global
    vmstat.h header, we can also reference it from the vmscan tracepoints.
    In moving it, it brings down the overhead of passing so many arguments
    to the trace event.  In the future, we may limit the number of arguments
    that a trace event may pass (ideally just 6, but more realistically it
    may be 8).

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Add gfp_flags arg to mm_vmscan_kswapd_wake for 4.17
Michael Jeanson [Tue, 17 Apr 2018 15:07:45 +0000 (11:07 -0400)] 
Fix: Add gfp_flags arg to mm_vmscan_kswapd_wake for 4.17

See upstream commit:

  commit 5ecd9d403ad081ed2de7b118c1e96124d4e0ba6c
  Author: David Rientjes <rientjes@google.com>
  Date:   Thu Apr 5 16:25:16 2018 -0700

    mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory

    Kswapd will not wakeup if per-zone watermarks are not failing or if too
    many previous attempts at background reclaim have failed.

    This can be true if there is a lot of free memory available.  For high-
    order allocations, kswapd is responsible for waking up kcompactd for
    background compaction.  If the zone is not below its watermarks or
    reclaim has recently failed (lots of free memory, nothing left to
    reclaim), kcompactd does not get woken up.

    When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
    woken up even if kswapd will not reclaim.  This allows high-order
    allocations, such as thp, to still trigger background compaction even
    when the zone has an abundance of free memory.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate: kvm instrumentation for ubuntu 4.13.0-38
Khalid Elmously [Sun, 25 Mar 2018 15:06:03 +0000 (11:06 -0400)] 
Update: kvm instrumentation for ubuntu 4.13.0-38

Starting from 4.13.0-38 the ubuntu kernel backport a kvm instrumentation
change introduced in 4.15 which affects the prototype of the kvm_mmio
event.

Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update kvm instrumentation for Ubuntu 3.13.0-144
Michael Jeanson [Fri, 23 Mar 2018 15:41:46 +0000 (11:41 -0400)] 
Fix: update kvm instrumentation for Ubuntu 3.13.0-144

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: btrfs instrumentation namespacing
Mathieu Desnoyers [Thu, 22 Mar 2018 21:33:32 +0000 (17:33 -0400)] 
Fix: btrfs instrumentation namespacing

Trips this warning:

[  122.301894] WARNING: CPU: 6 PID: 1654 at /home/efficios/git/lttng-modules/lttng-probes.c:99 fixup_lazy_probes+0x195/0x200 [lttng_tracer]
[  122.304974] Modules linked in: lttng_probe_compaction(O+) lttng_probe_btrfs(O) lttng_probe_block(O) lttng_ring_buffer_metadata_mmap_client(O) lttng_ring_buffer_client_mmap_overwrite(O) lttng_ring_buffer_client_mmap_discard(O) lttng_ring_buffer_metadata_client(O) lttng_ring_buffer_client_overwrite(O) lttng_ring_buffer_client_discard(O) lttng_tracer(O) lttng_statedump(O) lttng_ftrace(O) lttng_kprobes(O) lttng_clock(O) lttng_lib_ring_buffer(O) lttng_kretprobes(O)
[  122.314772] CPU: 6 PID: 1654 Comm: modprobe Tainted: G           O     4.16.0-rc6+ #54
[  122.316738] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[  122.320280] RIP: 0010:fixup_lazy_probes+0x195/0x200 [lttng_tracer]
[  122.321825] RSP: 0018:ffffc90008467ca0 EFLAGS: 00010286
[  122.323137] RAX: 00000000ffffffff RBX: ffffffffa01e7000 RCX: 0000000000000061
[  122.324847] RDX: 0000000000000005 RSI: ffffffffa01e21ac RDI: ffffffffa01e233b
[  122.326528] RBP: ffffffffa017f078 R08: 0000000000000062 R09: 0000000000000345
[  122.328154] R10: 0000000000000000 R11: ffffc90008467a28 R12: 0000000000000005
[  122.329791] R13: 0000000000000010 R14: 0000000000000010 R15: 0000000000000006
[  122.331410] FS:  00007f6c8d9a7740(0000) GS:ffff880c0fb80000(0000) knlGS:0000000000000000
[  122.333323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  122.334673] CR2: 00007ffcc9698ff8 CR3: 0000000c0afae004 CR4: 00000000001606e0
[  122.336300] Call Trace:
[  122.337011]  ? __event_probe__compaction_migratepages+0x250/0x250 [lttng_probe_compaction]
[  122.338901]  lttng_get_probe_list_head.part.2+0x19/0x20 [lttng_tracer]
[  122.340349]  lttng_probe_register+0xd5/0xe0 [lttng_tracer]
[  122.341607]  ? __event_probe__compaction_migratepages+0x250/0x250 [lttng_probe_compaction]
[  122.343453]  do_one_initcall+0x3d/0x16e
[  122.344383]  ? _cond_resched+0x15/0x30
[  122.345323]  ? kmem_cache_alloc_trace+0xe1/0x1b0
[  122.346394]  ? do_init_module+0x22/0x20c
[  122.347329]  do_init_module+0x5a/0x20c
[  122.350037]  load_module+0x244f/0x2980
[  122.350958]  ? m_show+0x190/0x190
[  122.351774]  ? security_capable+0x41/0x60
[  122.352723]  SYSC_finit_module+0x80/0xb0
[  122.353716]  do_syscall_64+0x76/0x1a0
[  122.354565]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  122.355669] RIP: 0033:0x7f6c8d4c73c9
[  122.356502] RSP: 002b:00007ffcc969c248 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[  122.358209] RAX: ffffffffffffffda RBX: 000055763df4fee9 RCX: 00007f6c8d4c73c9
[  122.359684] RDX: 0000000000000000 RSI: 000055763df4fee9 RDI: 0000000000000004
[  122.361182] RBP: 0000000000000000 R08: 0000000000000000 R09: 000055763f39a450
[  122.362663] R10: 0000000000000004 R11: 0000000000000206 R12: 000055763f392400
[  122.364144] R13: 000055763f396cb0 R14: 000055763f3925a0 R15: 0000000000040000
[  122.365690] Code: 25 14 a0 4a 8b 04 f0 48 8b 30 31 c0 e8 25 3b 10 e1 48 8b 43 08 48 8b 33 4c 89 e2 4a 8b 04 f0 48 8b 38 e8 9f b7 b1 e1 85 c0 74 07 <0f> 0b e9 b3 fe ff ff 48 c7 c7 16 26 14 a0 e8 f8 3a 10 e1 48 8b
[  122.369348] ---[ end trace 15840f1166edf835 ]---

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoCleanup: comment about CONFIG_HOTPLUG_CPU ifdef
Michael Jeanson [Tue, 13 Mar 2018 16:14:43 +0000 (12:14 -0400)] 
Cleanup: comment about CONFIG_HOTPLUG_CPU ifdef

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: do not use CONFIG_HOTPLUG_CPU for the new hotplug API
Lars Persson [Sun, 11 Mar 2018 14:02:43 +0000 (15:02 +0100)] 
Fix: do not use CONFIG_HOTPLUG_CPU for the new hotplug API

Kernel configurations without CONFIG_HOTPLUG_CPU throw an unknown
symbol error when attempting to insert the lttng-trace module:
 lttng_tracer: Unknown symbol lttng_hp_prepare (err 0)
 lttng_tracer: Unknown symbol lttng_hp_online (err 0)

This was caused by lttng-events and lttng-context-perf-counter not
agreeing on which preprocessor condition that should guard the use of
the hotplug API. In fact the API is available also on kernels built
without CONFIG_HOTPLUG_CPU.

Signed-off-by: Lars Persson <larper@axis.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update kvm instrumentation for 4.1.50+
Michael Jeanson [Thu, 8 Mar 2018 16:18:56 +0000 (11:18 -0500)] 
Fix: update kvm instrumentation for 4.1.50+

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUse the memory pool instead of kmalloc
Julien Desfossez [Fri, 23 Feb 2018 16:37:11 +0000 (11:37 -0500)] 
Use the memory pool instead of kmalloc

Replace the use of kmalloc/kfree in the tracepoint probes that need
dynamic allocation with the tracepoint memory pool alloc/free.

Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoCreate a memory pool for temporary tracepoint probes storage
Julien Desfossez [Fri, 23 Feb 2018 16:37:10 +0000 (11:37 -0500)] 
Create a memory pool for temporary tracepoint probes storage

This memory pool is created when the lttng-tracer module is loaded. It
allocates 4 buffers of 4k on each CPU. These buffers are designed to
allow tracepoint probes to temporarily store data that does not fit on
the stack (during the code_pre and code_post phases). The memory is
freed when the lttng-tracer module is unloaded.

This removes the need for dynamic allocation during the execution of
tracepoint probes, which does not behave well on PREEMPT_RT kernel, even
when invoked with the GFP_ATOMIC | GFP_NOWAIT flags.

Signed-off-by: Julien Desfossez <jdesfossez@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: use proper pid_ns in the process statedump
Michael Jeanson [Wed, 21 Feb 2018 21:36:17 +0000 (16:36 -0500)] 
Fix: use proper pid_ns in the process statedump

The pid_ns we currently use from the nsproxy struct is not the task's
pid_ns but the one that children of this task will use.

As stated in include/linux/nsproxy.h :

  The pid namespace is an exception -- it's accessed using
  task_active_pid_ns.  The pid namespace here is the
  namespace that children will use.

While it will be the same most of the time, it will report incorrect
information in some situations. Plus it has the side effect of
simplifying the code and removing kernel version checks.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: add variable quoting to shell scripts
Michael Jeanson [Tue, 20 Feb 2018 17:16:25 +0000 (12:16 -0500)] 
Fix: add variable quoting to shell scripts

Prevent errors if a path contains spaces.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate: kvm instrumentation for fedora 4.14.13-300
Michael Jeanson [Tue, 20 Feb 2018 17:10:05 +0000 (12:10 -0500)] 
Update: kvm instrumentation for fedora 4.14.13-300

Starting from 4.14.13-300 the fedora kernel backport a kvm instrumentation
change introduced in 4.15 which affects the prototype of the kvm_mmio event.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Add Fedora version macros
Loïc Gelle [Tue, 20 Feb 2018 17:10:04 +0000 (12:10 -0500)] 
Fix: Add Fedora version macros

Signed-off-by: Loïc Gelle <loic.gelle@polymtl.ca>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update btrfs instrumentation for SuSE 4.4.114-92
Michael Jeanson [Tue, 13 Feb 2018 20:23:51 +0000 (15:23 -0500)] 
Fix: update btrfs instrumentation for SuSE 4.4.114-92

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update block instrumentation for SuSE 4.4.114-92
Michael Jeanson [Tue, 13 Feb 2018 20:23:50 +0000 (15:23 -0500)] 
Fix: update block instrumentation for SuSE 4.4.114-92

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update rcu instrumentation for v4.16
Michael Jeanson [Mon, 12 Feb 2018 17:32:25 +0000 (18:32 +0100)] 
Fix: update rcu instrumentation for v4.16

See upstream commits :

  commit dec98900eae1e22467182e58688abe5fae98bd5f
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Wed Oct 4 16:24:29 2017 -0700

    rcu: Add ->dynticks field to rcu_dyntick trace event

  commit 84585aa8b6ad24e5bdfba9db4a320a6aeed192ab
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Wed Oct 4 15:55:16 2017 -0700

    rcu: Shrink ->dynticks_{nmi_,}nesting from long long to long

    Because the ->dynticks_nesting field now only contains the process-based
    nesting level instead of a value encoding both the process nesting level
    and the irq "nesting" level, we no longer need a long long, even on
    32-bit systems.  This commit therefore changes both the ->dynticks_nesting
    and ->dynticks_nmi_nesting fields to long.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update vmscan instrumentation for v4.16
Michael Jeanson [Mon, 12 Feb 2018 17:32:12 +0000 (18:32 +0100)] 
Fix: update vmscan instrumentation for v4.16

See upstream commit :

  commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
  Author: Josef Bacik <jbacik@fb.com>
  Date:   Wed Jan 31 16:16:26 2018 -0800

    mm: use sc->priority for slab shrink targets

    Previously we were using the ratio of the number of lru pages scanned to
    the number of eligible lru pages to determine the number of slab objects
    to scan.  The problem with this is that these two things have nothing to
    do with each other, so in slab heavy work loads where there is little to
    no page cache we can end up with the pages scanned being a very low
    number.  This means that we reclaim next to no slab pages and waste a
    lot of time reclaiming small amounts of space.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update timer instrumentation on 4.16 and 4.14-rt
Rasmus Villemoes [Mon, 12 Feb 2018 17:31:40 +0000 (18:31 +0100)] 
Fix: update timer instrumentation on 4.16 and 4.14-rt

See upstream commit :

  commit 63e2ed3659752a4850e0ef3a07f809988fcd74a4
  Author: Anna-Maria Gleixner <anna-maria@linutronix.de>
  Date:   Thu Dec 21 11:41:38 2017 +0100

    tracing/hrtimer: Print the hrtimer mode in the 'hrtimer_start' tracepoint

    The 'hrtimer_start' tracepoint lacks the mode information. The mode is
    important because consecutive starts can switch from ABS to REL or from
    PINNED to non PINNED.

    Append the mode field.

See linux-rt commit :

  commit 6ee32a49b1ed61c08ac9f1c9fcbf83d3c749b71d
  Author: Anna-Maria Gleixner <anna-maria@linutronix.de>
  Date:   Sun Oct 22 23:39:46 2017 +0200

    tracing: hrtimer: Print hrtimer mode in hrtimer_start tracepoint

    The hrtimer_start tracepoint lacks the mode information. The mode is
    important because consecutive starts can switch from ABS to REL or from
    PINNED to non PINNED.

    Add the mode information.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate kvm instrumentation for debian kernel 4.14.0-3
Michael Jeanson [Tue, 30 Jan 2018 21:48:36 +0000 (16:48 -0500)] 
Update kvm instrumentation for debian kernel 4.14.0-3

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoVersion 2.9.8 v2.9.8
Mathieu Desnoyers [Tue, 30 Jan 2018 20:52:51 +0000 (15:52 -0500)] 
Version 2.9.8

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: network instrumentation protocol enum
Mathieu Desnoyers [Thu, 25 Jan 2018 17:41:57 +0000 (12:41 -0500)] 
Fix: network instrumentation protocol enum

The enumeration field within the header payload should keep the
enumeration describing the header field, and not use the variant
selector enumeration.

This issue has been introduced by commit "Fix: network instrumentation
handling of corrupted TCP headers".

It causes the following warning messages in babeltrace:

[warning] Unknown value 6 in enum.
[warning] Unknown value 17 in enum.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update btrfs instrumentation for SuSE 4.4.103-6
Michael Jeanson [Tue, 23 Jan 2018 21:03:25 +0000 (16:03 -0500)] 
Fix: update btrfs instrumentation for SuSE 4.4.103-6

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update block instrumentation for SuSE 4.4.73-5
Michael Jeanson [Tue, 23 Jan 2018 21:03:24 +0000 (16:03 -0500)] 
Fix: update block instrumentation for SuSE 4.4.73-5

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: global_dirty_limit for kernel v4.2 and up
Michael Jeanson [Tue, 23 Jan 2018 21:00:07 +0000 (16:00 -0500)] 
Fix: global_dirty_limit for kernel v4.2 and up

global_dirty_limit was moved into wb_domain

See upstream commit :

  commit dcc25ae76eb7b8ff883eaaab57e30e8f2f085be3
  Author: Tejun Heo <tj@kernel.org>
  Date:   Fri May 22 18:23:22 2015 -0400

    writeback: move global_dirty_limit into wb_domain

    This patch is a part of the series to define wb_domain which
    represents a domain that wb's (bdi_writeback's) belong to and are
    measured against each other in.  This will enable IO backpressure
    propagation for cgroup writeback.

    global_dirty_limit exists to regulate the global dirty threshold which
    is a property of the wb_domain.  This patch moves hard_dirty_limit,
    dirty_lock, and update_time into wb_domain.

    This is pure reorganization and doesn't introduce any behavioral
    changes.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: network instrumentation handling of corrupted TCP headers
Mathieu Desnoyers [Thu, 18 Jan 2018 19:18:14 +0000 (14:18 -0500)] 
Fix: network instrumentation handling of corrupted TCP headers

A malformed packet may contain a valid IPv4/IPv6 header, but an
inconsistent TCP header. As a result, the trace contains a fully
formed IPv4/IPv6 header, including the "protocol" or "nexthdr"
fields indicating TCP, but no following TCP header.

This scenario leads to an unreadable CTF trace, because the
trace viewer expects a TCP header, but instead gets the next
event.

Therefore, using the IP header fields as selector for the
transport layer variant is not the right approach: introduce
our own selector field, which allows to properly deal with this
corner-case.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: add missing uaccess.h include from kstrtox.h wrapper
Mathieu Desnoyers [Wed, 17 Jan 2018 18:37:26 +0000 (13:37 -0500)] 
Fix: add missing uaccess.h include from kstrtox.h wrapper

Required to build lttng-modules against kernel < 3.0.0 on ARM.

Fixes #1148

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate: kvm instrumentation for 4.14.14+, 4.9.77+, 4.4.112+
Mathieu Desnoyers [Wed, 17 Jan 2018 16:17:08 +0000 (11:17 -0500)] 
Update: kvm instrumentation for 4.14.14+, 4.9.77+, 4.4.112+

Starting from 3.14.14, 4.9.77, and 4.4.112, the 3.14, 4.9, and 4.4
stable kernel branches backport a kvm instrumentation change introduced
in 4.15 which affects the prototype of the kvm_mmio event.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: btrfs_delayed_ref_head was unwired since v3.12
Michael Jeanson [Tue, 9 Jan 2018 22:40:00 +0000 (17:40 -0500)] 
Fix: btrfs_delayed_ref_head was unwired since v3.12

See upstream commit:

  commit 599c75ec3f7f3b606e8a0a684c00f12190712de8
  Author: Liu Bo <bo.li.liu@oracle.com>
  Date:   Tue Jul 16 19:03:36 2013 +0800

    Btrfs/tracepoint: update delayed ref tracepoints

    This shows exactly how btrfs processes the delayed refs onto disks,
    which is very helpful on understanding delayed ref mechanism and
    debugging related bugs.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate kvm instrumentation for debian kernel 4.9.65-3
Michael Jeanson [Tue, 9 Jan 2018 20:43:20 +0000 (15:43 -0500)] 
Update kvm instrumentation for debian kernel 4.9.65-3

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: debian kernel version parsing
Michael Jeanson [Tue, 9 Jan 2018 20:43:19 +0000 (15:43 -0500)] 
Fix: debian kernel version parsing

The debian version script only worked for ckt kernels and that was fine
until now because we only had checks for those versions in the code.

ckt (Canonical Kernel Team) kernels were used for a while during the jessie
cycle, their versionning is a bit different. They track the upstream vanilla
stable updates but they don't update the minor version number and instead add
an additionnal -cktX. They were all 3.16.7-cktX and after a while the version
switched back to upstream style at 3.16.36.

Knowing that, we can compare regular debian and ckt kernel versions
using this scheme :

  MAJOR.PATCHLEVEL.SUBLEVEL.CKT.DEBABI.DEBPATCH

And setting CKT to zero for non-ckt kernels.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: block instrumentation 4.14+ NULL pointer dereference
Mathieu Desnoyers [Tue, 9 Jan 2018 16:04:36 +0000 (11:04 -0500)] 
Fix: block instrumentation 4.14+ NULL pointer dereference

Support for block layer instrumentation on Linux kernels 4.14+
introduces the following NULL pointer dereference:

181.6723  [ 3819.390121] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
181.6724  [ 3819.394856] IP: __event_probe__block_get_rq+0x127/0x4a0 [lttng_probe_block]
181.6725  [ 3819.394856] PGD 7b924067 P4D 7b924067 PUD 733a7067 PMD 0
181.6726  [ 3819.394856] Oops: 0000 [#1] SMP
181.6727  [ 3819.394856] Modules linked in: lttng_test(OE) lttng_probe_x86_exceptions(OE) lttng_probe_x86_irq_vectors(OE) lttng_probe_writeback(OE) lttng_probe_workqueue(OE) lttng_probe_vmscan(OE) lttng_probe_udp(OE) lttng_probe_timer(OE) lttng_probe_sunrpc(OE) lttng_probe_statedump(OE) lttng_probe_sock(OE) lttng_probe_skb(OE) lttng_probe_signal(OE) lttng_probe_scsi(OE) lttng_probe_sched(OE) lttng_probe_regulator(OE) lttng_probe_regmap(OE) lttng_probe_rcu(OE) lttng_probe_random(OE) lttng_probe_printk(OE) lttng_probe_power(OE) lttng_probe_net(OE) lttng_probe_napi(OE) lttng_probe_module(OE) lttng_probe_kvm_x86_mmu(OE) lttng_probe_kvm_x86(OE) lttng_probe_kvm(OE) lttng_probe_kmem(OE) lttng_probe_jbd2(OE) lttng_probe_irq(OE) lttng_probe_i2c(OE) lttng_probe_gpio(OE) lttng_probe_ext4(OE) lttng_probe_compaction(OE) lttng_probe_btrfs(OE)
181.6728  [ 3819.394856] lttng_probe_block(OE) lttng_ring_buffer_metadata_mmap_client(OE) lttng_ring_buffer_client_mmap_overwrite(OE) lttng_ring_buffer_client_mmap_discard(OE) lttng_ring_buffer_metadata_client(OE) lttng_ring_buffer_client_overwrite(OE) lttng_ring_buffer_client_discard(OE) lttng_tracer(OE) lttng_statedump(OE) lttng_ftrace(OE) lttng_kprobes(OE) lttng_clock(OE) lttng_lib_ring_buffer(OE) lttng_kretprobes(OE) [last unloaded: lttng_statedump]
181.6729  [ 3819.394856] CPU: 1 PID: 17541 Comm: kworker/u4:2 Tainted: G OE 4.14.0 #1
181.6730  [ 3819.394856] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
181.6731  [ 3819.394856] Workqueue: events_freezable_power_ disk_events_workfn
181.6732  [ 3819.394856] task: ffff9cd5b9bb1cc0 task.stack: ffffbf4100444000
181.6733  [ 3819.394856] RIP: 0010:__event_probe__block_get_rq+0x127/0x4a0 [lttng_probe_block]
181.6734  [ 3819.394856] RSP: 0018:ffffbf4100447b40 EFLAGS: 00010246
181.6735  [ 3819.394856] RAX: 0000000000000000 RBX: ffff9cd5b39757a8 RCX: ffff9cd5ae850000
181.6736  [ 3819.394856] RDX: 000000000000042a RSI: 0000000000000bd6 RDI: ffffdf40ffd04470
181.6737  [ 3819.394856] RBP: ffffbf4100447c50 R08: 0000000000800000 R09: 0000000000019bd6
181.6738  [ 3819.394856] R10: ffffdf40ffd04470 R11: 0000000000000000 R12: 0000000000000000
181.6739  [ 3819.394856] R13: 000000000001d060 R14: ffff9cd5bb9988a0 R15: ffff9cd5b992b480
181.6740  [ 3819.394856] FS: 0000000000000000(0000) GS:ffff9cd5bfd00000(0000) knlGS:0000000000000000
181.6741  [ 3819.394856] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
181.6742  [ 3819.394856] CR2: 0000000000000008 CR3: 00000000736ab000 CR4: 00000000000006e0
181.6743  [ 3819.394856] Call Trace:
181.6744  [ 3819.394856] ? scsi_old_init_rq+0x84/0x100
181.6745  [ 3819.394856] ? mempool_alloc+0x5f/0x150
181.6746  [ 3819.394856] ? kvm_clock_read+0x1e/0x20
181.6747  [ 3819.394856] get_request+0x4db/0x7e0
181.6748  [ 3819.394856] ? wait_woken+0x80/0x80
181.6749  [ 3819.394856] blk_get_request+0x9c/0x110
181.6750  [ 3819.394856] scsi_execute+0x40/0x260
181.6751  [ 3819.394856] sr_check_events+0x7d/0x290
181.6752  [ 3819.394856] cdrom_check_events+0x18/0x30
181.6753  [ 3819.394856] sr_block_check_events+0x2a/0x30
181.6754  [ 3819.394856] disk_check_events+0x51/0x130
181.6755  [ 3819.394856] disk_events_workfn+0x16/0x20
181.6756  [ 3819.394856] process_one_work+0x156/0x3f0
181.6757  [ 3819.394856] worker_thread+0x4b/0x460
181.6758  [ 3819.394856] kthread+0x109/0x140
181.6759  [ 3819.394856] ? process_one_work+0x3f0/0x3f0
181.6760  [ 3819.394856] ? kthread_create_on_node+0x40/0x40
181.6761  [ 3819.394856] ret_from_fork+0x25/0x30
181.6762  [ 3819.394856] Code: 00 00 00 00 48 89 85 20 ff ff ff 48 8d 85 10 ff ff ff 8b 73 04 48 89 85 28 ff ff ff 49 8b 47 48 ff 50 28 85 c0 0f 88 78 01 00 00 <49> 8b 44 24 08 ba 04 00 00 00 48 8d b5 08 ff ff ff 48 8d bd 20
181.6763  [ 3819.394856] RIP: __event_probe__block_get_rq+0x127/0x4a0 [lttng_probe_block] RSP: ffffbf4100447b40
181.6764  [ 3819.394856] CR2: 0000000000000008
181.6765  [ 3819.394856] ---[ end trace b08f087751369a25 ]---

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate: kvm instrumentation for 3.16.52 and 3.2.97
Mathieu Desnoyers [Tue, 2 Jan 2018 16:07:05 +0000 (11:07 -0500)] 
Update: kvm instrumentation for 3.16.52 and 3.2.97

Starting from 3.16.52 and 3.2.97, the 3.16 and 3.2 stable kernel
branches backport a kvm instrumentation change introduced in 4.15 which
affects the prototype of the kvm_mmio event.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: kvm instrumentation for 4.15
Mathieu Desnoyers [Wed, 27 Dec 2017 14:07:30 +0000 (09:07 -0500)] 
Fix: kvm instrumentation for 4.15

Incorrect version range.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate sock instrumentation for 4.15
Mathieu Desnoyers [Tue, 26 Dec 2017 14:47:36 +0000 (09:47 -0500)] 
Update sock instrumentation for 4.15

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoUpdate kvm instrumentation for 4.15
Mathieu Desnoyers [Tue, 26 Dec 2017 14:47:22 +0000 (09:47 -0500)] 
Update kvm instrumentation for 4.15

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: ACCESS_ONCE() removed in kernel 4.15
Michael Jeanson [Tue, 19 Dec 2017 20:06:42 +0000 (15:06 -0500)] 
Fix: ACCESS_ONCE() removed in kernel 4.15

The ACCESS_ONCE() macro was removed in kernel 4.15 and should be
replaced by READ_ONCE and WRITE_ONCE which were introduced in kernel
3.19.

This commit replaces all calls to ACCESS_ONCE() with the appropriate
READ_ONCE or WRITE_ONCE and adds compatibility macros for kernels that
have them.

See this upstream commit:

  commit b03a0fe0c5e4b46dcd400d27395b124499554a71
  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
  Date:   Mon Oct 23 14:07:25 2017 -0700

    locking/atomics, mm: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE()

    For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
    preference to ACCESS_ONCE(), and new code is expected to use one of the
    former. So far, there's been no reason to change most existing uses of
    ACCESS_ONCE(), as these aren't currently harmful.

    However, for some features it is necessary to instrument reads and
    writes separately, which is not possible with ACCESS_ONCE(). This
    distinction is critical to correct operation.

    It's possible to transform the bulk of kernel code using the Coccinelle
    script below. However, this doesn't handle comments, leaving references
    to ACCESS_ONCE() instances which have been removed. As a preparatory
    step, this patch converts the mm code and comments to use
    {READ,WRITE}_ONCE() consistently.

    ----
    virtual patch

    @ depends on patch @
    expression E1, E2;
    @@

    - ACCESS_ONCE(E1) = E2
    + WRITE_ONCE(E1, E2)

    @ depends on patch @
    expression E;
    @@

    - ACCESS_ONCE(E)
    + READ_ONCE(E)
    ----

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: sched instrumentation on stable RT kernels
Michael Jeanson [Mon, 18 Dec 2017 19:35:55 +0000 (14:35 -0500)] 
Fix: sched instrumentation on stable RT kernels

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agotimer API transition for kernel 4.15
Michael Jeanson [Wed, 29 Nov 2017 22:03:21 +0000 (17:03 -0500)] 
timer API transition for kernel 4.15

The timer API changes starting from kernel 4.15.0.

There's an interresting LWN article on this subject:

  https://lwn.net/Articles/735887/

Check these upstream commits for more details:

  commit 686fef928bba6be13cabe639f154af7d72b63120
  Author: Kees Cook <keescook@chromium.org>
  Date:   Thu Sep 28 06:38:17 2017 -0700

    timer: Prepare to change timer callback argument type

    Modern kernel callback systems pass the structure associated with a
    given callback to the callback function. The timer callback remains one
    of the legacy cases where an arbitrary unsigned long argument continues
    to be passed as the callback argument. This has several problems:

    - This bloats the timer_list structure with a normally redundant
      .data field.

    - No type checking is being performed, forcing callbacks to do
      explicit type casts of the unsigned long argument into the object
      that was passed, rather than using container_of(), as done in most
      of the other callback infrastructure.

    - Neighboring buffer overflows can overwrite both the .function and
      the .data field, providing attackers with a way to elevate from a buffer
      overflow into a simplistic ROP-like mechanism that allows calling
      arbitrary functions with a controlled first argument.

    - For future Control Flow Integrity work, this creates a unique function
      prototype for timer callbacks, instead of allowing them to continue to
      be clustered with other void functions that take a single unsigned long
      argument.

    This adds a new timer initialization API, which will ultimately replace
    the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
    named timer_setup() (to mirror hrtimer_setup(), making instances of its
    use much easier to grep for).

    In order to support the migration of existing timers into the new
    callback arguments, timer_setup() casts its arguments to the existing
    legacy types, and explicitly passes the timer pointer as the legacy
    data argument. Once all setup_*timer() callers have been replaced with
    timer_setup(), the casts can be removed, and the data argument can be
    dropped with the timer expiration code changed to just pass the timer
    to the callback directly.

:
    Modern kernel callback systems pass the structure associated with a
    given callback to the callback function. The timer callback remains one
    of the legacy cases where an arbitrary unsigned long argument continues
    to be passed as the callback argument. This has several problems:

    - This bloats the timer_list structure with a normally redundant
      .data field.

    - No type checking is being performed, forcing callbacks to do
      explicit type casts of the unsigned long argument into the object
      that was passed, rather than using container_of(), as done in most
      of the other callback infrastructure.

    - Neighboring buffer overflows can overwrite both the .function and
      the .data field, providing attackers with a way to elevate from a buffer
      overflow into a simplistic ROP-like mechanism that allows calling
      arbitrary functions with a controlled first argument.

    - For future Control Flow Integrity work, this creates a unique function
      prototype for timer callbacks, instead of allowing them to continue to
      be clustered with other void functions that take a single unsigned long
      argument.

    This adds a new timer initialization API, which will ultimately replace
    the existing setup_timer(), setup_{deferrable,pinned,etc}_timer() family,
    named timer_setup() (to mirror hrtimer_setup(), making instances of its
    use much easier to grep for).

    In order to support the migration of existing timers into the new
    callback arguments, timer_setup() casts its arguments to the existing
    legacy types, and explicitly passes the timer pointer as the legacy
    data argument. Once all setup_*timer() callers have been replaced with
    timer_setup(), the casts can be removed, and the data argument can be
    dropped with the timer expiration code changed to just pass the timer
    to the callback directly.

    Since the regular pattern of using container_of() during local variable
    declaration repeats the need for the variable type declaration
    to be included, this adds a helper modeled after other from_*()
    helpers that wrap container_of(), named from_timer(). This helper uses
    typeof(*variable), removing the type redundancy and minimizing the need
    for line wraps in forthcoming conversions from "unsigned data long" to
    "struct timer_list *" in the timer callbacks:

    -void callback(unsigned long data)
    +void callback(struct timer_list *t)
    {
    -   struct some_data_structure *local = (struct some_data_structure *)data;
    +   struct some_data_structure *local = from_timer(local, t, timer);

    Finally, in order to support the handful of timer users that perform
    open-coded assignments of the .function (and .data) fields, provide
    cast macros (TIMER_FUNC_TYPE and TIMER_DATA_TYPE) that can be used
    temporarily. Once conversion has been completed, these can be globally
    trivially removed.

    ...

  commit e99e88a9d2b067465adaa9c111ada99a041bef9a
  Author: Kees Cook <keescook@chromium.org>
  Date:   Mon Oct 16 14:43:17 2017 -0700

    treewide: setup_timer() -> timer_setup()

    This converts all remaining cases of the old setup_timer() API into using
    timer_setup(), where the callback argument is the structure already
    holding the struct timer_list. These should have no behavioral changes,
    since they just change which pointer is passed into the callback with
    the same available pointers after conversion. It handles the following
    examples, in addition to some other variations.

    ...

  commit 185981d54a60ae90942c6ba9006b250f3348cef2
  Author: Kees Cook <keescook@chromium.org>
  Date:   Wed Oct 4 16:26:58 2017 -0700

    timer: Remove init_timer_pinned() in favor of timer_setup()

    This refactors the only users of init_timer_pinned() to use
    the new timer_setup() and from_timer(). Drops the definition of
    init_timer_pinned().

    ...

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: Don't nest get online cpus
Mathieu Desnoyers [Wed, 13 Dec 2017 18:40:42 +0000 (13:40 -0500)] 
Fix: Don't nest get online cpus

Since the cpu hotplug refactoring in the Linux kernel, CPU hotplug
"online cpus" read lock cannot be nested anymore.

Fix this by disabling preemption around the section instead.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: lttng_channel_syscall_mask() bool use in bitfield
Mathieu Desnoyers [Fri, 8 Dec 2017 19:17:21 +0000 (14:17 -0500)] 
Fix: lttng_channel_syscall_mask() bool use in bitfield

gcc 7 warns about using ~ on a bool. Pass a char as input type instead.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
6 years agoFix: update kmem instrumentation for kernel 4.15
Michael Jeanson [Tue, 28 Nov 2017 21:02:45 +0000 (16:02 -0500)] 
Fix: update kmem instrumentation for kernel 4.15

See upstream commit:

  commit 2d4894b5d2ae0fe1725ea7abd57b33bfbbe45492
  Author: Mel Gorman <mgorman@techsingularity.net>
  Date:   Wed Nov 15 17:37:59 2017 -0800

    mm: remove cold parameter from free_hot_cold_page*

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoVersion 2.9.7 v2.9.7
Mathieu Desnoyers [Wed, 8 Nov 2017 19:10:27 +0000 (14:10 -0500)] 
Version 2.9.7

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: lttng_kvmalloc helper NULL pointer OOPS
Mathieu Desnoyers [Tue, 7 Nov 2017 21:44:36 +0000 (16:44 -0500)] 
Fix: lttng_kvmalloc helper NULL pointer OOPS

The static function __vmalloc_node is not visible by KALLSYMS_ALL on at
least some kernels, which leads to a call to a NULL function when trying
to perform allocation of lttng buffer memory under memory fragmentation
conditions (kmalloc_node failure).

Use __vmalloc_node_range instead, and check that the returned pointer
is non-NULL to ensure this type of failure does not happen in any
condition.

Fallback to __vmalloc(), even though it is not NUMA-aware, in case
we fail to find __vmalloc_node_range, and print an explicit warning
to the user console about the need to enable KALLSYMS_ALL.

This affects kernels < 4.12. Later kernels provide kvmalloc(), which
we use.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoVersion 2.9.6 v2.9.6
Mathieu Desnoyers [Fri, 3 Nov 2017 21:27:11 +0000 (17:27 -0400)] 
Version 2.9.6

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: lttng-logger get_user_pages_fast error handling
Mathieu Desnoyers [Tue, 31 Oct 2017 22:23:59 +0000 (18:23 -0400)] 
Fix: lttng-logger get_user_pages_fast error handling

Comparing a signed return value against an unsigned nr_pages performs
the comparison as "unsigned", and therefore mistakenly considers
get_user_pages_fast() errors as success.

By passing an invalid pointer to write() to the /proc/lttng-logger
interface, unprivileged user-space processes can trigger a kernel OOPS.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoVersion 2.9.5 v2.9.5
Mathieu Desnoyers [Thu, 5 Oct 2017 21:05:53 +0000 (17:05 -0400)] 
Version 2.9.5

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: update block instrumentation for 4.14 kernel
Mathieu Desnoyers [Thu, 5 Oct 2017 18:52:15 +0000 (14:52 -0400)] 
Fix: update block instrumentation for 4.14 kernel

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoRevert "Fix: update block instrumentation for kernel 4.14"
Mathieu Desnoyers [Thu, 5 Oct 2017 18:45:43 +0000 (14:45 -0400)] 
Revert "Fix: update block instrumentation for kernel 4.14"

This reverts commit 49447902967115fe5a07ee7a1df3d17fbf4b1ab8.

It introduces a NULL pointer dereference:

[ 37.862398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
181.3  [ 37.864108] IP: [<ffffffffa01c41b7>] __event_probe__block_get_rq+0x127/0x4b0 [lttng_probe_block]
181.4  [ 37.864108] PGD 7a402067 PUD 7a4c7067 PMD 0
181.5  [ 37.864108] Oops: 0000 [#1] SMP
181.6  [ 37.864108] Modules linked in: lttng_probe_x86_exceptions(OE) lttng_probe_x86_irq_vectors(OE) lttng_probe_writeback(OE) lttng_probe_workqueue(OE) lttng_probe_vmscan(OE) lttng_probe_udp(OE) lttng_probe_timer(OE) lttng_probe_sunrpc(OE) lttng_probe_statedump(OE) lttng_probe_sock(OE) lttng_probe_skb(OE) lttng_probe_signal(OE) lttng_probe_scsi(OE) lttng_probe_sched(OE) lttng_probe_regulator(OE) lttng_probe_regmap(OE) lttng_probe_rcu(OE) lttng_probe_random(OE) lttng_probe_printk(OE) lttng_probe_power(OE) lttng_probe_net(OE) lttng_probe_napi(OE) lttng_probe_module(OE) lttng_probe_kvm_x86_mmu(OE) lttng_probe_kvm_x86(OE) lttng_probe_kvm(OE) lttng_probe_kmem(OE) lttng_probe_jbd2(OE) lttng_probe_irq(OE) lttng_probe_i2c(OE) lttng_probe_gpio(OE) lttng_probe_ext4(OE) lttng_probe_compaction(OE) lttng_probe_btrfs(OE) lttng_probe_block(OE) lttng_ring_buffer_metadata_mmap_client(OE) lttng_ring_buffer_client_mmap_overwrite(OE) lttng_ring_buffer_client_mmap_discard(OE) lttng_ring_buffer_metadata_client(OE) lttng_ring_buffer_client_overwrite(OE) lttng_ring_buffer_client_discard(OE) lttng_tracer(OE) lttng_statedump(OE) lttng_ftrace(OE) lttng_kprobes(OE) lttng_clock(OE) lttng_lib_ring_buffer(OE) lttng_kretprobes(OE)
181.7  [ 37.864108] CPU: 1 PID: 6 Comm: kworker/u4:0 Tainted: G OE 4.4.90 #1
181.8  [ 37.864108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
181.9  [ 37.864108] Workqueue: events_freezable_power_ disk_events_workfn
181.10  [ 37.864108] task: ffff88007c861bc0 ti: ffff88007c868000 task.ti: ffff88007c868000
181.11  [ 37.864108] RIP: 0010:[<ffffffffa01c41b7>] [<ffffffffa01c41b7>] __event_probe__block_get_rq+0x127/0x4b0 [lttng_probe_block]
181.12  [ 37.864108] RSP: 0018:ffff88007c86ba98 EFLAGS: 00010246
181.13  [ 37.864108] RAX: 0000000000000000 RBX: ffff880073683348 RCX: ffff8800747d0000
181.14  [ 37.864108] RDX: 00000008d0c5bde9 RSI: 00000000000009f2 RDI: 0000000000400000
181.15  [ 37.864108] RBP: ffff88007c86bba8 R08: 00000000001789ed R09: 0000000000100000
181.16  [ 37.864108] R10: ffffe8ffffd02460 R11: 0000000000000000 R12: 0000000000000000
181.17  [ 37.864108] R13: 0000000000017fe0 R14: ffff88007363c6e8 R15: ffff88007bef83c0
181.18  [ 37.864108] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
181.19  [ 37.864108] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
181.20  [ 37.864108] CR2: 0000000000000008 CR3: 000000007a4d0000 CR4: 00000000000006e0
181.21  [ 37.864108] Stack:
181.22  [ 37.864108] 0000000000000000 ffffffff8115a46b ffff88007c86bbe8 ffff88007bc67e30
181.23  [ 37.864108] ffff880073683348 00000000ffffff01 ffff88007a7a1000 ffff88007c86bab8
181.24  [ 37.864108] 0000000000000028 0000000100000001 ffffe8ffffd02460 0000000000000035
181.25  [ 37.864108] Call Trace:
181.26  [ 37.864108] [<ffffffff8115a46b>] ? ktime_get_mono_fast_ns+0x4b/0x90
181.27  [ 37.864108] [<ffffffff81532849>] ? alloc_request_struct+0x19/0x20
181.28  [ 37.864108] [<ffffffff811e8d8f>] ? mempool_alloc+0x5f/0x150
181.29  [ 37.864108] [<ffffffffa021815c>] ? __event_probe__kmem_alloc+0x1dc/0x2c0 [lttng_probe_kmem]
181.30  [ 37.864108] [<ffffffff810ad85e>] ? kvm_clock_read+0x1e/0x20
181.31  [ 37.864108] [<ffffffff81535f4f>] get_request+0x4af/0x760
181.32  [ 37.864108] [<ffffffff8112c270>] ? wake_atomic_t_function+0x60/0x60
181.33  [ 37.864108] [<ffffffff81536283>] blk_get_request+0x83/0xe0
181.34  [ 37.864108] [<ffffffff81773b5d>] scsi_execute+0x3d/0x1d0
181.35  [ 37.864108] [<ffffffff817758fe>] scsi_execute_req_flags+0x8e/0xf0
181.36  [ 37.864108] [<ffffffff81788f4d>] sr_check_events+0x8d/0x2a0
181.37  [ 37.864108] [<ffffffff81547590>] ? disk_check_events+0x130/0x130
181.38  [ 37.864108] [<ffffffff8181b618>] cdrom_check_events+0x18/0x30
181.39  [ 37.864108] [<ffffffff8178935a>] sr_block_check_events+0x2a/0x30
181.40  [ 37.864108] [<ffffffff815474b1>] disk_check_events+0x51/0x130
181.41  [ 37.864108] [<ffffffff815475a6>] disk_events_workfn+0x16/0x20
181.42  [ 37.864108] [<ffffffff81102b85>] process_one_work+0x165/0x480
181.43  [ 37.864108] [<ffffffff81102eeb>] worker_thread+0x4b/0x4c0
181.44  [ 37.864108] [<ffffffff81102ea0>] ? process_one_work+0x480/0x480
181.45  [ 37.864108] [<ffffffff81108d86>] kthread+0xd6/0xf0
181.46  [ 37.864108] [<ffffffff81108cb0>] ? kthread_create_on_node+0x180/0x180
181.47  [ 37.864108] [<ffffffff81aa690f>] ret_from_fork+0x3f/0x70
181.48  [ 37.864108] [<ffffffff81108cb0>] ? kthread_create_on_node+0x180/0x180
181.49  [ 37.864108] Code: 00 00 00 00 48 89 85 20 ff ff ff 48 8d 85 10 ff ff ff 8b 73 04 48 89 85 28 ff ff ff 49 8b 47 48 ff 50 28 85 c0 0f 88 5d 01 00 00 <49> 8b 44 24 08 48 85 c0 0f 84 3d 03 00 00 8b 00 89 85 08 ff ff
181.50  [ 37.864108] RIP [<ffffffffa01c41b7>] __event_probe__block_get_rq+0x127/0x4b0 [lttng_probe_block]
181.51  [ 37.864108] RSP <ffff88007c86ba98>
181.52  [ 37.864108] CR2: 0000000000000008

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoVersion 2.9.4 v2.9.4
Mathieu Desnoyers [Tue, 3 Oct 2017 18:37:51 +0000 (14:37 -0400)] 
Version 2.9.4

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: version check error in btrfs instrumentation
Michael Jeanson [Fri, 29 Sep 2017 20:40:36 +0000 (16:40 -0400)] 
Fix: version check error in btrfs instrumentation

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: update btrfs instrumentation for kernel 4.14
Michael Jeanson [Wed, 20 Sep 2017 16:12:41 +0000 (12:12 -0400)] 
Fix: update btrfs instrumentation for kernel 4.14

See upstream commit:

  Author: Jeff Mahoney <jeffm@suse.com>
  Date:   Wed Jun 28 21:56:54 2017 -0600

    btrfs: constify tracepoint arguments

    Tracepoint arguments are all read-only.  If we mark the arguments
    as const, we're able to keep or convert those arguments to const
    where appropriate.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: update writeback instrumentation for kernel 4.14
Michael Jeanson [Wed, 20 Sep 2017 16:12:40 +0000 (12:12 -0400)] 
Fix: update writeback instrumentation for kernel 4.14

See upstream commits:

  commit 11fb998986a72aa7e997d96d63d52582a01228c5
  Author: Mel Gorman <mgorman@techsingularity.net>
  Date:   Thu Jul 28 15:46:20 2016 -0700

    mm: move most file-based accounting to the node

    There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone.  This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted.  Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

  commit c4a25635b60d08853a3e4eaae3ab34419a36cfa2
  Author: Mel Gorman <mgorman@techsingularity.net>
  Date:   Thu Jul 28 15:46:23 2016 -0700

    mm: move vmscan writes and file write accounting to the node

    As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node.  For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine.  NUMA stats are still per-zone as this is a
    user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: update block instrumentation for kernel 4.14
Michael Jeanson [Wed, 20 Sep 2017 16:12:39 +0000 (12:12 -0400)] 
Fix: update block instrumentation for kernel 4.14

See upstream commit:

  commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a
  Author: Christoph Hellwig <hch@lst.de>
  Date:   Wed Aug 23 19:10:32 2017 +0200

    block: replace bi_bdev with a gendisk pointer and partitions index

    This way we don't need a block_device structure to submit I/O.  The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open.  Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device.  But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: vmalloc wrapper on kernel < 2.6.38
Michael Jeanson [Tue, 26 Sep 2017 18:16:47 +0000 (14:16 -0400)] 
Fix: vmalloc wrapper on kernel < 2.6.38

Ensure that all probes end up including the vmalloc wrapper through the
lttng-tracer.h header so the trace_*() static inlines are generated
through inclusion of include/trace/events/kmem.h before we define
CREATE_TRACE_POINTS.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: vmalloc wrapper on kernel >= 4.12
Michael Jeanson [Tue, 26 Sep 2017 17:46:30 +0000 (13:46 -0400)] 
Fix: vmalloc wrapper on kernel >= 4.12

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoAdd kmalloc failover to vmalloc
Michael Jeanson [Mon, 25 Sep 2017 14:56:20 +0000 (10:56 -0400)] 
Add kmalloc failover to vmalloc

This patch is based on the kvmalloc helpers introduced in kernel 4.12.

It will gracefully failover memory allocations of more than one page to
vmalloc for systems under high memory pressure or fragmentation.

See Linux kernel commit:
  commit a7c3e901a46ff54c016d040847eda598a9e3e653
  Author: Michal Hocko <mhocko@suse.com>
  Date:   Mon May 8 15:57:09 2017 -0700

    mm: introduce kv[mz]alloc helpers

    Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree.  Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code.  Yet we do not have any common helper
    for that and so users have invented their own helpers.  Some of them are
    really creative when doing so.  Let's just add kv[mz]alloc and make sure
    it is implemented properly.  This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures.  This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
7 years agoFix: mmap: caches aliased on virtual addresses
Mathieu Desnoyers [Tue, 19 Sep 2017 16:16:58 +0000 (12:16 -0400)] 
Fix: mmap: caches aliased on virtual addresses

Some architectures (e.g. implementations of arm64) implement their
caches based on the virtual addresses (rather than physical address).
It has the upside of making the cache access faster (no TLB lookup
required to access the cache line), but the downside of requiring
virtual mappings (e.g. kernel vs user-space) to be aligned on the number
of bits used for cache aliasing.

Perform dcache flushing for the entire sub-buffer in the get_subbuf
operation on those architectures, thus ensuring we don't end up with
cache aliasing issues.

An alternative approach we could eventually take would be to create a
kernel mapping for the ring buffer that is aligned with the user-space
mapping.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This page took 0.057213 seconds and 4 git commands to generate.