The kretprobe hash is mostly superfluous, replace it with a per-task
variable.
This gets rid of the task hash and it's related locking.
Note that this may change the kprobes module-exported API for kretprobe
handlers. If any out-of-tree kretprobe user uses ri->rp, use
get_kretprobe(ri) instead.
Also remove the confusing comment about checking if a fd exists. I
could not find one instance in the entire kernel that still matches
the description or the reason for the name fcheck.
The need for better names became apparent in the last round of
discussion of this set of changes[1].
Michael Jeanson [Mon, 23 Nov 2020 17:15:43 +0000 (12:15 -0500)]
Improve the release script
* Use git-archive, this removes all custom code to cleanup the repo, it
can now be used in an unclean repo as the code will be exported from
a specific tag.
* Add parameters, this will allow using the script on any machine
while keeping the default behavior for the maintainer.
Change-Id: I9f29d0e1afdbf475d0bbaeb9946ca3216f725e86 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Currently the tracepoint site will iterate a vector and issue indirect
calls to however many handlers are registered (ie. the vector is
long).
Using static_call() it is possible to optimize this for the common
case of only having a single handler registered. In this case the
static_call() can directly call this handler. Otherwise, if the vector
is longer than 1, call a function that iterates the whole vector like
the current code.
Change-Id: I739dd84d62cc1a821b8bd8acff74fa29aa25d22f Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed
Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
values instead of overloading RET_PF_RETRY. In the short term, the
unique values add clarity to the code and RET_PF_SPURIOUS will be used
by set_spte() to avoid unnecessary work for spurious faults.
In the long term, TDX will use RET_PF_FIXED to deterministically map
memory during pre-boot. The page fault flow may bail early for benign
reasons, e.g. if the mmu_notifier fires for an unrelated address. With
only RET_PF_RETRY, it's impossible for the caller to distinguish between
"cool, page is mapped" and "darn, need to try again", and thus cannot
handle benign cases like the mmu_notifier retry.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie0855c78852b45f588e131fe2463e15aae1bc023
Add functions to handle page faults in the TDP MMU. These page faults
are currently handled in much the same way as the x86 shadow paging
based MMU, however the ordering of some operations is slightly
different. Future patches will add eager NX splitting, a fast page fault
handler, and parallel page faults.
Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ie56959cb6c77913d2f1188b0ca15da9114623a4e
KVM: x86: Add intr/vectoring info and error code to kvm_exit tracepoint
Extend the kvm_exit tracepoint to align it with kvm_nested_vmexit in
terms of what information is captured. On SVM, add interrupt info and
error code, while on VMX it add IDT vectoring and error code. This
sets the stage for macrofying the kvm_exit tracepoint definition so that
it can be reused for kvm_nested_vmexit without loss of information.
Opportunistically stuff a zero for VM_EXIT_INTR_INFO if the VM-Enter
failed, as the field is guaranteed to be invalid. Note, it'd be
possible to further filter the interrupt/exception fields based on the
VM-Exit reason, but the helper is intended only for tracepoints, i.e.
an extra VMREAD or two is a non-issue, the failed VM-Enter case is just
low hanging fruit.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I638fa29ef7d8bb432de42a33f9ae4db43259b915
This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: Ia65cf44e108f2df0b458f0d335f33a8f18f50baa
Use proper ssize_t and size_t types for the return value and count
argument, move the offset last and make it an in/out argument like
all other read/write helpers, and make the buf argument a void pointer
to get rid of lots of casts in the callers.
Change-Id: I825c3fcbcc17e9b46e2a661fadc66b52a94eb2da Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Thu, 24 Sep 2020 19:38:35 +0000 (15:38 -0400)]
fix: Use 'kernel_read' to read from procfs
Use the 'kernel_read' helper to read files in procfs, it's present in
the kernel since the 2.6 series and does the right thing on kernels that
require the set_fs dance and newer one which don't.
Change-Id: I1a53fda379e0bb9acc79331626925bbdba63d727 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Michael Jeanson [Fri, 25 Sep 2020 20:05:00 +0000 (16:05 -0400)]
fix: don't allow userspace copy to read kernel memory
This patch fixes a security issue which allows the root user to read
arbitrary kernel memory. Considering the security model used in LTTng
userspace tooling for kernel tracing, this bug also allows members of
the 'tracing' group to read arbitrary kernel memory.
Calls to __copy_from_user_inatomic() where wrongly enclosed in
set_fs(KERNEL_DS) defeating the access_ok() calls and allowing to read
from kernel memory if a kernel address is provided.
Remove all set_fs() calls around __copy_from_user_inatomic().
As a side effect this will allow us to support v5.10 which should remove
set_fs().
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I35e4562c835217352c012ed96a7b8f93e941381e
The system call filter table has effectively been unused for a long
time due to system call name prefix mismatch. This means the overhead of
selective system call tracing was larger than it should have been because
the event payload preparation would be done for all system calls as soon
as a single system call is traced.
However, fixing this underlying issue unearths several issues that crept
unnoticed when the "enabler" concept was introduced (after the original
implementation of the system call filter table).
Here is a list of the issues which are resolved here:
- Split lttng_syscalls_unregister into an unregister and destroy
function, thus awaiting for a grace period (and therefore quiescence
of the users) after unregistering the system call tracepoints before
freeing the system call filter data structures. This effectively fixes
a use-after-free.
- The state for enabling "all" system calls vs enabling specific system
calls (and sequences of enable-disable) was incorrect with respect to
the "enablers" semantic. This is solved by always tracking the
bitmap of enabled system calls, and keeping this bitmap even when
enabling all system calls. The sc_filter is now always allocated
before system call tracing is registered to tracepoints, which means
it does not need to be RCU dereferenced anymore.
Padding fields in the ABI are reserved to select whether to:
- Trace either native or compat system call (or both, which is the
behavior currently implemented),
- Trace either system call entry or exit (or both, which is the
behavior currently implemented),
- Select the system call to trace by name (behavior currently
implemented) or by system call number,
writeback: Fix sync livelock due to b_dirty_time processing
When we are processing writeback for sync(2), move_expired_inodes()
didn't set any inode expiry value (older_than_this). This can result in
writeback never completing if there's steady stream of inodes added to
b_dirty_time list as writeback rechecks dirty lists after each writeback
round whether there's more work to be done. Fix the problem by using
sync(2) start time is inode expiry value when processing b_dirty_time
list similarly as for ordinarily dirtied inodes. This requires some
refactoring of older_than_this handling which simplifies the code
noticeably as a bonus.
Change-Id: I8b894b13ccc14d9b8983ee4c2810a927c319560b Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The only use of I_DIRTY_TIME_EXPIRE is to detect in
__writeback_single_inode() that inode got there because flush worker
decided it's time to writeback the dirty inode time stamps (either
because we are syncing or because of age). However we can detect this
directly in __writeback_single_inode() and there's no need for the
strange propagation with I_DIRTY_TIME_EXPIRE flag.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I92e37c2ff3ec36d431e8f9de5c8e37c5a2da55ea
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.
Change-Id: Ife8880bd9378dca2972da8838f40fc35ccdfaaac Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.
After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:
Running time on unfixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m2.051s
user 0m0.008s
sys 0m2.026s
Running time on fixed kernel:
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real 0m0.471s
user 0m0.004s
sys 0m0.395s
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I5169cb24853d4da32e2862a6626f1f058689b053
Beniamin Sandu [Thu, 13 Aug 2020 13:24:39 +0000 (16:24 +0300)]
Kconfig: fix dependency issue when building in-tree without CONFIG_FTRACE
When building in-tree, one could disable CONFIG_FTRACE from kernel
config which will leave CONFIG_TRACEPOINTS selected by LTTNG modules,
but generate a lot of linker errors like below because it leaves out
other stuff, e.g.:
trace.c:(.text+0xd86b): undefined reference to `trace_event_buffer_reserve'
ld: trace.c:(.text+0xd8de): undefined reference to `trace_event_buffer_commit'
ld: trace.c:(.text+0xd926): undefined reference to `event_triggers_call'
ld: trace.c:(.text+0xd942): undefined reference to `trace_event_ignore_this_pid'
ld: net/mac80211/trace.o: in function `trace_event_raw_event_drv_tdls_cancel_channel_switch':
It appears to be caused by the fact that TRACE_EVENT macros in the Linux
kernel depend on the Ftrace ring buffer as soon as CONFIG_TRACEPOINTS is
enabled.
Steps to reproduce:
- Get a clone of an upstream stable kernel and use scripts/built-in.sh on it
- Configure a standard x86-64 build, enable built-in LTTNG but disable
CONFIG_FTRACE from Kernel Hacking-->Tracers using menuconfig
commit 92143b2c5656 ("Fix: metadata stream leak, missing list removal and locking")
missed taking a lock protecting the metadata stream list iteration on
session destroy. This opens a race window between iteration and item
removal/free which triggers kernel OOPS.
Fix: metadata stream leak, missing list removal and locking
The metadata stream is part of a list of metadata streams in the
metadata cache. Its addition to the list should be protected by
the metadata cache lock. It needs to be paired with protection
of list iteration with the same lock.
Removal from the list is entirely missing, and should be added
to lttng_metadata_ring_buffer_release (with proper locking).
This missing list removal was probably not causing issues because the
metadata stream structure was leaked: a kfree() is missing from
lttng_metadata_ring_buffer_release as well.
Fix: coherent state not changed atomically with metadata written
commit 122c63cb4310 ("Fix: Implement RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK")
introduces a new ioctl which returns a flag indicating whether the
metadata is in consistent state at the end of the sub-buffer.
That commit is meant to address metadata consistency issues observable
in live sessions.
However, the "consistent" state is false as soon as a producer is
active (between an outermost metadata_begin/end pair). Unfortunately,
if the last "RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK" operation is
done between the last metadata printf and "end" of the transaction, the
last consistency state will be false, and the consumer daemon will never
send metadata to the relay daemon. This in turn causes a live viewer to
wait for metadata endlessly.
This issue can be reproduced by running lttng-tools:
tests/regression/tools/live/test_kernel
as root in a loop.
We observe two things:
1) the poll operation blocks when there is no more metadata to send,
which means there is no mean to unblock when the consistency state
changes back to "true" without producing additional metadata,
2) Even if (1) was fixed, the expectation from an ABI perspective is
that the "coherent" state is only populated when
RING_BUFFER_GET_NEXT_SUBBUF_METADATA_CHECK succeeds. Therefore,
there is no way to let user-space know about conherency transition
unless additional metadata is generated.
Fixing this requires to hold the metadata cache lock across the entire
production of a coherent metadata transaction. This simpler scheme is
possible because the metadata is generated in a reallocated memory area
and not directly into a ring buffer anymore. This was not the case in
earlier lttng-modules versions, when the metadata was generated directly
into a ring buffer, which explains why this simpler scheme was not
implemented.
mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.
These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.
So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.
A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).
Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput. With this
patch, there are fewer writes which are each larger on average.
Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.
Change-Id: I18080ca62bc6c1cd7d6da4cb27cc1521fbdca5e1 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
block: remove the error argument to the block_bio_complete tracepoint
The status can be trivially derived from the bio itself. That also avoid
callers like NVMe to incorrectly pass a blk_status_t instead of the errno,
and the overhead of translating the blk_status_t to the errno in the I/O
completion fast path when no tracing is enabled.
Fixes: 35fe0d12c8a3 ("nvme: trace bio completion")
Change-Id: I8d1463184d79bfab418a1755bfc6a0200170fff3 Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Get next metadata subbuffer, returning a flag indicating whether the
metadata is guaranteed to be in a consistent state at the end of this
sub-buffer (can be parsed).
This can be used by the consumer to know whether the metadata can be
parsed at the end of this sub-buffer, which is useful to distinguish
between errors and incomplete metadata in live tracing.
Ovidiu Panait [Thu, 14 May 2020 10:05:24 +0000 (13:05 +0300)]
Fix: Use vmalloc_sync_mappings on kernel 5.6 as well
Upstream commit [1], that got rid of vmalloc_sync_all and introduced
vmalloc_sync_mappings, is a v5.6 commit:
$ git tag --contains 763802b53a427ed3cbd419dbba255c414fdd9e7c
v5.6
v5.6-rc7
v5.7-rc1
v5.7-rc2
v5.7-rc3
Extend the LINUX_VERSION_CODE check to v5.6 to fix the following warnings:
...
[ 483.242037] LTTng: vmalloc_sync_all symbol lookup failed.
[ 483.257056] Page fault handler and NMI tracing might trigger faults.
...
Linux commit 0bd476e6c67190b5eb7b6e105c8db8ff61103281 ("kallsyms:
unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()") breaks
LTTng-modules by removing symbols used by the LTTng-modules out-of-tree
tracer.
I pointed this out when the change was originally considered before the
5.7 merge window. This generated some discussion but it did not lead to
any concrete proposal to fix the issue. [1]
The commit has been merged in the 5.7 merge window. At that point, as
maintainer of LTTng, I immediately raised a flag about this issue,
proposing an alternative approach to solve this: expose the few symbols
needed by LTTng to GPL modules. This was NACKed on the ground that the
Linux kernel cannot export GPL symbols when there are no in-tree
users. [2]
Steven Rostedt has shown interest in merging LTTng-modules upstream.
LTTng-modules being LGPL, this is very much doable. I have prepared a
tree of LTTng-modules "for upstreaming" and sent it to him privately so
he can review it. Even if in an ideal scenario LTTng-modules is merged
for the following merge window, it leaves LTTng-modules broken on the
5.7 kernel.
In order to ensure that the LTTng-modules kernel tracer continues working
for my end users on kernels 5.7 onwards, as a very last resort, this is
with great reluctance that I created this fix for LTTng modules. It
basically uses kprobes to lookup the kallsyms_lookup_name symbol, and
continues using kallsyms_lookup_name as before.
Because lttng-tracer depends on lttng-statedump, we cannot just put all
wrappers into lttng-tracer.o, because it would create a circular
dependency. This will be an issue if we introduce common wrappers which
are used in both lttng-tracer.o and in lttng-statedump.o.
Introduce a new lttng-wrapper.o to contain all wrapper symbols for all
lttng modules.
The lttng-ftrace integration (LTTNG_KERNEL_FUNCTION instrumentation
type) was unused for a while now. The "function" probing is actually
done with kprobes and kretprobes (LTTNG_KERNEL_KPROBE and
LTTNG_KERNEL_KRETPROBE).
Remove it so a use of kallsyms_lookup_name() can be removed as well.
Note that in the future we could add back this support by using
register_ftrace_function() which is exported to kernel modules, but
considering that we have not been using this code for a while,
just remove the implementation for now.
Remove dependency on kallsyms for splice_to_pipe (kernel 4.2+)
Upstream commit 2b514574f7e88 "net: af_unix: implement splice for stream
af_unix sockets" exported the "splice_to_pipe" symbol, so use it to
remove a dependency on kallsyms_lookup_name().
Remove dependency on kallsyms for irq_to_desc (kernel 3.4+)
Upstream commit 3911ff30f5d "genirq: export handle_edge_irq() and
irq_to_desc()" exported the irq_to_desc symbol, so use it to remove a
dependency on kallsyms_lookup_name().
Remove work-around for signed tracepoint module tainting (kernel 3.15+)
Upstream commit 66cc69e34e86a "Fix: module signature vs tracepoints: add
new TAINT_UNSIGNED_MODULE" fixed an issue where the kernel was
considering unsigned modules as tainting the kernel in the same way as a
force-loaded modules, which was causing the tracepoints within those
modules to be hidden.
This fix was merged in kernel 3.15, so there is no use in applying this
work-around starting from that kernel.
This removes a dependency on kallsyms_lookup_name() for the symbol
"tracepoint_module_notify".
This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.
workqueue: add worker function to workqueue_execute_end tracepoint
It's surprising that workqueue_execute_end includes only the work when
its counterpart workqueue_execute_start has both the work and the worker
function.
You can't set a tracing filter or trigger based on the function, and
postprocessing scripts interested in specific functions are harder to
write since they have to remember the work from _start and match it up
with the same field in _end.
Add the function name, taking care to use the copy stashed in the
worker since the work is no longer safe to touch.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
media: v4l2: abstract timeval handling in v4l2_buffer
As a preparation for adding 64-bit time_t support in the uapi,
change the drivers to no longer care about the format of the
timestamp field in struct v4l2_buffer.
The v4l2_timeval_to_ns() function is no longer needed in the
kernel after this, but there is userspace code relying on
it to be part of the uapi header.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
rcu: Remove kfree_rcu() special casing and lazy-callback handling
This commit removes kfree_rcu() special-casing and the lazy-callback
handling from Tree RCU. It moves some of this special casing to Tiny RCU,
the removal of which will be the subject of later commits.
This results in a nice negative delta.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item
ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM
Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.
Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.
Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.
Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
proc: decouple proc from VFS with "struct proc_ops"
Currently core /proc code uses "struct file_operations" for custom hooks,
however, VFS doesn't directly call them. Every time VFS expands
file_operations hook set, /proc code bloats for no reason.
Introduce "struct proc_ops" which contains only those hooks which /proc
allows to call into (open, release, read, write, ioctl, mmap, poll). It
doesn't contain module pointer as well.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
There are no in-kernel users remaining, but there may still be users that
include linux/time.h instead of sys/time.h from user space, so leave the
types available to user space while hiding them from kernel space.
Only the __kernel_old_* versions of these types remain now.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Change-Id: I986a813ad8b1c753ab1fa07f726b0cc481f049cb