Fix: ring buffer: handle concurrent update in nested buffer wrap around check
With stress-test loads that trigger sub-buffer switch very frequently
(small 4kB sub-buffers, frequent flush), we currently observe this kind
of warnings once every few minutes:
[65335.896208] ring buffer relay-overwrite-mmap, cpu 5: records were lost. Caused by:
[65335.896208] [ 0 buffer full, 1 nest buffer wrap-around, 0 event too big ]
It appears that the check for nested buffer wrap-around does not take
into account that a concurrent execution contexts (either nested for
per-cpu buffers, or from another CPU or nested for global buffers) can
update the commit_count value concurrently.
What we really want to do with this check is to ensure that if we enter
a sub-buffer that had an unbalanced reserve/commit count, assuming there
is no hope that this gets rebalanced promptly, we detect this and drop
the current event. However, in the case where the commit counter has
been concurrently updated by another reserve or a switch, we want to
retry the entire reserve operation.
One way to detect this is to sample the reserve offset twice, around the
commit counter read, along with the appropriate memory barriers.
Therefore, we can detect if the mismatch between reserve and commit
counter is actually caused by a concurrent update, which necessarily has
updated the reserve counter.
lib_ring_buffer_write(), lib_ring_buffer_memset() and
lib_ring_buffer_copy_from_user_inatomic() could be passed a length of 0.
This typically has no side-effect as far as writing into the buffers is
concerned, except for one detail: in overwrite mode, there is a check to
make sure the sub-buffer can be written into. This check is performed
even if length is 0. In the case where this would fall exactly at the
end of a sub-buffer, the check would fail, because the offset would fall
exactly at the beginning of the next sub-buffer.
Fix: ring buffer: RING_BUFFER_FLUSH ioctl buffer corruption
lib_ring_buffer_switch_slow() clearly states:
* Note, however, that as a v_cmpxchg is used for some atomic
* operations, this function must be called from the CPU which owns the
* buffer for a ACTIVE flush.
But unfortunately, the RING_BUFFER_FLUSH ioctl does not follow these
important directives. Therefore, whenever the consumer daemon or session
daemon explicitly triggers a "flush" on a buffer, it can race with data
being written to the buffer, leading to corruption of the reserve/commit
counters, and therefore corruption of data in the buffer. It triggers
these warnings for overwrite mode buffers:
[65356.890016] WARNING: at
/home/compudj/git/lttng-modules/wrapper/ringbuffer/../../lib/ringbuffer/../../wrapper/ringbuffer/../../lib/ringbuffer/backend.h:110 lttng_event_write+0x118/0x140 [lttng_ring_buffer_client_mmap_overwrite]()
Which indicates that we are trying to write into a sub-buffer for which
we don't have exclusive access. It also causes the following warnings to
show up:
[65335.896208] ring buffer relay-overwrite-mmap, cpu 5: records were lost. Caused by:
[65335.896208] [ 0 buffer full, 80910 nest buffer wrap-around, 0 event too big ]
Which is caused by corrupted commit counter.
Fix this by sending an IPI to the CPU owning the flushed buffer for
per-cpu synchronization. For global synchronization, no IPI is needed,
since we allow writes from remote CPUs.
Samuel Martin [Mon, 17 Jun 2013 14:28:51 +0000 (10:28 -0400)]
Fix build and load against linux-2.6.33.x
* lttng-event.h declared but did not implement
lttng_add_perf_counter_to_ctx on kernel >=2.6.33, the implementation
was in lttng-context-perf-counters.c, which was only included for
kernel >=2.6.34. This prevented the module from being loaded.
* on kernel 2.6.33.x, lttng-context-perf-counters.c complains about
implicit declaration for {get,put}_online_cpus and
{,un}register_cpu_notifier; so fix header inclusion.
Signed-off-by: Samuel Martin <smartin@aldebaran-robotics.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fix: statedump hang/too early completion due to logic error
The previous "Fix: statedump hang due to incorrect wait/wakeup use" was
not actually fixing the real problem.
The issue is that we should pass the expected condition to wait_event()
rather than its contrary.
This bug has been sitting there for a while. I suspect that a recent
change in the Linux scheduler behavior for newly spawned worker threads
might have contributed to trigger the hang more reliably.
The effects of this bugs are:
- possible hang of the lttng-sessiond (within the kernel) at tracing
start,
- the statedump end event is traced before all worker threads have
actually completed, which can confuse LTTng viewer state systems.
Reported-by: Phil Wilshire <sysdcs@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
The type of fields exchanged between kernel and userspace must be
compat_ulong_t instead of unsigned long in case of compat where
userspace is 32 bits and kernel is 64 bits.
Fix ring_buffer_frontend.c: missing include lttng-tracer-core.h
In lib/ringbuffer/ring_buffer_frontend.c, RING_BUFFER_ALIGN is undefined,
leading to no alignment offset being recorded after the call to
config->cb.record_header_size() in lib_ring_buffer_try_reserve_slow().
However, lttng-ring-buffer-client.h does define RING_BUFFER_ALIGN, so
the alignment offset will be produced when the packet header is written
in lttng_write_event_header().
This discrepancy may be observed on architectures that don't set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, such as ARM, with a babeltrace
error such as:
indicating that the actual content size differs from the calculated one
due to the difference in alignment. Including the appropriate header
file in ring_buffer_frontend.c solves the problem.
fix timestamps on architectures without CONFIG_KTIME_SCALAR
trace_clock_monotonic_wrapper() should return a u64 representing the
number of nanoseconds since system startup.
ktime_get() provides that value directly within its .tv64 field only
on those architectures defining CONFIG_KTIME_SCALAR, whereas in all
other cases (e.g. PowerPC) a ktime_to_ns() conversion (which
translates back to .tv64 when CONFIG_KTIME_SCALAR is defined)
becomes necessary.
CC [M] drivers/staging/lttng/probes/lttng-probe-statedump.o
In file included from drivers/staging/lttng/probes/../instrumentation/events/lttng-module/../../../probes/lttng-events.h:221:0,
from drivers/staging/lttng/probes/../instrumentation/events/lttng-module/../../../probes/define_trace.h:117,
from drivers/staging/lttng/probes/../instrumentation/events/lttng-module/lttng-statedump.h:162,
from drivers/staging/lttng/probes/lttng-probe-statedump.c:41:
drivers/staging/lttng/probes/../instrumentation/events/lttng-module/../../../probes/../instrumentation/events/lttng-module/lttng-statedump.h:29:1: error: 'TASK_COMM_LEN' undeclared here (not in a function)
Fix: dmesg printout should not print metadata warnings
Metadata channel retry when an event is dropped by the underlying
buffer. We should not print a message showing that the event has been
dropped to the user on dmesg console.
The underlying issue was a mismatch between the ring buffer
configuration description of the number of clock bits (32) saved and the
actual number used (27).
Introduce LTTNG_COMPACT_EVENT_BITS and LTTNG_COMPACT_TSC_BITS across the
code to remove all hardcoded instances of these values to ensure this
kind of mistake does not happen again.
The config pointer points to data that may vanish across the life-time
of the ring buffer stream references. It's only kept valid for the
duration between the channel create and destroy, but the streams can
keep an extra reference on the channel, and thus need the config to stay
valid.
Perform a copy of the config rather than keeping a pointer to it.
lttng lib: ring buffer move null pointer check to open
Let's move the NULL buf check to the file "open", where it belongs. The
"open" file operation is the actual interface between lib ring buffer
and the modules using it.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
* Dan Carpenter <dan.carpenter@oracle.com> wrote:
> Sparse complains that these signed bitfields look "dubious". The
> problem is that instead of being either 0 or 1 like people would expect,
> signed one bit variables like this are either 0 or -1. It doesn't cause
> a problem in this case but it's ugly so lets fix them.
* walter harms (wharms@bfs.de) wrote:
> hi,
> This patch looks ok to me but this design is ugly by itself.
> It should be replaced by an uchar uint whatever or use a
> real bool (obviously not preferred by this programmes).
bool :1, uchar :1 or uint :1 could make sense. uchar:1/bool:1 won't save
any space here, because the surrounding fields are either uint or
pointers, so alignment will just add padding.
I try to use int/uint whenever possible because x86 CPUs tend to get
less register false-dependencies when using instructions modifying the
whole register (generated by using int/uint types) rather than only part
of it (uchar/char/bool). I only use char/uchar/bool when there is a
clear wanted space gain.
The reason why I never use the bool type within a structure when I want
a compact representation is that bool takes a whole byte just to
represent one bit:
struct usebitfield {
int a;
unsigned int f:1, g:1, h:1, i:1, j:1;
int b;
};
struct usebool {
int a;
bool f, g, h, i, j;
int b;
};
struct useboolbf {
int a;
bool f:1, g:1, h:1, i:1, j:1;
int b;
};
This is because each bool takes one byte, while the bitfields are put in
units of "unsigned int" (or bool for the 3rd struct). So in this
example, we need 5 bytes + 3 bytes alignment for the bool, but only 4
bytes to hold the "unsigned int" unit for the bitfields.
The choice between bool and bitfields must also take into account the
frequency of access to the variable, because bitfields require mask
operations to access the selected bit(s). You will notice that none of
these bitfields are accessed on the tracing fast-path: only in
slow-paths. Therefore, space gain is more important than speed here.
One might argue that I have so few of these fields here that it does not
make an actual difference to go for bitfield or bool. I am just trying
to choose types best suited for their intended purpose, ensuring they
are future-proof and will allow simply adding more fields using the same
type, as needed.
So I guess I'll go for unsigned int :1 (rather than uint:1, so we
keep source-level compatibility with user-space LTTng-UST).
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
External modules don't seem to actually execute depmod on make install,
even though the console output says so. Document that it needs to be
manually executed in the README file.