kretprobe not consistently being triggered for sys_futex calls?


Ashley <ashley@...>
 

Hi all,

I have a problem which has been plaguing me for a few days, and was hoping for some insight or information that might help get to the bottom of it. I think it might be a bug in the kernel instrumentation code or kernel/futex.c, but it could equally just be me misunderstanding something.

I've registered a kprobe and a kretprobe on sys_futex on kernel 4.15.0-43 (Ubuntu 18.04 LTS), and I'm finding quite unusual results where I rarely seem to get the kretprobe triggering for sys_futex.

This seems to be consistent across different types of futex operations - for example, sometimes I trace a futex wake operation which never triggers the retprobe, and similarly for other operations, including wake.

I haven't yet found any combinations of parameters to sys_futex that causes this reliably. I've read some of kernel/futex.c and the functions it calls into to see if perhaps it is doing some crazy things to jump back into user space which would avoid triggering the return probe, but I didn't see anything that looked particularly suspicious (disclaimer: I'm nowhere near an expert on the topic so it's likely I'd miss something like this)


If I run this alongside a small shell script along the lines of (excuse the fish syntax!):
while true
echo "a" >> file
end

I get the following output after a minute or so:

fish-16555 [001] .... 237577.357940: 0x00000001: Total futex in: 363
fish-16555 [001] .N.. 237577.357946: 0x00000001: Total futex out: 176
fish-16555 [001] .N.. 237577.357947: 0x00000001: Total write in: 258608
fish-16555 [001] .N.. 237577.357947: 0x00000001: Total write out: 258607

I suspect the difference of one between the write in/out is due to either instrumentation starting while another thread was already inside a write system call, or perhaps a process was killed inside the system call (if that's possible?).

I am quite sure there is an issue with futex that can't be explained by any of these possibilities, though: I'm able to observe specific, continuing processes which have managed to execute a futex call and then get back to user space without triggering the retprobe, and even trigger multiple futex calls in consequence without triggering a single retprobe (I'm not doing anything crazy in my user code -- just spawning some std C++ threads, and I see the same issues tracing MySQL Server).

On the other hand, I've also seen the same program do the same set of actions on a different run, but manage to trigger every retprobe (or some subset, at least) that I expect them to.

I've written a StackOverflow question here: https://stackoverflow.com/questions/54579794/why-is-a-kretprobe-on-sys-futex-called-less-often-than-a-corresponding-kprobe but I anticipate that the question is too niche to get much visibility, so I was hoping it'd be okay to broadcast the question here as well. If we end up coming to an answer here then I'm very happy to accept your answer there.

Any input at all is appreciated :)

Thanks in advance!
Ashley


PS: Apologies if I'm using the mailing list incorrectly; please feel free to prod me about etiquette - I've never used one before

PPS: Incase it helps, here's an abridged log of my initial program which traces specific threads' futex operations, which was my first exposure to this issue. FUTEX << signifies a kprobe call, >> FUTEX a kretprobe call, and {WAKE XXXX} are annotations I added myself based on some extra output I removed to abridge this output:

main-6311 .... 238378.639978: 0x00000001: FUTEX <<
main-6312 .... 238378.640005: 0x00000001: FUTEX <<
main-6310 .... 238379.141966: 0x00000001: FUTEX{WAKE 6311} <<
main-6310 .... 238379.141986: 0x00000001: FUTEX <<
main-6311 d... 238379.142007: 0x00000001: >> FUTEX
main-6311 .... 238379.642507: 0x00000001: FUTEX{WAKE 6312} <<
main-6311 d... 238379.642525: 0x00000001: >> FUTEX
main-6311 .... 238379.642563: 0x00000001: FUTEX{WAKE 6310} <<
main-6311 dN.. 238379.642573: 0x00000001: >> FUTEX
main-6310 .... 238379.642581: 0x00000001: FUTEX <<
main-6312 .... 238380.143266: 0x00000001: FUTEX <<
main-6312 .... 238380.143290: 0x00000001: FUTEX{WAKE 6310} <<
main-6310 d... 238380.143305: 0x00000001: >> FUTEX

The notable thing is that we see threads like main-6310 issuing multiple futex calls without any retprobe triggering between them (lines 3-4, 10).


Paul Chaignon
 

Hi Ashley,

Could it be that you're reaching the maxactive bound and thus some probes
are missed? From the kprobes documentation [1]:

While the probed function is executing, its return address is
stored in an object of type kretprobe_instance. Before calling
register_kretprobe(), the user sets the maxactive field of the
kretprobe struct to specify how many instances of the specified
function can be probed simultaneously. register_kretprobe()
pre-allocates the indicated number of kretprobe_instance objects.

For example, if the function is non-recursive and is called with a
spinlock held, maxactive = 1 should be enough. If the function is
non-recursive and can never relinquish the CPU (e.g., via a semaphore
or preemption), NR_CPUS should be enough. If maxactive <= 0, it is
set to a default value. If CONFIG_PREEMPT is enabled, the default
is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS.

I tried to patch bcc [2] to set the maxactive value to 1000 on
attach_kretprobe, but it didn't make a difference when running your
script. Maybe you could check the nmissed field mentioned in the
documentation (with a kernel module instead of bpf)?

Paul

1 - https://www.kernel.org/doc/Documentation/kprobes.txt
2 - https://github.com/pchaigno/bcc/tree/bump-maxactive


Paul Chaignon
 

On Fri, Feb 08, 2019 at 01:49:44PM -0800, Ashley wrote:
Hi Paul,

Thanks for the prompt and detailed response! I hadn't considered trying it with a kernel module, but I set one up to measure this as per your suggestion, and -- almost frustratingly -- it seems to catch them all.

Here's output from the kernel module:
[ 2130.067718] sys_futex returned 1 and took 6419 ns to execute 372 task 6920
[ 2130.067829] sys_futex returned 0 and took 500407745 ns to execute 376 task 6921
[ 2130.567894] sys_futex returned 1 and took 6279 ns to execute 380 task 6921
[ 2130.567923] sys_futex returned 1 and took 3998 ns to execute 381 task 6921
[ 2130.567925] sys_futex returned 0 and took 500442358 ns to execute 381 task 6920
[ 2130.568043] sys_futex returned 0 and took 1000852410 ns to execute 385 task 6922
[ 2131.174619] sys_futex returned 0 and took 779 ns to execute 399 task 6922
[ 2131.174713] sys_futex returned 1 and took 5393 ns to execute 400 task 6922
[ 2131.174887] sys_futex returned 0 and took 607250418 ns to execute 403 task 6920
What's the source code of that kernel module?


And here's my output from my eBPF program:
ashley@ashley-VirtualBox ~/4/e> cat test | grep "FUTEX <<"
            main-6921  [000] ....  2129.844847: 0x00000001: FUTEX <<
            main-6922  [000] ....  2129.844870: 0x00000001: FUTEX <<
            main-6920  [000] ....  2130.344893: 0x00000001: FUTEX <<
            main-6920  [000] ....  2130.344906: 0x00000001: FUTEX <<
            main-6921  [000] ....  2130.845068: 0x00000001: FUTEX <<
            main-6921  [000] ....  2130.845102: 0x00000001: FUTEX <<
            main-6920  [000] ....  2130.845113: 0x00000001: FUTEX <<
            main-6922  [000] ....  2131.451799: 0x00000001: FUTEX <<
            main-6922  [000] .N..  2131.451887: 0x00000001: FUTEX <<
ashley@ashley-VirtualBox ~/4/e> cat test | grep ">> FUTEX"
            main-6921  [000] d...  2130.345015: 0x00000001: >> FUTEX
            main-6922  [000] d...  2130.845228: 0x00000001: >> FUTEX

So my eBPF program received all 9 futex calls as expected, but only 2 returns, whereas the kernel module successfully gets all of them consistently.

I'm also able to accurately track k[ret]probes for sys_write and sys_read through eBPF, even though they seem to happen in higher density on my system. The problem still occurs even if I don't use bpf_trace_printk within the retprobe also.
Weird, I have the same problem you have for sys_futex if I adapt the
script to track sys_write instead...


Any ideas where might be a good place to look from here? I'm totally stumped; it's quite tempting to just migrate my code to kernel modules, but everything else I'm instrumenting seems to work perfectly in eBPF, so it'd be a shame to do that.
I'm thinking it could be caused by a difference in kprobes configuration
between the kernel module and bcc.


Best,
Ashley
Paul


Ashley <ashley@...>
 

What's the source code of that kernel module?

I adapted code from an example in a repository Spotify have, which is here: https://github.com/spotify/linux/blob/master/samples/kprobes/kretprobe_example.c. The only change I made was just to print some extra information to correlate the calls with specific threads, and I added a hack before the printk call to only do so if the process ID was within a boundary which my program's process IDs are located within. The changes look like this:

    if (ri->task->pid >= 7000 || ri->task->pid <= 5000) return 0;
    printk(KERN_INFO "%s returned %d and took %lld ns to execute %d task %s %d\n",
            func_name, retval, (long long)delta, calls, ri->task->comm, ri->task->pid);

If I don't filter the print calls with the pid check, I start getting high numbers of missed calls, which I believe is because the printk call causes rsyslog to make futex calls; re-invoking the probe and saturating itself with calls. I wonder if this is what was happening with your sys_write code? I'm able to reliably track sys_write only when I'm not causing writing calls from within the eBPF program.

Another thing of note is that lowering the maxactive parameter to 1 or 10 leads to it never printing anything, which I find strange too. Is the maxactive parameter local to this specific kernel function and probe? I find it quite strange that with the default eBPF configuration I can get plenty of retprobes for syscalls like write but seemingly few for futex operations (which aren't necessarily denser) 

If I track specific futex operations, such as `futex_wake` and `futex_wait`, it seems like the retprobes for these work perfectly, which I think is also a bit strange.



One thing that comes to mind: is it possible that some optimisations are interfering with this? In `do_futex` (which I assume `sys_futex` is hooking into), most operations are of the form:

return futex_wake(uaddr, flags, val, val3);

I seem to be able to reliably trace the return of these specific futex operations from eBPF (I have tried `_wake` and `_wait`), but they mostly return values, rather than another function call. Is it possible that the retprobe isn't being triggered because control jumps into `futex_X` and returns straight back to user space?

This wouldn't explain why the probes added by the kernel module can detect them, though, so I doubt this is the problem.

>I'm thinking it could be caused by a difference in kprobes configuration
>between the kernel module and bcc.

I'll look into playing with the parameters in BCC and see if I get anywhere with that. I've been using the pre-built binaries to this point so I'll have to spend some time setting that up, so it might be a few days before I get anywhere with this. I'll update you if I find anything with this.

Ashley

On Sat, Feb 9, 2019 at 6:03 PM Paul Chaignon <paul.chaignon@...> wrote:

On Fri, Feb 08, 2019 at 01:49:44PM -0800, Ashley wrote:
> Hi Paul,
>
> Thanks for the prompt and detailed response! I hadn't considered trying it with a kernel module, but I set one up to measure this as per your suggestion, and -- almost frustratingly -- it seems to catch them all.
>
> Here's output from the kernel module:
> [ 2130.067718] sys_futex returned 1 and took 6419 ns to execute 372 task 6920
> [ 2130.067829] sys_futex returned 0 and took 500407745 ns to execute 376 task 6921
> [ 2130.567894] sys_futex returned 1 and took 6279 ns to execute 380 task 6921
> [ 2130.567923] sys_futex returned 1 and took 3998 ns to execute 381 task 6921
> [ 2130.567925] sys_futex returned 0 and took 500442358 ns to execute 381 task 6920
> [ 2130.568043] sys_futex returned 0 and took 1000852410 ns to execute 385 task 6922
> [ 2131.174619] sys_futex returned 0 and took 779 ns to execute 399 task 6922
> [ 2131.174713] sys_futex returned 1 and took 5393 ns to execute 400 task 6922
> [ 2131.174887] sys_futex returned 0 and took 607250418 ns to execute 403 task 6920

What's the source code of that kernel module?

>
> And here's my output from my eBPF program:
> ashley@ashley-VirtualBox ~/4/e> cat test | grep "FUTEX <<"
>             main-6921  [000] ....  2129.844847: 0x00000001: FUTEX <<
>             main-6922  [000] ....  2129.844870: 0x00000001: FUTEX <<
>             main-6920  [000] ....  2130.344893: 0x00000001: FUTEX <<
>             main-6920  [000] ....  2130.344906: 0x00000001: FUTEX <<
>             main-6921  [000] ....  2130.845068: 0x00000001: FUTEX <<
>             main-6921  [000] ....  2130.845102: 0x00000001: FUTEX <<
>             main-6920  [000] ....  2130.845113: 0x00000001: FUTEX <<
>             main-6922  [000] ....  2131.451799: 0x00000001: FUTEX <<
>             main-6922  [000] .N..  2131.451887: 0x00000001: FUTEX <<
> ashley@ashley-VirtualBox ~/4/e> cat test | grep ">> FUTEX"
>             main-6921  [000] d...  2130.345015: 0x00000001: >> FUTEX
>             main-6922  [000] d...  2130.845228: 0x00000001: >> FUTEX
>
> So my eBPF program received all 9 futex calls as expected, but only 2 returns, whereas the kernel module successfully gets all of them consistently.
>
> I'm also able to accurately track k[ret]probes for sys_write and sys_read through eBPF, even though they seem to happen in higher density on my system. The problem still occurs even if I don't use bpf_trace_printk within the retprobe also.

Weird, I have the same problem you have for sys_futex if I adapt the
script to track sys_write instead...

>
> Any ideas where might be a good place to look from here? I'm totally stumped; it's quite tempting to just migrate my code to kernel modules, but everything else I'm instrumenting seems to work perfectly in eBPF, so it'd be a shame to do that.

I'm thinking it could be caused by a difference in kprobes configuration
between the kernel module and bcc.

>
> Best,
> Ashley

Paul


Ashley <ashley@...>
 

So I upgraded my kernel version to 5.0.0-rc6, and it seems to consistently work perfectly now, even with a much higher density of futex calls than I was working with before.

I'm still curious what the issue is, and I sadly can't say with any confidence that it's definitely the kernel version, since I accidentally broke my filesystem trying to resize it and had to make a fresh VM, so I may have different BCC or dependency versions installed on this machine, but I'm hoping to find time to dig a bit more in a few weeks to see if I can determine what was going on here.

Thanks again for the help, Paul! I wouldn't have considered playing with the kernel version without realising it was working fine  for the kernel module.

Regards,
Ashley


Paul Chaignon
 

On Tue, Feb 12, 2019 at 07:02:58AM -0800, Ashley wrote:
So I upgraded my kernel version to 5.0.0-rc6, and it seems to consistently
work perfectly now, even with a much higher density of futex calls than I
was working with before.

I'm still curious what the issue is, and I sadly can't say with any
confidence that it's definitely the kernel version, since I accidentally
broke my filesystem trying to resize it and had to make a fresh VM, so I
may have different BCC or dependency versions installed on this machine,
but I'm hoping to find time to dig a bit more in a few weeks to see if I
can determine what was going on here.
bcc now supports the maxactive kretprobe parameter [1], so if you want to
test whether a higher maxactive value solves the issue you had, it's as
easy as calling BPF.attach_kretprobe with a maxactive=your_value argument.

1 - https://github.com/iovisor/bcc/pull/2224


Thanks again for the help, Paul! I wouldn't have considered playing with
the kernel version without realising it was working fine  for the kernel
module.

Regards,
Ashley