This group is locked. No changes can be made to the group while it is locked.
Date
1 - 6 of 6
kretprobe not consistently being triggered for sys_futex calls?
Ashley <ashley@...>
Hi all,
I have a problem which has been plaguing me for a few days, and was hoping for some insight or information that might help get to the bottom of it. I think it
might be a bug in the kernel instrumentation code or kernel/futex.c, but it could equally just be me misunderstanding something.
I've registered a kprobe and a kretprobe on sys_futex on kernel 4.15.0-43 (Ubuntu 18.04 LTS), and I'm finding quite unusual results where I rarely seem to get the kretprobe triggering for sys_futex.
This seems to be consistent across different types of futex operations - for example, sometimes I trace a futex wake operation which never triggers the retprobe, and similarly for other operations, including wake.
I haven't yet found any combinations of parameters to sys_futex that causes this reliably. I've read some of kernel/futex.c and the functions it calls into to see if perhaps it is doing some crazy things to jump back into user space which would avoid triggering
the return probe, but I didn't see anything that looked particularly suspicious (disclaimer: I'm nowhere near an expert on the topic so it's likely I'd miss something like this)
I have a small Python program which should be able to demonstrate this at https://gist.githubusercontent.com/ashleydavies/465f9171c3530284f477ae7097966156/raw/f0d86fdccf11441c46e77451fbbb29ded62cb0d2/trace_example.py
If I run this alongside a small shell script along the lines of (excuse the fish syntax!):
while true
echo "a" >> file
end
I get the following output after a minute or so:
fish-16555 [001] .... 237577.357940: 0x00000001: Total futex in: 363
fish-16555 [001] .N.. 237577.357946: 0x00000001: Total futex out: 176
fish-16555 [001] .N.. 237577.357947: 0x00000001: Total write in: 258608
fish-16555 [001] .N.. 237577.357947: 0x00000001: Total write out: 258607
I suspect the difference of one between the write in/out is due to either instrumentation starting while another thread was already inside a write system call, or perhaps a process was killed inside the system call (if that's possible?).
I am quite sure there is an issue with futex that can't be explained by any of these possibilities, though: I'm able to observe specific, continuing processes which have managed to execute a futex call and then get back to user space without triggering
the retprobe, and even trigger multiple futex calls in consequence without triggering a single retprobe (I'm not doing anything crazy in my user code -- just spawning some std C++ threads, and I see the same issues tracing MySQL Server).
On the other hand, I've also seen the same program do the same set of actions on a different run, but manage to trigger every retprobe (or some subset, at least) that I expect them to.
I've written a StackOverflow question here: https://stackoverflow.com/questions/54579794/why-is-a-kretprobe-on-sys-futex-called-less-often-than-a-corresponding-kprobe but
I anticipate that the question is too niche to get much visibility, so I was hoping it'd be okay to broadcast the question here as well. If we end up coming to an answer here then I'm very happy to accept your answer there.
Any input at all is appreciated :)
Thanks in advance!
Ashley
PS: Apologies if I'm using the mailing list incorrectly; please feel free to prod me about etiquette - I've never used one before
PPS: Incase it helps, here's an abridged log of my initial program which traces specific threads' futex operations, which was my first exposure to this issue. FUTEX << signifies a kprobe call, >> FUTEX a kretprobe call, and {WAKE XXXX} are annotations
I added myself based on some extra output I removed to abridge this output:
main-6311 .... 238378.639978: 0x00000001: FUTEX <<
main-6312 .... 238378.640005: 0x00000001: FUTEX <<
main-6310 .... 238379.141966: 0x00000001: FUTEX{WAKE 6311} <<
main-6310 .... 238379.141986: 0x00000001: FUTEX <<
main-6311 d... 238379.142007: 0x00000001: >> FUTEX
main-6311 .... 238379.642507: 0x00000001: FUTEX{WAKE 6312} <<
main-6311 d... 238379.642525: 0x00000001: >> FUTEX
main-6311 .... 238379.642563: 0x00000001: FUTEX{WAKE 6310} <<
main-6311 dN.. 238379.642573: 0x00000001: >> FUTEX
main-6310 .... 238379.642581: 0x00000001: FUTEX <<
main-6312 .... 238380.143266: 0x00000001: FUTEX <<
main-6312 .... 238380.143290: 0x00000001: FUTEX{WAKE 6310} <<
main-6310 d... 238380.143305: 0x00000001: >> FUTEX
The notable thing is that we see threads like main-6310 issuing multiple futex calls without any retprobe triggering between them (lines 3-4, 10).
|
Paul Chaignon
Hi Ashley,
Could it be that you're reaching the maxactive bound and thus some probes are missed? From the kprobes documentation [1]: While the probed function is executing, its return address is stored in an object of type kretprobe_instance. Before calling register_kretprobe(), the user sets the maxactive field of the kretprobe struct to specify how many instances of the specified function can be probed simultaneously. register_kretprobe() pre-allocates the indicated number of kretprobe_instance objects. For example, if the function is non-recursive and is called with a spinlock held, maxactive = 1 should be enough. If the function is non-recursive and can never relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should be enough. If maxactive <= 0, it is set to a default value. If CONFIG_PREEMPT is enabled, the default is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS. I tried to patch bcc [2] to set the maxactive value to 1000 on attach_kretprobe, but it didn't make a difference when running your script. Maybe you could check the nmissed field mentioned in the documentation (with a kernel module instead of bpf)? Paul 1 - https://www.kernel.org/doc/Documentation/kprobes.txt 2 - https://github.com/pchaigno/bcc/tree/bump-maxactive |
Paul Chaignon
On Fri, Feb 08, 2019 at 01:49:44PM -0800, Ashley wrote:
Hi Paul,What's the source code of that kernel module? Weird, I have the same problem you have for sys_futex if I adapt the script to track sys_write instead... I'm thinking it could be caused by a difference in kprobes configuration between the kernel module and bcc. Paul |
Ashley <ashley@...>
> What's the source code of that kernel module?
I adapted code from an example in a repository Spotify have, which is here: https://github.com/spotify/linux/blob/master/samples/kprobes/kretprobe_example.c.
The only change I made was just to print some extra information to correlate the calls with specific threads, and I added a hack before the printk call to only do so if the process ID was within a boundary which my program's process IDs are located within.
The changes look like this:
if (ri->task->pid >= 7000 || ri->task->pid <= 5000) return 0;
printk(KERN_INFO "%s returned %d and took %lld ns to execute %d task %s %d\n",
func_name, retval, (long long)delta, calls, ri->task->comm, ri->task->pid);
If I don't filter the print calls with the pid check, I start getting high numbers of missed calls, which I believe is because the printk call causes rsyslog to make futex calls; re-invoking
the probe and saturating itself with calls. I wonder if this is what was happening with your sys_write code? I'm able to reliably track sys_write only when I'm not causing writing calls from within the
eBPF program.
Another thing of note is that lowering the maxactive parameter to 1 or 10 leads to it never printing anything, which I find strange too. Is the maxactive parameter local to this specific kernel function and probe? I find
it quite strange that with the default eBPF configuration I can get plenty of retprobes for syscalls like write but seemingly few for futex operations (which aren't necessarily denser)
If I track specific futex operations, such as `futex_wake` and `futex_wait`, it seems like the retprobes for these work perfectly, which I think is also a bit strange.
One thing that comes to mind: is it possible that some optimisations are interfering with this? In `do_futex` (which I assume `sys_futex` is hooking into), most operations are of the form:
return futex_wake(uaddr, flags, val, val3);
I seem to be able to reliably trace the return of these specific futex operations from eBPF (I have tried `_wake` and `_wait`), but they mostly return values, rather than another function call. Is it possible that the retprobe isn't
being triggered because control jumps into `futex_X` and returns straight back to user space?
This wouldn't explain why the probes added by the kernel module can detect them, though, so I doubt this is the problem.
>I'm thinking it could be caused by a difference in kprobes configuration
>between the kernel module and bcc.
I'll look into playing with the parameters in BCC and see if I get anywhere with that. I've been using the pre-built binaries to this point so I'll have to spend some time setting that up, so it might be a few days before I get anywhere
with this. I'll update you if I find anything with this.
Ashley
On Sat, Feb 9, 2019 at 6:03 PM Paul Chaignon <paul.chaignon@...> wrote:
On Fri, Feb 08, 2019 at 01:49:44PM -0800, Ashley wrote: |
Ashley <ashley@...>
So I upgraded my kernel version to 5.0.0-rc6, and it seems to consistently work perfectly now, even with a much higher density of futex calls than I was working with before.
I'm still curious what the issue is, and I sadly can't say with any confidence that it's definitely the kernel version, since I accidentally broke my filesystem trying to resize it and had to make a fresh VM, so I may have different BCC or dependency versions installed on this machine, but I'm hoping to find time to dig a bit more in a few weeks to see if I can determine what was going on here. Thanks again for the help, Paul! I wouldn't have considered playing with the kernel version without realising it was working fine for the kernel module. Regards, Ashley |
Paul Chaignon
On Tue, Feb 12, 2019 at 07:02:58AM -0800, Ashley wrote:
So I upgraded my kernel version to 5.0.0-rc6, and it seems to consistentlybcc now supports the maxactive kretprobe parameter [1], so if you want to test whether a higher maxactive value solves the issue you had, it's as easy as calling BPF.attach_kretprobe with a maxactive=your_value argument. 1 - https://github.com/iovisor/bcc/pull/2224
|