Topics

Invalid filename/mode in openat tracepoint data


alessandro.gario@...
 

Hello everyone,

I'll start with some backstory first: I wrote my own BPF library to trace functions/syscalls and yesterday I noticed that I am sometimes receiving broken openat() tracepoint data. This happens randomly, often when processes are created in a short burst (like opening a new terminal instance with zsh + oh-my-zsh installed).

I initially thought it was my fault, and proceeded to debug the generated IR code and double check my tracepoint data definition (which, for reference, can be found here: https://github.com/trailofbits/ebpfpub/blob/master/ebpfpub/src/tracepointserializers.cpp#L425). I ended up giving up, not finding the reason this was failing.

Today, I have tried to replicate the same functionality using BCC so I could compare the output with my library and I ended up inside the same weird behavior:

Full script here: https://gist.github.com/alessandrogario/968b9c3ea78559f470bc650c8496449e#file-bcc_openat_tracepoint-py

--
bpf_trace_printk("sys_enter_openat mode:%ld "
"filename:%s (%ld)\\n",
args->mode,
args->filename,
args->filename);

2608.223222000 b'git' 8998 b'sys_enter_openat mode:0 filename: (93849603522670)
--

I was able to replicate this problem on Ubuntu 20.20 (5.4.0), Arch Linux (5.7.9) and Ubuntu 19.10 (5.3.0).

Has anyone ever encountered this problem, or has a few pointers as to why it happening?

Thanks!

Alessandro


Tristan Mayfield
 

I don't have an answer, but I verified this with the following
bpftrace script and using the action of switching to zsh/oh-my-zsh
from bash.

---
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat
{
printf("sys_enter_openat mode:%ld filename:%s (%ld)\n",
args->mode, str(args->filename), args->filename);
}
---

Here's some example data (not all the generated output) with spaces
around some of the issue lines:

sys_enter_openat mode:0 filename: (94797689127022)

sys_enter_openat mode:0 filename:/usr/lib/locale/locale-archive
(139635662831568)
sys_enter_openat mode:0 filename:/usr/share/locale/locale.alias
(140728940893664)
sys_enter_openat mode:0
filename:/usr/share/locale/en_US/LC_MESSAGES/git.mo (94797710736928)
sys_enter_openat mode:0
filename:/usr/share/locale/en/LC_MESSAGES/git.mo (94797710737472)
sys_enter_openat mode:0
filename:/usr/share/locale-langpack/en_US/LC_MESSAGES/git.mo
(94797710737712)
sys_enter_openat mode:0
filename:/usr/share/locale-langpack/en/LC_MESSAGES/git.mo
(94797710737584)
sys_enter_openat mode:438 filename:/dev/null (139809161489144)
sys_enter_openat mode:0 filename:/etc/ld.so.cache (140236659837824)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libpcre2-8.so.0
(140236659879440)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libz.so.1
(140236659639520)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libpthread.so.0
(140236659640784)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libc.so.6
(140236659642080)

sys_enter_openat mode:0 filename: (94426721874030)

sys_enter_openat mode:0 filename:/usr/lib/locale/locale-archive
(140236658581456)
sys_enter_openat mode:0 filename:/usr/share/locale/locale.alias
(140728357496384)

I'm tempted to think that this is some behavior of the system I don't
understand yet, rather than being a bug. But I can't say for sure.

Tristan

On 7/24/20, alessandro.gario@... <alessandro.gario@...> wrote:
Hello everyone,

I'll start with some backstory first: I wrote my own BPF library to
trace functions/syscalls and yesterday I noticed that I am sometimes
receiving broken openat() tracepoint data. This happens randomly, often
when processes are created in a short burst (like opening a new
terminal instance with zsh + oh-my-zsh installed).

I initially thought it was my fault, and proceeded to debug the
generated IR code and double check my tracepoint data definition
(which, for reference, can be found here:
https://github.com/trailofbits/ebpfpub/blob/master/ebpfpub/src/tracepointserializers.cpp#L425).

I ended up giving up, not finding the reason this was failing.

Today, I have tried to replicate the same functionality using BCC so I
could compare the output with my library and I ended up inside the same
weird behavior:

Full script here:
https://gist.github.com/alessandrogario/968b9c3ea78559f470bc650c8496449e#file-bcc_openat_tracepoint-py

--
bpf_trace_printk("sys_enter_openat mode:%ld "
"filename:%s (%ld)\\n",
args->mode,
args->filename,
args->filename);

2608.223222000 b'git' 8998 b'sys_enter_openat mode:0 filename:
(93849603522670)
--

I was able to replicate this problem on Ubuntu 20.20 (5.4.0), Arch
Linux (5.7.9) and Ubuntu 19.10 (5.3.0).

Has anyone ever encountered this problem, or has a few pointers as to
why it happening?

Thanks!

Alessandro






alessandro.gario@...
 

Hello Tristan,

thanks for spending the time to check this out!

One thing I forgot to mention is that I can verify with strace that the filename parameter is always present.
I initially suspected that the pointer wasn't mapped at the time the probe attempted to read from it, but shouldn't the tracepoint interface make sure it is accessible?

Alessandro Gario

On Fri, Jul 24, 2020 at 10:27 am, Tristan Mayfield <mayfieldtristan@...> wrote:
I don't have an answer, but I verified this with the following
bpftrace script and using the action of switching to zsh/oh-my-zsh
from bash.
---
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat
{
printf("sys_enter_openat mode:%ld filename:%s (%ld)\n",
args->mode, str(args->filename), args->filename);
}
---
Here's some example data (not all the generated output) with spaces
around some of the issue lines:
sys_enter_openat mode:0 filename: (94797689127022)
sys_enter_openat mode:0 filename:/usr/lib/locale/locale-archive
(139635662831568)
sys_enter_openat mode:0 filename:/usr/share/locale/locale.alias
(140728940893664)
sys_enter_openat mode:0
filename:/usr/share/locale/en_US/LC_MESSAGES/git.mo (94797710736928)
sys_enter_openat mode:0
filename:/usr/share/locale/en/LC_MESSAGES/git.mo (94797710737472)
sys_enter_openat mode:0
filename:/usr/share/locale-langpack/en_US/LC_MESSAGES/git.mo
(94797710737712)
sys_enter_openat mode:0
filename:/usr/share/locale-langpack/en/LC_MESSAGES/git.mo
(94797710737584)
sys_enter_openat mode:438 filename:/dev/null (139809161489144)
sys_enter_openat mode:0 filename:/etc/ld.so.cache (140236659837824)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libpcre2-8.so.0
(140236659879440)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libz.so.1
(140236659639520)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libpthread.so.0
(140236659640784)
sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libc.so.6
(140236659642080)
sys_enter_openat mode:0 filename: (94426721874030)
sys_enter_openat mode:0 filename:/usr/lib/locale/locale-archive
(140236658581456)
sys_enter_openat mode:0 filename:/usr/share/locale/locale.alias
(140728357496384)
I'm tempted to think that this is some behavior of the system I don't
understand yet, rather than being a bug. But I can't say for sure.
Tristan
On 7/24/20, alessandro.gario@... <alessandro.gario@...> wrote:
Hello everyone,
I'll start with some backstory first: I wrote my own BPF library to
trace functions/syscalls and yesterday I noticed that I am sometimes
receiving broken openat() tracepoint data. This happens randomly, often
when processes are created in a short burst (like opening a new
terminal instance with zsh + oh-my-zsh installed).
I initially thought it was my fault, and proceeded to debug the
generated IR code and double check my tracepoint data definition
(which, for reference, can be found here:
https://github.com/trailofbits/ebpfpub/blob/master/ebpfpub/src/tracepointserializers.cpp#L425).
I ended up giving up, not finding the reason this was failing.
Today, I have tried to replicate the same functionality using BCC so I
could compare the output with my library and I ended up inside the same
weird behavior:
Full script here:
https://gist.github.com/alessandrogario/968b9c3ea78559f470bc650c8496449e#file-bcc_openat_tracepoint-py
--
bpf_trace_printk("sys_enter_openat mode:%ld "
"filename:%s (%ld)\\n",
args->mode,
args->filename,
args->filename);
2608.223222000 b'git' 8998 b'sys_enter_openat mode:0 filename:
(93849603522670)
--
I was able to replicate this problem on Ubuntu 20.20 (5.4.0), Arch
Linux (5.7.9) and Ubuntu 19.10 (5.3.0).
Has anyone ever encountered this problem, or has a few pointers as to
why it happening?
Thanks!
Alessandro


Tristan Mayfield
 

I ran the same test with strace. One of the file data points that doesn't show up is this:

bpftrace:
sys_enter_openat mode:0 filename: (93911401193582)

strace:
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3

But "locale-archive" does show up in different contexts in bpftrace.
The major commonality I'm seeing is that the file opened right before the "no-name" file seems to be a shared object that was (presumably) dynamically used. Here are some examples:

sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libc.so.6 (140092012560096)
sys_enter_openat mode:0 filename: (93826516217966)

sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libtinfo.so.6 (139814679237888)
sys_enter_openat mode:0 filename: (139814679027664)

sys_enter_openat mode:0 filename:/lib/x86_64-linux-gnu/libc.so.6 (140231836626656)
sys_enter_openat mode:0 filename: (94880667103342)

This might be a linking issue and openat isn't getting supplied a filename? I'll keep debugging since this is interesting. Have you looked through bug reports for bpftrace or BCC?

Tristan


Tristan Mayfield
 

Alessandro,

I figured out that it's non-deterministic. So sometimes certain commands (git, awk, rm, uname, etc.) will have an openat with no filename, but other times they will.
I ran these commands experimentally and got results similar to what I have below for all of them:

$ rm something
sys_enter_openat comm: rm pid:3512 filename:/etc/ld.so.cache (140398792747904)
sys_enter_openat comm: rm pid:3512 filename:/lib/x86_64-linux-gnu/libc.so.6 (140398792789520)
sys_enter_openat comm: rm pid:3512 filename:/usr/lib/locale/locale-archive (140398792339408)

sys_enter_openat comm: rm pid:3514 filename:/etc/ld.so.cache (139648615484288)
sys_enter_openat comm: rm pid:3514 filename:/lib/x86_64-linux-gnu/libc.so.6 (139648615525904)
sys_enter_openat comm: rm pid:3514 filename: (139648615075792)

Because it's been so consistent, I believe the missing file is... always? Most of the time? At least a good part of the time "/usr/lib/locale/locale-archive".
I'm not sure why an archive file would behave differently, but it seems to be causing this issue. You can use the below bpftrace script to figure out which commands most often create the no-name situation.

tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat
{
        if( str(args->filename) == "") {
        printf("sys_enter_openat comm: %s pid:%d filename:%s (%ld)\n",comm,pid, str(args->filename), args->filename);
        }
}

Tristan


alessandro.gario@...
 

Hello Tristan!

That is the same path I found when debugging with strace! I think I also saw a missing comm string during my tests (with printk from BCC), but I would have to reproduce it again to be sure.

I'm going to test this one more time on kernel 4.18, as I don't remember finding this problem when I started writing the library on Ubuntu 18.10 (and maybe I'll also try to take a look at the openat implementation).

Thanks so much for your help!

Alessandro Gario

On Fri, Jul 24, 2020 at 10:11 am, Tristan Mayfield <mayfieldtristan@...> wrote:
Alessandro,
I figured out that it's non-deterministic. So sometimes certain commands (git, awk, rm, uname, etc.) will have an openat with no filename, but other times they will.
I ran these commands experimentally and got results similar to what I have below for all of them:
$ rm something
sys_enter_openat comm: rm pid:3512 filename:/etc/ld.so.cache (140398792747904)
sys_enter_openat comm: rm pid:3512 filename:/lib/x86_64-linux-gnu/libc.so.6 (140398792789520)
sys_enter_openat comm: rm pid:3512 filename:/usr/lib/locale/locale-archive (140398792339408)
sys_enter_openat comm: rm pid:3514 filename:/etc/ld.so.cache (139648615484288)
sys_enter_openat comm: rm pid:3514 filename:/lib/x86_64-linux-gnu/libc.so.6 (139648615525904)
sys_enter_openat comm: rm pid:3514 filename: (139648615075792)
Because it's been so consistent, I believe the missing file is... always? Most of the time? At least a good part of the time "/usr/lib/locale/locale-archive".
I'm not sure why an archive file would behave differently, but it seems to be causing this issue. You can use the below bpftrace script to figure out which commands most often create the no-name situation.
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat
{
if( str(args->filename) == "") {
printf("sys_enter_openat comm: %s pid:%d filename:%s (%ld)\n",comm,pid, str(args->filename), args->filename);
}
}
Tristan