Gianluca Borello <g.borello@...>
I wanted to share my recent experience with troubles accessing user memory when page faults occur, and ask for opinions.
A bit of introduction first. It's no surprise that a very good portion of the tracers out there vastly leverage the ability to dereference user memory for different purposes. Typical use cases in eBPF land can be found in all the bcc scripts containing references to bpf_probe_read() and bpf_probe_read_str() .
Both implementations internally use pagefault_disable() prior to actually accessing the memory : this is needed because accessing arbitrary user memory might trigger a page fault, which could cause a rescheduling action in case the mm->mmap_sem is contended or the fault requires loading the missing page from a backing storage (major fault). In most cases tracers are not run from a context where it's safe to reschedule, either because there is no user context or because we are in an atomic context (e.g. tracepoint handlers being called in a RCU read-side critical section). By disabling page faults, in such cases the memory won't be able to be accessed and those BPF helpers will return, for example, -EFAULT.
One might think that in practice this is not a big problem, because most of the time the memory has already being accessed by userspace and thus correctly loaded, however that is not the case, there are several instances where even if the memory is loaded, a minor page fault will be triggered upon accessing the memory, preventing the tracer from being able to access memory that is actually there.
I wanted to share with you two cases that I recently studied, they don't happen often enough to be a deal breaker, but they do happen often enough to be annoying. I'm not considering major page fault events (e.g. swap or first executable load) since that part is straightforward and it's easier to understand why a tracer can't access that memory in those cases.
Case 1: fork
When a process forks, the child's duplicated address space keeps referencing the parent's memory (until that memory is being just read, at least). As the man page of fork() says, "...the only penalty that it incurs is the time and memory required to duplicate the parent's page tables...". So, one would expect that a child can keep referencing the parent's memory without incurring into any page faults if the parent already accessed that memory, which is a typical case.
However, it turns out the kernel doesn't fully copy the page table of the parent, and it tries to avoid copying page table entries referring to, for example, shared read-only non-anonymous VMAs, in order to make the forking more efficient . This means that a newly created process will actually incur in a bunch of minor page faults when trying to access memory that was just recently accessed from the parent before the fork.
What does this mean from a tracer point of view? Suppose you want to intercept the arguments of an execve() system call that is being executed right after a fork(): there are very good chances some of the passed arguments are going to be constant strings right from the executable binary, and there are very good chances the parent already referenced them (or something in that surrounding) at some point and so they have already been loaded from disk.
The child, however, will not find those loaded because of the above behavior, and the kernel will incur in a minor page fault during the various copy_from_user() while executing the execve(). Since the tracer is likely executing before these copy_from_user() happen (e.g. hooking to a syscall tracepoint), reading that data with bpf_probe_read() won't be possible. On my system, it takes a simple "apt-get install foo" to at least consistently reproduce a couple of those missing reads every time.
Case 2: Automatic NUMA balancing
This one was very interesting for me to find. When the kernel supports automatic NUMA balancing and it's enabled either manually or automatically via CONFIG_NUMA_BALANCING_DEFAULT_ENABLED (like it happens in Ubuntu kernels), the kernel will periodically unmap pages of multi-threaded processes by writing PROT_NONE to the respective page table entries , forcing a subsequent memory access to cause a minor page fault, so that the page fault handler can decide to migrate the memory to another area closer to the CPU where the process is currently running . As you can imagine, this means that if a process rarely accesses a portion of the memory, that area will remain unmapped, preventing a tracer from accessing it. Unlike the previous fork case, the criteria to decide which VMAs can be unmapped is more aggressive, so even areas such as the stack can be targeted.
What does this mean from a tracer point of view? Just the other day I wanted to read a process arguments and environment via current->mm->arg_start, which just points in the surroundings of the top of the userspace stack, and bpf_probe_read() was failing to my surprise, I had never seen the top of the stack unmapped before. By manually printing the PTE flags, it was clear the unmapping was caused by PROT_NONE and this NUMA feature (unfortunately /proc/pid/pagemap didn't help since the output shown by proc doesn't distinguish between PAGE_BIT_PRESENT and PAGE_BIT_PROTNONE, which was the key to identify this behavior).
I wanted to share this story to ultimately ask the question: have you ever faced something similar in your tracing experiences? If so, have you worked around the problem in some way? One solution is to try to make sure the tracer actually accesses the memory immediately after either the userspace process or the respective kernel code accesses it, effectively shielding the tracing code from the page faults, but that is not always easy or possible to do.
Also, and this might be pure uneducated speculation, I'm wondering if we could ever get to a point where, for example, instead of immediately aborting the page fault handler when the context is not right , the abortion could be delayed until we are either sure that the mm->mmap_sem semaphore can't be acquired without contention or that the fault is actually going to be major? From what I have seen, such a change would dramatically reduce the amount of these corner cases, although of course what I just said might just be gibberish since I've never worked in that code.
Thanks for reading.