Re: Accessing user memory and minor page faults

Alexei Starovoitov

On Thu, Aug 31, 2017 at 01:41:11PM -0700, Gianluca Borello via iovisor-dev wrote:

I wanted to share my recent experience with troubles accessing user memory
when page faults occur, and ask for opinions.

A bit of introduction first. It's no surprise that a very good portion of
the tracers out there vastly leverage the ability to dereference user
memory for different purposes. Typical use cases in eBPF land can be found
in all the bcc scripts containing references to bpf_probe_read() and
bpf_probe_read_str() [1].

Both implementations internally use pagefault_disable() prior to actually
accessing the memory [2]: this is needed because accessing arbitrary user
memory might trigger a page fault, which could cause a rescheduling action
in case the mm->mmap_sem is contended or the fault requires loading the
missing page from a backing storage (major fault). In most cases tracers
are not run from a context where it's safe to reschedule, either because
there is no user context or because we are in an atomic context (e.g.
tracepoint handlers being called in a RCU read-side critical section). By
disabling page faults, in such cases the memory won't be able to be
accessed and those BPF helpers will return, for example, -EFAULT.

One might think that in practice this is not a big problem, because most of
the time the memory has already being accessed by userspace and thus
correctly loaded, however that is not the case, there are several instances
where even if the memory is loaded, a minor page fault will be triggered
upon accessing the memory, preventing the tracer from being able to access
memory that is actually there.

I wanted to share with you two cases that I recently studied, they don't
happen often enough to be a deal breaker, but they do happen often enough
to be annoying. I'm not considering major page fault events (e.g. swap or
first executable load) since that part is straightforward and it's easier
to understand why a tracer can't access that memory in those cases.

Case 1: fork

When a process forks, the child's duplicated address space keeps
referencing the parent's memory (until that memory is being just read, at
least). As the man page of fork() says, "...the only penalty that it incurs
is the time and memory required to duplicate the parent's page tables...".
So, one would expect that a child can keep referencing the parent's memory
without incurring into any page faults if the parent already accessed that
memory, which is a typical case.

However, it turns out the kernel doesn't fully copy the page table of the
parent, and it tries to avoid copying page table entries referring to, for
example, shared read-only non-anonymous VMAs, in order to make the forking
more efficient [3]. This means that a newly created process will actually
incur in a bunch of minor page faults when trying to access memory that was
just recently accessed from the parent before the fork.

What does this mean from a tracer point of view? Suppose you want to
intercept the arguments of an execve() system call that is being executed
right after a fork(): there are very good chances some of the passed
arguments are going to be constant strings right from the executable
binary, and there are very good chances the parent already referenced them
(or something in that surrounding) at some point and so they have already
been loaded from disk.

The child, however, will not find those loaded because of the above
behavior, and the kernel will incur in a minor page fault during the
various copy_from_user() while executing the execve(). Since the tracer is
likely executing before these copy_from_user() happen (e.g. hooking to a
syscall tracepoint), reading that data with bpf_probe_read() won't be
possible. On my system, it takes a simple "apt-get install foo" to at least
consistently reproduce a couple of those missing reads every time.

Case 2: Automatic NUMA balancing

This one was very interesting for me to find. When the kernel supports
automatic NUMA balancing and it's enabled either manually or automatically
via CONFIG_NUMA_BALANCING_DEFAULT_ENABLED (like it happens in Ubuntu
kernels), the kernel will periodically unmap pages of multi-threaded
processes by writing PROT_NONE to the respective page table entries [4],
forcing a subsequent memory access to cause a minor page fault, so that the
page fault handler can decide to migrate the memory to another area closer
to the CPU where the process is currently running [5]. As you can imagine,
this means that if a process rarely accesses a portion of the memory, that
area will remain unmapped, preventing a tracer from accessing it. Unlike
the previous fork case, the criteria to decide which VMAs can be unmapped
is more aggressive, so even areas such as the stack can be targeted.

What does this mean from a tracer point of view? Just the other day I
wanted to read a process arguments and environment via
current->mm->arg_start, which just points in the surroundings of the top of
the userspace stack, and bpf_probe_read() was failing to my surprise, I had
never seen the top of the stack unmapped before. By manually printing the
PTE flags, it was clear the unmapping was caused by PROT_NONE and this NUMA
feature (unfortunately /proc/pid/pagemap didn't help since the output shown
by proc doesn't distinguish between PAGE_BIT_PRESENT and PAGE_BIT_PROTNONE,
which was the key to identify this behavior).

I wanted to share this story to ultimately ask the question: have you ever
faced something similar in your tracing experiences? If so, have you worked
around the problem in some way? One solution is to try to make sure the
tracer actually accesses the memory immediately after either the userspace
process or the respective kernel code accesses it, effectively shielding
the tracing code from the page faults, but that is not always easy or
possible to do.

Also, and this might be pure uneducated speculation, I'm wondering if we
could ever get to a point where, for example, instead of immediately
aborting the page fault handler when the context is not right [6], the
abortion could be delayed until we are either sure that the mm->mmap_sem
semaphore can't be acquired without contention or that the fault is
actually going to be major? From what I have seen, such a change would
dramatically reduce the amount of these corner cases, although of course
what I just said might just be gibberish since I've never worked in that
thank you for the excellent summary of the problem.
Do you think adding new read helper that is relying on __get_user_pages_fast()
can solve some of it? If I understand the code it won't deal with prot_none,
so numa rebalancing won't be solved, but for the fork() it should work.
Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked()
won't work, since it needs to sleep in some cases which we cannot do
from bpf program.

Another alternative is to create new type of bpf programs attached to syscalls
that won't be running in rcu and allow full copy_from_user() there,
but it would also mean that we'd need to redesign map access from such progs.

Yet another alternative is to trace lsm hooks instead of syscalls.
Instead of sys_open() attach to security_file_open() and all user supplied
data will be available at that point.

Join to automatically receive all group messages.