Accessing user memory and minor page faults
Gianluca Borello <g.borello@...>
I wanted to share my recent experience with troubles accessing user memory when page faults occur, and ask for opinions.
A bit of introduction first. It's no surprise that a very good portion of the tracers out there vastly leverage the ability to dereference user memory for different purposes. Typical use cases in eBPF land can be found in all the bcc scripts containing references to bpf_probe_read() and bpf_probe_read_str() .
Both implementations internally use pagefault_disable() prior to actually accessing the memory : this is needed because accessing arbitrary user memory might trigger a page fault, which could cause a rescheduling action in case the mm->mmap_sem is contended or the fault requires loading the missing page from a backing storage (major fault). In most cases tracers are not run from a context where it's safe to reschedule, either because there is no user context or because we are in an atomic context (e.g. tracepoint handlers being called in a RCU read-side critical section). By disabling page faults, in such cases the memory won't be able to be accessed and those BPF helpers will return, for example, -EFAULT.
One might think that in practice this is not a big problem, because most of the time the memory has already being accessed by userspace and thus correctly loaded, however that is not the case, there are several instances where even if the memory is loaded, a minor page fault will be triggered upon accessing the memory, preventing the tracer from being able to access memory that is actually there.
I wanted to share with you two cases that I recently studied, they don't happen often enough to be a deal breaker, but they do happen often enough to be annoying. I'm not considering major page fault events (e.g. swap or first executable load) since that part is straightforward and it's easier to understand why a tracer can't access that memory in those cases.
Case 1: fork
When a process forks, the child's duplicated address space keeps referencing the parent's memory (until that memory is being just read, at least). As the man page of fork() says, "...the only penalty that it incurs is the time and memory required to duplicate the parent's page tables...". So, one would expect that a child can keep referencing the parent's memory without incurring into any page faults if the parent already accessed that memory, which is a typical case.
However, it turns out the kernel doesn't fully copy the page table of the parent, and it tries to avoid copying page table entries referring to, for example, shared read-only non-anonymous VMAs, in order to make the forking more efficient . This means that a newly created process will actually incur in a bunch of minor page faults when trying to access memory that was just recently accessed from the parent before the fork.
What does this mean from a tracer point of view? Suppose you want to intercept the arguments of an execve() system call that is being executed right after a fork(): there are very good chances some of the passed arguments are going to be constant strings right from the executable binary, and there are very good chances the parent already referenced them (or something in that surrounding) at some point and so they have already been loaded from disk.
The child, however, will not find those loaded because of the above behavior, and the kernel will incur in a minor page fault during the various copy_from_user() while executing the execve(). Since the tracer is likely executing before these copy_from_user() happen (e.g. hooking to a syscall tracepoint), reading that data with bpf_probe_read() won't be possible. On my system, it takes a simple "apt-get install foo" to at least consistently reproduce a couple of those missing reads every time.
Case 2: Automatic NUMA balancing
This one was very interesting for me to find. When the kernel supports automatic NUMA balancing and it's enabled either manually or automatically via CONFIG_NUMA_BALANCING_DEFAULT_ENABLED (like it happens in Ubuntu kernels), the kernel will periodically unmap pages of multi-threaded processes by writing PROT_NONE to the respective page table entries , forcing a subsequent memory access to cause a minor page fault, so that the page fault handler can decide to migrate the memory to another area closer to the CPU where the process is currently running . As you can imagine, this means that if a process rarely accesses a portion of the memory, that area will remain unmapped, preventing a tracer from accessing it. Unlike the previous fork case, the criteria to decide which VMAs can be unmapped is more aggressive, so even areas such as the stack can be targeted.
What does this mean from a tracer point of view? Just the other day I wanted to read a process arguments and environment via current->mm->arg_start, which just points in the surroundings of the top of the userspace stack, and bpf_probe_read() was failing to my surprise, I had never seen the top of the stack unmapped before. By manually printing the PTE flags, it was clear the unmapping was caused by PROT_NONE and this NUMA feature (unfortunately /proc/pid/pagemap didn't help since the output shown by proc doesn't distinguish between PAGE_BIT_PRESENT and PAGE_BIT_PROTNONE, which was the key to identify this behavior).
I wanted to share this story to ultimately ask the question: have you ever faced something similar in your tracing experiences? If so, have you worked around the problem in some way? One solution is to try to make sure the tracer actually accesses the memory immediately after either the userspace process or the respective kernel code accesses it, effectively shielding the tracing code from the page faults, but that is not always easy or possible to do.
Also, and this might be pure uneducated speculation, I'm wondering if we could ever get to a point where, for example, instead of immediately aborting the page fault handler when the context is not right , the abortion could be delayed until we are either sure that the mm->mmap_sem semaphore can't be acquired without contention or that the fault is actually going to be major? From what I have seen, such a change would dramatically reduce the amount of these corner cases, although of course what I just said might just be gibberish since I've never worked in that code.
Thanks for reading.
On Thu, Aug 31, 2017 at 01:41:11PM -0700, Gianluca Borello via iovisor-dev wrote:
Hithank you for the excellent summary of the problem.
Do you think adding new read helper that is relying on __get_user_pages_fast()
can solve some of it? If I understand the code it won't deal with prot_none,
so numa rebalancing won't be solved, but for the fork() it should work.
Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked()
won't work, since it needs to sleep in some cases which we cannot do
from bpf program.
Another alternative is to create new type of bpf programs attached to syscalls
that won't be running in rcu and allow full copy_from_user() there,
but it would also mean that we'd need to redesign map access from such progs.
Yet another alternative is to trace lsm hooks instead of syscalls.
Instead of sys_open() attach to security_file_open() and all user supplied
data will be available at that point.
Gianluca Borello <g.borello@...>
On Tue, Sep 5, 2017 at 7:29 PM, Alexei Starovoitov
Thank you Alexei for your thoughtful reply and sorry for the delay,
but you gave me a lot to think about :)
Do you think adding new read helper that is relying on __get_user_pages_fast()If I'm not mistaken, it won't work.
__get_user_pages_fast() works if you already have the PTEs in the page
table, which is not the case for the fork() child, since the
optimization done during mm duplication does not even copy the PTEs.
In other words, without properly taking the mm->mmap_sem lock and
accessing the VMA operations, I don't think it's possible. With a
quick experiment, I indeed get blocked at pud_none()  if I try to
call __get_user_pages_fast() on an address that has not yet been
faulted in after fork() (whereas it works fine in the parent).
In contrast, what the page fault handler does in such a situation is
recognizing that there's no PTE, so it calculates the address offset
using the VMA base address, and passes this offset to the proper
vma->vm_ops fault()/map_pages() operation, which in case of simple
file mapping will do a radix lookup in the private page tree and, if
successful, will also allocate a new PTE entry pointing to such page,
without ever touching the disk.
As far as the other case of numa unbalancing goes, as you already
pointed out __get_user_pages_fast() will bail even if the PTE is there
Maybe it's not what you meant, in which case I apologize for the
Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked()You are correct, and my point was: since it is indeed possible to
serve a page fault without ever sleeping if there's no contention over
the locks and the page is already in memory somewhere (like in the
case I just described above, and like you said "it needs to sleep in
*some* cases"), why not taking advantage of these cases?
So, why wouldn't in theory be possible to modify the page fault
handler and propagate a bool flag throughout all the methods named
"can_sleep", which would mimic faulthandler_disabled(), instead of
bailing out immediately at the start of the page handler? If all the
fault functions were aware of this flag and would act accordingly
(e.g. using trylock for their locks), the page fault handler could
keep going until we hit either a major fault or a lock contention. The
result would be that all these cases such as the one above would start
immediately working with the standard bpf_probe_read().
Again, I understand this would be a massive change that would likely
never happen, but I like the mental exercise of determining whether
something like that would in theory be possible or not, because maybe
I'm missing something else, and sleeping is not the only reason why
faulthandler_disabled() is set.
Another alternative is to create new type of bpf programs attached to syscallsThis would probably work for me, although I admit I have found myself
wanting to also access user memory from an arbitrary kprobe from time
to time and that didn't work either :) (like arbitrarily accessing the
top of the stack after the numa balancing unmapped the pages).
Yet another alternative is to trace lsm hooks instead of syscalls.Yes, that's a good alternative for some applications, although as you
can imagine it's not as granular as the entire system call API. What I
have also done is defer a lot of the memory access at the end of the
system call itself (sys_exit tracepoint), so that hopefully the memory
is already paged in, but there are other applications where it'd be
handy to access the memory before the kernel does.
Thanks a lot!
On Thu, Sep 07, 2017 at 12:40:21PM -0700, Gianluca Borello wrote:
this issue was discussed at Plumbers and it seems there may be
a solution in sight. The work on 'speculative page faults' will
remove mm->mmap_sem in favor of srcu approach with sequence numbers
and we will be able to do find_vma() and vma->vm_ops->access() from
the non-sleepable context.
From bpf program point of view it probably be a new helper
bpf_probe_read_harder() ;) or something that will try normal
pagefault_disabled read first and if it fails will try
Gianluca Borello <g.borello@...>
On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitov
Thank you Alexei for your reply and sorry for the delay, I just
finally found the time over the weekend to go over your message more
I applied the speculative page fault patch to my tree to better
understand the implications of your comment and indeed this patch (way
over my head!) seems a huge leap forward because it allows us to
lookup a VMA without taking any lock, so we can do it in a
However, I am still missing how this could be a resolutive fix. Let's
imagine for example the case I mentioned above where we have a fork()
child and right after the fork all VMAs referring to mapped files will
not have any valid PTEs (but the file is already in the page cache).
In this case, there's little we can do beside grabbing the VMA and
asking some vma->vm_ops to give us the page corresponding to the
address we're looking for. With the speculative fault, we can do it
also from a BPF helper, however some vm_ops methods are not ready to
be called in a non-sleepable context. For example, for filemap:
- fault() is not safe because it consistently ends up in a
might_sleep() invocation 
- map_pages() seems safe (but is it also for other VMA implementations?)
- access() is not defined
So, which ones would this BPF helper call in order to guarantee
usefulness while not causing blocking? Just calling vm_ops->access()
wouldn't help in this case since it's not defined. Looking at the code
for __access_remote_vm(), it seems it does a mix of get_user_pages()
(which in turn calls vm_ops->fault() and/or vm_ops->map_pages()) and
as a fallback it uses vm_ops->access(), but of course that one can
Perhaps the solution is much simpler and I just didn't grasp all the
implications of this work? (sorry again, it's the first time I dabble
in this subsystem).
On Sun, Oct 01, 2017 at 02:08:36PM -0700, Gianluca Borello via iovisor-dev wrote:
On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitovmy understanding of speculative page fault patch is that the whole
get_user_pages will operate under srcu, so we can call it
if necessary (not only find_vma will be safe to call).
Perhaps the solution is much simpler and I just didn't grasp all theall correct. It's not simpler than what you described.
I thought access() will be available in the situation we care about.
The only bit I'm missing is why to trace exec() args it will be going
all the way to filemap-backed pages of the executable ?
The strings are in the heap/stack, no?
we can try to implement filemap_access and it probably won't
need to lock_page to access it, since readahead does it without locking.
Accessing things via kmap or ioremap (the way generic_access_phys does it)
is too costly, so probably not the right approach.
Gianluca Borello <g.borello@...>
On Sun, Oct 1, 2017 at 7:00 PM, Alexei Starovoitov
My understanding, and please correct me if I'm wrong, is that yes,
get_user_pages() can definitely work under srcu, but if the PTE is not
found (like after fork), then get_user_pages will call some operations
in vma->vm_ops that may cause the caller to sleep and there's no way
or flag to prevent this behavior (except perhaps using
__get_user_pages_fast). While this isn't a problem in the places where
get_user_pages is typically called (e.g. access_process_vm), we can't
do it from a BPF helper because we run it with preempt_disable()
regardless of the srcu (at least the BPF programs from kprobes and
tracepoints, which I use the most).
Well, in my experiments done on normal workloads I can see many
constant strings that come directly from the mapped executable and are
passed as arguments to system calls. As a rule of thumb, running an
"apt-get install xxx" on any system will generate several dozens of
them, I analyzed a few and they are all instances of fork() +
we can try to implement filemap_access and it probably won'tImplementing filemap_access sounds like a feasible strategy, I wasn't
sure about that option since it would effectively mean patching all
the vm_ops definitions under fs/. I haven't read the implementation of
filemap yet but I'll take a better look as soon as I have some time.
At the same time, for anonymous VMAs it would be useful to have some
sort of BPF helper working like __get_user_pages_fast which would
somehow ignore the PROT_NONE set on the PTEs during the NUMA balancing
(not even sure that's actually possible, will have to experiment).
By the way, none of this is urgent as I have my workarounds to get
around that (I mostly parse arguments at the exit of the system call
tracepoint), but it'd be nice to have a few more powerful helpers that
can dig into all sorts of memory situations :-)
On Mon, Oct 02, 2017 at 07:29:51PM -0700, Gianluca Borello via iovisor-dev wrote:
On Sun, Oct 1, 2017 at 7:00 PM, Alexei Starovoitovnot following why all fs-es need to be patched.
can we do a mini version of filemap_fault() that only operates
on pages in cache? We cannot serve major faults anyway and real
fs access is not necessary.
At the same time, for anonymous VMAs it would be useful to have someshould be possible. passing a flag through all of the gup_*()
is probably not going to be acceptable, since all of these functions
already have 6 args, but checking some global state in pte_protnone()
may be a way out.
btw the .config-s I care about don't have CONFIG_NUMA_BALANCING set.
By the way, none of this is urgent as I have my workarounds to getyep. same here. It's not urgent, but we need a solution long term.
Gianluca Borello <g.borello@...>
On Fri, Oct 6, 2017 at 3:11 AM, Alexei Starovoitov
Good point. Indeed we can, I thought a strategy might have been
leaving the helper implementation as generic as possible and rely on a
call to vma->vm_ops->access() if present, but it seems we could
definitely try to directly look into the page cache if we see
vma->vm_file->f_mapping exists, bypassing the vm_ops (hope that's what
btw the .config-s I care about don't have CONFIG_NUMA_BALANCING set.It's interesting, most kernels I've seen don't have the feature
enabled by default, whereas all Ubuntu kernels, widely used on
servers, have it.
Thanks! I'll be glad to experiment with some of this at some point :)