Date
1 - 9 of 9
Accessing user memory and minor page faults
Gianluca Borello <g.borello@...>
Hi I wanted to share my recent experience with troubles accessing user memory when page faults occur, and ask for opinions. A bit of introduction first. It's no surprise that a very good portion of the tracers out there vastly leverage the ability to dereference user memory for different purposes. Typical use cases in eBPF land can be found in all the bcc scripts containing references to bpf_probe_read() and bpf_probe_read_str() [1]. Both implementations internally use pagefault_disable() prior to actually accessing the memory [2]: this is needed because accessing arbitrary user memory might trigger a page fault, which could cause a rescheduling action in case the mm->mmap_sem is contended or the fault requires loading the missing page from a backing storage (major fault). In most cases tracers are not run from a context where it's safe to reschedule, either because there is no user context or because we are in an atomic context (e.g. tracepoint handlers being called in a RCU read-side critical section). By disabling page faults, in such cases the memory won't be able to be accessed and those BPF helpers will return, for example, -EFAULT. One might think that in practice this is not a big problem, because most of the time the memory has already being accessed by userspace and thus correctly loaded, however that is not the case, there are several instances where even if the memory is loaded, a minor page fault will be triggered upon accessing the memory, preventing the tracer from being able to access memory that is actually there. I wanted to share with you two cases that I recently studied, they don't happen often enough to be a deal breaker, but they do happen often enough to be annoying. I'm not considering major page fault events (e.g. swap or first executable load) since that part is straightforward and it's easier to understand why a tracer can't access that memory in those cases. Case 1: fork When a process forks, the child's duplicated address space keeps referencing the parent's memory (until that memory is being just read, at least). As the man page of fork() says, "...the only penalty that it incurs is the time and memory required to duplicate the parent's page tables...". So, one would expect that a child can keep referencing the parent's memory without incurring into any page faults if the parent already accessed that memory, which is a typical case. However, it turns out the kernel doesn't fully copy the page table of the parent, and it tries to avoid copying page table entries referring to, for example, shared read-only non-anonymous VMAs, in order to make the forking more efficient [3]. This means that a newly created process will actually incur in a bunch of minor page faults when trying to access memory that was just recently accessed from the parent before the fork. What does this mean from a tracer point of view? Suppose you want to intercept the arguments of an execve() system call that is being executed right after a fork(): there are very good chances some of the passed arguments are going to be constant strings right from the executable binary, and there are very good chances the parent already referenced them (or something in that surrounding) at some point and so they have already been loaded from disk. The child, however, will not find those loaded because of the above behavior, and the kernel will incur in a minor page fault during the various copy_from_user() while executing the execve(). Since the tracer is likely executing before these copy_from_user() happen (e.g. hooking to a syscall tracepoint), reading that data with bpf_probe_read() won't be possible. On my system, it takes a simple "apt-get install foo" to at least consistently reproduce a couple of those missing reads every time. Case 2: Automatic NUMA balancing This one was very interesting for me to find. When the kernel supports automatic NUMA balancing and it's enabled either manually or automatically via CONFIG_NUMA_BALANCING_DEFAULT_ENABLED (like it happens in Ubuntu kernels), the kernel will periodically unmap pages of multi-threaded processes by writing PROT_NONE to the respective page table entries [4], forcing a subsequent memory access to cause a minor page fault, so that the page fault handler can decide to migrate the memory to another area closer to the CPU where the process is currently running [5]. As you can imagine, this means that if a process rarely accesses a portion of the memory, that area will remain unmapped, preventing a tracer from accessing it. Unlike the previous fork case, the criteria to decide which VMAs can be unmapped is more aggressive, so even areas such as the stack can be targeted. What does this mean from a tracer point of view? Just the other day I wanted to read a process arguments and environment via current->mm->arg_start, which just points in the surroundings of the top of the userspace stack, and bpf_probe_read() was failing to my surprise, I had never seen the top of the stack unmapped before. By manually printing the PTE flags, it was clear the unmapping was caused by PROT_NONE and this NUMA feature (unfortunately /proc/pid/pagemap didn't help since the output shown by proc doesn't distinguish between PAGE_BIT_PRESENT and PAGE_BIT_PROTNONE, which was the key to identify this behavior). I wanted to share this story to ultimately ask the question: have you ever faced something similar in your tracing experiences? If so, have you worked around the problem in some way? One solution is to try to make sure the tracer actually accesses the memory immediately after either the userspace process or the respective kernel code accesses it, effectively shielding the tracing code from the page faults, but that is not always easy or possible to do. Also, and this might be pure uneducated speculation, I'm wondering if we could ever get to a point where, for example, instead of immediately aborting the page fault handler when the context is not right [6], the abortion could be delayed until we are either sure that the mm->mmap_sem semaphore can't be acquired without contention or that the fault is actually going to be major? From what I have seen, such a change would dramatically reduce the amount of these corner cases, although of course what I just said might just be gibberish since I've never worked in that code. Thanks for reading. |
|
Alexei Starovoitov
On Thu, Aug 31, 2017 at 01:41:11PM -0700, Gianluca Borello via iovisor-dev wrote:
Hithank you for the excellent summary of the problem. Do you think adding new read helper that is relying on __get_user_pages_fast() can solve some of it? If I understand the code it won't deal with prot_none, so numa rebalancing won't be solved, but for the fork() it should work. Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked() won't work, since it needs to sleep in some cases which we cannot do from bpf program. Another alternative is to create new type of bpf programs attached to syscalls that won't be running in rcu and allow full copy_from_user() there, but it would also mean that we'd need to redesign map access from such progs. Yet another alternative is to trace lsm hooks instead of syscalls. Instead of sys_open() attach to security_file_open() and all user supplied data will be available at that point. |
|
Gianluca Borello <g.borello@...>
On Tue, Sep 5, 2017 at 7:29 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote: Thank you Alexei for your thoughtful reply and sorry for the delay, but you gave me a lot to think about :) Do you think adding new read helper that is relying on __get_user_pages_fast()If I'm not mistaken, it won't work. __get_user_pages_fast() works if you already have the PTEs in the page table, which is not the case for the fork() child, since the optimization done during mm duplication does not even copy the PTEs. In other words, without properly taking the mm->mmap_sem lock and accessing the VMA operations, I don't think it's possible. With a quick experiment, I indeed get blocked at pud_none() [1] if I try to call __get_user_pages_fast() on an address that has not yet been faulted in after fork() (whereas it works fine in the parent). In contrast, what the page fault handler does in such a situation is recognizing that there's no PTE, so it calculates the address offset using the VMA base address, and passes this offset to the proper vma->vm_ops fault()/map_pages() operation, which in case of simple file mapping will do a radix lookup in the private page tree and, if successful, will also allocate a new PTE entry pointing to such page, without ever touching the disk. As far as the other case of numa unbalancing goes, as you already pointed out __get_user_pages_fast() will bail even if the PTE is there [2]. Maybe it's not what you meant, in which case I apologize for the misunderstanding. Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked()You are correct, and my point was: since it is indeed possible to serve a page fault without ever sleeping if there's no contention over the locks and the page is already in memory somewhere (like in the case I just described above, and like you said "it needs to sleep in *some* cases"), why not taking advantage of these cases? So, why wouldn't in theory be possible to modify the page fault handler and propagate a bool flag throughout all the methods named "can_sleep", which would mimic faulthandler_disabled(), instead of bailing out immediately at the start of the page handler? If all the fault functions were aware of this flag and would act accordingly (e.g. using trylock for their locks), the page fault handler could keep going until we hit either a major fault or a lock contention. The result would be that all these cases such as the one above would start immediately working with the standard bpf_probe_read(). Again, I understand this would be a massive change that would likely never happen, but I like the mental exercise of determining whether something like that would in theory be possible or not, because maybe I'm missing something else, and sleeping is not the only reason why faulthandler_disabled() is set. Another alternative is to create new type of bpf programs attached to syscallsThis would probably work for me, although I admit I have found myself wanting to also access user memory from an arbitrary kprobe from time to time and that didn't work either :) (like arbitrarily accessing the top of the stack after the numa balancing unmapped the pages). Yet another alternative is to trace lsm hooks instead of syscalls.Yes, that's a good alternative for some applications, although as you can imagine it's not as granular as the entire system call API. What I have also done is defer a lot of the memory access at the end of the system call itself (sys_exit tracepoint), so that hopefully the memory is already paged in, but there are other applications where it'd be handy to access the memory before the kernel does. Thanks a lot! [1] https://github.com/torvalds/linux/blob/v4.13/mm/gup.c#L1580 [2] https://github.com/torvalds/linux/blob/v4.13/mm/gup.c#L1296 |
|
Alexei Starovoitov
On Thu, Sep 07, 2017 at 12:40:21PM -0700, Gianluca Borello wrote:
this issue was discussed at Plumbers and it seems there may be a solution in sight. The work on 'speculative page faults' will remove mm->mmap_sem in favor of srcu approach with sequence numbers and we will be able to do find_vma() and vma->vm_ops->access() from the non-sleepable context. From bpf program point of view it probably be a new helper bpf_probe_read_harder() ;) or something that will try normal pagefault_disabled read first and if it fails will try srcu_read_lock+vma->access approach. |
|
Gianluca Borello <g.borello@...>
On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote: Thank you Alexei for your reply and sorry for the delay, I just finally found the time over the weekend to go over your message more deeply. I applied the speculative page fault patch to my tree to better understand the implications of your comment and indeed this patch (way over my head!) seems a huge leap forward because it allows us to lookup a VMA without taking any lock, so we can do it in a non-sleepable context. However, I am still missing how this could be a resolutive fix. Let's imagine for example the case I mentioned above where we have a fork() child and right after the fork all VMAs referring to mapped files will not have any valid PTEs (but the file is already in the page cache). In this case, there's little we can do beside grabbing the VMA and asking some vma->vm_ops to give us the page corresponding to the address we're looking for. With the speculative fault, we can do it also from a BPF helper, however some vm_ops methods are not ready to be called in a non-sleepable context. For example, for filemap: - fault() is not safe because it consistently ends up in a might_sleep() invocation [1][2] - map_pages() seems safe (but is it also for other VMA implementations?) - access() is not defined So, which ones would this BPF helper call in order to guarantee usefulness while not causing blocking? Just calling vm_ops->access() wouldn't help in this case since it's not defined. Looking at the code for __access_remote_vm(), it seems it does a mix of get_user_pages() (which in turn calls vm_ops->fault() and/or vm_ops->map_pages()) and as a fallback it uses vm_ops->access(), but of course that one can sleep. Perhaps the solution is much simpler and I just didn't grasp all the implications of this work? (sorry again, it's the first time I dabble in this subsystem). Thanks [1] https://github.com/torvalds/linux/blob/v4.13/mm/filemap.c#L2372 [2] https://github.com/torvalds/linux/blob/v4.13/include/linux/pagemap.h#L496 |
|
Alexei Starovoitov
On Sun, Oct 01, 2017 at 02:08:36PM -0700, Gianluca Borello via iovisor-dev wrote:
On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitovmy understanding of speculative page fault patch is that the whole get_user_pages will operate under srcu, so we can call it if necessary (not only find_vma will be safe to call). Perhaps the solution is much simpler and I just didn't grasp all theall correct. It's not simpler than what you described. I thought access() will be available in the situation we care about. The only bit I'm missing is why to trace exec() args it will be going all the way to filemap-backed pages of the executable ? The strings are in the heap/stack, no? we can try to implement filemap_access and it probably won't need to lock_page to access it, since readahead does it without locking. Accessing things via kmap or ioremap (the way generic_access_phys does it) is too costly, so probably not the right approach. |
|
Gianluca Borello <g.borello@...>
On Sun, Oct 1, 2017 at 7:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote: Hi Alexei, My understanding, and please correct me if I'm wrong, is that yes, get_user_pages() can definitely work under srcu, but if the PTE is not found (like after fork), then get_user_pages will call some operations in vma->vm_ops that may cause the caller to sleep and there's no way or flag to prevent this behavior (except perhaps using __get_user_pages_fast). While this isn't a problem in the places where get_user_pages is typically called (e.g. access_process_vm), we can't do it from a BPF helper because we run it with preempt_disable() regardless of the srcu (at least the BPF programs from kprobes and tracepoints, which I use the most). Well, in my experiments done on normal workloads I can see many constant strings that come directly from the mapped executable and are passed as arguments to system calls. As a rule of thumb, running an "apt-get install xxx" on any system will generate several dozens of them, I analyzed a few and they are all instances of fork() + execve(some_arg_from_text_section). we can try to implement filemap_access and it probably won'tImplementing filemap_access sounds like a feasible strategy, I wasn't sure about that option since it would effectively mean patching all the vm_ops definitions under fs/. I haven't read the implementation of filemap yet but I'll take a better look as soon as I have some time. At the same time, for anonymous VMAs it would be useful to have some sort of BPF helper working like __get_user_pages_fast which would somehow ignore the PROT_NONE set on the PTEs during the NUMA balancing (not even sure that's actually possible, will have to experiment). By the way, none of this is urgent as I have my workarounds to get around that (I mostly parse arguments at the exit of the system call tracepoint), but it'd be nice to have a few more powerful helpers that can dig into all sorts of memory situations :-) Thanks |
|
Alexei Starovoitov
On Mon, Oct 02, 2017 at 07:29:51PM -0700, Gianluca Borello via iovisor-dev wrote:
On Sun, Oct 1, 2017 at 7:00 PM, Alexei Starovoitovnot following why all fs-es need to be patched. can we do a mini version of filemap_fault() that only operates on pages in cache? We cannot serve major faults anyway and real fs access is not necessary. At the same time, for anonymous VMAs it would be useful to have someshould be possible. passing a flag through all of the gup_*() is probably not going to be acceptable, since all of these functions already have 6 args, but checking some global state in pte_protnone() may be a way out. btw the .config-s I care about don't have CONFIG_NUMA_BALANCING set. By the way, none of this is urgent as I have my workarounds to getyep. same here. It's not urgent, but we need a solution long term. |
|
Gianluca Borello <g.borello@...>
On Fri, Oct 6, 2017 at 3:11 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote: Good point. Indeed we can, I thought a strategy might have been leaving the helper implementation as generic as possible and rely on a call to vma->vm_ops->access() if present, but it seems we could definitely try to directly look into the page cache if we see vma->vm_file->f_mapping exists, bypassing the vm_ops (hope that's what you meant). btw the .config-s I care about don't have CONFIG_NUMA_BALANCING set.It's interesting, most kernels I've seen don't have the feature enabled by default, whereas all Ubuntu kernels, widely used on servers, have it. Thanks! I'll be glad to experiment with some of this at some point :) |
|