Re: Accessing user memory and minor page faults


Gianluca Borello <g.borello@...>
 

On Tue, Sep 5, 2017 at 7:29 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

thank you for the excellent summary of the problem.
Thank you Alexei for your thoughtful reply and sorry for the delay,
but you gave me a lot to think about :)


Do you think adding new read helper that is relying on __get_user_pages_fast()
can solve some of it? If I understand the code it won't deal with prot_none,
so numa rebalancing won't be solved, but for the fork() it should work.
If I'm not mistaken, it won't work.

__get_user_pages_fast() works if you already have the PTEs in the page
table, which is not the case for the fork() child, since the
optimization done during mm duplication does not even copy the PTEs.
In other words, without properly taking the mm->mmap_sem lock and
accessing the VMA operations, I don't think it's possible. With a
quick experiment, I indeed get blocked at pud_none() [1] if I try to
call __get_user_pages_fast() on an address that has not yet been
faulted in after fork() (whereas it works fine in the parent).

In contrast, what the page fault handler does in such a situation is
recognizing that there's no PTE, so it calculates the address offset
using the VMA base address, and passes this offset to the proper
vma->vm_ops fault()/map_pages() operation, which in case of simple
file mapping will do a radix lookup in the private page tree and, if
successful, will also allocate a new PTE entry pointing to such page,
without ever touching the disk.

As far as the other case of numa unbalancing goes, as you already
pointed out __get_user_pages_fast() will bail even if the PTE is there
[2].

Maybe it's not what you meant, in which case I apologize for the
misunderstanding.


Doing down_read_trylock(&mm->mmap_sem) and call into __get_user_pages_locked()
won't work, since it needs to sleep in some cases which we cannot do
from bpf program.
You are correct, and my point was: since it is indeed possible to
serve a page fault without ever sleeping if there's no contention over
the locks and the page is already in memory somewhere (like in the
case I just described above, and like you said "it needs to sleep in
*some* cases"), why not taking advantage of these cases?

So, why wouldn't in theory be possible to modify the page fault
handler and propagate a bool flag throughout all the methods named
"can_sleep", which would mimic faulthandler_disabled(), instead of
bailing out immediately at the start of the page handler? If all the
fault functions were aware of this flag and would act accordingly
(e.g. using trylock for their locks), the page fault handler could
keep going until we hit either a major fault or a lock contention. The
result would be that all these cases such as the one above would start
immediately working with the standard bpf_probe_read().

Again, I understand this would be a massive change that would likely
never happen, but I like the mental exercise of determining whether
something like that would in theory be possible or not, because maybe
I'm missing something else, and sleeping is not the only reason why
faulthandler_disabled() is set.


Another alternative is to create new type of bpf programs attached to syscalls
that won't be running in rcu and allow full copy_from_user() there,
but it would also mean that we'd need to redesign map access from such progs.
This would probably work for me, although I admit I have found myself
wanting to also access user memory from an arbitrary kprobe from time
to time and that didn't work either :) (like arbitrarily accessing the
top of the stack after the numa balancing unmapped the pages).


Yet another alternative is to trace lsm hooks instead of syscalls.
Instead of sys_open() attach to security_file_open() and all user supplied
data will be available at that point.
Yes, that's a good alternative for some applications, although as you
can imagine it's not as granular as the entire system call API. What I
have also done is defer a lot of the memory access at the end of the
system call itself (sys_exit tracepoint), so that hopefully the memory
is already paged in, but there are other applications where it'd be
handy to access the memory before the kernel does.

Thanks a lot!

[1] https://github.com/torvalds/linux/blob/v4.13/mm/gup.c#L1580
[2] https://github.com/torvalds/linux/blob/v4.13/mm/gup.c#L1296

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.