Re: Accessing user memory and minor page faults


Gianluca Borello <g.borello@...>
 

On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

this issue was discussed at Plumbers and it seems there may be
a solution in sight. The work on 'speculative page faults' will
remove mm->mmap_sem in favor of srcu approach with sequence numbers
and we will be able to do find_vma() and vma->vm_ops->access() from
the non-sleepable context.
From bpf program point of view it probably be a new helper
bpf_probe_read_harder() ;) or something that will try normal
pagefault_disabled read first and if it fails will try
srcu_read_lock+vma->access approach.
Thank you Alexei for your reply and sorry for the delay, I just
finally found the time over the weekend to go over your message more
deeply.

I applied the speculative page fault patch to my tree to better
understand the implications of your comment and indeed this patch (way
over my head!) seems a huge leap forward because it allows us to
lookup a VMA without taking any lock, so we can do it in a
non-sleepable context.

However, I am still missing how this could be a resolutive fix. Let's
imagine for example the case I mentioned above where we have a fork()
child and right after the fork all VMAs referring to mapped files will
not have any valid PTEs (but the file is already in the page cache).

In this case, there's little we can do beside grabbing the VMA and
asking some vma->vm_ops to give us the page corresponding to the
address we're looking for. With the speculative fault, we can do it
also from a BPF helper, however some vm_ops methods are not ready to
be called in a non-sleepable context. For example, for filemap:

- fault() is not safe because it consistently ends up in a
might_sleep() invocation [1][2]
- map_pages() seems safe (but is it also for other VMA implementations?)
- access() is not defined

So, which ones would this BPF helper call in order to guarantee
usefulness while not causing blocking? Just calling vm_ops->access()
wouldn't help in this case since it's not defined. Looking at the code
for __access_remote_vm(), it seems it does a mix of get_user_pages()
(which in turn calls vm_ops->fault() and/or vm_ops->map_pages()) and
as a fallback it uses vm_ops->access(), but of course that one can
sleep.

Perhaps the solution is much simpler and I just didn't grasp all the
implications of this work? (sorry again, it's the first time I dabble
in this subsystem).

Thanks

[1] https://github.com/torvalds/linux/blob/v4.13/mm/filemap.c#L2372
[2] https://github.com/torvalds/linux/blob/v4.13/include/linux/pagemap.h#L496

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.