Re: Accessing user memory and minor page faults

Alexei Starovoitov

On Sun, Oct 01, 2017 at 02:08:36PM -0700, Gianluca Borello via iovisor-dev wrote:
On Mon, Sep 25, 2017 at 9:36 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

this issue was discussed at Plumbers and it seems there may be
a solution in sight. The work on 'speculative page faults' will
remove mm->mmap_sem in favor of srcu approach with sequence numbers
and we will be able to do find_vma() and vma->vm_ops->access() from
the non-sleepable context.
From bpf program point of view it probably be a new helper
bpf_probe_read_harder() ;) or something that will try normal
pagefault_disabled read first and if it fails will try
srcu_read_lock+vma->access approach.
Thank you Alexei for your reply and sorry for the delay, I just
finally found the time over the weekend to go over your message more

I applied the speculative page fault patch to my tree to better
understand the implications of your comment and indeed this patch (way
over my head!) seems a huge leap forward because it allows us to
lookup a VMA without taking any lock, so we can do it in a
non-sleepable context.

However, I am still missing how this could be a resolutive fix. Let's
imagine for example the case I mentioned above where we have a fork()
child and right after the fork all VMAs referring to mapped files will
not have any valid PTEs (but the file is already in the page cache).

In this case, there's little we can do beside grabbing the VMA and
asking some vma->vm_ops to give us the page corresponding to the
address we're looking for. With the speculative fault, we can do it
also from a BPF helper, however some vm_ops methods are not ready to
be called in a non-sleepable context. For example, for filemap:

- fault() is not safe because it consistently ends up in a
might_sleep() invocation [1][2]
- map_pages() seems safe (but is it also for other VMA implementations?)
- access() is not defined

So, which ones would this BPF helper call in order to guarantee
usefulness while not causing blocking? Just calling vm_ops->access()
wouldn't help in this case since it's not defined. Looking at the code
for __access_remote_vm(), it seems it does a mix of get_user_pages()
(which in turn calls vm_ops->fault() and/or vm_ops->map_pages()) and
as a fallback it uses vm_ops->access(), but of course that one can
my understanding of speculative page fault patch is that the whole
get_user_pages will operate under srcu, so we can call it
if necessary (not only find_vma will be safe to call).

Perhaps the solution is much simpler and I just didn't grasp all the
implications of this work? (sorry again, it's the first time I dabble
in this subsystem).
all correct. It's not simpler than what you described.
I thought access() will be available in the situation we care about.
The only bit I'm missing is why to trace exec() args it will be going
all the way to filemap-backed pages of the executable ?
The strings are in the heap/stack, no?
we can try to implement filemap_access and it probably won't
need to lock_page to access it, since readahead does it without locking.
Accessing things via kmap or ioremap (the way generic_access_phys does it)
is too costly, so probably not the right approach.

Join to automatically receive all group messages.