Re: Accessing user memory and minor page faults

Alexei Starovoitov

On Mon, Oct 02, 2017 at 07:29:51PM -0700, Gianluca Borello via iovisor-dev wrote:
On Sun, Oct 1, 2017 at 7:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

my understanding of speculative page fault patch is that the whole
get_user_pages will operate under srcu, so we can call it
if necessary (not only find_vma will be safe to call).
Hi Alexei,

My understanding, and please correct me if I'm wrong, is that yes,
get_user_pages() can definitely work under srcu, but if the PTE is not
found (like after fork), then get_user_pages will call some operations
in vma->vm_ops that may cause the caller to sleep and there's no way
or flag to prevent this behavior (except perhaps using
__get_user_pages_fast). While this isn't a problem in the places where
get_user_pages is typically called (e.g. access_process_vm), we can't
do it from a BPF helper because we run it with preempt_disable()
regardless of the srcu (at least the BPF programs from kprobes and
tracepoints, which I use the most).

all correct. It's not simpler than what you described.
I thought access() will be available in the situation we care about.
The only bit I'm missing is why to trace exec() args it will be going
all the way to filemap-backed pages of the executable ?
The strings are in the heap/stack, no?
Well, in my experiments done on normal workloads I can see many
constant strings that come directly from the mapped executable and are
passed as arguments to system calls. As a rule of thumb, running an
"apt-get install xxx" on any system will generate several dozens of
them, I analyzed a few and they are all instances of fork() +

we can try to implement filemap_access and it probably won't
need to lock_page to access it, since readahead does it without locking.
Implementing filemap_access sounds like a feasible strategy, I wasn't
sure about that option since it would effectively mean patching all
the vm_ops definitions under fs/. I haven't read the implementation of
filemap yet but I'll take a better look as soon as I have some time.
not following why all fs-es need to be patched.
can we do a mini version of filemap_fault() that only operates
on pages in cache? We cannot serve major faults anyway and real
fs access is not necessary.

At the same time, for anonymous VMAs it would be useful to have some
sort of BPF helper working like __get_user_pages_fast which would
somehow ignore the PROT_NONE set on the PTEs during the NUMA balancing
(not even sure that's actually possible, will have to experiment).
should be possible. passing a flag through all of the gup_*()
is probably not going to be acceptable, since all of these functions
already have 6 args, but checking some global state in pte_protnone()
may be a way out.
btw the .config-s I care about don't have CONFIG_NUMA_BALANCING set.

By the way, none of this is urgent as I have my workarounds to get
around that (I mostly parse arguments at the exit of the system call
tracepoint), but it'd be nice to have a few more powerful helpers that
can dig into all sorts of memory situations :-)
yep. same here. It's not urgent, but we need a solution long term.

Join to automatically receive all group messages.