Re: The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer

On Wed, 4 May 2016 22:22:07 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
More sophisticated selection of ifindex and/or tx queue can be built
on top.
I agree that driver choose TX queue to use to avoid conflicts, allowing
lockless access.

I think XDP/BPF "forward"-mode" should always select an egress/TX
ifindex/netdevice. If the ifindex happen to match the driver itself,
then driver can to the superfast TX into a driver TX-ring queue. But
if the ifindex is for another device (that does not support this) then
we fallback to full-SKB alloc and normal stack TX towards that
ifindex/netdevice (likely bypassing the rx_handler).

Avoid NUMA problems, return to same CPU
I think at this stage the numa part can be ignored.
We should assume one socket and deal with numa later,
since such things are out of bpf control and not part of
API that we need to stabilize right now.
We may have some sysctl knobs or ethtool in the future.
You misunderstood me. This was about the page-pool design. It
absolutely needs this "return _page_ to same CPU". Don't worry about
this part.

For performance reasons, the accounting should be kept as per CPU
In general that's absolutely correct, but by default XDP should
not have any counters. It's up to the program to keep the stats
on number of dropped packets. Thankfully per-cpu hash maps
already exist.
Also think you misunderstood me here. This is also about the page-pool
design. Of-cause, the XDP should not have any counters.

XDP pool return hook

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
I think we don't have cycles to do anything sophisticated
at 'pool return' point. Something like hard limit (ethtool configurable)
on number of recycle-able pages should be good enough.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?
I think we'll try to pick the good default for most of the use cases,
but ultimately it's another knob. If program processing time
is high, the user would have to increase this knob to keep
all pages in the recycle-able pool instead of talking to
main page-allocator. Even when this knob is not optimal,
the performance will still be acceptable, since the cost
of page_alloc+mmap-s will be amortized.
Also think you misunderstood me here. This was about bringing some of
the ideas from the page-pool into the page allocator itself. In
general I'm very much against adding more knobs to the kernel. It have
become one of the big pitfalls of the kernel.

Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of

Join to automatically receive all group messages.