Re: The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer

On Thu, 5 May 2016 11:01:52 -0700
Tom Herbert <tom@...> wrote:

On Thu, May 5, 2016 at 10:41 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.
There are always multiple rx and tx queues.
It makes the program portable across different nics and hw configuration
when it doesn't know rx queue number and doesn't make decision about tx queues.
I don't see a use case for selecting tx queue. The driver side
should be making this decision to make sure the performance is optimal
and everything is lock-less. Like it can allocate N+M TX queues and N RX
queues where N is multiple of cpu count and use M TX queues for normal
tcp stack tx traffic. Then everything is collision free and lockless.
Right, the TX queues used by the stack need to be completely
independent of those used by XDP. If an XDP instance (e.g. an RX
queue) has exclusive access to a TX queue there is no locking and no
collisions. Neither is there any need for the instance to transmit on
multiple queues except in the case that the different COS is offered
by different queues (e.g. priority), but again COS would be decided by
the BPF not the driver. In other words, for XDP we need one TX queue
per COS per each instance (RX queue) of XDP. There should be at most
one RX queue serviced per CPU also.
I almost agree, but there are some details ;-)

Yes, for XDP-TX we likely cannot piggy-back on the normal stack TX
queues (like we do on the RX queues). Thus, when a driver support the
XDP-TX feature, they need to provide some more TX queue's for XDP. For
lockless TX I assume we need a XDP-TX queue per CPU.

The way I understand you, you want the BPF program to choose the TX
queue number. I disagree, as BPF should have no knowledge about TX
queue numbers. (It would be hard to get lockless TX queue's if BPF
program chooses). IMHO the BPF program can choose the egress netdevice
(e.g. via ifindex). Then we call the NDO "XDP-page-fwd", inside that
call, the actual TX queue is chosen based on the current-running-CPU
(maybe simply via a this_cpu_xxx call).

Getting TX queues lockless, have one problem: TX DMA completion
interrupts. Today TX completion, "cleanup" of TX ring-queue can run on
another CPU. This breaks the lockless scheme. We need deal with this
somehow, and setup our XDP-TX-queue "more-strict" somehow from the
kernel side, and not allow userspace to change smp_affinity (simply
chmod the proc file ;-)).

Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of

Join to automatically receive all group messages.