Date   

Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.

Tom

More sophisticated selection of ifindex and/or tx queue can be built
on top.

Avoid NUMA problems, return to same CPU
I think at this stage the numa part can be ignored.
We should assume one socket and deal with numa later,
since such things are out of bpf control and not part of
API that we need to stabilize right now.
We may have some sysctl knobs or ethtool in the future.

For performance reasons, the accounting should be kept as per CPU
structures.
In general that's absolutely correct, but by default XDP should
not have any counters. It's up to the program to keep the stats
on number of dropped packets. Thankfully per-cpu hash maps
already exist.

XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).
I think we don't have cycles to do anything sophisticated
at 'pool return' point. Something like hard limit (ethtool configurable)
on number of recycle-able pages should be good enough.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?
I think we'll try to pick the good default for most of the use cases,
but ultimately it's another knob. If program processing time
is high, the user would have to increase this knob to keep
all pages in the recycle-able pool instead of talking to
main page-allocator. Even when this knob is not optimal,
the performance will still be acceptable, since the cost
of page_alloc+mmap-s will be amortized.


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.
There are always multiple rx and tx queues.
It makes the program portable across different nics and hw configuration
when it doesn't know rx queue number and doesn't make decision about tx queues.
I don't see a use case for selecting tx queue. The driver side
should be making this decision to make sure the performance is optimal
and everything is lock-less. Like it can allocate N+M TX queues and N RX
queues where N is multiple of cpu count and use M TX queues for normal
tcp stack tx traffic. Then everything is collision free and lockless.


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 10:41 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.
There are always multiple rx and tx queues.
It makes the program portable across different nics and hw configuration
when it doesn't know rx queue number and doesn't make decision about tx queues.
I don't see a use case for selecting tx queue. The driver side
should be making this decision to make sure the performance is optimal
and everything is lock-less. Like it can allocate N+M TX queues and N RX
queues where N is multiple of cpu count and use M TX queues for normal
tcp stack tx traffic. Then everything is collision free and lockless.
Right, the TX queues used by the stack need to be completely
independent of those used by XDP. If an XDP instance (e.g. an RX
queue) has exclusive access to a TX queue there is no locking and no
collisions. Neither is there any need for the instance to transmit on
multiple queues except in the case that the different COS is offered
by different queues (e.g. priority), but again COS would be decided by
the BPF not the driver. In other words, for XDP we need one TX queue
per COS per each instance (RX queue) of XDP. There should be at most
one RX queue serviced per CPU also.

Tom


Re: The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer
 

On Thu, 5 May 2016 11:01:52 -0700
Tom Herbert <tom@...> wrote:

On Thu, May 5, 2016 at 10:41 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.
There are always multiple rx and tx queues.
It makes the program portable across different nics and hw configuration
when it doesn't know rx queue number and doesn't make decision about tx queues.
I don't see a use case for selecting tx queue. The driver side
should be making this decision to make sure the performance is optimal
and everything is lock-less. Like it can allocate N+M TX queues and N RX
queues where N is multiple of cpu count and use M TX queues for normal
tcp stack tx traffic. Then everything is collision free and lockless.
Right, the TX queues used by the stack need to be completely
independent of those used by XDP. If an XDP instance (e.g. an RX
queue) has exclusive access to a TX queue there is no locking and no
collisions. Neither is there any need for the instance to transmit on
multiple queues except in the case that the different COS is offered
by different queues (e.g. priority), but again COS would be decided by
the BPF not the driver. In other words, for XDP we need one TX queue
per COS per each instance (RX queue) of XDP. There should be at most
one RX queue serviced per CPU also.
I almost agree, but there are some details ;-)

Yes, for XDP-TX we likely cannot piggy-back on the normal stack TX
queues (like we do on the RX queues). Thus, when a driver support the
XDP-TX feature, they need to provide some more TX queue's for XDP. For
lockless TX I assume we need a XDP-TX queue per CPU.

The way I understand you, you want the BPF program to choose the TX
queue number. I disagree, as BPF should have no knowledge about TX
queue numbers. (It would be hard to get lockless TX queue's if BPF
program chooses). IMHO the BPF program can choose the egress netdevice
(e.g. via ifindex). Then we call the NDO "XDP-page-fwd", inside that
call, the actual TX queue is chosen based on the current-running-CPU
(maybe simply via a this_cpu_xxx call).


Getting TX queues lockless, have one problem: TX DMA completion
interrupts. Today TX completion, "cleanup" of TX ring-queue can run on
another CPU. This breaks the lockless scheme. We need deal with this
somehow, and setup our XDP-TX-queue "more-strict" somehow from the
kernel side, and not allow userspace to change smp_affinity (simply
chmod the proc file ;-)).

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Re: The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer
 

On Wed, 4 May 2016 22:22:07 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
More sophisticated selection of ifindex and/or tx queue can be built
on top.
I agree that driver choose TX queue to use to avoid conflicts, allowing
lockless access.

I think XDP/BPF "forward"-mode" should always select an egress/TX
ifindex/netdevice. If the ifindex happen to match the driver itself,
then driver can to the superfast TX into a driver TX-ring queue. But
if the ifindex is for another device (that does not support this) then
we fallback to full-SKB alloc and normal stack TX towards that
ifindex/netdevice (likely bypassing the rx_handler).


Avoid NUMA problems, return to same CPU
I think at this stage the numa part can be ignored.
We should assume one socket and deal with numa later,
since such things are out of bpf control and not part of
API that we need to stabilize right now.
We may have some sysctl knobs or ethtool in the future.
You misunderstood me. This was about the page-pool design. It
absolutely needs this "return _page_ to same CPU". Don't worry about
this part.


For performance reasons, the accounting should be kept as per CPU
structures.
In general that's absolutely correct, but by default XDP should
not have any counters. It's up to the program to keep the stats
on number of dropped packets. Thankfully per-cpu hash maps
already exist.
Also think you misunderstood me here. This is also about the page-pool
design. Of-cause, the XDP should not have any counters.


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).
I think we don't have cycles to do anything sophisticated
at 'pool return' point. Something like hard limit (ethtool configurable)
on number of recycle-able pages should be good enough.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?
I think we'll try to pick the good default for most of the use cases,
but ultimately it's another knob. If program processing time
is high, the user would have to increase this knob to keep
all pages in the recycle-able pool instead of talking to
main page-allocator. Even when this knob is not optimal,
the performance will still be acceptable, since the cost
of page_alloc+mmap-s will be amortized.
Also think you misunderstood me here. This was about bringing some of
the ideas from the page-pool into the page allocator itself. In
general I'm very much against adding more knobs to the kernel. It have
become one of the big pitfalls of the kernel.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 12:11 PM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Thu, 5 May 2016 11:01:52 -0700
Tom Herbert <tom@...> wrote:

On Thu, May 5, 2016 at 10:41 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
I'm not sure what this means. In XDP the driver should not be making
any decisions (i.e. driver does not implement any). If there is a
choice of TX queue that should be made by the BPF code. Maybe for the
first instantiation there is only one queue and BPF always returns
index of zero-- this will be sufficient for most L4 load balancers and
ILA router.
There are always multiple rx and tx queues.
It makes the program portable across different nics and hw configuration
when it doesn't know rx queue number and doesn't make decision about tx queues.
I don't see a use case for selecting tx queue. The driver side
should be making this decision to make sure the performance is optimal
and everything is lock-less. Like it can allocate N+M TX queues and N RX
queues where N is multiple of cpu count and use M TX queues for normal
tcp stack tx traffic. Then everything is collision free and lockless.
Right, the TX queues used by the stack need to be completely
independent of those used by XDP. If an XDP instance (e.g. an RX
queue) has exclusive access to a TX queue there is no locking and no
collisions. Neither is there any need for the instance to transmit on
multiple queues except in the case that the different COS is offered
by different queues (e.g. priority), but again COS would be decided by
the BPF not the driver. In other words, for XDP we need one TX queue
per COS per each instance (RX queue) of XDP. There should be at most
one RX queue serviced per CPU also.
I almost agree, but there are some details ;-)

Yes, for XDP-TX we likely cannot piggy-back on the normal stack TX
queues (like we do on the RX queues). Thus, when a driver support the
XDP-TX feature, they need to provide some more TX queue's for XDP. For
lockless TX I assume we need a XDP-TX queue per CPU.

The way I understand you, you want the BPF program to choose the TX
queue number. I disagree, as BPF should have no knowledge about TX
queue numbers. (It would be hard to get lockless TX queue's if BPF
program chooses). IMHO the BPF program can choose the egress netdevice
(e.g. via ifindex). Then we call the NDO "XDP-page-fwd", inside that
call, the actual TX queue is chosen based on the current-running-CPU
(maybe simply via a this_cpu_xxx call).
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.


Getting TX queues lockless, have one problem: TX DMA completion
interrupts. Today TX completion, "cleanup" of TX ring-queue can run on
another CPU. This breaks the lockless scheme. We need deal with this
somehow, and setup our XDP-TX-queue "more-strict" somehow from the
kernel side, and not allow userspace to change smp_affinity (simply
chmod the proc file ;-)).
Hmm, how does DPDK deal with this? Hopefully you wouldn't need an
actual lock for this either, atomic ops on producer/consumer pointers
should work for most devices?

Tom

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Re: The page-pool as a component for XDP forwarding

Thomas Monjalon <thomas.monjalon@...>
 

2016-05-04 14:01, Tom Herbert:
On Wed, May 4, 2016 at 12:55 PM, Thomas Monjalon
<thomas.monjalon@...> wrote:
2016-05-04 12:47, Tom Herbert:
Maybe we can get basic forwarding to work first ;-). From a system
design point of view mixing different types of NICs on the same server
is not very good anyway.
Mixing NICs on a server is probably not common. But I wonder wether it
could allow to leverage different offload capabilities for an asymmetrical
traffic?
Maybe, but it's a lot of complexity. Do you have a specific use case in mind?
No real use case now but offload in NICs are becoming more and more complex
and really different depending of the vendor.
I think tunnel encapsulation offload use case is becoming real.
We can also think to different flow steering depending of the tunnel type.

Please could you elaborate why mixing is not very good?
Harder to design, test, don't see much value in it. Supporting such
things forces us to continually raise the abstraction and generalize
interfaces more and more which is exactly how we wind up with things
like 400 bytes skbuffs, locking, soft queues, etc. XDP is expressly
not meant to be a general solution, and that gives us liberty to cut
out anything that doesn't yield performance like trying to preserve a
high performance interface between two arbitrary drivers (but still
addressing the 90% case).
Interesting point of view. Thanks


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 1:46 PM, Thomas Monjalon
<thomas.monjalon@...> wrote:
2016-05-04 14:01, Tom Herbert:
On Wed, May 4, 2016 at 12:55 PM, Thomas Monjalon
<thomas.monjalon@...> wrote:
2016-05-04 12:47, Tom Herbert:
Maybe we can get basic forwarding to work first ;-). From a system
design point of view mixing different types of NICs on the same server
is not very good anyway.
Mixing NICs on a server is probably not common. But I wonder wether it
could allow to leverage different offload capabilities for an asymmetrical
traffic?
Maybe, but it's a lot of complexity. Do you have a specific use case in mind?
No real use case now but offload in NICs are becoming more and more complex
and really different depending of the vendor.
We're trying hard to discourage vendors from doing that. All these
complex HW offloads aren't helping matters! (e.g. see the continuing
saga of getting vendors to give us protocol generic checksum
offload...). Other than checksum offload and RSS, I'm not seeing much
we can leverage from the HW offloads for XDP. Of course when can
offload the BPF program to HW that might be a different story.

I think tunnel encapsulation offload use case is becoming real.
We can also think to different flow steering depending of the tunnel type.
Won't we be able to implement encap/decap in XDP just as easily but in
a way that is completely user programmable?

Tom


Please could you elaborate why mixing is not very good?
Harder to design, test, don't see much value in it. Supporting such
things forces us to continually raise the abstraction and generalize
interfaces more and more which is exactly how we wind up with things
like 400 bytes skbuffs, locking, soft queues, etc. XDP is expressly
not meant to be a general solution, and that gives us liberty to cut
out anything that doesn't yield performance like trying to preserve a
high performance interface between two arbitrary drivers (but still
addressing the 90% case).
Interesting point of view. Thanks


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Thu, May 05, 2016 at 09:32:32PM +0200, Jesper Dangaard Brouer wrote:
On Wed, 4 May 2016 22:22:07 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
More sophisticated selection of ifindex and/or tx queue can be built
on top.
I agree that driver choose TX queue to use to avoid conflicts, allowing
lockless access.

I think XDP/BPF "forward"-mode" should always select an egress/TX
ifindex/netdevice. If the ifindex happen to match the driver itself,
then driver can to the superfast TX into a driver TX-ring queue. But
if the ifindex is for another device (that does not support this) then
we fallback to full-SKB alloc and normal stack TX towards that
ifindex/netdevice (likely bypassing the rx_handler).
NO. See my ongoing rant on performance vs generality.
'then it looks so generic and nice' arguments are not applicable to XDP.
Even if ifindex check didn't cost any performance, it still doesn't
make sense to do it, since ifindex is dynamic, so the program
would need to be tailored for specific ifindex. Either compiled
on-the-fly when ifindex is known or extra map lookup to figure out
which ifindex to use. For load balancer/ila router use cases it's
unnecessary, so program will not be dealing with ifindex.


Re: The page-pool as a component for XDP forwarding

Daniel Borkmann
 

On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.
Can't this be done in a second step, with some per-cpu scratch data
as we have for redirect? That would seem easier to use to me, and easier
to extend with further data required to tx or rx to stack ... The return
code could have a flag to tell to look at the scratch data, for example.


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Fri, May 06, 2016 at 12:00:57AM +0200, Daniel Borkmann wrote:
On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.
Can't this be done in a second step, with some per-cpu scratch data
as we have for redirect? That would seem easier to use to me, and easier
to extend with further data required to tx or rx to stack ... The return
code could have a flag to tell to look at the scratch data, for example.
yes. 3,4,5,6,7,.. code can look at per-cpu scratch data too.
My point that for step one we define semantic for opcodes 0,1,2 in
the first 8 bits of return value. Everything else is reserved and
defaults to drop.


Re: The page-pool as a component for XDP forwarding

Daniel Borkmann
 

On 05/06/2016 12:04 AM, Alexei Starovoitov wrote:
On Fri, May 06, 2016 at 12:00:57AM +0200, Daniel Borkmann wrote:
On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.
Can't this be done in a second step, with some per-cpu scratch data
as we have for redirect? That would seem easier to use to me, and easier
to extend with further data required to tx or rx to stack ... The return
code could have a flag to tell to look at the scratch data, for example.
yes. 3,4,5,6,7,.. code can look at per-cpu scratch data too.
My point that for step one we define semantic for opcodes 0,1,2 in
the first 8 bits of return value. Everything else is reserved and
defaults to drop.
Yep, first step with opcodes 0=drop, 1=pass/stack, 2=tx/fwd defined sounds
reasonable to me with rest as drop.


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.

BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
Similarly, just do XDP_RX with some index that has meaning to caller
and driver. (Also NETDEV and IFINDEX are kernel specific terms, we
should avoid that in the interface).

So that just leaves three actions XDP_DROP, XDP_TX, XDP_RX!

lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three
opcodes and two of them take parameters. The parameters are generic so
they can indicate arbitrary instructions on what to do with the packet
(these can point to priority queue, HW rate limited queue, tap queue,
whatever). This is a way that drivers can extend the capabilities of
the interface for their own features without requiring any changes to
the interface. A singe index allows one array lookup which returns
what every information is needed for driver to act on the packet. So
this scheme is both generic and performant-- it allows generality in
that the TX and RX actions can be arbitrarily extended and is
performant since all the driver needs to do is look the index in the
array to complete the action. If we don't do something like this, then
every time someone adds some new functionality we have to add another
action-- that doesn't scale. Priority queues are a perfect example of
this, these are not a common supported feature and should not be
exposed in the base action.

This also means the return code is simple, just two fields: the opcode
and its parameter. In phase one the parameter would always be zero.

Tom


Re: The page-pool as a component for XDP forwarding

Daniel Borkmann
 

On 05/06/2016 01:00 AM, Tom Herbert via iovisor-dev wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three
opcodes and two of them take parameters. The parameters are generic so
they can indicate arbitrary instructions on what to do with the packet
(these can point to priority queue, HW rate limited queue, tap queue,
whatever). This is a way that drivers can extend the capabilities of
the interface for their own features without requiring any changes to
But that would mean, that XDP programs are not portable anymore across
different drivers, no? So they'd have to be rewritten when porting to a
different nic or cannot be supported there due to missing features.

the interface. A singe index allows one array lookup which returns
what every information is needed for driver to act on the packet. So
this scheme is both generic and performant-- it allows generality in
that the TX and RX actions can be arbitrarily extended and is
performant since all the driver needs to do is look the index in the
array to complete the action. If we don't do something like this, then
every time someone adds some new functionality we have to add another
action-- that doesn't scale. Priority queues are a perfect example of
this, these are not a common supported feature and should not be
exposed in the base action.
For portability on one XDP program across different NICs, then this
would either just be a *hint* to the driver and drivers not supporting
this don't care about that, or drivers need to indicate their capabilities
in some way to the verifier, so that verifier makes sure that such action
is possible at all for the given driver. For prio queues case, probably
first option might be better.

This also means the return code is simple, just two fields: the opcode
and its parameter. In phase one the parameter would always be zero.

Tom
_______________________________________________
iovisor-dev mailing list
iovisor-dev@...
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three
opcodes and two of them take parameters. The parameters are generic so
they can indicate arbitrary instructions on what to do with the packet
(these can point to priority queue, HW rate limited queue, tap queue,
whatever). This is a way that drivers can extend the capabilities of
the interface for their own features without requiring any changes to
the interface. A singe index allows one array lookup which returns
what every information is needed for driver to act on the packet. So
this scheme is both generic and performant-- it allows generality in
that the TX and RX actions can be arbitrarily extended and is
performant since all the driver needs to do is look the index in the
array to complete the action.
then semantics will become driver dependent. that will be unusable
from the program point view. Non-portable programs etc.
Also
if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit
_is_ slower than
if ((u8)ret == TX) do_xmit
At line rate every cycle and every branch count.

If we don't do something like this, then
every time someone adds some new functionality we have to add another
action-- that doesn't scale.
why? I think it's the opposite. tx with prio is clearly different
action. Some drivers may support it, some not.
And the programs will clearly indicate to the driver side what
they want, so it's only one 'switch ((u8)ret)' on the driver side
to understand the action and do it.


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 4:18 PM, Daniel Borkmann <daniel@...> wrote:
On 05/06/2016 01:00 AM, Tom Herbert via iovisor-dev wrote:

On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:

On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:

On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:

I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.

+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex

I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.

No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three
opcodes and two of them take parameters. The parameters are generic so
they can indicate arbitrary instructions on what to do with the packet
(these can point to priority queue, HW rate limited queue, tap queue,
whatever). This is a way that drivers can extend the capabilities of
the interface for their own features without requiring any changes to

But that would mean, that XDP programs are not portable anymore across
different drivers, no? So they'd have to be rewritten when porting to a
different nic or cannot be supported there due to missing features.
Actually, I believe the opposite is true. The interface I propose does
not require any features to be supported except that the driver can
drop, transmit on a default queue, and receive packets. To use
anything else is an "advanced" feature which will vary from device to
device, and we don't want to mandate anything beyond that. To deal
with this, allow the BPF program to be parameterized at runtime to get
the mappings of logical features to indices. Priority is a great
example, if we have an action called BPF_XDP_TX_PRIO but the device
doesn't support priority queues then what does that mean? It's doesn't
make sense to arbitrarily require this to be supported. Instead, we
can ask the driver up front what it supports and adjust the program
(mappings) as we see fit.

Tom

the interface. A singe index allows one array lookup which returns
what every information is needed for driver to act on the packet. So
this scheme is both generic and performant-- it allows generality in
that the TX and RX actions can be arbitrarily extended and is
performant since all the driver needs to do is look the index in the
array to complete the action. If we don't do something like this, then
every time someone adds some new functionality we have to add another
action-- that doesn't scale. Priority queues are a perfect example of
this, these are not a common supported feature and should not be
exposed in the base action.

For portability on one XDP program across different NICs, then this
would either just be a *hint* to the driver and drivers not supporting
this don't care about that, or drivers need to indicate their capabilities
in some way to the verifier, so that verifier makes sure that such action
is possible at all for the given driver. For prio queues case, probably
first option might be better.

This also means the return code is simple, just two fields: the opcode
and its parameter. In phase one the parameter would always be zero.

Tom
_______________________________________________
iovisor-dev mailing list
iovisor-dev@...
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Thu, May 5, 2016 at 4:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF
tag to allow possibility that some non-BPF entity really wants to use
this interface ;-) ).

Just have XDP_TX with some index. Index maps to priority, queue, other
device, what ever. The caller will need to understand what the
different possible indices mean but this can be negotiated out of band
and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance'
Combining them into one generic TX code is not simpler. Not for
the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three
opcodes and two of them take parameters. The parameters are generic so
they can indicate arbitrary instructions on what to do with the packet
(these can point to priority queue, HW rate limited queue, tap queue,
whatever). This is a way that drivers can extend the capabilities of
the interface for their own features without requiring any changes to
the interface. A singe index allows one array lookup which returns
what every information is needed for driver to act on the packet. So
this scheme is both generic and performant-- it allows generality in
that the TX and RX actions can be arbitrarily extended and is
performant since all the driver needs to do is look the index in the
array to complete the action.
then semantics will become driver dependent. that will be unusable
from the program point view. Non-portable programs etc.
Also
if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit
_is_ slower than
if ((u8)ret == TX) do_xmit
At line rate every cycle and every branch count.
You have the same problem if you allow a priority to be returned.


If we don't do something like this, then
every time someone adds some new functionality we have to add another
action-- that doesn't scale.
why? I think it's the opposite. tx with prio is clearly different
action. Some drivers may support it, some not.
And the programs will clearly indicate to the driver side what
they want, so it's only one 'switch ((u8)ret)' on the driver side
to understand the action and do it.
But then so is rate limiting, so is receiving to a tap queue, so is
sending on a paced queue, etc. All of these just boil down to which HW
queue is used that provides the COS.

161 - 180 of 2021