Re: The page-pool as a component for XDP forwarding


Daniel Borkmann
 

On 05/06/2016 12:04 AM, Alexei Starovoitov wrote:
On Fri, May 06, 2016 at 12:00:57AM +0200, Daniel Borkmann wrote:
On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation.
BPF program returns an index which the driver maps to a queue, but
this index is relative to XDP instance. So if a device offers 3 levels
priority queues then BPF program can return 0,1, or 2. The driver can
map this return value to a queue (probably from a set of three queues
dedicated to the XDP instance). What I am saying is that this driver
mapping should be trivial and does not implement any policy other than
restricting the XDP instance to its set-- like mapping to actual queue
number could be 3*N+R where N in instance # of XDP and R is return
index. Egress on a different interface can work the same way, for
instance 0 index might queue for local interface, 1 index might queue
for interface. This simple return value to queue mapping is lot easier
for crossing devices if they are managed by the same driver I think.
+1
we'd need a way to specify priority queue from bpf program.
Probably not as a first step though.
Something like
BPF_XDP_DROP 0
BPF_XDP_PASS 1
BPF_XDP_TX 2
BPF_XDP_TX_PRIO 3 | upper bits used for prio
BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev
lower 8-bits to encode action should be enough.
First merge-able step is to do 0,1,2 in one driver (like mlx4) and
start building it in other drivers.
Can't this be done in a second step, with some per-cpu scratch data
as we have for redirect? That would seem easier to use to me, and easier
to extend with further data required to tx or rx to stack ... The return
code could have a flag to tell to look at the scratch data, for example.
yes. 3,4,5,6,7,.. code can look at per-cpu scratch data too.
My point that for step one we define semantic for opcodes 0,1,2 in
the first 8 bits of return value. Everything else is reserved and
defaults to drop.
Yep, first step with opcodes 0=drop, 1=pass/stack, 2=tx/fwd defined sounds
reasonable to me with rest as drop.

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.