XDP seeking input from NIC hardware vendors


Jesper Dangaard Brouer
 

Would it make sense from a hardware point of view, to split the XDP
eBPF program into two stages.

Stage-1: Filter (restricted eBPF / no-helper calls)
Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the
likely more advanced stage-2 on the kernel side. Do HW vendors see a
benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the
first step in every XDP program will be to parse the packet-data,
in-order to determine if this is a packet the XDP program should
process.

Actions from stage-1 "filter" program:
- DROP (like XDP_DROP, early drop)
- PASS (like XDP_PASS, normal netstack)
- MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that
makes sense for the stage-2 program. E.g. proto id and/or data offset.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Fastabend, John R <john.r.fastabend@...>
 

Hi Jesper,

I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.

If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.

The other interesting thing would be to do more than just packet steering but actually run a more complete XDP program. Netronome supports this right. The question I have though is this a stacked of XDP programs one or more designated for hardware and some running in software perhaps with some annotation in the program so the hardware JIT knows where to place programs or do we expect the JIT itself to try and decide what is best to offload. I think the easiest to start with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the front or end of a packet so you could push metadata calculated by the program here in a generic way without having to extend XDP defined metadata structures. Another option is to DMA the metadata to a specified address. With this metadata the consumer/producer XDP programs have to agree on the format but no one else.

FWIW I was hoping to get some data to show performance overhead vs how deep we parse into the packets. I just wont have time to get to it for awhile but that could tell us how much perf gain the hardware could provide.

Thanks,
John

-----Original Message-----
From: Jesper Dangaard Brouer [mailto:brouer@...]
Sent: Thursday, July 7, 2016 3:43 AM
To: iovisor-dev@...
Cc: brouer@...; Brenden Blanco <bblanco@...>; Alexei Starovoitov <alexei.starovoitov@...>; Rana Shahout <ranas@...>; Ari Saha <as754m@...>; Tariq Toukan <tariqt@...>; Or Gerlitz <ogerlitz@...>; netdev@...; Simon Horman <horms@...>; Simon Horman <simon.horman@...>; Jakub Kicinski <jakub.kicinski@...>; Edward Cree <ecree@...>; Fastabend, John R <john.r.fastabend@...>
Subject: XDP seeking input from NIC hardware vendors


Would it make sense from a hardware point of view, to split the XDP eBPF program into two stages.

Stage-1: Filter (restricted eBPF / no-helper calls)
Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side. Do HW vendors see a benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process.

Actions from stage-1 "filter" program:
- DROP (like XDP_DROP, early drop)
- PASS (like XDP_PASS, normal netstack)
- MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--------------------------------------------------------------
Intel Research and Development Ireland Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263


This e-mail and any attachments may contain confidential material for the sole
use of the intended recipient(s). Any review or distribution by others is
strictly prohibited. If you are not the intended recipient, please contact the
sender and delete all copies.


Jakub Kicinski
 

On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
The other interesting thing would be to do more than just packet
steering but actually run a more complete XDP program. Netronome
supports this right. The question I have though is this a stacked of
XDP programs one or more designated for hardware and some running in
software perhaps with some annotation in the program so the hardware
JIT knows where to place programs or do we expect the JIT itself to
try and decide what is best to offload. I think the easiest to start
with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the
front or end of a packet so you could push metadata calculated by the
program here in a generic way without having to extend XDP defined
metadata structures. Another option is to DMA the metadata to a
specified address. With this metadata the consumer/producer XDP
programs have to agree on the format but no one else.
Yes!

At the XDP summit we were discussing pipe-lining XDP programs in
general, with different stages of the pipeline potentially using
specific hardware capabilities or even being directly mappable on
fixed HW functions.

Designating parsing as one of specialized blocks makes sense in a long
run, probably at the first stage with recirculation possible. We also
have some parsing HW we could utilize at some point. However, I'm
worried that it's too early to impose constraints and APIs. I agree
that we should first set a standard way to pass metadata across tail
calls to facilitate any form of pipe lining, regardless of which parts
of pipeline HW is able to offload.


Tom Herbert <tom@...>
 

On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
<jakub.kicinski@...> wrote:
On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
The other interesting thing would be to do more than just packet
steering but actually run a more complete XDP program. Netronome
supports this right. The question I have though is this a stacked of
XDP programs one or more designated for hardware and some running in
software perhaps with some annotation in the program so the hardware
JIT knows where to place programs or do we expect the JIT itself to
try and decide what is best to offload. I think the easiest to start
with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the
front or end of a packet so you could push metadata calculated by the
program here in a generic way without having to extend XDP defined
metadata structures. Another option is to DMA the metadata to a
specified address. With this metadata the consumer/producer XDP
programs have to agree on the format but no one else.
Yes!

At the XDP summit we were discussing pipe-lining XDP programs in
general, with different stages of the pipeline potentially using
specific hardware capabilities or even being directly mappable on
fixed HW functions.

Designating parsing as one of specialized blocks makes sense in a long
run, probably at the first stage with recirculation possible. We also
have some parsing HW we could utilize at some point. However, I'm
worried that it's too early to impose constraints and APIs. I agree
that we should first set a standard way to pass metadata across tail
calls to facilitate any form of pipe lining, regardless of which parts
of pipeline HW is able to offload.
+1

I don't see any reason why XDP programs can be turned into a pipeline,
but this is implementation based on the output of one program being
the inout of the next. While XDP may work with pipeline it does not
require it or define it. This makes XDP different from P4 and the
match-action paradigm.

Tom


John Fastabend
 

On 16-07-07 10:53 AM, Tom Herbert wrote:
On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
<jakub.kicinski@...> wrote:
On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
The other interesting thing would be to do more than just packet
steering but actually run a more complete XDP program. Netronome
supports this right. The question I have though is this a stacked of
XDP programs one or more designated for hardware and some running in
software perhaps with some annotation in the program so the hardware
JIT knows where to place programs or do we expect the JIT itself to
try and decide what is best to offload. I think the easiest to start
with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the
front or end of a packet so you could push metadata calculated by the
program here in a generic way without having to extend XDP defined
metadata structures. Another option is to DMA the metadata to a
specified address. With this metadata the consumer/producer XDP
programs have to agree on the format but no one else.
Yes!

At the XDP summit we were discussing pipe-lining XDP programs in
general, with different stages of the pipeline potentially using
specific hardware capabilities or even being directly mappable on
fixed HW functions.

Designating parsing as one of specialized blocks makes sense in a long
run, probably at the first stage with recirculation possible. We also
have some parsing HW we could utilize at some point. However, I'm
worried that it's too early to impose constraints and APIs. I agree
that we should first set a standard way to pass metadata across tail
calls to facilitate any form of pipe lining, regardless of which parts
of pipeline HW is able to offload.
+1

I don't see any reason why XDP programs can be turned into a pipeline,
but this is implementation based on the output of one program being
the inout of the next. While XDP may work with pipeline it does not
require it or define it. This makes XDP different from P4 and the
match-action paradigm.

Tom
Sounds like we all agree. Just a note, XDP is a reasonable target
for P4 in fact we have a P4 to eBPF target already working. We may end
up with a set of DSLs running on top of XDP where P4 is one of them.

.John


Alexei Starovoitov
 

On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
Hi Jesper,

I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.
looking at bpf programs written in plumgrid, facebook and cisco
with full certainty I can assure that parse/action split doesn't exist.
Parsing is always interleaved with lookups and actions.
cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
Trying to split single logical program into parsing/after_parse stages
has no pracitcal benefit.

If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.
the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.


John Fastabend
 

On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
Hi Jesper,

I have done some previous work on proprietary systems where we
used hardware to do the classification/parsing then passed a cookie to the
software which used the cookie to lookup a program to run on the packet.
When your programs are structured as a bunch of parsing followed by some
actions this can provide real performance benefits. Also a lot of
existing hardware supports this today assuming you use headers the
hardware "knows" about. It's a natural model for hardware that uses a
parser followed by tcam/cam/sram/etc lookup tables.
looking at bpf programs written in plumgrid, facebook and cisco
with full certainty I can assure that parse/action split doesn't exist.
Parsing is always interleaved with lookups and actions.
cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
What is heavy about a lookup? Is it the key generation? The key
generation can be provided by the hardware is what I was really alluding
to. If your data structures are ebpf maps though its probably a hash
or array table and the benefit of leveraging hardware would likely be
much better if/when there are software structures for LPM or wildcard
lookups.

Trying to split single logical program into parsing/after_parse stages
has no pracitcal benefit.

If the goal is to just separate XDP traffic from non-XDP traffic
you could accomplish this with a combination of SR-IOV/macvlan to separate
the device queues into multiple netdevs and then run XDP on just one of
the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
steer traffic to the netdev. This is how we support multiple networking
stacks on one device by the way it is called the bifurcated driver. Its
not too far of a stretch to think we could offload some simple XDP
programs to program the splitting of traffic instead of
cls_u32/flower/flow_director and then you would have a stack of XDP
programs. One running in hardware and a set running on the queues in
software.
the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.
Right and this works today so all it would require is adding the XDP
engine code to the VF drivers. Which should be relatively straight
forward if you have the PF driver working.

.John


Alexei Starovoitov
 

On Thu, Jul 07, 2016 at 09:05:29PM -0700, John Fastabend wrote:
On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
Hi Jesper,

I have done some previous work on proprietary systems where we
used hardware to do the classification/parsing then passed a cookie to the
software which used the cookie to lookup a program to run on the packet.
When your programs are structured as a bunch of parsing followed by some
actions this can provide real performance benefits. Also a lot of
existing hardware supports this today assuming you use headers the
hardware "knows" about. It's a natural model for hardware that uses a
parser followed by tcam/cam/sram/etc lookup tables.
looking at bpf programs written in plumgrid, facebook and cisco
with full certainty I can assure that parse/action split doesn't exist.
Parsing is always interleaved with lookups and actions.
cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
What is heavy about a lookup? Is it the key generation? The key
generation can be provided by the hardware is what I was really alluding
to. If your data structures are ebpf maps though its probably a hash
or array table and the benefit of leveraging hardware would likely be
much better if/when there are software structures for LPM or wildcard
lookups.
there is only hash map in the sw and the main cost of it was doing jhash
math and occasional miss in hashtable.
'key generation' is only copying bytes, so it mostly free.
Just like parsing which is few branches which tend to be predicted
by cpu quite well.
In case of our L4 loadbalancer we need to do consistent hash which
fixed hw probably won't be able to provide.
Unless hw is programmable :)
In general when we developed and benchmarked the programs,
redesigning the program to remove extra hash lookup gave performance
improvement whereas simplifying parsing logic (like removing vlan
handling or ip option) showed no difference in performance.

Trying to split single logical program into parsing/after_parse stages
has no pracitcal benefit.

If the goal is to just separate XDP traffic from non-XDP traffic
you could accomplish this with a combination of SR-IOV/macvlan to separate
the device queues into multiple netdevs and then run XDP on just one of
the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
steer traffic to the netdev. This is how we support multiple networking
stacks on one device by the way it is called the bifurcated driver. Its
not too far of a stretch to think we could offload some simple XDP
programs to program the splitting of traffic instead of
cls_u32/flower/flow_director and then you would have a stack of XDP
programs. One running in hardware and a set running on the queues in
software.
the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.
Right and this works today so all it would require is adding the XDP
engine code to the VF drivers. Which should be relatively straight
forward if you have the PF driver working.
Good point. I think the next step should be to enable xdp in VF drivers
and measure performance.


Jakub Kicinski
 

On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.
the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.
+1

I was thinking about using eBPF to direct to NIC queues but concluded
that doing a redirect to a VF is cleaner. Especially if the PF driver
supports VF representatives we could potentially just use
bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
the same stack.


Jesper Dangaard Brouer
 

On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:

On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:

If the goal is to just separate XDP traffic from non-XDP traffic
you could accomplish this with a combination of SR-IOV/macvlan to
separate the device queues into multiple netdevs and then run XDP
on just one of the netdevs. Then use flow director (ethtool) or
'tc cls_u32/flower' to steer traffic to the netdev. This is how
we support multiple networking stacks on one device by the way it
is called the bifurcated driver. Its not too far of a stretch to
think we could offload some simple XDP programs to program the
splitting of traffic instead of cls_u32/flower/flow_director and
then you would have a stack of XDP programs. One running in
hardware and a set running on the queues in software.

the above sounds like much better approach then Jesper/mine
prog_per_ring stuff.

If we can split the nic via sriov and have dedicated netdev via VF
just for XDP that's way cleaner approach. I guess we won't need to
do xdp_rxqmask after all.
+1

I was thinking about using eBPF to direct to NIC queues but concluded
that doing a redirect to a VF is cleaner. Especially if the PF driver
supports VF representatives we could potentially just use
bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
the same stack.
I actually disagree.

I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
then run a single/specific XDP program on that queue.

Why to I want this?

This part of solving a very fundamental CS problem (early demux), when
wanting to support Zero-copy on RX. The basic problem that the NIC
driver need to map RX pages into the RX ring, prior to receiving
packets. Thus, we need HW support to steer packets, for gaining enough
isolation (e.g between tenants domains) for allowing zero-copy.


Based on the flexibility of the HW-filter, the granularity achievable
for isolation (e.g. application specific) is much more flexible. Than
splitting up the entire NIC with SR-IOV, VFs or macvlans.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jakub Kicinski
 

On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:
On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
If the goal is to just separate XDP traffic from non-XDP traffic
you could accomplish this with a combination of SR-IOV/macvlan to
separate the device queues into multiple netdevs and then run XDP
on just one of the netdevs. Then use flow director (ethtool) or
'tc cls_u32/flower' to steer traffic to the netdev. This is how
we support multiple networking stacks on one device by the way it
is called the bifurcated driver. Its not too far of a stretch to
think we could offload some simple XDP programs to program the
splitting of traffic instead of cls_u32/flower/flow_director and
then you would have a stack of XDP programs. One running in
hardware and a set running on the queues in software.

the above sounds like much better approach then Jesper/mine
prog_per_ring stuff.

If we can split the nic via sriov and have dedicated netdev via VF
just for XDP that's way cleaner approach. I guess we won't need to
do xdp_rxqmask after all.
+1

I was thinking about using eBPF to direct to NIC queues but concluded
that doing a redirect to a VF is cleaner. Especially if the PF driver
supports VF representatives we could potentially just use
bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
the same stack.
I actually disagree.

I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
then run a single/specific XDP program on that queue.

Why to I want this?

This part of solving a very fundamental CS problem (early demux), when
wanting to support Zero-copy on RX. The basic problem that the NIC
driver need to map RX pages into the RX ring, prior to receiving
packets. Thus, we need HW support to steer packets, for gaining enough
isolation (e.g between tenants domains) for allowing zero-copy.


Based on the flexibility of the HW-filter, the granularity achievable
for isolation (e.g. application specific) is much more flexible. Than
splitting up the entire NIC with SR-IOV, VFs or macvlans.
I think of SR-IOV VFs a way of grouping queues. If HW is capable of
directing to a queue it's usually capable of directing to a VF as well.
And the VF could have all other traffic disabled so you would get only
packets directed to it by the (BPF) filter - same as you would for the
queue. Does that make sense for zero copy apps?


John Fastabend
 

On 16-07-08 09:07 AM, Jakub Kicinski wrote:
On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:
On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
If the goal is to just separate XDP traffic from non-XDP traffic
you could accomplish this with a combination of SR-IOV/macvlan to
separate the device queues into multiple netdevs and then run XDP
on just one of the netdevs. Then use flow director (ethtool) or
'tc cls_u32/flower' to steer traffic to the netdev. This is how
we support multiple networking stacks on one device by the way it
is called the bifurcated driver. Its not too far of a stretch to
think we could offload some simple XDP programs to program the
splitting of traffic instead of cls_u32/flower/flow_director and
then you would have a stack of XDP programs. One running in
hardware and a set running on the queues in software.

the above sounds like much better approach then Jesper/mine
prog_per_ring stuff.

If we can split the nic via sriov and have dedicated netdev via VF
just for XDP that's way cleaner approach. I guess we won't need to
do xdp_rxqmask after all.
+1

I was thinking about using eBPF to direct to NIC queues but concluded
that doing a redirect to a VF is cleaner. Especially if the PF driver
supports VF representatives we could potentially just use
bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
the same stack.
I actually disagree.

I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
then run a single/specific XDP program on that queue.

Why to I want this?

This part of solving a very fundamental CS problem (early demux), when
wanting to support Zero-copy on RX. The basic problem that the NIC
driver need to map RX pages into the RX ring, prior to receiving
packets. Thus, we need HW support to steer packets, for gaining enough
isolation (e.g between tenants domains) for allowing zero-copy.


Based on the flexibility of the HW-filter, the granularity achievable
for isolation (e.g. application specific) is much more flexible. Than
splitting up the entire NIC with SR-IOV, VFs or macvlans.
I think of SR-IOV VFs a way of grouping queues. If HW is capable of
directing to a queue it's usually capable of directing to a VF as well.
And the VF could have all other traffic disabled so you would get only
packets directed to it by the (BPF) filter - same as you would for the
queue. Does that make sense for zero copy apps?
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.

If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.

The question I have is should the "filter" part of the eBPF program
be a separate program from the XDP program and loaded using specific
semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
a ever growing set of "ndo" ops. If you are running multiple XDP
programs on the same NIC hardware then I think this actually makes
sense otherwise how would the hardware and even software find the
"demux" logic. In this model there is a "demux" program that selects
a queue/VF and a program that runs on the netdev queues.

Any thoughts?

.John


Jakub Kicinski
 

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jasper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.

If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Yes, for steering to VFs we could potentially reuse a lot of existing
infrastructure.

The question I have is should the "filter" part of the eBPF program
be a separate program from the XDP program and loaded using specific
semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
a ever growing set of "ndo" ops. If you are running multiple XDP
programs on the same NIC hardware then I think this actually makes
sense otherwise how would the hardware and even software find the
"demux" logic. In this model there is a "demux" program that selects
a queue/VF and a program that runs on the netdev queues.
I don't think we should enforce the separation here. What we may want
to do before forwarding to the VF can be much more complicated than
pure demux/filtering (simple eg - pop VLAN/tunnel). VF representative
model works well here as fallback - if program could not be offloaded
it will be run on the host and "trombone" packets via VFR into the VF.

If we have a chain of BPF programs we can order them in increasing
level of complexity/features required and then HW could transparently
offload the first parts - the easier ones - leaving more complex
processing on the host.

This should probably be paired with some sort of "skip-sw" flag to let
user space enforce the HW offload on the fast path part.


Jesper Dangaard Brouer
 

On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote:

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jesper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.
Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe. But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

ethtool -K eth2 ntuple on
ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet. And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).


Yes, for steering to VFs we could potentially reuse a lot of existing
infrastructure.

The question I have is should the "filter" part of the eBPF program
be a separate program from the XDP program and loaded using specific
semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
a ever growing set of "ndo" ops. If you are running multiple XDP
programs on the same NIC hardware then I think this actually makes
sense otherwise how would the hardware and even software find the
"demux" logic. In this model there is a "demux" program that selects
a queue/VF and a program that runs on the netdev queues.
I don't think we should enforce the separation here. What we may want
to do before forwarding to the VF can be much more complicated than
pure demux/filtering (simple eg - pop VLAN/tunnel). VF representative
model works well here as fallback - if program could not be offloaded
it will be run on the host and "trombone" packets via VFR into the VF.
That is an interesting idea.

If we have a chain of BPF programs we can order them in increasing
level of complexity/features required and then HW could transparently
offload the first parts - the easier ones - leaving more complex
processing on the host.
I'll try to keep out of the discussion of how to structure the BPF
program, as it is outside my "area".

This should probably be paired with some sort of "skip-sw" flag to let
user space enforce the HW offload on the fast path part.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Alexei Starovoitov
 

On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote:

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jesper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.
Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe. But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

ethtool -K eth2 ntuple on
ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet. And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).
so such ntuple rule will send udp4 traffic for specific ip and port
into a queue then it will somehow gets zero-copied to vm?
. looks like a lot of other pieces about zero-copy and qemu need to be
implemented (or at least architected) for this scheme to be conceivable
. and when all that happens what vm is going to do with this very specific
traffic? vm won't have any tcp or even ping?

the network virtualization traffic is typically encapsulated,
so if xdp is used to do steer the traffic, the program would need
to figure out vm id based on headers, strip tunnel, apply policy before
forwarding the packet further. Clearly hw ntuple is not going to suffice.

If there is no networking virtualization and VMs are operating in the
flat network, then there is no policy, no ip filter, no vm migration.
Only mac per vm and sriov handles this case just fine.
When hw becomes more programmable we'll be able to load xdp program
into hw that does tunnel, policy and forwards into vf then sriov will
become actually usable for cloud providers.
hw xdp into vf is more interesting than into a queue, since there is
more than one queue/interrupt per vf and network heavy vm can actually
consume large amount of traffic.


John Fastabend
 

On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote:

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jesper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.
Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe. But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

ethtool -K eth2 ntuple on
ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet. And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).
so such ntuple rule will send udp4 traffic for specific ip and port
into a queue then it will somehow gets zero-copied to vm?
. looks like a lot of other pieces about zero-copy and qemu need to be
implemented (or at least architected) for this scheme to be conceivable
. and when all that happens what vm is going to do with this very specific
traffic? vm won't have any tcp or even ping?
I have perhaps a different motivation to have queue steering in 'tc
cls-u32' and eventually xdp. The general idea is I have thousands of
queues and I can bind applications to the queues. When I know an
application is bound to a queue I can enable per queue busy polling (to
be implemented), set specific interrupt rates on the queue
(implementation will be posted soon), bind the queue to the correct
cpu, etc.

ntuple works OK for this now but xdp provides more flexibility and
also lets us add additional policy on the queue other than simply
queue steering.

I'm not convinced though that the demux queue selection should be part
of the XDP program itself just because it has no software analog to me
it sits in front of the set of XDP programs. But I think I could perhaps
be convinced it does if there is some reasonable way to do it. I guess
the single program method would result in an XDP program that read like

if (rx_queue == x)
do_foo
if (rx_queue == y)
do_bar

A hardware jit may be able to sort that out. Or use per queue sections.


the network virtualization traffic is typically encapsulated,
so if xdp is used to do steer the traffic, the program would need
to figure out vm id based on headers, strip tunnel, apply policy before
forwarding the packet further. Clearly hw ntuple is not going to suffice.

If there is no networking virtualization and VMs are operating in the
flat network, then there is no policy, no ip filter, no vm migration.
Only mac per vm and sriov handles this case just fine.
When hw becomes more programmable we'll be able to load xdp program
into hw that does tunnel, policy and forwards into vf then sriov will
become actually usable for cloud providers.
Yep :)

hw xdp into vf is more interesting than into a queue, since there is
more than one queue/interrupt per vf and network heavy vm can actually
consume large amount of traffic.
Another use case I have is to make a really high performance AF_PACKET
interface. So if there was a way to say bind a queue to an AF_PACKET
ring and run a policy XDP program before hitting the AF_PACKET
descriptor bit that would be really interesting because it would solve
some of my need for poll mode drivers in userspace.

.John


Jakub Kicinski
 

On Tue, 12 Jul 2016 12:13:01 -0700, John Fastabend wrote:
On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote:

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jesper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.
Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe. But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

ethtool -K eth2 ntuple on
ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet. And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).
so such ntuple rule will send udp4 traffic for specific ip and port
into a queue then it will somehow gets zero-copied to vm?
. looks like a lot of other pieces about zero-copy and qemu need to be
implemented (or at least architected) for this scheme to be conceivable
. and when all that happens what vm is going to do with this very specific
traffic? vm won't have any tcp or even ping?
I have perhaps a different motivation to have queue steering in 'tc
cls-u32' and eventually xdp. The general idea is I have thousands of
queues and I can bind applications to the queues. When I know an
application is bound to a queue I can enable per queue busy polling (to
be implemented), set specific interrupt rates on the queue
(implementation will be posted soon), bind the queue to the correct
cpu, etc.

ntuple works OK for this now but xdp provides more flexibility and
also lets us add additional policy on the queue other than simply
queue steering.

I'm not convinced though that the demux queue selection should be part
of the XDP program itself just because it has no software analog to me
it sits in front of the set of XDP programs.
Yes, although if we expect XDP to be target of offloading efforts
putting the demux here doesn't seem like an entirely bad idea. We
could say demux is just an API that more capable drivers/HW can
implement.

But I think I could perhaps
be convinced it does if there is some reasonable way to do it. I guess
the single program method would result in an XDP program that read like

if (rx_queue == x)
do_foo
if (rx_queue == y)
do_bar

A hardware jit may be able to sort that out.
+1


Jesper Dangaard Brouer
 

On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend <john.fastabend@...> wrote:

On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote:

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.
To do BPF RSS we need a way to select the queue which I think is all
Jesper wanted. So we will have to tackle the queue selection at some
point. The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW... Implementing
queue selection on HW side is trivial.
Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe. But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.
Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

ethtool -K eth2 ntuple on
ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet. And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).
so such ntuple rule will send udp4 traffic for specific ip and port
into a queue then it will somehow gets zero-copied to vm?
. looks like a lot of other pieces about zero-copy and qemu need to be
implemented (or at least architected) for this scheme to be conceivable
. and when all that happens what vm is going to do with this very specific
traffic? vm won't have any tcp or even ping?
I have perhaps a different motivation to have queue steering in 'tc
cls-u32' and eventually xdp. The general idea is I have thousands of
queues and I can bind applications to the queues. When I know an
application is bound to a queue I can enable per queue busy polling (to
be implemented), set specific interrupt rates on the queue
(implementation will be posted soon), bind the queue to the correct
cpu, etc.
+1

binding applications to queues.

This is basically what our customers are requesting. They have one or
two applications that need DPDK speeds. But they don't like dedicating
an entire NIC per application (like DPDK requires).

The basic idea is actually more fundamental. It reminds me of Van
Jacobson's netchannels[1] when he talks about "Channelize" (slides 24+)
Creating full "application" channel allow for lock free single producer
single consumer (SPSC) queue directly into the application.

[1] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf


ntuple works OK for this now but xdp provides more flexibility and
also lets us add additional policy on the queue other than simply
queue steering.

I'm not convinced though that the demux queue selection should be part
of the XDP program itself just because it has no software analog to me
it sits in front of the set of XDP programs. But I think I could perhaps
be convinced it does if there is some reasonable way to do it. I guess
the single program method would result in an XDP program that read like

if (rx_queue == x)
do_foo
if (rx_queue == y)
do_bar
Yes, that is also why I wanted a XDP program per RX queue. But the
"channelize" concept is more important.


A hardware jit may be able to sort that out. Or use per queue
sections.


the network virtualization traffic is typically encapsulated,
so if xdp is used to do steer the traffic, the program would need
to figure out vm id based on headers, strip tunnel, apply policy
before forwarding the packet further. Clearly hw ntuple is not
going to suffice.

If there is no networking virtualization and VMs are operating in
the flat network, then there is no policy, no ip filter, no vm
migration. Only mac per vm and sriov handles this case just fine.
When hw becomes more programmable we'll be able to load xdp program
into hw that does tunnel, policy and forwards into vf then sriov
will become actually usable for cloud providers.
Yep :)

hw xdp into vf is more interesting than into a queue, since there is
more than one queue/interrupt per vf and network heavy vm can
actually consume large amount of traffic.
Another use case I have is to make a really high performance AF_PACKET
interface. So if there was a way to say bind a queue to an AF_PACKET
ring and run a policy XDP program before hitting the AF_PACKET
descriptor bit that would be really interesting because it would solve
some of my need for poll mode drivers in userspace.
+1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP.
It would basically allow for implementing DPDK or netmap on top of XDP
(as least the RX side) without needing to run a NIC driver in userspace.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Thomas Monjalon <thomas.monjalon@...>
 

Hi,

About RX filtering, there is an ongoing effort in DPDK to write an API
which could leverage most of the hardware capabilities of any NICs:
https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
I understand that XDP does not target to support every hardware features,
though it may be an interesting approach to check.

2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend <john.fastabend@...> wrote:

Another use case I have is to make a really high performance AF_PACKET
interface. So if there was a way to say bind a queue to an AF_PACKET
ring and run a policy XDP program before hitting the AF_PACKET
descriptor bit that would be really interesting because it would solve
some of my need for poll mode drivers in userspace.
Have you started this work?
Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?

+1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP.
It would basically allow for implementing DPDK or netmap on top of XDP
(as least the RX side) without needing to run a NIC driver in userspace.
Why TX would not be possible through AF_PACKET?


Tom Herbert <tom@...>
 

On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
<thomas.monjalon@...> wrote:
Hi,

About RX filtering, there is an ongoing effort in DPDK to write an API
which could leverage most of the hardware capabilities of any NICs:
https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
I understand that XDP does not target to support every hardware features,
though it may be an interesting approach to check.
Thomas,

A major goal of XDP is to leverage and in fact encourage innovation in
hardware features. But, we are asking that vendors design the APIs
with the community in mind. For instance, if XDP supports crypto
offload it should have one API that different companies, we don't want
every vendor coming up with their own.

2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend <john.fastabend@...> wrote:

Another use case I have is to make a really high performance AF_PACKET
interface. So if there was a way to say bind a queue to an AF_PACKET
ring and run a policy XDP program before hitting the AF_PACKET
descriptor bit that would be really interesting because it would solve
some of my need for poll mode drivers in userspace.
Have you started this work?
Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?
I don't understand why the AF_PACKET with DPDK. They should be
mutually exclusive. XDP over DPDK does make sense.

Tom