Date
1 - 20 of 23
XDP seeking input from NIC hardware vendors
Would it make sense from a hardware point of view, to split the XDP
eBPF program into two stages. Stage-1: Filter (restricted eBPF / no-helper calls) Stage-2: Program Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side. Do HW vendors see a benefit of this approach? The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process. Actions from stage-1 "filter" program: - DROP (like XDP_DROP, early drop) - PASS (like XDP_PASS, normal netstack) - MATCH (call stage-2, likely carry-over opaque return code) The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer |
|
Fastabend, John R <john.r.fastabend@...>
Hi Jesper,
toggle quoted message
Show quoted text
I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables. If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software. The other interesting thing would be to do more than just packet steering but actually run a more complete XDP program. Netronome supports this right. The question I have though is this a stacked of XDP programs one or more designated for hardware and some running in software perhaps with some annotation in the program so the hardware JIT knows where to place programs or do we expect the JIT itself to try and decide what is best to offload. I think the easiest to start with is to annotate the programs. Also as far as I know a lot of hardware can stick extra data to the front or end of a packet so you could push metadata calculated by the program here in a generic way without having to extend XDP defined metadata structures. Another option is to DMA the metadata to a specified address. With this metadata the consumer/producer XDP programs have to agree on the format but no one else. FWIW I was hoping to get some data to show performance overhead vs how deep we parse into the packets. I just wont have time to get to it for awhile but that could tell us how much perf gain the hardware could provide. Thanks, John -----Original Message-----
From: Jesper Dangaard Brouer [mailto:brouer@...] Sent: Thursday, July 7, 2016 3:43 AM To: iovisor-dev@... Cc: brouer@...; Brenden Blanco <bblanco@...>; Alexei Starovoitov <alexei.starovoitov@...>; Rana Shahout <ranas@...>; Ari Saha <as754m@...>; Tariq Toukan <tariqt@...>; Or Gerlitz <ogerlitz@...>; netdev@...; Simon Horman <horms@...>; Simon Horman <simon.horman@...>; Jakub Kicinski <jakub.kicinski@...>; Edward Cree <ecree@...>; Fastabend, John R <john.r.fastabend@...> Subject: XDP seeking input from NIC hardware vendors Would it make sense from a hardware point of view, to split the XDP eBPF program into two stages. Stage-1: Filter (restricted eBPF / no-helper calls) Stage-2: Program Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side. Do HW vendors see a benefit of this approach? The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process. Actions from stage-1 "filter" program: - DROP (like XDP_DROP, early drop) - PASS (like XDP_PASS, normal netstack) - MATCH (call stage-2, likely carry-over opaque return code) The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -------------------------------------------------------------- Intel Research and Development Ireland Limited Registered in Ireland Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. |
|
Jakub Kicinski
On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
The other interesting thing would be to do more than just packetYes! At the XDP summit we were discussing pipe-lining XDP programs in general, with different stages of the pipeline potentially using specific hardware capabilities or even being directly mappable on fixed HW functions. Designating parsing as one of specialized blocks makes sense in a long run, probably at the first stage with recirculation possible. We also have some parsing HW we could utilize at some point. However, I'm worried that it's too early to impose constraints and APIs. I agree that we should first set a standard way to pass metadata across tail calls to facilitate any form of pipe lining, regardless of which parts of pipeline HW is able to offload. |
|
Tom Herbert <tom@...>
On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
<jakub.kicinski@...> wrote: On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:+1The other interesting thing would be to do more than just packetYes! I don't see any reason why XDP programs can be turned into a pipeline, but this is implementation based on the output of one program being the inout of the next. While XDP may work with pipeline it does not require it or define it. This makes XDP different from P4 and the match-action paradigm. Tom |
|
John Fastabend
On 16-07-07 10:53 AM, Tom Herbert wrote:
On Thu, Jul 7, 2016 at 9:12 AM, Jakub KicinskiSounds like we all agree. Just a note, XDP is a reasonable target for P4 in fact we have a P4 to eBPF target already working. We may end up with a set of DSLs running on top of XDP where P4 is one of them. .John |
|
Alexei Starovoitov
On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
Hi Jesper,looking at bpf programs written in plumgrid, facebook and cisco with full certainty I can assure that parse/action split doesn't exist. Parsing is always interleaved with lookups and actions. cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest. Trying to split single logical program into parsing/after_parse stages has no pracitcal benefit. If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.the above sounds like much better approach then Jesper/mine prog_per_ring stuff. If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach. I guess we won't need to do xdp_rxqmask after all. |
|
John Fastabend
On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:Hi Jesper, looking at bpf programs written in plumgrid, facebook and ciscoWhat is heavy about a lookup? Is it the key generation? The key generation can be provided by the hardware is what I was really alluding to. If your data structures are ebpf maps though its probably a hash or array table and the benefit of leveraging hardware would likely be much better if/when there are software structures for LPM or wildcard lookups. Trying to split single logical program into parsing/after_parse stagesRight and this works today so all it would require is adding the XDP engine code to the VF drivers. Which should be relatively straight forward if you have the PF driver working. .John |
|
Alexei Starovoitov
On Thu, Jul 07, 2016 at 09:05:29PM -0700, John Fastabend wrote:
On 16-07-07 07:22 PM, Alexei Starovoitov wrote:there is only hash map in the sw and the main cost of it was doing jhashOn Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:Hi Jesper, math and occasional miss in hashtable. 'key generation' is only copying bytes, so it mostly free. Just like parsing which is few branches which tend to be predicted by cpu quite well. In case of our L4 loadbalancer we need to do consistent hash which fixed hw probably won't be able to provide. Unless hw is programmable :) In general when we developed and benchmarked the programs, redesigning the program to remove extra hash lookup gave performance improvement whereas simplifying parsing logic (like removing vlan handling or ip option) showed no difference in performance. Good point. I think the next step should be to enable xdp in VF driversTrying to split single logical program into parsing/after_parse stagesRight and this works today so all it would require is adding the XDP and measure performance. |
|
Jakub Kicinski
On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
+1If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.the above sounds like much better approach then Jesper/mine prog_per_ring stuff. I was thinking about using eBPF to direct to NIC queues but concluded that doing a redirect to a VF is cleaner. Especially if the PF driver supports VF representatives we could potentially just use bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by the same stack. |
|
On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:
On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:I actually disagree.+1If the goal is to just separate XDP traffic from non-XDP traffic I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and then run a single/specific XDP program on that queue. Why to I want this? This part of solving a very fundamental CS problem (early demux), when wanting to support Zero-copy on RX. The basic problem that the NIC driver need to map RX pages into the RX ring, prior to receiving packets. Thus, we need HW support to steer packets, for gaining enough isolation (e.g between tenants domains) for allowing zero-copy. Based on the flexibility of the HW-filter, the granularity achievable for isolation (e.g. application specific) is much more flexible. Than splitting up the entire NIC with SR-IOV, VFs or macvlans. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer |
|
Jakub Kicinski
On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:I think of SR-IOV VFs a way of grouping queues. If HW is capable ofOn Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:I actually disagree.+1If the goal is to just separate XDP traffic from non-XDP traffic directing to a queue it's usually capable of directing to a VF as well. And the VF could have all other traffic disabled so you would get only packets directed to it by the (BPF) filter - same as you would for the queue. Does that make sense for zero copy apps? |
|
John Fastabend
On 16-07-08 09:07 AM, Jakub Kicinski wrote:
On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:The only distinction between VFs and queue groupings on my side is VFsOn Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski@...> wrote:I think of SR-IOV VFs a way of grouping queues. If HW is capable ofOn Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:I actually disagree.+1If the goal is to just separate XDP traffic from non-XDP traffic provide RSS where as queue groupings have to be selected explicitly. In a programmable NIC world the distinction might be lost if a "RSS" program can be loaded into the NIC to select queues but for existing hardware the distinction is there. If you demux using a eBPF program or via a filter model like flow_director or cls_{u32|flower} I think we can support both. And this just depends on the programmability of the hardware. Note flow_director and cls_{u32|flower} steering to VFs is already in place. The question I have is should the "filter" part of the eBPF program be a separate program from the XDP program and loaded using specific semantics (e.g. "load_hardware_demux" ndo op) at the risk of building a ever growing set of "ndo" ops. If you are running multiple XDP programs on the same NIC hardware then I think this actually makes sense otherwise how would the hardware and even software find the "demux" logic. In this model there is a "demux" program that selects a queue/VF and a program that runs on the netdev queues. Any thoughts? .John |
|
Jakub Kicinski
On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
The only distinction between VFs and queue groupings on my side is VFsTo do BPF RSS we need a way to select the queue which I think is all Jasper wanted. So we will have to tackle the queue selection at some point. The main obstacle with it for me is to define what queue selection means when program is not offloaded to HW... Implementing queue selection on HW side is trivial. If you demux using a eBPF program or via a filter model likeYes, for steering to VFs we could potentially reuse a lot of existing infrastructure. The question I have is should the "filter" part of the eBPF programI don't think we should enforce the separation here. What we may want to do before forwarding to the VF can be much more complicated than pure demux/filtering (simple eg - pop VLAN/tunnel). VF representative model works well here as fallback - if program could not be offloaded it will be run on the host and "trombone" packets via VFR into the VF. If we have a chain of BPF programs we can order them in increasing level of complexity/features required and then HW could transparently offload the first parts - the easier ones - leaving more complex processing on the host. This should probably be paired with some sort of "skip-sw" flag to let user space enforce the HW offload on the fast path part. |
|
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski@...> wrote: On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:Yes, I do see the problem of fallback, when the programs "filter" demuxThe only distinction between VFs and queue groupings on my side is VFsTo do BPF RSS we need a way to select the queue which I think is all cannot be offloaded to hardware. First I though it was a good idea to keep the "demux-filter" part of the eBPF program, as software fallback can still apply this filter in SW, and just mark the packets as not-zero-copy-safe. But when HW offloading is not possible, then packets can be delivered every RX queue, and SW would need to handle that, which hard to keep transparent. Maybe we should keep HW demuxing as a separate setup step.If you demux using a eBPF program or via a filter model like Today I can almost do what I want: by setting up ntuple filters, and (if Alexei allows it) assign an application specific XDP eBPF program to a specific RX queue. ethtool -K eth2 ntuple on ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42 Then the XDP program can be attached to RX queue 42, and promise/guarantee that it will consume all packet. And then the backing page-pool can allow zero-copy RX (and enable scrubbing when refilling pool). Yes, for steering to VFs we could potentially reuse a lot of existingThat is an interesting idea. If we have a chain of BPF programs we can order them in increasingI'll try to keep out of the discussion of how to structure the BPF program, as it is outside my "area". This should probably be paired with some sort of "skip-sw" flag to let -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer |
|
Alexei Starovoitov
On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
On Fri, 8 Jul 2016 18:51:07 +0100so such ntuple rule will send udp4 traffic for specific ip and port into a queue then it will somehow gets zero-copied to vm? . looks like a lot of other pieces about zero-copy and qemu need to be implemented (or at least architected) for this scheme to be conceivable . and when all that happens what vm is going to do with this very specific traffic? vm won't have any tcp or even ping? the network virtualization traffic is typically encapsulated, so if xdp is used to do steer the traffic, the program would need to figure out vm id based on headers, strip tunnel, apply policy before forwarding the packet further. Clearly hw ntuple is not going to suffice. If there is no networking virtualization and VMs are operating in the flat network, then there is no policy, no ip filter, no vm migration. Only mac per vm and sriov handles this case just fine. When hw becomes more programmable we'll be able to load xdp program into hw that does tunnel, policy and forwards into vf then sriov will become actually usable for cloud providers. hw xdp into vf is more interesting than into a queue, since there is more than one queue/interrupt per vf and network heavy vm can actually consume large amount of traffic. |
|
John Fastabend
On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:I have perhaps a different motivation to have queue steering in 'tcOn Fri, 8 Jul 2016 18:51:07 +0100so such ntuple rule will send udp4 traffic for specific ip and port cls-u32' and eventually xdp. The general idea is I have thousands of queues and I can bind applications to the queues. When I know an application is bound to a queue I can enable per queue busy polling (to be implemented), set specific interrupt rates on the queue (implementation will be posted soon), bind the queue to the correct cpu, etc. ntuple works OK for this now but xdp provides more flexibility and also lets us add additional policy on the queue other than simply queue steering. I'm not convinced though that the demux queue selection should be part of the XDP program itself just because it has no software analog to me it sits in front of the set of XDP programs. But I think I could perhaps be convinced it does if there is some reasonable way to do it. I guess the single program method would result in an XDP program that read like if (rx_queue == x) do_foo if (rx_queue == y) do_bar A hardware jit may be able to sort that out. Or use per queue sections. Yep :) hw xdp into vf is more interesting than into a queue, since there isAnother use case I have is to make a really high performance AF_PACKET interface. So if there was a way to say bind a queue to an AF_PACKET ring and run a policy XDP program before hitting the AF_PACKET descriptor bit that would be really interesting because it would solve some of my need for poll mode drivers in userspace. .John |
|
Jakub Kicinski
On Tue, 12 Jul 2016 12:13:01 -0700, John Fastabend wrote:
On 16-07-11 07:24 PM, Alexei Starovoitov wrote:Yes, although if we expect XDP to be target of offloading effortsOn Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:I have perhaps a different motivation to have queue steering in 'tcOn Fri, 8 Jul 2016 18:51:07 +0100so such ntuple rule will send udp4 traffic for specific ip and port putting the demux here doesn't seem like an entirely bad idea. We could say demux is just an API that more capable drivers/HW can implement. But I think I could perhaps+1 |
|
On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend <john.fastabend@...> wrote: On 16-07-11 07:24 PM, Alexei Starovoitov wrote:+1On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:I have perhaps a different motivation to have queue steering in 'tcOn Fri, 8 Jul 2016 18:51:07 +0100so such ntuple rule will send udp4 traffic for specific ip and port binding applications to queues. This is basically what our customers are requesting. They have one or two applications that need DPDK speeds. But they don't like dedicating an entire NIC per application (like DPDK requires). The basic idea is actually more fundamental. It reminds me of Van Jacobson's netchannels[1] when he talks about "Channelize" (slides 24+) Creating full "application" channel allow for lock free single producer single consumer (SPSC) queue directly into the application. [1] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf ntuple works OK for this now but xdp provides more flexibility andYes, that is also why I wanted a XDP program per RX queue. But the "channelize" concept is more important. A hardware jit may be able to sort that out. Or use per queue+1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP. It would basically allow for implementing DPDK or netmap on top of XDP (as least the RX side) without needing to run a NIC driver in userspace. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer |
|
Thomas Monjalon <thomas.monjalon@...>
Hi,
About RX filtering, there is an ongoing effort in DPDK to write an API which could leverage most of the hardware capabilities of any NICs: https://rawgit.com/6WIND/rte_flow/master/rte_flow.html http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352 I understand that XDP does not target to support every hardware features, though it may be an interesting approach to check. 2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev: On Tue, 12 Jul 2016 12:13:01 -0700Have you started this work? Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK? +1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP.Why TX would not be possible through AF_PACKET? |
|
Tom Herbert <tom@...>
On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
<thomas.monjalon@...> wrote: Hi,Thomas, A major goal of XDP is to leverage and in fact encourage innovation in hardware features. But, we are asking that vendors design the APIs with the community in mind. For instance, if XDP supports crypto offload it should have one API that different companies, we don't want every vendor coming up with their own. 2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:I don't understand why the AF_PACKET with DPDK. They should beOn Tue, 12 Jul 2016 12:13:01 -0700Have you started this work? mutually exclusive. XDP over DPDK does make sense. Tom |
|