This group is locked. No changes can be made to the group while it is locked.
Date
1 - 8 of 8
XDP: Contents of XDP meta-data
What should be the content of the XDP meta-data structure?
I find the HW RX-hash very useful for DDoS filtering, and I would like for it to survive until TX completion. Do we need the (page-)offset and length when forwarding/TX'ing a XDP packet? (E.g. a XDP-hook call could prepend a header, changing the offset and length). What about (HW) checksum states? Do we (or HW) care about the input/ingress "port"? What about offload feature flags? Tom Herbert talked about making this both portable and extentable for specific HW needs of the future... conflicting goals? John Fastabend mentioned some approach that he might be able to share? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer |
Brenden Blanco <bblanco@...>
On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer <brouer@...> wrote:
Not in the meta data structure itself, but it may be visible to the verifier/jit for accounting purposes. I do foresee a need for grow/shrink head/tail, which would be exposed as a helper function that operates on the underlying structure.
Only if we can come up with a platform agnostic identification scheme, and only if there is a strong use case for it. So far this is not the case.
|
John Fastabend
On 16-06-02 09:19 AM, Brenden Blanco wrote:
On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard BrouerThe high level idea is to not hard code the metadata as we have traditionally done in NIC descriptors but to let this be programmed via the XDP program. For example a parser block written in bpf can specify the NIC should extract various fields and call some hash routine on them. And put these in a block somewhere. This sequence may only exist in a single XDP program but flexible fixed asic hardware, npus, fpga's, etc should be able to "jit" this onto their hardware. (do we really want to call this 'jit' its more of a translation to me but ok) The less flexible hardware that we have today will have to do some pattern matching to see if it can support the operation instead of programming the device directly. One way we do this is define a packed struct in the program struct my_metadata { struct header_foo outer_header; u32 hash_fields0; u32 hash_fields1; u64 foobar; } __attribute__ xdp_metadata Then XDP can define a "standard" way to pack structures with xdp_metadata attribute for the host program. One option I've looked at briefly is to push the metadata onto the front of the packet. Other options are to have an opaque pointer to it in the descriptor or if its small enough pack it inline in the descriptor. Packing the metadata inline can be limiting though depending on how much metadata start to get passed around. This scheme can handle vlan header information for example as just another bit of metadata. Thanks, .John |
Tom Herbert <tom@...>
On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer
<brouer@...> wrote: Yes, we'll probably need offset and length as input and output. But I think we are restricting packets to a single page so we don't need a vector of these. What about (HW) checksum states?Yes, but IMO we should only have protocol generic interfaces checksum complete on receive and csum_start,csum_offset on transmit (no checksum-unnecessary, no IPv4 or IPv6 specific checksum mechanisms) Do we (or HW) care about the input/ingress "port"?I assume each port (queue) has its own instance of XDP so the ingress port is implied. May this requires per instance data? What about offload feature flags?I think we need to consider them on case by case basis. csum and rxhash are probably "in" I would think. TSO, LRO, encapsulation acceleration (VLAN insertion, etc.) I tend to think are less interesting for XDP. The API for something like encryption should probably be considered separately since it is far more complex than simple offloads already established. I think that depends on how we define portability. Portability can be defined by the API that a program uses, so if the program runs on one platform that supports the API used by the program it should run an any platform that supports the same API. Obviously, if a program is restricted to use only the most basic API we expect it to be nearly universally portable, and one that uses more complex features, say some crypto offload, we expect it to be portable to only the subset of devices that provide crypto API. The one trap that I think we want to avoid is transparently doing software emulation of features for the sake of portability. If a program uses some API not supported by a particular piece of HW, that should be addressed before run time (the program may be altered to call SW emulation, eliminate attempts to use the feature, or do so other explicit workaround). Also, we are not talking about binary compatibility only source compatibility; recompilation and reconfiguration are allowed. Tom John Fastabend mentioned some approach that he might be able to share? |
Alexei Starovoitov
On Thu, Jun 2, 2016 at 9:53 AM, John Fastabend <john.fastabend@...> wrote:
On 16-06-02 09:19 AM, Brenden Blanco wrote:Exactly.On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard BrouerThe high level idea is to not hard code the metadata as we have I think that very much resonates with what Thomas said in the other thread: I see little value in enabling VLAN acceleration in an XDPsounds like we all are on the same page here. I also think the program should see the packet the way it is transmitted on the wire. That applies not only to vlan header, but to other dsa-like headers that some switches put in front of the packet and some nics do to. Technically it's packet metadata, but since it's on the wire, the program should see it the same way and access it with 'packet load/store' instructions. If we start moving such md from the buffer that was dma-ed into main memory into additional md structure that program is seeing, we're only wasting cycles copying things around and reducing generality. John's idea of having 'canned' bpf program that runs in hw just before 'main' bpf program runs in the cpu is quite interesting! Is the purpose of this 'canned' program is to define such 'struct my_md' in hw specific way using the same C language of main bpf program ? Different hw nics will provide their own 'bpf-rx' program ? Would the same approach work for tx ? Something hw specific before and after main bpf prog ? phys_wire -> bpf_rx prog (hw specific runs in hw) -> main_bpf -> -> bpf_tx (hw specific program runs in hw) -> phys_wire The nic vendors provide bpf_rx/bpf_tx programs and users right their own middle part? If we take it a step further, some hw specific super optimizer can take all 3 programs and jit them all at once. From the user point of view, it will be easy enough to read the C code of bpf_rx and bpf_tx programs to understand what fancy things this hw has. |
Hannes Frederic Sowa
On 02.06.2016 23:14, Alexei Starovoitov via iovisor-dev wrote:
sounds like we all are on the same page here.I asked that during the meeting because of my limited exposure with hardware and their programmability, but maybe John can answer those. Does e.g. a VF connected to a network card internal switch reparse the vlan id instead of taking it out of the descriptor? I hope it does, I would be happy if the packets can simply be processed as is, but I would like to prevent some discrepancy between what XDP assumes and what hardware expects. Otherwise dropping vlan headers or other headers is absolutely unnecessary. Bye, Hannes |
Alexei Starovoitov
On Thu, Jun 2, 2016 at 3:30 PM, Hannes Frederic Sowa
<hannes@...> wrote: On 02.06.2016 23:14, Alexei Starovoitov via iovisor-dev wrote:ixgbe, mlx, e1000 drivers can disable vlan filtering.sounds like we all are on the same page here.I asked that during the meeting because of my limited exposure with What happens with internal nic switch and vlan doesn't matter. If internal nic switch doesn't have 'vlan' problem - good, if it does, oh well, it's a bug from XDP point of view and they would have to fix hw or firmware. Like what I see in mlx4 is that rx descriptor doesn't provide checksum complete and instead tries to parse ip and computes csum of the rest. That's a bug. If firmware cannot be fixed to do real checksum_complete, it won't be used in scenarios where xdp programs need to deal with full packet csum. There will be other hw limitation. That's ok. Rephrasing what Tom said in the other email: xdp should not do sw emulation when hw cannot do the feature. Like if there is a nic that always puts vlan in a descriptor md, xdp approach will not be efficient there, since moving mac headers for every packet in sw is a waste. Either vendors have to fix it in firmware or else. At present I think xdp hw requirements are very minimal and very reasonable. |
John Fastabend
[...]
Great.Exactly. hmm interesting "see" here is going to get pretty lose when theI see little value in enabling VLAN acceleration in an XDPsounds like we all are on the same page here. hardware is modifying fields and packet data. But I think I agree with the spirit. The XDP program shouldn't be trying to decide if the hardware popped tag X or not and then repair the packet. I think a compiler pass on the other hand should be free to make these optimizations. That applies not only to vlan header, but to other dsa-like headersThe dsa-like headers are hardware specific or at least vendor specific. I'm not sure you want to see them. But anyways most the hardware I've worked with can either strip the tags or not. And on the wire here is also sort of lose. At least on the hardware/software I have its only on the "wire" when you try to cluster multiple chips to make it look like one switch or when you are using some non-standard bonding features across switches, and probably a few other corner cases I've never used. The tags are never pushed into the actual network it is more of an internal accounting thing. If we start moving such md from the buffer that was dma-ed into main memoryThat is my idea yes. On some hardware the bpf program will be "canned" meaning you the software guy can't change it. But at least you can unambiguously define what the hardware is doing. Added plus is I get these from the hardware architects and don't have to spend hours reading over English prose 1000+ pg spec sheets. Another benefit is it should fit into the APIs being generated for your bpf maps/programs. On more flexible hardware this program doesn't need to be canned it can actually be written by the software developer. With the netronome cards for example when there JIT gets upstream you could have two programs one for hardware and one for software. Different hw nics will provide their own 'bpf-rx' program ?Yes if the hardware is not flexible you get these canned rx/tx programs. Or when you power on your NIC it will use the vendor provided rx/tx program. Something hw specific before and after main bpf prog ?yep. If we take it a step further, some hw specific super optimizer can takeyep. Yep and eventually you write all three components. Or as you say I would like to write the super optimizer that just takes one program and generates an "optimal" decomposition into hw-rx - sw-rx - sw-tx - hw-tx. The FPGA and GPU folks have some of this logic started with OpenCl and other tools. And LLVM today already understands a lot of this. .John |