XDP: Contents of XDP meta-data


Jesper Dangaard Brouer
 

What should be the content of the XDP meta-data structure?

I find the HW RX-hash very useful for DDoS filtering, and I would like
for it to survive until TX completion.

Do we need the (page-)offset and length when forwarding/TX'ing a XDP
packet? (E.g. a XDP-hook call could prepend a header, changing the
offset and length).

What about (HW) checksum states?

Do we (or HW) care about the input/ingress "port"?

What about offload feature flags?


Tom Herbert talked about making this both portable and extentable for
specific HW needs of the future... conflicting goals?

John Fastabend mentioned some approach that he might be able to share?

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Brenden Blanco <bblanco@...>
 

On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer <brouer@...> wrote:

What should be the content of the XDP meta-data structure?

I find the HW RX-hash very useful for DDoS filtering, and I would like
for it to survive until TX completion.

Do we need the (page-)offset and length when forwarding/TX'ing a XDP
packet?  (E.g. a XDP-hook call could prepend a header, changing the
offset and length).

 Not in the meta data structure itself, but it may be visible to the verifier/jit for accounting purposes. I do foresee a need for grow/shrink head/tail, which would be exposed as a helper function that operates on the underlying structure.


What about (HW) checksum states?

Do we (or HW) care about the input/ingress "port"?

Only if we can come up with a platform agnostic identification scheme, and only if there is a strong use case for it. So far this is not the case.

What about offload feature flags?


Tom Herbert talked about making this both portable and extentable for
specific HW needs of the future... conflicting goals?

John Fastabend mentioned some approach that he might be able to share?

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


John Fastabend
 

On 16-06-02 09:19 AM, Brenden Blanco wrote:
On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer
<brouer@... <mailto:brouer@...>> wrote:


What should be the content of the XDP meta-data structure?

I find the HW RX-hash very useful for DDoS filtering, and I would like
for it to survive until TX completion.

Do we need the (page-)offset and length when forwarding/TX'ing a XDP
packet? (E.g. a XDP-hook call could prepend a header, changing the
offset and length).


Not in the meta data structure itself, but it may be visible to the
verifier/jit for accounting purposes. I do foresee a need for
grow/shrink head/tail, which would be exposed as a helper function that
operates on the underlying structure.


What about (HW) checksum states?

Do we (or HW) care about the input/ingress "port"?


Only if we can come up with a platform agnostic identification scheme,
and only if there is a strong use case for it. So far this is not the case.


What about offload feature flags?


Tom Herbert talked about making this both portable and extentable for
specific HW needs of the future... conflicting goals?

John Fastabend mentioned some approach that he might be able to share?
The high level idea is to not hard code the metadata as we have
traditionally done in NIC descriptors but to let this be programmed via
the XDP program.

For example a parser block written in bpf can specify the NIC should
extract various fields and call some hash routine on them. And put these
in a block somewhere. This sequence may only exist in a single XDP
program but flexible fixed asic hardware, npus, fpga's, etc should be
able to "jit" this onto their hardware. (do we really want to call this
'jit' its more of a translation to me but ok) The less flexible hardware
that we have today will have to do some pattern matching to see if it
can support the operation instead of programming the device directly.

One way we do this is define a packed struct in the program

struct my_metadata {
struct header_foo outer_header;
u32 hash_fields0;
u32 hash_fields1;
u64 foobar;
} __attribute__ xdp_metadata

Then XDP can define a "standard" way to pack structures with
xdp_metadata attribute for the host program. One option I've looked at
briefly is to push the metadata onto the front of the packet. Other
options are to have an opaque pointer to it in the descriptor or if its
small enough pack it inline in the descriptor. Packing the metadata
inline can be limiting though depending on how much metadata start to
get passed around.

This scheme can handle vlan header information for example as just
another bit of metadata.

Thanks,
.John


Tom Herbert <tom@...>
 

On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

What should be the content of the XDP meta-data structure?

I find the HW RX-hash very useful for DDoS filtering, and I would like
for it to survive until TX completion.

Do we need the (page-)offset and length when forwarding/TX'ing a XDP
packet? (E.g. a XDP-hook call could prepend a header, changing the
offset and length).
Yes, we'll probably need offset and length as input and output. But I
think we are restricting packets to a single page so we don't need a
vector of these.

What about (HW) checksum states?
Yes, but IMO we should only have protocol generic interfaces checksum
complete on receive and csum_start,csum_offset on transmit (no
checksum-unnecessary, no IPv4 or IPv6 specific checksum mechanisms)

Do we (or HW) care about the input/ingress "port"?
I assume each port (queue) has its own instance of XDP so the ingress
port is implied. May this requires per instance data?

What about offload feature flags?
I think we need to consider them on case by case basis. csum and
rxhash are probably "in" I would think. TSO, LRO, encapsulation
acceleration (VLAN insertion, etc.) I tend to think are less
interesting for XDP. The API for something like encryption should
probably be considered separately since it is far more complex than
simple offloads already established.


Tom Herbert talked about making this both portable and extentable for
specific HW needs of the future... conflicting goals?
I think that depends on how we define portability. Portability can be
defined by the API that a program uses, so if the program runs on one
platform that supports the API used by the program it should run an
any platform that supports the same API. Obviously, if a program is
restricted to use only the most basic API we expect it to be nearly
universally portable, and one that uses more complex features, say
some crypto offload, we expect it to be portable to only the subset of
devices that provide crypto API. The one trap that I think we want to
avoid is transparently doing software emulation of features for the
sake of portability. If a program uses some API not supported by a
particular piece of HW, that should be addressed before run time (the
program may be altered to call SW emulation, eliminate attempts to use
the feature, or do so other explicit workaround). Also, we are not
talking about binary compatibility only source compatibility;
recompilation and reconfiguration are allowed.

Tom

John Fastabend mentioned some approach that he might be able to share?

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Alexei Starovoitov
 

On Thu, Jun 2, 2016 at 9:53 AM, John Fastabend <john.fastabend@...> wrote:
On 16-06-02 09:19 AM, Brenden Blanco wrote:
On Thu, Jun 2, 2016 at 3:20 AM, Jesper Dangaard Brouer
<brouer@... <mailto:brouer@...>> wrote:


What should be the content of the XDP meta-data structure?

I find the HW RX-hash very useful for DDoS filtering, and I would like
for it to survive until TX completion.

Do we need the (page-)offset and length when forwarding/TX'ing a XDP
packet? (E.g. a XDP-hook call could prepend a header, changing the
offset and length).


Not in the meta data structure itself, but it may be visible to the
verifier/jit for accounting purposes. I do foresee a need for
grow/shrink head/tail, which would be exposed as a helper function that
operates on the underlying structure.


What about (HW) checksum states?

Do we (or HW) care about the input/ingress "port"?


Only if we can come up with a platform agnostic identification scheme,
and only if there is a strong use case for it. So far this is not the case.


What about offload feature flags?


Tom Herbert talked about making this both portable and extentable for
specific HW needs of the future... conflicting goals?

John Fastabend mentioned some approach that he might be able to share?
The high level idea is to not hard code the metadata as we have
traditionally done in NIC descriptors but to let this be programmed via
the XDP program.

For example a parser block written in bpf can specify the NIC should
extract various fields and call some hash routine on them. And put these
in a block somewhere. This sequence may only exist in a single XDP
program but flexible fixed asic hardware, npus, fpga's, etc should be
able to "jit" this onto their hardware. (do we really want to call this
'jit' its more of a translation to me but ok) The less flexible hardware
that we have today will have to do some pattern matching to see if it
can support the operation instead of programming the device directly.

One way we do this is define a packed struct in the program

struct my_metadata {
struct header_foo outer_header;
u32 hash_fields0;
u32 hash_fields1;
u64 foobar;
} __attribute__ xdp_metadata

Then XDP can define a "standard" way to pack structures with
xdp_metadata attribute for the host program. One option I've looked at
briefly is to push the metadata onto the front of the packet. Other
options are to have an opaque pointer to it in the descriptor or if its
small enough pack it inline in the descriptor. Packing the metadata
inline can be limiting though depending on how much metadata start to
get passed around.

This scheme can handle vlan header information for example as just
another bit of metadata.
Exactly.
I think that very much resonates with what Thomas said in the other thread:

I see little value in enabling VLAN acceleration in an XDP
environment. A VLAN header would be parsed or constructed with
BPF instructions. Any offload of VLAN parsing or constructions
should be done by offloading the BPF program itself.
sounds like we all are on the same page here.
I also think the program should see the packet the way it
is transmitted on the wire.
That applies not only to vlan header, but to other dsa-like headers
that some switches put in front of the packet and some nics do to.
Technically it's packet metadata, but since it's on the wire, the program
should see it the same way and access it with 'packet load/store' instructions.
If we start moving such md from the buffer that was dma-ed into main memory
into additional md structure that program is seeing, we're only wasting
cycles copying things around and reducing generality.

John's idea of having 'canned' bpf program that runs
in hw just before 'main' bpf program runs in the cpu is quite interesting!
Is the purpose of this 'canned' program is to define such 'struct my_md'
in hw specific way using the same C language of main bpf program ?
Different hw nics will provide their own 'bpf-rx' program ?
Would the same approach work for tx ?
Something hw specific before and after main bpf prog ?
phys_wire -> bpf_rx prog (hw specific runs in hw) -> main_bpf ->
-> bpf_tx (hw specific program runs in hw) -> phys_wire
The nic vendors provide bpf_rx/bpf_tx programs and users
right their own middle part?

If we take it a step further, some hw specific super optimizer can take
all 3 programs and jit them all at once.

From the user point of view, it will be easy enough to read
the C code of bpf_rx and bpf_tx programs to understand what
fancy things this hw has.


Hannes Frederic Sowa
 

On 02.06.2016 23:14, Alexei Starovoitov via iovisor-dev wrote:
sounds like we all are on the same page here.
I also think the program should see the packet the way it
is transmitted on the wire.
That applies not only to vlan header, but to other dsa-like headers
that some switches put in front of the packet and some nics do to.
Technically it's packet metadata, but since it's on the wire, the program
should see it the same way and access it with 'packet load/store' instructions.
If we start moving such md from the buffer that was dma-ed into main memory
into additional md structure that program is seeing, we're only wasting
cycles copying things around and reducing generality.
I asked that during the meeting because of my limited exposure with
hardware and their programmability, but maybe John can answer those.

Does e.g. a VF connected to a network card internal switch reparse the
vlan id instead of taking it out of the descriptor? I hope it does, I
would be happy if the packets can simply be processed as is, but I would
like to prevent some discrepancy between what XDP assumes and what
hardware expects.

Otherwise dropping vlan headers or other headers is absolutely unnecessary.

Bye,
Hannes


Alexei Starovoitov
 

On Thu, Jun 2, 2016 at 3:30 PM, Hannes Frederic Sowa
<hannes@...> wrote:
On 02.06.2016 23:14, Alexei Starovoitov via iovisor-dev wrote:
sounds like we all are on the same page here.
I also think the program should see the packet the way it
is transmitted on the wire.
That applies not only to vlan header, but to other dsa-like headers
that some switches put in front of the packet and some nics do to.
Technically it's packet metadata, but since it's on the wire, the program
should see it the same way and access it with 'packet load/store' instructions.
If we start moving such md from the buffer that was dma-ed into main memory
into additional md structure that program is seeing, we're only wasting
cycles copying things around and reducing generality.
I asked that during the meeting because of my limited exposure with
hardware and their programmability, but maybe John can answer those.

Does e.g. a VF connected to a network card internal switch reparse the
vlan id instead of taking it out of the descriptor? I hope it does, I
would be happy if the packets can simply be processed as is, but I would
like to prevent some discrepancy between what XDP assumes and what
hardware expects.
ixgbe, mlx, e1000 drivers can disable vlan filtering.
What happens with internal nic switch and vlan doesn't matter.
If internal nic switch doesn't have 'vlan' problem - good,
if it does, oh well, it's a bug from XDP point of view and
they would have to fix hw or firmware.
Like what I see in mlx4 is that rx descriptor doesn't provide
checksum complete and instead tries to parse ip and computes
csum of the rest. That's a bug. If firmware cannot be fixed
to do real checksum_complete, it won't be used in scenarios
where xdp programs need to deal with full packet csum.
There will be other hw limitation. That's ok.
Rephrasing what Tom said in the other email:
xdp should not do sw emulation when hw cannot do the feature.
Like if there is a nic that always puts vlan in a descriptor md,
xdp approach will not be efficient there, since moving mac
headers for every packet in sw is a waste. Either vendors
have to fix it in firmware or else.
At present I think xdp hw requirements are very minimal
and very reasonable.


John Fastabend
 

[...]


The high level idea is to not hard code the metadata as we have
traditionally done in NIC descriptors but to let this be programmed via
the XDP program.

For example a parser block written in bpf can specify the NIC should
extract various fields and call some hash routine on them. And put these
in a block somewhere. This sequence may only exist in a single XDP
program but flexible fixed asic hardware, npus, fpga's, etc should be
able to "jit" this onto their hardware. (do we really want to call this
'jit' its more of a translation to me but ok) The less flexible hardware
that we have today will have to do some pattern matching to see if it
can support the operation instead of programming the device directly.

One way we do this is define a packed struct in the program

struct my_metadata {
struct header_foo outer_header;
u32 hash_fields0;
u32 hash_fields1;
u64 foobar;
} __attribute__ xdp_metadata

Then XDP can define a "standard" way to pack structures with
xdp_metadata attribute for the host program. One option I've looked at
briefly is to push the metadata onto the front of the packet. Other
options are to have an opaque pointer to it in the descriptor or if its
small enough pack it inline in the descriptor. Packing the metadata
inline can be limiting though depending on how much metadata start to
get passed around.

This scheme can handle vlan header information for example as just
another bit of metadata.
Exactly.
I think that very much resonates with what Thomas said in the other thread:
Great.

I see little value in enabling VLAN acceleration in an XDP
environment. A VLAN header would be parsed or constructed with
BPF instructions. Any offload of VLAN parsing or constructions
should be done by offloading the BPF program itself.
sounds like we all are on the same page here.
I also think the program should see the packet the way it
is transmitted on the wire.
hmm interesting "see" here is going to get pretty lose when the
hardware is modifying fields and packet data. But I think I agree
with the spirit. The XDP program shouldn't be trying to decide if
the hardware popped tag X or not and then repair the packet. I
think a compiler pass on the other hand should be free to make these
optimizations.

That applies not only to vlan header, but to other dsa-like headers
that some switches put in front of the packet and some nics do to.
Technically it's packet metadata, but since it's on the wire, the program
should see it the same way and access it with 'packet load/store' instructions.
The dsa-like headers are hardware specific or at least vendor specific.
I'm not sure you want to see them. But anyways most the hardware I've
worked with can either strip the tags or not. And on the wire here is
also sort of lose. At least on the hardware/software I have its only
on the "wire" when you try to cluster multiple chips to make it look
like one switch or when you are using some non-standard bonding
features across switches, and probably a few other corner cases I've
never used. The tags are never pushed into the actual network it is
more of an internal accounting thing.

If we start moving such md from the buffer that was dma-ed into main memory
into additional md structure that program is seeing, we're only wasting
cycles copying things around and reducing generality.

John's idea of having 'canned' bpf program that runs
in hw just before 'main' bpf program runs in the cpu is quite interesting!
Is the purpose of this 'canned' program is to define such 'struct my_md'
in hw specific way using the same C language of main bpf program ?
That is my idea yes. On some hardware the bpf program will be "canned"
meaning you the software guy can't change it. But at least you can
unambiguously define what the hardware is doing. Added plus is I get
these from the hardware architects and don't have to spend hours reading
over English prose 1000+ pg spec sheets. Another benefit is it should
fit into the APIs being generated for your bpf maps/programs.

On more flexible hardware this program doesn't need to be canned it
can actually be written by the software developer. With the netronome
cards for example when there JIT gets upstream you could have two
programs one for hardware and one for software.

Different hw nics will provide their own 'bpf-rx' program ?
Would the same approach work for tx ?
Yes if the hardware is not flexible you get these canned rx/tx
programs. Or when you power on your NIC it will use the vendor
provided rx/tx program.

Something hw specific before and after main bpf prog ?
phys_wire -> bpf_rx prog (hw specific runs in hw) -> main_bpf ->
-> bpf_tx (hw specific program runs in hw) -> phys_wire
The nic vendors provide bpf_rx/bpf_tx programs and users
right their own middle part?
yep.

If we take it a step further, some hw specific super optimizer can take
all 3 programs and jit them all at once.
yep.


From the user point of view, it will be easy enough to read
the C code of bpf_rx and bpf_tx programs to understand what
fancy things this hw has.
Yep and eventually you write all three components. Or as you say I would
like to write the super optimizer that just takes one program and
generates an "optimal" decomposition into hw-rx - sw-rx - sw-tx - hw-tx.
The FPGA and GPU folks have some of this logic started with OpenCl and
other tools. And LLVM today already understands a lot of this.

.John