Re: minutes: IO VIsor TSC and Dev members call


Jesper Dangaard Brouer
 

On Wed, 2 Mar 2016 22:28:42 -0800
Brenden Blanco <bblanco@...> wrote:

Thanks all for joining today,

We had a very interesting session focused entirely on XDP (express data
path), a new initiative to improve packet processing performance in the
linux kernel. The details will best be covered by the slides, which I'll be
sure to bug Tom to get a copy of to share, so I won't make the situation
worse by sharing my possibly erroneous notes.

In a nutshell, the goal is to give the low level driver architecture of the
kernel some TLC, improving PPS and BPFifying it.

Some early prototypes are already in the works!
I'm doing my usual benchmark driven development. Which means I'm
currently benchmarking the lowest RX layer of the drivers and just
dropping packets inside the driver.

Current results from driver:mlx4 (40Gbits/s) indicate that interacting
with the page-allocator is costing us 30% overhead.

The performance goals are:
20M pps per-cpu drop rate
14M pps per-cpu forwarding rate
100Gbps per-cpu GRO
Driver: mlx4 early drop tests
- 6 Mpps => SKB drop (just calling dev_kfree_skb)
* (main overhead is first cache-miss on pkt-data hdr)
- 14.5 Mpps => Driver drop before SKB alloc, no-pkt-data touched
* main overhead 30% is page-allocator related
- 20 Mpps => MAX bound, if removing all 30% page-alloc overhead
* this just upper possible bound... stop tuning when getting close to this

The mlx4 driver already implements it's own page-allocator-cache, but
does not do proper recycling. I want us to implement a more generic
page-allocator-cache that drivers can use, and that support recycling.


There were some opinions on the initial use cases that some of us would
like to apply to this to:
- Drop (DDOS mitigation)
I see DDoS as project goal #1

- Passthrough / Forwarding
I see forward as proj goal #2

- Delivery to socket
I would actually say, we don't want to deliver these kind of RX frames
into sockets. Primary reason is memory consumption. The amount of
time a packet can stay on a socket is unbounded.

For small packet, we might even consider doing a copy, if the dest is a
local socket. This is what drivers already do for small packets, but I
would like for this "copy-break" to be pushed "up-a-level".

In the future we can consider zero-copy RX socket delivery, but the
socket would likely need to opt-in, with a setsockopt, and the
userspace API programming model also needs some changes.


- Delivery to VM
Delivery into VM is a very interesting feature. I actually see this as
goal #3. Even-though this is actually fairly complicated.

I tried to explain on the call what my VM design plan was, maybe it's
easier over email:

Once we have our own page-allocator-cache in-place. We assign a
separate page-allocator-cache for each HW RX ring queue (primarily for
performance reasons in the normal use-case).

For VM delivery, we create a new RX ring queue, and use ntuple HW
filters in the NIC to direct packets to the VM-specific-RXQ. (I see
this as HW based early demux)

The page-allocator-cache assigned to this VM-specific-RXQ is configured
to be VM specific. That is, kernel pages are (pre-mapped) memory map
shared with the VM process. Thus, the DMA RX engine will deliver
packet-data into pages which are available in the VM's memory space,
thus avail as zero-copy RX. (The API for delivering and returning
pages, also needs some careful considerations, e.g. designing in bulk
from the start. More work required here)


Thanks!

Attendees:
Alexei Starovoitov
Alex Reece
Brendan Gregg
Brenden Blanco
Daniel Borkmann
Deepa Kalani
Jesper Brouer
Jianwen Pi
John Fastabend
Mihai Budiu
Pere Monclus
Prem Jonnalagadda
Thomas Graf
Tom Herbert
Yunsong Lu


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.