Re: minutes: IO VIsor TSC and Dev members call
Jesper Dangaard Brouer
On Wed, 2 Mar 2016 22:28:42 -0800
Brenden Blanco <bblanco@...> wrote:
Thanks all for joining today,I'm doing my usual benchmark driven development. Which means I'm
currently benchmarking the lowest RX layer of the drivers and just
dropping packets inside the driver.
Current results from driver:mlx4 (40Gbits/s) indicate that interacting
with the page-allocator is costing us 30% overhead.
The performance goals are:Driver: mlx4 early drop tests
- 6 Mpps => SKB drop (just calling dev_kfree_skb)
* (main overhead is first cache-miss on pkt-data hdr)
- 14.5 Mpps => Driver drop before SKB alloc, no-pkt-data touched
* main overhead 30% is page-allocator related
- 20 Mpps => MAX bound, if removing all 30% page-alloc overhead
* this just upper possible bound... stop tuning when getting close to this
The mlx4 driver already implements it's own page-allocator-cache, but
does not do proper recycling. I want us to implement a more generic
page-allocator-cache that drivers can use, and that support recycling.
There were some opinions on the initial use cases that some of us wouldI see DDoS as project goal #1
- Passthrough / ForwardingI see forward as proj goal #2
- Delivery to socketI would actually say, we don't want to deliver these kind of RX frames
into sockets. Primary reason is memory consumption. The amount of
time a packet can stay on a socket is unbounded.
For small packet, we might even consider doing a copy, if the dest is a
local socket. This is what drivers already do for small packets, but I
would like for this "copy-break" to be pushed "up-a-level".
In the future we can consider zero-copy RX socket delivery, but the
socket would likely need to opt-in, with a setsockopt, and the
userspace API programming model also needs some changes.
- Delivery to VMDelivery into VM is a very interesting feature. I actually see this as
goal #3. Even-though this is actually fairly complicated.
I tried to explain on the call what my VM design plan was, maybe it's
easier over email:
Once we have our own page-allocator-cache in-place. We assign a
separate page-allocator-cache for each HW RX ring queue (primarily for
performance reasons in the normal use-case).
For VM delivery, we create a new RX ring queue, and use ntuple HW
filters in the NIC to direct packets to the VM-specific-RXQ. (I see
this as HW based early demux)
The page-allocator-cache assigned to this VM-specific-RXQ is configured
to be VM specific. That is, kernel pages are (pre-mapped) memory map
shared with the VM process. Thus, the DMA RX engine will deliver
packet-data into pages which are available in the VM's memory space,
thus avail as zero-copy RX. (The API for delivering and returning
pages, also needs some careful considerations, e.g. designing in bulk
from the start. More work required here)
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org