Re: minutes: IO VIsor TSC and Dev members call

John Fastabend

On 16-03-03 12:48 PM, Jesper Dangaard Brouer wrote:
On Thu, 3 Mar 2016 10:12:27 -0800
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Thu, Mar 3, 2016 at 1:57 AM, Jesper Dangaard Brouer via iovisor-dev
<iovisor-dev@...> wrote:
On Wed, 2 Mar 2016 22:28:42 -0800
Brenden Blanco <bblanco@...> wrote:

Thanks all for joining today,

We had a very interesting session focused entirely on XDP (express data
path), a new initiative to improve packet processing performance in the
linux kernel. The details will best be covered by the slides, which I'll be
sure to bug Tom to get a copy of to share, so I won't make the situation
worse by sharing my possibly erroneous notes.

In a nutshell, the goal is to give the low level driver architecture of the
kernel some TLC, improving PPS and BPFifying it.

Some early prototypes are already in the works!
I'm doing my usual benchmark driven development. Which means I'm
currently benchmarking the lowest RX layer of the drivers and just
dropping packets inside the driver.

Current results from driver:mlx4 (40Gbits/s) indicate that interacting
with the page-allocator is costing us 30% overhead.

The performance goals are:
20M pps per-cpu drop rate
14M pps per-cpu forwarding rate
100Gbps per-cpu GRO
Driver: mlx4 early drop tests
- 6 Mpps => SKB drop (just calling dev_kfree_skb)
* (main overhead is first cache-miss on pkt-data hdr)
- 14.5 Mpps => Driver drop before SKB alloc, no-pkt-data touched
* main overhead 30% is page-allocator related
awesome. that's a great baseline.

- 20 Mpps => MAX bound, if removing all 30% page-alloc overhead
* this just upper possible bound... stop tuning when getting close to this

The mlx4 driver already implements it's own page-allocator-cache, but
does not do proper recycling. I want us to implement a more generic
page-allocator-cache that drivers can use, and that support recycling.
I think the next step here is to make mlx4 to recycle
pages and rx descriptors on its own. Later we can generalize it
into something that other drivers can use. Right now
I'd try to get to maximum possible drop rate with
minimal changes.
Yes, but we might as well start up with making the allocator hacks in
mlx4 more generic, when adding recycle. But still keep them locally in
that file.

Or, we're talking about benchmarking MLX4_EN_FLAG_RX_FILTER_NEEDED
That's the place where we plan to add XDP hook.
I had some issue with benchmarking just before MLX4_EN_FLAG_RX_FILTER_NEEDED.

I added a drop (via goto next;) just inside the statement:

if (dev->features & NETIF_F_GRO) {
goto next;

That allowed me some flexibility to enable/disable it easily.

Jesper, can you share 'perf report' ?
Its easiest to see with a FlameGraph:

The problem with normal perf report output, is that page functions are
so "many", thus they look small percentage wise, but once you add them
up, they start to use a lot.

The "NAPI_force_on" in the name was a hack, where I forced it to never
exit softirq, but the performance didn't improve. It did made the perf
record more focused on what we want to look at.

John, if you can share similar numbers for ixgbe or i40e
that would be great, so we can have some driver competition :)
Also it will help us to see how different drivers can
recycle pages. imo only then we can generalize it into
common page-alloctor-cache-with-recycle infra.

There were some opinions on the initial use cases that some of us would
like to apply to this to:
- Drop (DDOS mitigation)
I see DDoS as project goal #1

- Passthrough / Forwarding
I see forward as proj goal #2

- Delivery to socket
- Delivery to VM
Delivery into VM is a very interesting feature. I actually see this as
goal #3. Even-though this is actually fairly complicated.
yeah, let's worry about it later. We need to walk before we can fly.
For practical implementations yes.

But I need/want to work a bit in this area... because I'm attending
MM-summit, and I want to present on this idea there. Getting something
like this integrated into the MM-area is going to take time, and we
need to present our ideas in this area as early as possible to the
MM-people. At least I'm hoping to get some MM-feedback on things I
should not do ;-)
If you are looking at this you might want to check out what we have
today or perhaps your already aware of it.

For steering flows to specific queues we have ethtool interface and
soon 'tc' will support this as well via u32 but eventually ebpf

I guess I never added explicit ethtool support for this as customers
have custom software running in the control plane that manages this.
Then VFs map onto userspace dataplane or VM or whatever. I think your
team wrote this,

If you want something that doesn't require SRIOV we have these patches
that were rejected due to security concerns.

Of course my colleagues run DPDK on the top of these queues but there
are two alternatives to doing this that we started working on but never
made it very far. But you can hook this direct into qemu so you have a
direct queue into a VM I think this is the best bet. Although you
will need to wait until we get hardware support to protect the queue
pair dma ring if you push that all the way up into userspace for best
performance. A middle ground is to sanitize the dma addresses in
the driver by kicking it with a system call or something else. I
measured this @ about 15% overhead in 2015 but I am told system
calls are getting better as time goes on. We also had some crazy
schemes where the in kernel driver polled on some shared mmap bit
but this never worked very well and worse it burned a core. Neil
Horman had some other ideas around catching TLB misses or something
but I forget exactly what he was up to.


I think in parallel mellanox folks need to fix mlx5 driver
to allocate skb only after packet is arrived (similar to mlx4).
Yes, I already told them to do so...

Join { to automatically receive all group messages.