[PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support


Jesper Dangaard Brouer
 

On Wed, 7 Sep 2016 20:07:01 +0300
Saeed Mahameed <saeedm@...> wrote:

On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
<saeedm@...> wrote:
On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or@...> wrote:
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@...> wrote:

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop

Streams Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.51Mpps 5.14Mpps 13.5Mpps
2 11.5Mpps 10.0Mpps 25.1Mpps
4 16.3Mpps 17.2Mpps 35.4Mpps
8 29.6Mpps 28.2Mpps 45.8Mpps*
16 34.0Mpps 30.1Mpps 45.8Mpps*
Rana, Guys, congrat!!

When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?
Yes, I will make this more clear in the actual submission,
Here we are talking about different RSS core rings.


In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.
Here we did, the first row is what you are describing the other rows
are the same test
with increasing the number of the RSS receiving cores, The xmit side is sending
as many streams as possible to be as much uniformly spread as possible
across the
different RSS cores on the receiver.
Hi Saeed,

Please report CPU utilization also. The expectation is that
performance should scale linearly with increasing number of CPUs (i.e.
pps/CPU_utilization should be constant).
That was my expectation too.
Be careful with such expectations at these extreme speeds, because we
are starting to hit PCI-express limitations and CPU cache-coherency
limitations (if any atomic/RMW operations still exists per packet).

Consider that in the small packet size 64 bytes case, the drivers PCI bandwidth
need/overhead is actually quite large, as every descriptor also 64
bytes transferred.


Anyway we will share more accurate results when we have them, with CPU
utilization statistics as well.
It is interesting to monitor the CPU utilization, because (if C-states
are enabled) you will likely see the CPU freq be reduced or even enter
CPU idle states, in-case your software (XDP) gets faster than the HW
(PCI or NIC). I've seen that happen with mlx4/CX3-pro.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 7:54 PM, Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
<saeedm@...> wrote:
On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or@...> wrote:
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@...> wrote:

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop

Streams Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.51Mpps 5.14Mpps 13.5Mpps
2 11.5Mpps 10.0Mpps 25.1Mpps
4 16.3Mpps 17.2Mpps 35.4Mpps
8 29.6Mpps 28.2Mpps 45.8Mpps*
16 34.0Mpps 30.1Mpps 45.8Mpps*
Rana, Guys, congrat!!

When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?
Yes, I will make this more clear in the actual submission,
Here we are talking about different RSS core rings.


In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.
Here we did, the first row is what you are describing the other rows
are the same test
with increasing the number of the RSS receiving cores, The xmit side is sending
as many streams as possible to be as much uniformly spread as possible
across the
different RSS cores on the receiver.
Hi Saeed,

Please report CPU utilization also. The expectation is that
performance should scale linearly with increasing number of CPUs (i.e.
pps/CPU_utilization should be constant).
Hi Tom

That was my expectation too.

We didn't do the full analysis yet, It could be that RSS was not
spreading the workload on all the cores evenly.
Those numbers are from my humble machine with a quick and dirty
testing, the idea of this submission
is to let the folks look at the code while we continue testing and
analyzing those patches.

Anyway we will share more accurate results when we have them, with CPU
utilization statistics as well.

Thanks,
Saeed.

Tom


Tom Herbert <tom@...>
 

On Wed, Sep 7, 2016 at 7:48 AM, Saeed Mahameed
<saeedm@...> wrote:
On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or@...> wrote:
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@...> wrote:

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop

Streams Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.51Mpps 5.14Mpps 13.5Mpps
2 11.5Mpps 10.0Mpps 25.1Mpps
4 16.3Mpps 17.2Mpps 35.4Mpps
8 29.6Mpps 28.2Mpps 45.8Mpps*
16 34.0Mpps 30.1Mpps 45.8Mpps*
Rana, Guys, congrat!!

When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?
Yes, I will make this more clear in the actual submission,
Here we are talking about different RSS core rings.


In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.
Here we did, the first row is what you are describing the other rows
are the same test
with increasing the number of the RSS receiving cores, The xmit side is sending
as many streams as possible to be as much uniformly spread as possible
across the
different RSS cores on the receiver.
Hi Saeed,

Please report CPU utilization also. The expectation is that
performance should scale linearly with increasing number of CPUs (i.e.
pps/CPU_utilization should be constant).

Tom


Here, I guess you want to 1st get an initial max for N pktgen TX
threads all sending
the same stream so you land on single RX ring, and then move to M * N pktgen TX
threads to max that further.

I don't see how the current Linux stack would be able to happily drive 34M PPS
(== allocate SKB, etc, you know...) on a single CPU, Jesper?

Or.


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 4:32 PM, Or Gerlitz <gerlitz.or@...> wrote:
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@...> wrote:

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop

Streams Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.51Mpps 5.14Mpps 13.5Mpps
2 11.5Mpps 10.0Mpps 25.1Mpps
4 16.3Mpps 17.2Mpps 35.4Mpps
8 29.6Mpps 28.2Mpps 45.8Mpps*
16 34.0Mpps 30.1Mpps 45.8Mpps*
Rana, Guys, congrat!!

When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?
Yes, I will make this more clear in the actual submission,
Here we are talking about different RSS core rings.


In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.
Here we did, the first row is what you are describing the other rows
are the same test
with increasing the number of the RSS receiving cores, The xmit side is sending
as many streams as possible to be as much uniformly spread as possible
across the
different RSS cores on the receiver.


Here, I guess you want to 1st get an initial max for N pktgen TX
threads all sending
the same stream so you land on single RX ring, and then move to M * N pktgen TX
threads to max that further.

I don't see how the current Linux stack would be able to happily drive 34M PPS
(== allocate SKB, etc, you know...) on a single CPU, Jesper?

Or.