[PATCH RFC 11/11] net/mlx5e: XDP TX xmit more


Eric Dumazet <eric.dumazet@...>
 

On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote:

With small packets and the "default" ring size for this NIC/driver
combination, is the BQL large enough that the ring fills before one hits
the BQL?
It depends on how TX completion (NAPI handler) is implemented in the
driver.

Say how many packets can be dequeued by each invocation.

Drivers have a lot of variations there.


Rick Jones <rick.jones2@...>
 

On 09/08/2016 11:16 AM, Tom Herbert wrote:
On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <tom@...> wrote:
Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?
Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost. The tailptr/doorbell cost around 70ns. (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.
With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL?

rick jones


Tom Herbert <tom@...>
 

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
<brouer@...> wrote:

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).


BQL is not helping with that?
Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight". Which Tom already have
a patch for :-) Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs....

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?
Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost. The tailptr/doorbell cost around 70ns. (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.
But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.

The idea/algo is trying to predict the future. If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues. This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider several packets are
in-flight.
When that happens, I will bet on, I can stop sending packets to the
device, and instead queue them in the qdisc layer. If I'm unlucky and
the flow stops, then I'm hoping that the last packet stuck in the qdisc,
will be picked by the next napi-schedule, before the device driver runs
"dry".
This is exactly what BQL already does (except the queue limit is on
bytes). Once the byte limit is reached the queue is stopped. At TX
completion time some number of bytes are freed up so that a bulk of
packets can be sent to the queue limit.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jesper Dangaard Brouer
 

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
<brouer@...> wrote:

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).


BQL is not helping with that?
Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight". Which Tom already have
a patch for :-) Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs....

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?
Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost. The tailptr/doorbell cost around 70ns. (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.

The idea/algo is trying to predict the future. If we see a given/high
packet rate, which equals a high transaction cost, then lets try not
calling the driver, and instead backlog the packet in the qdisc,
speculatively hoping the current rate continues. This will in effect
allow bulking and amortize the 70ns transaction cost over N packets.

Instead of tracking a rate of packets or doorbells per sec, I will let
BQLs packet-in-flight tell me when the driver sees a rate high enough
that the drivers (DMA-TX completion) consider several packets are
in-flight.
When that happens, I will bet on, I can stop sending packets to the
device, and instead queue them in the qdisc layer. If I'm unlucky and
the flow stops, then I'm hoping that the last packet stuck in the qdisc,
will be picked by the next napi-schedule, before the device driver runs
"dry".

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Tom Herbert <tom@...>
 

On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer
<brouer@...> wrote:

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).


BQL is not helping with that?
Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight". Which Tom already have
a patch for :-) Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.
Sorry, I'm still missing it. The point of BQL is that we minimize the
amount of data (and hence number of packets) that needs to be queued
in the device in order to prevent the link from going idle while there
are outstanding packets to be sent. The algorithm is based on counting
bytes not packets because bytes are roughly an equal cost unit of
work. So if we've queued 100K of bytes on the queue we know how long
that takes around 80 usecs @10G, but if we count packets then we
really don't know much about that. 100 packets enqueued could
represent 6400 bytes or 6400K worth of data so time to transmit is
anywhere from 5usecs to 5msecs....

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?

Tom

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jesper Dangaard Brouer
 

On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works,
but it is an ugly hack to get NIC pushes backs, and I foresee it will
hurt normal use-cases. (There are other reasons for shrinking the ring
size like cache usage, but that is unrelated to this).


BQL is not helping with that?
Exactly. But the BQL _byte_ limit is not what is needed, what we need
to know is the _packets_ currently "in-flight". Which Tom already have
a patch for :-) Once we have that the algorithm is simple.

Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough"
packets in-flight, the qdisc start it's bulk dequeue building phase,
before calling the driver. The allowed max qdisc bulk size should
likely be related to pkts-in-flight.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Tom Herbert <tom@...>
 

On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.
BQL is not helping with that?

Tom

.John


John Fastabend
 

On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if
I shrink the nic descriptor ring size to only be slightly larger than
the qdisc layer bulking size I get more bulking and better perf numbers.
At least on microbenchmarks. The reason being the nic pushes back more
on the qdisc. So maybe a case for making the ring size in the NIC some
factor of the expected number of queues feeding the descriptor ring.

.John


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.


What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.
Yes, and i am just trying to suggest Ideas that do not require
communication between RX (NAPI) and TX.

The problem here is the synchronization (TX doorbell from RX) which is
not as simple as atomic operation for some drivers.

How about RX bulking ? it also can help here, since for the forwarding
case, the forwarding path will be able to
process bulk of RX SKBs and can bulk xmit the portion of SKBs that
will be forwarded.

As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if
you are joining of course).

Thanks
Saeed.


Jesper Dangaard Brouer
 

On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
[...]

Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Yes, I can confirm this happens in my experiments.

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device TX ring is idle most of the time, i think
his idea can come in handy here. I am not fully involved in the
details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc
layer, by having the drivers provide some feedback to the qdisc layer
indicating xmit_more should be possible. This will be a topic at the
Network Performance Workshop[1] at NetDev 1.2, I have will hopefully
challenge people to come up with a good solution ;-)

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop


Eric Dumazet <eric.dumazet@...>
 

On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:

Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.
I do not think qdisc is relevant here.

Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool),
because only this layer can know if more packets are to come.

What I am saying is that regardless of skb->xmit_more being set or not,
(for example if no qdisc is even used)
a NAPI driver can arm a bit asking the doorbell being sent at the end of
NAPI.

I am not saying this must be done, only that the idea could be extended
to non XDP world, if we care enough.


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.
Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ?
Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.
Jesper has a similar Idea to make the qdisc think it is under
pressure, when the device
TX ring is idle most of the time, i think his idea can come in handy here.
I am not fully involved in the details, maybe he can elaborate more.

But if it works, it will be transparent to napi, and xmit more will
happen by design.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)


Eric Dumazet <eric.dumazet@...>
 

On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.
Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ?
Only if a qdisc is present and pressure is high enough.

But in a forwarding setup, we likely receive at a lower rate than the
NIC can transmit.

A simple cmpxchg could be used to synchronize the thing, if we really
cared about doorbell cost. (Ie if the cost of this cmpxchg() is way
smaller than doorbell one)


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.
Why is this idea depends on XDP ?

It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts
the TX descriptors and once done
the RX ring hits the doorbell once for all the TX descriptors it
posted, and it is the only possible place to hit a doorbell
for XDP TX ring.

For regular TX and RX ring sharing the same IRQ, there is no such
simple connection between them, and hitting a doorbell
from RX ring napi would race with xmit ndo function of the TX ring.

How do you synchronize in such case ?
isn't the existing xmit more mechanism sufficient enough ? maybe we
can have a fence from napi RX function
that will hold the xmit queue until done and then flush the TX queue
with the setting the right xmit more flags, without the need
of explicitly intervening with TX flow (hitting the doorbell).


Saeed Mahameed
 

On Wed, Sep 7, 2016 at 4:44 PM, John Fastabend via iovisor-dev
<iovisor-dev@...> wrote:
On 16-09-07 05:42 AM, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*
Hi Saeed,

How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.
Hi John,

Right I should have been more clear here, numbers of streams refers to
the active RSS cores.
We just manipulate the number of rings with ethtool -L to test this.

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.
Thanks,
John
_______________________________________________
iovisor-dev mailing list
iovisor-dev@...
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


John Fastabend
 

On 16-09-07 05:42 AM, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*
Hi Saeed,

How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.
Thanks,
John