On 16-09-07 05:42 AM, Saeed Mahameed wrote: Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such.
XDP forward packet rate:
Comparing XDP with and w/o xmit more (bulk transmit):
Streams XDP TX XDP TX (xmit more) --------------------------------------------------- 1 4.90Mpps 7.50Mpps 2 9.50Mpps 14.8Mpps 4 16.5Mpps 25.1Mpps 8 21.5Mpps 27.5Mpps* 16 24.1Mpps 27.5Mpps*
Hi Saeed, How many cores are you using with these numbers? Just a single core? Or are streams being RSS'd across cores somehow. *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams, we will be working on the analysis and will publish the conclusions later.
Thanks, John
|
|
On Wed, Sep 7, 2016 at 4:44 PM, John Fastabend via iovisor-dev <iovisor-dev@...> wrote: On 16-09-07 05:42 AM, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such.
XDP forward packet rate:
Comparing XDP with and w/o xmit more (bulk transmit):
Streams XDP TX XDP TX (xmit more) --------------------------------------------------- 1 4.90Mpps 7.50Mpps 2 9.50Mpps 14.8Mpps 4 16.5Mpps 25.1Mpps 8 21.5Mpps 27.5Mpps* 16 24.1Mpps 27.5Mpps*
Hi Saeed,
How many cores are you using with these numbers? Just a single core? Or are streams being RSS'd across cores somehow.
Hi John, Right I should have been more clear here, numbers of streams refers to the active RSS cores. We just manipulate the number of rings with ethtool -L to test this. *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams, we will be working on the analysis and will publish the conclusions later.
Thanks, John _______________________________________________ iovisor-dev mailing list iovisor-dev@... https://lists.iovisor.org/mailman/listinfo/iovisor-dev
|
|
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote: On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such. Why is this idea depends on XDP ?
It looks like we could apply it to any driver having one IRQ servicing one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts the TX descriptors and once done the RX ring hits the doorbell once for all the TX descriptors it posted, and it is the only possible place to hit a doorbell for XDP TX ring. For regular TX and RX ring sharing the same IRQ, there is no such simple connection between them, and hitting a doorbell from RX ring napi would race with xmit ndo function of the TX ring. How do you synchronize in such case ? isn't the existing xmit more mechanism sufficient enough ? maybe we can have a fence from napi RX function that will hold the xmit queue until done and then flush the TX queue with the setting the right xmit more flags, without the need of explicitly intervening with TX flow (hitting the doorbell).
|
|
Eric Dumazet <eric.dumazet@...>
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote: On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such. Why is this idea depends on XDP ?
It looks like we could apply it to any driver having one IRQ servicing one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts the TX descriptors and once done the RX ring hits the doorbell once for all the TX descriptors it posted, and it is the only possible place to hit a doorbell for XDP TX ring.
For regular TX and RX ring sharing the same IRQ, there is no such simple connection between them, and hitting a doorbell from RX ring napi would race with xmit ndo function of the TX ring.
How do you synchronize in such case ? isn't the existing xmit more mechanism sufficient enough ? Only if a qdisc is present and pressure is high enough. But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit. A simple cmpxchg could be used to synchronize the thing, if we really cared about doorbell cost. (Ie if the cost of this cmpxchg() is way smaller than doorbell one)
|
|
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote: On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will flush (doorbell) all XDP looped packets in case there are such. Why is this idea depends on XDP ?
It looks like we could apply it to any driver having one IRQ servicing one RX and one TX, without XDP being involved.
Yes but it is more complicated than XDP case, where the RX ring posts the TX descriptors and once done the RX ring hits the doorbell once for all the TX descriptors it posted, and it is the only possible place to hit a doorbell for XDP TX ring.
For regular TX and RX ring sharing the same IRQ, there is no such simple connection between them, and hitting a doorbell from RX ring napi would race with xmit ndo function of the TX ring.
How do you synchronize in such case ? isn't the existing xmit more mechanism sufficient enough ? Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more. But if it works, it will be transparent to napi, and xmit more will happen by design. A simple cmpxchg could be used to synchronize the thing, if we really cared about doorbell cost. (Ie if the cost of this cmpxchg() is way smaller than doorbell one)
|
|
Eric Dumazet <eric.dumazet@...>
On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote: Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. I do not think qdisc is relevant here. Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), because only this layer can know if more packets are to come. What I am saying is that regardless of skb->xmit_more being set or not, (for example if no qdisc is even used) a NAPI driver can arm a bit asking the doorbell being sent at the end of NAPI. I am not saying this must be done, only that the idea could be extended to non XDP world, if we care enough.
|
|

Jesper Dangaard Brouer
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote: On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...] Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments. Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design.
Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer[1] http://netdevconf.org/1.2/session.html?jesper-performance-workshop
|
|
On Wed, Sep 7, 2016 at 9:19 PM, Eric Dumazet <eric.dumazet@...> wrote: On Wed, 2016-09-07 at 19:57 +0300, Saeed Mahameed wrote:
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. I do not think qdisc is relevant here.
Right now, skb->xmit_more is set only by qdisc layer (and pktgen tool), because only this layer can know if more packets are to come.
What I am saying is that regardless of skb->xmit_more being set or not, (for example if no qdisc is even used) a NAPI driver can arm a bit asking the doorbell being sent at the end of NAPI.
I am not saying this must be done, only that the idea could be extended to non XDP world, if we care enough.
Yes, and i am just trying to suggest Ideas that do not require communication between RX (NAPI) and TX. The problem here is the synchronization (TX doorbell from RX) which is not as simple as atomic operation for some drivers. How about RX bulking ? it also can help here, since for the forwarding case, the forwarding path will be able to process bulk of RX SKBs and can bulk xmit the portion of SKBs that will be forwarded. As Jesper suggested, Let's talk in Netdev1.2 at jesper's session. ( if you are joining of course). Thanks Saeed.
|
|
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote: On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring. .John
|
|
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote: On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring.
BQL is not helping with that? Tom .John
|
|

Jesper Dangaard Brouer
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote: On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring. I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this). BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple. Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|
|
On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer <brouer@...> wrote: On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this).
BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple.
Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight.
Sorry, I'm still missing it. The point of BQL is that we minimize the amount of data (and hence number of packets) that needs to be queued in the device in order to prevent the link from going idle while there are outstanding packets to be sent. The algorithm is based on counting bytes not packets because bytes are roughly an equal cost unit of work. So if we've queued 100K of bytes on the queue we know how long that takes around 80 usecs @10G, but if we count packets then we really don't know much about that. 100 packets enqueued could represent 6400 bytes or 6400K worth of data so time to transmit is anywhere from 5usecs to 5msecs.... Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Tom -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|
|

Jesper Dangaard Brouer
On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert <tom@...> wrote: On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer <brouer@...> wrote:
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-) One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring. I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this).
BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple.
Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight. Sorry, I'm still missing it. The point of BQL is that we minimize the amount of data (and hence number of packets) that needs to be queued in the device in order to prevent the link from going idle while there are outstanding packets to be sent. The algorithm is based on counting bytes not packets because bytes are roughly an equal cost unit of work. So if we've queued 100K of bytes on the queue we know how long that takes around 80 usecs @10G, but if we count packets then we really don't know much about that. 100 packets enqueued could represent 6400 bytes or 6400K worth of data so time to transmit is anywhere from 5usecs to 5msecs....
Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. The idea/algo is trying to predict the future. If we see a given/high packet rate, which equals a high transaction cost, then lets try not calling the driver, and instead backlog the packet in the qdisc, speculatively hoping the current rate continues. This will in effect allow bulking and amortize the 70ns transaction cost over N packets. Instead of tracking a rate of packets or doorbells per sec, I will let BQLs packet-in-flight tell me when the driver sees a rate high enough that the drivers (DMA-TX completion) consider several packets are in-flight. When that happens, I will bet on, I can stop sending packets to the device, and instead queue them in the qdisc layer. If I'm unlucky and the flow stops, then I'm hoping that the last packet stuck in the qdisc, will be picked by the next napi-schedule, before the device driver runs "dry". -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|
|
On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer <brouer@...> wrote: On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 10:11 PM, Jesper Dangaard Brouer <brouer@...> wrote:
On Wed, 7 Sep 2016 20:21:24 -0700 Tom Herbert <tom@...> wrote:
On Wed, Sep 7, 2016 at 7:58 PM, John Fastabend <john.fastabend@...> wrote:
On 16-09-07 11:22 AM, Jesper Dangaard Brouer wrote:
On Wed, 7 Sep 2016 19:57:19 +0300 Saeed Mahameed <saeedm@...> wrote:
On Wed, Sep 7, 2016 at 6:32 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 18:08 +0300, Saeed Mahameed wrote:
On Wed, Sep 7, 2016 at 5:41 PM, Eric Dumazet <eric.dumazet@...> wrote:
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote: [...]
Only if a qdisc is present and pressure is high enough.
But in a forwarding setup, we likely receive at a lower rate than the NIC can transmit.
Yes, I can confirm this happens in my experiments.
Jesper has a similar Idea to make the qdisc think it is under pressure, when the device TX ring is idle most of the time, i think his idea can come in handy here. I am not fully involved in the details, maybe he can elaborate more.
But if it works, it will be transparent to napi, and xmit more will happen by design. Yes. I have some ideas around getting more bulking going from the qdisc layer, by having the drivers provide some feedback to the qdisc layer indicating xmit_more should be possible. This will be a topic at the Network Performance Workshop[1] at NetDev 1.2, I have will hopefully challenge people to come up with a good solution ;-)
One thing I've noticed but haven't yet actually analyzed much is if I shrink the nic descriptor ring size to only be slightly larger than the qdisc layer bulking size I get more bulking and better perf numbers. At least on microbenchmarks. The reason being the nic pushes back more on the qdisc. So maybe a case for making the ring size in the NIC some factor of the expected number of queues feeding the descriptor ring.
I've also played with shrink the NIC descriptor ring size, it works, but it is an ugly hack to get NIC pushes backs, and I foresee it will hurt normal use-cases. (There are other reasons for shrinking the ring size like cache usage, but that is unrelated to this).
BQL is not helping with that? Exactly. But the BQL _byte_ limit is not what is needed, what we need to know is the _packets_ currently "in-flight". Which Tom already have a patch for :-) Once we have that the algorithm is simple.
Qdisc dequeue look at BQL pkts-in-flight, if driver have "enough" packets in-flight, the qdisc start it's bulk dequeue building phase, before calling the driver. The allowed max qdisc bulk size should likely be related to pkts-in-flight.
Sorry, I'm still missing it. The point of BQL is that we minimize the amount of data (and hence number of packets) that needs to be queued in the device in order to prevent the link from going idle while there are outstanding packets to be sent. The algorithm is based on counting bytes not packets because bytes are roughly an equal cost unit of work. So if we've queued 100K of bytes on the queue we know how long that takes around 80 usecs @10G, but if we count packets then we really don't know much about that. 100 packets enqueued could represent 6400 bytes or 6400K worth of data so time to transmit is anywhere from 5usecs to 5msecs....
Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account.
The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed.
But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. The idea/algo is trying to predict the future. If we see a given/high packet rate, which equals a high transaction cost, then lets try not calling the driver, and instead backlog the packet in the qdisc, speculatively hoping the current rate continues. This will in effect allow bulking and amortize the 70ns transaction cost over N packets.
Instead of tracking a rate of packets or doorbells per sec, I will let BQLs packet-in-flight tell me when the driver sees a rate high enough that the drivers (DMA-TX completion) consider several packets are in-flight. When that happens, I will bet on, I can stop sending packets to the device, and instead queue them in the qdisc layer. If I'm unlucky and the flow stops, then I'm hoping that the last packet stuck in the qdisc, will be picked by the next napi-schedule, before the device driver runs "dry".
This is exactly what BQL already does (except the queue limit is on bytes). Once the byte limit is reached the queue is stopped. At TX completion time some number of bytes are freed up so that a bulk of packets can be sent to the queue limit. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|
|
Rick Jones <rick.jones2@...>
On 09/08/2016 11:16 AM, Tom Herbert wrote: On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer <brouer@...> wrote:
On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert <tom@...> wrote:
Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account.
The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed.
But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL? rick jones
|
|
Eric Dumazet <eric.dumazet@...>
On Thu, 2016-09-08 at 11:48 -0700, Rick Jones wrote: With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL? It depends on how TX completion (NAPI handler) is implemented in the driver. Say how many packets can be dequeued by each invocation. Drivers have a lot of variations there.
|
|