README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more


Jesper Dangaard Brouer
 

I'm sorry but I have a problem with this patch!

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!

The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.

But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf

--Jesper
(top post)

On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@...> wrote:

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@...>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index df2c9e0..6846208 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -265,7 +265,8 @@ struct mlx5e_cq {

struct mlx5e_rq;
typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
- struct mlx5_cqe64 *cqe);
+ struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
u16 ix);

@@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);

void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle);
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 912a0e2..ed93251 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
struct mlx5e_cq *cq,
int update_owner_only,
- int budget_rem)
+ int budget_rem,
+ bool *xdp_doorbell)
{
u32 cqcc = cq->wq.cc + update_owner_only;
u32 cqe_count;
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);

mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
- rq->handle_rx_cqe(rq, &cq->title);
+ rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
}
mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
cq->wq.cc = cqcc;
@@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,

static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
struct mlx5e_cq *cq,
- int budget_rem)
+ int budget_rem,
+ bool *xdp_doorbell)
{
mlx5e_read_title_slot(rq, cq, cq->wq.cc);
mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
- rq->handle_rx_cqe(rq, &cq->title);
+ rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
cq->mini_arr_idx++;

- return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
+ return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
}

void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
@@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
sq->pc += MLX5E_XDP_TX_WQEBBS;

- /* TODO: xmit more */
+ /* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
+ return true;
+}
+
+static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
+{
+ struct mlx5_wq_cyc *wq = &sq->wq;
+ struct mlx5e_tx_wqe *wqe;
+ u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
+
+ wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+
wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);

+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
/* fill sq edge with nops to avoid wqe wrap around */
while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
mlx5e_send_nop(sq, false);
}
- return true;
+#endif
}

/* returns true if packet was consumed by xdp */
static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
const struct bpf_prog *prog,
struct mlx5e_dma_info *di,
- void *data, u16 len)
+ void *data, u16 len,
+ bool *xdp_doorbell)
{
bool consumed = false;
struct xdp_buff xdp;
@@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
MLX5_RX_HEADROOM,
len);
+ if (unlikely(!consumed) && (*xdp_doorbell)) {
+ /* SQ is full, ring doorbell */
+ mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+ *xdp_doorbell = false;
+ }
rq->stats.xdp_tx += consumed;
+ *xdp_doorbell |= consumed;
return consumed;
default:
bpf_warn_invalid_xdp_action(act);
@@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
return false;
}

-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell)
{
struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
struct mlx5e_dma_info *di;
@@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
goto wq_ll_pop;
}

- if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+ if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
goto wq_ll_pop; /* page/packet was consumed by XDP */

skb = build_skb(va, RQ_PAGE_SIZE(rq));
@@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
skb->len += headlen;
}

-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell)
{
u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
u16 wqe_id = be16_to_cpu(cqe->wqe_id);
@@ -860,13 +883,15 @@ mpwrq_cqe_out:
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
{
struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+ bool xdp_doorbell = false;
int work_done = 0;

if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
return 0;

if (cq->decmprs_left)
- work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
+ work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
+ &xdp_doorbell);

for (; work_done < budget; work_done++) {
struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
@@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
work_done +=
mlx5e_decompress_cqes_start(rq, cq,
- budget - work_done);
+ budget - work_done,
+ &xdp_doorbell);
continue;
}

mlx5_cqwq_pop(&cq->wq);

- rq->handle_rx_cqe(rq, cqe);
+ rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
}

+ if (xdp_doorbell)
+ mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+
mlx5_cqwq_update_db_record(&cq->wq);

/* ensure cq space is freed before enabling more cqes */
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Alexei Starovoitov
 

On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.

The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.
I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.

But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

--Jesper
(top post)
since when it's ok to top post?

On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@...> wrote:

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@...>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)
...
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);

mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);

+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
Saeed,
please make sure to remove such debug bits.


Jesper Dangaard Brouer
 

On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.
What I'm proposing can also be used by the normal stack.

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.
This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.


The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.
I just explain it will be more efficient due to better use of icache.


In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.
I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.
So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access. In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.


But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

--Jesper
(top post)
since when it's ok to top post?
What a kneejerk reaction. When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow). I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.


On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@...> wrote:

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@...>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)
...
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);

mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);

+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
Saeed,
please make sure to remove such debug bits.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Alexei Starovoitov
 

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.
This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
correct. And why is that a problem?
As we discussed numerous times before XDP is deliberately not trying
to work with 10% of the traffic. If most of the traffic is going into
the stack there is no reason to use XDP. We have tc and netfilter
to deal with it. The cases where most of the traffic needs
skb should not use XDP. If we try to add such uses cases to XDP we
will only hurt XDP performance, increase complexity and gain nothing back.

Let's say a user wants to send 50% into the stack->tcp->socket->user and
another 50% via XDP_TX. The performance is going to be dominated by the stack.
So everything that XDP does to receive and/or transmit is irrelevant.
If we try to optimize XDP for that, we gain nothing in performance.
The user could have used netfilter just as well in such scenario.
The performance would have been the same.

XDP only makes sense when it's servicing most of the traffic,
like L4 load balancer, ILA router or DoS prevention use cases.
Sorry for the broken record. XDP is not a solution for every
networking use case. It only makes sense for packet in and out.
When packet goes to the host it has to go through skb and
optimizing that path is a task that is orthogonal to the XDP patches.

To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.


Saeed Mahameed
 

On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
<iovisor-dev@...> wrote:
On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
Jesper, what we did here is also bulking, instead of bulkin in a
temporary list in the driver
we list the packets in the HW and once done we transmit all at once via the
xdp_doorbell indication.

I agree with you that we can take advantage and improve the icahce by
bulkin first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling
in the diver and we need to do the comparison between both approaches
before we decide which is better.

this must be marked as future work and not have this from the start.


mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.

The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.

In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.
I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.

But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

--Jesper
(top post)
since when it's ok to top post?

On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@...> wrote:

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@...>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)
...
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);

mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);

+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
Saeed,
please make sure to remove such debug bits.
Sure, will fix this.

Thanks,
Saeed.


Tom Herbert <tom@...>
 

On Thu, Sep 8, 2016 at 10:36 PM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Thu, 8 Sep 2016 20:22:04 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
is it because the variable is called 'xdp_doorbell'?
Frankly I see nothing scary in this patch.
It extends existing code by adding a flag to ring doorbell or not.
The end of rx napi is used as an obvious heuristic to flush the pipe.
Looks pretty generic to me.
The same code can be used for non-xdp as well once we figure out
good algorithm for xmit_more in the stack.
What I'm proposing can also be used by the normal stack.

Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.
This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.


The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
not sure what are you proposing here?
Sounds like you want to extend it to multi port in the future?
Sure. The proposed code is easily extendable.

Or you want to see something like a link list of packets
or an array of packets that RX side is preparing and then
send the whole array/list to TX port?
I don't think that would be efficient, since it would mean
unnecessary copy of pointers.
I just explain it will be more efficient due to better use of icache.


In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.
I don't think so. Multi port is natural extension to this set of patches.
With multi port the end of RX will tell multiple ports (that were
used to tx) to ring the bell. Pretty trivial and doesn't involve any
extra arrays or link lists.
So, have you solved the problem exclusive access to a TX ring of a
remote/different net_device when sending?

In you design you assume there exist many TX ring available for other
devices to access. In my design I also want to support devices that
doesn't have this HW capability, and e.g. only have one TX queue.
Right, but segregating TX queues used by the stack from the those used
by XDP is pretty fundamental to the design. If we start mixing them,
then we need to pull in several features (such as BQL which seems like
what you're proposing) into the XDP path. If this starts to slow
things down or we need to reinvent a bunch of existing features to not
use skbuffs that seems to run contrary to "the simple as possible"
model for XDP-- may as well use the regular stack at that point
maybe...

Tom


But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.

[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
I don't see how this particular paper applies to the existing kernel code.
It's great to take ideas from research papers, but real code is different.

--Jesper
(top post)
since when it's ok to top post?
What a kneejerk reaction. When writing something general we often
reply to the top of the email, and then often delete the rest (which
makes it hard for later comers to follow). I was bcc'ing some people,
which needed the context, so it was a service note to you, that I
didn't write anything below.


On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm@...> wrote:

Previously we rang XDP SQ doorbell on every forwarded XDP packet.

Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.

Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.

XDP forward packet rate:

Comparing XDP with and w/o xmit more (bulk transmit):

Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*

*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.

Signed-off-by: Saeed Mahameed <saeedm@...>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)
...
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);

mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);

+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
Saeed,
please make sure to remove such debug bits.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jesper Dangaard Brouer
 

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.
This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
correct. And why is that a problem?
It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance. What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me. Let me explain this again, in another way.

This is a mental model switch. Stop seeing the lowest driver RX as
something that works on a per packet basis. Maybe is it is easier to
understand if we instead see this as vector processing? This is about
having a vector of packets, where we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle (directly measurable with perf, while I-cache
is not directly measurable).


Lets assume everything fits into the I-cache (XDP+driver code). The
CPU-frontend still have to decode the instructions from the I-cache
into micro-ops. The next level of optimizations is to reuse the
decoded I-cache by running it on all elements in the packet-vector.

The Intel "64 and IA-32 Architectures Optimization Reference Manual"
(section 3.4.2.6 "Optimization for Decoded ICache"[1][2]), states make
sure each hot code block is less than about 500 instructions. Thus,
the different "stages" working on the packet-vector, need to be rather
small and compact.

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
[2] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf



Notice: The same mental model switch applies to delivery packets to
the regular netstack. I've brought this up before[3]. Instead of
flushing the drivers I-cache for every packet, by calling the stack,
let instead bundle up N-packets in the driver before calling the
stack. I showed 10% speedup by a naive implementation of this
approach. Edward Cree also showed[4] a 10% performance boost, and
went further into the stack, showing a 25% increase.

A goal is also, to make optimizing netstack code-size independent of
the driver code-size, by separating the netstacks I-cache usage from
the drivers.

[3] http://lists.openwall.net/netdev/2016/01/15/51
[4] http://lists.openwall.net/netdev/2016/04/19/89
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jesper Dangaard Brouer
 

On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed <saeedm@...> wrote:

On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
<iovisor-dev@...> wrote:
On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
Jesper, what we did here is also bulking, instead of bulking in a
temporary list in the driver we list the packets in the HW and once
done we transmit all at once via the xdp_doorbell indication.

I agree with you that we can take advantage and improve the icache by
bulking first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling in the driver and we need to do the comparison
between both approaches before we decide which is better.
I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis. It might be easier to view this as a kind
of vector processing. This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle. The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.


It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets. Instead, I imagine doing 8-16
packets. Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets. You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.


this must be marked as future work and not have this from the start.
We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.


mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO. And now mlx5 is building on the same principle. That is why
I'm yelling STOP.


The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Jesper Dangaard Brouer
 

On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
[...]
To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.
The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets. I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive). Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack. This is due to hitting a bottleneck in the page
allocator. I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Alexei Starovoitov
 

On Mon, Sep 12, 2016 at 10:56:55AM +0200, Jesper Dangaard Brouer wrote:
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
Lets do bundling/bulking from the start!
mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
See nothing wrong about it. RX side processes the packets and
when it's done it tells TX to xmit whatever it collected.
This is doing "hidden" bulking and not really taking advantage of using
the icache more effeciently.

Let me explain the problem I see, little more clear then, so you
hopefully see where I'm going.

Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
correct. And why is that a problem?
It is good that you can see and acknowledge the I-cache problem.

XDP is all about performance. What I hear is, that you are arguing
against a model that will yield better performance, that does not make
sense to me. Let me explain this again, in another way.
I'm arguing against your proposal that I think will be more complex and
lower performance than what Saeed and the team already implemented.
Therefore I don't think it's fair to block the patch and ask them to
reimplement it just to test an idea that may or may not improve performance.

Getting maximum performance is tricky. Good is better than perfect.
It's important to argue about user space visible bits upfront, but
on the kernel performance side we should build/test incrementally.
This particular patch 11/11 is simple, easy to review and provides
good performance. What's not to like?


Alexei Starovoitov
 

On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
[...]
To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.
The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets. I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.
not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)
hmm. DoS prevention use case is when 99% of the traffic is dropped.

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive). Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack. This is due to hitting a bottleneck in the page
allocator. I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
sure. no one arguing against improving page allocator.

It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.
no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad packets.


Jesper Dangaard Brouer
 

On Mon, 12 Sep 2016 12:56:28 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Mon, Sep 12, 2016 at 01:30:25PM +0200, Jesper Dangaard Brouer wrote:
On Thu, 8 Sep 2016 23:30:50 -0700
Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Fri, Sep 09, 2016 at 07:36:52AM +0200, Jesper Dangaard Brouer wrote:
[...]
Imagine you have packets intermixed towards the stack and XDP_TX.
Every time you call the stack code, then you flush your icache. When
returning to the driver code, you will have to reload all the icache
associated with the XDP_TX, this is a costly operation.
[...]
To make further progress in this discussion can we talk about
the use case you have in mind instead? Then solution will
be much clear, I hope.
The DDoS use-case _is_ affected by this "hidden" bulking design.

Lets say, I want to implement a DDoS facility. Instead of just
dropping the malicious packets, I want to see the bad packets. I
implement this by rewriting the destination-MAC to be my monitor
machine and then XDP_TX the packet.
not following the use case. you want to implement a DDoS generator?
Or just forward all bad packets from affected host to another host
in the same rack? so two servers will be spammed with traffic and
even more load on the tor? I really don't see how this is useful
for anything but stress testing.
As I wrote below, the purpose of the monitor machine is to diagnose
false positives. If you worry about the added load I would either,
forward out another interface (which is not supported yet) or simply do
sampling of packets being forwarded to the monitor host.

In the DDoS use-case, you have loaded your XDP/eBPF program, and 100%
of the traffic is delivered to the stack. (See note 1)
hmm. DoS prevention use case is when 99% of the traffic is dropped.
As I wrote below, until the DDoS attack starts, all packets are
delivered to the stack.

Once the DDoS attack starts, then the traffic pattern changes, and XDP
should (hopefully only) catch the malicious traffic (monitor machine can
help diagnose false positive). Now, due to interleaving the DDoS
traffic with the clean traffic, then efficiency of XDP_TX is reduced due to
more icache misses...



Note(1): Notice I have already demonstrated that loading a XDP/eBPF
program with 100% delivery to the stack, actually slows down the
normal stack. This is due to hitting a bottleneck in the page
allocator. I'm working removing that bottleneck with page_pool, and
that solution is orthogonal to this problem.
sure. no one arguing against improving page allocator.

It is actually an excellent argument, for why you would want to run a
DDoS XDP filter only on a restricted number of RX queues.
no. it's the opposite. If the host is under DoS there is no way
the host can tell in advance which rx queue will be seeing bad
packets.
Sorry, this note was not related to the DoS use-case. You
misunderstood it.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Tom Herbert <tom@...>
 

On Mon, Sep 12, 2016 at 3:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Fri, 9 Sep 2016 18:03:09 +0300
Saeed Mahameed <saeedm@...> wrote:

On Fri, Sep 9, 2016 at 6:22 AM, Alexei Starovoitov via iovisor-dev
<iovisor-dev@...> wrote:
On Thu, Sep 08, 2016 at 10:11:47AM +0200, Jesper Dangaard Brouer wrote:

I'm sorry but I have a problem with this patch!
Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.


What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.

Why not implement a TX bulking interface directly instead?!?

Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)

This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.

Lets do bundling/bulking from the start!
Jesper, what we did here is also bulking, instead of bulking in a
temporary list in the driver we list the packets in the HW and once
done we transmit all at once via the xdp_doorbell indication.

I agree with you that we can take advantage and improve the icache by
bulking first in software and then queue all at once in the hw then
ring one doorbell.

but I also agree with Alexei that this will introduce an extra
pointer/list handling in the driver and we need to do the comparison
between both approaches before we decide which is better.
I welcome implementing both approaches and benchmarking them against
each-other, I'll gladly dedicate time for this!
Yes, please implement this so we can have something clear to evaluate
and compare. There is far to much spewing of "expert opinions"
happening here :-(

I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

The key point is stop seeing the lowest driver RX as something that
works on a per packet basis. It might be easier to view this as a kind
of vector processing. This is about having a vector of packets, where
we apply some action/operation.

This is about using the CPU more efficiently, getting it to do more
instructions per cycle. The next level of optimization (for >= Sandy
Bridge CPUs) is to make these vector processing stages small enough to fit
into the CPUs decoded-I-cache section.


It might also be important to mention, that for netstack delivery, I
don't imagine bulking 64 packets. Instead, I imagine doing 8-16
packets. Why, because the NIC-HW runs independently and have the
opportunity to deliver more frames in the RX ring queue, while the
stack "slow" processed packets. You can view this as "bulking" from
the RX ring queue, with a "look-back" before exiting the NAPI poll loop.


this must be marked as future work and not have this from the start.
We both know that statement is BS, and the other approach will never be
implemented once this patch is accepted upstream.


mlx4 already does bulking and this proposed mlx5 set of patches
does bulking as well.
I'm reacting exactly because mlx4 is also doing "bulking" in the wrong
way IMHO. And now mlx5 is building on the same principle. That is why
I'm yelling STOP.


The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.

It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.

[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Edward Cree <ecree@...>
 

On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51
BTW, I'd also still rather like to see that happen, I never really
understood the objections people had to those patches when I posted them. I
still believe that dealing in skb-lists instead of skbs, and thus
'automatically' bulking similar packets, is better than trying to categorise
packets into flows early on based on some set of keys. The problem with the
latter approach is that there are now two definitions of "similar":
1) the set of fields used to index the flow
2) what will actually cause the stack's behaviour to differ if not using the
cached values.
Quite apart from the possibility of bugs if one changes but not the other,
this forces (1) to be conservative, only considering things "similar" if the
entire stack will. Whereas with bundling, the stack can keep packets
together until they reach a layer at which they are no longer "similar"
enough. Thus, for instance, packets with the same IP 3-tuple but different
port numbers can be grouped together for IP layer processing, then split
apart for L4.

-Ed


Eric Dumazet <eric.dumazet@...>
 

On Tue, 2016-09-13 at 16:20 +0100, Edward Cree wrote:
On 12/09/16 11:15, Jesper Dangaard Brouer wrote:
I'm reacting so loudly, because this is a mental model switch, that
need to be applied to the full drivers RX path. Also for normal stack
delivery of SKBs. As both Edward Cree[1] and I[2] have demonstrated,
there is between 10%-25% perf gain here.

[1] http://lists.openwall.net/netdev/2016/04/19/89
[2] http://lists.openwall.net/netdev/2016/01/15/51
BTW, I'd also still rather like to see that happen, I never really
understood the objections people had to those patches when I posted them. I
still believe that dealing in skb-lists instead of skbs, and thus
'automatically' bulking similar packets, is better than trying to categorise
packets into flows early on based on some set of keys. The problem with the
latter approach is that there are now two definitions of "similar":
1) the set of fields used to index the flow
2) what will actually cause the stack's behaviour to differ if not using the
cached values.
Quite apart from the possibility of bugs if one changes but not the other,
this forces (1) to be conservative, only considering things "similar" if the
entire stack will. Whereas with bundling, the stack can keep packets
together until they reach a layer at which they are no longer "similar"
enough. Thus, for instance, packets with the same IP 3-tuple but different
port numbers can be grouped together for IP layer processing, then split
apart for L4.
To be fair you never showed us the numbers for DDOS traffic, and you did
not show us how typical TCP + netfilter modules kind of traffic would be
handled.

Show us real numbers, not synthetic ones, say when receiving traffic on
100,000 or more TCP sockets.

We also care about icache pressure, and GRO/TSO already provides
bundling where it is applicable, without adding insane complexity in the
stacks.

Just look at how complex the software fallbacks for GSO/checksumming
are, how many bugs we had to fix... And this is only at the edge of our
stack.


Jesper Dangaard Brouer
 

On Tue, 13 Sep 2016 08:58:30 -0700
Eric Dumazet <eric.dumazet@...> wrote:

We also care about icache pressure, and GRO/TSO already provides
bundling where it is applicable, without adding insane complexity in
the stacks.
Sorry, I cannot resist. The GRO code is really bad regarding icache
pressure/usage, due to how everything is function pointers calling
function pointers, even if the general case is calling the function
defined just next to it in the same C-file (which usually cause
inlining). I can easily get 10% more performance for UDP use-cases by
simply disabling the GRO code, and I measure a significant drop in
icache-misses.

Edward's solution should lower icache pressure.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer