A couple eBPF and cls_act questions


Dan Siemon
 

Hello,

I listened in on the IOVisor call today but wasn't sure if user
questions and use cases where appropriate there so I started writing
this...

There are two things I was going to discuss:

1)

We use cls_act and eBPF extensively (https://www.preseem.com)

I'm hoping someone can point me in the right direction... we track
metrics per-IP address (latency, loss, others) via a eBPF program and
also manage traffic via HTB and FQ-CoDel. A given end customer can have
several IP addresses assigned which all need to go through the same HTB
class (and attached FQ-CoDel instance) to enforce the plan rate.

At present, we extract the metrics in cls_act which is pre-qdisc. The
problem with this is that the byte and packet counts are too high
because packets can be dropped in the qdiscs. It also messes a bit with
the loss and latency calculations.

Is there a way to hook post qdisc? I looked a bit at XDP, but it seems
that is only Rx now?

2)

The Celium BPF docs are great but one thing I don't have a good handle
on is concurrent access to map data. Taking the hash map as an example,
does the map update function need to be called with the eBPF program
for each update to make it safe with concurrent access via userspace
bpf syscall? At present we do some map data updates just using the ptr
from the lookup function.


Daniel Borkmann
 

On 08/08/2018 09:18 PM, Dan Siemon wrote:
Hello,

I listened in on the IOVisor call today but wasn't sure if user
questions and use cases where appropriate there so I started writing
this...

There are two things I was going to discuss:

1)

We use cls_act and eBPF extensively (https://www.preseem.com)

I'm hoping someone can point me in the right direction... we track
metrics per-IP address (latency, loss, others) via a eBPF program and
also manage traffic via HTB and FQ-CoDel. A given end customer can have
several IP addresses assigned which all need to go through the same HTB
class (and attached FQ-CoDel instance) to enforce the plan rate.

At present, we extract the metrics in cls_act which is pre-qdisc. The
problem with this is that the byte and packet counts are too high
because packets can be dropped in the qdiscs. It also messes a bit with
the loss and latency calculations.

Is there a way to hook post qdisc? I looked a bit at XDP, but it seems
that is only Rx now?
There's a tracepoint right before netdev_start_xmit() which is called
trace_net_dev_start_xmit(). So you could combine sch_clsact egress with
cls_bpf and a BPF prog on the tracepoint right before handing the skb
to the driver, they could share a map for example for the tuple to counters
mapping, so you would still be able to do the major work in cls_bpf
outside the qdisc lock.

2)

The Celium BPF docs are great but one thing I don't have a good handle
on is concurrent access to map data. Taking the hash map as an example,
does the map update function need to be called with the eBPF program
for each update to make it safe with concurrent access via userspace
bpf syscall? At present we do some map data updates just using the ptr
from the lookup function.
The latter works with e.g. counters using BPF_XADD instructions, or
switching to per-CPU map for counters to avoid the atomic op altogether.
For single CPU map with non-XADD updates you might need to use the map
update function instead to avoid races.

Thanks,
Daniel


Dan Siemon
 

On Wed, 2018-08-08 at 21:43 +0200, Daniel Borkmann wrote:
Is there a way to hook post qdisc? I looked a bit at XDP, but it
seems
that is only Rx now?
There's a tracepoint right before netdev_start_xmit() which is called
trace_net_dev_start_xmit(). So you could combine sch_clsact egress
with
cls_bpf and a BPF prog on the tracepoint right before handing the skb
to the driver, they could share a map for example for the tuple to
counters
mapping, so you would still be able to do the major work in cls_bpf
outside the qdisc lock.
Thanks. I don't know much about tracepoints but will look into this. I
gather these are capable of the same rates as the tc hooks?

Where would the context extracted from the packet in the BPF prog (eg
5-tuple) be stashed so the tracepoint program can get at it without
parsing the headers again?

Ideally this context is extracted once the ingress port and flows with
the SKB through to the egress port so we don't need to parse the
headers more than once.

Is the XDP on Tx idea something worth even talking about or is the
tracepoint basically equivalent?

Similarly, does it make any sense to add a post-qdisc tc hook where a
clsact could be attached? In this model, the same program could count
pre or post qdisc just based on where it was attached.

Thanks for the help.


Daniel Borkmann
 

On 08/08/2018 10:48 PM, Dan Siemon wrote:
On Wed, 2018-08-08 at 21:43 +0200, Daniel Borkmann wrote:
Is there a way to hook post qdisc? I looked a bit at XDP, but it
seems
that is only Rx now?
There's a tracepoint right before netdev_start_xmit() which is called
trace_net_dev_start_xmit(). So you could combine sch_clsact egress
with
cls_bpf and a BPF prog on the tracepoint right before handing the skb
to the driver, they could share a map for example for the tuple to
counters
mapping, so you would still be able to do the major work in cls_bpf
outside the qdisc lock.
Thanks. I don't know much about tracepoints but will look into this. I
gather these are capable of the same rates as the tc hooks?
Depends on what rates you are targeting, you might want to check BPF
raw tracepoints to reduce overhead given this would be in hot path.
From f6ef56589374 ("Merge branch 'bpf-raw-tracepoints'") that tested
samples/bpf/test_overhead performance on 1 CPU, it says:

tracepoint base kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
task_rename 1.1M 769K 947K 1.0M
urandom_read 789K 697K 750K 755K

Where would the context extracted from the packet in the BPF prog (eg
5-tuple) be stashed so the tracepoint program can get at it without
parsing the headers again?
Probably makes sense to flatten part of the key and map it into
skb->mark, or store it into skb->cb[], or store an offset there
that points into the packet.

Ideally this context is extracted once the ingress port and flows with
the SKB through to the egress port so we don't need to parse the
headers more than once.

Is the XDP on Tx idea something worth even talking about or is the
tracepoint basically equivalent?

Similarly, does it make any sense to add a post-qdisc tc hook where a
clsact could be attached? In this model, the same program could count
pre or post qdisc just based on where it was attached.
I think it might be useful, a sch_clsact subhook would avoid having to
unclone or linearize the skb. There's also an option to place cls_bpf
in direct-action mode into sch_fq_codel which would come after your htb
(see fq_codel_classify()), but I presumed you also want the hook after
that.