[RFC PATCH 00/11] OVS eBPF datapath.


William Tu
 

Existing OVS has three datapath types: Linux kernel (dpif-netlink),
userspace (dpif-netdev), and Windows. This series add another type of
OVS datapath: the eBPF datapath (dpif-bpf).

eBPF stands for extended Berkeley Packet Filter. It enables userspace
applications to customize and extend the Linux kernel’s functionality.
Thus, the benefit of implementing OVS datapath in eBPF is flexibility:
new feature can be added through eBPF bytecode and dynamically loaded
into the Linux kernel, and safety, the eBPF bytecode is guaranteed
to not crash the kernel by the BPF verifier, and finally: portability,
the eBPF bytecode is platform-agnostic so hopefully the same implementation
can run on different platforms.

The implementation tries to re-implement whatever under Linux kernel's
net/openvswitch/* into eBPF code. However, this series is still far
from being complete. A couple of eBPF limitations make it difficult.


OVS eBPF Architecture
=====================
OVS has the following architecure:
_
| +-------------------+
| | ovs-vswitchd |
| +-------------------+
userspace | | ofproto |<-->OpenFlow controllers
| +--------+-+--------+
| | netdev | |ofproto-|
| +--------+ | dpif |
| netdev | +--------+
*eBPF hook --> |provider| | dpif |
+---||---+ +--------+
| || | dpif | <--- *eBPF provider
| || |provider|
|_ || +---||---+
|| ||
_ +---||-----+---||---+
| | |datapath| <--- *eBPF datapath
kernel | | +--------+
| | |
|_ +--------||---------+
||
physical
NIC

And the patch adds:
- eBPF hook for attaching eBPF/XDP program to netdev,
files: lib/netdev-linux.*
- eBPF dpif provider, an interface to communicate with eBPF datapath
files: lib/dpif-bpf.*, lib/dpif-bpf-odp.*
- eBPF datapath, the implementation of OVS datapath in eBPF
files: bpf/datapath.c, bpf/*.h

Most of the design and implemention are described in OSR2018
paper[1], "Building an Extensible Open vSwitch Datapath" and
OVS conference[2], "Offloading OVS Flow Processing using eBPF"
[1] https://dl.acm.org/citation.cfm?id=3139657
[2] http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf


eBPF/XDP
========
A single bpf bytecode 'bpf/datapath.o' is generated and loaded into
all netdevs attached to OVS, in either ingress or egress of the netdev.
A packet traversing the OVS eBPF datapath typically go through three
stages: parse, lookup, and action. Each stage consists of multiple
eBPF program and each stage is tail called from each other.

'objdump -h bpf/datapath.o' shows the OVS eBPF datapath object file.

1 tail-32 000018b0 0000000000000000 0000000000000000 00000040 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
2 tail-33 00001b08 0000000000000000 0000000000000000 000018f0 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
3 tail-0 000000b8 0000000000000000 0000000000000000 000033f8 2**3
CONTENTS, ALLOC, LOAD, READONLY, CODE
4 tail-1 00000a48 0000000000000000 0000000000000000 000034b0 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
<skip>
16 tail-13 00000aa0 0000000000000000 0000000000000000 000095d8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
17 xdp 00000070 0000000000000000 0000000000000000 0000a078 2**3
CONTENTS, ALLOC, LOAD, READONLY, CODE
18 af_xdp 00000010 0000000000000000 0000000000000000 0000a0e8 2**3
CONTENTS, ALLOC, LOAD, READONLY, CODE
19 tail-35 000008b0 0000000000000000 0000000000000000 0000a0f8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
20 ingress 00000178 0000000000000000 0000000000000000 0000a9a8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
21 egress 00000188 0000000000000000 0000000000000000 0000ab20 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
22 downcall 000003b0 0000000000000000 0000000000000000 0000aca8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
23 maps 000000fc 0000000000000000 0000000000000000 0000b058 2**2

Program with 'tail-{0-13}' is the OVS action implementation, see actions.h.
Program ingress, egress, and downcall are three possible entry points of a
packet triggered the eBPF program, which from there, tail calls the next
stage. Program xdp and af_xdp is still empty for future integration.

Currently, llvm/clang-4.0 doesn't work, we have to use version 3.8.

Testsuite
=========
We create a set of test cases under tests/system-bpf-traffic.at,
which is a subset of the kernel datapath testsuite (system-traffic.at)

'make check-bpf' will kick start the testing, so far this patch can do
BPF datapath-sanity

1: datapath - basic BPF commands ok
2: datapath - ping between two ports ok
3: datapath - http between two ports ok
4: datapath - ping between two ports on vlan ok
5: datapath - ping between two ports on cvlan ok
6: datapath - ping6 between two ports ok
7: datapath - ping6 between two ports on vlan ok
8: datapath - ping6 between two ports on cvlan ok
9: datapath - ping over bond skipped (system-bpf-traffic.at:210)
10: datapath - ping over vxlan tunnel ok
11: datapath - ping over vxlan6 tunnel ok
12: datapath - ping over gre tunnel ok
13: datapath - ping over geneve tunnel ok
14: datapath - ping over geneve6 tunnel ok
15: datapath - clone action FAILED (system-bpf-traffic.at:445)
and FAILED for the rest.
The log of each test is saved at tests/system-bpf-testsuite.dir/<id>/

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
c. Verifier limitation.

The patch is based on top of the OVS 2.9.1
commit f8b6477aa019 ("Set release date for 2.9.1.")

Or at my github
# git clone https://github.com/williamtu/ovs-ebpf
at branch rfc

Joe Stringer (7):
ovs-bpf: add documentation and configuration.
netdev: add ebpf support for netdev provider.
lib: implement perf event ringbuffer for upcall.
lib/bpf: add support for managing bpf program/map.
dpif: add 'dpif-bpf' provider.
dpif-bpf-odp: Add bpf datapath interface and impl.
utilities: Add ovs-bpfctl utility.

William Tu (4):
bpf: implement OVS BPF datapath.
vswitch/bridge.c: add bpf datapath initialization.
tests: Add "make check-bpf" traffic target.
vagrant: add ebpf support using ubuntu/bionic

Documentation/automake.mk | 1 +
Documentation/index.rst | 2 +-
Documentation/intro/install/bpf.rst | 142 +++
Documentation/intro/install/index.rst | 1 +
Makefile.am | 13 +-
Vagrantfile-eBPF | 99 ++
acinclude.m4 | 39 +
bpf/.gitignore | 4 +
bpf/action.h | 628 +++++++++++
bpf/api.h | 279 +++++
bpf/automake.mk | 60 +
bpf/datapath.c | 187 +++
bpf/datapath.h | 71 ++
bpf/generated_headers.h | 185 +++
bpf/helpers.h | 209 ++++
bpf/lookup.h | 227 ++++
bpf/maps.h | 170 +++
bpf/odp-bpf.h | 254 +++++
bpf/openvswitch.h | 49 +
bpf/ovs-p4.h | 112 ++
bpf/ovs-proto.p4 | 329 ++++++
bpf/parser.h | 412 +++++++
bpf/xdp.h | 35 +
configure.ac | 1 +
include/linux/pkt_cls.h | 21 +
lib/automake.mk | 12 +
lib/bpf.c | 524 +++++++++
lib/bpf.h | 69 ++
lib/dpif-bpf-odp.c | 943 ++++++++++++++++
lib/dpif-bpf-odp.h | 47 +
lib/dpif-bpf.c | 1995 +++++++++++++++++++++++++++++++++
lib/dpif-netdev.c | 29 +-
lib/dpif-provider.h | 1 +
lib/dpif.c | 3 +
lib/netdev-bsd.c | 2 +
lib/netdev-dpdk.c | 2 +
lib/netdev-dummy.c | 2 +
lib/netdev-linux.c | 436 ++++++-
lib/netdev-linux.h | 2 +
lib/netdev-provider.h | 11 +
lib/netdev-vport.c | 145 ++-
lib/netdev.c | 25 +
lib/netdev.h | 4 +
lib/packets.h | 6 +-
lib/perf-event.c | 288 +++++
lib/perf-event.h | 43 +
ofproto/ofproto-dpif.c | 69 +-
tests/.gitignore | 1 +
tests/automake.mk | 30 +-
tests/ofproto-macros.at | 8 +
tests/system-bpf-macros.at | 112 ++
tests/system-bpf-testsuite.at | 25 +
tests/system-bpf-testsuite.patch | 10 +
tests/system-bpf-traffic.at | 809 +++++++++++++
utilities/automake.mk | 9 +
utilities/ovs-bpfctl.8.xml | 45 +
utilities/ovs-bpfctl.c | 248 ++++
vswitchd/bridge.c | 21 +
58 files changed, 9453 insertions(+), 53 deletions(-)
create mode 100644 Documentation/intro/install/bpf.rst
create mode 100644 Vagrantfile-eBPF
create mode 100644 bpf/.gitignore
create mode 100644 bpf/action.h
create mode 100644 bpf/api.h
create mode 100644 bpf/automake.mk
create mode 100644 bpf/datapath.c
create mode 100644 bpf/datapath.h
create mode 100644 bpf/generated_headers.h
create mode 100644 bpf/helpers.h
create mode 100644 bpf/lookup.h
create mode 100644 bpf/maps.h
create mode 100644 bpf/odp-bpf.h
create mode 100644 bpf/openvswitch.h
create mode 100644 bpf/ovs-p4.h
create mode 100644 bpf/ovs-proto.p4
create mode 100644 bpf/parser.h
create mode 100644 bpf/xdp.h
create mode 100644 lib/bpf.c
create mode 100644 lib/bpf.h
create mode 100644 lib/dpif-bpf-odp.c
create mode 100644 lib/dpif-bpf-odp.h
create mode 100644 lib/dpif-bpf.c
create mode 100644 lib/perf-event.c
create mode 100644 lib/perf-event.h
create mode 100644 tests/system-bpf-macros.at
create mode 100644 tests/system-bpf-testsuite.at
create mode 100644 tests/system-bpf-testsuite.patch
create mode 100644 tests/system-bpf-traffic.at
create mode 100644 utilities/ovs-bpfctl.8.xml
create mode 100644 utilities/ovs-bpfctl.c

--
2.7.4


Alexei Starovoitov
 

On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
B. instead of helpers to interface with conntrack the way ovs did, I prefer
a generic conntrack mechanism that can be used out of xdp too

c. Verifier limitation.
Not sure what limitations you're concerned about.


William Tu
 

Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
Now my backup plan is to implement megaflow in BPF.

B. instead of helpers to interface with conntrack the way ovs did, I prefer
a generic conntrack mechanism that can be used out of xdp too
OK. We will work on this direction.

c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.

We can always break the large program then tail call, but sometimes
the register spills on the stack, and when restore, the states is gone
and verifier fails. This is more difficult for us to work around.

Below is an example:
----
at 203: r7 is a const and store on stack (r10 - 248)
at 250: r2 reads (r10 - 248) back.
at 251: fails the verifier

from 27 to 201: R0=map_value(id=0,off=0,ks=4,vs=4352,imm=0)
R7=inv(id=0,umax_value=31,var_off=(0x0; 0x1f))
R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
201: (7b) *(u64 *)(r10 -256) = r0
202: (27) r7 *= 136
203: (7b) *(u64 *)(r10 -248) = r7
204: (bf) r6 = r0
205: (0f) r6 += r7
206: (b7) r8 = 2
207: (15) if r6 == 0x0 goto pc+93
R0=map_value(id=0,off=0,ks=4,vs=4352,imm=0)
R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=inv(id=0,umax_value=4216,var_off=(0x0; 0x1ff8)) R8=inv2
R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 fp-256=map_value
208: (b7) r1 = 681061
209: (63) *(u32 *)(r10 -200) = r1
210: (18) r1 = 0x6b73616d20746573
212: (7b) *(u64 *)(r10 -208) = r1
213: (bf) r1 = r10
214: (07) r1 += -208
215: (b7) r2 = 12
216: (85) call bpf_trace_printk#6
217: (bf) r7 = r6
218: (07) r7 += 8
219: (61) r1 = *(u32 *)(r6 +8)
R0=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7_w=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
220: (15) if r1 == 0x7 goto pc+82
R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0;
0xffffffff)) R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
221: (55) if r1 != 0x4 goto pc+228
R0=inv(id=0) R1=inv4
R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
222: (61) r1 = *(u32 *)(r9 +80)
223: (7b) *(u64 *)(r10 -264) = r1
224: (61) r6 = *(u32 *)(r9 +76)
225: (b7) r1 = 0
226: (73) *(u8 *)(r10 -198) = r1
227: (b7) r1 = 2674
228: (6b) *(u16 *)(r10 -200) = r1
229: (18) r1 = 0x6568746520746573
231: (7b) *(u64 *)(r10 -208) = r1
232: (bf) r1 = r10
233: (07) r1 += -208
234: (b7) r2 = 11
235: (85) call bpf_trace_printk#6
236: (bf) r1 = r6
237: (07) r1 += 14
238: (79) r2 = *(u64 *)(r10 -264)
239: (2d) if r1 > r2 goto pc+61
R0=inv(id=0) R1=pkt(id=0,off=14,r=14,imm=0)
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
240: (71) r1 = *(u8 *)(r7 +10)
R0=inv(id=0) R1_w=pkt(id=0,off=14,r=14,imm=0)
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
241: (73) *(u8 *)(r6 +0) = r1
242: (71) r1 = *(u8 *)(r7 +11)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
243: (73) *(u8 *)(r6 +1) = r1
244: (71) r1 = *(u8 *)(r7 +12)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
245: (73) *(u8 *)(r6 +2) = r1
246: (71) r1 = *(u8 *)(r7 +13)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
247: (73) *(u8 *)(r6 +3) = r1
248: (79) r4 = *(u64 *)(r10 -256)
249: (bf) r1 = r4
250: (79) r2 = *(u64 *)(r10 -248)
251: (0f) r1 += r2
math between map_value pointer and register with unbounded min value
is not allowed


Alexei Starovoitov
 

On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.

c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.


William Tu
 

On Tue, Jul 3, 2018 at 10:56 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
I mean "new" rules being added at 10 rules/sec.
Update rate might be much higher.

Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.
When adding a new rule, do these algorithms require rebuilding the
entire tree?

In our evaluation, updating an existing entry in the decision tree
performs OK, because it is equal to lookup and replace, and lookup
is fast, update is just atomic swap. But inserting a new rule is slow,
because it requires re-building the tree using all existing rules.
And we see new rule being added at rate 10 rules per second.
So we are constantly rebuilding the entire tree.

If the entire tree has 100k of rules, it takes around 2 seconds to rebuild,
based on the dpdk acl library. And without an incremental algorithm,
adding 1 new rule will trigger rebuilding the 100k of rules, and it is too slow.

Reading through HyperCuts and EffiCuts, I'm not sure how it supports
incrementally adding a new rule, without rebuilding the entire tree.
http://ccr.sigcomm.org/online/files/p207.pdf
http://cseweb.ucsd.edu/~susingh/papers/hyp-sigcomm03.pdf

The HyperCuts papers says
"A fast update algorithm can also be implemented; however we do not
go into the details of incremental update in this paper"


c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
yes, now we store the flow key in percpu array with 1 element.

In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.
Once the flow key is stored in map, another eBPF program
needs to use that key to lookup flow table (another map).
So we have to store the flow key on stack first, in order to
use it as key to lookup the flow table map.

Is there a way to work around it?
Thanks
William


Paul Chaignon
 

On Wed, Jul 04, 2018 at 07:25:50PM -0700, William Tu wrote:
On Tue, Jul 3, 2018 at 10:56 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
I mean "new" rules being added at 10 rules/sec.
Update rate might be much higher.

Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.
When adding a new rule, do these algorithms require rebuilding the
entire tree?

In our evaluation, updating an existing entry in the decision tree
performs OK, because it is equal to lookup and replace, and lookup
is fast, update is just atomic swap. But inserting a new rule is slow,
because it requires re-building the tree using all existing rules.
And we see new rule being added at rate 10 rules per second.
So we are constantly rebuilding the entire tree.

If the entire tree has 100k of rules, it takes around 2 seconds to rebuild,
based on the dpdk acl library. And without an incremental algorithm,
adding 1 new rule will trigger rebuilding the 100k of rules, and it is too slow.

Reading through HyperCuts and EffiCuts, I'm not sure how it supports
incrementally adding a new rule, without rebuilding the entire tree.
http://ccr.sigcomm.org/online/files/p207.pdf
http://cseweb.ucsd.edu/~susingh/papers/hyp-sigcomm03.pdf

The HyperCuts papers says
"A fast update algorithm can also be implemented; however we do not
go into the details of incremental update in this paper"


c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
yes, now we store the flow key in percpu array with 1 element.

In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.
Once the flow key is stored in map, another eBPF program
needs to use that key to lookup flow table (another map).
So we have to store the flow key on stack first, in order to
use it as key to lookup the flow table map.

Is there a way to work around it?
d71962f ("bpf: allow map helpers access to map values directly") removes
that limitation from the verifier and should allow you to use map values
as map keys directly. 4.18-rc1 has it.

Thanks
William


William Tu
 


d71962f ("bpf: allow map helpers access to map values directly") removes
that limitation from the verifier and should allow you to use map values
as map keys directly. 4.18-rc1 has it.

Thanks
William
Hi Paul,

Thanks a lot! This is very helpful.
I'm testing it now, works great so far, and saves lots of bpf stack space.

Regards,
William