[PATCHv2 RFC 0/3] AF_XDP support for OVS


William Tu
 

From: root <ovs-smartnic@...>

The patch series introduces AF_XDP support for OVS netdev.
AF_XDP is a new address family working together with eBPF.
In short, a socket with AF_XDP family can receive and send
packets from an eBPF/XDP program attached to the netdev.
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst

OVS has a couple of netdev types, i.e., system, tap, or
internal. The patch first adds a new netdev types called
"afxdp", and implement its configuration, packet reception,
and transmit functions. Since the AF_XDP socket, xsk,
operates in userspace, once ovs-vswitchd receives packets
from xsk, the proposed architecture re-uses the existing
userspace dpif-netdev datapath. As a result, most of
the packet processing happens at the userspace instead of
linux kernel.

Architecure
===========
_
| +-------------------+
| | ovs-vswitchd |<-->ovsdb-server
| +-------------------+
| | ofproto |<-->OpenFlow controllers
| +--------+-+--------+
| | netdev | |ofproto-|
userspace | +--------+ | dpif |
| | netdev | +--------+
| |provider| | dpif |
| +---||---+ +--------+
| || | dpif- |
| || | netdev |
|_ || +--------+
||
_ +---||-----+--------+
| | af_xdp prog + |
kernel | | xsk_map |
|_ +--------||---------+
||
physical
NIC

To simply start, create a ovs userspace bridge using dpif-netdev
by setting the datapath_type to netdev:
# ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev

And attach a linux netdev with type afxdp:
# ovs-vsctl add-port br0 afxdp-p0 -- \
set interface afxdp-p0 type="afxdp"

Most of the implementation follows the AF_XDP sample code
in Linux kernel under samples/bpf/xdpsock_user.c.

Configuration
=============
When a new afxdp netdev is added to OVS, the patch does
the following configuration
1) attach the afxdp program and map to the netdev (see bpf/xdp.h)
(Currently support maximum 4 afxdp netdev)
2) create an AF_XDP socket (XSK) for the afxdp netdev
3) allocate a virtual contiguous memory region, called umem, and
register this memory to the XSK
4) setup the rx/tx ring, and umem's fill/completion ring

Packet Flow
===========
Currently, the af_xdp ebpf program loaded to the netdev does
nothing but simply forwards the packet to its receiving queue id.

The v1 patch simplifies the buffer/ring management by introducing
a copy from umem to ovs's internal buffer, when receiving a
packet. And when sending the packet out to another netdev,
copying the packet to the netdev's umem.

The v2 patch implements a simple push/pop umem list, where the
push/pop list contains unused umem elements. So before receiving,
any packet, pop N elements from the list and place into the FILL
queues. When received a batch of packets, ex: 32 packets,
pop another 32 unused umem elements and refill the FILL queues.

When there are N packets to send, pop N elements from the list,
copy the packet data into each umem element, and issue send.
Once finish, recycle the sent packet's umem from the COMPLETION
queue, and push back to the umem list.

With v2, an AF_XDP packet forwarding from one netdev (ovs-p0) to
another netdev (ovs-p1) goes through the following path:
Considering the driver mode:
1) xdp program at ovs-p0 copies packet to netdev's rx umem
2) ovs-vswitchd receive the packet from ovs-p0
3) ovs dpif-netdev does parse, lookup flow table,
and action=forward to another afxdp port
4) ovs-vswitchd copies the pachet to umem of ovs-p1, kick_tx
5) kernel copies the packet from umem to ovs-p1 tx queue

Thus, the total number of copies between two ports is 3.
I haven't tried the AF_XDP zero copy mode driver.
Hopefully by using AF_XDP zero copy mode, 1) and 5) will
be removed. So the best case will be one copy between two
netdev.

Performance
===========
Performance is done by setting up 2 machine with back-to-back
dual port 10Gbps link with ixgbe driver. The machine consists
of Intel Xeon CPU E5-2440 v2 with 1.90GHz, 8 cores.
One machine is sending 14Mpps, single flow, 64-byte packet, to
another machine, which runs OVS with 2 afxdp pors. One port is used
for receiving packet, and the other port for sending packet back.
All experiments are using single core.

Baseline performance using xdpsock (Mpps, 64-byte)
Benchmark XDP_SKB XDP_DRV
rxdrop 1.1 2.9
txpush 1.5 1.5
l2fwd 1.08 1.12

With this patch, OVS with AF_XDP (Mpps, 64-byte)
Benchmark XDP_SKB XDP_DRV
rxdrop 1.4 3.3 *[1]
txpush N/A N/A
l2fwd 0.3 0.4

The rxdrop is using the rule
ovs-ofctl add-flow br0 "in_port=1 actions=drop"
The l2fwd is using
ovs-ofctl add-flow br0 "in_port=1 actions=output:2"
where port 1 and 2 are ixgbe 10Gbps port.

Apparently we need to find out why l2fwd is so slow.

[1] the number is higher than the baseline,
I guess it's due to OVS's pmd-mode netdev.

Evaluation1: perf
=================
1) RX drop 3Mpps
15.43% pmd7 ovs-vswitchd [.] miniflow_extract
11.63% pmd7 ovs-vswitchd [.] dp_netdev_input__
9.69% pmd7 libc-2.23.so [.] __clock_gettime
7.23% pmd7 ovs-vswitchd [.] umem_elem_push
7.12% pmd7 ovs-vswitchd [.] pmd_thread_main
6.76% pmd7 [vdso] [.] __vdso_clock_gettime
6.65% pmd7 ovs-vswitchd [.] netdev_linux_rxq_xsk
6.46% pmd7 [kernel.vmlinux] [k] nmi
6.36% pmd7 ovs-vswitchd [.] odp_execute_actions
5.85% pmd7 ovs-vswitchd [.] netdev_linux_rxq_recv
4.62% pmd7 ovs-vswitchd [.] time_timespec__
3.77% pmd7 ovs-vswitchd [.] dp_netdev_process_rxq_port

2) L2fwd 0.4Mpps
20.05% pmd7 [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline
11.40% pmd7 [kernel.vmlinux] [k] syscall_return_via_sysret
11.27% pmd7 [kernel.vmlinux] [k] __sys_sendto
6.10% pmd7 [kernel.vmlinux] [k] __fget
4.81% pmd7 [kernel.vmlinux] [k] xsk_sendmsg
3.62% pmd7 [kernel.vmlinux] [k] sockfd_lookup_light
3.29% pmd7 [kernel.vmlinux] [k] aa_label_sk_perm
3.16% pmd7 libpthread-2.23.so [.] __GI___libc_sendto
3.07% pmd7 [kernel.vmlinux] [k] nmi
2.90% pmd7 [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.89% pmd7 [kernel.vmlinux] [k] do_syscall_64

note: lots of sendto syscall on l2fwd due to tx
lots of time spent in [kernel.vmlinux], not ovs-vswitchd

Evaluation2: strace -c
======================
1) RX drop 3Mpps
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
24.94 0.000199 2 114 108 recvmsg
23.06 0.000184 11 17 poll
19.67 0.000157 3 54 54 accept
14.41 0.000115 2 54 40 read
5.39 0.000043 7 6 sendmsg
4.76 0.000038 2 21 18 recvfrom
3.88 0.000031 2 19 getrusage
3.01 0.000024 24 1 restart_syscall
0.88 0.000007 4 2 sendto
------ ----------- ----------- --------- --------- ----------------
100.00 0.000798 288 220 total

2) L2fwd 0.4Mpps
Teconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.35 11.180104 272685 41 poll
0.64 0.072016 72016 1 restart_syscall
0.00 0.000256 1 264 252 recvmsg
0.00 0.000253 2 126 126 accept
0.00 0.000102 1 126 92 read
0.00 0.000071 6 12 sendmsg
0.00 0.000054 1 50 42 recvfrom
0.00 0.000035 1 44 getrusage
0.00 0.000015 4 4 sendto
------ ----------- ----------- --------- --------- ----------------
100.00 11.252906 668 512 total

note: strace showing lots of poll systcall, but using frace,
I saw lots of sendto system call.

Evaluation3: perf stat
======================
1) RX drop 3Mpps
20,010.33 msec cpu-clock # 1.000 CPUs utilized
20,010.35 msec task-clock # 1.000 CPUs utilized
93 context-switches # 4.648 M/sec
346,728,358 cache-references # 17327754.023 M/sec (19.06%)
13,206 cache-misses # 0.004 % of all cache refs (19.11%)
45,404,902,649 cycles # 2269110.577 GHz (19.13%)
79,272,455,182 instructions # 1.75 insn per cycle
# 0.29 stalled cycles per insn (23.88%)
16,481,635,945 branches # 823669962.269 M/sec (23.88%)
0 faults # 0.000 K/sec
22,691,286,646 stalled-cycles-frontend # 49.98% frontend cycles idle (23.86%)
<not supported> stalled-cycles-backend
14,650,538 branch-misses # 0.09% of all branches (23.81%)
1,982,308,381 bus-cycles # 99065886.107 M/sec (23.79%)
37,669,050,133 ref-cycles # 1882511251.024 M/sec (28.55%)
195,627,994 LLC-loads # 9776511.444 M/sec (23.79%)
1,804 LLC-load-misses # 0.00% of all LL-cache hits (23.79%)
150,563,094 LLC-stores # 7524392.504 M/sec (9.51%)
126 LLC-store-misses # 6.297 M/sec (9.51%)
807,489,043 LLC-prefetches # 40354275.012 M/sec (9.51%)
3,794 LLC-prefetch-misses # 189.605 M/sec (9.51%)
22,972,900,217 dTLB-loads # 1148070975.362 M/sec (9.51%)
4,386,366 dTLB-load-misses # 0.02% of all dTLB cache hits (9.51%)
15,018,543,088 dTLB-stores # 750551878.461 M/sec (9.51%)
22,187 dTLB-store-misses # 1108.796 M/sec (9.51%)
<not supported> dTLB-prefetches
<not supported> dTLB-prefetch-misses
5,248,771 iTLB-loads # 262307.396 M/sec (9.51%)
74,790 iTLB-load-misses # 1.42% of all iTLB cache hits (14.27%)

2) L2fwd 0.4Mpps
Performance counter stats for process id '7095':

10,005.64 msec cpu-clock # 1.000 CPUs utilized
10,005.64 msec task-clock # 1.000 CPUs utilized
47 context-switches # 4.698 M/sec
33,231,952 cache-references # 3321534.433 M/sec (19.09%)
4,093 cache-misses # 0.012 % of all cache refs (19.13%)
21,940,530,397 cycles # 2192956.561 GHz (19.13%)
15,082,324,838 instructions # 0.69 insn per cycle
# 0.95 stalled cycles per insn (23.89%)
3,180,559,568 branches # 317897008.296 M/sec (23.89%)
0 faults # 0.000 K/sec
14,284,571,595 stalled-cycles-frontend # 65.11% frontend cycles idle (23.83%)
<not supported> stalled-cycles-backend
73,279,708 branch-misses # 2.30% of all branches (23.79%)
997,678,140 bus-cycles # 99717955.022 M/sec (23.79%)
18,958,361,307 ref-cycles # 1894888686.357 M/sec (28.54%)
20,159,851 LLC-loads # 2014977.611 M/sec (23.79%)
458 LLC-load-misses # 0.00% of all LL-cache hits (23.79%)
11,159,143 LLC-stores # 1115356.622 M/sec (9.51%)
42 LLC-store-misses # 4.198 M/sec (9.51%)
54,652,812 LLC-prefetches # 5462549.925 M/sec (9.51%)
1,650 LLC-prefetch-misses # 164.918 M/sec (9.51%)
4,662,831,793 dTLB-loads # 466050154.223 M/sec (9.51%)
137,048,612 dTLB-load-misses # 2.94% of all dTLB cache hits (9.51%)
3,427,752,805 dTLB-stores # 342603978.511 M/sec (9.51%)
1,083,813 dTLB-store-misses # 108327.136 M/sec (9.51%)
<not supported> dTLB-prefetches
<not supported> dTLB-prefetch-misses
4,982,148 iTLB-loads # 497965.817 M/sec (9.51%)
14,543,131 iTLB-load-misses # 291.90% of all iTLB cache hits (14.27%)

note: perf highlights 2 stats as red color: stalled-cycles-frontend, and iTLB-load-misses
I think l2fwd has high syscall, causing high iTLB-load-misses.
I don't know why on both cases, the frontend cycle, which does decode,
is pretty high (49% and 65%)

Evaluation4: ovs-appctl dpif-netdev/pmd-stats-show
==================================================
1) RX drop 3Mpps
pmd thread numa_id 0 core_id 11:
packets received: 82106476
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 82106156
megaflow hits: 318
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 2
miss with failed upcall: 0
avg. packets per output batch: 32.00
idle cycles: 198086106617 (60.53%)
processing cycles: 129181332198 (39.47%)
avg cycles per packet: 3985.89 (327267438815/82106476)
avg processing cycles per packet: 1573.34 (129181332198/82106476)

2) L2fwd 0.4Mpps
pmd thread numa_id 0 core_id 11:
packets received: 8555669
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 8555509
megaflow hits: 159
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 0
avg. packets per output batch: 32.00
idle cycles: 89532679391 (57.74%)
processing cycles: 65538391726 (42.26%)
avg cycles per packet: 18124.95 (155071071117/8555669)
avg processing cycles per packet: 7660.23 (65538391726/8555669)

note: avg cycles per packet: 3985 v.s 18124

Next Step
=========
1) optimize the tx part as well as l2fwd
2) try the zero copy mode driver

v1->v2:
- add a list to maintain unused umem elements
- remove copy from rx umem to ovs internal buffer
- use hugetlb to reduce misses (not much difference)
- use pmd mode netdev in OVS (huge performance improve)
- remove malloc dp_packet, instead put dp_packet in umem

William Tu (3):
afxdp: add ebpf code for afxdp and xskmap.
netdev-linux: add new netdev type afxdp.
tests: add afxdp test cases.

acinclude.m4 | 1 +
bpf/api.h | 6 +
bpf/helpers.h | 2 +
bpf/maps.h | 12 +
bpf/xdp.h | 42 ++-
lib/automake.mk | 5 +-
lib/bpf.c | 41 +-
lib/bpf.h | 6 +-
lib/dp-packet.c | 20 +
lib/dp-packet.h | 27 +-
lib/dpif-netdev-perf.h | 16 +-
lib/dpif-netdev.c | 59 ++-
lib/if_xdp.h | 79 ++++
lib/netdev-dummy.c | 1 +
lib/netdev-linux.c | 808 +++++++++++++++++++++++++++++++++++++++-
lib/netdev-provider.h | 2 +
lib/netdev-vport.c | 1 +
lib/netdev.c | 11 +
lib/netdev.h | 1 +
lib/xdpsock.c | 70 ++++
lib/xdpsock.h | 82 ++++
tests/automake.mk | 17 +
tests/ofproto-macros.at | 1 +
tests/system-afxdp-macros.at | 148 ++++++++
tests/system-afxdp-testsuite.at | 25 ++
tests/system-afxdp-traffic.at | 38 ++
vswitchd/bridge.c | 1 +
27 files changed, 1492 insertions(+), 30 deletions(-)
create mode 100644 lib/if_xdp.h
create mode 100644 lib/xdpsock.c
create mode 100644 lib/xdpsock.h
create mode 100644 tests/system-afxdp-macros.at
create mode 100644 tests/system-afxdp-testsuite.at
create mode 100644 tests/system-afxdp-traffic.at

--
2.7.4