Date   

Re: [RFC PATCH 00/11] OVS eBPF datapath.

Paul Chaignon
 

On Wed, Jul 04, 2018 at 07:25:50PM -0700, William Tu wrote:
On Tue, Jul 3, 2018 at 10:56 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
I mean "new" rules being added at 10 rules/sec.
Update rate might be much higher.

Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.
When adding a new rule, do these algorithms require rebuilding the
entire tree?

In our evaluation, updating an existing entry in the decision tree
performs OK, because it is equal to lookup and replace, and lookup
is fast, update is just atomic swap. But inserting a new rule is slow,
because it requires re-building the tree using all existing rules.
And we see new rule being added at rate 10 rules per second.
So we are constantly rebuilding the entire tree.

If the entire tree has 100k of rules, it takes around 2 seconds to rebuild,
based on the dpdk acl library. And without an incremental algorithm,
adding 1 new rule will trigger rebuilding the 100k of rules, and it is too slow.

Reading through HyperCuts and EffiCuts, I'm not sure how it supports
incrementally adding a new rule, without rebuilding the entire tree.
http://ccr.sigcomm.org/online/files/p207.pdf
http://cseweb.ucsd.edu/~susingh/papers/hyp-sigcomm03.pdf

The HyperCuts papers says
"A fast update algorithm can also be implemented; however we do not
go into the details of incremental update in this paper"


c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
yes, now we store the flow key in percpu array with 1 element.

In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.
Once the flow key is stored in map, another eBPF program
needs to use that key to lookup flow table (another map).
So we have to store the flow key on stack first, in order to
use it as key to lookup the flow table map.

Is there a way to work around it?
d71962f ("bpf: allow map helpers access to map values directly") removes
that limitation from the verifier and should allow you to use map values
as map keys directly. 4.18-rc1 has it.

Thanks
William


Re: [RFC PATCH 00/11] OVS eBPF datapath.

William Tu
 

On Tue, Jul 3, 2018 at 10:56 AM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
I mean "new" rules being added at 10 rules/sec.
Update rate might be much higher.

Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.
When adding a new rule, do these algorithms require rebuilding the
entire tree?

In our evaluation, updating an existing entry in the decision tree
performs OK, because it is equal to lookup and replace, and lookup
is fast, update is just atomic swap. But inserting a new rule is slow,
because it requires re-building the tree using all existing rules.
And we see new rule being added at rate 10 rules per second.
So we are constantly rebuilding the entire tree.

If the entire tree has 100k of rules, it takes around 2 seconds to rebuild,
based on the dpdk acl library. And without an incremental algorithm,
adding 1 new rule will trigger rebuilding the 100k of rules, and it is too slow.

Reading through HyperCuts and EffiCuts, I'm not sure how it supports
incrementally adding a new rule, without rebuilding the entire tree.
http://ccr.sigcomm.org/online/files/p207.pdf
http://cseweb.ucsd.edu/~susingh/papers/hyp-sigcomm03.pdf

The HyperCuts papers says
"A fast update algorithm can also be implemented; however we do not
go into the details of incremental update in this paper"


c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
yes, now we store the flow key in percpu array with 1 element.

In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.
Once the flow key is stored in map, another eBPF program
needs to use that key to lookup flow table (another map).
So we have to store the flow key on stack first, in order to
use it as key to lookup the flow table map.

Is there a way to work around it?
Thanks
William


Re: [RFC PATCH 00/11] OVS eBPF datapath.

Alexei Starovoitov
 

On Thu, Jun 28, 2018 at 07:19:35AM -0700, William Tu wrote:
Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
is this a typo? you probably meant 10K rule updates a second ?
Last time I've dealt with these algorithms we had 100K acl updates a second.
It was an important metric that we were optimizing for.
I'm pretty sure '*cuts' algos do many thousands per second non optimized.

c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.
have you tried using per-cpu array of one element with large value
instead of stack?
In the latest verifier most of the operations that can be done with the stack
pointer can be done with pointer to map value too.


Re: Verifier error: variable stack access var_off

Teng Qin
 

Firstly, note that bpf_probe_read_str adds an extra \0 to the read string. So your max should be sizeof(data.argv) - 1 instead in order for data.argv[len] = ' ' to work (from Verifier's perspective, logically you don't need that extra delimiter~)

Then, Yonghong had a patch a few month ago addressing very similar issue. See the example in patch series
bpf: improve verifier ARG_CONST_SIZE_OR_ZERO semantics
Does your Kernel have those patches?

However, even with all those the data.argv[len] = ' ' part still fails with something about stack offset not being fixed. I will try debug more to see how to fix that. For now, you can use a per-CPU array of size 1 for the data instead of allocating it on the stack. The following works for me:
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>
 
#define ARGSIZE  128
 
struct data_t {
    char argv[ARGSIZE];
};
 
BPF_PERF_OUTPUT(events);
BPF_PERCPU_ARRAY(mem, struct data_t, 1);
//
// Here's what I'm trying to do. Let's say this has:
//   __argv[0] = "ls"
//   __argv[1] = "-l"
// I'm trying to create a buffer with "ls -l", by doing bpf_probe_read_str() for
// each element into the buffer, while keeping track of the length of
// the previous read so I can insert a space delimiter at that offset,
// and begin the next read after the delimiter.
//
int on_event(struct pt_regs *ctx,
    const char __user *filename,
    const char __user *const __user *__argv,
    const char __user *const __user *__envp)
{
    int zero = 0;
    struct data_t* data = mem.lookup(&zero);
    if (!data)
      return 0;
 
    uint64_t max = sizeof(data->argv) - 1;
    const char *argp = NULL;
    bpf_probe_read(&argp, sizeof(argp), (void *)&__argv[0]);
    uint64_t len = bpf_probe_read_str(&(data->argv), max, argp);
    len &= 0xffffffff; // to avoid: "math between fp pointer and register errs"
    bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid
 
    data->argv[len] = ' ';
 
    events.perf_submit(ctx, data, len);
out:
    return 0;
}


Verifier error: variable stack access var_off

Brendan Gregg
 

I'm hoping someone knows a workaround here.

I have a char buf[128] and I'd like to write to arbitrary offsets, but
keep hitting this error. Any workaround? I've included a sample bcc
program below, which has a block comment as to what I'm trying to do
(join an argv[]). Thanks,

Brendan

---execsnoop2.py---
#!/usr/bin/python
# From execsnoop (bcc/eBPF).

from __future__ import print_function
from bcc import BPF
from bcc.utils import ArgString, printb
import bcc.utils as utils
import argparse
import ctypes as ct
import re
import time

# arguments
examples = """examples:
./execsnoop # trace all exec() syscalls
"""
parser = argparse.ArgumentParser(
description="Trace exec() syscalls",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
parser.add_argument("--ebpf", action="store_true",
help=argparse.SUPPRESS)
args = parser.parse_args()

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
#include <linux/fs.h>

#define ARGSIZE 128

struct data_t {
char argv[ARGSIZE];
};

BPF_PERF_OUTPUT(events);

//
// Here's what I'm trying to do. Let's say this has:
// __argv[0] = "ls"
// __argv[1] = "-l"
// I'm trying to create a buffer with "ls -l", by doing bpf_probe_read_str() for
// each element into the buffer, while keeping track of the length of
// the previous read so I can insert a space delimiter at that offset,
// and begin the next read after the delimiter.
//
int syscall__execve(struct pt_regs *ctx,
const char __user *filename,
const char __user *const __user *__argv,
const char __user *const __user *__envp)
{
struct data_t data = {};
uint64_t max = sizeof(data.argv);
const char *argp = NULL;
bpf_probe_read(&argp, sizeof(argp), (void *)&__argv[0]);
uint64_t len = bpf_probe_read_str(&data.argv, max, argp);
len &= 0xffffffff; // to avoid: "math between fp pointer and register errs"
bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid

data.argv[len] = ' ';
// XXX this fails with:
// "variable stack access var_off=(0x0; 0xffffffff) off=-128 size=1"
// how do I fix this?

// events.perf_submit(ctx, &data, len);
// XXX this fails with:
// "R5 min value is negative, either use unsigned or 'var &= const'"
// how do I fix this? In the meantime, I'm passing the whole buffer out:
events.perf_submit(ctx, &data, sizeof(data.argv));
out:
return 0;
}
"""

if args.ebpf:
print(bpf_text)
exit()

# initialize BPF
b = BPF(text=bpf_text)
execve_fnname = b.get_syscall_fnname("execve")
b.attach_kprobe(event=execve_fnname, fn_name="syscall__execve")

ARGSIZE = 128 # should match #define in C above

class Data(ct.Structure):
_fields_ = [
("argv", ct.c_char * ARGSIZE),
]

print("running");

# process event
def print_event(cpu, data, size):
event = ct.cast(data, ct.POINTER(Data)).contents
printb(b"%s" % event.argv)

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
b.perf_buffer_poll()
---execsnoop2.py---


Re: [RFC PATCH 00/11] OVS eBPF datapath.

William Tu
 

Hi Alexei,

Thanks a lot for the feedback!

On Wed, Jun 27, 2018 at 8:00 PM, Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
We did try the decision tree approach using dpdk's acl lib. The lookup
speed is 6 times faster than the magaflow using tuple space.
However, the update/insertion requires rebuilding/re-balancing the decision
tree so it's way too slow. I think hicuts or efficuts suffers the same issue.
So decision tree algos are scalable only for lookup operation due to its
optimization over tree depth, but not scalable under
update/insert/delete operations.

On customer's system we see megaflow update/insert rate around 10 rules/sec,
this makes decision tree unusable, unless we invent something to optimize the
update/insert time or incremental update of these decision tree algo.
Now my backup plan is to implement megaflow in BPF.

B. instead of helpers to interface with conntrack the way ovs did, I prefer
a generic conntrack mechanism that can be used out of xdp too
OK. We will work on this direction.

c. Verifier limitation.
Not sure what limitations you're concerned about.
Mostly related to stack. The flow key OVS uses (struct sw_flow_key)
is 464 byte. We trim a lot, now around 300 byte, but still huge, considering
the BPF's stack limit is 512 byte.

We can always break the large program then tail call, but sometimes
the register spills on the stack, and when restore, the states is gone
and verifier fails. This is more difficult for us to work around.

Below is an example:
----
at 203: r7 is a const and store on stack (r10 - 248)
at 250: r2 reads (r10 - 248) back.
at 251: fails the verifier

from 27 to 201: R0=map_value(id=0,off=0,ks=4,vs=4352,imm=0)
R7=inv(id=0,umax_value=31,var_off=(0x0; 0x1f))
R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
201: (7b) *(u64 *)(r10 -256) = r0
202: (27) r7 *= 136
203: (7b) *(u64 *)(r10 -248) = r7
204: (bf) r6 = r0
205: (0f) r6 += r7
206: (b7) r8 = 2
207: (15) if r6 == 0x0 goto pc+93
R0=map_value(id=0,off=0,ks=4,vs=4352,imm=0)
R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=inv(id=0,umax_value=4216,var_off=(0x0; 0x1ff8)) R8=inv2
R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1 fp-256=map_value
208: (b7) r1 = 681061
209: (63) *(u32 *)(r10 -200) = r1
210: (18) r1 = 0x6b73616d20746573
212: (7b) *(u64 *)(r10 -208) = r1
213: (bf) r1 = r10
214: (07) r1 += -208
215: (b7) r2 = 12
216: (85) call bpf_trace_printk#6
217: (bf) r7 = r6
218: (07) r7 += 8
219: (61) r1 = *(u32 *)(r6 +8)
R0=inv(id=0) R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7_w=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
220: (15) if r1 == 0x7 goto pc+82
R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0;
0xffffffff)) R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
221: (55) if r1 != 0x4 goto pc+228
R0=inv(id=0) R1=inv4
R6=map_value(id=0,off=0,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value
222: (61) r1 = *(u32 *)(r9 +80)
223: (7b) *(u64 *)(r10 -264) = r1
224: (61) r6 = *(u32 *)(r9 +76)
225: (b7) r1 = 0
226: (73) *(u8 *)(r10 -198) = r1
227: (b7) r1 = 2674
228: (6b) *(u16 *)(r10 -200) = r1
229: (18) r1 = 0x6568746520746573
231: (7b) *(u64 *)(r10 -208) = r1
232: (bf) r1 = r10
233: (07) r1 += -208
234: (b7) r2 = 11
235: (85) call bpf_trace_printk#6
236: (bf) r1 = r6
237: (07) r1 += 14
238: (79) r2 = *(u64 *)(r10 -264)
239: (2d) if r1 > r2 goto pc+61
R0=inv(id=0) R1=pkt(id=0,off=14,r=14,imm=0)
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
240: (71) r1 = *(u8 *)(r7 +10)
R0=inv(id=0) R1_w=pkt(id=0,off=14,r=14,imm=0)
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
241: (73) *(u8 *)(r6 +0) = r1
242: (71) r1 = *(u8 *)(r7 +11)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
243: (73) *(u8 *)(r6 +1) = r1
244: (71) r1 = *(u8 *)(r7 +12)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
245: (73) *(u8 *)(r6 +2) = r1
246: (71) r1 = *(u8 *)(r7 +13)
R0=inv(id=0) R1_w=inv(id=0,umax_value=255,var_off=(0x0; 0xff))
R2=pkt_end(id=0,off=0,imm=0) R6=pkt(id=0,off=0,r=14,imm=0)
R7=map_value(id=0,off=8,ks=4,vs=4352,umax_value=4216,var_off=(0x0;
0x1ff8)) R8=inv2 R9=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
fp-256=map_value fp-264=pkt_end
247: (73) *(u8 *)(r6 +3) = r1
248: (79) r4 = *(u64 *)(r10 -256)
249: (bf) r1 = r4
250: (79) r2 = *(u64 *)(r10 -248)
251: (0f) r1 += r2
math between map_value pointer and register with unbounded min value
is not allowed


Re: [RFC PATCH 00/11] OVS eBPF datapath.

Alexei Starovoitov
 

On Sat, Jun 23, 2018 at 05:16:32AM -0700, William Tu wrote:

Discussion
==========
We are still actively working on finishing the feature, currently
the basic forwarding and tunnel feature work, but still under
heavy debugging and development. The purpose of this RFC is to
get some early feedbacks and direction for finishing the complete
features in existing kernel's OVS datapath (the net/openvswitch/*).
Thank you for sharing the patches.

Three major issues we are worried:
a. Megaflow support in BPF.
b. Connection Tracking support in BPF.
my opinion on the above two didn't change.
To recap:
A. Non scalable megaflow map is no go. I'd like to see packet classification
algorithm like hicuts or efficuts to be implemented instead, since it can be
shared by generic bpf, bpftiler, ovs and likely others.
B. instead of helpers to interface with conntrack the way ovs did, I prefer
a generic conntrack mechanism that can be used out of xdp too

c. Verifier limitation.
Not sure what limitations you're concerned about.


Re: minutes: IO Visort TSC/Dev Meeting

Brenden Blanco
 

On Wed, Jun 27, 2018 at 4:02 PM Brenden Blanco <bblanco@...> wrote:

Hi All,

As usual, here are my notes from the meeting today.
s/Visort/Visor/


-Brenden

=== Discussion ===
Kenny:
How is transition to groups.io mailer?
- no complaints

Yonghong:
- rewriter improvements
- lot of new corner cases being worked on
- some python3 improvements
- request to enable tests

Brenden:
- to implement fedora 28, ubuntu 18 buildbots
- python3 testing support in test cases

Saeed:
- XDP metadata RFC looking for feedback
- xdp programs currently ask for specific flags, cross reference with netdev
supported flags, fail otherwise
- is this behavior acceptable?
- pro: better than detecting at runtime and silently failing
- con: flag checking prior to metadata conversion at packet ingress is
burdensome to the driver
tightly bound to kernel version
- if metadata at the front of the packet needs to be read first, could be
caching/perf issues as metadata is moved around
- data duplicated, e.g. checksum in 3 places: descriptor+skb+xdp_md

Okash:
- BTF:
- some changes in pahole to convert dwarf function description to btf

Yifeng:
- OVS BPF rfc is pushed
- Joe planning to review the patches
- conntrack support in progress (adding helper approach)
- would be nice to have a native conntrack implementation
- question: how to handle reassembly?

Daniel:
- firefighting current bpf bugs
- syzcaller generated issues
- instrumented allocator failures
- todo: devmap redirect bug from tailcall


=== Attendees ===
Brenden Blanco
Daniel Borkmann
Jack Jones
Jakub Kicinski
Jesper Brouer
Kenny Paul
Marco Leogrande
Martin Lau
Nic Viljoen
Quentin Monnet
Saeed
Yifeng Sun
Joe Stringer
Yonghong Song
David Beckett
John F
Neerav Parikh


minutes: IO Visort TSC/Dev Meeting

Brenden Blanco
 

Hi All,

As usual, here are my notes from the meeting today.

-Brenden

=== Discussion ===
Kenny:
How is transition to groups.io mailer?
- no complaints

Yonghong:
- rewriter improvements
- lot of new corner cases being worked on
- some python3 improvements
- request to enable tests

Brenden:
- to implement fedora 28, ubuntu 18 buildbots
- python3 testing support in test cases

Saeed:
- XDP metadata RFC looking for feedback
- xdp programs currently ask for specific flags, cross reference with netdev
supported flags, fail otherwise
- is this behavior acceptable?
- pro: better than detecting at runtime and silently failing
- con: flag checking prior to metadata conversion at packet ingress is
burdensome to the driver
tightly bound to kernel version
- if metadata at the front of the packet needs to be read first, could be
caching/perf issues as metadata is moved around
- data duplicated, e.g. checksum in 3 places: descriptor+skb+xdp_md

Okash:
- BTF:
- some changes in pahole to convert dwarf function description to btf

Yifeng:
- OVS BPF rfc is pushed
- Joe planning to review the patches
- conntrack support in progress (adding helper approach)
- would be nice to have a native conntrack implementation
- question: how to handle reassembly?

Daniel:
- firefighting current bpf bugs
- syzcaller generated issues
- instrumented allocator failures
- todo: devmap redirect bug from tailcall


=== Attendees ===
Brenden Blanco
Daniel Borkmann
Jack Jones
Jakub Kicinski
Jesper Brouer
Kenny Paul
Marco Leogrande
Martin Lau
Nic Viljoen
Quentin Monnet
Saeed
Yifeng Sun
Joe Stringer
Yonghong Song
David Beckett
John F
Neerav Parikh


reminder: IO Visor TSC/Dev Meeting

Brenden Blanco
 

Please join us tomorrow for our bi-weekly call. As usual, this meeting is
open to everybody and completely optional.
You might be interested to join if:
You want to know what is going on in BPF land
You are doing something interesting yourself with BPF and would like to
share
You want to know what the heck BPF is

=== IO Visor Dev/TSC Meeting ===

Every 2 weeks on Wednesday, from Wednesday, January 25, 2017, to no end date
11:00 am  |  Pacific Daylight Time (San Francisco, GMT-07:00)  |  30 min

https://bluejeans.com/568677804/


[RFC PATCH 11/11] vagrant: add ebpf support using ubuntu/bionic

William Tu
 

VAGRANT_VAGRANTFILE=Vagrantfile-eBPF vagrant up

Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
---
Makefile.am | 1 +
Vagrantfile-eBPF | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
create mode 100644 Vagrantfile-eBPF

diff --git a/Makefile.am b/Makefile.am
index ec1fc53b1060..d26c765a285a 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -86,6 +86,7 @@ EXTRA_DIST = \
$(MAN_ROOTS) \
Vagrantfile \
Vagrantfile-FreeBSD \
+ Vagrantfile-eBPF \
.mailmap
bin_PROGRAMS =
sbin_PROGRAMS =
diff --git a/Vagrantfile-eBPF b/Vagrantfile-eBPF
new file mode 100644
index 000000000000..7b9be32b8f03
--- /dev/null
+++ b/Vagrantfile-eBPF
@@ -0,0 +1,99 @@
+# -*- mode: ruby -*-
+# vi: set ft=ruby :
+
+$bootstrap = <<SCRIPT
+ pwd
+ apt-get update
+ apt-get -y install \
+ build-essential dpkg-dev lintian devscripts fakeroot \
+ debhelper dh-autoreconf uuid-runtime \
+ autoconf automake libtool \
+ python-all python-twisted-core python-twisted-conch \
+ xdg-utils groff graphviz netcat curl \
+ wget python-six ethtool \
+ libcap-ng-dev libssl-dev python-dev openssl \
+ python-pyftpdlib python-flake8 python-tftpy \
+ linux-headers-`uname -r`
+ apt-get install -y cmake libbison-dev bison flex bc libelf-dev
+ apt-get install -y libmnl-dev gcc-multilib libc6-dev-i386 pkg-config
+SCRIPT
+
+$install_iproute2 = <<SCRIPT
+ pwd
+ mkdir -p build
+ cd build
+ rm -rf iproute2
+ git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2.git && \
+ cd iproute2 && \
+ ./configure && \
+ make -j `getconf _NPROCESSORS_ONLN` && make install
+SCRIPT
+
+$install_llvm = <<SCRIPT
+ pwd
+ cd build
+ curl -Ssl -o clang+llvm.tar.xz http://releases.llvm.org/3.8.1/clang+llvm-3.8.1-x86_64-linux-gnu-ubuntu-16.04.tar.xz
+ tar -C /usr/local -xJf ./clang+llvm.tar.xz || true
+ mv /usr/local/clang+llvm-3.8.1-x86_64-linux-gnu-ubuntu-16.04 /usr/local/clang+llvm || true
+ rm -f clang+llvm.tar.xz
+ export PATH="/usr/local/clang+llvm/bin:$PATH"
+ llc --version
+ clang --version
+SCRIPT
+
+$install_libbpf = <<SCRIPT
+ pwd
+ cd build
+ rm -rf linux
+ git clone --branch v4.15 --depth 1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
+ cd linux && make defconfig && make -C tools/lib/bpf/ && cd tools/lib/bpf/ && make install
+SCRIPT
+
+$build_ovs = <<SCRIPT
+ cd /home/vagrant/ovs
+ pwd
+ export PATH="/usr/local/clang+llvm/bin:$PATH"
+ which clang
+ which llc
+ pwd
+ make distclean
+ ./boot.sh
+ sudo ./configure --with-bpf=/home/vagrant/build/linux/tools/
+ make
+SCRIPT
+
+$ovs_check = <<SCRIPT
+ pwd
+ cd /home/vagrant/ovs
+ sudo make check TESTSUITEFLAGS='1'
+SCRIPT
+
+$sparse_check = <<SCRIPT
+ pwd
+ cd /home/vagrant/ovs
+ touch lib/dpif-bpf-odp.c
+ touch lib/dpif-bpf.c
+ make C=1 CF="-Wsparse-all -D__CHECKER__ -D__CHECK_ENDIAN__ -Wbitwise" lib/dpif-bpf-odp.o
+ make C=1 CF="-Wsparse-all -D__CHECKER__ -D__CHECK_ENDIAN__ -Wbitwise" lib/dpif-bpf.o
+SCRIPT
+
+$ovs_check_bpf = <<SCRIPT
+ pwd
+ cd /home/vagrant/ovs
+ export LD_LIBRARY_PATH=/home/vagrant/build/linux/tools/lib/bpf/:$LD_LIBRARY_PATH
+ export PATH="/usr/local/clang+llvm/bin:$PATH"
+ objdump -h /home/vagrant/ovs/bpf/datapath.o
+ sudo make check-bpf TESTSUITEFLAGS='1'
+SCRIPT
+
+Vagrant.configure("2") do |config|
+ config.vm.box = "ubuntu/bionic64"
+ config.vm.synced_folder ".", "/home/vagrant/ovs"
+
+ config.vm.provision "bootstrap", type: "shell", inline: $bootstrap
+ config.vm.provision "install_iproute2", type: "shell", inline: $install_iproute2
+ config.vm.provision "install_llvm", type: "shell", inline: $install_llvm
+ config.vm.provision "install_libbpf", type: "shell", inline: $install_libbpf
+ config.vm.provision "build_ovs", type: "shell", inline: $build_ovs
+ config.vm.provision "ovs_check", type: "shell", inline: $ovs_check
+end
--
2.7.4


[RFC PATCH 10/11] tests: Add "make check-bpf" traffic target.

William Tu
 

Add a separate test file tests/system-bpf-traffic.at for
bpf testing. The test cases are a subset of the existing
system-traffic.at, and with additional bpf-specific tests.
When test passes, the log file is saved under:
tests/system-bpf-testsuite.dir/<ID>/

Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Signed-off-by: Joe Stringer <joe@...>
Co-authored-by: Joe Stringer <joe@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
tests/.gitignore | 1 +
tests/automake.mk | 30 +-
tests/ofproto-macros.at | 8 +
tests/system-bpf-macros.at | 112 ++++++
tests/system-bpf-testsuite.at | 25 ++
tests/system-bpf-testsuite.patch | 10 +
tests/system-bpf-traffic.at | 809 +++++++++++++++++++++++++++++++++++++++
7 files changed, 994 insertions(+), 1 deletion(-)
create mode 100644 tests/system-bpf-macros.at
create mode 100644 tests/system-bpf-testsuite.at
create mode 100644 tests/system-bpf-testsuite.patch
create mode 100644 tests/system-bpf-traffic.at

diff --git a/tests/.gitignore b/tests/.gitignore
index 3e2ddf2e9e5d..98890e011afc 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -11,6 +11,7 @@
/ovs-pki.log
/pki/
/system-kmod-testsuite
+/system-bpf-testsuite
/system-userspace-testsuite
/system-offloads-testsuite
/test-aes128
diff --git a/tests/automake.mk b/tests/automake.mk
index 52ed53fd16d4..a6a57acea394 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -4,15 +4,18 @@ EXTRA_DIST += \
$(SYSTEM_TESTSUITE_AT) \
$(SYSTEM_KMOD_TESTSUITE_AT) \
$(SYSTEM_USERSPACE_TESTSUITE_AT) \
+ $(SYSTEM_BPF_TESTSUITE_AT) \
$(SYSTEM_OFFLOADS_TESTSUITE_AT) \
$(TESTSUITE) \
$(SYSTEM_KMOD_TESTSUITE) \
$(SYSTEM_USERSPACE_TESTSUITE) \
+ $(SYSTEM_BPF_TESTSUITE) \
$(SYSTEM_OFFLOADS_TESTSUITE) \
tests/atlocal.in \
$(srcdir)/package.m4 \
$(srcdir)/tests/testsuite \
- $(srcdir)/tests/testsuite.patch
+ $(srcdir)/tests/testsuite.patch \
+ $(srcdir)/tests/system-bpf-testsuite.patch

COMMON_MACROS_AT = \
tests/ovsdb-macros.at \
@@ -110,6 +113,11 @@ SYSTEM_KMOD_TESTSUITE_AT = \
tests/system-kmod-testsuite.at \
tests/system-kmod-macros.at

+SYSTEM_BPF_TESTSUITE_AT = \
+ tests/system-bpf-testsuite.at \
+ tests/system-bpf-macros.at \
+ tests/system-bpf-traffic.at
+
SYSTEM_USERSPACE_TESTSUITE_AT = \
tests/system-userspace-testsuite.at \
tests/system-ovn.at \
@@ -134,6 +142,8 @@ TESTSUITE = $(srcdir)/tests/testsuite
TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch
SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite
SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite
+SYSTEM_BPF_TESTSUITE = $(srcdir)/tests/system-bpf-testsuite
+BPF_TESTSUITE_PATCH = $(srcdir)/tests/system-bpf-testsuite.patch
SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite
DISTCLEANFILES += tests/atconfig tests/atlocal

@@ -174,6 +184,15 @@ check-lcov: all $(check_DATA) clean-lcov
lcov $(LCOV_OPTS) -o tests/lcov/coverage.info
genhtml $(GENHTML_OPTS) -o tests/lcov tests/lcov/coverage.info
@echo "coverage report generated at tests/lcov/index.html"
+
+check-bpf-lcov: all $(check_DATA) clean-lcov
+ find . -name '*.gcda' | xargs -n1 rm -f
+ -set $(SHELL) '$(SYSTEM_BPF_TESTSUITE)' -C tests AUTOTEST_PATH=$(AUTOTEST_PATH) $(TESTSUITEFLAGS); \
+ "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+ $(MKDIR_P) tests/lcov
+ lcov $(LCOV_OPTS) -o tests/lcov/coverage.info
+ genhtml $(GENHTML_OPTS) -o tests/lcov tests/lcov/coverage.info
+ @echo "coverage report generated at tests/lcov/index.html"

# valgrind support

@@ -254,6 +273,10 @@ check-system-userspace: all
set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)

+check-bpf: all
+ set $(SHELL) '$(SYSTEM_BPF_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
+ "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
+
check-offloads: all
set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \
"$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck)
@@ -282,6 +305,11 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP
$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
$(AM_V_at)mv $@.tmp $@

+$(SYSTEM_BPF_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_BPF_TESTSUITE_AT) $(BPF_TESTSUITE_PATCH) $(COMMON_MACROS_AT)
+ $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
+ $(AM_V_at)mv $@.tmp $@
+ $(AM_V_at)patch -p1 $@ tests/system-bpf-testsuite.patch
+
$(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT)
$(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at
$(AM_V_at)mv $@.tmp $@
diff --git a/tests/ofproto-macros.at b/tests/ofproto-macros.at
index c8bfe5b5c262..6c110b7cfc81 100644
--- a/tests/ofproto-macros.at
+++ b/tests/ofproto-macros.at
@@ -335,6 +335,8 @@ m4_define([_OVS_VSWITCHD_START],
AT_CAPTURE_FILE([ovs-vswitchd.log])
on_exit "kill_ovs_vswitchd `cat ovs-vswitchd.pid`"
AT_CHECK([[sed < stderr '
+/bpf|INFO|/d
+/odp_util.*ERR/d
/ovs_numa|INFO|Discovered /d
/vlog|INFO|opened log file/d
/vswitchd|INFO|ovs-vswitchd (Open vSwitch)/d
@@ -344,6 +346,7 @@ m4_define([_OVS_VSWITCHD_START],
/ofproto|INFO|datapath ID changed to fedcba9876543210/d
/dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable/d
/netdev: Flow API/d
+/Re-using preloaded BPF datapath/d
/tc: Using policy/d']])
])

@@ -395,6 +398,11 @@ check_logs () {
sed -n "$1
/reset by peer/d
/Broken pipe/d
+/bpf.*|WARN/d
+/bpf.*|ERR/d
+/dpif.*|WARN/d
+/odp_util.*revalidator.*|ERR/d
+/ofproto_dpif_upcall.*|WARN/d
/timeval.*Unreasonably long [[0-9]]*ms poll interval/d
/timeval.*faults: [[0-9]]* minor, [[0-9]]* major/d
/timeval.*disk: [[0-9]]* reads, [[0-9]]* writes/d
diff --git a/tests/system-bpf-macros.at b/tests/system-bpf-macros.at
new file mode 100644
index 000000000000..23c170d73119
--- /dev/null
+++ b/tests/system-bpf-macros.at
@@ -0,0 +1,112 @@
+# _ADD_BR([name])
+#
+# Expands into the proper ovs-vsctl commands to create a bridge with the
+# appropriate type and properties
+m4_define([_ADD_BR], [[add-br $1 -- set Bridge $1 datapath_type="bpf" protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 fail-mode=secure ]])
+
+# OVS_TRAFFIC_VSWITCHD_START([vsctl-args], [vsctl-output], [=override])
+#
+# Creates a database and starts ovsdb-server, starts ovs-vswitchd
+# connected to that database, calls ovs-vsctl to create a bridge named
+# br0 with predictable settings, passing 'vsctl-args' as additional
+# commands to ovs-vsctl. If 'vsctl-args' causes ovs-vsctl to provide
+# output (e.g. because it includes "create" commands) then 'vsctl-output'
+# specifies the expected output after filtering through uuidfilt.pl.
+m4_define([OVS_TRAFFIC_VSWITCHD_START],
+ [
+ export OVS_PKGDATADIR=$(`pwd`)
+ #OVS_WAIT_WHILE([ip link show ovs-system])
+ umount /sys/fs/bpf/
+ AT_CHECK([mount -t bpf none /sys/fs/bpf])
+ AT_CHECK([mkdir -p /sys/fs/bpf/ovs])
+ _OVS_VSWITCHD_START([--disable-system])
+ dnl Add bridges, ports, etc.
+ ip link del br0
+ #OVS_WAIT_WHILE([ip link show br0])
+ AT_CHECK([ovs-vsctl -- _ADD_BR([br0]) -- $1 m4_if([$2], [], [], [| ${PERL} $srcdir/uuidfilt.pl])], [0], [$2])
+ on_exit 'ovs-vsctl del-br br0'
+ on_exit 'ip link del ovs-system'
+ on_exit 'tail -500 /sys/kernel/debug/tracing/trace > trace'
+])
+
+# OVS_TRAFFIC_VSWITCHD_STOP([WHITELIST], [extra_cmds])
+#
+# Gracefully stops ovs-vswitchd and ovsdb-server, checking their log files
+# for messages with severity WARN or higher and signaling an error if any
+# is present. The optional WHITELIST may contain shell-quoted "sed"
+# commands to delete any warnings that are actually expected, e.g.:
+#
+# OVS_TRAFFIC_VSWITCHD_STOP(["/expected error/d"])
+#
+# 'extra_cmds' are shell commands to be executed afte OVS_VSWITCHD_STOP() is
+# invoked. They can be used to perform additional cleanups such as name space
+# removal.
+m4_define([OVS_TRAFFIC_VSWITCHD_STOP],
+ [OVS_VSWITCHD_STOP([$1])
+ AT_CHECK([:; $2])
+ AT_CHECK([umount /sys/fs/bpf])
+ AT_CAPTURE_FILE([trace])
+ ])
+
+# CONFIGURE_VETH_OFFLOADS([VETH])
+#
+# Disable TX offloads for veths. The userspace datapath uses the AF_PACKET
+# socket to receive packets for veths. Unfortunately, the AF_PACKET socket
+# doesn't play well with offloads:
+# 1. GSO packets are received without segmentation and therefore discarded.
+# 2. Packets with offloaded partial checksum are received with the wrong
+# checksum, therefore discarded by the receiver.
+#
+# By disabling tx offloads in the non-OVS side of the veth peer we make sure
+# that the AF_PACKET socket will not receive bad packets.
+#
+# This is a workaround, and should be removed when offloads are properly
+# supported in netdev-linux.
+m4_define([CONFIGURE_VETH_OFFLOADS],
+ [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore])]
+)
+
+# CHECK_CONNTRACK()
+#
+# Perform requirements checks for running conntrack tests.
+#
+m4_define([CHECK_CONNTRACK],
+ [AT_SKIP_IF([test $HAVE_PYTHON = no])]
+)
+
+# CHECK_CONNTRACK_ALG()
+#
+# Perform requirements checks for running conntrack ALG tests. The userspace
+# doesn't support ALGs yet, so skip the tests
+#
+m4_define([CHECK_CONNTRACK_ALG],
+[
+ AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_FRAG()
+#
+# Perform requirements checks for running conntrack fragmentations tests.
+# The userspace doesn't support fragmentation yet, so skip the tests.
+m4_define([CHECK_CONNTRACK_FRAG],
+[
+ AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_LOCAL_STACK()
+#
+# Perform requirements checks for running conntrack tests with local stack.
+# While the kernel connection tracker automatically passes all the connection
+# tracking state from an internal port to the OpenvSwitch kernel module, there
+# is simply no way of doing that with the userspace, so skip the tests.
+m4_define([CHECK_CONNTRACK_LOCAL_STACK],
+[
+ AT_SKIP_IF([:])
+])
+
+# CHECK_CONNTRACK_NAT()
+#
+# Perform requirements checks for running conntrack NAT tests. The userspace
+# datapath supports NAT.
+#
+m4_define([CHECK_CONNTRACK_NAT])
diff --git a/tests/system-bpf-testsuite.at b/tests/system-bpf-testsuite.at
new file mode 100644
index 000000000000..54ebbcba17dc
--- /dev/null
+++ b/tests/system-bpf-testsuite.at
@@ -0,0 +1,25 @@
+AT_INIT
+
+AT_COPYRIGHT([Copyright (c) 2015 Nicira, Inc.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at:
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.])
+
+m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS])
+
+m4_include([tests/ovs-macros.at])
+m4_include([tests/ovsdb-macros.at])
+m4_include([tests/ofproto-macros.at])
+m4_include([tests/system-bpf-macros.at])
+m4_include([tests/system-common-macros.at])
+
+m4_include([tests/system-bpf-traffic.at])
diff --git a/tests/system-bpf-testsuite.patch b/tests/system-bpf-testsuite.patch
new file mode 100644
index 000000000000..94f3771d4ee9
--- /dev/null
+++ b/tests/system-bpf-testsuite.patch
@@ -0,0 +1,10 @@
+--- system-bpf-testsuite 2018-05-31 05:10:16.425135086 -0700
++++ system-bpf-testsuite 2018-05-31 05:13:46.556051030 -0700
+@@ -2369,7 +2369,6 @@
+ else
+ if test -d "$at_group_dir"; then
+ find "$at_group_dir" -type d ! -perm -700 -exec chmod u+rwx \{\} \;
+- rm -fr "$at_group_dir"
+ fi
+ rm -f "$at_test_source"
+ fi
diff --git a/tests/system-bpf-traffic.at b/tests/system-bpf-traffic.at
new file mode 100644
index 000000000000..0044eba1b861
--- /dev/null
+++ b/tests/system-bpf-traffic.at
@@ -0,0 +1,809 @@
+AT_BANNER([BPF datapath-sanity])
+
+AT_SETUP([datapath - basic BPF commands])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-appctl dpif/dump-dps], [0], [dnl
+bpf@br0
+])
+AT_CHECK([ovs-appctl dpif/show], [0], [dnl
+bpf@ovs-bpf: hit:0 missed:0
+ br0:
+ br0 65534/1: (tap)
+])
+AT_CHECK([ovs-appctl dpctl/dump-flows bpf@ovs-bpf], [0], [dnl
+])
+AT_CHECK([ovs-appctl dpif/dump-flows br0], [0], [dnl
+])
+AT_CHECK([ovs-bpfctl show], [0], [stdout])
+
+dnl NOTE: BPF datapath does not support megaflow, so the
+dnl rules below won't match any packet
+AT_CHECK([ovs-appctl dpctl/add-flow bpf@ovs-bpf "in_port(1),eth(),eth_type(0x0806),arp()" 2], [0], [stdout])
+
+AT_CHECK([ovs-appctl dpctl/add-flow bpf@ovs-bpf "in_port(1),eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15),eth_type(0x0800),ipv4(src=35.8.2.41,dst=172.16.0.20,proto=5,tos=0x80,ttl=128,frag=no)" 2], [0], [stdout])
+
+AT_CHECK([ovs-appctl dpctl/add-flow bpf@ovs-bpf "in_port(1),eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15),eth_type(0x86dd),ipv6(src=::1,dst=::2,label=0,proto=6,tclass=0,hlimit=128,frag=no),tcp(src=80,dst=8080)" 2], [0], [stdout])
+
+dnl this will print "receive tunnel port not found" and cause failure
+dnl AT_CHECK([ovs-appctl dpctl/add-flow bpf@ovs-bpf "skb_priority(0),tunnel(tun_id=0x7f10354,src=10.10.10.10,dst=20.20.20.20,ttl=64,flags(csum|key)),skb_mark(0x1234),recirc_id(0),dp_hash(0),in_port(1),eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15)" 2], [0], [stdout])
+
+dnl AT_CHECK([ovs-appctl dpctl/add-flow bpf@ovs-bpf "skb_priority(0x1234),tunnel(tun_id=0xfedcba9876543210,src=10.10.10.10,dst=20.20.20.20,tos=0x8,ttl=64,flags(key)),skb_mark(0),recirc_id(0),dp_hash(0),in_port(1),eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15),eth_type(0x8100),vlan(vid=99,pcp=7),encap(eth_type(0x86dd),ipv6(src=::1,dst=::2,label=0,proto=58,tclass=0,hlimit=128,frag=no),icmpv6(type=136,code=0),nd(target=::3,sll=00:05:06:07:08:09,tll=00:0a:0b:0c:0d:0e))" 2],[0], [stdout])
+
+dnl AT_CHECK([ovs-appctl dpif/del-flows br0], [0], [dnl
+dnl ])
+
+dnl AT_CHECK([ovs-dpctl add-flow bpf@ovs-bpf "in_port(1),eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15),eth_type(0x0800),ipv4(src=35.8.2.41,dst=172.16.0.20,proto=5,tos=0x80,ttl=128,frag=no)" 2], [0], [dnl
+dnl ])
+
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - http between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_START_L7([at_ns1], [http])
+NS_CHECK_EXEC([at_ns0], [wget 10.1.1.2 -t 3 -T 1 --retry-connrefused -v -o wget0.log])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+ADD_VLAN(p0, at_ns0, 100, "10.2.2.1/24")
+ADD_VLAN(p1, at_ns1, 100, "10.2.2.2/24")
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping between two ports on cvlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+ADD_SVLAN(p0, at_ns0, 4094, "10.255.2.1/24")
+ADD_SVLAN(p1, at_ns1, 4094, "10.255.2.2/24")
+
+ADD_CVLAN(p0.4094, at_ns0, 100, "10.2.2.1/24")
+ADD_CVLAN(p1.4094, at_ns1, 100, "10.2.2.2/24")
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping -c 1 10.2.2.2])
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.2.2.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports on vlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
+
+ADD_VLAN(p0, at_ns0, 100, "fc00:1::1/96")
+ADD_VLAN(p1, at_ns1, 100, "fc00:1::2/96")
+
+dnl Linux seems to take a little time to get its IPv6 stack in order. Without
+dnl waiting, we get occasional failures due to the following error:
+dnl "connect: Cannot assign requested address"
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping6 between two ports on cvlan])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "fc00::1/96")
+ADD_VETH(p1, at_ns1, br0, "fc00::2/96")
+
+ADD_SVLAN(p0, at_ns0, 4094, "fc00:ffff::1/96")
+ADD_SVLAN(p1, at_ns1, 4094, "fc00:ffff::2/96")
+
+ADD_CVLAN(p0.4094, at_ns0, 100, "fc00:1::1/96")
+ADD_CVLAN(p1.4094, at_ns1, 100, "fc00:1::2/96")
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00:1::2])
+
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 1600 -q -c 3 -i 0.3 -w 6 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping6 -s 3200 -q -c 3 -i 0.3 -w 6 fc00:1::2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over bond])
+AT_SKIP_IF([echo > /dev/null])
+OVS_TRAFFIC_VSWITCHD_START()
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH_BOND(p1 p2, at_ns1, br0, bond0, lacp=active bond_mode=balance-tcp, "10.1.1.2/24")
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping -c 1 10.1.1.2])
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 1600 -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+NS_CHECK_EXEC([at_ns0], [ping -s 3200 -q -c 3 -i 0.3 -w 2 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan tunnel])
+OVS_CHECK_VXLAN()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+ip link del vxlan_sys_4789
+on_exit 'ip link del vxlan_sys_4789'
+on_exit 'ip link del br-underlay'
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([vxlan], [br0], [at_vxlan0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([vxlan], [at_vxlan1], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+ [id 0 dstport 4789])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over vxlan6 tunnel])
+OVS_CHECK_VXLAN_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+ip link del vxlan_sys_4789
+
+on_exit 'ip link del vxlan_sys_4789'
+on_exit 'ip link del br-underlay'
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([vxlan], [br0], [at_vxlan0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([vxlan], [at_vxlan1], [at_ns0], [fc00::100], [10.1.1.1/24],
+ [id 0 dstport 4789 udp6zerocsumtx udp6zerocsumrx])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 6 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over gre tunnel])
+OVS_CHECK_GRE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24], [options:key=100])
+ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24], [key 100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve tunnel])
+OVS_CHECK_GENEVE()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+ip link del genev_sys_6081
+on_exit 'ip link del genev_sys_6081'
+on_exit 'ip link del br-underlay'
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([geneve], [br0], [at_gnv0], [172.31.1.1], [10.1.1.100/24], [options:key=22])
+ADD_NATIVE_TUNNEL([geneve], [ns_gnv0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+ [vni 22])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 2 172.31.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - ping over geneve6 tunnel])
+OVS_CHECK_GENEVE_UDP6ZEROCSUM()
+
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-underlay])
+
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+ADD_NAMESPACES(at_ns0)
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "fc00::1/64", [], [], "nodad")
+AT_CHECK([ip addr add dev br-underlay "fc00::100/64" nodad])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL6([geneve], [br0], [at_gnv0], [fc00::1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL6([geneve], [ns_gnv0], [at_ns0], [fc00::100], [10.1.1.1/24],
+ [vni 0 udp6zerocsumtx udp6zerocsumrx])
+
+OVS_WAIT_UNTIL([ip netns exec at_ns0 ping6 -c 1 fc00::100])
+
+dnl First, check the underlay
+NS_CHECK_EXEC([at_ns0], [ping6 -q -c 3 -i 0.3 -w 2 fc00::100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+dnl Okay, now check the overlay with different packet sizes
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.100 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - clone action])
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_NAMESPACES(at_ns0, at_ns1, at_ns2)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
+
+AT_CHECK([ovs-vsctl -- set interface ovs-p0 ofport_request=1 \
+ -- set interface ovs-p1 ofport_request=2])
+
+AT_DATA([flows.txt], [dnl
+priority=1 actions=NORMAL
+priority=10 in_port=1,ip,actions=clone(mod_dl_dst(50:54:00:00:00:0a),set_field:192.168.3.3->ip_dst), output:2
+priority=10 in_port=2,ip,actions=clone(mod_dl_src(ae:c6:7e:54:8d:4d),mod_dl_dst(50:54:00:00:00:0b),set_field:192.168.4.4->ip_dst, controller), output:1
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-ofctl monitor br0 65534 invalid_ttl --detach --no-chdir --pidfile 2> ofctl_monitor.log])
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+AT_CHECK([cat ofctl_monitor.log | STRIP_MONITOR_CSUM], [0], [dnl
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+icmp,vlan_tci=0x0000,dl_src=ae:c6:7e:54:8d:4d,dl_dst=50:54:00:00:00:0b,nw_src=10.1.1.2,nw_dst=192.168.4.4,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 icmp_csum: <skip>
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+AT_SETUP([datapath - mpls actions])
+OVS_TRAFFIC_VSWITCHD_START([_ADD_BR([br1])])
+
+ADD_NAMESPACES(at_ns0, at_ns1)
+
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+ADD_VETH(p1, at_ns1, br1, "10.1.1.2/24")
+
+AT_CHECK([ip link add patch0 type veth peer name patch1])
+on_exit 'ip link del patch0'
+
+AT_CHECK([ip link set dev patch0 up])
+AT_CHECK([ip link set dev patch1 up])
+AT_CHECK([ovs-vsctl add-port br0 patch0])
+AT_CHECK([ovs-vsctl add-port br1 patch1])
+
+AT_DATA([flows.txt], [dnl
+table=0,priority=100,dl_type=0x0800 actions=push_mpls:0x8847,set_mpls_label:3,resubmit(,1)
+table=0,priority=100,dl_type=0x8847,mpls_label=3 actions=pop_mpls:0x0800,resubmit(,1)
+table=0,priority=10 actions=resubmit(,1)
+table=1,priority=10 actions=normal
+])
+
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+AT_CHECK([ovs-ofctl add-flows br1 flows.txt])
+
+NS_CHECK_EXEC([at_ns0], [ping -q -c 3 -i 0.3 -w 6 10.1.1.2 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+NS_CHECK_EXEC([at_ns1], [ping -q -c 3 -i 0.3 -w 6 10.1.1.1 | FORMAT_PING], [0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+AT_SETUP([datapath - basic truncate action])
+AT_SKIP_IF([test $HAVE_NC = no])
+OVS_TRAFFIC_VSWITCHD_START()
+AT_CHECK([ovs-ofctl del-flows br0])
+
+dnl Create p0 and ovs-p0(1)
+ADD_NAMESPACES(at_ns0)
+ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
+NS_CHECK_EXEC([at_ns0], [ip link set dev p0 address e6:66:c1:11:11:11])
+NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
+
+dnl Create p1(3) and ovs-p1(2), packets received from ovs-p1 will appear in p1
+AT_CHECK([ip link add p1 type veth peer name ovs-p1])
+on_exit 'ip link del ovs-p1'
+AT_CHECK([ip link set dev ovs-p1 up])
+AT_CHECK([ip link set dev p1 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p1 -- set interface ovs-p1 ofport_request=2])
+dnl Use p1 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p1 -- set interface p1 ofport_request=3])
+
+dnl Create p2(5) and ovs-p2(4)
+AT_CHECK([ip link add p2 type veth peer name ovs-p2])
+on_exit 'ip link del ovs-p2'
+AT_CHECK([ip link set dev ovs-p2 up])
+AT_CHECK([ip link set dev p2 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=4])
+dnl Use p2 to check the truncated packet
+AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=5])
+
+dnl basic test
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl use this file as payload file for ncat
+AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
+on_exit 'rm -f payload200.bin'
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl packet with truncated size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=100
+])
+dnl packet with original size
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=242
+])
+
+dnl more complicated output actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+in_port=3 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=5 dl_dst=e6:66:c1:22:22:22 actions=drop
+in_port=1 dl_dst=e6:66:c1:22:22:22 actions=output(port=2,max_len=100),output:4,output(port=2,max_len=100),output(port=4,max_len=100),output:2,output(port=4,max_len=200),output(port=2,max_len=65535)
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+dnl SLOW_ACTION: disable kernel datapath truncate support
+dnl Repeat the test above, but exercise the SLOW_ACTION code path
+AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
+
+dnl SLOW_ACTION test1: check datapatch actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-appctl ofproto/trace br0 "in_port=1,dl_type=0x800,dl_src=e6:66:c1:11:11:11,dl_dst=e6:66:c1:22:22:22,nw_src=192.168.0.1,nw_dst=192.168.0.2,nw_proto=6,tp_src=8,tp_dst=9"], [0], [stdout])
+AT_CHECK([tail -3 stdout], [0],
+[Datapath actions: trunc(100),3,5,trunc(100),3,trunc(100),5,3,trunc(200),5,trunc(65535),3
+This flow is handled by the userspace slow path because it:
+ - Uses action(s) not supported by datapath.
+])
+
+dnl SLOW_ACTION test2: check actual packet truncate
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 1234 < payload200.bin])
+
+dnl 100 + 100 + 242 + min(65535,242) = 684
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=3" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=684
+])
+
+dnl 242 + 100 + min(242,200) = 542
+AT_CHECK([ovs-ofctl dump-flows br0 table=0 | grep "in_port=5" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=542
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+dnl Create 2 bridges and 2 namespaces to test truncate over
+dnl GRE tunnel:
+dnl br0: overlay bridge
+dnl ns1: connect to br0, with IP:10.1.1.2
+dnl br-underlay: with IP: 172.31.1.100
+dnl ns0: connect to br-underlay, with IP: 10.1.1.1
+AT_SETUP([datapath - truncate and output to gre tunnel])
+AT_SKIP_IF([test $HAVE_NC = no])
+OVS_CHECK_GRE()
+OVS_TRAFFIC_VSWITCHD_START()
+
+ADD_BR([br-underlay])
+ADD_NAMESPACES(at_ns0)
+ADD_NAMESPACES(at_ns1)
+AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"])
+AT_CHECK([ovs-ofctl add-flow br-underlay "actions=normal"])
+
+dnl Set up underlay link from host into the namespace using veth pair.
+ADD_VETH(p0, at_ns0, br-underlay, "172.31.1.1/24")
+AT_CHECK([ip addr add dev br-underlay "172.31.1.100/24"])
+AT_CHECK([ip link set dev br-underlay up])
+
+dnl Set up tunnel endpoints on OVS outside the namespace and with a native
+dnl linux device inside the namespace.
+ADD_OVS_TUNNEL([gre], [br0], [at_gre0], [172.31.1.1], [10.1.1.100/24])
+ADD_NATIVE_TUNNEL([gretap], [ns_gre0], [at_ns0], [172.31.1.100], [10.1.1.1/24],
+ [], [address e6:66:c1:11:11:11])
+AT_CHECK([ovs-vsctl -- set interface at_gre0 ofport_request=1])
+NS_CHECK_EXEC([at_ns0], [arp -s 10.1.1.2 e6:66:c1:22:22:22])
+
+dnl Set up (p1 and ovs-p1) at br0
+ADD_VETH(p1, at_ns1, br0, '10.1.1.2/24')
+AT_CHECK([ovs-vsctl -- set interface ovs-p1 ofport_request=2])
+NS_CHECK_EXEC([at_ns1], [ip link set dev p1 address e6:66:c1:22:22:22])
+NS_CHECK_EXEC([at_ns1], [arp -s 10.1.1.1 e6:66:c1:11:11:11])
+
+dnl Set up (p2 and ovs-p2) as loopback for verifying packet size
+AT_CHECK([ip link add p2 type veth peer name ovs-p2])
+on_exit 'ip link del ovs-p2'
+AT_CHECK([ip link set dev ovs-p2 up])
+AT_CHECK([ip link set dev p2 up])
+AT_CHECK([ovs-vsctl add-port br0 ovs-p2 -- set interface ovs-p2 ofport_request=3])
+AT_CHECK([ovs-vsctl add-port br0 p2 -- set interface p2 ofport_request=4])
+
+dnl use this file as payload file for ncat
+AT_CHECK([dd if=/dev/urandom of=payload200.bin bs=200 count=1 2> /dev/null])
+on_exit 'rm -f payload200.bin'
+
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_DATA([flows.txt], [dnl
+priority=99,in_port=1,actions=output(port=2,max_len=100),output(port=3,max_len=100)
+priority=99,in_port=2,udp,actions=output(port=1,max_len=100)
+priority=1,in_port=4,ip,actions=drop
+priority=1,actions=drop
+])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+AT_CHECK([ovs-ofctl del-flows br-underlay])
+AT_DATA([flows-underlay.txt], [dnl
+priority=99,dl_type=0x0800,nw_proto=47,in_port=1,actions=LOCAL
+priority=99,dl_type=0x0800,nw_proto=47,in_port=LOCAL,ip_dst=172.31.1.1/24,actions=1
+priority=1,actions=drop
+])
+
+AT_CHECK([ovs-ofctl add-flows br-underlay flows-underlay.txt])
+
+dnl check tunnel push path, from at_ns1 to at_ns0
+NS_CHECK_EXEC([at_ns1], [nc $NC_EOF_OPT -u 10.1.1.1 1234 < payload200.bin])
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+dnl Before truncation = ETH(14) + IP(20) + UDP(8) + 200 = 242B
+AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=2" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=242
+])
+dnl After truncation = outer ETH(14) + outer IP(20) + GRE(4) + 100 = 138B
+AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=138
+])
+
+dnl check tunnel pop path, from at_ns0 to at_ns1
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 5678 < payload200.bin])
+dnl After truncation = 100 byte at loopback device p2(4)
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=4" | ofctl_strip], [0], [dnl
+ n_packets=1, n_bytes=100, priority=1,ip,in_port=4 actions=drop
+])
+
+dnl SLOW_ACTION: disable datapath truncate support
+dnl Repeat the test above, but exercise the SLOW_ACTION code path
+AT_CHECK([ovs-appctl dpif/set-dp-features br0 trunc false], [0])
+
+dnl SLOW_ACTION test1: check datapatch actions
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+
+dnl SLOW_ACTION test2: check actual packet truncate
+AT_CHECK([ovs-ofctl del-flows br0])
+AT_CHECK([ovs-ofctl add-flows br0 flows.txt])
+AT_CHECK([ovs-ofctl del-flows br-underlay])
+AT_CHECK([ovs-ofctl add-flows br-underlay flows-underlay.txt])
+
+dnl check tunnel push path, from at_ns1 to at_ns0
+NS_CHECK_EXEC([at_ns1], [nc $NC_EOF_OPT -u 10.1.1.1 1234 < payload200.bin])
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+
+dnl Before truncation = ETH(14) + IP(20) + UDP(8) + 200 = 242B
+AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=2" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=242
+])
+dnl After truncation = outer ETH(14) + outer IP(20) + GRE(4) + 100 = 138B
+AT_CHECK([ovs-ofctl dump-flows br-underlay | grep "in_port=LOCAL" | sed -n 's/.*\(n\_bytes=[[0-9]]*\).*/\1/p'], [0], [dnl
+n_bytes=138
+])
+
+dnl check tunnel pop path, from at_ns0 to at_ns1
+NS_CHECK_EXEC([at_ns0], [nc $NC_EOF_OPT -u 10.1.1.2 5678 < payload200.bin])
+dnl After truncation = 100 byte at loopback device p2(4)
+AT_CHECK([ovs-appctl revalidator/purge], [0])
+AT_CHECK([ovs-ofctl dump-flows br0 | grep "in_port=4" | ofctl_strip], [0], [dnl
+ n_packets=1, n_bytes=100, priority=1,ip,in_port=4 actions=drop
+])
+
+OVS_TRAFFIC_VSWITCHD_STOP
+AT_CLEANUP
+
+
+dnl simple test case for BPF
+AT_SETUP([ovn -- 1 LR connects 2 LSes])
+AT_KEYWORDS([ovnbpf])
+
+ovn_start
+OVS_TRAFFIC_VSWITCHD_START()
+ADD_BR([br-int])
+
+# Set external-ids in br-int needed for ovn-controller
+# Use vxlan here
+ovs-vsctl \
+ -- set Open_vSwitch . external-ids:system-id=hv1 \
+ -- set Open_vSwitch . external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
+ -- set Open_vSwitch . external-ids:ovn-encap-type=vxlan \
+ -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
+ -- set bridge br-int fail-mode=secure other-config:disable-in-band=true
+
+# Start ovn-controller
+start_daemon ovn-controller
+
+# Logical network:
+# 1 LR - R1 and 2 LSes foo and bar R1 has switchess foo (192.168.1.0/24)
+# and # bar (192.168.2.0/24) connected to it.
+#
+# foo ------- R1 ------- bar
+# 192.168.1.0/24 192.168.2.0/24
+#
+
+ovn-nbctl create Logical_Router name=R1
+
+ovn-nbctl ls-add foo
+ovn-nbctl ls-add bar
+
+# Connect foo to R1
+ovn-nbctl lrp-add R1 foo 00:00:01:01:02:03 192.168.1.1/24
+ovn-nbctl lsp-add foo rp-foo -- set Logical_Switch_Port rp-foo \
+ type=router options:router-port=foo addresses=\"00:00:01:01:02:03\"
+
+# Connect bar to R1
+ovn-nbctl lrp-add R1 bar 00:00:01:01:02:04 192.168.2.1/24
+ovn-nbctl lsp-add bar rp-bar -- set Logical_Switch_Port rp-bar \
+ type=router options:router-port=bar addresses=\"00:00:01:01:02:04\"
+
+# Logical port 'foo1' in switch 'foo'.
+ADD_NAMESPACES(foo1)
+ADD_VETH(foo1, foo1, br-int, "192.168.1.2/24", "f0:00:00:01:02:03", \
+ "192.168.1.1")
+ovn-nbctl lsp-add foo foo1 \
+ -- lsp-set-addresses foo1 "f0:00:00:01:02:03 192.168.1.2"
+
+ADD_NAMESPACES(bar1)
+ADD_VETH(bar1, bar1, br-int, "192.168.2.2/24", "f0:00:00:01:02:05", \
+"192.168.2.1")
+ovn-nbctl lsp-add bar bar1 \
+ -- lsp-set-addresses bar1 "f0:00:00:01:02:05 192.168.2.2"
+
+# wait for ovn-controller to catch up.
+ovn-nbctl --wait=hv sync
+
+# 'bar1' should be able to ping 'foo1' directly.
+NS_CHECK_EXEC([bar1], [ping -q -c 3 -i 0.3 -w 8 192.168.1.2 | FORMAT_PING], \
+[0], [dnl
+3 packets transmitted, 3 received, 0% packet loss, time 0ms
+])
+
+OVS_APP_EXIT_AND_WAIT([ovn-controller])
+
+as ovn-sb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as ovn-nb
+OVS_APP_EXIT_AND_WAIT([ovsdb-server])
+
+as northd
+OVS_APP_EXIT_AND_WAIT([ovn-northd])
+
+as
+OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
+/connection dropped.*/d"])
+AT_CLEANUP
+
--
2.7.4


[RFC PATCH 09/11] utilities: Add ovs-bpfctl utility.

William Tu
 

From: Joe Stringer <joe@...>

This new utility is used for standalone probing of BPF datapath state.

Signed-off-by: Joe Stringer <joe@...>
Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
utilities/automake.mk | 9 ++
utilities/ovs-bpfctl.8.xml | 45 ++++++++
utilities/ovs-bpfctl.c | 248 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 302 insertions(+)
create mode 100644 utilities/ovs-bpfctl.8.xml
create mode 100644 utilities/ovs-bpfctl.c

diff --git a/utilities/automake.mk b/utilities/automake.mk
index 1636cb93e677..9de28eb1eb7d 100644
--- a/utilities/automake.mk
+++ b/utilities/automake.mk
@@ -39,6 +39,7 @@ utilities/ovs-lib: $(top_builddir)/config.status

EXTRA_DIST += \
utilities/ovs-appctl-bashcomp.bash \
+ utilities/ovs-bpfctl.8.xml \
utilities/ovs-check-dead-ifs.in \
utilities/ovs-ctl.in \
utilities/ovs-dev.py \
@@ -103,6 +104,7 @@ CLEANFILES += \

man_MANS += \
utilities/ovs-appctl.8 \
+ utilities/ovs-bpfctl.8 \
utilities/ovs-ctl.8 \
utilities/ovs-testcontroller.8 \
utilities/ovs-dpctl.8 \
@@ -148,4 +150,11 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \
utilities/checkpatch.py utilities/ovs-dev.py \
utilities/ovs-tcpdump.in

+if HAVE_BPF
+bin_PROGRAMS += \
+ utilities/ovs-bpfctl
+utilities_ovs_bpfctl_SOURCES = utilities/ovs-bpfctl.c
+utilities_ovs_bpfctl_LDADD = lib/libopenvswitch.la
+endif
+
include utilities/bugtool/automake.mk
diff --git a/utilities/ovs-bpfctl.8.xml b/utilities/ovs-bpfctl.8.xml
new file mode 100644
index 000000000000..6160d5eb06aa
--- /dev/null
+++ b/utilities/ovs-bpfctl.8.xml
@@ -0,0 +1,45 @@
+<?xml version="1.0" encoding="utf-8"?>
+<manpage program="ovs-bpfctl" section="8" title="ovs-bpfctl">
+ <h1>Name</h1>
+ <p>ovs-bpfctl -- administer Open vSwitch BPF state</p>
+
+ <h1>Synopsis</h1>
+ <p><code>ovs-bpfctl</code> [<var>options</var>] <var>command</var> [<var>arg</var>...]</p>
+
+ <h1>Description</h1>
+ <p>This utility can be used to probe and manage OVS BPF state.</p>
+
+ <h1>Commands</h1>
+ <dl>
+ <dt><code>show</code></dt>
+ <dd>
+ Prints a brief overview of the current BPF configuration state.
+ </dd>
+
+ <dt><code>load-dp</code> <var>filename</var></dt>
+ <dd>
+ Loads a BPF datapath implementation from <var>filename</var> into the
+ kernel, and pins it to the filesystem.
+ </dd>
+ </dl>
+
+ <h1>Options</h1>
+ <xi:include href="lib/common.xml" xmlns:xi="http://www.w3.org/2003/XInclude"/>
+
+ <h1>Exit Status</h1>
+ <dl>
+ <dt>0</dt>
+ <dd>Successful program execution.</dd>
+ <dt>1</dt>
+ <dd>Usage or syntax error.</dd>
+ </dl>
+
+ <h1>See also</h1>
+ <p><code>tc</code>(8), <code>tc-bpf</code>(8)</p>
+
+ <h1>Authors</h1>
+ <p>Manpage written by Joe Stringer.</p>
+ <p>Please report corrections or improvements to
+ <code>&lt;bugs@...&gt;</code></p>
+
+</manpage>
diff --git a/utilities/ovs-bpfctl.c b/utilities/ovs-bpfctl.c
new file mode 100644
index 000000000000..10b238a3d79e
--- /dev/null
+++ b/utilities/ovs-bpfctl.c
@@ -0,0 +1,248 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+#include <errno.h>
+#include <getopt.h>
+#include <inttypes.h>
+#include <signal.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/stat.h>
+
+#include "bpf.h"
+#include "command-line.h"
+#include "fatal-signal.h"
+#include "util.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+
+static int verbosity = 0;
+static bool read_only = false;
+
+typedef int bpfctl_command_handler(int argc, const char *argv[]);
+struct bpfctl_command {
+ const char *name;
+ const char *usage;
+ int min_args;
+ int max_args;
+ bpfctl_command_handler *handler;
+ enum { DP_RO, DP_RW} mode;
+};
+
+OVS_NO_RETURN static void usage(void *userdata OVS_UNUSED);
+static void parse_options(int argc, char *argv[]);
+static int bpfctl_run_command(int argc, const char *argv[]);
+
+static void
+bpfctl_print(void *userdata OVS_UNUSED, bool error, const char *msg)
+{
+ FILE *outfile = error ? stderr : stdout;
+ fputs(msg, outfile);
+}
+
+static void
+bpfctl_error(int err_no, const char *fmt, ...)
+{
+ const char *subprogram_name = get_subprogram_name();
+ struct ds ds = DS_EMPTY_INITIALIZER;
+ int save_errno = errno;
+ va_list args;
+
+ if (subprogram_name[0]) {
+ ds_put_format(&ds, "%s(%s): ", program_name,subprogram_name);
+ } else {
+ ds_put_format(&ds, "%s: ", program_name);
+ }
+
+ va_start(args, fmt);
+ ds_put_format_valist(&ds, fmt, args);
+ va_end(args);
+
+ if (err_no != 0) {
+ ds_put_format(&ds, " (%s)", ovs_retval_to_string(err_no));
+ }
+ ds_put_cstr(&ds, "\n");
+
+ bpfctl_print(NULL, true, ds_cstr(&ds));
+
+ ds_destroy(&ds);
+
+ errno = save_errno;
+}
+
+int
+main(int argc, char *argv[])
+{
+ int error;
+ set_program_name(argv[0]);
+ parse_options(argc, argv);
+ fatal_ignore_sigpipe();
+
+ error = bpfctl_run_command(argc - optind, (const char **) argv + optind);
+ return error ? EXIT_FAILURE : EXIT_SUCCESS;
+}
+
+static void
+parse_options(int argc, char *argv[])
+{
+ enum {
+ OPT_CLEAR = UCHAR_MAX + 1,
+ OPT_MAY_CREATE,
+ OPT_READ_ONLY,
+ VLOG_OPTION_ENUMS
+ };
+ static const struct option long_options[] = {
+ {"read-only", no_argument, NULL, OPT_READ_ONLY},
+ {"help", no_argument, NULL, 'h'},
+ {"option", no_argument, NULL, 'o'},
+ {"version", no_argument, NULL, 'V'},
+ VLOG_LONG_OPTIONS,
+ {NULL, 0, NULL, 0},
+ };
+ char *short_options = ovs_cmdl_long_options_to_short_options(long_options);
+
+ for (;;) {
+ int c;
+
+ c = getopt_long(argc, argv, short_options, long_options, NULL);
+ if (c == -1) {
+ break;
+ }
+
+ switch (c) {
+ case OPT_READ_ONLY:
+ read_only = true;
+ break;
+
+ case 'm':
+ verbosity++;
+ break;
+
+ case 'h':
+ usage(NULL);
+
+ case 'o':
+ ovs_cmdl_print_options(long_options);
+ exit(EXIT_SUCCESS);
+
+ case 'V':
+ ovs_print_version(0, 0);
+ exit(EXIT_SUCCESS);
+
+ VLOG_OPTION_HANDLERS
+
+ case '?':
+ exit(EXIT_FAILURE);
+
+ default:
+ abort();
+ }
+ }
+ free(short_options);
+}
+
+static void
+usage(void *userdata OVS_UNUSED)
+{
+ printf("%s: Open vSwitch bpf management utility\n"
+ "usage: %s [OPTIONS] COMMAND [ARG...]\n"
+ " show show basic info on bpf datapaths\n"
+ " load-dp FILENAME load datapath from FILENAME\n",
+ program_name, program_name);
+ vlog_usage();
+ printf(" -m, --more increase verbosity of output\n"
+ " -h, --help display this help message\n"
+ " -V, --version display version information\n");
+ exit(EXIT_SUCCESS);
+}
+
+static int
+bpfctl_show(int argc OVS_UNUSED, const char *argv[] OVS_UNUSED)
+{
+ struct bpf_state bpf;
+
+ if (!bpf_get(&bpf, verbosity)) {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ bpf_format_state(&ds, &bpf);
+ printf("%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ bpf_put(&bpf);
+ }
+ return 0;
+}
+
+static int
+bpfctl_load_dp(int argc OVS_UNUSED, const char *argv[])
+{
+ int error;
+
+ error = bpf_init();
+ if (error) {
+ return error;
+ }
+ return bpf_load(argv[1]);
+}
+
+static const struct bpfctl_command all_commands[] = {
+ { "load-dp", "[file]", 1, 1, bpfctl_load_dp, DP_RW },
+ { "show", "", 0, 0, bpfctl_show, DP_RO },
+ { NULL, NULL, 0, 0, NULL, DP_RO },
+};
+
+/* Runs the command designated by argv[0] within the command table specified by
+ * 'commands', which must be terminated by a command whose 'name' member is a
+ * null pointer. */
+static int
+bpfctl_run_command(int argc, const char *argv[])
+{
+ const struct bpfctl_command *p;
+
+ if (argc < 1) {
+ bpfctl_error(0, "missing command name; use --help for help");
+ return EINVAL;
+ }
+
+ for (p = all_commands; p->name != NULL; p++) {
+ if (!strcmp(p->name, argv[0])) {
+ int n_arg = argc - 1;
+ if (n_arg < p->min_args) {
+ bpfctl_error(0, "'%s' command requires at least %d arguments",
+ p->name, p->min_args);
+ return EINVAL;
+ } else if (n_arg > p->max_args) {
+ bpfctl_error(0, "'%s' command takes at most %d arguments",
+ p->name, p->max_args);
+ return EINVAL;
+ } else {
+ if (p->mode == DP_RW && read_only) {
+ bpfctl_error(0,
+ "'%s' command does not work in read only mode",
+ p->name);
+ return EINVAL;
+ }
+ return p->handler(argc, argv);
+ }
+ }
+ }
+
+ bpfctl_error(0, "unknown command '%s'; use --help for help",
+ argv[0]);
+ return EINVAL;
+}
--
2.7.4


[RFC PATCH 07/11] bpf: implement OVS BPF datapath.

William Tu
 

This patch adds the OVS-eBPF datapath implementation for dpif-bpf.
Three stages are added: parse, lookup, and actions. Each stages are
tail called to the next stage. When executing multiple actions,
the current action also tail calls the subsequent action, based on
the result of flow table lookup.

The protocol headers are auto-generated and defined at generated_headers.h.
The bpf_flow_key is extracted using the P4-to-eBPF compiler from
the bcc project. A couple of manual tweaks are required, see parser.h.

Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Signed-off-by: Joe Stringer <joe@...>
Co-authored-by: Joe Stringer <joe@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
Makefile.am | 1 +
bpf/action.h | 628 ++++++++++++++++++++++++++++++++++++++++++++++++
bpf/api.h | 279 +++++++++++++++++++++
bpf/automake.mk | 60 +++++
bpf/datapath.c | 187 ++++++++++++++
bpf/datapath.h | 71 ++++++
bpf/generated_headers.h | 185 ++++++++++++++
bpf/helpers.h | 209 ++++++++++++++++
bpf/lookup.h | 227 +++++++++++++++++
bpf/maps.h | 170 +++++++++++++
bpf/odp-bpf.h | 254 ++++++++++++++++++++
bpf/openvswitch.h | 49 ++++
bpf/ovs-p4.h | 112 +++++++++
bpf/ovs-proto.p4 | 329 +++++++++++++++++++++++++
bpf/parser.h | 412 +++++++++++++++++++++++++++++++
bpf/xdp.h | 35 +++
16 files changed, 3208 insertions(+)
create mode 100644 bpf/action.h
create mode 100644 bpf/api.h
create mode 100644 bpf/automake.mk
create mode 100644 bpf/datapath.c
create mode 100644 bpf/datapath.h
create mode 100644 bpf/generated_headers.h
create mode 100644 bpf/helpers.h
create mode 100644 bpf/lookup.h
create mode 100644 bpf/maps.h
create mode 100644 bpf/odp-bpf.h
create mode 100644 bpf/openvswitch.h
create mode 100644 bpf/ovs-p4.h
create mode 100644 bpf/ovs-proto.p4
create mode 100644 bpf/parser.h
create mode 100644 bpf/xdp.h

diff --git a/Makefile.am b/Makefile.am
index 21e27fa32965..ec1fc53b1060 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -440,6 +440,7 @@ dist-docs:

include Documentation/automake.mk
include m4/automake.mk
+include bpf/automake.mk
include lib/automake.mk
include ofproto/automake.mk
include utilities/automake.mk
diff --git a/bpf/action.h b/bpf/action.h
new file mode 100644
index 000000000000..49213698c00b
--- /dev/null
+++ b/bpf/action.h
@@ -0,0 +1,628 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+/* OVS Datapath Execution
+ * ======================
+ *
+ * When a lookup is successful the eBPF gets a list of actions to be
+ * executed, such as outputting the packet to a certain port, or
+ * pushing a VLAN tag. The list of actions is configured in ovs-vswitchd
+ * and may be a variable length depending on the desired network processing
+ * behaviour. For example, an L2 switch doing unknown broadcast sends
+ * packet to all its current ports. The OVS datapath’s actions is derived
+ * from the OpenFlow action specification and the OVSDB schema for
+ * ovs-vswitchd.
+ *
+ */
+#include <errno.h>
+#include <stdint.h>
+#include <iproute2/bpf_elf.h>
+#include <linux/ip.h>
+
+#include "api.h"
+#include "maps.h"
+#include "helpers.h"
+
+#define ALIGNED_CAST(TYPE, ATTR) ((TYPE) (void *) (ATTR))
+
+#define IP_CSUM_OFF (ETH_HLEN + offsetof(struct iphdr, check))
+#define TOS_OFF (ETH_HLEN + offsetof(struct iphdr, tos))
+#define TTL_OFF (ETH_HLEN + offsetof(struct iphdr, ttl))
+#define DST_OFF (ETH_HLEN + offsetof(struct iphdr, daddr))
+#define SRC_OFF (ETH_HLEN + offsetof(struct iphdr, saddr))
+
+static inline void set_ip_tos(struct __sk_buff *skb, __u8 new_tos)
+{
+ __u8 old_tos = load_byte(skb, TOS_OFF);
+
+ bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_tos, new_tos, 2);
+
+ /* Use helper here because using direct packet
+ * access causes verifier error
+ */
+ bpf_skb_store_bytes(skb, TOS_OFF, &new_tos, sizeof(new_tos), 0);
+}
+
+static inline void set_ip_ttl(struct __sk_buff *skb, __u8 new_ttl)
+{
+ __u8 old_ttl = load_byte(skb, TTL_OFF);
+
+ bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ttl, new_ttl, 2);
+ bpf_skb_store_bytes(skb, TTL_OFF, &new_ttl, sizeof(new_ttl), 0);
+}
+
+static inline void set_ip_dst(struct __sk_buff *skb, __u32 new_dst)
+{
+ __u32 old_dst = load_word(skb, DST_OFF);
+
+ bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_dst, new_dst, 4);
+ bpf_skb_store_bytes(skb, DST_OFF, &new_dst, sizeof(new_dst), 0);
+}
+
+static inline void set_ip_src(struct __sk_buff *skb, __u32 new_src)
+{
+ __u32 old_src = load_word(skb, SRC_OFF);
+
+ bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_src, new_src, 4);
+ bpf_skb_store_bytes(skb, SRC_OFF, &new_src, sizeof(new_src), 0);
+}
+
+/*
+ * Every OVS action need to lookup the action list and
+ * with index, find out the next action to process
+ */
+static inline struct bpf_action *pre_tail_action(struct __sk_buff *skb,
+ struct bpf_action_batch **__batch)
+{
+ uint32_t index = ovs_cb_get_action_index(skb);
+ struct bpf_action *action = NULL;
+ struct bpf_action_batch *batch;
+ int zero_index = 0;
+
+ if (index >= BPF_DP_MAX_ACTION) {
+ printt("ERR max ebpf action hit\n");
+ return NULL;
+ }
+
+ if (skb->cb[OVS_CB_DOWNCALL_EXE]) {
+ /* Downcall packet has a dedicated action list */
+ batch = bpf_map_lookup_elem(&execute_actions, &zero_index);
+ } else {
+ struct bpf_flow_key *exe_flow_key, flow_key;
+
+ exe_flow_key = bpf_map_lookup_elem(&percpu_executing_key,
+ &zero_index);
+ if (!exe_flow_key) {
+ printt("empty percpu_executing_key\n");
+ return NULL;
+ }
+
+ flow_key = *exe_flow_key;
+ batch = bpf_map_lookup_elem(&flow_table, &flow_key);
+ }
+ if (!batch) {
+ printt("no batch action found\n");
+ return NULL;
+ }
+
+ *__batch = batch;
+ action = &((batch)->actions[index]);
+ return action;
+}
+
+/*
+ * After processing the action, tail call the next.
+ */
+static inline int post_tail_action(struct __sk_buff *skb,
+ struct bpf_action_batch *batch)
+{
+ struct bpf_action *next_action;
+ uint32_t index;
+
+ if (!batch)
+ return TC_ACT_SHOT;
+
+ index = skb->cb[OVS_CB_ACT_IDX] + 1;
+ skb->cb[OVS_CB_ACT_IDX] = index;
+
+ if (index >= BPF_DP_MAX_ACTION)
+ goto finish;
+
+ next_action = &batch->actions[index];
+ if (next_action->type == 0)
+ goto finish;
+
+ printt("next action type = %d\n", next_action->type);
+ bpf_tail_call(skb, &tailcalls, next_action->type);
+
+ printt("[BUG] tail call missing\n");
+ return TC_ACT_SHOT;
+
+finish:
+ if (skb->cb[OVS_CB_DOWNCALL_EXE]) {
+ int index = 0;
+ bpf_map_delete_elem(&execute_actions, &index);
+ }
+ return TC_ACT_STOLEN;
+}
+
+/*
+ * Use this action to indicate end of action list
+ * BPF program: tail-0
+ */
+__section_tail(OVS_ACTION_ATTR_UNSPEC)
+static int tail_action_unspec(struct __sk_buff *skb)
+{
+ int index OVS_UNUSED = ovs_cb_get_action_index(skb);
+
+ printt("action index = %d, end of processing\n", index);
+
+ /* Handle actions=drop, we return SHOT so the device's dropped stats
+ will be incremented (see sch_handle_ingress).
+
+ If there are more actions, ex: actions=a1,a2,drop, this is
+ handled in post_tail_actions and return STOLEN
+ */
+ return TC_ACT_SHOT;
+}
+
+/*
+ * BPF program: tail-1
+ */
+__section_tail(OVS_ACTION_ATTR_OUTPUT)
+static int tail_action_output(struct __sk_buff *skb)
+{
+ int ret __attribute__((__unused__));
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+ int flags;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ /* Internal dev is tap type and hooked only to bpf egress filter.
+ When output to an internal device, a packet is clone-redirected to
+ this device's ingress so that this packet is processed by kernel stack.
+ Why? Since if the packet is sent to its egress, it is delivered to the
+ tap device's socket, not kernel.
+ */
+ flags = action->u.out.flags & OVS_BPF_FLAGS_TX_STACK ? BPF_F_INGRESS : 0;
+ printt("output action port = %d ingress? %d\n",
+ action->u.out.port, (flags));
+
+ bpf_clone_redirect(skb, action->u.out.port, flags);
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements OVS userspace
+ * BPF program: tail-2
+ */
+__section_tail(OVS_ACTION_ATTR_USERSPACE)
+static int tail_action_userspace(struct __sk_buff *skb)
+{
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ /* XXX If move this declaration to top, the stack will overflow. */
+ struct bpf_upcall md = {
+ .type = OVS_UPCALL_ACTION,
+ .skb_len = skb->len,
+ .ifindex = skb->ifindex,
+ };
+
+ if (action->u.userspace.nlattr_len > sizeof(md.uactions)) {
+ printt("userspace action is too large\n");
+ return TC_ACT_SHOT;
+ }
+
+ memcpy(md.uactions, action->u.userspace.nlattr_data, sizeof(md.uactions));
+ md.uactions_len = action->u.userspace.nlattr_len;
+
+ struct ebpf_headers_t *hdrs = bpf_get_headers();
+ if (!hdrs) {
+ printt("headers is NULL\n");
+ return TC_ACT_SHOT;
+ }
+
+ memcpy(&md.key.headers, hdrs, sizeof(*hdrs));
+
+ uint64_t flags = skb->len;
+ flags <<= 32;
+ flags |= BPF_F_CURRENT_CPU;
+ int err = skb_event_output(skb, &upcalls, flags, &md, sizeof md);
+
+ if (err) {
+ printt("skb_event_output of userspace action: %d", err);
+ return TC_ACT_SHOT;
+ }
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements BPF tunnel
+ * BPF program: tail-3
+ */
+__section_tail(OVS_ACTION_ATTR_SET)
+static int tail_action_tunnel_set(struct __sk_buff *skb)
+{
+ struct bpf_tunnel_key key;
+ int ret;
+ uint64_t flags;
+
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+ struct ovs_action_set_tunnel *tunnel;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ if (action->is_set) {
+ /* set_masked */
+ printt("ERR: this is set tunnel action\n");
+ return TC_ACT_SHOT;
+ }
+
+ tunnel = &action->u.tunnel;
+
+ /* hard-coded now, should fetch it from action->u */
+ __builtin_memset(&key, 0x0, sizeof(key));
+ key.tunnel_id = tunnel->tunnel_id;
+ key.tunnel_tos = tunnel->tunnel_tos;
+ key.tunnel_ttl = tunnel->tunnel_ttl;
+
+ printt("tunnel_id = %x\n", key.tunnel_id);
+
+ /* TODO: handle BPF_F_DONT_FRAGMENT and BPF_F_SEQ_NUMBER */
+ flags = BPF_F_ZERO_CSUM_TX;
+ if (!tunnel->use_ipv6) {
+ key.remote_ipv4 = tunnel->remote_ipv4;
+ flags &= ~BPF_F_TUNINFO_IPV6;
+ } else {
+ memcpy(&key.remote_ipv4, &tunnel->remote_ipv4, 16);
+ flags |= BPF_F_TUNINFO_IPV6;
+ }
+
+ ret = bpf_skb_set_tunnel_key(skb, &key, sizeof(key), flags);
+ if (ret < 0)
+ printt("ERR setting tunnel key\n");
+
+ if (tunnel->gnvopt_valid) {
+ ret = bpf_skb_set_tunnel_opt(skb, &tunnel->gnvopt,
+ sizeof tunnel->gnvopt);
+ if (ret < 0)
+ printt("ERR setting tunnel opt\n");
+ }
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements VLAN push
+ * BPF program: tail-4
+ */
+__section_tail(OVS_ACTION_ATTR_PUSH_VLAN)
+static int tail_action_push_vlan(struct __sk_buff *skb)
+{
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ printt("vlan push tci %d\n", action->u.push_vlan.vlan_tci);
+ printt("vlan push tpid %d\n", action->u.push_vlan.vlan_tpid);
+ bpf_skb_vlan_push(skb, action->u.push_vlan.vlan_tpid,
+ action->u.push_vlan.vlan_tci & ~VLAN_TAG_PRESENT);
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements VLAN pop
+ * BPF program: tail-5
+ */
+__section_tail(OVS_ACTION_ATTR_POP_VLAN)
+static int tail_action_pop_vlan(struct __sk_buff *skb)
+{
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ printt("vlan pop %d\n");
+ bpf_skb_vlan_pop(skb);
+
+ /* FIXME: invalidate_flow_key()? */
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements sample
+ * BPF program: tail-6
+ */
+__section_tail(OVS_ACTION_ATTR_SAMPLE)
+static int tail_action_sample(struct __sk_buff *skb OVS_UNUSED)
+{
+ printt("ERR: Sample action not implemented,\
+ do you want to do it? \n");
+
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This action implements recirculation
+ * BPF program: tail-7
+ */
+__section_tail(OVS_ACTION_ATTR_RECIRC)
+static int tail_action_recirc(struct __sk_buff *skb)
+{
+ u32 recirc_id = 0;
+ struct bpf_action *action;
+ struct bpf_action_batch *batch ;
+ struct ebpf_metadata_t *ebpf_md;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ /* recirc should be the last action.
+ * level does not handle */
+
+ /* don't check the is_flow_key_valid(),
+ * now always re-parsing the header.
+ */
+ recirc_id = action->u.recirc_id;
+ printt("recirc id = %d\n", recirc_id);
+
+ /* update metadata */
+ ebpf_md = bpf_get_mds();
+ if (!ebpf_md) {
+ printt("lookup metadata failed\n");
+ return TC_ACT_SHOT;
+ }
+ ebpf_md->md.recirc_id = recirc_id;
+
+ skb->cb[OVS_CB_ACT_IDX] = 0;
+ skb->cb[OVS_CB_DOWNCALL_EXE] = 0;
+
+ /* FIXME: recirc should not call this. */
+ bpf_tail_call(skb, &tailcalls, MATCH_ACTION_CALL);
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This action implement hash
+ * BPF program: tail-8
+ */
+__section_tail(OVS_ACTION_ATTR_HASH)
+static int tail_action_hash(struct __sk_buff *skb)
+{
+ u32 hash = 0;
+ int index = 0;
+ struct ebpf_metadata_t *ebpf_md;
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ printt("skb->hash before = %x\n", skb->hash);
+ hash = bpf_get_hash_recalc(skb);
+ printt("skb->hash = %x hash \n", skb->hash);
+ if (!hash)
+ hash = 0x1;
+
+ ebpf_md = bpf_map_lookup_elem(&percpu_metadata, &index);
+ if (!ebpf_md) {
+ printt("LOOKUP metadata failed\n");
+ return TC_ACT_SHOT;
+ }
+ printt("save hash to ebpf_md->md.dp_hash\n");
+ ebpf_md->md.dp_hash = hash; /* or create a ovs_flow_hash?*/
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements MPLS push
+ * BPF program: tail-9
+ */
+__section_tail(OVS_ACTION_ATTR_PUSH_MPLS)
+static int tail_action_mpls_push(struct __sk_buff *skb OVS_UNUSED)
+{
+ printt("ERR: Push MPLS action not implemented,\
+ do you want to do it? \n");
+
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This action implements MPLS pop
+ * BPF program: tail-10
+ */
+__section_tail(OVS_ACTION_ATTR_POP_MPLS)
+static int tail_action_mpls_pop(struct __sk_buff *skb OVS_UNUSED)
+{
+ printt("ERR: Pop MPLS action not implemented,\
+ do you want to do it? \n");
+
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This action implements set packet's fields, mask not supported.
+ * Many other fields not implemented yet.
+ * BPF program: tail-11
+ * TODO: hit verifier limit here, maybe create more program and
+ * more tail call.
+ */
+__section_tail(OVS_ACTION_ATTR_SET_MASKED)
+static int tail_action_set_masked(struct __sk_buff *skb)
+{
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ switch (action->u.mset.key_type) {
+ case OVS_KEY_ATTR_ETHERNET: {
+ u8 *data = (u8 *)(long)skb->data;
+ u8 *data_end = (u8 *)(long)skb->data_end;
+ struct ethhdr *eth;
+ struct ovs_key_ethernet *ether;
+ int i;
+
+ /* packet data */
+ eth = (struct ethhdr *)data;
+ if (data + sizeof(*eth) > data_end)
+ return TC_ACT_SHOT;
+
+ /* value from map */
+ ether = &action->u.mset.key.ether;
+ for (i = 0; i < 6; i++)
+ eth->h_dest[i] = ether->eth_dst.ea[i];
+ for (i = 0; i < 6; i++)
+ eth->h_source[i] = ether->eth_src.ea[i];
+ break;
+ }
+ case OVS_KEY_ATTR_IPV4: {
+ u8 *data = (u8 *)(long)skb->data;
+ u8 *data_end = (u8 *)(long)skb->data_end;
+ struct iphdr *nh;
+ struct ovs_key_ipv4 *ipv4;
+
+ /* packet data */
+ nh = ALIGNED_CAST(struct iphdr *, data + sizeof(struct ethhdr));
+ if ((u8 *)nh + sizeof(struct iphdr) + 12 > data_end) {
+ return TC_ACT_SHOT;
+ }
+
+ /* value from map */
+ ipv4 = &action->u.mset.key.ipv4;
+ memcpy(&nh->saddr, &ipv4->ipv4_src, 8);
+ nh->protocol = ipv4->ipv4_proto;
+ nh->tos = ipv4->ipv4_tos;
+ nh->ttl = ipv4->ipv4_ttl;
+
+ set_ip_tos(skb, ipv4->ipv4_tos);
+ set_ip_ttl(skb, ipv4->ipv4_ttl);
+ //set_ip_src(skb, ipv4->ipv4_src);
+ //set_ip_dst(skb, ipv4->ipv4_dst);
+
+ //bpf_l3_csum_replace(skb, IP_CSUM_OFF, nh->saddr, ipv4->ipv4_src, 4);
+ //bpf_l3_csum_replace(skb, IP_CSUM_OFF, nh->daddr, ipv4->ipv4_dst, 4);
+ //bpf_l3_csum_replace(skb, IP_CSUM_OFF, nh->protocol, ipv4->ipv4_proto, 1);
+ //bpf_l3_csum_replace(skb, IP_CSUM_OFF, nh->tos, ipv4->ipv4_tos, 2);
+ //bpf_l3_csum_replace(skb, IP_CSUM_OFF, nh->ttl, ipv4->ipv4_ttl, 1);
+
+ /* XXX ignore frag */
+
+ break;
+ }
+ case OVS_KEY_ATTR_UNSPEC:
+ case OVS_KEY_ATTR_ENCAP:
+ case OVS_KEY_ATTR_PRIORITY: /* u32 skb->priority */
+ case OVS_KEY_ATTR_IN_PORT: /* u32 OVS dp port number */
+ case OVS_KEY_ATTR_VLAN: /* be16 VLAN TCI */
+ case OVS_KEY_ATTR_ETHERTYPE: /* be16 Ethernet type */
+ case OVS_KEY_ATTR_IPV6: /* struct ovs_key_ipv6 */
+ case OVS_KEY_ATTR_TCP: /* struct ovs_key_tcp */
+ case OVS_KEY_ATTR_UDP: /* struct ovs_key_udp */
+ case OVS_KEY_ATTR_ICMP: /* struct ovs_key_icmp */
+ case OVS_KEY_ATTR_ICMPV6: /* struct ovs_key_icmpv6 */
+ case OVS_KEY_ATTR_ARP: /* struct ovs_key_arp */
+ case OVS_KEY_ATTR_ND: /* struct ovs_key_nd */
+ case OVS_KEY_ATTR_SKB_MARK: /* u32 skb mark */
+ case OVS_KEY_ATTR_TUNNEL: /* Nested set of ovs_tunnel attributes */
+ case OVS_KEY_ATTR_SCTP: /* struct ovs_key_sctp */
+ case OVS_KEY_ATTR_TCP_FLAGS: /* be16 TCP flags. */
+ case OVS_KEY_ATTR_DP_HASH: /* u32 hash value. Value 0 indicates the hash */
+ case OVS_KEY_ATTR_RECIRC_ID: /* u32 recirc id */
+ case OVS_KEY_ATTR_MPLS: /* array of struct ovs_key_mpls. */
+ case OVS_KEY_ATTR_CT_STATE: /* u32 bitmask of OVS_CS_F_* */
+ case OVS_KEY_ATTR_CT_ZONE: /* u16 connection tracking zone. */
+ case OVS_KEY_ATTR_CT_MARK: /* u32 connection tracking mark */
+ case OVS_KEY_ATTR_CT_LABELS: /* 16-octet connection tracking labels */
+ case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4: /* struct ovs_key_ct_tuple_ipv4 */
+ case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6: /* struct ovs_key_ct_tuple_ipv6 */
+ case OVS_KEY_ATTR_NSH: /* Nested set of ovs_nsh_key_* */
+#ifdef __KERNEL__
+ case OVS_KEY_ATTR_TUNNEL_INFO: /* struct ovs_tunnel_info */
+#endif
+#ifndef __KERNEL__
+ case OVS_KEY_ATTR_PACKET_TYPE: /* be32 packet type */
+#endif
+ case __OVS_KEY_ATTR_MAX:
+ default:
+ printt("ERR Un-implemented set %d\n", action->type);
+ return TC_ACT_SHOT;
+ }
+
+ return post_tail_action(skb, batch);
+}
+
+/*
+ * This action implements connection tracking
+ * BPF program: tail-12
+ */
+__section_tail(OVS_ACTION_ATTR_CT)
+static int tail_action_ct(struct __sk_buff *skb OVS_UNUSED)
+{
+ printt("ERR: CT (connection tracking) not implemented,\
+ do you want to do it? \n");
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This action implements packet truncate
+ * BPF program: tail-13
+ */
+__section_tail(OVS_ACTION_ATTR_TRUNC)
+static int tail_action_trunc(struct __sk_buff *skb)
+{
+ struct bpf_action *action;
+ struct bpf_action_batch *batch;
+
+ action = pre_tail_action(skb, &batch);
+ if (!action)
+ return TC_ACT_SHOT;
+
+ printt("len before: %d\n", skb->len);
+ printt("truncate to %d\n", action->u.trunc.max_len);
+
+ /* The helper will resize the skb to the given new size */
+ bpf_skb_change_tail(skb, action->u.trunc.max_len, 0);
+
+ printt("len after: %d\n", skb->len);
+ return post_tail_action(skb, batch);
+}
diff --git a/bpf/api.h b/bpf/api.h
new file mode 100644
index 000000000000..f2db1f729157
--- /dev/null
+++ b/bpf/api.h
@@ -0,0 +1,279 @@
+#ifndef __BPF_API__
+#define __BPF_API__
+
+/* Note:
+ *
+ * This file can be included into eBPF kernel programs. It contains
+ * a couple of useful helper functions, map/section ABI (bpf_elf.h),
+ * misc macros and some eBPF specific LLVM built-ins.
+ */
+
+#include <linux/bpf.h>
+#include <stdint.h>
+
+#define UNSPEC_CALL 0
+#define OUTPUT_CALL 1
+#define PARSER_CALL 32
+#define MATCH_ACTION_CALL 33
+#define DEPARSER_CALL 34
+#define UPCALL_CALL 35
+
+#ifndef TC_ACT_OK
+#define TC_ACT_OK 0
+#define TC_ACT_RECLASSIFY 1
+#define TC_ACT_SHOT 2
+#define TC_ACT_PIPE 3
+#define TC_ACT_STOLEN 4
+#define TC_ACT_QUEUED 5
+#define TC_ACT_REPEAT 6
+#define TC_ACT_REDIRECT 7
+#endif
+
+/** Misc macros. */
+
+#ifndef __stringify
+# define __stringify(X) #X
+#endif
+
+#ifndef __maybe_unused
+# define __maybe_unused __attribute__((__unused__))
+#endif
+
+#ifndef htons
+# define htons(X) __constant_htons((X))
+#endif
+
+#ifndef ntohs
+# define ntohs(X) __constant_ntohs((X))
+#endif
+
+#ifndef htonl
+# define htonl(X) __constant_htonl((X))
+#endif
+
+#ifndef ntohl
+# define ntohl(X) __constant_ntohl((X))
+#endif
+
+#ifndef __inline__
+# define __inline__ __attribute__((always_inline))
+#endif
+
+#ifndef __section
+# define __section(NAME) \
+ __attribute__((section(NAME), used))
+#endif
+
+#ifndef __section_tail
+# define __section_tail(KEY) \
+ __section("tail-" __stringify(KEY))
+#endif
+
+#ifndef __section_license
+# define __section_license \
+ __section(ELF_SECTION_LICENSE)
+#endif
+
+#ifndef __section_maps
+# define __section_maps \
+ __section(ELF_SECTION_MAPS)
+#endif
+
+#ifndef BPF_LICENSE
+# define BPF_LICENSE(NAME) \
+ char ____license[] __section_license = NAME
+#endif
+
+#ifndef __BPF_MAP
+# define __BPF_MAP(NAME, TYPE, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM) \
+ struct bpf_map_def __section_maps NAME = { \
+ .type = (TYPE), \
+ .key_size = (SIZE_KEY), \
+ .value_size = (SIZE_VALUE), \
+ .max_entries = (MAX_ELEM), \
+ .map_flags = 0, \
+ }
+#endif
+
+#ifndef BPF_HASH
+# define BPF_HASH(NAME, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM) \
+ __BPF_MAP(NAME, BPF_MAP_TYPE_HASH, ID, SIZE_KEY, SIZE_VALUE, \
+ PIN, MAX_ELEM)
+#endif
+
+#ifndef BPF_PERCPU_HASH
+# define BPF_PERCPU_HASH(NAME, ID, SIZE_KEY, SIZE_VALUE, PIN, MAX_ELEM) \
+ __BPF_MAP(NAME, BPF_MAP_TYPE_PERCPU_HASH, ID, SIZE_KEY, SIZE_VALUE, \
+ PIN, MAX_ELEM)
+#endif
+
+#ifndef BPF_ARRAY
+# define BPF_ARRAY(NAME, ID, SIZE_VALUE, PIN, MAX_ELEM) \
+ __BPF_MAP(NAME, BPF_MAP_TYPE_ARRAY, ID, sizeof(uint32_t), \
+ SIZE_VALUE, PIN, MAX_ELEM)
+#endif
+
+#ifndef BPF_PERCPU_ARRAY
+# define BPF_PERCPU_ARRAY(NAME, ID, SIZE_VALUE, PIN, MAX_ELEM) \
+ __BPF_MAP(NAME, BPF_MAP_TYPE_PERCPU_ARRAY, ID, sizeof(uint32_t), \
+ SIZE_VALUE, PIN, MAX_ELEM)
+#endif
+
+#ifndef BPF_PROG_ARRAY
+# define BPF_PROG_ARRAY(NAME, ID, PIN, MAX_ELEM) \
+ __BPF_MAP(NAME, BPF_MAP_TYPE_PROG_ARRAY, ID, sizeof(uint32_t), \
+ sizeof(uint32_t), PIN, MAX_ELEM)
+#endif
+
+#ifndef BPF_PERF_OUTPUT
+# define BPF_PERF_OUTPUT(name, pin) \
+ __BPF_MAP(name, BPF_MAP_TYPE_PERF_EVENT_ARRAY, 0, sizeof(uint32_t), \
+ sizeof(uint32_t), pin, __NR_CPUS__)
+#endif
+
+/** Classifier helper */
+
+#ifndef BPF_H_DEFAULT
+# define BPF_H_DEFAULT -1
+#endif
+
+/** BPF helper functions for tc. Individual flags are in linux/bpf.h */
+
+#ifndef BPF_FUNC
+# define BPF_FUNC(NAME, ...) \
+ (* NAME)(__VA_ARGS__) __maybe_unused = (void *) BPF_FUNC_##NAME
+#endif
+
+#ifndef BPF_FUNC2
+# define BPF_FUNC2(NAME, ...) \
+ (* NAME)(__VA_ARGS__) __maybe_unused
+#endif
+
+/* Map access/manipulation */
+static void *BPF_FUNC(map_lookup_elem, void *map, const void *key);
+static int BPF_FUNC(map_update_elem, void *map, const void *key,
+ const void *value, uint32_t flags);
+static int BPF_FUNC(map_delete_elem, void *map, const void *key);
+
+/* Time access */
+static uint64_t BPF_FUNC(ktime_get_ns, void);
+
+/* Debugging */
+
+/* FIXME: __attribute__ ((format(printf, 1, 3))) not possible unless
+ * llvm bug https://llvm.org/bugs/show_bug.cgi?id=26243 gets resolved.
+ * It would require ____fmt to be made const, which generates a reloc
+ * entry (non-map).
+ */
+static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);
+
+#ifndef printt
+# ifdef DEBUG_BPF_OFF
+# define printt(fmt, ...)
+# else
+# define printt(fmt, ...) \
+ ({ \
+ char ____fmt[] = fmt; \
+ trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
+ })
+# endif
+#endif
+
+/* Random numbers */
+static uint32_t BPF_FUNC(get_prandom_u32, void);
+
+/* Tail calls */
+static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map,
+ uint32_t index);
+
+/* System helpers */
+static uint32_t BPF_FUNC(get_smp_processor_id, void);
+
+/* Packet misc meta data */
+static uint32_t BPF_FUNC(get_hash_recalc, struct __sk_buff *skb);
+
+static int BPF_FUNC(skb_under_cgroup, void *map, uint32_t index);
+
+/* Packet redirection */
+static int BPF_FUNC(redirect, int ifindex, uint32_t flags);
+static int BPF_FUNC(clone_redirect, struct __sk_buff *skb, int ifindex,
+ uint32_t flags);
+
+/* Packet manipulation */
+static int BPF_FUNC(skb_load_bytes, struct __sk_buff *skb, uint32_t off,
+ void *to, uint32_t len);
+static int BPF_FUNC(skb_store_bytes, struct __sk_buff *skb, uint32_t off,
+ const void *from, uint32_t len, uint32_t flags);
+
+static int BPF_FUNC(l3_csum_replace, struct __sk_buff *skb, uint32_t off,
+ uint32_t from, uint32_t to, uint32_t flags);
+static int BPF_FUNC(l4_csum_replace, struct __sk_buff *skb, uint32_t off,
+ uint32_t from, uint32_t to, uint32_t flags);
+static int BPF_FUNC(csum_diff, void *from, uint32_t from_size, void *to,
+ uint32_t to_size, uint32_t seed);
+
+static int BPF_FUNC(skb_change_type, struct __sk_buff *skb, uint32_t type);
+static int BPF_FUNC(skb_change_proto, struct __sk_buff *skb, uint32_t proto,
+ uint32_t flags);
+static int BPF_FUNC(skb_change_tail, struct __sk_buff *skb, uint32_t nlen,
+ uint32_t flags);
+
+/* Packet vlan encap/decap */
+static int BPF_FUNC(skb_vlan_push, struct __sk_buff *skb, uint16_t proto,
+ uint16_t vlan_tci);
+static int BPF_FUNC(skb_vlan_pop, struct __sk_buff *skb);
+
+/* Packet tunnel encap/decap */
+static int BPF_FUNC(skb_get_tunnel_key, struct __sk_buff *skb,
+ struct bpf_tunnel_key *to, uint32_t size, uint32_t flags);
+static int BPF_FUNC(skb_set_tunnel_key, struct __sk_buff *skb,
+ const struct bpf_tunnel_key *from, uint32_t size,
+ uint32_t flags);
+
+static int BPF_FUNC(skb_get_tunnel_opt, struct __sk_buff *skb,
+ void *to, uint32_t size);
+static int BPF_FUNC(skb_set_tunnel_opt, struct __sk_buff *skb,
+ const void *from, uint32_t size);
+
+/* Events for user space */
+static int BPF_FUNC2(skb_event_output, struct __sk_buff *skb, void *map, uint64_t index,
+ const void *data, uint32_t size) = (void *)BPF_FUNC_perf_event_output;
+
+/** LLVM built-ins, mem*() routines work for constant size */
+
+#ifndef lock_xadd
+# define lock_xadd(ptr, val) ((void) __sync_fetch_and_add(ptr, val))
+#endif
+
+#ifndef memset
+# define memset(s, c, n) __builtin_memset((s), (c), (n))
+#endif
+
+#ifndef memcpy
+# define memcpy(d, s, n) __builtin_memcpy((d), (s), (n))
+#endif
+
+#ifndef memmove
+# define memmove(d, s, n) __builtin_memmove((d), (s), (n))
+#endif
+
+/* FIXME: __builtin_memcmp() is not yet fully useable unless llvm bug
+ * https://llvm.org/bugs/show_bug.cgi?id=26218 gets resolved. Also
+ * this one would generate a reloc entry (non-map), otherwise.
+ */
+#if 0
+#ifndef memcmp
+# define memcmp(a, b, n) __builtin_memcmp((a), (b), (n))
+#endif
+#endif
+
+unsigned long long load_byte(void *skb, unsigned long long off)
+ asm ("llvm.bpf.load.byte");
+
+unsigned long long load_half(void *skb, unsigned long long off)
+ asm ("llvm.bpf.load.half");
+
+unsigned long long load_word(void *skb, unsigned long long off)
+ asm ("llvm.bpf.load.word");
+
+#endif /* __BPF_API__ */
diff --git a/bpf/automake.mk b/bpf/automake.mk
new file mode 100644
index 000000000000..3028c585b6cc
--- /dev/null
+++ b/bpf/automake.mk
@@ -0,0 +1,60 @@
+bpf_sources = bpf/datapath.c
+bpf_headers = \
+ bpf/api.h \
+ bpf/datapath.h \
+ bpf/odp-bpf.h \
+ bpf/ovs-p4.h \
+ bpf/helpers.h \
+ bpf/openvswitch.h \
+ bpf/maps.h \
+ bpf/parser.h \
+ bpf/lookup.h \
+ bpf/action.h \
+ bpf/generated_headers.h \
+ bpf/xdp.h
+bpf_extra = \
+ bpf/ovs-proto.p4
+
+# Regardless of configuration with GCC, we must compile the BPF with clang
+# since GCC doesn't have a BPF backend. Clang dones't support these flags,
+# so we filter them out.
+
+bpf_FILTER_FLAGS := $(filter-out -Wbool-compare, $(AM_CFLAGS))
+bpf_FILTER_FLAGS2 := $(filter-out -Wduplicated-cond, $(bpf_FILTER_FLAGS))
+bpf_FILTER_FLAGS3 := $(filter-out --coverage, $(bpf_FILTER_FLAGS2))
+bpf_CFLAGS := $(bpf_FILTER_FLAGS3)
+bpf_CFLAGS += -D__NR_CPUS__=$(shell nproc) -O2 -Wall -Werror -emit-llvm
+bpf_CFLAGS += -I$(top_builddir)/include -I$(top_srcdir)/include
+bpf_CFLAGS += -Wno-error=pointer-arith # Allow skb->data arithmetic
+bpf_CFLAGS += -I${IPROUTE2_SRC_PATH}/include/uapi/
+# FIXME:
+#bpf_CFLAGS += -D__KERNEL__
+
+dist_sources = $(bpf_sources)
+dist_headers = $(bpf_headers)
+build_sources = $(dist_sources)
+build_headers = $(dist_headers)
+build_objects = $(patsubst %.c,%.o,$(build_sources))
+
+LLC ?= llc-3.8
+CLANG ?= clang-3.8
+
+bpf: $(build_objects)
+bpf/datapath.o: $(bpf_sources) $(bpf_headers)
+ $(MKDIR_P) $(dir $@)
+ @which $(CLANG) >/dev/null 2>&1 || \
+ (echo "Unable to find clang, Install clang (>=3.7) package"; exit 1)
+ $(AM_V_CC) $(CLANG) $(bpf_CFLAGS) -c $< -o - | \
+ $(LLC) -march=bpf -filetype=obj -o $@
+
+bpf/datapath_dbg.o: $(bpf_sources) $(bpf_headers)
+ @which clang-4.0 > /dev/null 2>&1 || \
+ (echo "Unable to find clang-4.0 for debugging"; exit 1)
+ clang-4.0 $(bpf_CFLAGS) -g -c $< -o -| llc-4.0 -march=bpf -filetype=obj -o $@_dbg
+ llvm-objdump-4.0 -S -no-show-raw-insn $@_dbg > $@_dbg.objdump
+
+EXTRA_DIST += $(dist_sources) $(dist_headers) $(bpf_extra)
+if HAVE_BPF
+dist_bpf_DATA += $(build_objects)
+endif
+
diff --git a/bpf/datapath.c b/bpf/datapath.c
new file mode 100644
index 000000000000..627177208059
--- /dev/null
+++ b/bpf/datapath.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#include <errno.h>
+#include <stdint.h>
+#include <iproute2/bpf_elf.h>
+
+#include "api.h"
+#include "odp-bpf.h"
+#include "datapath.h"
+
+/*
+ * Instead of having multiple BPF object files,
+ * include all headers and generate single datapath.o
+ */
+#include "maps.h"
+#include "parser.h"
+#include "lookup.h"
+#include "action.h"
+#include "xdp.h"
+
+/* We don't rely on specific versions of the kernel; however libbpf requires
+ * this to be both specified and non-zero. */
+static const __maybe_unused __section("version") uint32_t version = 0x1;
+
+static inline void __maybe_unused
+bpf_debug(struct __sk_buff *skb, enum ovs_dbg_subtype subtype, int error)
+{
+ uint64_t cpu = get_smp_processor_id();
+ uint64_t flags = skb->len;
+ struct bpf_upcall md = {
+ .type = OVS_UPCALL_DEBUG,
+ .subtype = subtype,
+ .ifindex = skb->ingress_ifindex,
+ .cpu = cpu,
+ .skb_len = skb->len,
+ .error = error
+ };
+
+ flags <<= 32;
+ flags |= BPF_F_CURRENT_CPU;
+
+ skb_event_output(skb, &upcalls, flags, &md, sizeof(md));
+}
+
+/*
+ * This program forwards the packet to userspace, using the
+ * perf_event_output helper function.
+ * BPF program: tail-35
+ */
+__section_tail(UPCALL_CALL)
+static inline int process_upcall(struct __sk_buff *skb)
+{
+ struct bpf_upcall md = {
+ .type = OVS_UPCALL_MISS,
+ .skb_len = skb->len,
+ //.ifindex = ovs_cb_get_ifindex(skb),
+ };
+ int stat, err;
+ struct ebpf_headers_t *hdrs = bpf_get_headers();
+ struct ebpf_metadata_t *mds = bpf_get_mds();
+
+ if (!hdrs || !mds) {
+ printt("headers/mds is NULL\n");
+ return TC_ACT_OK;
+ }
+
+ md.ifindex = mds->md.in_port;
+
+ memcpy(&md.key.headers, hdrs, sizeof(struct ebpf_headers_t));
+ memcpy(&md.key.mds, mds, sizeof(struct ebpf_metadata_t));
+
+ if (hdrs->valid & VLAN_VALID) {
+ printt("upcall skb->len(%d) with vlan %x %x\n",
+ skb->len, hdrs->vlan.etherType, hdrs->vlan.tci);
+ skb_vlan_push(skb, hdrs->vlan.etherType,
+ hdrs->vlan.tci & ~VLAN_TAG_PRESENT);
+ md.skb_len = skb->len;
+ }
+
+ uint64_t flags = skb->len;
+ flags <<= 32;
+ flags |= BPF_F_CURRENT_CPU;
+
+ err = skb_event_output(skb, &upcalls, flags, &md, sizeof(md));
+ stat = !err ? OVS_DP_STATS_MISSED
+ : err == -ENOSPC ? OVS_DP_STATS_LOST
+ : OVS_DP_STATS_ERRORS;
+ stats_account(stat);
+ return TC_ACT_OK;
+}
+
+/*
+ * This is the ENTRY POINT for packet seen at ingress queue
+ */
+__section("ingress")
+static int to_stack(struct __sk_buff *skb)
+{
+ printt("\n\ningress from %d (%d)\n", skb->ingress_ifindex, skb->ifindex);
+
+ ovs_cb_init(skb, true);
+ bpf_tail_call(skb, &tailcalls, PARSER_CALL);
+
+ printt("ERR: tail call fail in ingress\n");
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This is the ENTRY POINT for packet seen at egress queue
+ */
+__section("egress")
+static int from_stack(struct __sk_buff *skb)
+{
+ printt("\n\negress from %d (%d)\n", skb->ingress_ifindex, skb->ifindex);
+
+ ovs_cb_init(skb, false);
+ bpf_tail_call(skb, &tailcalls, PARSER_CALL);
+
+ printt("ERR: tail call fail in egress\n");
+ return TC_ACT_SHOT;
+}
+
+/*
+ * This is the ENTRY POINT for downcall packet
+ */
+__section("downcall")
+static int execute(struct __sk_buff *skb)
+{
+ struct bpf_downcall md;
+ u32 ebpf_zero = 0;
+ int flags, ofs;
+
+ ofs = skb->len - sizeof(md);
+ skb_load_bytes(skb, ofs, &md, sizeof(md));
+ flags = md.flags & OVS_BPF_FLAGS_TX_STACK ? BPF_F_INGRESS : 0;
+
+ printt("downcall (%d) from %d flags %d\n", md.type,
+ md.ifindex, flags);
+
+ bpf_map_update_elem(&percpu_metadata, &ebpf_zero, &md.md, BPF_ANY);
+
+ skb_change_tail(skb, ofs, 0);
+
+ switch (md.type) {
+ case OVS_BPF_DOWNCALL_EXECUTE: {
+ struct bpf_action_batch *action_batch;
+
+ action_batch = bpf_map_lookup_elem(&execute_actions, &ebpf_zero);
+ if (action_batch) {
+ printt("get valid action_batch\n");
+ skb->cb[OVS_CB_DOWNCALL_EXE] = 1;
+ bpf_tail_call(skb, &tailcalls, action_batch->actions[0].type);
+ } else {
+ printt("get null action_batch\n");
+ }
+ break;
+ }
+ case OVS_BPF_DOWNCALL_OUTPUT: {
+ /* Skip writing the BPF metadata in parser */
+ skb->cb[OVS_CB_ACT_IDX] = -1;
+ /* Redirect to the device this packet came from, so it's as though the
+ * packet was freshly received. This should execute PARSER_CALL. */
+ return redirect(md.ifindex, flags);
+ }
+ default:
+ printt("Unknown downcall type %d\n", md.type);
+ break;
+ }
+ return 0;
+}
+
+BPF_LICENSE("GPL");
diff --git a/bpf/datapath.h b/bpf/datapath.h
new file mode 100644
index 000000000000..d9f48461cc79
--- /dev/null
+++ b/bpf/datapath.h
@@ -0,0 +1,71 @@
+/*
+ * Copyright (c) 2017, 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#include "odp-bpf.h"
+
+#define SKB_CB_U32S 5 /* According to linux/bpf.h. */
+
+enum ovs_cb_idx {
+ OVS_CB_ACT_IDX, /* Next action to process in action batch. */
+ OVS_CB_INGRESS, /* 0 = egress; nonzero = ingress. */
+ OVS_CB_DOWNCALL_EXE, /* 0 = match/execute, 1 = downcall/execute. */
+};
+
+static void
+ovs_cb_init(struct __sk_buff *skb, bool ingress)
+{
+ for (int i = 0; i < SKB_CB_U32S; i++)
+ skb->cb[i] = 0;
+ skb->cb[OVS_CB_INGRESS] = ingress;
+}
+
+static bool
+ovs_cb_is_initial_parse(struct __sk_buff *skb) {
+ int index = skb->cb[OVS_CB_ACT_IDX];
+
+ if (index != 0) {
+ printt("recirc, don't update metadata, index %d\n", index);
+ }
+ return index == 0;
+}
+
+static uint32_t
+ovs_cb_get_action_index(struct __sk_buff *skb)
+{
+ return skb->cb[OVS_CB_ACT_IDX];
+}
+
+static uint32_t OVS_UNUSED
+ovs_cb_get_ifindex(struct __sk_buff *skb)
+{
+ uint32_t ifindex;
+
+ if (!skb)
+ return 0;
+
+ /* This workaround the compiler optimization issue */
+ if (skb->cb[OVS_CB_INGRESS]) {
+ __asm__ __volatile__("": : :"memory");
+ return skb->ingress_ifindex;
+ }
+
+ ifindex = skb->ifindex;
+ __asm__ __volatile__("": : :"memory");
+
+ return ifindex;
+}
diff --git a/bpf/generated_headers.h b/bpf/generated_headers.h
new file mode 100644
index 000000000000..52e33a8601a6
--- /dev/null
+++ b/bpf/generated_headers.h
@@ -0,0 +1,185 @@
+#ifndef P4_GENERATED_HEADERS
+#define P4_GENERATED_HEADERS
+
+/* We sometimes disable IPV6 to work
+ * around 512-Byte BPF stack limit
+ */
+#define BPF_ENABLE_IPV6
+
+#ifndef BPF_TYPES
+#define BPF_TYPES
+typedef signed char s8;
+typedef unsigned char u8;
+typedef signed short s16;
+typedef unsigned short u16;
+typedef signed int s32;
+typedef unsigned int u32;
+typedef signed long long s64;
+typedef unsigned long long u64;
+#endif
+
+struct ipv6_t {
+ u8 version; /* 4 bits */
+ u8 trafficClass; /* 8 bits */
+ u32 flowLabel; /* 20 bits */
+ u16 payloadLen; /* 16 bits */
+ u8 nextHdr; /* 8 bits */
+ u8 hopLimit; /* 8 bits */
+ char srcAddr[16]; /* 128 bits */
+ char dstAddr[16]; /* 128 bits */
+};
+struct pkt_metadata_t {
+ u32 recirc_id; /* 32 bits */
+ u32 dp_hash; /* 32 bits */
+ u32 skb_priority; /* 32 bits */
+ u32 pkt_mark; /* 32 bits */
+ u16 ct_state; /* 16 bits */
+ u16 ct_zone; /* 16 bits */
+ u32 ct_mark; /* 32 bits */
+ char ct_label[16]; /* 128 bits */
+ u32 in_port; /* 32 bits */
+ u32 packet_length;
+};
+struct udp_t {
+ u16 srcPort; /* 16 bits */
+ u16 dstPort; /* 16 bits */
+ u16 length_; /* 16 bits */
+ u16 checksum; /* 16 bits */
+};
+struct arp_rarp_t {
+ ovs_be16 ar_hrd; /* format of hardware address */
+ ovs_be16 ar_pro; /* format of protocol address */
+ unsigned char ar_hln; /* length of hardware address */
+ unsigned char ar_pln; /* length of protocol address */
+ ovs_be16 ar_op; /* ARP opcode (command) */
+
+ /* Ethernet+IPv4 specific members. */
+ unsigned char ar_sha[6]; /* sender hardware address */
+ unsigned char ar_sip[4]; /* sender IP address: be32 */
+ unsigned char ar_tha[6]; /* target hardware address */
+ unsigned char ar_tip[4]; /* target IP address: be32 */
+} __attribute__((packed));
+struct icmp_t {
+ u8 type;
+ u8 code;
+};
+struct icmpv6_t {
+ u8 type;
+ u8 code;
+ u16 csum;
+ union {
+ uint32_t data32[1]; /* type-specific field */
+ uint16_t data16[2]; /* type-specific field */
+ uint8_t data8[4]; /* type-specific field */
+ } dataun;
+};
+struct ipv4_t {
+ u8 ttl; /* 8 bits */
+ u8 protocol; /* 8 bits */
+ ovs_be32 srcAddr; /* 32 bits */
+ ovs_be32 dstAddr; /* 32 bits */
+};
+struct gnv_opt {
+ ovs_be16 opt_class;
+ uint8_t type;
+ uint8_t length:5;
+ uint8_t r3:1;
+ uint8_t r2:1;
+ uint8_t r1:1;
+ uint8_t opt_data[4]; /* hard-coded to 4 byte */
+};
+struct flow_tnl_t {
+ union {
+ struct {
+ u32 ip_dst; /* 32 bits */ // BPF uses host byte-order
+ u32 ip_src; /* 32 bits */
+ } ip4;
+#ifdef BPF_ENABLE_IPV6
+ struct {
+ char ipv6_dst[16]; /* 128 bits */
+ char ipv6_src[16]; /* 128 bits */
+ } ip6;
+#endif
+ };
+ u32 tun_id; /* 32 bits */
+ u16 flags; /* 16 bits */
+ u8 ip_tos; /* 8 bits */
+ u8 ip_ttl; /* 8 bits */
+ ovs_be16 tp_src; /* 16 bits */
+ ovs_be16 tp_dst; /* 16 bits */
+ u16 gbp_id; /* 16 bits */
+ u8 gbp_flags; /* 8 bits */
+ u8 use_ipv6: 4,
+ gnvopt_valid: 4;
+ struct gnv_opt gnvopt;
+ char pad1[0]; /* 40 bits */
+};
+struct tcp_t {
+ ovs_be16 srcPort; /* 16 bits */
+ ovs_be16 dstPort; /* 16 bits */
+ u32 seqNo; /* 32 bits */
+ u32 ackNo; /* 32 bits */
+ u8 dataOffset:4, /* 4 bits */
+ res:4; /* 4 bits */
+ u8 flags; /* 8 bits */
+ u16 window; /* 16 bits */
+ u16 checksum; /* 16 bits */
+ u16 urgentPtr; /* 16 bits */
+};
+struct ethernet_t {
+ char dstAddr[6]; /* 48 bits */
+ char srcAddr[6]; /* 48 bits */
+ ovs_be16 etherType; /* 16 bits */
+};
+struct vlan_tag_t {
+ union {
+ u16 pcp:3,
+ cfi:1,
+ vid:12;
+ ovs_be16 tci; /* host byte order */
+ };
+ ovs_be16 etherType; /* network byte order */
+};
+struct mpls_t {
+ ovs_be32 top_lse; /* top label stack entry */
+};
+
+enum proto_valid {
+ ETHER_VALID = 1 << 0,
+ MPLS_VALID = 1 << 1,
+ IPV4_VALID = 1 << 2,
+ IPV6_VALID = 1 << 3,
+ ARP_VALID = 1 << 4,
+ TCP_VALID = 1 << 5,
+ UDP_VALID = 1 << 6,
+ ICMP_VALID = 1 << 7,
+ VLAN_VALID = 1 << 8,
+ CVLAN_VALID = 1 << 9,
+ ICMPV6_VALID = 1 << 10,
+};
+
+struct ebpf_headers_t {
+ u32 valid;
+ struct ethernet_t ethernet;
+ struct mpls_t mpls;
+ union {
+ struct ipv4_t ipv4;
+#ifdef BPF_ENABLE_IPV6
+ struct ipv6_t ipv6;
+#endif
+ struct arp_rarp_t arp;
+ };
+ union {
+ struct tcp_t tcp;
+ struct udp_t udp;
+ struct icmp_t icmp;
+ struct icmpv6_t icmpv6;
+ };
+ struct vlan_tag_t vlan;
+ struct vlan_tag_t cvlan;
+};
+struct ebpf_metadata_t {
+ struct pkt_metadata_t md;
+ struct flow_tnl_t tnl_md;
+};
+#endif
diff --git a/bpf/helpers.h b/bpf/helpers.h
new file mode 100644
index 000000000000..69fdbb344075
--- /dev/null
+++ b/bpf/helpers.h
@@ -0,0 +1,209 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#ifndef __OVSBPF_HELPERS_H
+#define __OVSBPF_HELPERS_H
+#include <stdbool.h>
+#include <stdio.h>
+#include <linux/bpf.h>
+
+/* Additional headers */
+# define printk(fmt, ...) \
+({ \
+ char ____fmt[] = fmt; \
+ bpf_trace_printk(____fmt, sizeof(____fmt), \
+ ##__VA_ARGS__); \
+})
+
+#define ERR_EXIT() \
+ ({printk("[ERROR] \n"); return TC_ACT_OK;})
+
+#define NOT_HERE() \
+ ({printk("[ERROR] Program should not reach here\n");})
+
+#ifndef BPF_TYPES
+#define BPF_TYPES
+typedef signed char s8;
+typedef unsigned char u8;
+typedef signed short s16;
+typedef unsigned short u16;
+typedef signed int s32;
+typedef unsigned int u32;
+typedef signed long long s64;
+typedef unsigned long long u64;
+#endif
+
+#define ___constant_swab16(x) ((__u16)( \
+ (((__u16)(x) & (__u16)0x00ffU) << 8) | \
+ (((__u16)(x) & (__u16)0xff00U) >> 8)))
+
+#define ___constant_swab32(x) ((__u32)( \
+ (((__u32)(x) & (__u32)0x000000ffUL) << 24) | \
+ (((__u32)(x) & (__u32)0x0000ff00UL) << 8) | \
+ (((__u32)(x) & (__u32)0x00ff0000UL) >> 8) | \
+ (((__u32)(x) & (__u32)0xff000000UL) >> 24)))
+
+#define ___constant_swab64(x) ((__u64)( \
+ (((__u64)(x) & (__u64)0x00000000000000ffULL) << 56) | \
+ (((__u64)(x) & (__u64)0x000000000000ff00ULL) << 40) | \
+ (((__u64)(x) & (__u64)0x0000000000ff0000ULL) << 24) | \
+ (((__u64)(x) & (__u64)0x00000000ff000000ULL) << 8) | \
+ (((__u64)(x) & (__u64)0x000000ff00000000ULL) >> 8) | \
+ (((__u64)(x) & (__u64)0x0000ff0000000000ULL) >> 24) | \
+ (((__u64)(x) & (__u64)0x00ff000000000000ULL) >> 40) | \
+ (((__u64)(x) & (__u64)0xff00000000000000ULL) >> 56)))
+
+#define __constant_htonl(x) (___constant_swab32((x)))
+#define __constant_ntohl(x) (___constant_swab32(x))
+#define __constant_htons(x) (___constant_swab16((x)))
+#define __constant_ntohs(x) ___constant_swab16((x))
+
+/* helper macro to place programs, maps, license in
+ * different sections in elf_bpf file. Section names
+ * are interpreted by elf_bpf loader
+ */
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+/* helper functions called from eBPF programs written in C */
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+ (void *) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value,
+ unsigned long long flags) =
+ (void *) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) =
+ (void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
+ (void *) BPF_FUNC_probe_read;
+static unsigned long long (*bpf_ktime_get_ns)(void) =
+ (void *) BPF_FUNC_ktime_get_ns;
+static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
+ (void *) BPF_FUNC_trace_printk;
+static void (*bpf_tail_call)(void *ctx, void *map, int index) =
+ (void *) BPF_FUNC_tail_call;
+static unsigned long long (*bpf_get_smp_processor_id)(void) =
+ (void *) BPF_FUNC_get_smp_processor_id;
+static unsigned long long (*bpf_get_current_pid_tgid)(void) =
+ (void *) BPF_FUNC_get_current_pid_tgid;
+static unsigned long long (*bpf_get_current_uid_gid)(void) =
+ (void *) BPF_FUNC_get_current_uid_gid;
+static int (*bpf_get_current_comm)(void *buf, int buf_size) =
+ (void *) BPF_FUNC_get_current_comm;
+static int (*bpf_perf_event_read)(void *map, int index) =
+ (void *) BPF_FUNC_perf_event_read;
+static int (*bpf_clone_redirect)(void *ctx, int ifindex, int flags) =
+ (void *) BPF_FUNC_clone_redirect;
+static int (*bpf_redirect)(int ifindex, int flags) =
+ (void *) BPF_FUNC_redirect;
+static int (*bpf_perf_event_output)(void *ctx, void *map,
+ unsigned long long flags, void *data,
+ int size) =
+ (void *) BPF_FUNC_perf_event_output;
+static int (*bpf_get_stackid)(void *ctx, void *map, int flags) =
+ (void *) BPF_FUNC_get_stackid;
+static int (*bpf_probe_write_user)(void *dst, void *src, int size) =
+ (void *) BPF_FUNC_probe_write_user;
+static int (*bpf_current_task_under_cgroup)(void *map, int index) =
+ (void *) BPF_FUNC_current_task_under_cgroup;
+static int (*bpf_skb_get_tunnel_key)(void *ctx, void *key, int size, int flags) =
+ (void *) BPF_FUNC_skb_get_tunnel_key;
+static int (*bpf_skb_set_tunnel_key)(void *ctx, void *key, int size, int flags) =
+ (void *) BPF_FUNC_skb_set_tunnel_key;
+static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int size) =
+ (void *) BPF_FUNC_skb_get_tunnel_opt;
+static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
+ (void *) BPF_FUNC_skb_set_tunnel_opt;
+static unsigned long long (*bpf_get_prandom_u32)(void) =
+ (void *) BPF_FUNC_get_prandom_u32;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+ (void *) BPF_FUNC_xdp_adjust_head;
+static int (*bpf_skb_vlan_push)(void *ctx, int vlan_proto, int vlan_tci) =
+ (void *) BPF_FUNC_skb_vlan_push;
+static int (*bpf_skb_vlan_pop)(void *ctx) =
+ (void *) BPF_FUNC_skb_vlan_pop;
+static int (*bpf_skb_change_tail)(void *ctx, int len, int flags) =
+ (void *) BPF_FUNC_skb_change_tail;
+static int (*bpf_get_hash_recalc)(void *ctx) =
+ (void *) BPF_FUNC_get_hash_recalc;
+
+/* llvm builtin functions that eBPF C program may use to
+ * emit BPF_LD_ABS and BPF_LD_IND instructions
+ */
+struct sk_buff;
+unsigned long long load_byte(void *skb,
+ unsigned long long off) asm("llvm.bpf.load.byte");
+unsigned long long load_half(void *skb,
+ unsigned long long off) asm("llvm.bpf.load.half");
+unsigned long long load_word(void *skb,
+ unsigned long long off) asm("llvm.bpf.load.word");
+
+/* a helper structure used by eBPF C program
+ * to describe map attributes to elf_bpf loader
+ */
+struct bpf_map_def {
+ unsigned int type;
+ unsigned int key_size;
+ unsigned int value_size;
+ unsigned int max_entries;
+ unsigned int map_flags;
+ unsigned int id;
+ unsigned int pinning;
+};
+
+/* used in TC */
+/*
+struct bpf_elf_map {
+ __u32 type;
+ __u32 key_size;
+ __u32 value_size;
+ __u32 max_entries;
+ __u32 map_flags;
+ __u32 id;
+ __u32 pinning;
+};
+*/
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+ (void *) BPF_FUNC_skb_load_bytes;
+static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
+ (void *) BPF_FUNC_skb_store_bytes;
+static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
+ (void *) BPF_FUNC_l3_csum_replace;
+static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flags) =
+ (void *) BPF_FUNC_l4_csum_replace;
+static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
+ (void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_change_head)(void *, int len, int flags) =
+ (void *) BPF_FUNC_skb_change_head;
+
+#if defined(__x86_64__)
+#define PT_REGS_PARM1(x) ((x)->di)
+#define PT_REGS_PARM2(x) ((x)->si)
+#define PT_REGS_PARM3(x) ((x)->dx)
+#define PT_REGS_PARM4(x) ((x)->cx)
+#define PT_REGS_PARM5(x) ((x)->r8)
+#define PT_REGS_RET(x) ((x)->sp)
+#define PT_REGS_FP(x) ((x)->bp)
+#define PT_REGS_RC(x) ((x)->ax)
+#define PT_REGS_SP(x) ((x)->sp)
+#define PT_REGS_IP(x) ((x)->ip)
+#endif
+#define BPF_KPROBE_READ_RET_IP(ip, ctx) ({ \
+ bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); })
+#define BPF_KRETPROBE_READ_RET_IP(ip, ctx) ({ \
+ bpf_probe_read(&(ip), sizeof(ip), \
+ (void *)(PT_REGS_FP(ctx) + sizeof(ip))); })
+#endif
diff --git a/bpf/lookup.h b/bpf/lookup.h
new file mode 100644
index 000000000000..db60289b46b9
--- /dev/null
+++ b/bpf/lookup.h
@@ -0,0 +1,227 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+#include <openvswitch/compiler.h>
+#include "ovs-p4.h"
+#include "api.h"
+#include "helpers.h"
+#include "maps.h"
+
+/* eBPF executes actions by tailcall because eBPF doesn't support for-loop and
+ * unroll produces oversized code.
+ *
+ * Each action handler uses current packet's key to look for the next action.
+ * However, the key can be changed by some actions like hash, so a stable
+ * key is kept in an eBPF map named percpu_executing_key. In action handler,
+ * firstly, the stable key is got from percpu_executing_key, then it is used
+ * to look up the actions being executed. skb->cb[OVS_CB_ACT_IDX] points to
+ * next action.
+ */
+static inline void ovs_execute_actions(struct __sk_buff *skb,
+ struct bpf_action *action)
+{
+ enum ovs_action_attr type;
+ type = action->type;
+
+ printt("action type %d\n", type);
+
+ /* note: this isn't a for loop, tail call won't return. */
+ switch (type) {
+ case OVS_ACTION_ATTR_UNSPEC:
+ printt("end of action processing\n");
+ break;
+ case OVS_ACTION_ATTR_OUTPUT:
+ printt("output action port = %d\n", action->u.out.port);
+ break;
+ case OVS_ACTION_ATTR_USERSPACE:
+ printt("userspace action, len = %d, ifindex = %d upcall back\n",
+ action->u.userspace.nlattr_len, ovs_cb_get_ifindex(skb));
+ break;
+ case OVS_ACTION_ATTR_SET:
+ printt("set action, remote ipv4 = %x, is_set = %d\n",
+ action->u.tunnel.remote_ipv4, action->is_set);
+ break;
+ case OVS_ACTION_ATTR_PUSH_VLAN:
+ printt("vlan push tci %d\n", action->u.push_vlan.vlan_tci);
+ break;
+ case OVS_ACTION_ATTR_POP_VLAN:
+ printt("vlan pop\n");
+ break;
+ case OVS_ACTION_ATTR_RECIRC:
+ printt("recirc\n");
+ break;
+ case OVS_ACTION_ATTR_HASH:
+ printt("hash\n");
+ break;
+ case OVS_ACTION_ATTR_SET_MASKED:
+ printt("set masked\n");
+ break;
+ case OVS_ACTION_ATTR_CT:
+ printt("ct\n");
+ break;
+ case OVS_ACTION_ATTR_TRUNC:
+ printt("truncate\n");
+ break;
+ case OVS_ACTION_ATTR_SAMPLE: /* Nested case OVS_SAMPLE_ATTR_*. */
+ case OVS_ACTION_ATTR_PUSH_MPLS: /* struct ovs_action_push_mpls. */
+ case OVS_ACTION_ATTR_POP_MPLS: /* __be16 ethertype. */
+ case OVS_ACTION_ATTR_PUSH_ETH: /* struct ovs_action_push_eth. */
+ case OVS_ACTION_ATTR_POP_ETH: /* No argument. */
+ case OVS_ACTION_ATTR_CT_CLEAR: /* No argument. */
+ case OVS_ACTION_ATTR_PUSH_NSH: /* Nested case OVS_NSH_KEY_ATTR_*. */
+ case OVS_ACTION_ATTR_POP_NSH: /* No argument. */
+#ifndef __KERNEL__
+ case OVS_ACTION_ATTR_TUNNEL_PUSH: /* struct ovs_action_push_tnl*/
+ case OVS_ACTION_ATTR_TUNNEL_POP: /* u32 port number. */
+ case OVS_ACTION_ATTR_CLONE: /* Nested case OVS_CLONE_ATTR_*. */
+ case OVS_ACTION_ATTR_METER: /* u32 meter number. */
+#endif
+ case __OVS_ACTION_ATTR_MAX:
+#ifdef __KERNEL__
+ case OVS_ACTION_ATTR_SET_TO_MASKED: /* Kernel module internal masked
+ * set action converted from
+ * case OVS_ACTION_ATTR_SET. */
+#endif
+ default:
+ printt("ERR: action type %d not supportedn", type);
+ break;
+ }
+
+ bpf_tail_call(skb, &tailcalls, type);
+
+ /* OVS_NOT_REACHED */
+ return;
+}
+
+static inline void
+stats_account(enum ovs_bpf_dp_stats index)
+{
+ uint32_t stat = 1;
+ uint64_t *value;
+
+ value = map_lookup_elem(&datapath_stats, &index);
+ if (value) {
+ __sync_fetch_and_add(value, stat);
+ }
+}
+
+/* OVS revalidator thread reads each entry in eBPF map
+ * (flow_table and dp_flow_stats), reports to OpenFlow
+ * table statistics, and decide to remove/keep the entry
+ * by comparing its timestamp.
+ */
+static inline void
+flow_stats_account(struct ebpf_headers_t *headers,
+ struct ebpf_metadata_t *mds,
+ size_t bytes)
+{
+ struct bpf_flow_key flow_key;
+ struct bpf_flow_stats *flow_stats;
+
+ flow_key.headers = *headers;
+ flow_key.mds = *mds;
+
+ flow_stats = bpf_map_lookup_elem(&dp_flow_stats, &flow_key);
+ if (!flow_stats) {
+ struct bpf_flow_stats s = {0, 0, 0};
+ int err;
+
+ printt("flow not found in flow stats, first install\n");
+ s.packet_count = 1;
+ s.byte_count = bytes;
+ s.used = bpf_ktime_get_ns() / (1000*1000); /* msec */
+ err = bpf_map_update_elem(&dp_flow_stats, &flow_key, &s, BPF_ANY);
+ if (err) {
+ return;
+ }
+ } else {
+ flow_stats->packet_count += 1;
+ flow_stats->byte_count += bytes;
+ flow_stats->used = bpf_ktime_get_ns() / (1000*1000); /* msec */
+ printt("current: packets %d count %d ts %d\n",
+ flow_stats->packet_count, flow_stats->byte_count, flow_stats->used);
+ }
+
+ return;
+}
+
+static inline struct bpf_action_batch *
+ovs_lookup_flow(struct ebpf_headers_t *headers,
+ struct ebpf_metadata_t *mds)
+{
+ struct bpf_flow_key flow_key;
+
+ flow_key.headers = *headers;
+ flow_key.mds = *mds;
+
+ return bpf_map_lookup_elem(&flow_table, &flow_key);
+}
+
+__section_tail(MATCH_ACTION_CALL)
+static int lookup(struct __sk_buff* skb OVS_UNUSED)
+{
+ struct bpf_action_batch *action_batch;
+ struct ebpf_headers_t *headers;
+ struct ebpf_metadata_t *mds;
+
+ headers = bpf_get_headers();
+ if (!headers) {
+ printt("no packet header found\n");
+ ERR_EXIT();
+ }
+
+ mds = bpf_get_mds();
+ if (!mds) {
+ printt("no packet metadata found\n");
+ ERR_EXIT();
+ }
+
+ /* LOOKUP */
+ action_batch = ovs_lookup_flow(headers, mds);
+ if (!action_batch) {
+ printt("no action found, upcall to userspace\n");
+ bpf_tail_call(skb, &tailcalls, UPCALL_CALL);
+
+ /* OVS_NOT_REACHED */
+ return TC_ACT_OK;
+ } else {
+ /* DP Stats Update */
+ stats_account(OVS_DP_STATS_HIT);
+ /* Flow Stats Update */
+ flow_stats_account(headers, mds, skb->len);
+ }
+
+ /* Hit verifier limit when moving declaration up. */
+ struct bpf_flow_key flow_key;
+ flow_key.headers = *headers;
+ flow_key.mds = *mds;
+ int index = 0;
+ int error = bpf_map_update_elem(&percpu_executing_key, &index,
+ &flow_key, BPF_ANY);
+ if (error) {
+ printt("update percpu_executing_key failed: %d\n", error);
+ return TC_ACT_OK;
+ }
+
+ /* the subsequent actions will be tail called. */
+ ovs_execute_actions(skb, &action_batch->actions[0]);
+
+ printt("ERROR: tail call fails\n");
+
+ /* OVS_NOT_REACHED */
+ return TC_ACT_OK;
+}
diff --git a/bpf/maps.h b/bpf/maps.h
new file mode 100644
index 000000000000..aa1c15864975
--- /dev/null
+++ b/bpf/maps.h
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * This file is offered under your choice of two licenses: Apache 2.0 or GNU
+ * GPL 2.0 or later. The permission statements for each of these licenses is
+ * given below. You may license your modifications to this file under either
+ * of these licenses or both. If you wish to license your modifications under
+ * only one of these licenses, delete the permission text for the other
+ * license.
+ *
+ * ----------------------------------------------------------------------
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * ----------------------------------------------------------------------
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ * ----------------------------------------------------------------------
+ */
+
+#ifndef BPFMAP_OPENVSWITCH_H
+#define BPFMAP_OPENVSWITCH_H 1
+
+#include "api.h"
+#include "openvswitch.h"
+#include "ovs-p4.h"
+
+/* ovs-vswitchd as a writer will update these maps.
+ * bpf datapath as reader lookup and processes */
+
+/* FIXME: copy from iproute2 */
+enum {
+ BPF_MAP_ID_PROTO,
+ BPF_MAP_ID_QUEUE,
+ BPF_MAP_ID_DROPS,
+ BPF_MAP_ID_ACTION,
+ BPF_MAP_ID_INGRESS,
+ __BPF_MAP_ID_MAX,
+#define BPF_MAP_ID_MAX __BPF_MAP_ID_MAX
+};
+
+/* A bpf flow key is extracted from the
+ * parser.h and saved in
+ * 1) percpu_headers, and
+ * 2) percpu_metadata
+ * Access: BPF is the only writer/reader
+ */
+BPF_PERCPU_ARRAY(percpu_headers,
+ 0,
+ sizeof(struct ebpf_headers_t),
+ 0,
+ 1
+);
+BPF_PERCPU_ARRAY(percpu_metadata,
+ 0,
+ sizeof(struct ebpf_metadata_t),
+ 0,
+ 1
+);
+
+/* BPF flow tale
+ * Access: BPF is the reader for lookup,
+ * ovs-vswitchd is the writer
+ */
+BPF_HASH(flow_table,
+ 0,
+ sizeof(struct bpf_flow_key),
+ sizeof(struct bpf_action_batch),
+ 0,
+ 256
+);
+
+/* BPF flow stats table
+ * Access: BPF is the writer for updating,
+ * ovs-vswitchd/revalidator is the reader
+ */
+BPF_HASH(dp_flow_stats,
+ 0,
+ sizeof(struct bpf_flow_key),
+ sizeof(struct bpf_flow_stats),
+ 0,
+ 256
+);
+
+/*
+ * Map for implementing the upcall, which forwards the
+ * first packet (lookup misses) to ovs-vswitchd
+ */
+BPF_PERF_OUTPUT(upcalls, 0);
+
+
+/* BPF datapath stats
+ * Access: BPF is the writer,
+ * ovs-vswitchd is the reader
+ * XXX: switch to percpu to improve performance
+ */
+BPF_ARRAY(datapath_stats,
+ 0,
+ sizeof(uint64_t),
+ 0,
+ __OVS_DP_STATS_MAX
+);
+
+/* Global tail call map:
+ * index 0-31 for actions (OVS_ACTION_ATTR_*)
+ * index 32-63 for others
+ */
+BPF_PROG_ARRAY(tailcalls,
+ 0,
+ 0,
+ 64
+);
+
+/* A dedicated action list for downcall packet.
+ * Access: ovs-vswitch is the writer,
+ * BPF is the reader
+ */
+BPF_ARRAY(execute_actions,
+ 0,
+ sizeof(struct bpf_action_batch),
+ 0,
+ 1
+);
+
+/* A dedicated key for downcall packet.
+ * Access: ovs-vswitch is the writer,
+ * BPF is the reader
+ */
+BPF_PERCPU_ARRAY(percpu_executing_key,
+ 0,
+ sizeof(struct bpf_flow_key),
+ 0,
+ 1
+);
+
+struct ebpf_headers_t;
+struct ebpf_metadata_t;
+
+static inline struct ebpf_headers_t *bpf_get_headers()
+{
+ int ebpf_zero = 0;
+ return bpf_map_lookup_elem(&percpu_headers, &ebpf_zero);
+}
+
+static inline struct ebpf_metadata_t *bpf_get_mds()
+{
+ int ebpf_zero = 0;
+ return bpf_map_lookup_elem(&percpu_metadata, &ebpf_zero);
+}
+
+#endif /* BPFMAP_OPENVSWITCH_H */
diff --git a/bpf/odp-bpf.h b/bpf/odp-bpf.h
new file mode 100644
index 000000000000..b1df3bbe6840
--- /dev/null
+++ b/bpf/odp-bpf.h
@@ -0,0 +1,254 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * This file is offered under your choice of two licenses: Apache 2.0 or GNU
+ * GPL 2.0 or later. The permission statements for each of these licenses is
+ * given below. You may license your modifications to this file under either
+ * of these licenses or both. If you wish to license your modifications under
+ * only one of these licenses, delete the permission text for the other
+ * license.
+ *
+ * ----------------------------------------------------------------------
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * ----------------------------------------------------------------------
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ * ----------------------------------------------------------------------
+ */
+
+#ifndef BPF_OPENVSWITCH_H
+#define BPF_OPENVSWITCH_H 1
+
+#include "odp-netlink.h"
+#include "generated_headers.h"
+
+enum ovs_upcall_cmd {
+ OVS_UPCALL_UNSPEC = OVS_PACKET_CMD_UNSPEC,
+
+ /* Kernel-to-user notifications. */
+ OVS_UPCALL_MISS = OVS_PACKET_CMD_MISS,
+ OVS_UPCALL_ACTION = OVS_PACKET_CMD_ACTION,
+
+ /* Userspace commands. */
+ OVS_UPCALL_EXECUTE = OVS_PACKET_CMD_EXECUTE,
+
+ OVS_UPCALL_DEBUG,
+};
+
+enum ovs_dbg_subtype {
+ OVS_DBG_ST_UNSPEC,
+ OVS_DBG_ST_REDIRECT,
+ __OVS_DBG_ST_MAX,
+};
+#define OVS_DBG_ST_MAX (__OVS_DBG_ST_MAX - 1)
+
+static const char *bpf_upcall_subtypes[] OVS_UNUSED = {
+ [OVS_DBG_ST_UNSPEC] = "Unspecified",
+ [OVS_DBG_ST_REDIRECT] = "Downcall redirect",
+};
+
+/* Used with 'datapath_stats' map. */
+enum ovs_bpf_dp_stats {
+ OVS_DP_STATS_UNSPEC,
+ OVS_DP_STATS_HIT,
+ OVS_DP_STATS_MISSED,
+ OVS_DP_STATS_LOST,
+ OVS_DP_STATS_FLOWS,
+ OVS_DP_STATS_MASK_HIT,
+ OVS_DP_STATS_MASKS,
+ OVS_DP_STATS_ERRORS,
+ __OVS_DP_STATS_MAX,
+};
+#define OVS_DP_STATS_MAX (__OVS_DP_STATS_MAX - 1)
+
+struct bpf_flow {
+ uint64_t value; /* XXX */
+};
+
+struct bpf_flow_stats {
+ uint64_t packet_count; /* Number of packets matched. */
+ uint64_t byte_count; /* Number of bytes matched. */
+ uint64_t used; /* Last used time (in jiffies). */
+ //spinlock_t lock; /* Lock for atomic stats update. */
+ //__be16 tcp_flags; /* Union of seen TCP flags. */
+};
+
+struct bpf_flow_key {
+ struct ebpf_headers_t headers;
+ struct ebpf_metadata_t mds;
+};
+
+struct bpf_upcall {
+ uint8_t type;
+ uint8_t subtype;
+ uint32_t ifindex; /* Incoming device */
+ uint32_t cpu;
+ uint32_t error;
+ uint32_t skb_len;
+#ifdef BPF_ENABLE_IPV6
+ uint8_t uactions[24]; /* Contains 'struct nlattr' */
+#else
+ uint8_t uactions[64];
+#endif
+ uint32_t uactions_len;
+ struct bpf_flow_key key;
+ /* Followed by 'skb_len' of packet data. */
+};
+
+#define OVS_BPF_FLAGS_TX_STACK (1 << 0)
+
+#define OVS_BPF_DOWNCALL_UNSPEC 0
+#define OVS_BPF_DOWNCALL_OUTPUT 1
+#define OVS_BPF_DOWNCALL_EXECUTE 2
+
+struct bpf_downcall {
+ uint32_t type;
+ uint32_t ifindex;
+ uint32_t debug;
+ uint32_t flags;
+ struct ebpf_metadata_t md;
+ /* Followed by packet data. */
+};
+
+#define ETH_ALEN 6
+
+#define OVS_ACTION_ATTR_UNSPEC 0
+#define OVS_ACTION_ATTR_OUTPUT 1
+#define OVS_ACTION_ATTR_USERSPACE 2
+#define OVS_ACTION_ATTR_SET 3
+#define OVS_ACTION_ATTR_PUSH_VLAN 4
+#define OVS_ACTION_ATTR_POP_VLAN 5
+#define OVS_ACTION_ATTR_SAMPLE 6
+#define OVS_ACTION_ATTR_RECIRC 7
+#define OVS_ACTION_ATTR_HASH 8
+#define OVS_ACTION_ATTR_PUSH_MPLS 9
+#define OVS_ACTION_ATTR_POP_MPLS 10
+#define OVS_ACTION_ATTR_SET_MASKED 11
+#define OVS_ACTION_ATTR_CT 12
+#define OVS_ACTION_ATTR_TRUNC 13
+#define OVS_ACTION_ATTR_PUSH_ETH 14
+#define OVS_ACTION_ATTR_POP_ETH 15
+
+#define VLAN_CFI_MASK 0x1000 /* Canonical Format Indicator */
+#define VLAN_TAG_PRESENT VLAN_CFI_MASK
+
+struct flow_key {
+ __be32 src;
+ __be32 dst;
+ union {
+ __be32 ports;
+ __be16 port16[2];
+ };
+ __u32 ip_proto;
+};
+
+struct ovs_action_set_tunnel {
+ /* light weight tunnel key */
+ __u32 tunnel_id; /* tunnel id is host byte order */
+ union {
+ __u32 remote_ipv4; /* host byte order */
+ __u32 remote_ipv6[4];
+ };
+ __u8 tunnel_tos;
+ __u8 tunnel_ttl;
+ __u16 tunnel_ext;
+ __u32 tunnel_label;
+ struct gnv_opt gnvopt;
+ __u8 gnvopt_valid;
+ __u8 use_ipv6;
+};
+
+struct ovs_action_set_masked {
+ enum ovs_key_attr key_type;
+ union {
+ struct ovs_key_ethernet ether;
+ struct ovs_key_mpls mpls;
+ struct ovs_key_ipv4 ipv4;
+ struct ovs_key_ipv6 ipv6;
+ struct ovs_key_tcp tcp;
+ struct ovs_key_udp udp;
+ struct ovs_key_sctp sctp;
+ struct ovs_key_icmp icmp;
+ struct ovs_key_icmpv6 icmpv6;
+ struct ovs_key_arp arp;
+ } key;
+#if 0
+ /* BPF datapath does not support mask */
+ union {
+ struct ovs_key_ethernet ether;
+ struct ovs_key_mpls mpls;
+ struct ovs_key_ipv4 ipv4;
+ struct ovs_key_ipv6 ipv6;
+ struct ovs_key_tcp tcp;
+ struct ovs_key_udp udp;
+ struct ovs_key_sctp sctp;
+ struct ovs_key_icmp icmp;
+ struct ovs_key_icmpv6 icmpv6;
+ struct ovs_key_arp arp;
+ } mask;
+#endif
+};
+
+struct ovs_action_output {
+ uint32_t port;
+ uint32_t flags;
+};
+
+struct ovs_action_ct {
+ int commit;
+ /* XXX: Include everything in enum ovs_ct_attr. */
+};
+
+struct ovs_action_userspace {
+ __u16 nlattr_len;
+ __u8 nlattr_data[64];
+};
+
+struct bpf_action {
+ enum ovs_action_attr type; /* action type */
+ uint32_t is_set;
+ union {
+ struct ovs_action_output out; /* OVS_ACTION_ATTR_OUTPUT: 8B */
+ struct ovs_action_trunc trunc; /* OVS_ACTION_ATTR_TRUNC: 4B */
+ struct ovs_action_hash hash; /* OVS_ACTION_ATTR_HASH: 8B */
+ struct ovs_action_push_mpls mpls; /* OVS_ACTION_ATTR_PUSH_MPLS: 6B */
+ ovs_be16 ethertype; /* OVS_ACTION_ATTR_POP_MPLS: 2B */
+ struct ovs_action_push_vlan push_vlan; /* OVS_ACTION_ATTR_PUSH_VLAN: 4B */
+ /* OVS_ACTION_ATTR_POP_VLAN: 0B */
+ uint32_t recirc_id; /* OVS_ACTION_ATTR_RECIRC: 4B */
+ struct ovs_action_set_tunnel tunnel;
+ struct ovs_action_set_masked mset; /* OVS_ACTION_ATTR_SET_MASK: */
+ struct ovs_action_ct ct; /* OVS_ACTION_ATTR_CT: */
+ struct ovs_action_userspace userspace; /* OVS_ACTION_ATTR_USERSPACE: */
+
+ uint64_t aligned[16]; // make it 128 byte
+ } u;
+};
+
+#define BPF_DP_MAX_ACTION 32
+struct bpf_action_batch {
+ struct bpf_action actions[BPF_DP_MAX_ACTION];
+};
+
+#endif /* BPF_OPENVSWITCH_H */
diff --git a/bpf/openvswitch.h b/bpf/openvswitch.h
new file mode 100644
index 000000000000..602e223bd280
--- /dev/null
+++ b/bpf/openvswitch.h
@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * This file is offered under your choice of two licenses: Apache 2.0 or GNU
+ * GPL 2.0 or later. The permission statements for each of these licenses is
+ * given below. You may license your modifications to this file under either
+ * of these licenses or both. If you wish to license your modifications under
+ * only one of these licenses, delete the permission text for the other
+ * license.
+ *
+ * ----------------------------------------------------------------------
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * ----------------------------------------------------------------------
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ * ----------------------------------------------------------------------
+ */
+
+#ifndef __BPF_OPENVSWITCH__
+#define __BPF_OPENVSWITCH__
+#include <stdint.h>
+#include "odp-netlink.h"
+
+#ifndef BPFNL_OPENVSWITCH_H
+#define BPFNL_OPENVSWITCH_H 1
+#endif /* BPFNL_OPENVSWITCH_H */
+
+#endif /* __BPF_OPENVSWITCH__ */
diff --git a/bpf/ovs-p4.h b/bpf/ovs-p4.h
new file mode 100644
index 000000000000..49937894083a
--- /dev/null
+++ b/bpf/ovs-p4.h
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#ifndef BPFP4_OPENVSWITCH_H
+#define BPFP4_OPENVSWITCH_H 1
+
+#include "helpers.h"
+#include "generated_headers.h"
+/*
+ * From BCC src/cc/export/helpers.h
+ */
+#define MASK(_n) ((_n) < 64 ? (1ull << (_n)) - 1 : ((u64)-1LL))
+#define MASK128(_n) ((_n) < 128 ? ((unsigned __int128)1 << (_n)) - 1 : ((unsigned __int128)-1))
+
+static inline u16 bpf_ntohs(u16 val) {
+ /* will be recognized by gcc into rotate insn and eventually rolw 8 */
+ return (val << 8) | (val >> 8);
+}
+static inline u32 bpf_ntohl(u32 val) {
+ /* gcc will use bswapsi2 insn */
+ return __builtin_bswap32(val);
+}
+static inline u64 bpf_ntohll(u64 val) {
+ /* gcc will use bswapdi2 insn */
+ return __builtin_bswap64(val);
+}
+static inline u16 bpf_htons(u16 val) {
+ return bpf_ntohs(val);
+}
+static inline u32 bpf_htonl(u32 val) {
+ return bpf_ntohl(val);
+}
+static inline u64 bpf_htonll(u64 val) {
+ return bpf_ntohll(val);
+}
+static inline u64 load_dword(void *skb, u64 off) {
+ return ((u64)load_word(skb, off) << 32) | load_word(skb, off + 4);
+}
+
+static inline __attribute__((always_inline))
+void bpf_dins_pkt(void *pkt, u64 off, u64 bofs, u64 bsz, u64 val) {
+ // The load_xxx function does a bswap before returning the short/word/dword,
+ // so the value in register will always be host endian. However, the bytes
+ // written back need to be in network order.
+ if (bofs == 0 && bsz == 8) {
+ bpf_skb_store_bytes(pkt, off, &val, 1, 0);
+ } else if (bofs + bsz <= 8) {
+ u8 v = load_byte(pkt, off);
+ v &= ~(MASK(bsz) << (8 - (bofs + bsz)));
+ v |= ((val & MASK(bsz)) << (8 - (bofs + bsz)));
+ bpf_skb_store_bytes(pkt, off, &v, 1, 0);
+ } else if (bofs == 0 && bsz == 16) {
+ u16 v = bpf_htons(val);
+ bpf_skb_store_bytes(pkt, off, &v, 2, 0);
+ } else if (bofs + bsz <= 16) {
+ u16 v = load_half(pkt, off);
+ v &= ~(MASK(bsz) << (16 - (bofs + bsz)));
+ v |= ((val & MASK(bsz)) << (16 - (bofs + bsz)));
+ v = bpf_htons(v);
+ bpf_skb_store_bytes(pkt, off, &v, 2, 0);
+ } else if (bofs == 0 && bsz == 32) {
+ u32 v = bpf_htonl(val);
+ bpf_skb_store_bytes(pkt, off, &v, 4, 0);
+ } else if (bofs + bsz <= 32) {
+ u32 v = load_word(pkt, off);
+ v &= ~(MASK(bsz) << (32 - (bofs + bsz)));
+ v |= ((val & MASK(bsz)) << (32 - (bofs + bsz)));
+ v = bpf_htonl(v);
+ bpf_skb_store_bytes(pkt, off, &v, 4, 0);
+ } else if (bofs == 0 && bsz == 64) {
+ u64 v = bpf_htonll(val);
+ bpf_skb_store_bytes(pkt, off, &v, 8, 0);
+ } else if (bofs + bsz <= 64) {
+ u64 v = load_dword(pkt, off);
+ v &= ~(MASK(bsz) << (64 - (bofs + bsz)));
+ v |= ((val & MASK(bsz)) << (64 - (bofs + bsz)));
+ v = bpf_htonll(v);
+ bpf_skb_store_bytes(pkt, off, &v, 8, 0);
+ }
+}
+
+enum ErrorCode {
+ p4_pe_no_error,
+ p4_pe_index_out_of_bounds,
+ p4_pe_out_of_packet,
+ p4_pe_header_too_long,
+ p4_pe_header_too_short,
+ p4_pe_unhandled_select,
+ p4_pe_checksum,
+ p4_pe_too_many_encap,
+ p4_pe_ipv6_disabled,
+};
+
+#define EBPF_MASK(t, w) ((((t)(1)) << (w)) - (t)1)
+#define BYTES(w) ((w + 7) / 8)
+
+#endif
diff --git a/bpf/ovs-proto.p4 b/bpf/ovs-proto.p4
new file mode 100644
index 000000000000..c6ebdb510b75
--- /dev/null
+++ b/bpf/ovs-proto.p4
@@ -0,0 +1,329 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * This file is offered under your choice of two licenses: Apache 2.0 or GNU
+ * GPL 2.0 or later. The permission statements for each of these licenses is
+ * given below. You may license your modifications to this file under either
+ * of these licenses or both. If you wish to license your modifications under
+ * only one of these licenses, delete the permission text for the other
+ * license.
+ *
+ * ----------------------------------------------------------------------
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * ----------------------------------------------------------------------
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ * ----------------------------------------------------------------------
+ */
+
+/* OVS P4 1.0 protocol file
+ * use bcc to generate eBPF C file
+ * see bcc project: https://github.com/iovisor/bcc.git
+ * under ~/bcc/src/cc/frontends/p4/test/
+ */
+#define ETH_P_8021Q 0x8100 /* 802.1Q VLAN Extended Header */
+#define ETH_P_8021AD 0x88A8 /* 802.1ad Service VLAN */
+#define ETH_P_ARP 0x0806
+#define ETH_P_IPV4 0x0800
+#define ETH_P_IPV6 0x86DD
+
+#define IPPROTO_ICMP 1
+#define IPPROTO_IGMP 2
+#define IPPROTO_TCP 6
+#define IPPROTO_UDP 17
+#define IPPROTO_GRE 47
+#define IPPROTO_SCTP 132
+
+header_type ethernet_t {
+ fields {
+ dstAddr : 48;
+ srcAddr : 48;
+ etherType : 16;
+ }
+}
+
+header_type vlan_tag_t {
+ fields {
+ pcp : 3;
+ cfi : 1;
+ vid : 12;
+ etherType : 16;
+ }
+}
+
+header_type mpls_t {
+ fields {
+ label : 20;
+ exp : 3;
+ bos : 1;
+ ttl : 8;
+ }
+}
+
+header_type arp_rarp_t {
+ fields {
+ hwType : 16;
+ protoType : 16;
+ hwAddrLen : 8;
+ protoAddrLen : 8;
+ opcode : 16;
+ }
+}
+
+header_type arp_rarp_ipv4_t {
+ fields {
+ srcHwAddr : 48;
+ srcProtoAddr : 32;
+ dstHwAddr : 48;
+ dstProtoAddr : 32;
+ }
+}
+
+header_type ipv4_t {
+ fields {
+ version : 4;
+ ihl : 4;
+ diffserv : 8;
+ totalLen : 16;
+ identification : 16;
+ flags : 3;
+ fragOffset : 13;
+ ttl : 8;
+ protocol : 8;
+ hdrChecksum : 16;
+ srcAddr : 32;
+ dstAddr: 32;
+ }
+}
+
+header_type ipv6_t {
+ fields {
+ version : 4;
+ trafficClass : 8;
+ flowLabel : 20;
+ payloadLen : 16;
+ nextHdr : 8;
+ hopLimit : 8;
+ srcAddr : 128;
+ dstAddr : 128;
+ }
+}
+
+header_type icmp_t {
+ fields {
+ typeCode : 16;
+ hdrChecksum : 16;
+ }
+}
+
+header_type tcp_t {
+ fields {
+ srcPort : 16;
+ dstPort : 16;
+ seqNo : 32;
+ ackNo : 32;
+ dataOffset : 4;
+ res : 4;
+ flags : 8;
+ window : 16;
+ checksum : 16;
+ urgentPtr : 16;
+ }
+}
+
+header_type udp_t {
+ fields {
+ srcPort : 16;
+ dstPort : 16;
+ length_ : 16;
+ checksum : 16;
+ }
+}
+
+header_type sctp_t {
+ fields {
+ srcPort : 16;
+ dstPort : 16;
+ verifTag : 32;
+ checksum : 32;
+ }
+}
+
+header_type gre_t {
+ fields {
+ C : 1;
+ R : 1;
+ K : 1;
+ S : 1;
+ s : 1;
+ recurse : 3;
+ flags : 5;
+ ver : 3;
+ proto : 16;
+ }
+}
+
+/* ----------------- metadata ---------------- */
+header_type pkt_metadata_t {
+ fields {
+ recirc_id : 32; /* Recirculation id carried with the
+ recirculating packets. 0 for packets
+ received from the wire. */
+ dp_hash : 32; /* hash value computed by the recirculation
+ action. */
+ skb_priority : 32; /* Packet priority for QoS. */
+ pkt_mark : 32; /* Packet mark. */
+ ct_state : 16; /* Connection state. */
+ ct_zone : 16; /* Connection zone. */
+ ct_mark : 32; /* Connection mark. */
+ ct_label : 128; /* Connection label. */
+ in_port : 32; /* Input port. */
+ }
+}
+
+header_type flow_tnl_t {
+ fields {
+ /* struct flow_tnl:
+ * Tunnel information used in flow key and metadata.
+ */
+ ip_dst : 32;
+ ipv6_dst : 64;
+ ip_src: 32;
+ ipv6_src : 64;
+ tun_id : 64;
+ flags : 16;
+ ip_tos : 8;
+ ip_ttl : 8;
+ tp_src : 16;
+ tp_dst : 16;
+ gbp_id : 16;
+ gbp_flags : 8;
+ pad1: 40; /* Pad to 64 bits. */
+ /* struct tun_metadata metadata; */
+ }
+}
+
+header ethernet_t ethernet;
+header ipv4_t ipv4;
+header ipv6_t ipv6;
+header arp_rarp_t arp;
+header tcp_t tcp;
+header udp_t udp;
+header icmp_t icmp;
+header vlan_tag_t vlan;
+metadata pkt_metadata_t md;
+metadata flow_tnl_t tnl_md;
+
+parser start {
+ return parse_ethernet;
+}
+
+parser parse_ethernet{
+ extract(ethernet);
+ return select(latest.etherType) {
+ ETH_P_8021Q: parse_vlan;
+ ETH_P_8021AD: parse_vlan;
+ ETH_P_ARP: parse_arp;
+ ETH_P_IPV4: parse_ipv4;
+ ETH_P_IPV6: parse_ipv6;
+ default: ingress;
+ }
+}
+
+parser parse_vlan {
+ extract(vlan);
+ return select(latest.etherType) {
+ ETH_P_ARP: parse_arp;
+ ETH_P_IPV4: parse_ipv4;
+ ETH_P_IPV6: parse_ipv6;
+ default: ingress;
+ }
+}
+
+parser parse_arp {
+ extract(arp);
+ return ingress;
+}
+
+parser parse_ipv4 {
+ extract(ipv4);
+ return select(latest.protocol) {
+ IPPROTO_TCP: parse_tcp;
+ IPPROTO_UDP: parse_udp;
+ IPPROTO_ICMP: parse_icmp;
+ default: ingress;
+ }
+}
+
+parser parse_ipv6 {
+ extract(ipv6);
+ return select(latest.nextHdr) {
+ IPPROTO_TCP: parse_tcp;
+ IPPROTO_UDP: parse_udp;
+ IPPROTO_ICMP: parse_icmp;
+ default: ingress;
+ }
+}
+
+parser parse_tcp {
+ extract(tcp);
+ return ingress;
+}
+
+parser parse_udp {
+ extract(udp);
+ return ingress;
+}
+
+parser parse_icmp {
+ extract(icmp);
+ return ingress;
+}
+/* ------------------------------------------------------------------------- */
+action nop() {}
+
+table ovs_tbl {
+ reads {
+ /* Avoid compiler optimizes out, although
+ we are not using it at all */
+ ethernet.dstAddr: exact;
+ vlan.etherType: exact;
+ ipv4.dstAddr: exact;
+ ipv6.dstAddr: exact;
+ icmp.typeCode: exact;
+ tcp.dstPort: exact;
+ udp.dstPort: exact;
+ md.in_port: exact;
+ tnl_md.tun_id: exact;
+ }
+ actions {
+ nop;
+ }
+}
+
+control ingress
+{
+ apply(ovs_tbl);
+}
+
diff --git a/bpf/parser.h b/bpf/parser.h
new file mode 100644
index 000000000000..ab43d5e30730
--- /dev/null
+++ b/bpf/parser.h
@@ -0,0 +1,412 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+/*
+ * Protocol parser generated from P4 1.0
+ */
+#include "ovs-p4.h"
+#include "api.h"
+#include "helpers.h"
+#include "maps.h"
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+
+__section_tail(PARSER_CALL)
+static int ovs_parser(struct __sk_buff* skb) {
+ struct ebpf_headers_t ebpf_headers = {};
+ struct ebpf_metadata_t ebpf_metadata = {};
+ unsigned skbOffsetInBits = 0;
+ enum ErrorCode ebpf_error = p4_pe_no_error;
+ u32 ebpf_zero = 0;
+ int offset = 0;
+ void *data = (void *)(long)skb->data;
+ struct ethhdr *eth = data;
+
+ if ((char *)data + sizeof(*eth) > (char *)(long)skb->data_end) {
+ return 0;
+ }
+
+ ebpf_headers.valid = 0;
+ printt("proto = %x len = %d vlan_tci = %x\n",
+ eth->h_proto, skb->len, (int)skb->vlan_tci);
+ printt("skb->ingress_ifindex %d skb->ifindex %d\n",
+ skb->ingress_ifindex, skb->ifindex);
+
+ if (skb->cb[OVS_CB_ACT_IDX] != 0) {
+ printt("this is a downcall packet\n");
+ }
+
+ if (skb_load_bytes(skb, offset, &ebpf_headers.ethernet, 14) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.valid |= ETHER_VALID;
+ offset += 14;
+ skbOffsetInBits = offset * 8;
+
+ /* vlan_tci is in host byte order. */
+ if (skb->vlan_tci) {
+ ebpf_headers.vlan.tci = skb->vlan_tci | VLAN_TAG_PRESENT;
+ ebpf_headers.vlan.etherType = skb->vlan_proto;
+ ebpf_headers.valid |= VLAN_VALID;
+ printt("vlan proto %x tci %x\n", skb->vlan_proto, skb->vlan_tci);
+ }
+
+ u32 tmp_3 = eth->h_proto;
+ if (tmp_3 == 0x0081 || tmp_3 == 0xA888) {
+ if (ebpf_headers.valid & VLAN_VALID) {
+ goto parse_cvlan;
+ }
+
+ printt("Nested vlan? not supported!\n");
+ if (1) return 0;
+ if (skb->vlan_tci) {
+ goto parse_cvlan;
+ } else {
+ goto parse_vlan;
+ }
+ } if (tmp_3 == 0x0608) {
+ goto parse_arp;
+ } if (tmp_3 == 0x0008) {
+ goto parse_ipv4;
+ } if (tmp_3 == 0xDD86) {
+ goto parse_ipv6;
+ } else {
+ goto ovs_tbl_4;
+ }
+
+ parse_vlan: {
+ struct vlan_tag_t *vlan = &ebpf_headers.vlan;
+ if (skb_load_bytes(skb, offset, &vlan, 4) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ printt("parsing vlan\n");
+ offset += 4;
+ skbOffsetInBits = offset * 8;
+
+ {
+ u32 tmp_5 = ebpf_headers.vlan.etherType;
+ if (tmp_5 == 0x0608)
+ goto parse_arp;
+ if (tmp_5 == 0x0008)
+ goto parse_ipv4;
+ if (tmp_5 == 0xDD86)
+ goto parse_ipv6;
+ if (tmp_5 == 0x0081 || tmp_5 == 0xA888) {
+ printt("not support layer-3 vlan");
+ goto parse_cvlan;
+ } else
+ goto ovs_tbl_4;
+ }
+ }
+ parse_cvlan: {
+ if (skb_load_bytes(skb, offset, &ebpf_headers.cvlan, 4) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ printt("parsing cvlan\n");
+ offset += 4;
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= CVLAN_VALID;
+ u32 tmp_5 = ebpf_headers.cvlan.etherType;
+ if (tmp_5 == 0x0608)
+ goto parse_arp;
+ if (tmp_5 == 0x0008)
+ goto parse_ipv4;
+ if (tmp_5 == 0xDD86)
+ goto parse_ipv6;
+ if (tmp_5 == 0x0081) {
+ ebpf_error = p4_pe_too_many_encap;
+ goto end;
+ }
+ else
+ goto ovs_tbl_4;
+ }
+ parse_arp: {
+ struct arp_rarp_t *arp = &ebpf_headers.arp;
+ if (skb_load_bytes(skb, offset, arp, sizeof ebpf_headers.arp) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ if (arp->ar_hrd == 0x0100 &&
+ arp->ar_pro == 0x0008 &&
+ arp->ar_hln == 6 &&
+ arp->ar_pln == 4) {
+
+ printt("valid arp\n");
+ } else {
+ printt("Invalid arp\n");
+ }
+ offset += sizeof ebpf_headers.arp;
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= ARP_VALID;
+ goto ovs_tbl_4;
+ }
+ parse_ipv4: {
+ struct iphdr nh;
+ if (skb_load_bytes(skb, offset, &nh, 20) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ offset += nh.ihl * 4;
+ ebpf_headers.ipv4.ttl = nh.ttl;
+ ebpf_headers.ipv4.protocol = nh.protocol;
+ ebpf_headers.ipv4.srcAddr = nh.saddr;
+ ebpf_headers.ipv4.dstAddr = nh.daddr;
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= IPV4_VALID;
+ u32 tmp_6 = ebpf_headers.ipv4.protocol;
+ if (tmp_6 == 6)
+ goto parse_tcp;
+ if (tmp_6 == 17)
+ goto parse_udp;
+ if (tmp_6 == 1)
+ goto parse_icmp;
+ else
+ goto ovs_tbl_4;
+ }
+ parse_ipv6: {
+#ifdef BPF_ENABLE_IPV6
+ if (skb->len < BYTES(skbOffsetInBits + 4)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.ipv6.version = ((load_byte(skb, (skbOffsetInBits + 0) / 8)) >> (4)) & EBPF_MASK(u8, 4);
+ skbOffsetInBits += 4;
+ if (skb->len < BYTES(skbOffsetInBits + 8)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ //ebpf_headers.ipv6.trafficClass = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (4)) & EBPF_MASK(u16, 8);
+ ebpf_headers.ipv6.trafficClass = 0;
+ skbOffsetInBits += 8;
+ if (skb->len < BYTES(skbOffsetInBits + 20)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.ipv6.flowLabel = ((load_word(skb, (skbOffsetInBits + 0) / 8)) >> (8)) & EBPF_MASK(u32, 20);
+ skbOffsetInBits += 20;
+ if (skb->len < BYTES(skbOffsetInBits + 16)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ //ebpf_headers.ipv6.payloadLen = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ ebpf_headers.ipv6.payloadLen = 0;
+ skbOffsetInBits += 16;
+ if (skb->len < BYTES(skbOffsetInBits + 8)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.ipv6.nextHdr = ((load_byte(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ skbOffsetInBits += 8;
+ if (skb->len < BYTES(skbOffsetInBits + 8)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ //ebpf_headers.ipv6.hopLimit = ((load_byte(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ ebpf_headers.ipv6.hopLimit = 0;
+ skbOffsetInBits += 8;
+ if (skb->len < BYTES(skbOffsetInBits + 8*16*2)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ if (skb_load_bytes(skb, skbOffsetInBits/8, &ebpf_headers.ipv6.srcAddr, 32) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ skbOffsetInBits += 8*16*2;;
+ ebpf_headers.valid |= IPV6_VALID;
+ u32 tmp_7 = ebpf_headers.ipv6.nextHdr;
+ printt("ipv6 proto %d\n", tmp_7);
+ if (tmp_7 == 6)
+ goto parse_tcp;
+ if (tmp_7 == 17)
+ goto parse_udp;
+ if (tmp_7 == 58)
+ goto parse_icmpv6;
+ if (tmp_7 == 41 || tmp_7 == 43 || tmp_7 == 44 || tmp_7 == 51) {
+ printt("icmpv6 extension header not support");
+ return TC_ACT_SHOT;
+ }
+ else {
+ printt("ipv6 proto %x not parsed\n");
+ goto ovs_tbl_4;
+ }
+#else
+ ebpf_error = p4_pe_ipv6_disabled;
+ goto end;
+#endif
+ }
+ parse_tcp: {
+ if (skb_load_bytes(skb, offset, &ebpf_headers.tcp, 4) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ offset += sizeof ebpf_headers.tcp - 1;
+
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= TCP_VALID;
+ goto ovs_tbl_4;
+ }
+ parse_udp: {
+ if (skb->len < BYTES(skbOffsetInBits + 16)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.udp.srcPort = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ skbOffsetInBits += 16;
+ if (skb->len < BYTES(skbOffsetInBits + 16)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ ebpf_headers.udp.dstPort = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ skbOffsetInBits += 16;
+ if (skb->len < BYTES(skbOffsetInBits + 16)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ //ebpf_headers.udp.length_ = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ ebpf_headers.udp.length_ = 0;
+ skbOffsetInBits += 16;
+ if (skb->len < BYTES(skbOffsetInBits + 16)) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ // Remove from key
+ // ebpf_headers.udp.checksum = ((load_half(skb, (skbOffsetInBits + 0) / 8)) >> (0));
+ ebpf_headers.udp.checksum = 0;
+ skbOffsetInBits += 16;
+ ebpf_headers.valid |= UDP_VALID;
+ goto ovs_tbl_4;
+ }
+ parse_icmp: {
+ if (skb_load_bytes(skb, offset, &ebpf_headers.icmp, 2) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ printt("icmp type = %x code = %x\n", ebpf_headers.icmp.type,
+ ebpf_headers.icmp.code);
+
+#if 0 /* the ICMP packet might be ip fragment */
+ if (ebpf_headers.ipv4.flags & IP_FRAGMENT) {
+ ebpf_headers.icmp.type = 0;
+ ebpf_headers.icmp.code = 0;
+ }
+#endif
+ offset += 8;
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= ICMP_VALID;
+ goto ovs_tbl_4;
+ }
+#ifdef BPF_ENABLE_IPV6
+ parse_icmpv6: {
+ if (skb_load_bytes(skb, offset, &ebpf_headers.icmpv6,
+ sizeof(struct icmpv6_t)) < 0) {
+ ebpf_error = p4_pe_header_too_short;
+ goto end;
+ }
+ printt("icmpv6 type = %x code = %x\n", ebpf_headers.icmpv6.type,
+ ebpf_headers.icmpv6.code);
+
+ offset += 16;
+ skbOffsetInBits = offset * 8;
+ ebpf_headers.valid |= ICMPV6_VALID;
+ goto ovs_tbl_4;
+ }
+#endif
+
+ /* Most of the code are generated by P4C-EBPF
+ Manual code starts here */
+ ovs_tbl_4:
+ {
+ int ret;
+ struct bpf_tunnel_key key;
+
+ ebpf_metadata.md.skb_priority = skb->priority;
+
+ /* Don't use ovs_cb_get_ifindex(), that gets optimized into something
+ * that can't be verified. >:( */
+ if (skb->cb[OVS_CB_INGRESS]) {
+ ebpf_metadata.md.in_port = skb->ingress_ifindex;
+ }
+ if (!skb->cb[OVS_CB_INGRESS]) {
+ ebpf_metadata.md.in_port = skb->ifindex;
+ }
+ ebpf_metadata.md.pkt_mark = skb->mark;
+ ebpf_metadata.md.packet_length = skb->len;
+
+ ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key), 0);
+ if (!ret) {
+ printt("bpf_skb_get_tunnel_key id = %d ipv4\n", key.tunnel_id);
+ ebpf_metadata.tnl_md.tun_id = key.tunnel_id;
+ ebpf_metadata.tnl_md.ip4.ip_src = key.remote_ipv4;
+ ebpf_metadata.tnl_md.ip_tos = key.tunnel_tos;
+ ebpf_metadata.tnl_md.ip_ttl = key.tunnel_ttl;
+ ebpf_metadata.tnl_md.use_ipv6 = 0;
+ ebpf_metadata.tnl_md.flags = 0;
+#ifdef BPF_ENABLE_IPV6
+ } else if (ret == -EPROTO) {
+ ret = bpf_skb_get_tunnel_key(skb, &key, sizeof(key),
+ BPF_F_TUNINFO_IPV6);
+ if (!ret) {
+ printt("bpf_skb_get_tunnel_key id = %d ipv6\n", key.tunnel_id);
+ ebpf_metadata.tnl_md.tun_id = key.tunnel_id;
+ memcpy(&ebpf_metadata.tnl_md.ip6.ipv6_src, &key.remote_ipv4, 16);
+ ebpf_metadata.tnl_md.ip_tos = key.tunnel_tos;
+ ebpf_metadata.tnl_md.ip_ttl = key.tunnel_ttl;
+ ebpf_metadata.tnl_md.use_ipv6 = 1;
+ ebpf_metadata.tnl_md.flags = 0;
+ }
+#endif
+ }
+
+ if (!ret) {
+ ret = bpf_skb_get_tunnel_opt(skb, &ebpf_metadata.tnl_md.gnvopt,
+ sizeof ebpf_metadata.tnl_md.gnvopt);
+ if (ret > 0)
+ ebpf_metadata.tnl_md.gnvopt_valid = 1;
+ printt("bpf_skb_get_tunnel_opt ret = %d\n", ret);
+ }
+ }
+
+end:
+ if (ebpf_error != p4_pe_no_error) {
+ printt("parse error, drop\n";);
+ return TC_ACT_SHOT;
+ }
+
+ /* write flow key and md to key map */
+ printt("Parser: updating flow key\n");
+ bpf_map_update_elem(&percpu_headers,
+ &ebpf_zero, &ebpf_headers, BPF_ANY);
+
+ if (ovs_cb_is_initial_parse(skb)) {
+ bpf_map_update_elem(&percpu_metadata,
+ &ebpf_zero, &ebpf_metadata, BPF_ANY);
+ }
+ skb->cb[OVS_CB_ACT_IDX] = 0;
+
+ /* tail call next stage */
+ printt("tail call match + lookup stage\n");
+ bpf_tail_call(skb, &tailcalls, MATCH_ACTION_CALL);
+
+ printt("[ERROR] missing tail call\n");
+ return TC_ACT_OK;
+}
diff --git a/bpf/xdp.h b/bpf/xdp.h
new file mode 100644
index 000000000000..2d2102a6ba28
--- /dev/null
+++ b/bpf/xdp.h
@@ -0,0 +1,35 @@
+/*
+ * Copyright (c) 2018 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+#include "ovs-p4.h"
+#include "api.h"
+#include "helpers.h"
+
+__section("xdp")
+static int xdp_ingress(struct xdp_md *ctx OVS_UNUSED)
+{
+ /* TODO: see p4c-xdp project */
+ printt("return XDP_PASS\n");
+ return XDP_PASS;
+}
+
+__section("af_xdp")
+static int af_xdp_ingress(struct xdp_md *ctx OVS_UNUSED)
+{
+ /* TODO: see xdpsock_kern.c ans xdpsock_user.c */
+ return XDP_PASS;
+}
--
2.7.4


[RFC PATCH 08/11] vswitch/bridge.c: add bpf datapath initialization.

William Tu
 

The patch initializes the bpf datapath when bridge starts.
The check_support could be avoided since we know what datapath
bpf program supports what feature.

Signed-off-by: Joe Stringer <joe@...>
Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
lib/packets.h | 6 ++++-
ofproto/ofproto-dpif.c | 69 ++++++++++++++++++++++++++++++++++----------------
vswitchd/bridge.c | 21 +++++++++++++++
3 files changed, 73 insertions(+), 23 deletions(-)

diff --git a/lib/packets.h b/lib/packets.h
index 9a71aa3abbdb..2379c8f6d19d 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -47,7 +47,8 @@ static inline bool ipv6_addr_is_set(const struct in6_addr *addr);
static inline bool
flow_tnl_dst_is_set(const struct flow_tnl *tnl)
{
- return tnl->ip_dst || ipv6_addr_is_set(&tnl->ipv6_dst);
+ return tnl->ip_dst || ipv6_addr_is_set(&tnl->ipv6_dst) ||
+ tnl->ip_src || ipv6_addr_is_set(&tnl->ipv6_src);
}

struct in6_addr flow_tnl_dst(const struct flow_tnl *tnl);
@@ -154,7 +155,10 @@ pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
* we can just zero out ip_dst and the rest of the data will never be
* looked at. */
md->tunnel.ip_dst = 0;
+ md->tunnel.ip_src = 0;
md->tunnel.ipv6_dst = in6addr_any;
+ md->tunnel.ipv6_src = in6addr_any;
+
md->in_port.odp_port = port;
}

diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 3365d4185926..115c138505ac 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1338,28 +1338,53 @@ CHECK_FEATURE__(ct_orig_tuple6, ct_orig_tuple6, ct_nw_proto, 1, ETH_TYPE_IPV6)
static void
check_support(struct dpif_backer *backer)
{
- /* Actions. */
- backer->rt_support.odp.recirc = check_recirc(backer);
- backer->rt_support.odp.max_vlan_headers = check_max_vlan_headers(backer);
- backer->rt_support.odp.max_mpls_depth = check_max_mpls_depth(backer);
- backer->rt_support.masked_set_action = check_masked_set_action(backer);
- backer->rt_support.trunc = check_trunc_action(backer);
- backer->rt_support.ufid = check_ufid(backer);
- backer->rt_support.tnl_push_pop = dpif_supports_tnl_push_pop(backer->dpif);
- backer->rt_support.clone = check_clone(backer);
- backer->rt_support.sample_nesting = check_max_sample_nesting(backer);
- backer->rt_support.ct_eventmask = check_ct_eventmask(backer);
- backer->rt_support.ct_clear = check_ct_clear(backer);
-
- /* Flow fields. */
- backer->rt_support.odp.ct_state = check_ct_state(backer);
- backer->rt_support.odp.ct_zone = check_ct_zone(backer);
- backer->rt_support.odp.ct_mark = check_ct_mark(backer);
- backer->rt_support.odp.ct_label = check_ct_label(backer);
-
- backer->rt_support.odp.ct_state_nat = check_ct_state_nat(backer);
- backer->rt_support.odp.ct_orig_tuple = check_ct_orig_tuple(backer);
- backer->rt_support.odp.ct_orig_tuple6 = check_ct_orig_tuple6(backer);
+ if (!strcmp(backer->type, "bpf")) {
+ /* Actions. */
+ backer->rt_support.odp.recirc = check_recirc(backer);
+ backer->rt_support.odp.max_vlan_headers = check_max_vlan_headers(backer);
+ backer->rt_support.odp.max_mpls_depth = check_max_mpls_depth(backer);
+ backer->rt_support.masked_set_action = check_masked_set_action(backer);
+ backer->rt_support.trunc = check_trunc_action(backer);
+ backer->rt_support.ufid = check_ufid(backer);
+ backer->rt_support.tnl_push_pop = dpif_supports_tnl_push_pop(backer->dpif);
+ backer->rt_support.clone = check_clone(backer);
+ backer->rt_support.sample_nesting = check_max_sample_nesting(backer);
+ backer->rt_support.ct_eventmask = false;
+ backer->rt_support.ct_clear = false;
+
+ /* Flow fields. */
+ backer->rt_support.odp.ct_state = false;
+ backer->rt_support.odp.ct_zone = false;
+ backer->rt_support.odp.ct_mark = false;
+ backer->rt_support.odp.ct_label = false;
+
+ backer->rt_support.odp.ct_state_nat = false;
+ backer->rt_support.odp.ct_orig_tuple = false;
+ backer->rt_support.odp.ct_orig_tuple6 = false;
+ } else {
+ /* Actions. */
+ backer->rt_support.odp.recirc = check_recirc(backer);
+ backer->rt_support.odp.max_vlan_headers = check_max_vlan_headers(backer);
+ backer->rt_support.odp.max_mpls_depth = check_max_mpls_depth(backer);
+ backer->rt_support.masked_set_action = check_masked_set_action(backer);
+ backer->rt_support.trunc = check_trunc_action(backer);
+ backer->rt_support.ufid = check_ufid(backer);
+ backer->rt_support.tnl_push_pop = dpif_supports_tnl_push_pop(backer->dpif);
+ backer->rt_support.clone = check_clone(backer);
+ backer->rt_support.sample_nesting = check_max_sample_nesting(backer);
+ backer->rt_support.ct_eventmask = check_ct_eventmask(backer);
+ backer->rt_support.ct_clear = check_ct_clear(backer);
+
+ /* Flow fields. */
+ backer->rt_support.odp.ct_state = check_ct_state(backer);
+ backer->rt_support.odp.ct_zone = check_ct_zone(backer);
+ backer->rt_support.odp.ct_mark = check_ct_mark(backer);
+ backer->rt_support.odp.ct_label = check_ct_label(backer);
+
+ backer->rt_support.odp.ct_state_nat = check_ct_state_nat(backer);
+ backer->rt_support.odp.ct_orig_tuple = check_ct_orig_tuple(backer);
+ backer->rt_support.odp.ct_orig_tuple6 = check_ct_orig_tuple6(backer);
+ }
}

static int
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index f44f950a4fce..ca6d73810420 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -20,6 +20,7 @@
#include <stdlib.h>

#include "async-append.h"
+#include "bpf.h"
#include "bfd.h"
#include "bitmap.h"
#include "cfm.h"
@@ -508,6 +509,25 @@ bridge_exit(bool delete_datapath)
ovsdb_idl_destroy(idl);
}

+static int
+init_ebpf(const struct ovsrec_open_vswitch *ovs_cfg OVS_UNUSED)
+{
+ static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+ static int error = 0;
+
+ if (ovsthread_once_start(&once)) {
+ char *bpf_elf = xasprintf("%s/bpf/datapath.o", ovs_pkgdatadir());
+
+ error = bpf_init();
+ if (!error) {
+ error = bpf_load(bpf_elf);
+ }
+ free(bpf_elf);
+ ovsthread_once_done(&once);
+ }
+ return error;
+}
+
/* Looks at the list of managers in 'ovs_cfg' and extracts their remote IP
* addresses and ports into '*managersp' and '*n_managersp'. The caller is
* responsible for freeing '*managersp' (with free()).
@@ -2979,6 +2999,7 @@ bridge_run(void)
if (cfg) {
netdev_set_flow_api_enabled(&cfg->other_config);
dpdk_init(&cfg->other_config);
+ init_ebpf(cfg);
}

/* Initialize the ofproto library. This only needs to run once, but
--
2.7.4


[RFC PATCH 06/11] dpif-bpf-odp: Add bpf datapath interface and impl.

William Tu
 

From: Joe Stringer <joe@...>

Add an implementation of the API between the userspace "Open
vSwitch Datapath Protocol" and the BPF datapath.

Signed-off-by: Joe Stringer <joe@...>
Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
lib/automake.mk | 12 +
lib/dpif-bpf-odp.c | 943 +++++++++++++++++++++++++++++++++++++++++++++++++++++
lib/dpif-bpf-odp.h | 47 +++
3 files changed, 1002 insertions(+)
create mode 100644 lib/dpif-bpf-odp.c
create mode 100644 lib/dpif-bpf-odp.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 8ecad12415a3..61fef23152d3 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -9,6 +9,7 @@ lib_LTLIBRARIES += lib/libopenvswitch.la

lib_libopenvswitch_la_LIBADD = $(SSL_LIBS)
lib_libopenvswitch_la_LIBADD += $(CAPNG_LDADD)
+lib_libopenvswitch_la_LIBADD += $(BPF_LDADD)

if WIN32
lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS}
@@ -358,6 +359,7 @@ endif

if LINUX
lib_libopenvswitch_la_SOURCES += \
+ lib/bpf.h \
lib/dpif-netlink.c \
lib/dpif-netlink.h \
lib/dpif-netlink-rtnl.c \
@@ -383,6 +385,16 @@ lib_libopenvswitch_la_SOURCES += \
lib/tc.h
endif

+if HAVE_BPF
+lib_libopenvswitch_la_SOURCES += \
+ lib/bpf.c \
+ lib/dpif-bpf.c \
+ lib/dpif-bpf-odp.c \
+ lib/dpif-bpf-odp.h \
+ lib/perf-event.c \
+ lib/perf-event.h
+endif
+
if DPDK_NETDEV
lib_libopenvswitch_la_SOURCES += \
lib/dpdk.c \
diff --git a/lib/dpif-bpf-odp.c b/lib/dpif-bpf-odp.c
new file mode 100644
index 000000000000..0e10e38511ad
--- /dev/null
+++ b/lib/dpif-bpf-odp.c
@@ -0,0 +1,943 @@
+/*
+ * Copyright (c) 2017 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include "dpif-bpf-odp.h"
+
+#include <errno.h>
+
+#include "bpf/odp-bpf.h"
+#include "openvswitch/flow.h"
+#include "openvswitch/vlog.h"
+#include "netlink.h"
+#include "util.h"
+
+VLOG_DEFINE_THIS_MODULE(dpif_bpf_odp);
+
+static void
+ct_action_to_bpf(const struct nlattr *ct, struct bpf_action *dst)
+{
+ const struct nlattr *nla;
+ int left;
+
+ NL_ATTR_FOR_EACH_UNSAFE(nla, left, ct, ct->nla_len) {
+ switch ((enum ovs_ct_attr)nla->nla_type) {
+ case OVS_CT_ATTR_COMMIT:
+ dst->u.ct.commit = true;
+ break;
+ case OVS_CT_ATTR_ZONE:
+ case OVS_CT_ATTR_MARK:
+ case OVS_CT_ATTR_LABELS:
+ case OVS_CT_ATTR_HELPER:
+ case OVS_CT_ATTR_NAT:
+ case OVS_CT_ATTR_FORCE_COMMIT:
+ case OVS_CT_ATTR_EVENTMASK:
+ default:
+ VLOG_INFO("Ignoring CT attribute %d", nla->nla_type);
+ break;
+ case OVS_CT_ATTR_UNSPEC:
+ case __OVS_CT_ATTR_MAX:
+ OVS_NOT_REACHED();
+ }
+ }
+}
+
+enum odp_key_fitness
+odp_tun_to_bpf_tun(const struct nlattr *nla, size_t nla_len,
+ struct flow_tnl_t *tun)
+{
+ const struct nlattr *a;
+ size_t left;
+
+ NL_ATTR_FOR_EACH(a, left, nla, nla_len) {
+ enum ovs_tunnel_key_attr type = nl_attr_type(a);
+
+ switch (type) {
+ case OVS_TUNNEL_KEY_ATTR_ID:
+ tun->tun_id = ntohl(be64_to_be32(nl_attr_get_be64(a)));
+ break;
+ case OVS_TUNNEL_KEY_ATTR_IPV4_SRC:
+ tun->ip4.ip_src = ntohl(nl_attr_get_be32(a));
+ tun->use_ipv6 = 0;
+ break;
+ case OVS_TUNNEL_KEY_ATTR_IPV4_DST:
+ tun->ip4.ip_dst = ntohl(nl_attr_get_be32(a));
+ tun->use_ipv6 = 0;
+ break;
+ case OVS_TUNNEL_KEY_ATTR_TOS:
+ tun->ip_tos = nl_attr_get_u8(a);
+ break;
+ case OVS_TUNNEL_KEY_ATTR_TTL:
+ tun->ip_ttl = nl_attr_get_u8(a);
+ break;
+ case OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT:
+ //tun->flags |= FLOW_TNL_F_DONT_FRAGMENT;
+ // in bpf helper, there is no tun_flags extracted
+ break;
+ case OVS_TUNNEL_KEY_ATTR_TP_DST:
+ tun->tp_dst = nl_attr_get_be16(a);
+ break;
+ case OVS_TUNNEL_KEY_ATTR_TP_SRC:
+ tun->tp_src = nl_attr_get_be16(a);
+ break;
+ case OVS_TUNNEL_KEY_ATTR_IPV6_SRC:
+#ifdef BPF_ENABLE_IPV6
+ memcpy(&tun->ip6.ipv6_src, nl_attr_get(a), 16);
+ tun->use_ipv6 = 1;
+#endif
+ break;
+ case OVS_TUNNEL_KEY_ATTR_IPV6_DST:
+#ifdef BPF_ENABLE_IPV6
+ memcpy(&tun->ip6.ipv6_dst, nl_attr_get(a), 16);
+ tun->use_ipv6 = 1;
+#endif
+ break;
+ case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS: /* Array of Geneve options. */
+ if (nl_attr_get_size(a) != sizeof tun->gnvopt) {
+ VLOG_ERR("%s: geneve opts size is %ld, expect %ld", __func__,
+ nl_attr_get_size(a), sizeof tun->gnvopt);
+ } else {
+ memcpy(&tun->gnvopt, nl_attr_get(a), sizeof tun->gnvopt);
+ tun->gnvopt_valid = 1;
+ }
+ break;
+ case OVS_TUNNEL_KEY_ATTR_CSUM: /* No argument. CSUM packet. */
+ case OVS_TUNNEL_KEY_ATTR_OAM: /* No argument. OAM frame. */
+ case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS: /* Nested OVS_VXLAN_EXT_* */
+ case OVS_TUNNEL_KEY_ATTR_PAD:
+ case __OVS_TUNNEL_KEY_ATTR_MAX:
+ VLOG_INFO("%s: unknown type %d", __func__, type);
+ break;
+ default:
+ VLOG_INFO("%s: unknown type %d", __func__, type);
+ OVS_NOT_REACHED();
+ }
+ }
+
+ return ODP_FIT_PERFECT;
+}
+
+/* Converts the OVS netlink-formatted action 'src' into a BPF action in 'dst'.
+ *
+ * Returns 0 on success, or a positive errno value on failure.
+ */
+int
+odp_action_to_bpf_action(const struct nlattr *src, struct bpf_action *dst)
+{
+ enum ovs_action_attr type = nl_attr_type(src);
+
+ switch (type) {
+ case OVS_ACTION_ATTR_PUSH_VLAN: {
+ const struct ovs_action_push_vlan *vlan = nl_attr_get(src);
+ dst->u.push_vlan = *vlan;
+ VLOG_DBG("push vlan tpid %x tci %x", vlan->vlan_tpid, vlan->vlan_tci);
+ break;
+ }
+ case OVS_ACTION_ATTR_CT:
+ ct_action_to_bpf(nl_attr_get(src), dst);
+ break;
+ case OVS_ACTION_ATTR_RECIRC:
+ dst->u.recirc_id = nl_attr_get_u32(src);
+ break;
+ case OVS_ACTION_ATTR_SAMPLE:
+ // XXX: ignore
+ return 1;
+ case OVS_ACTION_ATTR_USERSPACE:
+ if (nl_attr_get_size(src) <= sizeof dst->u.userspace.nlattr_data) {
+ size_t len = nl_attr_get_size(src);
+ memcpy(dst->u.userspace.nlattr_data, nl_attr_get(src), len);
+ dst->u.userspace.nlattr_len = len;
+ VLOG_INFO("size of userspace action is %ld", len);
+ } else {
+ VLOG_WARN("Size of userspace action too large: %ld > %ld",
+ nl_attr_get_size(src),
+ sizeof dst->u.userspace.nlattr_data);
+ return EOPNOTSUPP;
+ }
+ break;
+ case OVS_ACTION_ATTR_HASH: {
+ const struct ovs_action_hash *hash_act = nl_attr_get(src);
+ dst->u.hash = *hash_act;
+ break;
+ }
+ case OVS_ACTION_ATTR_SET: {
+ const struct nlattr *a;
+ a = nl_attr_get(src);
+
+ switch (nl_attr_type(a)) {
+ case OVS_KEY_ATTR_TUNNEL: {
+ enum odp_key_fitness ret;
+ struct flow_tnl_t tunnel;
+
+ tunnel.tun_id = 0;
+ ret = odp_tun_to_bpf_tun(nl_attr_get(a), nl_attr_get_size(a),
+ &tunnel);
+ if (ret != ODP_FIT_PERFECT) {
+ return EOPNOTSUPP;
+ }
+
+ dst->u.tunnel.tunnel_id = tunnel.tun_id;
+ if (!tunnel.use_ipv6)
+ dst->u.tunnel.remote_ipv4 = tunnel.ip4.ip_dst;
+#ifdef BPF_ENABLE_IPV6
+ else
+ memcpy(dst->u.tunnel.remote_ipv6, tunnel.ip6.ipv6_dst, 16);
+#endif
+ dst->u.tunnel.tunnel_tos = tunnel.ip_tos;
+ dst->u.tunnel.tunnel_ttl = tunnel.ip_ttl;
+ dst->u.tunnel.use_ipv6 = tunnel.use_ipv6;
+
+ if (tunnel.gnvopt_valid) {
+ dst->u.tunnel.gnvopt = tunnel.gnvopt;
+ dst->u.tunnel.gnvopt_valid = 1;
+ }
+ break;
+ }
+ default:
+ VLOG_INFO("%s: set %d is not supported", __func__,
+ nl_attr_type(a));
+ return EOPNOTSUPP;
+ }
+ break;
+ }
+ case OVS_ACTION_ATTR_SET_MASKED: {
+ const struct nlattr *a;
+ a = nl_attr_get(src);
+
+ dst->u.mset.key_type = nl_attr_type(a);
+
+ switch (nl_attr_type(a)) {
+ case OVS_KEY_ATTR_ETHERNET: {
+ struct ovs_key_ethernet *ether;
+
+ //ovs_assert(nl_attr_get_size(a) == 2 * sizeof *ether);
+
+ ether = &dst->u.mset.key.ether;
+ memcpy(ether, nl_attr_get(a), sizeof *ether);
+ break;
+ }
+ case OVS_KEY_ATTR_IPV4: {
+ struct ovs_key_ipv4 *ip;
+
+ //ovs_assert(nl_attr_get_size(a) == 2 * sizeof *ip);
+
+ ip = &dst->u.mset.key.ipv4;
+ memcpy(ip, nl_attr_get(a), sizeof *ip);
+ break;
+ }
+ default:
+ VLOG_INFO("%s: set_mask %d is not supported", __func__,
+ nl_attr_type(a));
+ return EOPNOTSUPP;
+ }
+ dst->is_set = 1;
+ break;
+ }
+ case OVS_ACTION_ATTR_TRUNC: {
+ const struct ovs_action_trunc *trunc = nl_attr_get(src);
+ dst->u.trunc = *trunc;
+ VLOG_INFO("truncate to %d byte", trunc->max_len);
+ break;
+ }
+ case OVS_ACTION_ATTR_POP_VLAN:
+ case OVS_ACTION_ATTR_PUSH_MPLS:
+ case OVS_ACTION_ATTR_POP_MPLS:
+ case OVS_ACTION_ATTR_PUSH_ETH:
+ case OVS_ACTION_ATTR_POP_ETH:
+ case OVS_ACTION_ATTR_TUNNEL_PUSH:
+ case OVS_ACTION_ATTR_TUNNEL_POP:
+ case OVS_ACTION_ATTR_CLONE:
+ case OVS_ACTION_ATTR_METER:
+ case OVS_ACTION_ATTR_CT_CLEAR:
+ case OVS_ACTION_ATTR_PUSH_NSH:
+ case OVS_ACTION_ATTR_POP_NSH:
+ VLOG_WARN("Unsupported action type %d", nl_attr_type(src));
+ return EOPNOTSUPP;
+ case OVS_ACTION_ATTR_UNSPEC:
+ case OVS_ACTION_ATTR_OUTPUT:
+ case __OVS_ACTION_ATTR_MAX:
+ OVS_NOT_REACHED();
+ }
+
+ return 0;
+}
+
+int
+bpf_actions_to_odp_actions(struct bpf_action_batch *batch, struct ofpbuf *out)
+{
+ int i;
+
+ for (i = 0; i < BPF_DP_MAX_ACTION; i++) {
+ struct bpf_action *act = &batch->actions[i];
+ enum ovs_action_attr type = act->type;
+
+ switch (type) {
+ case OVS_ACTION_ATTR_UNSPEC:
+ /* End of actions list. */
+ return 0;
+
+ case OVS_ACTION_ATTR_OUTPUT: {
+ /* XXX: ifindex to odp translation */
+ nl_msg_put_u32(out, type, act->u.out.port);
+ break;
+ }
+ case OVS_ACTION_ATTR_PUSH_VLAN: {
+ nl_msg_put_unspec(out, type, &act->u.push_vlan,
+ sizeof act->u.push_vlan);
+ break;
+ }
+ case OVS_ACTION_ATTR_RECIRC:
+ nl_msg_put_u32(out, type, act->u.recirc_id);
+ break;
+ case OVS_ACTION_ATTR_TRUNC:
+ nl_msg_put_unspec(out, type, &act->u.trunc, sizeof act->u.trunc);
+ break;
+ case OVS_ACTION_ATTR_HASH:
+ nl_msg_put_unspec(out, type, &act->u.hash, sizeof act->u.hash);
+ break;
+ case OVS_ACTION_ATTR_PUSH_MPLS:
+ nl_msg_put_unspec(out, type, &act->u.mpls, sizeof act->u.mpls);
+ break;
+ case OVS_ACTION_ATTR_POP_MPLS:
+ nl_msg_put_be16(out, type, act->u.ethertype);
+ break;
+ case OVS_ACTION_ATTR_SAMPLE: {
+ VLOG_WARN("XXX FIXME attr sample");
+ break;
+ }
+ case OVS_ACTION_ATTR_SET: {
+ // see parse_tc_flower_to_match
+ size_t start_ofs;
+ size_t tun_key_ofs;
+ struct ovs_action_set_tunnel *tun;
+
+ tun = &act->u.tunnel;
+ start_ofs = nl_msg_start_nested(out, OVS_ACTION_ATTR_SET);
+ tun_key_ofs = nl_msg_start_nested(out, OVS_KEY_ATTR_TUNNEL);
+
+ nl_msg_put_be64(out, OVS_TUNNEL_KEY_ATTR_ID,
+ be32_to_be64(htonl(tun->tunnel_id)));
+
+ if (!tun->use_ipv6) {
+ if (tun->remote_ipv4) {
+ nl_msg_put_be32(out, OVS_TUNNEL_KEY_ATTR_IPV4_DST,
+ htonl(tun->remote_ipv4));
+ }
+#ifdef BPF_ENABLE_IPV6
+ } else {
+ if (ipv6_addr_is_set((const struct in6_addr *)&tun->remote_ipv6)) {
+ nl_msg_put_in6_addr(out, OVS_TUNNEL_KEY_ATTR_IPV6_DST,
+ (const struct in6_addr *)&tun->remote_ipv6);
+ }
+#endif
+ }
+
+#if 0
+ if (!tnl_type || !strcmp(tnl_type, "geneve")) {
+ tun_metadata_to_geneve_nlattr(tun_key, tun_flow_key, key_buf, a);
+ }
+#endif
+ nl_msg_end_nested(out, tun_key_ofs);
+ nl_msg_end_nested(out, start_ofs);
+ break;
+ }
+ case OVS_ACTION_ATTR_SET_MASKED: {
+ VLOG_WARN("XXX FXIME attr set masked");
+ size_t offset = nl_msg_start_nested(out, OVS_ACTION_ATTR_SET_MASKED);
+
+ nl_msg_end_nested(out, offset);
+ break;
+ }
+
+ case OVS_ACTION_ATTR_USERSPACE: {
+ VLOG_WARN("XXX FXIME attr userspace");
+#if 0
+ size_t offset;
+ struct ovs_action_userspace *au;
+
+ au = &act->u.userspace;
+
+ offset = nl_msg_start_nested(out, OVS_ACTION_ATTR_USERSPACE);
+ nl_msg_put_u32(out, OVS_USERSPACE_ATTR_PID, 123);
+ if (nlattr_len != 0) {
+ memcpy(nl_msg_put_unspec_zero(odp_actions, OVS_USERSPACE_ATTR_USERDATA,
+ MAX(8, userdata_size)),
+ userdata, userdata_size);
+ }
+ nl_msg_end_nested(out, offset);
+#endif
+ break;
+ }
+ case OVS_ACTION_ATTR_CT:
+ case OVS_ACTION_ATTR_POP_VLAN:
+ case OVS_ACTION_ATTR_PUSH_ETH:
+ case OVS_ACTION_ATTR_POP_ETH:
+ case OVS_ACTION_ATTR_TUNNEL_PUSH:
+ case OVS_ACTION_ATTR_TUNNEL_POP:
+ case OVS_ACTION_ATTR_CLONE:
+ case OVS_ACTION_ATTR_METER:
+ case OVS_ACTION_ATTR_CT_CLEAR:
+ case OVS_ACTION_ATTR_PUSH_NSH:
+ case OVS_ACTION_ATTR_POP_NSH:
+ VLOG_WARN("Unexpected action type %d", type);
+ return EOPNOTSUPP;
+ case __OVS_ACTION_ATTR_MAX:
+ default:
+ OVS_NOT_REACHED();
+ break;
+ }
+ }
+ return 0;
+}
+
+/* Extracts packet metadata from the BPF-formatted flow key in 'key' into a
+ * flow structure in 'flow'. Returns an ODP_FIT_* value that indicates how well
+ * 'key' fits our expectations for what a flow key should contain.
+ *
+ * Note that flow->in_port will still contain an ifindex after this call, the
+ * caller is responsible for converting it to an odp_port number.
+ */
+void
+bpf_flow_key_extract_metadata(const struct bpf_flow_key *key,
+ struct flow *flow)
+{
+ const struct pkt_metadata_t *md = &key->mds.md;
+
+ /* metadata parsing */
+ flow->packet_type = htonl(PT_ETH);
+ flow->in_port.odp_port = u32_to_odp(md->in_port);
+ flow->recirc_id = md->recirc_id;
+ flow->dp_hash = md->dp_hash;
+ flow->skb_priority = md->skb_priority;
+ flow->pkt_mark = md->pkt_mark;
+ flow->ct_state = md->ct_state;
+ flow->ct_zone = md->ct_zone;
+ flow->ct_mark = md->ct_mark;
+ if (flow->recirc_id != 0) {
+ VLOG_INFO("recirc_id = %d", flow->recirc_id);
+ }
+
+ const struct flow_tnl_t *tun = &key->mds.tnl_md;
+ if (!tun->use_ipv6) {
+ flow->tunnel.ip_src = htonl(tun->ip4.ip_src);
+ flow->tunnel.ip_dst = htonl(tun->ip4.ip_dst);
+#ifdef BPF_ENABLE_IPV6
+ } else {
+ memcpy(&flow->tunnel.ipv6_src, tun->ip6.ipv6_src, 16);
+ memcpy(&flow->tunnel.ipv6_dst, tun->ip6.ipv6_dst, 16);
+#endif
+ }
+ flow->tunnel.ip_tos = tun->ip_tos;
+ flow->tunnel.ip_ttl = tun->ip_ttl;
+ flow->tunnel.tun_id = htonll(tun->tun_id);
+ //flow->tunnel.flags = FLOW_TNL_F_DONT_FRAGMENT; // this causes key differs
+ flow->tunnel.flags = 0;
+
+ if (tun->gnvopt_valid) {
+ memcpy(flow->tunnel.metadata.opts.gnv, &tun->gnvopt,
+ sizeof tun->gnvopt);
+ flow->tunnel.metadata.present.len = sizeof tun->gnvopt;
+ flow->tunnel.flags |= FLOW_TNL_F_UDPIF;
+ }
+
+//#define IP_DF 0x4000 /* Flag: "Don't Fragment" */
+// flow->tunnel.flags = 0x40; //htons(IP_DF);
+ /* TODO */
+ /*
+ flow->ct_label = md.ct_label;
+ ct_nw_proto
+ ct_{nw,tp}_{src,dst}
+ flow_tnl_copy__()
+ */
+}
+
+/* XXX The caller must perform in_port translation. */
+void
+bpf_metadata_from_flow(const struct flow *flow, struct ebpf_metadata_t *md)
+{
+ if (flow->packet_type != htonl(PT_ETH)) {
+ VLOG_WARN("Cannot convert flow to bpf metadata: non-ethernet");
+ }
+ md->md.in_port = odp_to_u32(flow->in_port.odp_port); /* XXX */
+ md->md.recirc_id = flow->recirc_id;
+ md->md.dp_hash = flow->dp_hash;
+ md->md.skb_priority = flow->skb_priority;
+ md->md.pkt_mark = flow->pkt_mark;
+ md->md.ct_state = flow->ct_state;
+ md->md.ct_zone = flow->ct_zone;
+ md->md.ct_mark = flow->ct_mark;
+
+ /* TODO */
+ /*
+ md->md.ct_label = flow.ct_label;
+ flow_tnl_copy__()
+ */
+}
+
+enum odp_key_fitness
+bpf_flow_key_to_flow(const struct bpf_flow_key *key, struct flow *flow)
+{
+ const struct ebpf_headers_t *hdrs = &key->headers;
+
+ memset(flow, 0, sizeof *flow);
+ bpf_flow_key_extract_metadata(key, flow);
+
+ /* L2 */
+ if (hdrs->valid & ETHER_VALID) {
+ memcpy(&flow->dl_dst, &hdrs->ethernet.dstAddr, sizeof(struct eth_addr));
+ memcpy(&flow->dl_src, &hdrs->ethernet.srcAddr, sizeof(struct eth_addr));
+ flow->dl_type = hdrs->ethernet.etherType;
+ }
+ if (hdrs->valid & VLAN_VALID) {
+ flow->vlans[0].tpid = hdrs->vlan.etherType;
+ flow->vlans[0].tci = hdrs->vlan.tci | htons(VLAN_CFI);
+ }
+
+ /* L3 */
+ if (hdrs->valid & IPV4_VALID) {
+ flow->nw_src = hdrs->ipv4.srcAddr;
+ flow->nw_dst = hdrs->ipv4.dstAddr;
+ flow->nw_ttl = hdrs->ipv4.ttl;
+ flow->nw_proto = hdrs->ipv4.protocol;
+#ifdef BPF_ENABLE_IPV6
+ } else if (hdrs->valid & IPV6_VALID) {
+ memcpy(&flow->ipv6_src, &hdrs->ipv6.srcAddr, sizeof flow->ipv6_src);
+ memcpy(&flow->ipv6_dst, &hdrs->ipv6.dstAddr, sizeof flow->ipv6_dst);
+ flow->ipv6_label = htonl(hdrs->ipv6.flowLabel);
+ /* XXX: flow->nw_frag */
+ flow->nw_tos = hdrs->ipv6.trafficClass;
+ flow->nw_ttl = hdrs->ipv6.hopLimit;
+ flow->nw_proto = hdrs->ipv6.nextHdr;
+#endif
+ } else if (hdrs->valid & ARP_VALID) {
+ memcpy(&flow->arp_sha, key->headers.arp.ar_sha, 6);
+ memcpy(&flow->arp_tha, key->headers.arp.ar_tha, 6);
+ memcpy(&flow->nw_src, key->headers.arp.ar_sip, 4); /* be32 */
+ memcpy(&flow->nw_dst, key->headers.arp.ar_tip, 4);
+
+ if (ntohs(key->headers.arp.ar_op) < 0xff) {
+ flow->nw_proto = ntohs(key->headers.arp.ar_op);
+ } else {
+ flow->nw_proto = 0;
+ }
+ }
+
+ /* L4 */
+ if (hdrs->valid & TCP_VALID) {
+ flow->tcp_flags = htons(hdrs->tcp.flags);
+ flow->tp_src = hdrs->tcp.srcPort;
+ flow->tp_dst = hdrs->tcp.dstPort;
+ } else if (hdrs->valid & UDP_VALID) {
+ flow->tp_src = htons(hdrs->udp.srcPort);
+ flow->tp_dst = htons(hdrs->udp.dstPort);
+ } else if (hdrs->valid & ICMP_VALID) {
+ /* XXX: validate */
+ flow->tp_src = htons(hdrs->icmp.type); // u8 to be16
+ flow->tp_dst = htons(hdrs->icmp.code);
+ } else if (hdrs->valid & ICMPV6_VALID) {
+ flow->tp_src = htons(hdrs->icmpv6.type); // u8 to be16
+ flow->tp_dst = htons(hdrs->icmpv6.code);
+ } /* XXX: IGMP */
+
+ return ODP_FIT_PERFECT;
+}
+
+/* Converts the 'nla_len' bytes of OVS netlink-formatted flow key in 'nla' into
+ * the bpf flow structure in 'key'. Returns an ODP_FIT_* value that indicates
+ * how well 'nla' fits into the BPF flow key format. On success, 'in_port' will
+ * be populated with the in_port specified by 'nla', which the caller must
+ * convert from an ODP port number into an ifindex and place into 'key'.
+ */
+enum odp_key_fitness
+odp_key_to_bpf_flow_key(const struct nlattr *nla, size_t nla_len,
+ struct bpf_flow_key *key, odp_port_t *in_port,
+ bool inner, bool verbose)
+{
+ bool found_in_port = false;
+ const struct nlattr *a;
+ size_t left;
+
+ NL_ATTR_FOR_EACH(a, left, nla, nla_len) {
+ enum ovs_key_attr type = nl_attr_type(a);
+
+ switch (type) {
+ case OVS_KEY_ATTR_PRIORITY:
+ key->mds.md.skb_priority = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_IN_PORT: {
+ /* The caller must convert the ODP port number into ifindex. */
+ *in_port = nl_attr_get_odp_port(a);
+ found_in_port = true;
+ break;
+ }
+ case OVS_KEY_ATTR_ETHERNET: {
+ const struct ovs_key_ethernet *eth = nl_attr_get(a);
+
+ for (int i = 0; i < ARRAY_SIZE(eth->eth_dst.ea); i++) {
+ key->headers.ethernet.dstAddr[i] = eth->eth_dst.ea[i];
+ key->headers.ethernet.srcAddr[i] = eth->eth_src.ea[i];
+ }
+ key->headers.valid |= ETHER_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_VLAN: {
+ ovs_be16 tci = nl_attr_get_be16(a);
+ struct vlan_tag_t *vlan = inner ? &key->headers.cvlan
+ : &key->headers.vlan;
+ vlan->tci = tci;
+ /* etherType is set below in OVS_KEY_ATTR_ETHERTYPE. */
+ key->headers.valid |= VLAN_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_ETHERTYPE:
+ /* etherType to set depends on encapsulation. */
+ if (key->headers.valid & VLAN_VALID) {
+ key->headers.vlan.etherType = key->headers.ethernet.etherType;
+ }
+
+ key->headers.ethernet.etherType = nl_attr_get_be16(a);
+ key->headers.valid |= ETHER_VALID; /* FIXME */
+ break;
+ case OVS_KEY_ATTR_IPV4: {
+ const struct ovs_key_ipv4 *ipv4 = nl_attr_get(a);
+
+ key->headers.ipv4.srcAddr = ipv4->ipv4_src;
+ key->headers.ipv4.dstAddr = ipv4->ipv4_dst;
+ key->headers.ipv4.protocol = ipv4->ipv4_proto;
+ key->headers.ipv4.ttl = ipv4->ipv4_ttl;
+ /* XXX: ipv4->ipv4_frag; One of OVS_FRAG_TYPE_*. */
+ key->headers.valid |= IPV4_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_IPV6: {
+#ifdef BPF_ENABLE_IPV6
+ const struct ovs_key_ipv6 *ipv6 = nl_attr_get(a);
+
+ memcpy(&key->headers.ipv6.srcAddr, &ipv6->ipv6_src,
+ ARRAY_SIZE(key->headers.ipv6.srcAddr));
+ memcpy(&key->headers.ipv6.dstAddr, &ipv6->ipv6_dst,
+ ARRAY_SIZE(key->headers.ipv6.dstAddr));
+ key->headers.ipv6.flowLabel = ntohl(ipv6->ipv6_label);
+ key->headers.ipv6.nextHdr = ipv6->ipv6_proto;
+ key->headers.ipv6.trafficClass = ipv6->ipv6_tclass;
+ key->headers.ipv6.hopLimit = ipv6->ipv6_hlimit;
+ /* XXX: ipv6_frag; One of OVS_FRAG_TYPE_*. */
+ key->headers.valid |= IPV6_VALID;
+#endif
+ break;
+ }
+ case OVS_KEY_ATTR_TCP: {
+ const struct ovs_key_tcp *tcp = nl_attr_get(a);
+
+ key->headers.tcp.srcPort = tcp->tcp_src;
+ key->headers.tcp.dstPort = tcp->tcp_dst;
+ key->headers.valid |= TCP_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_UDP: {
+ const struct ovs_key_udp *udp = nl_attr_get(a);
+
+ key->headers.udp.srcPort = ntohs(udp->udp_src);
+ key->headers.udp.dstPort = ntohs(udp->udp_dst);
+ key->headers.valid |= UDP_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_ICMP: {
+ const struct ovs_key_icmp *icmp = nl_attr_get(a);
+ /* XXX: Double-check */
+ key->headers.icmp.type = icmp->icmp_type;
+ key->headers.icmp.code = icmp->icmp_code;
+ key->headers.valid |= ICMP_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_ARP: {
+ const struct ovs_key_arp *arp = nl_attr_get(a);
+
+ key->headers.arp.ar_op = arp->arp_op;
+ memcpy(key->headers.arp.ar_sip, &arp->arp_sip, 4);
+ memcpy(key->headers.arp.ar_tip, &arp->arp_tip, 4); /* be32 */
+ memcpy(key->headers.arp.ar_sha, &arp->arp_sha, 6);
+ memcpy(key->headers.arp.ar_tha, &arp->arp_tha, 6);
+ key->headers.valid |= ARP_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_SKB_MARK:
+ key->mds.md.pkt_mark = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_TCP_FLAGS: {
+ ovs_be16 flags_be = nl_attr_get_be16(a);
+ uint16_t flags = ntohs(flags_be);
+
+ key->headers.tcp.flags = flags;
+ key->headers.tcp.res = flags >> 8;
+ key->headers.valid |= TCP_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_DP_HASH:
+ key->mds.md.dp_hash = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_RECIRC_ID:
+ key->mds.md.recirc_id = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_CT_STATE:
+ key->mds.md.ct_state = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_CT_ZONE:
+ key->mds.md.ct_zone = nl_attr_get_u16(a);
+ break;
+ case OVS_KEY_ATTR_CT_MARK:
+ key->mds.md.ct_mark = nl_attr_get_u32(a);
+ break;
+ case OVS_KEY_ATTR_CT_LABELS:
+ memcpy(&key->mds.md.ct_label, nl_attr_get(a),
+ sizeof(key->mds.md.ct_label));
+ break;
+ case OVS_KEY_ATTR_PACKET_TYPE: {
+ ovs_be32 pt = nl_attr_get_be32(a);
+ if (pt != htonl(PT_ETH)) {
+ return ODP_FIT_ERROR;
+ }
+ break;
+ }
+ case OVS_KEY_ATTR_MPLS: {
+ const struct ovs_key_mpls *mpls = nl_attr_get(a);
+ key->headers.mpls.top_lse = mpls->mpls_lse;
+ break;
+ }
+ case OVS_KEY_ATTR_ENCAP: {
+ enum odp_key_fitness ret;
+ ret = odp_key_to_bpf_flow_key(nl_attr_get(a), nl_attr_get_size(a),
+ key, in_port, true, verbose);
+ if (ret != ODP_FIT_PERFECT) {
+ return ret;
+ }
+ break;
+ }
+ case OVS_KEY_ATTR_TUNNEL: {
+ enum odp_key_fitness ret;
+ ret = odp_tun_to_bpf_tun(nl_attr_get(a), nl_attr_get_size(a),
+ &key->mds.tnl_md);
+ if (ret != ODP_FIT_PERFECT) {
+ VLOG_ERR("%s odp key to bpf tunnel key error", __func__);
+ return ret;
+ }
+ break;
+ }
+ case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV4:
+ case OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV6:
+ case OVS_KEY_ATTR_ICMPV6: {
+ const struct ovs_key_icmpv6 *icmpv6 = nl_attr_get(a);
+
+ key->headers.icmpv6.type = icmpv6->icmpv6_type;
+ key->headers.icmpv6.code = icmpv6->icmpv6_code;
+ key->headers.valid |= ICMPV6_VALID;
+ break;
+ }
+ case OVS_KEY_ATTR_ND: {
+ // XXX skip
+ break;
+ }
+ case OVS_KEY_ATTR_SCTP:
+ case OVS_KEY_ATTR_NSH:
+ {
+ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 20);
+ struct ds ds = DS_EMPTY_INITIALIZER;
+ // compile error, remove it
+ //odp_format_key_attr(a, NULL, NULL, &ds, verbose);
+ VLOG_INFO_RL(&rl, "Cannot convert \'%s\'", ds_cstr(&ds));
+ ds_destroy(&ds);
+ return ODP_FIT_ERROR;
+ }
+ case OVS_KEY_ATTR_UNSPEC:
+ case __OVS_KEY_ATTR_MAX:
+ default:
+ OVS_NOT_REACHED();
+ }
+ }
+
+ if (!inner && !found_in_port) {
+ VLOG_ERR("not found in_port");
+ return ODP_FIT_ERROR;
+ }
+
+ if (!inner && verbose) {
+ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ ds_put_format(&ds, "%s\nODP:\n", __func__);
+ odp_flow_key_format(nla, nla_len, &ds);
+ ds_put_cstr(&ds, "\nBPF:\n");
+ bpf_flow_key_format(&ds, key);
+ VLOG_INFO_RL(&rl, "%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+
+ return ODP_FIT_PERFECT;
+}
+
+#define TABSPACE " "
+
+static void
+indent(struct ds *ds, struct ds *tab, const char *string)
+{
+ ds_put_format(ds, "%s%s", ds_cstr(tab), string);
+ ds_put_cstr(tab, TABSPACE);
+}
+
+static void
+trim(struct ds *ds, struct ds *tab)
+{
+ ds_chomp(ds, '\n');
+ ds_put_char(ds, '\n');
+ ds_truncate(tab, tab->length ? tab->length - strlen(TABSPACE) : 0);
+}
+
+#define PUT_FIELD(STRUCT, NAME, FORMAT) \
+ if (STRUCT->NAME) \
+ ds_put_format(ds, #NAME"=%"FORMAT",", STRUCT->NAME)
+
+void
+bpf_flow_key_format(struct ds *ds, const struct bpf_flow_key *key)
+{
+ struct ds tab = DS_EMPTY_INITIALIZER;
+
+ indent(ds, &tab, "headers:\n");
+ {
+ if (key->headers.valid & ETHER_VALID) {
+ const struct ethernet_t *eth = &key->headers.ethernet;
+ const struct eth_addr *src = (struct eth_addr *)&eth->srcAddr;
+ const struct eth_addr *dst = (struct eth_addr *)&eth->dstAddr;
+
+ ds_put_format(ds, "%sethernet(", ds_cstr(&tab));
+ PUT_FIELD(eth, etherType, "#"PRIx16);
+ ds_put_format(ds, "dst="ETH_ADDR_FMT",", ETH_ADDR_ARGS(*dst));
+ ds_put_format(ds, "src="ETH_ADDR_FMT",", ETH_ADDR_ARGS(*src));
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ if (key->headers.valid & IPV4_VALID) {
+ const struct ipv4_t *ipv4 = &key->headers.ipv4;
+
+ ds_put_format(ds, "%sipv4(", ds_cstr(&tab));
+ PUT_FIELD(ipv4, ttl, "#"PRIx8);
+ PUT_FIELD(ipv4, protocol, "#"PRIx8);
+ ds_put_format(ds, "srcAddr="IP_FMT",", IP_ARGS(ipv4->srcAddr));
+ ds_put_format(ds, "dstAddr="IP_FMT",", IP_ARGS(ipv4->dstAddr));
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+#ifdef BPF_ENABLE_IPV6
+ if (key->headers.valid & IPV6_VALID) {
+ const struct ipv6_t *ipv6 = &key->headers.ipv6;
+
+ ds_put_format(ds, "%sipv6(", ds_cstr(&tab));
+ PUT_FIELD(ipv6, version, "#"PRIx8);
+ PUT_FIELD(ipv6, trafficClass, "#"PRIx8);
+ PUT_FIELD(ipv6, flowLabel, "#"PRIx32);
+ PUT_FIELD(ipv6, payloadLen, "#"PRIx16);
+ PUT_FIELD(ipv6, nextHdr, "#"PRIx8);
+ PUT_FIELD(ipv6, hopLimit, "#"PRIx8);
+ ds_put_cstr(ds, "src=");
+ ipv6_format_addr((struct in6_addr *)&ipv6->srcAddr, ds);
+ ds_put_cstr(ds, ",dst=");
+ ipv6_format_addr((struct in6_addr *)&ipv6->dstAddr, ds);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+#endif
+ if (key->headers.valid & ARP_VALID) {
+ const struct arp_rarp_t *arp = &key->headers.arp;
+
+ ds_put_format(ds, "%sarp(", ds_cstr(&tab));
+ PUT_FIELD(arp, ar_hrd, "#"PRIx16);
+ PUT_FIELD(arp, ar_pro, "#"PRIx16);
+ PUT_FIELD(arp, ar_hln, "#"PRIx8);
+ PUT_FIELD(arp, ar_pln, "#"PRIx8);
+ PUT_FIELD(arp, ar_op, "#"PRIx16);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ if (key->headers.valid & TCP_VALID) {
+ const struct tcp_t *tcp = &key->headers.tcp;
+
+ ds_put_format(ds, "%stcp(", ds_cstr(&tab));
+ PUT_FIELD(tcp, srcPort, PRIu16);
+ PUT_FIELD(tcp, dstPort, PRIu16);
+ PUT_FIELD(tcp, seqNo, "#"PRIx32);
+ PUT_FIELD(tcp, ackNo, "#"PRIx32);
+ PUT_FIELD(tcp, dataOffset, "#"PRIx8);
+ PUT_FIELD(tcp, res, "#"PRIx8);
+ PUT_FIELD(tcp, flags, "#"PRIx8);
+ PUT_FIELD(tcp, window, "#"PRIx16);
+ PUT_FIELD(tcp, checksum, "#"PRIx16);
+ PUT_FIELD(tcp, urgentPtr, "#"PRIx16);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ if (key->headers.valid & UDP_VALID) {
+ const struct udp_t *udp = &key->headers.udp;
+
+ ds_put_format(ds, "%sudp(", ds_cstr(&tab));
+ PUT_FIELD(udp, srcPort, PRIu16);
+ PUT_FIELD(udp, dstPort, PRIu16);
+ PUT_FIELD(udp, length_, "#"PRIx16);
+ PUT_FIELD(udp, checksum, "#"PRIx16);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ if (key->headers.valid & ICMP_VALID) {
+ const struct icmp_t *icmp = &key->headers.icmp;
+
+ ds_put_format(ds, "%sicmp(", ds_cstr(&tab));
+ PUT_FIELD(icmp, type, "#"PRIx8);
+ PUT_FIELD(icmp, code, "#"PRIx8);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ if (key->headers.valid & VLAN_VALID) {
+ const struct vlan_tag_t *vlan = &key->headers.vlan;
+
+ ds_put_format(ds, "%svlan(", ds_cstr(&tab));
+ PUT_FIELD(vlan, pcp, "#"PRIx8);
+ PUT_FIELD(vlan, cfi, "#"PRIx8);
+ PUT_FIELD(vlan, vid, "#"PRIx16);
+ PUT_FIELD(vlan, tci, "#"PRIx16);
+ PUT_FIELD(vlan, etherType, "#"PRIx16);
+ ds_chomp(ds, ',');
+ ds_put_format(ds, ")\n");
+ }
+ }
+ trim(ds, &tab);
+ indent(ds, &tab, "metadata:\n");
+ {
+ indent(ds, &tab, "md:\n");
+ {
+ ds_put_hex_dump(ds, &key->mds.md, sizeof key->mds.md, 0, false);
+ }
+ trim(ds, &tab);
+ indent(ds, &tab, "tnl_md:\n");
+ {
+ ds_put_hex_dump(ds, &key->mds.tnl_md, sizeof key->mds.tnl_md, 0,
+ false);
+ }
+ trim(ds, &tab);
+ }
+ trim(ds, &tab);
+ ds_chomp(ds, '\n');
+
+ ds_destroy(&tab);
+}
diff --git a/lib/dpif-bpf-odp.h b/lib/dpif-bpf-odp.h
new file mode 100644
index 000000000000..ddf9b5fec6af
--- /dev/null
+++ b/lib/dpif-bpf-odp.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (c) 2017 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_BPF_ODP_H
+#define DPIF_BPF_ODP_H 1
+
+#include "odp-util.h"
+
+struct flow;
+struct flow_tnl_t;
+struct nlattr;
+struct bpf_flow_key;
+struct bpf_action;
+struct ebpf_metadata_t;
+struct bpf_action_batch;
+
+int odp_action_to_bpf_action(const struct nlattr *, struct bpf_action *);
+int bpf_actions_to_odp_actions(struct bpf_action_batch *, struct ofpbuf *out);
+enum odp_key_fitness bpf_flow_key_to_flow(const struct bpf_flow_key *,
+ struct flow *);
+void bpf_flow_key_extract_metadata(const struct bpf_flow_key *,
+ struct flow *flow);
+void bpf_metadata_from_flow(const struct flow *flow,
+ struct ebpf_metadata_t *md);
+enum odp_key_fitness odp_key_to_bpf_flow_key(const struct nlattr *, size_t,
+ struct bpf_flow_key *,
+ odp_port_t *in_port,
+ bool inner, bool verbose);
+enum odp_key_fitness odp_tun_to_bpf_tun(const struct nlattr *nla,
+ size_t nla_len,
+ struct flow_tnl_t *tun);
+void bpf_flow_key_format(struct ds *ds, const struct bpf_flow_key *key);
+
+#endif /* dpif-bpf-odp.h */
--
2.7.4


[RFC PATCH 05/11] dpif: add 'dpif-bpf' provider.

William Tu
 

From: Joe Stringer <joe@...>

Implement a new datapath interface for use with BPF datapaths.

Like dpif-netlink, dpif-bpf is backed by an implementation which resides
within the kernel. It uses the BPF functionality available in recent
versions of Linux to create the datapath. Unlike dpif-netlink there is no
datapath notion of a bridge with ports attached; dpif-bpf is implemented
by attaching BPF programs directly to individual devices using TC.

Upcalls are implemented using a perf event ringbuffer, which is polled
by handler threads. Flow execution is implemented by sending the packet
plus metadata on a dedicated tap device, where there is an BPF program
that understands the format of the packet coming from userspace. When
this device receives a message, it strips the metadata, uses it to
determine how to execute the packet, then forwards the packet onwards.

This initial implementation has a number of limitations which are
expected to go away over time:
* The set of matches and actions supported by the datapath is not
as wide as the full set known by OVS, so if a flow cannot be
expressed in the current eBPF API, OVS will log errors and return
errors during flow put.
* Only the input port and packet length is passed as metadata from
the datapath to userspace during upcall. Key extraction is done
purely from the packet provided from the datapath.
* Conversely, only the output port is sent down during execution.
No other actions are supported currently; and only one output is
supported.
* Ingress policing cannot be configured on BPF
datapath devices.
* On startup, if the OVS BPF datapath is already loaded into the
kernel and pinned to the filesystem, it will reuse this datapath,
even if the datapath is out-of-date.

Documentation/intro/install/bpf.rst contains further information on how
to build and use the bpf datapath.

For more details on the design and implementation, see our OSR paper:
[1] https://dl.acm.org/citation.cfm?id=3139657
[2] http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf

Signed-off-by: Joe Stringer <joe@...>
Signed-off-by: William Tu <u9012063@...>
Signed-off-by: Yifeng Sun <pkusunyifeng@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
lib/dpif-bpf.c | 1995 +++++++++++++++++++++++++++++++++++++++++++++++++++
lib/dpif-provider.h | 1 +
lib/dpif.c | 3 +
3 files changed, 1999 insertions(+)
create mode 100644 lib/dpif-bpf.c

diff --git a/lib/dpif-bpf.c b/lib/dpif-bpf.c
new file mode 100644
index 000000000000..d0931af78278
--- /dev/null
+++ b/lib/dpif-bpf.c
@@ -0,0 +1,1995 @@
+/*
+ * Copyright (c) 2016, 2017, 2018 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include <errno.h>
+#include <openvswitch/hmap.h>
+#include <openvswitch/types.h>
+#include <openvswitch/vlog.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+
+#include "bpf.h"
+#include "bpf/odp-bpf.h"
+#include "dirs.h"
+#include "dpif.h"
+#include "dpif-provider.h"
+#include "dpif-bpf-odp.h"
+#include "dpif-netlink-rtnl.h"
+#include "fat-rwlock.h"
+#include "netdev.h"
+#include "netdev-provider.h"
+#include "netdev-vport.h"
+#include "odp-util.h"
+#include "ovs-numa.h"
+#include "perf-event.h"
+#include "sset.h"
+#include "openvswitch/poll-loop.h"
+
+VLOG_DEFINE_THIS_MODULE(dpif_bpf);
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(60, 60);
+
+/* Protects against changes to 'bpf_datapaths'. */
+static struct ovs_mutex bpf_datapath_mutex = OVS_MUTEX_INITIALIZER;
+
+/* Contains all 'struct dpif_bpf_dp's. */
+static struct shash bpf_datapaths OVS_GUARDED_BY(bpf_datapath_mutex)
+ = SHASH_INITIALIZER(&bpf_datapaths);
+
+struct bpf_handler {
+ /* Into owning dpif_bpf_dp->channels */
+ int offset;
+ int count;
+ int index; /* next channel to use */
+};
+
+struct dpif_bpf_dp {
+ struct dpif *dpif;
+ const char *const name;
+ struct ovs_refcount ref_cnt;
+ atomic_flag destroyed;
+
+ /* Ports.
+ *
+ * Any lookup into 'ports' requires taking 'port_mutex'. */
+ struct ovs_mutex port_mutex;
+ struct hmap ports_by_odp OVS_GUARDED;
+ struct hmap ports_by_ifindex OVS_GUARDED;
+ struct seq *port_seq; /* Incremented whenever a port changes. */
+ uint64_t last_seq;
+
+ /* Handlers */
+ struct fat_rwlock upcall_lock;
+ uint32_t n_handlers;
+ struct bpf_handler *handlers;
+
+ /* Upcall channels. */
+ size_t page_size;
+ int n_pages;
+ int n_channels;
+ struct perf_channel channels[];
+};
+
+struct dpif_bpf {
+ struct dpif dpif;
+ struct dpif_bpf_dp *dp;
+};
+
+struct dpif_bpf_port {
+ struct hmap_node odp_node; /* Node in dpif_bpf_dp 'ports_by_odp'. */
+ struct hmap_node if_node; /* Node in dpif_bpf_dp 'ports_by_ifindex'. */
+ struct netdev *netdev;
+ odp_port_t port_no;
+ int ifindex;
+ char *type; /* Port type as requested by user. */
+ struct netdev_saved_flags *sf;
+
+ unsigned n_rxq;
+ struct netdev_rxq **rxqs;
+};
+
+static void vlog_hex_dump(const u8 *buf, size_t count)
+{
+ struct ds ds = DS_EMPTY_INITIALIZER;
+ ds_put_hex_dump(&ds, buf, count, 0, false);
+ VLOG_DBG("\n%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+}
+
+int create_dp_bpf(const char *name, struct dpif_bpf_dp **dp);
+static void dpif_bpf_close(struct dpif *dpif);
+static int do_add_port(struct dpif_bpf_dp *dp, const char *devname,
+ const char *type, odp_port_t port_no)
+ OVS_REQUIRES(dp->port_mutex);
+static void do_del_port(struct dpif_bpf_dp *dp, struct dpif_bpf_port *port)
+ OVS_REQUIRES(dp->port_mutex);
+static int dpif_bpf_delete_all_flow(void);
+
+static struct dpif_bpf *
+dpif_bpf_cast(const struct dpif *dpif)
+{
+ ovs_assert(dpif->dpif_class == &dpif_bpf_class);
+ return CONTAINER_OF(dpif, struct dpif_bpf, dpif);
+}
+
+static struct dpif_bpf_dp *
+get_dpif_bpf_dp(const struct dpif *dpif)
+{
+ return dpif_bpf_cast(dpif)->dp;
+}
+
+static struct dp_bpf {
+ struct bpf_state bpf;
+ struct netdev *outport; /* Used for downcall. */
+} datapath;
+
+static int
+configure_outport(struct netdev *outport)
+{
+ int error;
+
+ error = netdev_set_filter(outport, &datapath.bpf.downcall);
+ if (error) {
+ return error;
+ }
+
+ error = netdev_set_flags(outport, NETDEV_UP, NULL);
+ if (error) {
+ return error;
+ }
+
+ return 0;
+}
+
+static int
+dpif_bpf_init(void)
+{
+ static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+ static int error = 0;
+
+ if (ovsthread_once_start(&once)) {
+ struct netdev *outport;
+
+ error = bpf_get(&datapath.bpf, true);
+ if (!error) {
+ /* FIXME: should we named ovs-system? */
+ error = netdev_open("ovs-system", "tap", &outport);
+ if (!error) {
+ VLOG_INFO("%s: created BPF tap downcall device %s",
+ __func__, outport->name);
+
+ error = configure_outport(outport);
+ if (error) {
+ VLOG_ERR("%s: configure downcall device failed", __func__);
+ netdev_close(outport);
+ } else {
+ datapath.outport = outport;
+ }
+ }
+ }
+
+ if (!error) {
+ dpif_bpf_delete_all_flow();
+ }
+ ovsthread_once_done(&once);
+ }
+ return error;
+}
+
+static int
+dpif_bpf_enumerate(struct sset *all_dps,
+ const struct dpif_class *dpif_class OVS_UNUSED)
+{
+ struct shash_node *node;
+
+ ovs_mutex_lock(&bpf_datapath_mutex);
+ SHASH_FOR_EACH(node, &bpf_datapaths) {
+ sset_add(all_dps, node->name);
+ }
+ ovs_mutex_unlock(&bpf_datapath_mutex);
+
+ return 0;
+}
+
+static const char
+*dpif_bpf_port_open_type(const struct dpif_class *dpif_class OVS_UNUSED,
+ const char *type)
+{
+ return strcmp(type, "internal") ? type : "tap";
+}
+
+static struct dpif *
+create_dpif_bpf(struct dpif_bpf_dp *dp)
+ OVS_REQUIRES(bpf_datapath_mutex)
+{
+ uint16_t netflow_id = hash_string(dp->name, 0);
+ struct dpif_bpf *dpif;
+
+ ovs_refcount_ref(&dp->ref_cnt);
+
+ dpif = xmalloc(sizeof *dpif);
+ dpif_init(&dpif->dpif, &dpif_bpf_class, dp->name, netflow_id >> 8, netflow_id);
+ dpif->dp = dp;
+
+ return &dpif->dpif;
+}
+
+static int
+dpif_bpf_open(const struct dpif_class *dpif_class OVS_UNUSED,
+ const char *name, bool create OVS_UNUSED, struct dpif **dpifp)
+{
+ struct dpif_bpf_dp *dp;
+ int error;
+
+ error = dpif_bpf_init();
+ if (error) {
+ VLOG_ERR("dpif_bpf_init failed");
+ return error;
+ }
+
+ ovs_mutex_lock(&bpf_datapath_mutex);
+ dp = shash_find_data(&bpf_datapaths, name);
+ if (!dp) {
+ error = create ? create_dp_bpf(name, &dp) : ENODEV;
+ } else {
+ ovs_assert(dpif_class == &dpif_bpf_class);
+ error = create ? EEXIST : 0;
+ }
+ if (!error) {
+ *dpifp = create_dpif_bpf(dp);
+ if (create) { /* XXX */
+ dp->dpif = *dpifp;
+ }
+ }
+ ovs_mutex_unlock(&bpf_datapath_mutex);
+
+ return error;
+}
+
+static int
+perf_event_channels_init(struct dpif_bpf_dp *dp)
+{
+ size_t length = dp->page_size * (dp->n_pages + 1);
+ int error = 0;
+ int i, cpu;
+
+ for (cpu = 0; cpu < dp->n_channels; cpu++) {
+ struct perf_channel *channel = &dp->channels[cpu];
+
+ error = perf_channel_open(channel, cpu, length);
+ if (error) {
+ goto error;
+ }
+ }
+
+error:
+ if (error) {
+ for (i = 0; i < cpu; i++) {
+ perf_channel_close(&dp->channels[cpu]);
+ }
+ }
+
+ return error;
+}
+
+static void
+dpif_bpf_free(struct dpif_bpf_dp *dp)
+ OVS_REQUIRES(bpf_datapath_mutex)
+{
+ shash_find_and_delete(&bpf_datapaths, dp->name);
+
+ if (ovs_refcount_read(&dp->ref_cnt) == 0) {
+ ovs_mutex_destroy(&dp->port_mutex);
+ seq_destroy(dp->port_seq);
+ fat_rwlock_destroy(&dp->upcall_lock);
+ hmap_destroy(&dp->ports_by_ifindex);
+ hmap_destroy(&dp->ports_by_odp);
+ if (dp->n_handlers) {
+ free(dp->handlers);
+ }
+ free(dp);
+ }
+}
+
+int
+create_dp_bpf(const char *name, struct dpif_bpf_dp **dp_)
+ OVS_REQUIRES(bpf_datapath_mutex)
+{
+ int max_cpu;
+ struct dpif_bpf_dp *dp;
+ int i, error;
+
+ max_cpu = ovs_numa_get_n_cores();
+
+ dp = xzalloc(sizeof *dp + max_cpu * sizeof(struct perf_channel));
+ ovs_refcount_init(&dp->ref_cnt);
+ atomic_flag_clear(&dp->destroyed);
+ hmap_init(&dp->ports_by_odp);
+ hmap_init(&dp->ports_by_ifindex);
+ fat_rwlock_init(&dp->upcall_lock);
+ dp->port_seq = seq_create();
+ ovs_mutex_init(&dp->port_mutex);
+ dp->n_pages = 8;
+ dp->page_size = sysconf(_SC_PAGESIZE);
+ dp->n_channels = max_cpu;
+ dp->last_seq = seq_read(dp->port_seq);
+
+ *CONST_CAST(const char **, &dp->name) = xstrdup(name);
+ shash_add(&bpf_datapaths, name, dp); /* XXX */
+
+ error = perf_event_channels_init(dp);
+ if (error) {
+ dpif_bpf_free(dp);
+ return error;
+ }
+
+ ovs_assert(datapath.bpf.upcalls.fd != -1);
+
+ for (i = 0; i < dp->n_channels; i++) {
+ error = bpf_map_update_elem(datapath.bpf.upcalls.fd, &i,
+ &dp->channels[i].fd, 0);
+ if (error) {
+ VLOG_WARN("failed to insert channel fd on cpu=%d: %s",
+ i, ovs_strerror(error));
+ goto out;
+ }
+ }
+
+out:
+ if (error) {
+ dpif_bpf_free(dp);
+ }
+ if (!error) {
+ *dp_ = dp;
+ }
+ return 0;
+}
+
+static void
+dpif_bpf_close(struct dpif *dpif_)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+
+ ovs_mutex_lock(&bpf_datapath_mutex);
+ if (ovs_refcount_unref_relaxed(&dp->ref_cnt) == 1) {
+ struct dpif_bpf_port *port, *next;
+ int i;
+
+ fat_rwlock_wrlock(&dp->upcall_lock);
+ for (i = 0; i < dp->n_channels; i++) {
+ struct perf_channel *channel = &dp->channels[i];
+
+ perf_channel_close(channel);
+ }
+ fat_rwlock_unlock(&dp->upcall_lock);
+
+ ovs_mutex_lock(&dp->port_mutex);
+ HMAP_FOR_EACH_SAFE (port, next, odp_node, &dp->ports_by_odp) {
+ do_del_port(dp, port);
+ }
+ ovs_mutex_unlock(&dp->port_mutex);
+ dpif_bpf_free(dp);
+ }
+ ovs_mutex_unlock(&bpf_datapath_mutex);
+
+ free(dpif_bpf_cast(dpif_));
+}
+
+static int
+dpif_bpf_destroy(struct dpif *dpif_)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+
+ if (!atomic_flag_test_and_set(&dp->destroyed)) {
+ if (ovs_refcount_unref_relaxed(&dp->ref_cnt) == 1) {
+ /* Can't happen: 'dpif' still owns a reference to 'dp'.
+ * The workflow is first call dpif_class->destroy() then
+ * dpif->close(). */
+ OVS_NOT_REACHED();
+ }
+ }
+#if 0
+ if (datapath.outport) {
+ netdev_close(datapath.outport);
+ }
+#endif
+
+ return 0;
+}
+
+static int
+dpif_bpf_get_stats(const struct dpif *dpif OVS_UNUSED,
+ struct dpif_dp_stats *stats)
+{
+ uint32_t key, n_flows = 0;
+ struct bpf_flow_key flow_key;
+ int err = 0;
+
+ memset(stats, 0, sizeof(*stats));
+ key = OVS_DP_STATS_HIT;
+ if (bpf_map_lookup_elem(datapath.bpf.datapath_stats.fd, &key,
+ &stats->n_hit)) {
+ VLOG_INFO("datapath_stats lookup failed (%d): %s", key,
+ ovs_strerror(errno));
+ }
+ key = OVS_DP_STATS_MISSED;
+ if (bpf_map_lookup_elem(datapath.bpf.datapath_stats.fd, &key,
+ &stats->n_missed)) {
+ VLOG_INFO("datapath_stats lookup failed (%d): %s", key,
+ ovs_strerror(errno));
+ }
+
+ /* Count the number of datapath flow entries */
+ memset(&flow_key, 0, sizeof flow_key);
+ do {
+ err = bpf_map_get_next_key(datapath.bpf.flow_table.fd,
+ &flow_key, &flow_key);
+ if (!err) {
+ n_flows++;
+ }
+ } while (!err);
+
+ stats->n_flows = n_flows;
+
+ /* XXX: Other missing stats */
+ return 0;
+}
+
+static struct dpif_bpf_port *
+bpf_lookup_port(const struct dpif_bpf_dp *dp, odp_port_t port_no)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port;
+
+ HMAP_FOR_EACH_WITH_HASH (port, odp_node, netdev_hash_port_no(port_no),
+ &dp->ports_by_odp) {
+ if (port->port_no == port_no) {
+ return port;
+ }
+ }
+ return NULL;
+}
+
+static odp_port_t
+choose_port(struct dpif_bpf_dp *dp)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ uint32_t port_no;
+
+ for (port_no = 1; port_no <= UINT16_MAX; port_no++) {
+ if (!bpf_lookup_port(dp, u32_to_odp(port_no))) {
+ return u32_to_odp(port_no);
+ }
+ }
+
+ return ODPP_NONE;
+}
+
+static int
+get_port_by_name(struct dpif_bpf_dp *dp, const char *devname,
+ struct dpif_bpf_port **portp)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port;
+
+ HMAP_FOR_EACH (port, odp_node, &dp->ports_by_odp) {
+ if (!strcmp(netdev_get_name(port->netdev), devname)) {
+ *portp = port;
+ return 0;
+ }
+ }
+
+ *portp = NULL;
+ return ENOENT;
+}
+
+static uint32_t
+hash_ifindex(int ifindex)
+{
+ return hash_int(ifindex, 0);
+}
+
+static int
+get_port_by_ifindex(struct dpif_bpf_dp *dp, int ifindex,
+ struct dpif_bpf_port **portp)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port;
+
+ HMAP_FOR_EACH_WITH_HASH (port, if_node, hash_ifindex(ifindex),
+ &dp->ports_by_ifindex) {
+ if (port->ifindex == ifindex) {
+ *portp = port;
+ return 0;
+ }
+ }
+
+ *portp = NULL;
+ return ENOENT;
+}
+
+static odp_port_t
+ifindex_to_odp(struct dpif_bpf_dp *dp, int ifindex)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port;
+
+ if (get_port_by_ifindex(dp, ifindex, &port)) {
+ return ODPP_NONE;
+ }
+
+ return port->port_no;
+}
+
+static bool output_to_local_stack(struct netdev *netdev)
+{
+ return !strcmp(netdev_get_type(netdev), "tap");
+}
+
+static bool netdev_support_xdp(struct netdev *netdev OVS_UNUSED)
+{
+ return true;
+}
+
+static uint32_t
+get_port_flags(struct netdev *netdev)
+{
+ return output_to_local_stack(netdev) ? OVS_BPF_FLAGS_TX_STACK : 0;
+}
+
+static uint16_t
+odp_port_to_ifindex(struct dpif_bpf_dp *dp, odp_port_t port_no, uint32_t *flags)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port = bpf_lookup_port(dp, port_no);
+
+ if (port) {
+ if (flags) {
+ *flags = get_port_flags(port->netdev);
+ }
+ return port->ifindex;
+ }
+ return 0;
+}
+
+/* Modelled after dpif-netdev 'port_create', minus pmd and txq logic, plus bpf
+ * filter set. */
+static int
+port_create(const char *devname, const char *type,
+ odp_port_t port_no, struct dpif_bpf_port **portp)
+{
+ struct netdev_saved_flags *sf;
+ struct dpif_bpf_port *port;
+ enum netdev_flags flags;
+ struct netdev *netdev;
+ int n_open_rxqs = 0;
+ int i, error;
+ int ifindex;
+
+ *portp = NULL;
+
+ /* Open and validate network device. */
+ error = netdev_open(devname, type, &netdev);
+
+ VLOG_DBG("%s %s type %s error %d", __func__, devname, type, error);
+ if (error) {
+ return error;
+ }
+ /* XXX reject non-Ethernet devices */
+
+ netdev_get_flags(netdev, &flags);
+ if (flags & NETDEV_LOOPBACK) {
+ VLOG_ERR_RL(&rl, "%s: cannot add a loopback device", devname);
+ error = EINVAL;
+ goto out;
+ }
+
+ if (netdev_is_reconf_required(netdev)) {
+ error = netdev_reconfigure(netdev);
+ if (error) {
+ goto out;
+ }
+ }
+
+ ifindex = netdev_get_ifindex(netdev);
+ if (ifindex < 0) {
+ VLOG_WARN_RL(&rl, "%s: Failed to get ifindex", devname);
+ error = -ifindex;
+ goto out;
+ }
+
+ VLOG_DBG("%s ifindex = %d", devname, ifindex);
+
+ /* For all internal port, ex: br0, br-underlay, br-int,
+ we set bpf program only to its egress queue. (due to the
+ natural of tap device). For other types, ex: eth0, vxlan_sys,
+ we set bpf program to its ingress queue.
+
+ A tap device's egress queue is tied to a socket for userspace
+ to receive the packet by open(/dev/tun0). On the other hand,
+ a send to the socket will show up in the tap device's ingress queue.
+ */
+ if (output_to_local_stack(netdev)) {
+ error = netdev_set_filter(netdev, &datapath.bpf.egress);
+ } else {
+ error = netdev_set_filter(netdev, &datapath.bpf.ingress);
+ }
+ if (error) {
+ goto out;
+ }
+
+ if (netdev_support_xdp(netdev)) {
+ error = netdev_set_xdp(netdev, &datapath.bpf.xdp);
+ if (error) {
+ VLOG_WARN("%s XDP set failed", __func__);
+ goto out;
+ }
+ VLOG_DBG("%s %s XDP set done", __func__, netdev->name);
+ }
+
+ port = xzalloc(sizeof *port);
+ port->port_no = port_no;
+ port->ifindex = ifindex;
+ port->netdev = netdev;
+ port->n_rxq = netdev_n_rxq(netdev);
+ port->rxqs = xcalloc(port->n_rxq, sizeof *port->rxqs);
+ port->type = xstrdup(type);
+
+ for (i = 0; i < port->n_rxq; i++) {
+ error = netdev_rxq_open(netdev, &port->rxqs[i], i);
+ if (error) {
+ VLOG_ERR("%s: cannot receive packets on this network device (queue %d) (%s)",
+ devname, i, ovs_strerror(errno));
+ goto out_rxq_close;
+ }
+ n_open_rxqs++;
+ }
+
+ error = netdev_turn_flags_on(netdev, NETDEV_PROMISC, &sf);
+ if (error) {
+ goto out_rxq_close;
+ }
+ port->sf = sf;
+
+ *portp = port;
+ return 0;
+
+out_rxq_close:
+ for (i = 0; i < n_open_rxqs; i++) {
+ netdev_rxq_close(port->rxqs[i]);
+ }
+ free(port->type);
+ free(port->rxqs);
+ free(port);
+
+out:
+ netdev_close(netdev);
+ return error;
+}
+
+static int
+do_add_port(struct dpif_bpf_dp *dp, const char *devname,
+ const char *type, odp_port_t port_no)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ struct dpif_bpf_port *port;
+ int error;
+
+ if (!get_port_by_name(dp, devname, &port)) {
+ return EEXIST;
+ }
+
+ error = port_create(devname, type, port_no, &port);
+ if (error) {
+ VLOG_ERR("port_create return %d", error);
+ return error;
+ }
+
+ hmap_insert(&dp->ports_by_odp, &port->odp_node,
+ netdev_hash_port_no(port->port_no));
+ hmap_insert(&dp->ports_by_ifindex, &port->if_node,
+ hash_ifindex(port->ifindex));
+ seq_change(dp->port_seq);
+
+ return 0;
+}
+
+static int
+dpif_bpf_port_add(struct dpif *dpif, struct netdev *netdev,
+ odp_port_t *port_nop)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif);
+ char namebuf[NETDEV_VPORT_NAME_BUFSIZE];
+ const char *dpif_port;
+ odp_port_t port_no;
+ int error;
+
+ if (!strcmp(netdev_get_type(netdev), "vxlan") ||
+ !strcmp(netdev_get_type(netdev), "gre") ||
+ !strcmp(netdev_get_type(netdev), "geneve")) {
+
+ VLOG_INFO("Creating %s device", netdev_get_type(netdev));
+ error = dpif_netlink_rtnl_port_create(netdev);
+ if (error) {
+ if (error != EOPNOTSUPP) {
+ VLOG_WARN_RL(&rl, "Failed to create %s with rtnetlink: %s",
+ netdev_get_name(netdev), ovs_strerror(error));
+ }
+ return error;
+ }
+ }
+
+ ovs_mutex_lock(&dp->port_mutex);
+ dpif_port = netdev_vport_get_dpif_port(netdev, namebuf, sizeof namebuf);
+ if (*port_nop != ODPP_NONE) {
+ port_no = *port_nop;
+ error = bpf_lookup_port(dp, *port_nop) ? EBUSY : 0;
+ } else {
+ port_no = choose_port(dp);
+ error = port_no == ODPP_NONE ? EFBIG : 0;
+ }
+ if (error) {
+ goto unlock;
+ }
+
+ *port_nop = port_no;
+ error = do_add_port(dp, dpif_port, netdev_get_type(netdev), port_no);
+ if (error) {
+ goto unlock;
+ }
+
+unlock:
+ ovs_mutex_unlock(&dp->port_mutex);
+ return error;
+}
+
+static void
+do_del_port(struct dpif_bpf_dp *dp, struct dpif_bpf_port *port)
+ OVS_REQUIRES(dp->port_mutex)
+{
+ int i, error;
+
+ seq_change(dp->port_seq);
+ hmap_remove(&dp->ports_by_odp, &port->odp_node);
+ hmap_remove(&dp->ports_by_ifindex, &port->if_node);
+
+ error = netdev_set_filter(port->netdev, NULL);
+ if (error) {
+ VLOG_WARN("%s: Failed to clear filter from netdev",
+ netdev_get_name(port->netdev));
+ }
+
+ if (netdev_support_xdp(port->netdev)) {
+ error = netdev_set_xdp(port->netdev, NULL);
+ if (error) {
+ VLOG_WARN("%s: Failed to clear XDP from netdev",
+ netdev_get_name(port->netdev));
+ }
+ }
+
+ netdev_close(port->netdev);
+ netdev_restore_flags(port->sf);
+ for (i = 0; i < port->n_rxq; i++) {
+ netdev_rxq_close(port->rxqs[i]);
+ }
+
+ free(port->type);
+ free(port->rxqs);
+ free(port);
+}
+
+static int
+dpif_bpf_port_del(struct dpif *dpif, odp_port_t port_no)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif);
+ struct dpif_bpf_port *port;
+ int error = 0;
+
+ ovs_mutex_lock(&dp->port_mutex);
+ port = bpf_lookup_port(dp, port_no);
+ if (!port) {
+ VLOG_WARN("deleting port %d, but it doesn't exist", port_no);
+ error = EINVAL;
+ }
+ ovs_mutex_unlock(&dp->port_mutex);
+
+ return error;
+}
+
+static void
+answer_port_query(const struct dpif_bpf_port *port,
+ struct dpif_port *dpif_port)
+{
+ dpif_port->name = xstrdup(netdev_get_name(port->netdev));
+ dpif_port->type = xstrdup(port->type);
+ dpif_port->port_no = port->port_no;
+}
+
+static int
+dpif_bpf_port_query_by_number(const struct dpif *dpif_, odp_port_t port_no,
+ struct dpif_port *port_)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+ struct dpif_bpf_port *port;
+ int error = 0;
+
+ ovs_mutex_lock(&dp->port_mutex);
+ port = bpf_lookup_port(dp, port_no);
+ if (!port) {
+ errno = ENOENT;
+ goto out;
+ }
+ answer_port_query(port, port_);
+
+out:
+ ovs_mutex_unlock(&dp->port_mutex);
+ return error;
+}
+
+static int
+dpif_bpf_port_query_by_name(const struct dpif *dpif_, const char *devname,
+ struct dpif_port *dpif_port)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+ struct dpif_bpf_port *port;
+ int error;
+
+ ovs_mutex_lock(&dp->port_mutex);
+ error = get_port_by_name(dp, devname, &port);
+ if (!error && dpif_port) {
+ answer_port_query(port, dpif_port);
+ }
+ ovs_mutex_unlock(&dp->port_mutex);
+
+ return error;
+}
+
+struct dpif_bpf_port_state {
+ struct hmap_position position;
+ char *name;
+};
+
+static int
+dpif_bpf_port_dump_start(const struct dpif *dpif OVS_UNUSED, void **statep)
+{
+ *statep = xzalloc(sizeof(struct dpif_bpf_port_state));
+ return 0;
+}
+
+static int
+dpif_bpf_port_dump_next(const struct dpif *dpif_, void *state_,
+ struct dpif_port *dpif_port)
+{
+ struct dpif_bpf_port_state *state = state_;
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+ struct hmap_node *node;
+ int retval;
+
+ ovs_mutex_lock(&dp->port_mutex);
+ node = hmap_at_position(&dp->ports_by_odp, &state->position);
+ if (node) {
+ struct dpif_bpf_port *port;
+
+ port = CONTAINER_OF(node, struct dpif_bpf_port, odp_node);
+
+ free(state->name);
+ state->name = xstrdup(netdev_get_name(port->netdev));
+ dpif_port->name = state->name;
+ dpif_port->type = port->type;
+ dpif_port->port_no = port->port_no;
+
+ retval = 0;
+ } else {
+ retval = EOF;
+ }
+ ovs_mutex_unlock(&dp->port_mutex);
+
+ return retval;
+}
+
+static int
+dpif_bpf_port_dump_done(const struct dpif *dpif OVS_UNUSED,
+ void *state_)
+{
+ struct dpif_bpf_port_state *state = state_;
+
+ free(state->name);
+ free(state);
+ return 0;
+}
+
+static int
+dpif_bpf_port_poll(const struct dpif *dpif_, char **devnamep OVS_UNUSED)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+ uint64_t new_port_seq;
+
+ new_port_seq = seq_read(dp->port_seq);
+ if (dp->last_seq != new_port_seq) {
+ dp->last_seq = new_port_seq;
+ return ENOBUFS;
+ }
+
+ return EAGAIN;
+}
+
+static void
+dpif_bpf_port_poll_wait(const struct dpif *dpif_)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+
+ seq_wait(dp->port_seq, dp->last_seq);
+}
+
+static int
+dpif_bpf_flow_flush(struct dpif *dpif OVS_UNUSED)
+{
+ struct bpf_flow_key key;
+ int err = 0;
+
+ /* Flow Entry Table */
+ memset(&key, 0, sizeof key);
+ do {
+ err = bpf_map_get_next_key(datapath.bpf.flow_table.fd, &key, &key);
+ if (!err) {
+ bpf_map_delete_elem(datapath.bpf.flow_table.fd, &key);
+ }
+ } while (!err);
+
+ /* Flow Stats Table */
+ memset(&key, 0, sizeof key);
+ do {
+ err = bpf_map_get_next_key(datapath.bpf.dp_flow_stats.fd, &key, &key);
+ if (!err) {
+ bpf_map_delete_elem(datapath.bpf.dp_flow_stats.fd, &key);
+ }
+ } while (!err);
+
+
+ return errno == ENOENT ? 0 : errno;
+}
+
+struct dpif_bpf_flow_dump {
+ struct dpif_flow_dump up;
+ int status;
+ struct bpf_flow_key pos;
+ struct ovs_mutex mutex;
+};
+
+static struct dpif_bpf_flow_dump *
+dpif_bpf_flow_dump_cast(struct dpif_flow_dump *dump)
+{
+ return CONTAINER_OF(dump, struct dpif_bpf_flow_dump, up);
+}
+
+static struct dpif_flow_dump *
+dpif_bpf_flow_dump_create(const struct dpif *dpif_, bool terse,
+ char *type OVS_UNUSED)
+{
+ struct dpif_bpf_flow_dump *dump;
+
+ dump = xzalloc(sizeof *dump);
+ dpif_flow_dump_init(&dump->up, dpif_);
+ dump->up.terse = terse;
+ ovs_mutex_init(&dump->mutex);
+
+ return &dump->up;
+}
+
+static int
+dpif_bpf_flow_dump_destroy(struct dpif_flow_dump *dump_)
+{
+ struct dpif_bpf_flow_dump *dump = dpif_bpf_flow_dump_cast(dump_);
+ int status = dump->status;
+
+ ovs_mutex_destroy(&dump->mutex);
+ free(dump);
+
+ return status == ENOENT ? 0 : status;
+}
+
+struct dpif_bpf_flow_dump_thread {
+ struct dpif_flow_dump_thread up;
+ struct dpif_bpf_flow_dump *dump;
+ struct ofpbuf buf; /* Stores key,mask,acts for a particular dump. */
+};
+
+static struct dpif_bpf_flow_dump_thread *
+dpif_bpf_flow_dump_thread_cast(struct dpif_flow_dump_thread *thread)
+{
+ return CONTAINER_OF(thread, struct dpif_bpf_flow_dump_thread, up);
+}
+
+static struct dpif_flow_dump_thread *
+dpif_bpf_flow_dump_thread_create(struct dpif_flow_dump *dump_)
+{
+ struct dpif_bpf_flow_dump *dump = dpif_bpf_flow_dump_cast(dump_);
+ struct dpif_bpf_flow_dump_thread *thread;
+
+ thread = xmalloc(sizeof *thread);
+ dpif_flow_dump_thread_init(&thread->up, &dump->up);
+ thread->dump = dump;
+ ofpbuf_init(&thread->buf, 1024);
+ return &thread->up;
+}
+
+static void
+dpif_bpf_flow_dump_thread_destroy(struct dpif_flow_dump_thread *thread_)
+{
+ struct dpif_bpf_flow_dump_thread *thread =
+ dpif_bpf_flow_dump_thread_cast(thread_);
+ ofpbuf_uninit(&thread->buf);
+ free(thread);
+}
+
+static int
+fetch_flow(struct dpif_bpf_dp *dp, struct dpif_flow *flow,
+ struct ofpbuf *out, const struct bpf_flow_key *key)
+{
+ struct flow f;
+ struct odp_flow_key_parms parms = {
+ .flow = &f,
+ };
+ struct bpf_action_batch action;
+ struct bpf_flow_stats stats;
+ int err;
+
+ memset(flow, 0, sizeof *flow);
+
+ err = bpf_map_lookup_elem(datapath.bpf.flow_table.fd, key, &action);
+ if (err) {
+ return errno;
+ }
+
+ /* XXX: Extract 'dp_flow' into 'flow'. */
+ if (bpf_flow_key_to_flow(key, &f) == ODP_FIT_ERROR) {
+ VLOG_WARN("%s: bpf flow key parsing error", __func__);
+ return EINVAL;
+ }
+ f.in_port.odp_port = ifindex_to_odp(dp,
+ odp_to_u32(f.in_port.odp_port));
+
+ /* Translate BPF flow into netlink format. */
+ ofpbuf_clear(out);
+
+ /* Use 'out->header' to point to the flow key, 'out->msg' for actions */
+ out->header = out->data;
+ odp_flow_key_from_flow(&parms, out);
+ out->msg = ofpbuf_tail(out);
+ err = bpf_actions_to_odp_actions(&action, out);
+ if (err) {
+ VLOG_ERR("%s: bpf_actions to odp actions fails", __func__);
+ return err;
+ }
+
+ flow->key = out->header;
+ flow->key_len = ofpbuf_headersize(out);
+ flow->actions = out->msg;
+ flow->actions_len = ofpbuf_msgsize(out);
+
+ dpif_flow_hash(dp->dpif, flow->key, flow->key_len, &flow->ufid);
+ flow->ufid_present = false; /* XXX */
+
+ /* Fetch datapath flow stats */
+ err = bpf_map_lookup_elem(datapath.bpf.dp_flow_stats.fd, key, &stats);
+ if (err) {
+ VLOG_DBG("flow stats lookup fails, fd %d err = %d %s",
+ datapath.bpf.dp_flow_stats.fd, err, ovs_strerror(errno));
+ return errno;
+ } else {
+ VLOG_DBG("flow stats lookup OK");
+ memcpy(&flow->stats, &stats, 3 * sizeof(uint64_t));
+ }
+
+ return 0;
+}
+
+static int
+dpif_bpf_insert_flow(struct bpf_flow_key *flow_key,
+ struct bpf_action_batch *actions)
+{
+ int err;
+ struct bpf_flow_stats flow_stats;
+
+ VLOG_DBG("Insert bof_flow_key:");
+ vlog_hex_dump((unsigned char *)flow_key, sizeof *flow_key);
+
+ VLOG_DBG("Insert action:");
+ vlog_hex_dump((unsigned char *)actions, sizeof actions[0]);
+
+ ovs_assert(datapath.bpf.flow_table.fd != -1);
+ err = bpf_map_update_elem(datapath.bpf.flow_table.fd,
+ flow_key,
+ actions, BPF_ANY);
+ if (err) {
+ VLOG_ERR("Failed to add flow into flow table, map fd %d, error %s",
+ datapath.bpf.flow_table.fd, ovs_strerror(errno));
+ return errno;
+ }
+
+ flow_stats.packet_count = 1;
+ flow_stats.byte_count = flow_key->mds.md.packet_length;
+ flow_stats.used = 0;
+
+ err = bpf_map_update_elem(datapath.bpf.dp_flow_stats.fd,
+ flow_key,
+ &flow_stats, BPF_ANY);
+ if (err) {
+ VLOG_ERR("Failed to add flow into flow stats table, map fd %d, error %s",
+ datapath.bpf.dp_flow_stats.fd, ovs_strerror(errno));
+ return errno;
+ }
+
+ return 0;
+}
+
+static int
+dpif_bpf_delete_flow(struct bpf_flow_key *flow_key,
+ struct dpif_flow_stats *stats)
+{
+ int err;
+ struct bpf_action_batch actions;
+
+ ovs_assert(datapath.bpf.flow_table.fd != -1);
+
+ err = bpf_map_lookup_elem(datapath.bpf.flow_table.fd, flow_key, &actions);
+ if (err != 0) {
+ VLOG_ERR("Failed to find flow into flow table, map fd %d: %s",
+ datapath.bpf.flow_table.fd, ovs_strerror(errno));
+ VLOG_WARN("bpf_flow_key not found\n");
+ vlog_hex_dump((unsigned char *)flow_key, sizeof *flow_key);
+
+ goto delete_stats;
+ }
+
+ err = bpf_map_delete_elem(datapath.bpf.flow_table.fd, flow_key);
+ if (err) {
+ VLOG_ERR("Failed to del flow into flow table, map fd %d: %s",
+ datapath.bpf.flow_table.fd, ovs_strerror(errno));
+ return errno;
+ }
+
+ if (stats) {
+ /* XXX: Stats */
+ memset(stats, 0, sizeof *stats);
+
+delete_stats:
+ err = bpf_map_delete_elem(datapath.bpf.dp_flow_stats.fd, flow_key);
+ if (err) {
+ VLOG_ERR("Failed to del flow into flow stat table, map fd %d: %s",
+ datapath.bpf.flow_table.fd, ovs_strerror(errno));
+ /* Skip when element is not found */
+ return 0;
+ }
+ }
+ return 0;
+}
+
+static int
+dpif_bpf_delete_all_flow(void)
+{
+ int err;
+ struct bpf_flow_key key;
+
+ do {
+ err = bpf_map_get_next_key(datapath.bpf.flow_table.fd, NULL, &key);
+ if (err) {
+ return err;
+ }
+
+ err = bpf_map_delete_elem(datapath.bpf.flow_table.fd, &key);
+ } while (!err);
+
+ return err;
+}
+
+static int
+dpif_bpf_flow_dump_next(struct dpif_flow_dump_thread *thread_,
+ struct dpif_flow *flows, int max_flows)
+{
+ struct dpif_bpf_flow_dump_thread *thread =
+ dpif_bpf_flow_dump_thread_cast(thread_);
+ struct dpif_bpf_flow_dump *dump = thread->dump;
+ int n = 0;
+ int err;
+
+ ovs_mutex_lock(&dump->mutex);
+ err = dump->status;
+ if (err) {
+ goto unlock;
+ }
+
+ while (n <= max_flows) {
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dump->up.dpif);
+
+ err = bpf_map_get_next_key(datapath.bpf.flow_table.fd,
+ &dump->pos, &dump->pos);
+ if (err) {
+ err = errno;
+ break;
+ }
+ err = fetch_flow(dp, &flows[n], &thread->buf, &dump->pos);
+ if (err == ENOENT) {
+ /* Flow disappeared. Oh well, we tried. */
+ continue;
+ } else if (err) {
+ break;
+ }
+ n++;
+ }
+ dump->status = err;
+unlock:
+ ovs_mutex_unlock(&dump->mutex);
+ return n;
+}
+
+struct dpif_bpf_downcall_parms {
+ uint32_t type;
+ odp_port_t port_no;
+ struct bpf_action_batch *action_batch;
+};
+
+static int
+dpif_bpf_downcall(struct dpif *dpif_, struct dp_packet *packet,
+ const struct flow *flow,
+ struct dpif_bpf_downcall_parms *parms)
+{
+ struct dp_packet_batch batch;
+ struct bpf_downcall md = {
+ .type = parms->type,
+ .debug = 0xC0FFEEEE,
+ };
+ uint32_t ifindex;
+ uint32_t flags;
+ int error;
+ int queue = 0;
+ struct dp_packet *clone_pkt;
+
+ ovs_assert(datapath.bpf.execute_actions.fd != -1);
+
+ bpf_metadata_from_flow(flow, &md.md);
+
+ ifindex = odp_port_to_ifindex(get_dpif_bpf_dp(dpif_),
+ flow->in_port.odp_port, &flags);
+#if 0
+ /* this is ok at check_support time */
+ if (!ifindex) {
+ VLOG_WARN("%s: in_port.odp_port %d found",
+ __func__, flow->in_port.odp_port);
+ return ENODEV;
+ }
+#endif
+
+ md.md.md.in_port = ifindex;
+ md.ifindex = ifindex;
+
+ if (parms->action_batch) {
+ int zero_index = 0;
+ error = bpf_map_update_elem(datapath.bpf.execute_actions.fd,
+ &zero_index, parms->action_batch, 0);
+ if (error) {
+ VLOG_ERR("%s: map update failed", __func__);
+ return error;
+ }
+ }
+
+ /* XXX: Check that ovs-system device MTU is large enough to include md. */
+ dp_packet_put(packet, &md, sizeof md);
+ clone_pkt = dp_packet_clone(packet);
+ dp_packet_batch_init_packet(&batch, clone_pkt);
+
+ VLOG_INFO("send downcall (%d)", parms->type);
+ error = netdev_send(datapath.outport, queue, &batch, false);
+ dp_packet_set_size(packet, dp_packet_size(packet) - sizeof md);
+
+ return error;
+}
+
+static int OVS_UNUSED
+dpif_bpf_output(struct dpif *dpif_, struct dp_packet *packet,
+ const struct flow *flow, odp_port_t port_no,
+ uint32_t flags OVS_UNUSED)
+{
+ struct dpif_bpf_downcall_parms parms = {
+ .port_no = port_no,
+ .type = OVS_BPF_DOWNCALL_OUTPUT,
+ .action_batch = NULL
+ };
+ return dpif_bpf_downcall(dpif_, packet, flow, &parms);
+}
+
+static int
+dpif_bpf_execute_(struct dpif *dpif_, struct dp_packet *packet,
+ const struct flow *flow,
+ struct bpf_action_batch *action_batch)
+{
+ struct dpif_bpf_downcall_parms parms = {
+ .type = OVS_BPF_DOWNCALL_EXECUTE,
+ .action_batch = action_batch,
+ };
+ return dpif_bpf_downcall(dpif_, packet, flow, &parms);
+}
+
+static int
+dpif_bpf_serialize_actions(struct dpif_bpf_dp *dp,
+ struct bpf_action_batch *action_batch,
+ const struct nlattr *nlactions,
+ size_t actions_len)
+{
+
+ const struct nlattr *a;
+ unsigned int left, count = 0, skipped = 0;
+ struct bpf_action *actions;
+
+ memset(action_batch, 0, sizeof(*action_batch));
+ actions = action_batch->actions;
+
+ NL_ATTR_FOR_EACH_UNSAFE (a, left, nlactions, actions_len) {
+ enum ovs_action_attr type = nl_attr_type(a);
+ actions[count].type = type;
+
+ if (type == OVS_ACTION_ATTR_OUTPUT) {
+ struct dpif_bpf_port *port;
+ odp_port_t port_no = nl_attr_get_odp_port(a);
+
+ ovs_mutex_lock(&dp->port_mutex);
+ port = bpf_lookup_port(dp, port_no);
+ if (port) {
+ VLOG_INFO("output action to port %d ifindex %d", port_no,
+ port->ifindex);
+ actions[count].u.out.port = port->ifindex;
+ actions[count].u.out.flags = get_port_flags(port->netdev);
+ }
+ ovs_mutex_unlock(&dp->port_mutex);
+ } else {
+ if (odp_action_to_bpf_action(a, &actions[count])) {
+ skipped++;
+ }
+ }
+ count++;
+ }
+
+ VLOG_INFO("Processing flow actions (%d/%d skipped)", skipped, count);
+ if (skipped) {
+ /* XXX: VLOG actions that couldn't be processed */
+ }
+ return 0;
+}
+
+static int
+dpif_bpf_execute(struct dpif *dpif_, struct dpif_execute *execute)
+{
+ struct bpf_action_batch batch;
+ int error = 0;
+
+ error = dpif_bpf_serialize_actions(get_dpif_bpf_dp(dpif_), &batch, execute->actions,
+ execute->actions_len);
+ if (error) {
+ return error;
+ }
+
+ error = dpif_bpf_execute_(dpif_, execute->packet,
+ execute->flow, &batch);
+ return error;
+}
+
+/* Translates 'port' into an ifindex and sets it inside 'key'.
+ *
+ * Returns 0 on success, or a positive errno otherwise. */
+static int
+set_in_port(struct dpif_bpf_dp *dp, struct bpf_flow_key *key, odp_port_t port)
+{
+ uint16_t ifindex;
+
+ ifindex = odp_port_to_ifindex(dp, port, NULL);
+ if (!ifindex && port) {
+ VLOG_WARN("Could not find ifindex corresponding to port %"PRIu32,
+ port);
+ return ENODEV;
+ }
+
+ key->mds.md.in_port = ifindex;
+ return 0;
+}
+
+/* Converts 'key' (of size 'key_len') into a bpf flow key in 'key_out', and
+ * optionally 'actions' (of size 'actions_len') into 'batch'. 'mask' (of size
+ * 'mask_len') may optionally be used for logging, of which the verbosity is
+ * controlled by 'verbose'.
+ *
+ * Returns 0 on success, or a positive errno otherwise.
+ */
+static int
+prepare_bpf_flow__(struct dpif_bpf_dp *dp,
+ const struct nlattr *key, size_t key_len,
+ const struct nlattr *mask, size_t mask_len,
+ const struct nlattr *actions, size_t actions_len,
+ struct bpf_flow_key *key_out, struct bpf_action_batch *batch,
+ bool verbose)
+{
+ odp_port_t in_port;
+ int err = EINVAL;
+
+ if (1) {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ /* XXX: Use dpif_format_flow()? */
+ odp_flow_format(key, key_len, mask, mask_len, NULL, &ds, true);
+ ds_put_cstr(&ds, ", actions=");
+ format_odp_actions(&ds, actions, actions_len, NULL);
+ VLOG_WARN("Translating odp key to bpf key:\n%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+
+ memset(key_out, 0, sizeof *key_out);
+ if (odp_key_to_bpf_flow_key(key, key_len, key_out,
+ &in_port, false, verbose)) {
+ if (verbose) {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ /* XXX: Use dpif_format_flow()? */
+ odp_flow_format(key, key_len, mask, mask_len, NULL, &ds,
+ true);
+ VLOG_WARN("Failed to translate odp key to bpf key:\n%s",
+ ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+ return err;
+ }
+
+ err = set_in_port(dp, key_out, in_port);
+ if (err) {
+ return err;
+ }
+ if (batch) {
+ err = dpif_bpf_serialize_actions(dp, batch, actions, actions_len);
+ if (err) {
+ return err;
+ }
+ }
+
+ /* Transfer back to flow to check if everything is good */
+ if (1) {
+ struct flow flow;
+ enum odp_key_fitness res;
+
+ res = bpf_flow_key_to_flow(key_out, &flow);
+ if (res != ODP_FIT_PERFECT) {
+ VLOG_ERR("transfer bpf key back to flow failed");
+ } else {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ flow_format(&ds, &flow, NULL);
+ ds_put_cstr(&ds, ", actions=");
+ format_odp_actions(&ds, actions, actions_len, NULL);
+ VLOG_WARN("Translating back:\n%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+ }
+
+ return 0;
+}
+
+static int
+prepare_bpf_flow(struct dpif_bpf_dp *dp, const struct nlattr *key,
+ size_t key_len, struct bpf_flow_key *key_out, bool verbose)
+{
+ return prepare_bpf_flow__(dp, key, key_len, NULL, 0, NULL, 0, key_out,
+ NULL, verbose);
+}
+
+static void
+dpif_bpf_operate(struct dpif *dpif_, struct dpif_op **ops, size_t n_ops)
+{
+ struct dpif_bpf_dp *dp = get_dpif_bpf_dp(dpif_);
+
+ for (int i = 0; i < n_ops; i++) {
+ struct dpif_op *op = ops[i];
+ struct dpif_flow_del *del OVS_UNUSED;
+ struct dpif_flow_get *get OVS_UNUSED;
+
+ switch (op->type) {
+ case DPIF_OP_EXECUTE:
+ op->error = dpif_bpf_execute(dpif_, &op->u.execute);
+ break;
+ case DPIF_OP_FLOW_PUT: {
+ struct dpif_flow_put *put = &op->u.flow_put;
+ bool verbose = !(put->flags & DPIF_FP_PROBE);
+ struct bpf_action_batch action_batch;
+ struct bpf_flow_key key;
+ int err;
+
+ err = prepare_bpf_flow__(dp, put->key, put->key_len,
+ put->mask, put->mask_len,
+ put->actions, put->actions_len,
+ &key, &action_batch, verbose);
+ if (!err) {
+ err = dpif_bpf_insert_flow(&key, &action_batch);
+ }
+ op->error = err;
+ break;
+ }
+ case DPIF_OP_FLOW_GET: {
+ struct dpif_flow_get *get = &op->u.flow_get;
+ struct bpf_flow_key key;
+ int err;
+
+ err = prepare_bpf_flow(dp, get->key, get->key_len, &key, true);
+ if (!err) {
+ err = fetch_flow(dp, get->flow, get->buffer, &key);
+ }
+ op->error = err;
+ break;
+ }
+ case DPIF_OP_FLOW_DEL: {
+ struct dpif_flow_del *del = &op->u.flow_del;
+ struct bpf_flow_key key;
+ int err;
+
+ err = prepare_bpf_flow(dp, del->key, del->key_len, &key, true);
+ if (!err) {
+ err = dpif_bpf_delete_flow(&key, del->stats);
+ }
+ op->error = err;
+ break;
+ }
+ default:
+ OVS_NOT_REACHED();
+ }
+ }
+}
+
+static int
+dpif_bpf_recv_set(struct dpif *dpif_, bool enable)
+{
+ struct dpif_bpf_dp *dpif = get_dpif_bpf_dp(dpif_);
+ int stored_error = 0;
+
+ for (int i = 0; i < dpif->n_channels; i++) {
+ int error = perf_channel_set(&dpif->channels[i], enable);
+ if (error) {
+ VLOG_ERR("failed to set recv_set %s (%s)",
+ enable ? "true": "false", ovs_strerror(error));
+ stored_error = error;
+ }
+ }
+
+ return stored_error;
+}
+
+static int
+dpif_bpf_handlers_set__(struct dpif_bpf_dp *dp, uint32_t n_handlers)
+ OVS_REQUIRES(&dp->upcall_lock)
+{
+ struct bpf_handler prev;
+ int i, extra;
+
+ memset(&prev, 0, sizeof prev);
+ if (dp->n_handlers) {
+ free(dp->handlers);
+ dp->handlers = NULL;
+ dp->n_handlers = 0;
+ }
+
+ if (!n_handlers) {
+ return 0;
+ }
+
+ dp->handlers = xzalloc(sizeof *dp->handlers * n_handlers);
+ for (i = 0; i < n_handlers; i++) {
+ struct bpf_handler *curr = dp->handlers + i;
+
+ if (i > dp->n_channels) {
+ VLOG_INFO("Ignoring extraneous handlers (%d for %d channels)",
+ n_handlers, dp->n_channels);
+ break;
+ }
+
+ curr->offset = prev.offset + prev.count;
+ curr->count = dp->n_channels / n_handlers;
+ prev = *curr;
+ }
+ extra = dp->n_channels % n_handlers;
+ if (extra) {
+ VLOG_INFO("Extra %d channels; distributing across handlers", extra);
+ for (i = 0; i < extra; i++) {
+ struct bpf_handler *curr = dp->handlers + n_handlers - i - 1;
+
+ curr->offset = curr->offset + extra - i - 1;
+ curr->count++;
+ }
+ }
+
+ dp->n_handlers = n_handlers;
+ return 0;
+}
+
+static int
+dpif_bpf_handlers_set(struct dpif *dpif_, uint32_t n_handlers)
+{
+ struct dpif_bpf_dp *dpif = get_dpif_bpf_dp(dpif_);
+ int error;
+
+ fat_rwlock_wrlock(&dpif->upcall_lock);
+ error = dpif_bpf_handlers_set__(dpif, n_handlers);
+ fat_rwlock_unlock(&dpif->upcall_lock);
+
+ return error;
+}
+
+static int
+extract_key(struct dpif_bpf_dp *dpif, const struct bpf_flow_key *key,
+ struct dp_packet *packet, struct ofpbuf *buf)
+{
+ struct flow flow;
+ struct odp_flow_key_parms parms = {
+ .flow = &flow,
+ };
+ parms.support.recirc = true;
+
+ {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ bpf_flow_key_format(&ds, key);
+ VLOG_INFO("bpf_flow_key_format\n%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+
+ /* This function goes first because it zeros out flow. */
+ flow_extract(packet, &flow);
+
+ bpf_flow_key_extract_metadata(key, &flow);
+
+ VLOG_INFO("packet.md.port = %d", packet->md.in_port.odp_port);
+
+ if (flow.in_port.odp_port != 0) {
+ flow.in_port.odp_port = ifindex_to_odp(dpif,
+ odp_to_u32(flow.in_port.odp_port));
+ } else {
+ flow.in_port.odp_port = packet->md.in_port.odp_port;
+ }
+ VLOG_INFO("flow.in_port.odp_port %d", flow.in_port.odp_port);
+
+ if (1) {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ flow_format(&ds, &flow, NULL);
+ VLOG_WARN("Upcall flow:\n%s",
+ ds_cstr(&ds));
+ ds_destroy(&ds);
+
+ }
+
+ odp_flow_key_from_flow(&parms, buf);
+
+ return 0;
+}
+
+struct ovs_ebpf_event {
+ struct perf_event_raw sample;
+ struct bpf_upcall header;
+ uint8_t data[];
+};
+
+static void OVS_UNUSED
+dpif_bpf_flow_dump_all(struct dpif_bpf_dp *dp OVS_UNUSED)
+{
+ struct dpif_bpf_flow_dump dump;
+ int err;
+
+ memset(&dump, 0, sizeof dump);
+ while (1) {
+ err = bpf_map_get_next_key(datapath.bpf.flow_table.fd,
+ &dump.pos, &dump.pos);
+ if (err) {
+ VLOG_INFO("err is %d", err);
+ break;
+ }
+ vlog_hex_dump((unsigned char *)&dump.pos, sizeof dump.pos);
+ }
+}
+
+/* perf_channel_read() fills the first part of 'buffer' with the full event.
+ * Here, the key will be extracted immediately following it, and 'upcall'
+ * will be initialized to point within 'buffer'.
+ */
+static int
+perf_sample_to_upcall__(struct dpif_bpf_dp *dp, struct ovs_ebpf_event *e,
+ struct dpif_upcall *upcall, struct ofpbuf *buffer)
+{
+ size_t sample_len = e->sample.size - sizeof e->header;
+ size_t pkt_len = e->header.skb_len;
+ size_t pre_key_len;
+ odp_port_t port_no;
+ int err;
+
+ if (pkt_len < ETH_HEADER_LEN) {
+ VLOG_WARN_RL(&rl, "Unexpectedly short packet (%"PRIuSIZE")", pkt_len);
+ return EINVAL;
+ }
+ if (e->sample.size - sizeof e->header < pkt_len) {
+ VLOG_WARN_RL(&rl,
+ "Packet longer than sample (pkt=%"PRIuSIZE", sample=%"PRIuSIZE")",
+ pkt_len, sample_len);
+ return EINVAL;
+ }
+
+ port_no = ifindex_to_odp(dp, e->header.ifindex);
+ VLOG_INFO("ifindex %d odp %d", e->header.ifindex, port_no);
+ if (port_no == ODPP_NONE) {
+ VLOG_WARN_RL(&rl, "failed to map upcall ifindex=%d to odp",
+ e->header.ifindex);
+ return EINVAL;
+ }
+
+ memset(upcall, 0, sizeof *upcall);
+
+ /* Use buffer->header to point to the packet, and buffer->msg to point to
+ * the extracted flow key. Therefore, when extract_key() reallocates
+ * 'buffer', we can easily get pointers back to the packet and start of
+ * extracted key. */
+ buffer->header = e->data;
+ buffer->msg = ofpbuf_tail(buffer);
+ pre_key_len = buffer->size;
+
+ VLOG_INFO("upcall key hex\n");
+ vlog_hex_dump((unsigned char *)&e->header.key, sizeof e->header.key);
+ //VLOG_INFO("list of bpf keys\n");
+ //dpif_bpf_flow_dump_all(dp);
+ VLOG_INFO("raw packet data in e->data");
+ vlog_hex_dump(e->data, MIN(pkt_len, 100));
+
+ dp_packet_use_stub(&upcall->packet, e->data, pkt_len);
+ dp_packet_set_size(&upcall->packet, pkt_len);
+ pkt_metadata_init(&upcall->packet.md, port_no);
+
+ err = extract_key(dp, &e->header.key, &upcall->packet, buffer);
+ if (err) {
+ return err;
+ }
+
+ upcall->key = buffer->msg;
+ upcall->key_len = buffer->size - pre_key_len;
+ dpif_flow_hash(dp->dpif, upcall->key, upcall->key_len, &upcall->ufid);
+
+ return 0;
+}
+
+/* perf_channel_read() fills the first part of 'buffer' with the full event.
+ * Here, the key will be extracted immediately following it, and 'upcall'
+ * will be initialized to point within 'buffer'.
+ */
+static int
+perf_sample_to_upcall_miss(struct dpif_bpf_dp *dp, struct ovs_ebpf_event *e,
+ struct dpif_upcall *upcall, struct ofpbuf *buffer)
+{
+ int err;
+
+ err = perf_sample_to_upcall__(dp, e, upcall, buffer);
+ if (err) {
+ return err;
+ }
+
+ ofpbuf_prealloc_tailroom(buffer, sizeof(struct bpf_downcall));
+ upcall->type = DPIF_UC_MISS;
+
+ return 0;
+}
+
+/* Modified from perf_sample_to_upcall.
+ */
+static int
+perf_sample_to_upcall_userspace(struct dpif_bpf_dp *dp, struct ovs_ebpf_event *e,
+ struct dpif_upcall *upcall,
+ struct ofpbuf *buffer)
+{
+ const struct nlattr *actions = (struct nlattr *)e->header.uactions;
+ const struct nlattr *a;
+ unsigned int left;
+ int err;
+
+ err = perf_sample_to_upcall__(dp, e, upcall, buffer);
+ if (err) {
+ return err;
+ }
+
+ NL_ATTR_FOR_EACH_UNSAFE (a, left, actions, e->header.uactions_len) {
+ switch (nl_attr_type(a)) {
+ case OVS_USERSPACE_ATTR_PID:
+ //nl_attr_get_u32(a);
+ break;
+ case OVS_USERSPACE_ATTR_USERDATA:
+ upcall->userdata = CONST_CAST(struct nlattr *, a);
+ break;
+ default:
+ VLOG_INFO("%s unsupported userspace action. %d",
+ __func__, nl_attr_type(a));
+ return EOPNOTSUPP;
+ }
+ }
+
+ upcall->type = DPIF_UC_ACTION;
+ return 0;
+}
+
+static void
+bpf_debug_print(int subtype, int error)
+{
+ int level = error ? VLL_WARN : VLL_DBG;
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ if (subtype >= 0 && subtype < ARRAY_SIZE(bpf_upcall_subtypes)) {
+ ds_put_cstr(&ds, bpf_upcall_subtypes[subtype]);
+ } else {
+ ds_put_format(&ds, "Unknown subtype %d", subtype);
+ }
+ ds_put_format(&ds, " reports: %s", ovs_strerror(error));
+
+ VLOG_RL(&rl, level, "%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+}
+
+static int
+recv_perf_sample(struct dpif_bpf_dp *dpif, struct ovs_ebpf_event *e,
+ struct dpif_upcall *upcall, struct ofpbuf *buffer)
+{
+ if (e->sample.header.size < sizeof *e
+ || e->sample.size < sizeof e->header) {
+ VLOG_WARN_RL(&rl, "Unexpectedly short sample (%"PRIu32")",
+ e->sample.size);
+ return EINVAL;
+ }
+
+ VLOG_INFO("\nreceived upcall %d", e->header.type);
+
+ switch (e->header.type) {
+ case OVS_UPCALL_MISS:
+ return perf_sample_to_upcall_miss(dpif, e, upcall, buffer);
+ break;
+ case OVS_UPCALL_DEBUG:
+ bpf_debug_print(e->header.subtype, e->header.error);
+ return EAGAIN;
+ case OVS_UPCALL_ACTION:
+ return perf_sample_to_upcall_userspace(dpif, e, upcall, buffer);
+ break;
+ default:
+ break;
+ }
+
+ VLOG_WARN_RL(&rl, "Unfamiliar upcall type %d", e->header.type);
+ return EINVAL;
+}
+
+static int
+dpif_bpf_recv(struct dpif *dpif_, uint32_t handler_id,
+ struct dpif_upcall *upcall, struct ofpbuf *buffer)
+{
+ struct dpif_bpf_dp *dpif = get_dpif_bpf_dp(dpif_);
+ struct bpf_handler *handler;
+ int error = EAGAIN;
+ int i;
+
+ fat_rwlock_rdlock(&dpif->upcall_lock);
+ handler = dpif->handlers + handler_id;
+ for (i = 0; i < handler->count; i++) {
+ int channel_idx = (handler->index + i) % handler->count;
+ struct perf_channel *channel;
+
+ channel = &dpif->channels[handler->offset + channel_idx];
+ error = perf_channel_read(channel, buffer);
+ if (!error) {
+ error = recv_perf_sample(dpif, buffer->header, upcall, buffer);
+ }
+ if (error != EAGAIN) {
+ break;
+ }
+ }
+ handler->index = (handler->index + 1) % handler->count;
+ fat_rwlock_unlock(&dpif->upcall_lock);
+
+ return error;
+}
+
+static char *
+dpif_bpf_get_datapath_version(void)
+{
+ return xstrdup("<built-in>");
+}
+
+static void
+dpif_bpf_recv_wait(struct dpif *dpif_, uint32_t handler_id)
+{
+ struct dpif_bpf_dp *dpif = get_dpif_bpf_dp(dpif_);
+ struct bpf_handler *handler;
+ int i;
+
+ fat_rwlock_rdlock(&dpif->upcall_lock);
+ handler = dpif->handlers + handler_id;
+ for (i = 0; i < handler->count; i++) {
+ poll_fd_wait(dpif->channels[handler->offset + i].fd, POLLIN);
+ }
+ fat_rwlock_unlock(&dpif->upcall_lock);
+}
+
+static void
+dpif_bpf_recv_purge(struct dpif *dpif_)
+{
+ struct dpif_bpf_dp *dpif = get_dpif_bpf_dp(dpif_);
+ int i;
+
+ fat_rwlock_rdlock(&dpif->upcall_lock);
+ for (i = 0; i < dpif->n_channels; i++) {
+ struct perf_channel *channel = &dpif->channels[i];
+
+ perf_channel_flush(channel);
+ }
+ fat_rwlock_unlock(&dpif->upcall_lock);
+}
+
+const struct dpif_class dpif_bpf_class = {
+ "bpf",
+ dpif_bpf_init,
+ dpif_bpf_enumerate,
+ dpif_bpf_port_open_type,
+ dpif_bpf_open,
+ dpif_bpf_close,
+ dpif_bpf_destroy,
+ NULL, /* run */
+ NULL, /* wait */
+ dpif_bpf_get_stats,
+ dpif_bpf_port_add,
+ dpif_bpf_port_del,
+ NULL, /* port_set_config */
+ dpif_bpf_port_query_by_number,
+ dpif_bpf_port_query_by_name,
+ NULL, /* port_get_pid */
+ dpif_bpf_port_dump_start,
+ dpif_bpf_port_dump_next,
+ dpif_bpf_port_dump_done,
+ dpif_bpf_port_poll,
+ dpif_bpf_port_poll_wait,
+ dpif_bpf_flow_flush,
+ dpif_bpf_flow_dump_create,
+ dpif_bpf_flow_dump_destroy,
+ dpif_bpf_flow_dump_thread_create,
+ dpif_bpf_flow_dump_thread_destroy,
+ dpif_bpf_flow_dump_next,
+ dpif_bpf_operate,
+ dpif_bpf_recv_set,
+ dpif_bpf_handlers_set,
+ NULL, /* set_config */
+ NULL, /* queue_to_priority */
+ dpif_bpf_recv,
+ dpif_bpf_recv_wait,
+ dpif_bpf_recv_purge,
+ NULL, /* register_dp_purge_cb */
+ NULL, /* register_upcall_cb */
+ NULL, /* enable_upcall */
+ NULL, /* disable_upcall */
+ dpif_bpf_get_datapath_version,
+ NULL, /* ct_dump_start */
+ NULL, /* ct_dump_next */
+ NULL, /* ct_dump_done */
+ NULL, /* ct_flush */
+ NULL, /* ct_set_maxconns */
+ NULL, /* ct_get_maxconns */
+ NULL, /* ct_get_nconns */
+ NULL, /* meter_get_features */
+ NULL, /* meter_set */
+ NULL, /* meter_get */
+ NULL, /* meter_del */
+};
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index 62b3598acfc5..ae21593ab1b2 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -476,6 +476,7 @@ struct dpif_class {

extern const struct dpif_class dpif_netlink_class;
extern const struct dpif_class dpif_netdev_class;
+extern const struct dpif_class dpif_bpf_class;

#ifdef __cplusplus
}
diff --git a/lib/dpif.c b/lib/dpif.c
index f03763ec55b4..43d97ec1582a 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -71,6 +71,9 @@ static const struct dpif_class *base_dpif_classes[] = {
#if defined(__linux__) || defined(_WIN32)
&dpif_netlink_class,
#endif
+#if HAVE_BPF /* XXX: Linux 4.9+ */
+ &dpif_bpf_class,
+#endif
&dpif_netdev_class,
};

--
2.7.4


[RFC PATCH 04/11] lib/bpf: add support for managing bpf program/map.

William Tu
 

From: Joe Stringer <joe@...>

Through libbpf, the patch adds support for loading bpf program
and maps, pinning the program and map to /sys/fs/bpf/ovs/, managing
the file descriptor of each loaded map, and printting.

Signed-off-by: Joe Stringer <joe@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
lib/bpf.c | 524 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
lib/bpf.h | 69 +++++++++
2 files changed, 593 insertions(+)
create mode 100644 lib/bpf.c
create mode 100644 lib/bpf.h

diff --git a/lib/bpf.c b/lib/bpf.c
new file mode 100644
index 000000000000..48c677e54659
--- /dev/null
+++ b/lib/bpf.c
@@ -0,0 +1,524 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+
+#include <errno.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include <linux/bpf.h>
+#include <linux/limits.h>
+#include <linux/magic.h>
+#include <iproute2/bpf_elf.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/vfs.h>
+#include <sys/resource.h>
+
+#include "bpf.h"
+#include "bpf/odp-bpf.h"
+#include "util.h"
+#include "openvswitch/dynamic-string.h"
+#include "openvswitch/vlog.h"
+
+#define BPF_FS_PATH "/sys/fs/bpf/ovs/"
+static const char *ovs_bpf_path = BPF_FS_PATH;
+
+#define MAX_BPF_PROG_ARRAY 64 //FIXME
+VLOG_DEFINE_THIS_MODULE(bpf);
+
+static void
+bpf_format_prog(struct ds *ds, const struct bpf_prog *prog)
+{
+ ds_put_format(ds, " %s:\n", prog->name);
+ ds_put_format(ds, " handle: %08"PRIx32"\n", prog->handle);
+}
+
+typedef void map_element_writer_t(struct ds *, uint64_t, void *);
+
+static void
+format_dp_stats(struct ds *ds, uint64_t key, void *value_)
+{
+ uint64_t value = *(uint64_t *)value_;
+
+ switch (key) {
+ case OVS_DP_STATS_UNSPEC:
+ while (ds_chomp(ds, ' ')) {
+ /* nom nom nom */
+ }
+ break;
+ case OVS_DP_STATS_HIT:
+ ds_put_cstr(ds, "hit");
+ break;
+ case OVS_DP_STATS_MISSED:
+ ds_put_cstr(ds, "missed");
+ break;
+ case OVS_DP_STATS_LOST:
+ ds_put_cstr(ds, "lost");
+ break;
+ case OVS_DP_STATS_FLOWS:
+ ds_put_cstr(ds, "flows");
+ break;
+ case OVS_DP_STATS_MASK_HIT:
+ ds_put_cstr(ds, "masks_hit");
+ break;
+ case OVS_DP_STATS_MASKS:
+ ds_put_cstr(ds, "masks");
+ break;
+ case OVS_DP_STATS_ERRORS:
+ ds_put_cstr(ds, "errors");
+ break;
+ default:
+ ds_put_format(ds, "unknown-%"PRIu64, key);
+ break;
+ }
+ if (key) {
+ ds_put_format(ds, ": %"PRIu64"\n", value);
+ }
+}
+
+static void
+format_upcalls(struct ds *ds, uint64_t key, void *value OVS_UNUSED)
+{
+ ds_put_format(ds, "cpu-%"PRIu64"\n", key);
+}
+
+static void
+format_tailcalls(struct ds *ds, uint64_t key, void *value_)
+{
+ uint32_t value = *(uint32_t *)value_;
+ ds_put_format(ds, "index-%"PRIu64"prog_fd-%d\n", key, value);
+}
+
+static int
+lookup_elem(int fd, void *key, size_t key_len, void *value)
+{
+ int err = bpf_map_lookup_elem(fd, (uint64_t *)key, (uint64_t *)value);
+ if (err) {
+ struct ds ds = DS_EMPTY_INITIALIZER;
+
+ ds_put_cstr(&ds, "error occurred looking up elem ");
+ ds_put_hex(&ds, key, key_len);
+ ds_put_format(&ds, ": %s", ovs_strerror(errno));
+ VLOG_DBG("%s", ds_cstr(&ds));
+ ds_destroy(&ds);
+ }
+
+ return err;
+}
+
+#define MAP_FORMAT_FUNC(NAME, KTYPE, VTYPE, PRINT_COUNT) \
+ static void NAME(struct ds *ds, const struct bpf_map *map, \
+ map_element_writer_t fmt) \
+ { \
+ KTYPE key = 0; \
+ VTYPE value; \
+ int count = 0; \
+ \
+ VLOG_DBG("reading map %s", map->name); \
+ ds_put_format(ds, " %s:\n", map->name); \
+ if (!lookup_elem(map->fd, &key, sizeof key, &value)) { \
+ count++; \
+ if (fmt) { \
+ ds_put_cstr(ds, " "); \
+ fmt(ds, key, &value); \
+ } \
+ } \
+ while (!bpf_map_get_next_key(map->fd, &key, &key)) { \
+ count++; \
+ if (fmt) { \
+ if (!lookup_elem(map->fd, &key, sizeof key, &value)) { \
+ ds_put_cstr(ds, " "); \
+ fmt(ds, key, &value); \
+ } \
+ } \
+ }; \
+ if (PRINT_COUNT) { \
+ ds_put_format(ds, " count: %d\n", count); \
+ } \
+ }
+
+MAP_FORMAT_FUNC(bpf_format_map_stats, uint64_t, uint64_t, false);
+MAP_FORMAT_FUNC(bpf_format_map_flows, uint64_t, struct bpf_flow, true);
+MAP_FORMAT_FUNC(bpf_format_map_upcalls, uint32_t, uint32_t, true);
+MAP_FORMAT_FUNC(bpf_format_map_tailcalls, uint32_t, uint32_t, true);//FIXME
+//MAP_FORMAT_FUNC(bpf_format_map_dp_flow_stats,
+
+void
+bpf_format_state(struct ds *ds, struct bpf_state *state)
+{
+ ds_put_format(ds, "path: %s\n", ovs_bpf_path);
+ ds_put_cstr(ds, "maps:\n");
+ bpf_format_map_stats(ds, &state->datapath_stats, format_dp_stats);
+ bpf_format_map_flows(ds, &state->flow_table, NULL);
+ bpf_format_map_upcalls(ds, &state->upcalls, format_upcalls);
+ bpf_format_map_tailcalls(ds, &state->tailcalls, format_tailcalls);
+ //bpf_format_map_dp_flow_stats(ds, &state->dp_flow_stats, NULL);
+ ds_put_cstr(ds, "programs:\n");
+ bpf_format_prog(ds, &state->downcall);
+ bpf_format_prog(ds, &state->egress);
+ bpf_format_prog(ds, &state->ingress);
+ bpf_format_prog(ds, &state->xdp);
+}
+
+/* Populates 'state' with the standard set of programs and maps for openvswitch
+ * datapath as sourced from pinned programs at ovs_bpf_path.
+ *
+ * Returns 0 on success, or positive errno on error. If successful, the caller
+ * is resposible for releasing the resources in 'state' via bpf_put().
+ */
+int
+bpf_get(struct bpf_state *state, bool verbose)
+{
+ const struct {
+ int *fd;
+ const char *path;
+ } objs[] = {
+ /* BPF Programs */
+ {&state->ingress.fd, "ingress/0"},
+ {&state->egress.fd, "egress/0"},
+ {&state->downcall.fd, "downcall/0"},
+ {&state->xdp.fd, "xdp/0"},
+ /* BPF Maps */
+ {&state->upcalls.fd, "upcalls"},
+ {&state->flow_table.fd, "flow_table"},
+ {&state->datapath_stats.fd, "datapath_stats"},
+ {&state->tailcalls.fd, "tailcalls"},
+ {&state->execute_actions.fd, "execute_actions"},
+ {&state->dp_flow_stats.fd, "dp_flow_stats"},
+ };
+ int i, k, error = 0;
+ char buf[BUFSIZ];
+ int prog_array_fd;
+
+ for (i = 0; i < ARRAY_SIZE(objs); i++) {
+ struct stat s;
+
+ //Failed to load /sys/fs/bpf/ovs/progs/ingress_0:
+ snprintf(buf, ARRAY_SIZE(buf), "%s/%s", ovs_bpf_path, objs[i].path);
+ if (stat(buf, &s)) {
+ error = errno;
+ break;
+ }
+ error = bpf_obj_get(buf);
+ if (error > 0) {
+ VLOG_DBG("Loaded BPF object at %s fd %d", buf, error);
+ *objs[i].fd = error;
+ error = 0;
+ continue;
+ } else {
+ error = errno;
+ break;
+ }
+ }
+
+ prog_array_fd = state->tailcalls.fd;
+
+ VLOG_DBG("start loading/pinning program array\n");
+ for (k = 0; k < BPF_MAX_PROG_ARRAY; k++) {
+ struct stat s;
+ int prog_fd;
+
+ state->tailarray[k].fd = 0;
+
+ snprintf(buf, ARRAY_SIZE(buf), "%s/tail-%d/0", ovs_bpf_path, k);
+ if (stat(buf, &s)) {
+ continue;
+ }
+
+ prog_fd = bpf_obj_get(buf);
+ if (prog_fd > 0) {
+ VLOG_DBG("Loaded BPF object at %s", buf);
+ state->tailarray[k].fd = prog_fd;
+ error = bpf_map_update_elem(prog_array_fd, &k, &prog_fd, BPF_ANY);
+ if (error < 0) {
+ VLOG_ERR("Can not add %s into BPF_MAP_PROG_ARRAY\n", buf);
+ break;
+ }
+ } else {
+ error = errno;
+ break;
+ }
+ }
+
+ if (error) {
+ VLOG(verbose ? VLL_WARN : VLL_DBG, "Failed to load %s: %s",
+ buf, ovs_strerror(error));
+
+ for (int j = 0; j < i; j++) {
+ close(*objs[j].fd);
+ *objs[j].fd = 0;
+ }
+
+ for (int j = 0; j < BPF_MAX_PROG_ARRAY; j++) {
+ if (state->tailarray[j].fd)
+ close(state->tailarray[j].fd);
+ }
+ }
+
+ if (!error) {
+ state->ingress.handle = INGRESS_HANDLE;
+ state->ingress.name = xstrdup("ovs_cls_ingress");
+ state->egress.handle = EGRESS_HANDLE;
+ state->egress.name = xstrdup("ovs_cls_egress");
+ state->downcall.handle = INGRESS_HANDLE;
+ state->downcall.name = xstrdup("ovs_cls_downcall");
+ state->upcalls.name = xstrdup("upcalls");
+ state->xdp.name = xstrdup("xdp");
+ state->flow_table.name = xstrdup("flow_table");
+ state->datapath_stats.name = xstrdup("datapath_stats");
+ state->dp_flow_stats.name = xstrdup("dp_flow_stats");
+ // add parser, lookup, action, deparser
+ state->tailcalls.name = xstrdup("tailcalls");
+
+ }
+
+ return error;
+}
+
+static void
+xclose(int fd, const char *name)
+{
+ int error = close(fd);
+ if (error) {
+ VLOG_WARN("Failed to close BPF fd %s: %s", name, ovs_strerror(errno));
+ }
+}
+
+/* Frees resources allocated by bpf_put(). */
+void
+bpf_put(struct bpf_state *state)
+{
+ xclose(state->ingress.fd, state->ingress.name);
+ xclose(state->egress.fd, state->egress.name);
+ xclose(state->downcall.fd, state->downcall.name);
+ xclose(state->upcalls.fd, state->upcalls.name);
+ xclose(state->xdp.fd, state->xdp.name);
+ xclose(state->flow_table.fd, "ovs_map_flow_table");
+ xclose(state->datapath_stats.fd, "ovs_datapath_stats");
+ xclose(state->dp_flow_stats.fd, state->dp_flow_stats.name);
+ free((void *)state->ingress.name);
+ free((void *)state->egress.name);
+ free((void *)state->downcall.name);
+ free((void *)state->upcalls.name);
+ free((void *)state->xdp.name);
+ free((void *)state->flow_table.name);
+ free((void *)state->datapath_stats.name);
+ free((void *)state->dp_flow_stats.name);
+}
+
+static void
+process(struct bpf_object *obj)
+{
+ struct bpf_program *prog;
+ struct bpf_map *map;
+
+ VLOG_DBG("Opened object '%s'\n", bpf_object__name(obj));
+ VLOG_DBG("Programs:\n");
+ bpf_object__for_each_program(prog, obj) {
+ const char *title = bpf_program__title(prog, false);
+ int error;
+
+ VLOG_DBG(" - %s\n", title);
+ if (strstr(title, "xdp")) {
+ error = bpf_program__set_xdp(prog);
+ } else {
+ error = bpf_program__set_sched_cls(prog); // or sched_act?
+ }
+ if (error) {
+ VLOG_WARN("Failed to set '%s' prog type: %s\n", title,
+ ovs_strerror(error));
+ }
+
+ }
+
+ if (VLOG_IS_DBG_ENABLED()) {
+ VLOG_DBG("Maps:\n");
+ bpf_map__for_each(map, obj) {
+ const char *name = bpf_map__name(map);
+ VLOG_DBG(" - %s\n", name);
+ }
+ }
+}
+
+/* Attempts to load the BPF datapath in the form of an ELF compiled for the BPF
+ * ISA in 'path', install it into the kernel, and pin it to the filesystem
+ * under ovs_bpf_path/{maps,progs}/foo.
+ *
+ * Returns 0 on success, or positive errno on error.
+ */
+int
+bpf_load(const char *path)
+{
+ const char *stage = NULL;
+ struct bpf_state state;
+ struct bpf_object *obj;
+ long error;
+ struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+
+ if ((error = setrlimit(RLIMIT_MEMLOCK, &r))) {
+ VLOG_ERR("Failed to set rlimit %s", ovs_strerror(error));
+ return error;
+ }
+
+ if (!bpf_get(&state, false)) {
+ /* XXX: Restart; Upgrade */
+ VLOG_INFO("Re-using preloaded BPF datapath");
+ bpf_put(&state);
+ return 0;
+ }
+
+ obj = bpf_object__open(path);
+ error = libbpf_get_error(obj);
+ if (error) {
+ stage = "open";
+ goto out;
+ }
+ process(obj);
+ error = bpf_object__load(obj);
+ if (error) {
+ stage = "load";
+ goto close;
+ }
+ error = bpf_object__pin(obj, ovs_bpf_path);
+ if (error) {
+ stage = "pin";
+ goto close;
+ }
+
+ error = bpf_object__unload(obj);
+ if (error) {
+ stage = "unload";
+ goto close;
+ }
+
+close:
+ bpf_object__close(obj);
+out:
+ if (error < 0) {
+ error = -error;
+ } else if (!error) {
+ VLOG_DBG("Loaded BPF datapath from %s", path);
+ }
+ if (error > __LIBBPF_ERRNO__START && error < __LIBBPF_ERRNO__END) {
+ char buf[BUFSIZ];
+
+ libbpf_strerror(error, buf, ARRAY_SIZE(buf));
+ VLOG_WARN("Failed to %s BPF datapath: %s\n", stage ? stage : "", buf);
+ error = EINVAL;
+ }
+ return error;
+}
+
+#define PRINT_FN(NAME) \
+static int \
+print_##NAME(const char *fmt, ...) \
+{ \
+ va_list args; \
+ \
+ va_start(args, fmt); \
+ vlog_valist(&this_module, VLL_##NAME, fmt, args); \
+ va_end(args); \
+ return 0; \
+}
+
+PRINT_FN(WARN);
+PRINT_FN(INFO);
+PRINT_FN(DBG);
+
+#define stringize(x) #x
+
+static int OVS_UNUSED
+mount_bpf(void)
+{
+ struct statfs st_fs;
+ char path[PATH_MAX];
+ char type[NAME_MAX];
+ int err = 0;
+ FILE *fp;
+ int idx;
+
+ fp = fopen("/proc/mounts", "r");
+ if (fp) {
+ const char *fmt;
+ int match;
+
+ fmt = "%*s %"stringize(PATH_MAX)"s %#"stringize(NAME_MAX)"s %*s\n";
+ for (match = 0; match != EOF; match = fscanf(fp, fmt, path, type)) {
+ if (match == 2 && !strcmp(type, "bpf"))
+ break;
+ }
+ if (fclose(fp)) {
+ err = errno;
+ VLOG_INFO("Failed to close /proc/mounts: %s", ovs_strerror(err));
+ }
+ if (strcmp(type, "bpf")) {
+ err = errno;
+ VLOG_DBG("Couldn't find bpf mountpoint in /proc/mounts");
+ }
+ } else {
+ err = errno;
+ VLOG_INFO("Cannot open /proc/mounts: %s", ovs_strerror(err));
+ }
+ if (err || strlen(path) == 0) {
+ VLOG_DBG("Using %s for BPF filesystem mountpoint", BPF_FS_PATH);
+ strcpy(path, BPF_FS_PATH);
+ }
+
+ if (!statfs(path, &st_fs) && st_fs.f_type == BPF_FS_MAGIC) {
+ VLOG_INFO("BPF filesystem already mounted to %s", path);
+ return 0;
+ }
+
+ if (mkdir(path, 0755) && errno != EEXIST) {
+ VLOG_WARN("Failed to create %s: %s", path, ovs_strerror(errno));
+ return errno;
+ }
+
+ if (mount("bpf", path, "bpf", 0, NULL)) {
+ VLOG_WARN("Failed to mount BPF filesystem: %s", ovs_strerror(errno));
+ return errno;
+ }
+
+ idx = strlen(path);
+ if (idx >= PATH_MAX - strlen("/ovs")) {
+ VLOG_WARN("BPF filesystem path \"%s\" is too long.", path);
+ return ENAMETOOLONG;
+ } else {
+ strncpy(&path[idx], "/ovs", strlen("/ovs"));
+ }
+
+ if (mkdir(path, 0755) && errno != EEXIST) {
+ VLOG_WARN("Failed to create %s: %s", path, ovs_strerror(errno));
+ return errno;
+ }
+
+ if (ovs_bpf_path) {
+ free(CONST_CAST(char *, ovs_bpf_path));
+ }
+ ovs_bpf_path = xstrdup(path);
+ return 0;
+}
+
+int
+bpf_init(void)
+{
+ libbpf_set_print(print_WARN, print_INFO, print_DBG);
+ /* skip using mount_bpf */
+ return 0;
+}
diff --git a/lib/bpf.h b/lib/bpf.h
new file mode 100644
index 000000000000..4b5afaf4f77f
--- /dev/null
+++ b/lib/bpf.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef LIB_BPF_H
+#define LIB_BPF_H 1
+
+#include <errno.h>
+#include "openvswitch/compiler.h"
+
+#define INGRESS_HANDLE 0xFFFFFFF2
+#define EGRESS_HANDLE 0xFFFFFFF3
+
+struct bpf_prog {
+ const char *name;
+ uint32_t handle; /* tc handle */
+ int fd;
+};
+
+struct bpf_map {
+ const char *name;
+ int fd;
+};
+
+#if HAVE_BPF
+struct bpf_state;
+struct ds;
+
+#define BPF_MAX_PROG_ARRAY 64
+struct bpf_state {
+ /* File descriptors for programs. */
+ struct bpf_prog ingress; /* BPF_PROG_TYPE_SCHED_CLS */
+ struct bpf_prog egress; /* BPF_PROG_TYPE_SCHED_CLS */
+ struct bpf_prog downcall; /* BPF_PROG_TYPE_SCHED_CLS */
+ struct bpf_prog tailarray[BPF_MAX_PROG_ARRAY];
+ struct bpf_prog xdp; /* BPF_PROG_TYPE_XDP */
+ // william: struct bpf_prog parser, deparser, action,
+
+ struct bpf_map upcalls; /* BPF_MAP_TYPE_PERF_ARRAY */
+ struct bpf_map flow_table; /* BPF_MAP_TYPE_HASH */
+ struct bpf_map datapath_stats; /* BPF_MAP_TYPE_ARRAY */
+ struct bpf_map tailcalls; /* BPF_PROG_TYPE_PROG_ARRARY */
+ struct bpf_map execute_actions; /* BPF_MAP_TYPE_ARRAY */
+ struct bpf_map dp_flow_stats; /* BPF_MAP_TYPE_HASH */
+};
+
+int bpf_get(struct bpf_state *state, bool verbose);
+void bpf_put(struct bpf_state *state);
+int bpf_load(const char *path);
+int bpf_init(void);
+void bpf_format_state(struct ds *ds, struct bpf_state *state);
+#else /* !HAVE_BPF */
+static inline int bpf_load(const char *path OVS_UNUSED) { return EOPNOTSUPP; }
+static inline int bpf_init(void) { return 0; }
+#endif /* HAVE_BPF */
+
+#endif /* LIB_BPF_H */
--
2.7.4


[RFC PATCH 03/11] lib: implement perf event ringbuffer for upcall.

William Tu
 

From: Joe Stringer <joe@...>

A flow missed by the match action table in ebpf triggers an upcall,
which forwards the information to ovs-vswitchd using skb_perf_event_output
helper function. The patch implements the userspace receiving logic.

Signed-off-by: Joe Stringer <joe@...>
Signed-off-by: William Tu <u9012063@...>
---
lib/perf-event.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
lib/perf-event.h | 43 +++++++++
2 files changed, 331 insertions(+)
create mode 100644 lib/perf-event.c
create mode 100644 lib/perf-event.h

diff --git a/lib/perf-event.c b/lib/perf-event.c
new file mode 100644
index 000000000000..c51c936033db
--- /dev/null
+++ b/lib/perf-event.c
@@ -0,0 +1,288 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include <config.h>
+#include "perf-event.h"
+
+#include <errno.h>
+#include <linux/perf_event.h>
+#include <linux/unistd.h>
+#include <openvswitch/vlog.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include "coverage.h"
+#include "openvswitch/util.h"
+#include "ovs-atomic.h"
+
+VLOG_DEFINE_THIS_MODULE(perf_event);
+
+COVERAGE_DEFINE(perf_lost);
+COVERAGE_DEFINE(perf_sample);
+COVERAGE_DEFINE(perf_unknown);
+
+struct perf_event_lost {
+ struct perf_event_header header;
+ uint64_t id;
+ uint64_t lost;
+};
+
+struct rb_cursor {
+ struct perf_event_mmap_page *page;
+ uint64_t head, tail;
+};
+
+static int
+perf_event_open_fd(int *fd_out, int cpu)
+{
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_SOFTWARE,
+ .size = sizeof(struct perf_event_attr),
+ .config = PERF_COUNT_SW_BPF_OUTPUT,
+ .sample_type = PERF_SAMPLE_RAW,
+ .watermark = 0,
+ .wakeup_events = 1,
+ };
+ int fd, error;
+
+ fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
+ if (fd < 0) {
+ error = errno;
+ VLOG_ERR("failed to open perf events (%s)", ovs_strerror(error));
+ return error;
+ }
+
+ if (ioctl(fd, PERF_EVENT_IOC_RESET, 1) == -1) {
+ error = errno;
+ VLOG_ERR("failed to reset perf events (%s)", ovs_strerror(error));
+ return error;
+ }
+
+ *fd_out = fd;
+ return 0;
+}
+
+int
+perf_channel_open(struct perf_channel *channel, int cpu, size_t page_len)
+{
+ int fd = 0, error;
+ void *page;
+
+ error = perf_event_open_fd(&fd, cpu);
+ if (error) {
+ VLOG_WARN("failed to open perf channel (cpu %d): %s",
+ cpu, ovs_strerror(error));
+ return error;
+ }
+
+ page = mmap(NULL, page_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (page == MAP_FAILED) {
+ error = errno;
+ VLOG_ERR("failed to mmap perf event fd (cpu %d): %s",
+ cpu, ovs_strerror(error));
+ close(fd);
+ return error;
+ }
+ channel->page = page;
+ channel->cpu = cpu;
+ channel->fd = fd;
+ channel->length = page_len;
+
+ return 0;
+}
+
+int
+perf_channel_set(struct perf_channel *channel, bool enable)
+{
+ int request = enable ? PERF_EVENT_IOC_ENABLE : PERF_EVENT_IOC_DISABLE;
+
+ if (ioctl(channel->fd, request, 0) == -1) {
+ return errno;
+ }
+ return 0;
+}
+
+void
+perf_channel_close(struct perf_channel *channel)
+{
+ if (ioctl(channel->fd, PERF_EVENT_IOC_DISABLE, 0) == -1) {
+ int error = errno;
+ VLOG_ERR("failed to disable perf events (%s)",
+ ovs_strerror(error));
+ }
+
+ if (munmap((void *)channel->page, channel->length)) {
+ VLOG_WARN("Failed to unmap page for cpu %d: %s",
+ channel->cpu, ovs_strerror(errno));
+ }
+ if (close(channel->fd)) {
+ VLOG_WARN("Failed to close page for cpu %d: %s",
+ channel->cpu, ovs_strerror(errno));
+ }
+ channel->page = NULL;
+ channel->fd = 0;
+ channel->length = 0;
+}
+
+static uint8_t *
+rb_base(struct rb_cursor *cursor)
+{
+ return ((uint8_t *)cursor->page) + cursor->page->data_offset;
+}
+
+static uint8_t *
+rb_end(struct rb_cursor *cursor)
+{
+ return rb_base(cursor) + cursor->page->data_size;
+}
+
+static uint64_t
+cursor_event_offset(struct rb_cursor *cursor)
+{
+ return cursor->tail % cursor->page->data_size;
+}
+
+static uint64_t
+cursor_end_offset(struct rb_cursor *cursor)
+{
+ return cursor->head % cursor->page->data_size;
+}
+
+static void *
+cursor_peek(struct rb_cursor *cursor)
+{
+ void *next = rb_base(cursor) + cursor_event_offset(cursor);
+ void *end = rb_base(cursor) + cursor_end_offset(cursor);
+
+ return (next != end) ? next : NULL;
+}
+
+static uint8_t *
+event_end(struct perf_event_header *header)
+{
+ return (uint8_t *)header + header->size;
+}
+
+static bool
+init_cursor(struct rb_cursor *cursor,
+ struct perf_event_mmap_page *page)
+{
+ uint64_t head = *((volatile uint64_t *)&page->data_head);
+ uint64_t tail = page->data_tail;
+
+ /* Separate the read of 'data_head' from the read of the ringbuffer data.*/
+ atomic_thread_fence(memory_order_consume);
+
+ cursor->page = page;
+ cursor->head = head;
+ cursor->tail = tail;
+
+ return head != tail;
+}
+
+static void
+perf_event_pull(struct perf_event_mmap_page *page, uint64_t tail)
+{
+ /* Separate reads in the ringbuffer from the writing of the tail. */
+ atomic_thread_fence(memory_order_release);
+ page->data_tail = tail;
+}
+
+static bool
+perf_event_copy(struct rb_cursor *cursor, struct ofpbuf *buffer)
+{
+ struct perf_event_header *header = cursor_peek(cursor);
+
+ if (!header) {
+ return false;
+ }
+
+ ofpbuf_clear(buffer);
+ if (event_end(header) <= rb_end(cursor)) {
+ ofpbuf_push(buffer, header, header->size);
+ } else {
+ uint64_t seg1_len = rb_end(cursor) - (uint8_t *)header;
+ uint64_t seg2_len = header->size - seg1_len;
+
+ ofpbuf_put(buffer, header, seg1_len);
+ ofpbuf_put(buffer, rb_base(cursor), seg2_len);
+ }
+
+ buffer->header = buffer->data;
+ cursor->tail += header->size;
+
+ return true;
+}
+
+/* Reads the next full perf event from 'channel' into 'buffer'.
+ *
+ * 'buffer' may be reallocated, so the caller must subsequently uninitialize
+ * it. 'buf->header' will be updated to point to the beginning of the event,
+ * which starts with a 'struct perf_event_header'.
+ *
+ * Returns 0 if there is a new OVS event, otherwise a positive errno value.
+ * Returns EAGAIN if there are no new events.
+ */
+int
+perf_channel_read(struct perf_channel *channel, struct ofpbuf *buffer)
+{
+ struct rb_cursor cursor;
+ int error = EAGAIN;
+
+ if (!init_cursor(&cursor, channel->page)) {
+ return error;
+ }
+
+ if (perf_event_copy(&cursor, buffer)) {
+ struct perf_event_header *header = buffer->header;
+
+ switch (header->type) {
+ case PERF_RECORD_SAMPLE:
+ /* Success! */
+ COVERAGE_INC(perf_sample);
+ error = 0;
+ break;
+ case PERF_RECORD_LOST: {
+ struct perf_event_lost *e = buffer->header;
+ COVERAGE_ADD(perf_lost, e->lost);
+ error = ENOBUFS;
+ break;
+ }
+ default:
+ COVERAGE_INC(perf_unknown);
+ error = EPROTO;
+ break;
+ }
+
+ perf_event_pull(channel->page, cursor.tail);
+ }
+
+ return error;
+}
+
+void
+perf_channel_flush(struct perf_channel *channel)
+{
+ struct perf_event_mmap_page *page = channel->page;
+ uint64_t head = *((volatile uint64_t *)&page->data_head);
+
+ /* The memory_order_consume fence is unnecessary when we don't read any
+ * of the data from the ringbuffer - see perf_output_put_handle().
+ * However, we still need to order the above read wrt to the tail write. */
+ perf_event_pull(page, head);
+}
diff --git a/lib/perf-event.h b/lib/perf-event.h
new file mode 100644
index 000000000000..74bc8e961dbc
--- /dev/null
+++ b/lib/perf-event.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2016 Nicira, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef PERF_EVENT_H
+#define PERF_EVENT_H 1
+
+#include <linux/perf_event.h>
+#include "openvswitch/ofpbuf.h"
+#include "openvswitch/types.h"
+
+struct perf_event_raw {
+ struct perf_event_header header;
+ uint32_t size;
+ /* Followed by uint8_t data[size]; */
+};
+
+struct perf_channel {
+ struct perf_event_mmap_page *page;
+ int cpu;
+ int fd;
+ size_t length;
+};
+
+int perf_channel_open(struct perf_channel *, int cpu, size_t page_len);
+int perf_channel_set(struct perf_channel *channel, bool enable);
+int perf_channel_read(struct perf_channel *, struct ofpbuf *);
+void perf_channel_flush(struct perf_channel *);
+void perf_channel_close(struct perf_channel *);
+
+#endif /* PERF_EVENT_H */
--
2.7.4


[RFC PATCH 02/11] netdev: add ebpf support for netdev provider.

William Tu
 

From: Joe Stringer <joe@...>

To receive packets, an eBPF program has to be attached to a netdev
through tc ingress/egress, an XDP program has to be attached to
a netdev's xdp hook point. The patch introduces two new netdev_class
function: set_filter and set_xdp for the purpose. Now two netdev
types, netdev-linux and netdev-vport, have the actual implementation.

Signed-off-by: William Tu <u9012063@...>
Co-authored-by: William Tu <u9012063@...>
Co-authored-by: Yifeng Sun <pkusunyifeng@...>
---
include/linux/pkt_cls.h | 21 +++
lib/dpif-netdev.c | 29 ++--
lib/netdev-bsd.c | 2 +
lib/netdev-dpdk.c | 2 +
lib/netdev-dummy.c | 2 +
lib/netdev-linux.c | 436 +++++++++++++++++++++++++++++++++++++++++++++++-
lib/netdev-linux.h | 2 +
lib/netdev-provider.h | 11 ++
lib/netdev-vport.c | 145 +++++++++++++++-
lib/netdev.c | 25 +++
lib/netdev.h | 4 +
11 files changed, 655 insertions(+), 24 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index f7bc7ea708d7..770af90a5c64 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -104,6 +104,27 @@ enum {
__TCA_BASIC_MAX
};

+/* BPF classifier */
+
+#define TCA_BPF_FLAG_ACT_DIRECT (1 << 0)
+
+enum {
+ TCA_BPF_UNSPEC,
+ TCA_BPF_ACT,
+ TCA_BPF_POLICE,
+ TCA_BPF_CLASSID,
+ TCA_BPF_OPS_LEN,
+ TCA_BPF_OPS,
+ TCA_BPF_FD,
+ TCA_BPF_NAME,
+ TCA_BPF_FLAGS,
+ TCA_BPF_FLAGS_GEN,
+ TCA_BPF_TAG,
+ __TCA_BPF_MAX,
+};
+
+#define TCA_BPF_MAX (__TCA_BPF_MAX - 1)
+
/* Flower classifier */

enum {
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index ba62128c758c..baff020fe3d0 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1505,12 +1505,6 @@ dp_netdev_reload_pmd__(struct dp_netdev_pmd_thread *pmd)
ovs_mutex_unlock(&pmd->cond_mutex);
}

-static uint32_t
-hash_port_no(odp_port_t port_no)
-{
- return hash_int(odp_to_u32(port_no), 0);
-}
-
static int
port_create(const char *devname, const char *type,
odp_port_t port_no, struct dp_netdev_port **portp)
@@ -1525,6 +1519,7 @@ port_create(const char *devname, const char *type,

/* Open and validate network device. */
error = netdev_open(devname, type, &netdev);
+ VLOG_INFO("%s %s error %d", __func__, devname, error);
if (error) {
return error;
}
@@ -1578,7 +1573,7 @@ do_add_port(struct dp_netdev *dp, const char *devname, const char *type,
return error;
}

- hmap_insert(&dp->ports, &port->node, hash_port_no(port_no));
+ hmap_insert(&dp->ports, &port->node, netdev_hash_port_no(port_no));
seq_change(dp->port_seq);

reconfigure_datapath(dp);
@@ -1596,6 +1591,8 @@ dpif_netdev_port_add(struct dpif *dpif, struct netdev *netdev,
odp_port_t port_no;
int error;

+ VLOG_INFO("%s", __func__);
+
ovs_mutex_lock(&dp->port_mutex);
dpif_port = netdev_vport_get_dpif_port(netdev, namebuf, sizeof namebuf);
if (*port_nop != ODPP_NONE) {
@@ -1648,7 +1645,8 @@ dp_netdev_lookup_port(const struct dp_netdev *dp, odp_port_t port_no)
{
struct dp_netdev_port *port;

- HMAP_FOR_EACH_WITH_HASH (port, node, hash_port_no(port_no), &dp->ports) {
+ HMAP_FOR_EACH_WITH_HASH (port, node, netdev_hash_port_no(port_no),
+ &dp->ports) {
if (port->port_no == port_no) {
return port;
}
@@ -1808,7 +1806,7 @@ dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
odp_port_t in_port)
{
struct dpcls *cls;
- uint32_t hash = hash_port_no(in_port);
+ uint32_t hash = netdev_hash_port_no(in_port);
CMAP_FOR_EACH_WITH_HASH (cls, node, hash, &pmd->classifiers) {
if (cls->in_port == in_port) {
/* Port classifier exists already */
@@ -1824,7 +1822,7 @@ dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd,
OVS_REQUIRES(pmd->flow_mutex)
{
struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
- uint32_t hash = hash_port_no(in_port);
+ uint32_t hash = netdev_hash_port_no(in_port);

if (!cls) {
/* Create new classifier for in_port */
@@ -3311,7 +3309,7 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
{
struct tx_port *tx;

- HMAP_FOR_EACH_IN_BUCKET (tx, node, hash_port_no(port_no), hmap) {
+ HMAP_FOR_EACH_IN_BUCKET (tx, node, netdev_hash_port_no(port_no), hmap) {
if (tx->port->port_no == port_no) {
return tx;
}
@@ -4034,13 +4032,13 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
if (netdev_has_tunnel_push_pop(tx_port->port->netdev)) {
tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
hmap_insert(&pmd->tnl_port_cache, &tx_port_cached->node,
- hash_port_no(tx_port_cached->port->port_no));
+ netdev_hash_port_no(tx_port_cached->port->port_no));
}

if (netdev_n_txq(tx_port->port->netdev)) {
tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
- hash_port_no(tx_port_cached->port->port_no));
+ netdev_hash_port_no(tx_port_cached->port->port_no));
}
}
}
@@ -4793,7 +4791,8 @@ dp_netdev_add_port_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
tx->flush_time = 0LL;
dp_packet_batch_init(&tx->output_pkts);

- hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
+ hmap_insert(&pmd->tx_ports, &tx->node,
+ netdev_hash_port_no(tx->port->port_no));
pmd->need_reload = true;
}

@@ -5965,7 +5964,7 @@ dpif_dummy_change_port_number(struct unixctl_conn *conn, int argc OVS_UNUSED,

/* Reinsert with new port number. */
port->port_no = port_no;
- hmap_insert(&dp->ports, &port->node, hash_port_no(port_no));
+ hmap_insert(&dp->ports, &port->node, netdev_hash_port_no(port_no));
reconfigure_datapath(dp);

seq_change(dp->port_seq);
diff --git a/lib/netdev-bsd.c b/lib/netdev-bsd.c
index 05974c100895..1460ae2504c5 100644
--- a/lib/netdev-bsd.c
+++ b/lib/netdev-bsd.c
@@ -1516,6 +1516,8 @@ netdev_bsd_update_flags(struct netdev *netdev_, enum netdev_flags off,
NULL, /* set_advertisement */ \
NULL, /* get_pt_mode */ \
NULL, /* set_policing */ \
+ NULL, /* set_filter */ \
+ NULL, /* set_xdp */ \
NULL, /* get_qos_type */ \
NULL, /* get_qos_capabilities */ \
NULL, /* get_qos */ \
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 52d8fe6b7ac2..20116c22137e 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -3854,6 +3854,8 @@ unlock:
NULL, /* get_pt_mode */ \
\
netdev_dpdk_set_policing, \
+ NULL, /* set_filter */ \
+ NULL, /* set_xdp */ \
netdev_dpdk_get_qos_types, \
NULL, /* get_qos_capabilities */ \
netdev_dpdk_get_qos, \
diff --git a/lib/netdev-dummy.c b/lib/netdev-dummy.c
index 4246af3b9c86..44c9458a9a22 100644
--- a/lib/netdev-dummy.c
+++ b/lib/netdev-dummy.c
@@ -1427,6 +1427,8 @@ netdev_dummy_update_flags(struct netdev *netdev_,
NULL, /* get_pt_mode */ \
\
NULL, /* set_policing */ \
+ NULL, /* set_filter */ \
+ NULL, /* set_xdp */ \
NULL, /* get_qos_types */ \
NULL, /* get_qos_capabilities */ \
NULL, /* get_qos */ \
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 4e0473cf331f..121dd3bc738e 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -46,6 +46,9 @@
#include <string.h>
#include <unistd.h>

+#include <bpf/libbpf.h> /* linux/tools/bpf/libbpf.h */
+
+#include "bpf.h"
#include "coverage.h"
#include "dp-packet.h"
#include "dpif-netlink.h"
@@ -227,6 +230,9 @@ enum {
VALID_VPORT_STAT_ERROR = 1 << 5,
VALID_DRVINFO = 1 << 6,
VALID_FEATURES = 1 << 7,
+ VALID_INGRESS_FILTER = 1 << 8,
+ VALID_EGRESS_FILTER = 1 << 9,
+ VALID_XDP_FILTER = 1 << 10,
};

/* Traffic control. */
@@ -421,6 +427,7 @@ static const struct tc_ops tc_ops_sfq;
static const struct tc_ops tc_ops_default;
static const struct tc_ops tc_ops_noop;
static const struct tc_ops tc_ops_other;
+static const struct tc_ops tc_ops_clsact;

static const struct tc_ops *const tcs[] = {
&tc_ops_htb, /* Hierarchy token bucket (see tc-htb(8)). */
@@ -431,6 +438,7 @@ static const struct tc_ops *const tcs[] = {
&tc_ops_noop, /* Non operating qos type. */
&tc_ops_default, /* Default qdisc (see tc-pfifo_fast(8)). */
&tc_ops_other, /* Some other qdisc. */
+ &tc_ops_clsact, /* Classifier with nested action. */
NULL
};

@@ -442,8 +450,12 @@ static struct tcmsg *netdev_linux_tc_make_request(const struct netdev *,
int type,
unsigned int flags,
struct ofpbuf *);
+static int clsact_install__(struct netdev *netdev_);
static int tc_add_policer(struct netdev *,
uint32_t kbits_rate, uint32_t kbits_burst);
+static int tc_add_filter(struct netdev *, int fd, uint32_t parent,
+ const char *name);
+static bool tc_is_clsact(const struct tc *tc);

static int tc_parse_qdisc(const struct ofpbuf *, const char **kind,
struct nlattr **options);
@@ -485,13 +497,19 @@ struct netdev_linux {
long long int carrier_resets;
uint32_t kbits_rate; /* Policing data. */
uint32_t kbits_burst;
+ uint32_t ingress_filter; /* BPF ingress filter fd. */
+ uint32_t egress_filter; /* BPF egress filter fd. */
+ uint32_t ingress_xdp_filter;/* XDP ingress filter fd. */
int vport_stats_error; /* Cached error code from vport_get_stats().
0 or an errno value. */
int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */
int ether_addr_error; /* Cached error code from set/get etheraddr. */
int netdev_policing_error; /* Cached error code from set policing. */
+ int ingress_filter_error; /* Cached error code from set filter. */
+ int egress_filter_error; /* Cached error code from set filter. */
int get_features_error; /* Cached error code from ETHTOOL_GSET. */
int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */
+ int ingress_xdp_error;

enum netdev_features current; /* Cached from ETHTOOL_GSET. */
enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */
@@ -2159,8 +2177,14 @@ netdev_linux_set_policing(struct netdev *netdev_,
if (kbits_rate) {
error = tc_add_del_ingress_qdisc(ifindex, true);
if (error) {
- VLOG_WARN_RL(&rl, "%s: adding policing qdisc failed: %s",
- netdev_name, ovs_strerror(error));
+ const char *bpf_conflict = "";
+
+ if (error == EEXIST && (netdev->ingress_filter
+ || netdev->egress_filter)) {
+ bpf_conflict = " (conflicts with BPF)";
+ }
+ VLOG_WARN_RL(&rl, "%s: adding policing qdisc failed: %s%s",
+ netdev_name, ovs_strerror(error), bpf_conflict);
goto out;
}

@@ -2184,6 +2208,268 @@ out:
return error;
}

+/* Attempts to set a BPF filter on the device. Returns 0 if successful,
+ * otherwise a positive errno value. */
+static int
+netdev_linux_set_filter__(struct netdev *netdev_, const struct bpf_prog *prog,
+ unsigned int valid_bit, int *filter_error,
+ uint32_t *netdev_filter)
+{
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+ const char *netdev_name = netdev_get_name(netdev_);
+ int error;
+
+ VLOG_DBG("Setting %s filter %d on %s (handle %08"PRIx32")", prog->name,
+ prog->fd, netdev_name, prog->handle);
+
+ if (netdev->cache_valid & valid_bit) {
+ error = *filter_error;
+ if (error || (prog && prog->fd == *netdev_filter)) {
+ /* Assume that settings haven't changed since we last set them. */
+ goto out;
+ }
+ netdev->cache_valid &= ~valid_bit;
+ }
+
+ /* Remove non-clsact qdiscs. */
+ if (netdev->tc && !tc_is_clsact(netdev->tc)) {
+ error = tc_del_qdisc(netdev_);
+ if (error) {
+ VLOG_WARN_RL(&rl, "%s: removing qdisc failed: %s",
+ netdev_name, ovs_strerror(error));
+ goto out;
+ }
+ }
+
+ if (prog) {
+ if (!netdev->tc || !tc_is_clsact(netdev->tc)) {
+ error = clsact_install__(netdev_);
+ if (error && error != EEXIST) {
+ VLOG_WARN_RL(&rl, "%s: clsact qdisc setup failed: %s",
+ netdev_name, ovs_strerror(error));
+ goto out;
+ }
+ }
+
+ error = tc_add_filter(netdev_, prog->fd, prog->handle, prog->name);
+ if (error){
+ VLOG_WARN_RL(&rl, "%s: adding filter %s failed: %s",
+ netdev_name, prog->name, ovs_strerror(error));
+ goto out;
+ }
+ }
+
+ *netdev_filter = prog ? prog->fd : 0;
+
+out:
+ if (!error || error == ENODEV) {
+ *filter_error = error;
+ netdev->cache_valid |= valid_bit;
+ }
+ return error;
+}
+
+static int
+netdev_linux_set_filter(struct netdev *netdev_, const struct bpf_prog *prog)
+{
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+ int error;
+
+ ovs_mutex_lock(&netdev->mutex);
+ if (!prog || prog->handle == INGRESS_HANDLE) {
+ error = netdev_linux_set_filter__(netdev_, prog, VALID_INGRESS_FILTER,
+ &netdev->ingress_filter_error,
+ &netdev->ingress_filter);
+ } else {
+ error = netdev_linux_set_filter__(netdev_, prog, VALID_EGRESS_FILTER,
+ &netdev->egress_filter_error,
+ &netdev->egress_filter);
+ }
+ ovs_mutex_unlock(&netdev->mutex);
+
+ return error;
+}
+
+#ifndef SOL_NETLINK
+#define SOL_NETLINK 270
+#endif
+
+/* Extract from libbpf */
+int
+bpf_set_link_xdp_fd(int ifindex, int fd, uint32_t flags)
+{
+
+ struct sockaddr_nl sa;
+ int sock, seq = 0, len, ret = -1;
+ char buf[4096];
+ struct nlattr *nla, *nla_xdp;
+ struct {
+ struct nlmsghdr nh;
+ struct ifinfomsg ifinfo;
+ char attrbuf[64];
+ } req;
+ struct nlmsghdr *nh;
+ struct nlmsgerr *err;
+ socklen_t addrlen;
+ int one = 1;
+
+ memset(&sa, 0, sizeof(sa));
+ sa.nl_family = AF_NETLINK;
+
+ sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+ if (sock < 0) {
+ return -errno;
+ }
+
+ if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
+ &one, sizeof(one)) < 0) {
+ VLOG_WARN_RL(&rl, "Netlink error reporting not supported");
+ }
+
+ if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+ ret = -errno;
+ goto cleanup;
+ }
+
+ addrlen = sizeof(sa);
+ if (getsockname(sock, (struct sockaddr *)&sa, &addrlen) < 0) {
+ ret = -errno;
+ goto cleanup;
+ }
+
+ if (addrlen != sizeof(sa)) {
+ goto cleanup;
+ }
+
+ memset(&req, 0, sizeof(req));
+ req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+ req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ req.nh.nlmsg_type = RTM_SETLINK;
+ req.nh.nlmsg_pid = 0;
+ req.nh.nlmsg_seq = ++seq;
+ req.ifinfo.ifi_family = AF_UNSPEC;
+ req.ifinfo.ifi_index = ifindex;
+
+ /* started nested attribute for XDP */
+ nla = (struct nlattr *)(((char *)&req)
+ + NLMSG_ALIGN(req.nh.nlmsg_len));
+ nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+ nla->nla_len = NLA_HDRLEN;
+
+ /* add XDP fd */
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = IFLA_XDP_FD;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+ nla->nla_len += nla_xdp->nla_len;
+
+ /* if user passed in any flags, add those too */
+ if (flags) {
+ nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+ nla_xdp->nla_type = IFLA_XDP_FLAGS;
+ nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+ memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+ nla->nla_len += nla_xdp->nla_len;
+ }
+
+ req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+ /* send */
+ if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+ ret = -errno;
+ goto cleanup;
+ }
+
+ /* recv */
+ len = recv(sock, buf, sizeof(buf), 0);
+ if (len < 0) {
+ ret = -errno;
+ goto cleanup;
+ }
+
+ for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+ nh = NLMSG_NEXT(nh, len)) {
+ if (nh->nlmsg_pid != sa.nl_pid) {
+ ret = -1;
+ goto cleanup;
+ }
+ if (nh->nlmsg_seq != seq) {
+ ret = -1;
+ goto cleanup;
+ }
+ switch (nh->nlmsg_type) {
+ case NLMSG_ERROR:
+ err = (struct nlmsgerr *)NLMSG_DATA(nh);
+ if (!err->error)
+ continue;
+ ret = err->error;
+ /* nla_dump_errormsg(nh); */
+ goto cleanup;
+ case NLMSG_DONE:
+ break;
+ default:
+ break;
+ }
+ }
+
+ ret = 0;
+
+cleanup:
+ close(sock);
+ return ret;
+}
+
+static int
+netdev_linux_set_xdp__(struct netdev *netdev_, const struct bpf_prog *prog,
+ unsigned int valid_bit, int *filter_error,
+ uint32_t *netdev_filter)
+{
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+ const char *netdev_name = netdev_get_name(netdev_);
+ int ifindex = netdev->ifindex;
+ int error;
+
+ VLOG_DBG("Setting %s XDP filter %d on %s (ifindex %d)", prog->name,
+ prog->fd, netdev_name, ifindex);
+
+ if (netdev->cache_valid & valid_bit) {
+ error = *filter_error;
+ if (error || (prog && prog->fd == *netdev_filter)) {
+ /* Assume that settings haven't changed since we last set them. */
+ goto out;
+ }
+ netdev->cache_valid &= ~valid_bit;
+ }
+ error = bpf_set_link_xdp_fd(ifindex, prog->fd, XDP_FLAGS_SKB_MODE);
+ if (error < 0) {
+ VLOG_WARN_RL(&rl, "%s: adding XDP filter %s failed: %s",
+ netdev_name, prog->name, ovs_strerror(error));
+ goto out;
+ }
+
+out:
+ if (!error || error == ENODEV) {
+ *filter_error = error;
+ netdev->cache_valid |= valid_bit;
+ }
+ return error;
+}
+
+static int
+netdev_linux_set_xdp(struct netdev *netdev_, const struct bpf_prog *prog)
+{
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+ int error;
+
+ ovs_mutex_lock(&netdev->mutex);
+ error = netdev_linux_set_xdp__(netdev_, prog, VALID_XDP_FILTER,
+ &netdev->ingress_xdp_error,
+ &netdev->ingress_xdp_filter);
+ ovs_mutex_unlock(&netdev->mutex);
+
+ return error;
+}
+
static int
netdev_linux_get_qos_types(const struct netdev *netdev OVS_UNUSED,
struct sset *types)
@@ -2879,6 +3165,8 @@ netdev_linux_update_flags(struct netdev *netdev_, enum netdev_flags off,
NULL, /* get_pt_mode */ \
\
netdev_linux_set_policing, \
+ netdev_linux_set_filter, \
+ netdev_linux_set_xdp, \
netdev_linux_get_qos_types, \
netdev_linux_get_qos_capabilities, \
netdev_linux_get_qos, \
@@ -4671,6 +4959,74 @@ static const struct tc_ops tc_ops_other = {
NULL /* class_dump_stats */
};

+/* "linux-clsact" traffic control class. */
+static int
+clsact_setup_qdisc(struct netdev *netdev)
+{
+ struct ofpbuf request;
+ struct tcmsg *tcmsg;
+
+ tcmsg = netdev_linux_tc_make_request(netdev, RTM_NEWQDISC,
+ NLM_F_EXCL | NLM_F_CREATE, &request);
+ if (!tcmsg) {
+ return ENODEV;
+ }
+ tcmsg->tcm_handle = tc_make_handle(0xFFFF, 0);
+ tcmsg->tcm_parent = TC_H_INGRESS;
+ nl_msg_put_string(&request, TCA_KIND, "clsact");
+ nl_msg_put_unspec(&request, TCA_OPTIONS, NULL, 0);
+
+ return tc_transact(&request, NULL);
+}
+
+static int
+clsact_install__(struct netdev *netdev_)
+{
+ static const struct tc tc = TC_INITIALIZER(&tc, &tc_ops_clsact);
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+ int error;
+
+ error = clsact_setup_qdisc(netdev_);
+ if (error) {
+ return error;
+ }
+
+ /* Nothing but a tc class implementation is allowed to write to a tc. This
+ * class never does that, so we can legitimately use a const tc object. */
+ netdev->tc = CONST_CAST(struct tc *, &tc);
+
+ return 0;
+}
+
+static int
+clsact_tc_install(struct netdev *netdev,
+ const struct smap *details OVS_UNUSED)
+{
+ return clsact_install__(netdev);
+}
+
+static int
+clsact_tc_load(struct netdev *netdev, struct ofpbuf *nlmsg OVS_UNUSED)
+{
+ return clsact_install__(netdev);
+}
+
+static const struct tc_ops tc_ops_clsact = {
+ "clsact", /* linux_name */
+ "linux-clsact", /* ovs_name */
+ 0, /* n_queues */
+ clsact_tc_install,
+ clsact_tc_load,
+ NULL, /* tc_destroy */
+ NULL, /* qdisc_get */
+ NULL, /* qdisc_set */
+ NULL, /* class_get */
+ NULL, /* class_set */
+ NULL, /* class_delete */
+ NULL, /* class_get_stats */
+ NULL /* class_dump_stats */
+};
+
/* Traffic control. */

/* Number of kernel "tc" ticks per second. */
@@ -4775,6 +5131,49 @@ tc_add_policer(struct netdev *netdev,
return 0;
}

+/* Adds a filter to 'netdev' corresponding to BPF program associated with 'fd'.
+ *
+ * This function is equivalent to running:
+ * /sbin/tc filter add dev <devname> <parent> bpf da object-pinned <path>
+ *
+ * The configuration and stats may be seen with the following command:
+ * /sbin/tc -s filter show dev <devname> <parent>
+ *
+ * Returns 0 if successful, otherwise a positive errno value.
+ */
+static int
+tc_add_filter(struct netdev *netdev, int fd, uint32_t parent, const char *name)
+{
+ struct ofpbuf request;
+ struct tcmsg *tcmsg;
+ size_t opts_offset;
+ int error;
+
+ tcmsg = netdev_linux_tc_make_request(netdev, RTM_NEWTFILTER,
+ NLM_F_EXCL | NLM_F_CREATE, &request);
+ if (!tcmsg) {
+ return ENODEV;
+ }
+ tcmsg->tcm_handle = tc_make_handle(0, 0x1);
+ tcmsg->tcm_parent = parent;
+ tcmsg->tcm_info = tc_make_handle(0, /* preference */
+ (OVS_FORCE uint16_t) htons(ETH_P_ALL));
+
+ nl_msg_put_string(&request, TCA_KIND, "bpf");
+ opts_offset = nl_msg_start_nested(&request, TCA_OPTIONS);
+ nl_msg_put_u32(&request, TCA_BPF_FLAGS, TCA_BPF_FLAG_ACT_DIRECT);
+ nl_msg_put_u32(&request, TCA_BPF_FD, fd);
+ nl_msg_put_string(&request, TCA_BPF_NAME, name);
+ nl_msg_end_nested(&request, opts_offset);
+
+ error = tc_transact(&request, NULL);
+ if (error) {
+ return error;
+ }
+
+ return 0;
+}
+
static void
read_psched(void)
{
@@ -5060,21 +5459,21 @@ tc_delete_class(const struct netdev *netdev, unsigned int handle)
return error;
}

-/* Equivalent to "tc qdisc del dev <name> root". */
+/* Equivalent to "tc qdisc del dev <name> handle <handle> <parent>". */
static int
-tc_del_qdisc(struct netdev *netdev_)
+tc_del_qdisc__(struct netdev_linux *netdev, uint32_t parent, uint32_t handle)
{
- struct netdev_linux *netdev = netdev_linux_cast(netdev_);
struct ofpbuf request;
struct tcmsg *tcmsg;
int error;

- tcmsg = netdev_linux_tc_make_request(netdev_, RTM_DELQDISC, 0, &request);
+ tcmsg = netdev_linux_tc_make_request(&netdev->up, RTM_DELQDISC, 0,
+ &request);
if (!tcmsg) {
return ENODEV;
}
- tcmsg->tcm_handle = tc_make_handle(1, 0);
- tcmsg->tcm_parent = TC_H_ROOT;
+ tcmsg->tcm_handle = handle;
+ tcmsg->tcm_parent = parent;

error = tc_transact(&request, NULL);
if (error == EINVAL) {
@@ -5092,6 +5491,27 @@ tc_del_qdisc(struct netdev *netdev_)
}

static bool
+tc_is_clsact(const struct tc *tc)
+{
+ if (!tc || !tc->ops->linux_name) {
+ return false;
+ }
+ return !strcmp(tc->ops->linux_name, "clsact");
+}
+
+static int
+tc_del_qdisc(struct netdev *netdev_)
+{
+ struct netdev_linux *netdev = netdev_linux_cast(netdev_);
+
+ if (netdev->tc && tc_is_clsact(netdev->tc)) {
+ return tc_del_qdisc__(netdev, TC_H_INGRESS,
+ tc_make_handle(TC_H_INGRESS, 0));
+ }
+ return tc_del_qdisc__(netdev, TC_H_ROOT, tc_make_handle(1, 0));
+}
+
+static bool
getqdisc_is_safe(void)
{
static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
diff --git a/lib/netdev-linux.h b/lib/netdev-linux.h
index 880f86402a1e..8257d4c695f9 100644
--- a/lib/netdev-linux.h
+++ b/lib/netdev-linux.h
@@ -29,6 +29,8 @@ int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag,
const char *flag_name, bool enable);
int linux_get_ifindex(const char *netdev_name);

+int bpf_set_link_xdp_fd(int ifindex, int fd, uint32_t flags);
+
#define LINUX_FLOW_OFFLOAD_API \
netdev_tc_flow_flush, \
netdev_tc_flow_dump_create, \
diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
index 25bd671c1382..3e53a5b76272 100644
--- a/lib/netdev-provider.h
+++ b/lib/netdev-provider.h
@@ -32,6 +32,7 @@
extern "C" {
#endif

+struct bpf_prog;
struct netdev_tnl_build_header_params;
#define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC

@@ -505,6 +506,16 @@ struct netdev_class {
int (*set_policing)(struct netdev *netdev, unsigned int kbits_rate,
unsigned int kbits_burst);

+ /* Attempts to attach a traffic filter in the form of an (e)BPF program.
+ *
+ * This function may be set to null if filters are not supported. */
+ int (*set_filter)(struct netdev *netdev, const struct bpf_prog *);
+
+ /* Attempts to attach a XDP eBPF program.
+ *
+ * This function may be set to null if filters are not supported. */
+ int (*set_xdp)(struct netdev *netdev, const struct bpf_prog *);
+
/* Adds to 'types' all of the forms of QoS supported by 'netdev', or leaves
* it empty if 'netdev' does not support QoS. Any names added to 'types'
* should be documented as valid for the "type" column in the "QoS" table
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 52aa12d79933..4341c89894a3 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -22,12 +22,14 @@
#include <errno.h>
#include <fcntl.h>
#include <sys/socket.h>
+#include <linux/rtnetlink.h>
#include <net/if.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netinet/ip6.h>
#include <sys/ioctl.h>

+#include "bpf.h"
#include "byte-order.h"
#include "daemon.h"
#include "dirs.h"
@@ -43,6 +45,7 @@
#include "route-table.h"
#include "smap.h"
#include "socket-util.h"
+#include "tc.h"
#include "unaligned.h"
#include "unixctl.h"
#include "openvswitch/vlog.h"
@@ -72,6 +75,10 @@ struct vport_class {
struct netdev_class netdev_class;
};

+/* This is set pretty low because we probably won't learn anything from the
+ * additional log messages. */
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+
bool
netdev_vport_is_vport_class(const struct netdev_class *class)
{
@@ -866,6 +873,140 @@ netdev_vport_get_ifindex(const struct netdev *netdev_)
return linux_get_ifindex(name);
}

+/* "linux-clsact" traffic control class. */
+static int
+clsact_setup_qdisc(struct netdev *netdev)
+{
+ struct ofpbuf request;
+ struct tcmsg *tcmsg;
+ int ifindex;
+
+ ifindex = netdev_vport_get_ifindex(netdev);
+
+ tcmsg = tc_make_request(ifindex, RTM_NEWQDISC, NLM_F_EXCL | NLM_F_CREATE,
+ &request);
+ if (!tcmsg) {
+ return ENODEV;
+ }
+ tcmsg->tcm_handle = tc_make_handle(0xFFFF, 0);
+ tcmsg->tcm_parent = TC_H_INGRESS;
+ nl_msg_put_string(&request, TCA_KIND, "clsact");
+ nl_msg_put_unspec(&request, TCA_OPTIONS, NULL, 0);
+
+ return tc_transact(&request, NULL);
+}
+
+static int
+tc_add_filter(struct netdev *netdev, int fd, uint32_t parent, const char *name)
+{
+ struct ofpbuf request;
+ struct tcmsg *tcmsg;
+ size_t opts_offset;
+ int ifindex;
+ int error;
+
+ ifindex = netdev_vport_get_ifindex(netdev);
+
+ tcmsg = tc_make_request(ifindex, RTM_NEWTFILTER, NLM_F_EXCL | NLM_F_CREATE,
+ &request);
+ if (!tcmsg) {
+ return ENODEV;
+ }
+ tcmsg->tcm_handle = tc_make_handle(0, 0x1);
+ tcmsg->tcm_parent = parent;
+#define ETH_P_ALL 0x0003
+ tcmsg->tcm_info = tc_make_handle(0, /* preference */
+ (OVS_FORCE uint16_t) htons(ETH_P_ALL));
+
+ nl_msg_put_string(&request, TCA_KIND, "bpf");
+ opts_offset = nl_msg_start_nested(&request, TCA_OPTIONS);
+ nl_msg_put_u32(&request, TCA_BPF_FLAGS, TCA_BPF_FLAG_ACT_DIRECT);
+ nl_msg_put_u32(&request, TCA_BPF_FD, fd);
+ nl_msg_put_string(&request, TCA_BPF_NAME, name);
+ nl_msg_end_nested(&request, opts_offset);
+
+ error = tc_transact(&request, NULL);
+ if (error) {
+ return error;
+ }
+
+ return 0;
+}
+
+/* Attempts to set a BPF filter on the device. Returns 0 if successful,
+ * otherwise a positive errno value. */
+static int
+netdev_vport_set_filter__(struct netdev *netdev_, const struct bpf_prog *prog,
+ unsigned int OVS_UNUSED valid_bit, int OVS_UNUSED *filter_error,
+ uint32_t OVS_UNUSED *netdev_filter)
+{
+ struct netdev_vport OVS_UNUSED *netdev = netdev_vport_cast(netdev_);
+ const char *netdev_name = netdev_get_name(netdev_);
+ int error;
+
+ if (!prog) {
+ return 0;
+ }
+
+ VLOG_DBG("Setting %s filter %d on %s (handle %08"PRIx32")", prog->name,
+ prog->fd, netdev_name, prog->handle);
+
+ error = clsact_setup_qdisc(netdev_);
+ if (error && error != EEXIST) {
+ VLOG_WARN("%s: clsact qdisc setup failed: %s",
+ netdev_name, ovs_strerror(error));
+ goto out;
+ }
+
+ error = tc_add_filter(netdev_, prog->fd, prog->handle, prog->name);
+ if (error){
+ VLOG_WARN_RL(&rl, "%s: adding filter %s failed: %s",
+ netdev_name, prog->name, ovs_strerror(error));
+ goto out;
+ }
+
+out:
+ VLOG_INFO("%s %d", __func__, error);
+ return error;
+}
+
+static int
+netdev_vport_set_filter(struct netdev *netdev_, const struct bpf_prog *prog)
+{
+ struct netdev_vport *netdev = netdev_vport_cast(netdev_);
+ int error = 0;
+
+ ovs_mutex_lock(&netdev->mutex);
+ if (!prog || prog->handle == INGRESS_HANDLE) {
+ error = netdev_vport_set_filter__(netdev_, prog, 0, NULL, NULL);
+ }
+ ovs_mutex_unlock(&netdev->mutex);
+
+ VLOG_INFO("%s %d", __func__, error);
+
+ return error;
+}
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, uint32_t flags);
+
+static int
+netdev_vport_set_xdp(struct netdev *netdev_, const struct bpf_prog *prog)
+{
+ struct netdev_vport *netdev = netdev_vport_cast(netdev_);
+ int error = 0;
+ int ifindex;
+
+ ovs_mutex_lock(&netdev->mutex);
+ ifindex = netdev_vport_get_ifindex(netdev_);
+ error = bpf_set_link_xdp_fd(ifindex, prog->fd,
+ XDP_FLAGS_SKB_MODE);
+ ovs_mutex_unlock(&netdev->mutex);
+
+ VLOG_INFO("%s %d", __func__, error);
+
+ return error;
+}
+
#define NETDEV_VPORT_GET_IFINDEX netdev_vport_get_ifindex
#define NETDEV_FLOW_OFFLOAD_API LINUX_FLOW_OFFLOAD_API
#else /* !__linux__ */
@@ -914,6 +1055,8 @@ netdev_vport_get_ifindex(const struct netdev *netdev_)
get_pt_mode, \
\
NULL, /* set_policing */ \
+ netdev_vport_set_filter, /* set_filter */ \
+ netdev_vport_set_xdp, /* set_xdp */ \
NULL, /* get_qos_types */ \
NULL, /* get_qos_capabilities */ \
NULL, /* get_qos */ \
@@ -972,7 +1115,7 @@ netdev_vport_tunnel_register(void)
TUNNEL_CLASS("gre", "gre_sys", netdev_gre_build_header,
netdev_gre_push_header,
netdev_gre_pop_header,
- NULL),
+ NETDEV_VPORT_GET_IFINDEX),
TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
netdev_tnl_push_udp_header,
netdev_vxlan_pop_header,
diff --git a/lib/netdev.c b/lib/netdev.c
index be05dc64024a..c44a1a683b92 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -759,6 +759,13 @@ netdev_get_pt_mode(const struct netdev *netdev)
: NETDEV_PT_LEGACY_L2);
}

+/* Returns a 32-bit hash of the given port number. */
+uint32_t
+netdev_hash_port_no(odp_port_t port_no)
+{
+ return hash_int(odp_to_u32(port_no), 0);
+}
+
/* Sends 'batch' on 'netdev'. Returns 0 if successful (for every packet),
* otherwise a positive errno value. Returns EAGAIN without blocking if
* at least one the packets cannot be queued immediately. Returns EMSGSIZE
@@ -1449,6 +1456,24 @@ netdev_set_policing(struct netdev *netdev, uint32_t kbits_rate,
: EOPNOTSUPP);
}

+/* Attempts to apply (e)BPF filter 'prog' to the netdev. */
+int
+netdev_set_filter(struct netdev *netdev, struct bpf_prog *prog)
+{
+ return (netdev->netdev_class->set_filter
+ ? netdev->netdev_class->set_filter(netdev, prog)
+ : EOPNOTSUPP);
+}
+
+/* Attempts to apply (e)BPF filter 'prog' to the netdev. */
+int
+netdev_set_xdp(struct netdev *netdev, struct bpf_prog *prog)
+{
+ return (netdev->netdev_class->set_xdp
+ ? netdev->netdev_class->set_xdp(netdev, prog)
+ : EOPNOTSUPP);
+}
+
/* Adds to 'types' all of the forms of QoS supported by 'netdev', or leaves it
* empty if 'netdev' does not support QoS. Any names added to 'types' should
* be documented as valid for the "type" column in the "QoS" table in
diff --git a/lib/netdev.h b/lib/netdev.h
index ff1b604b24e2..3388504d85c9 100644
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -59,6 +59,7 @@ extern "C" {
* netdev and access each of those from a different thread.)
*/

+struct bpf_prog;
struct dp_packet_batch;
struct dp_packet;
struct netdev_class;
@@ -167,6 +168,7 @@ bool netdev_mtu_is_user_config(struct netdev *);
int netdev_get_ifindex(const struct netdev *);
int netdev_set_tx_multiq(struct netdev *, unsigned int n_txq);
enum netdev_pt_mode netdev_get_pt_mode(const struct netdev *);
+uint32_t netdev_hash_port_no(odp_port_t port_no);

/* Packet reception. */
int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
@@ -316,6 +318,8 @@ struct netdev_queue_stats {

int netdev_set_policing(struct netdev *, uint32_t kbits_rate,
uint32_t kbits_burst);
+int netdev_set_filter(struct netdev *netdev, struct bpf_prog *prog);
+int netdev_set_xdp(struct netdev *netdev, struct bpf_prog *prog);

int netdev_get_qos_types(const struct netdev *, struct sset *types);
int netdev_get_qos_capabilities(const struct netdev *,
--
2.7.4

661 - 680 of 2020