Date   

bpf verification fails when accessing an array element

William Tu
 

Hi,

I'm new to BPF and I'm trying to implement the following logic but fails due to "R0 invalid mem access 'inv'". (Apology if this is not the right mailing list.)

So I have an integer array of size 32 as the value of a bpf hashmap. I save the index of the array at skb->cb[0], so that another bpf can tail_call it with different index value. I've checked that the index is within the range and I couldn't understand why this fails. Any comments are appreciated!

--- bpf code ---
struct actions {
    int action[32];
};
struct bpf_map_def SEC("maps") test_map = {
    .type = BPF_MAP_TYPE_HASH,
    .key_size = sizeof(uint32_t),
    .value_size = sizeof(struct actions),
    .max_entries = 1024,
};

SEC("socket2")
int bpf_prog2(struct __sk_buff *skb)
{
    u32 key = 0;
    int v = 0;
    char fmt[] = "%d\n";
    uint32_t index = 0;
    struct actions *acts;
    acts = bpf_map_lookup_elem(&test_map, &key);   
    if (!acts)
        return 0; 

    index = skb->cb[0];
    if (index >= 32)
        return 0;
   
    v = acts->action[index];
    bpf_trace_printk(fmt, sizeof(fmt), v);
    return 0;
}

--- error log ---
bpf_prog_load() err=13
0: (bf) r6 = r1
1: (b7) r1 = 0
2: (63) *(u32 *)(r10 -4) = r1
3: (b7) r1 = 680997
4: (63) *(u32 *)(r10 -8) = r1
5: (bf) r2 = r10
6: (07) r2 += -4
7: (18) r1 = 0x1c10c5a0
9: (85) call 1
10: (15) if r0 == 0x0 goto pc+9
 R0=map_value(ks=4,vs=128) R6=ctx R10=fp
11: (61) r1 = *(u32 *)(r6 +56)
12: (25) if r1 > 0x1f goto pc+7
 R0=map_value(ks=4,vs=128) R1=inv R6=ctx R10=fp
13: (67) r1 <<= 2
14: (0f) r0 += r1
15: (61) r3 = *(u32 *)(r0 +0)
R0 invalid mem access 'inv'

btw, if I change to using switch-case, basically listing all 32 cases, then the program passes.

I wonder if I'm doing something wrong, or fundamentally BPF does not allow us to do it?
Thank you~

William


Re: bpf verification fails when accessing an array element

Alexei Starovoitov
 

On Mon, Apr 18, 2016 at 12:47 PM, William Tu via iovisor-dev
<iovisor-dev@...> wrote:
Hi,

I'm new to BPF and I'm trying to implement the following logic but fails due
to "R0 invalid mem access 'inv'". (Apology if this is not the right mailing
list.)

So I have an integer array of size 32 as the value of a bpf hashmap. I save
the index of the array at skb->cb[0], so that another bpf can tail_call it
with different index value. I've checked that the index is within the range
and I couldn't understand why this fails. Any comments are appreciated!

--- bpf code ---
struct actions {
int action[32];
};
struct bpf_map_def SEC("maps") test_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(uint32_t),
.value_size = sizeof(struct actions),
.max_entries = 1024,
};

SEC("socket2")
int bpf_prog2(struct __sk_buff *skb)
{
u32 key = 0;
int v = 0;
char fmt[] = "%d\n";
uint32_t index = 0;
struct actions *acts;
acts = bpf_map_lookup_elem(&test_map, &key);
if (!acts)
return 0;

index = skb->cb[0];
if (index >= 32)
return 0;

v = acts->action[index];
bpf_trace_printk(fmt, sizeof(fmt), v);
return 0;
}

--- error log ---
bpf_prog_load() err=13
0: (bf) r6 = r1
1: (b7) r1 = 0
2: (63) *(u32 *)(r10 -4) = r1
3: (b7) r1 = 680997
4: (63) *(u32 *)(r10 -8) = r1
5: (bf) r2 = r10
6: (07) r2 += -4
7: (18) r1 = 0x1c10c5a0
9: (85) call 1
10: (15) if r0 == 0x0 goto pc+9
R0=map_value(ks=4,vs=128) R6=ctx R10=fp
11: (61) r1 = *(u32 *)(r6 +56)
12: (25) if r1 > 0x1f goto pc+7
R0=map_value(ks=4,vs=128) R1=inv R6=ctx R10=fp
13: (67) r1 <<= 2
14: (0f) r0 += r1
15: (61) r3 = *(u32 *)(r0 +0)
R0 invalid mem access 'inv'

btw, if I change to using switch-case, basically listing all 32 cases, then
the program passes.

I wonder if I'm doing something wrong, or fundamentally BPF does not allow
us to do it?
currently there is such limitation, since in the following:
index = skb->cb[0];
if (index >= 32)
return 0;
v = acts->action[index];
verifier couldn't recognize that 'index' variable is actually capped
to a valid range.
It is possible to address though.
I'm actually working on something similar to make packet access
work with direct loads.


Re: bpf verification fails when accessing an array element

William Tu
 

Hi Alexei,

I see, thanks a lot!
I look forward to your work.

Regards,
William

On Mon, Apr 18, 2016 at 1:00 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Mon, Apr 18, 2016 at 12:47 PM, William Tu via iovisor-dev
<iovisor-dev@...> wrote:
> Hi,
>
> I'm new to BPF and I'm trying to implement the following logic but fails due
> to "R0 invalid mem access 'inv'". (Apology if this is not the right mailing
> list.)
>
> So I have an integer array of size 32 as the value of a bpf hashmap. I save
> the index of the array at skb->cb[0], so that another bpf can tail_call it
> with different index value. I've checked that the index is within the range
> and I couldn't understand why this fails. Any comments are appreciated!
>
> --- bpf code ---
> struct actions {
>     int action[32];
> };
> struct bpf_map_def SEC("maps") test_map = {
>     .type = BPF_MAP_TYPE_HASH,
>     .key_size = sizeof(uint32_t),
>     .value_size = sizeof(struct actions),
>     .max_entries = 1024,
> };
>
> SEC("socket2")
> int bpf_prog2(struct __sk_buff *skb)
> {
>     u32 key = 0;
>     int v = 0;
>     char fmt[] = "%d\n";
>     uint32_t index = 0;
>     struct actions *acts;
>     acts = bpf_map_lookup_elem(&test_map, &key);
>     if (!acts)
>         return 0;
>
>     index = skb->cb[0];
>     if (index >= 32)
>         return 0;
>
>     v = acts->action[index];
>     bpf_trace_printk(fmt, sizeof(fmt), v);
>     return 0;
> }
>
> --- error log ---
> bpf_prog_load() err=13
> 0: (bf) r6 = r1
> 1: (b7) r1 = 0
> 2: (63) *(u32 *)(r10 -4) = r1
> 3: (b7) r1 = 680997
> 4: (63) *(u32 *)(r10 -8) = r1
> 5: (bf) r2 = r10
> 6: (07) r2 += -4
> 7: (18) r1 = 0x1c10c5a0
> 9: (85) call 1
> 10: (15) if r0 == 0x0 goto pc+9
>  R0=map_value(ks=4,vs=128) R6=ctx R10=fp
> 11: (61) r1 = *(u32 *)(r6 +56)
> 12: (25) if r1 > 0x1f goto pc+7
>  R0=map_value(ks=4,vs=128) R1=inv R6=ctx R10=fp
> 13: (67) r1 <<= 2
> 14: (0f) r0 += r1
> 15: (61) r3 = *(u32 *)(r0 +0)
> R0 invalid mem access 'inv'
>
> btw, if I change to using switch-case, basically listing all 32 cases, then
> the program passes.
>
> I wonder if I'm doing something wrong, or fundamentally BPF does not allow
> us to do it?

currently there is such limitation, since in the following:
     index = skb->cb[0];
     if (index >= 32)
         return 0;
     v = acts->action[index];
verifier couldn't recognize that 'index' variable is actually capped
to a valid range.
It is possible to address though.
I'm actually working on something similar to make packet access
work with direct loads.


MPLS helper function in eBPF

William Tu
 

Hi,

I saw VLAN push/pop helper functions in BPF and wondering if there is any MPLS related helper functions available?

Looking at sockex3_kern.c, the PROG(PARSE_MPLS) parses the mpls header. But to implement mpls_push/mpls_pop, are we able to use load/store bpf helper function to add/delete mpls header?  It seems not sufficient because some skb metadata also needs update, such as skb->len, and skb headroom.

Thank you
William


Re: MPLS helper function in eBPF

Alexei Starovoitov
 

On Sat, Apr 23, 2016 at 9:45 AM, William Tu via iovisor-dev
<iovisor-dev@...> wrote:
Hi,

I saw VLAN push/pop helper functions in BPF and wondering if there is any
MPLS related helper functions available?

Looking at sockex3_kern.c, the PROG(PARSE_MPLS) parses the mpls header. But
to implement mpls_push/mpls_pop, are we able to use load/store bpf helper
function to add/delete mpls header? It seems not sufficient because some
skb metadata also needs update, such as skb->len, and skb headroom.
yes. there is no mpls_push/pop yet. If needed they will be done as helpers,
since they change skb size and mess with metadata.
What's the use case?


Re: MPLS helper function in eBPF

William Tu
 

Hi Alexei,

I'm just starting to experimenting BPF, exploring what BPF can do, and don't have specific use case at this point. Thanks for your reply!

Regards,
William

On Sat, Apr 23, 2016 at 6:04 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Sat, Apr 23, 2016 at 9:45 AM, William Tu via iovisor-dev
<iovisor-dev@...> wrote:
> Hi,
>
> I saw VLAN push/pop helper functions in BPF and wondering if there is any
> MPLS related helper functions available?
>
> Looking at sockex3_kern.c, the PROG(PARSE_MPLS) parses the mpls header. But
> to implement mpls_push/mpls_pop, are we able to use load/store bpf helper
> function to add/delete mpls header?  It seems not sufficient because some
> skb metadata also needs update, such as skb->len, and skb headroom.

yes. there is no mpls_push/pop yet. If needed they will be done as helpers,
since they change skb size and mess with metadata.
What's the use case?


Matching key inserted by python

Rudi Floren
 

Hello guys,

i already had an issue resolved by one of u on github.
I continued working on my project and came across problem.

Basically i will insert a bunch of domain names via python into my hash_table, and in my bpf program i want to check if the name of the incoming packet is in my hash_table.

For filling my table i use this little python snipped:
def encode_dns(name):
  size = 32
  if len(name) > 253:
    raise Exception("DNS Name too long.")
  b = bytearray()
  elements = name.split(".")
  for element in elements:
    b.extend(struct.pack("!B", len(element)))
    b.extend(element)

  blen = len(b)
  for i in range(blen, size): 
    b.append(b'\x00')
  return (c_ubyte * size).from_buffer(b)

cache = bpf.get_table("cache")
key = cache.Key()
key.p = encode_dns("foo.bar")
cache[key] = leaf


When dumping the keys of the hash_table with hexdump they seem to be right. (dns encoding wise.)

But my bpf program doesn't match the key.
It is basically the same as the fixed one in my gh-issue (#446)
struct Key key = {};
...
if (cursor == sentinel) goto end; c = cursor_advance(cursor, 1); key.p[i++] = c->c; //repeating
...
struct Leaf * lookup_leaf = cache.lookup(&key);

        if(lookup_leaf) {
          bpf_trace_printk("yes im in cache");

         }
...
I tried to bpf_trace_printk my key to verify that it is the same as the one in the cache but did't succeed.

Anyone has an idea what i am doing wrong?
Full code is available in this gist: https://gist.github.com/valkum/0d70028b864b89639b4c0f6616612463

Thanks,

Rudi


Re: Matching key inserted by python

Rudi Floren
 

Just a short update.
I managed to get it working finally. The sentinel was was off by 4 bytes, as the dns question has the 2 fields qtype and qclass at the end.


Thanks,
Rudi


Rudi Floren <rudi.floren@...> schrieb am Mo., 25. Apr. 2016 um 20:05 Uhr:

Hello guys,

i already had an issue resolved by one of u on github.
I continued working on my project and came across problem.

Basically i will insert a bunch of domain names via python into my hash_table, and in my bpf program i want to check if the name of the incoming packet is in my hash_table.

For filling my table i use this little python snipped:
def encode_dns(name):
  size = 32
  if len(name) > 253:
    raise Exception("DNS Name too long.")
  b = bytearray()
  elements = name.split(".")
  for element in elements:
    b.extend(struct.pack("!B", len(element)))
    b.extend(element)

  blen = len(b)
  for i in range(blen, size): 
    b.append(b'\x00')
  return (c_ubyte * size).from_buffer(b)

cache = bpf.get_table("cache")
key = cache.Key()
key.p = encode_dns("foo.bar")
cache[key] = leaf


When dumping the keys of the hash_table with hexdump they seem to be right. (dns encoding wise.)

But my bpf program doesn't match the key.
It is basically the same as the fixed one in my gh-issue (#446)
struct Key key = {};
...
if (cursor == sentinel) goto end; c = cursor_advance(cursor, 1); key.p[i++] = c->c; //repeating
...
struct Leaf * lookup_leaf = cache.lookup(&key);

        if(lookup_leaf) {
          bpf_trace_printk("yes im in cache");

         }
...
I tried to bpf_trace_printk my key to verify that it is the same as the one in the cache but did't succeed.

Anyone has an idea what i am doing wrong?
Full code is available in this gist: https://gist.github.com/valkum/0d70028b864b89639b4c0f6616612463

Thanks,

Rudi


Re: Matching key inserted by python

Alexei Starovoitov
 

On Tue, Apr 26, 2016 at 8:51 AM, Rudi Floren via iovisor-dev
<iovisor-dev@...> wrote:
Just a short update.
I managed to get it working finally. The sentinel was was off by 4 bytes, as
the dns question has the 2 fields qtype and qclass at the end.
good to know.
if you don't mind, could you push your working script to examples/networking/
so when we implement 'bounded loop' instructions your use case is accounted for.

thanks

Rudi Floren <rudi.floren@...> schrieb am Mo., 25. Apr. 2016 um 20:05
Uhr:

Hello guys,

i already had an issue resolved by one of u on github.
I continued working on my project and came across problem.

Basically i will insert a bunch of domain names via python into my
hash_table, and in my bpf program i want to check if the name of the
incoming packet is in my hash_table.

For filling my table i use this little python snipped:
def encode_dns(name):
size = 32
if len(name) > 253:
raise Exception("DNS Name too long.")
b = bytearray()
elements = name.split(".")
for element in elements:
b.extend(struct.pack("!B", len(element)))
b.extend(element)

blen = len(b)
for i in range(blen, size):
b.append(b'\x00')
return (c_ubyte * size).from_buffer(b)

cache = bpf.get_table("cache")
key = cache.Key()
key.p = encode_dns("foo.bar")
cache[key] = leaf


When dumping the keys of the hash_table with hexdump they seem to be
right. (dns encoding wise.)

But my bpf program doesn't match the key.
It is basically the same as the fixed one in my gh-issue (#446)
struct Key key = {};
...
if (cursor == sentinel) goto end; c = cursor_advance(cursor, 1);
key.p[i++] = c->c; //repeating
...
struct Leaf * lookup_leaf = cache.lookup(&key);

if(lookup_leaf) {
bpf_trace_printk("yes im in cache");

}
...
I tried to bpf_trace_printk my key to verify that it is the same as the
one in the cache but did't succeed.

Anyone has an idea what i am doing wrong?
Full code is available in this gist:
https://gist.github.com/valkum/0d70028b864b89639b4c0f6616612463

Thanks,

Rudi

_______________________________________________
iovisor-dev mailing list
iovisor-dev@...
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Re: Matching key inserted by python

Rudi Floren
 

sure, i have to ask my advisor but I think there is no problem.

Alexei Starovoitov <alexei.starovoitov@...> schrieb am Di., 26. Apr. 2016 um 17:55 Uhr:

On Tue, Apr 26, 2016 at 8:51 AM, Rudi Floren via iovisor-dev
<iovisor-dev@...> wrote:
> Just a short update.
> I managed to get it working finally. The sentinel was was off by 4 bytes, as
> the dns question has the 2 fields qtype and qclass at the end.

good to know.
if you don't mind, could you push your working script to examples/networking/
so when we implement 'bounded loop' instructions your use case is accounted for.

thanks

> Rudi Floren <rudi.floren@...> schrieb am Mo., 25. Apr. 2016 um 20:05
> Uhr:
>>
>> Hello guys,
>>
>> i already had an issue resolved by one of u on github.
>> I continued working on my project and came across problem.
>>
>> Basically i will insert a bunch of domain names via python into my
>> hash_table, and in my bpf program i want to check if the name of the
>> incoming packet is in my hash_table.
>>
>> For filling my table i use this little python snipped:
>> def encode_dns(name):
>>   size = 32
>>   if len(name) > 253:
>>     raise Exception("DNS Name too long.")
>>   b = bytearray()
>>   elements = name.split(".")
>>   for element in elements:
>>     b.extend(struct.pack("!B", len(element)))
>>     b.extend(element)
>>
>>   blen = len(b)
>>   for i in range(blen, size):
>>     b.append(b'\x00')
>>   return (c_ubyte * size).from_buffer(b)
>>
>> cache = bpf.get_table("cache")
>> key = cache.Key()
>> key.p = encode_dns("foo.bar")
>> cache[key] = leaf
>>
>>
>> When dumping the keys of the hash_table with hexdump they seem to be
>> right. (dns encoding wise.)
>>
>> But my bpf program doesn't match the key.
>> It is basically the same as the fixed one in my gh-issue (#446)
>> struct Key key = {};
>> ...
>> if (cursor == sentinel) goto end; c = cursor_advance(cursor, 1);
>> key.p[i++] = c->c; //repeating
>> ...
>> struct Leaf * lookup_leaf = cache.lookup(&key);
>>
>>         if(lookup_leaf) {
>>           bpf_trace_printk("yes im in cache");
>>
>>          }
>> ...
>> I tried to bpf_trace_printk my key to verify that it is the same as the
>> one in the cache but did't succeed.
>>
>> Anyone has an idea what i am doing wrong?
>> Full code is available in this gist:
>> https://gist.github.com/valkum/0d70028b864b89639b4c0f6616612463
>>
>> Thanks,
>>
>> Rudi
>
>
> _______________________________________________
> iovisor-dev mailing list
> iovisor-dev@...
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
>


reschedule: IO Visor TSC and Dev Members Call

Brenden Blanco <bblanco@...>
 

Hi All,

This week is a conference week for myself, and I know a few others are busy, so I think we should delay this weeks call. Let us reschedule for next week, May 4.

Thanks,
Brenden


reminder: IO Visor TSC and Dev Members Call

Brenden Blanco <bblanco@...>
 

Hi All,

Please join us for our (rescheduled) bi-weekly call. This meeting is open to everybody and completely optional.

This week, we'll primarily cover developer topics:

 - Tracing (USDT)
 - XDP
 - Kernel


IOVisor TSC/Dev Meeting
Wednesday, May 4, 2016
11:00 am  |  Pacific Daylight Time (San Francisco, GMT-07:00)  |  1 hr
 
Join WebEx meeting
Meeting number:282 427 272
Meeting password:iovisor
 
Join by phone
+1-415-655-0003 US TOLL
Access code: 282 427 272
Global call-in numbers
 
Add this meeting to your calendar. (Cannot add from mobile devices.)
 


The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer
 

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.

Below I propose that we use XDP eBPF in a new fashion, based on the
state of the feedback loop that the page-pool offer. With eBPF hooks
at both RX and page-return-time, we can implement a super powerful DDoS
protection mechanism. Does it make sense?

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Designing the page-pool
=======================
:Version: 0.1.1
:Authors: Jesper Dangaard Brouer

Introduction
============

Motivation for page recycling is primarily performance related (due to
network stack use-cases), as bottlenecks exist in both the page
allocator and DMA APIs.

It was agreed at MM-summit 2016, that the ideas developed to speedup
the per CPU caching layer, should be integrated into the page
allocator, where possible. And we should try to share data
structures. (https://lwn.net/Articles/684616/)

The page-pool per device, still have merits, as it can:
1) solve the DMA API (IOMMU) overhead issue, which
2) in-directly make pages writable by network-stack,
3) provide a feedback-loop at the device level

Referring to MM-summit presentation, for the DMA use-case, and why
larger order pages are considered problematic.

MM-summit 2016 presentation avail here:
* http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf


XDP (eXpress Data Path)
-----------------------

The page-pool is a component that XDP need in-order to perform packet
forwarding at the page level.

* https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
* http://lwn.net/Articles/682538/


Avoid NUMA problems, return to same CPU
=======================================

A classical problem, especially for NUMA systems, is that in a
asymmetric workload memory can be allocated on one CPU but free'ed on
a remote CPU (worst-case a remote NUMA node). (Thus, CPU "local"
recycling on free is problematic).

Upfront, our design solves this issue, by requiring pages are recycled
back to the originating CPU. (This feature is also beneficial for the
feedback-loop and associated accounting.)


Reduce page (buddy) fragmentation
=================================

Another benefit of a page-pool layer on-top of a driver, which can
maintain a steady-state working-set, is that the page allocator have
less chances of getting fragmented.


Feedback loop
=============

With drivers current approach (of calling the page allocator directly)
the number of pages a driver can hand-out is unbounded.

The page-pool provide the ability to get a feedback loop facility, at
the device level.

A classical problem is that a single device can take up an unfair
large portion of the shared memory resources, if e.g. an application
(or guest VM) does not free the resources (fast-enough). Thus,
negatively impacting the entire system, possibly leading to
Out-Of-Memory (OOM) conditions.

The protection mechanism the page-pool can provide (at the device
level) MUST not be seen as a congestion-control mechanism. It should
be seen as a "circuit-breaker" last resort facility to protect other
parts of the system.

Congestion-control aware traffic usually handle the situation (and
adjust their rate to stabilize the network). Thus, a circuit-breaker
must allow sufficient time for congestion-control aware traffic to
stabilize.

The situations that are relevant for the circuit-breaker, are
excessive and persistent non-congestion-controlled traffic, that
affect other parts of the system.

Drop policy
-----------

When the circuit-breaker is in effect (e.g. dropping all packets and
recycling the page directly), then XDP/eBPF hook could decide to
change the drop verdict.

With the XDP hook in-place, it is possible to implement arbitrarily
drop policies. If the XDP hook, gets the RX HW-hash, then it can
implement flow based policies without touching packet data.


Detecting driver overload
-------------------------

It might be difficult to determine when the circuit-breaker should
kick-in, based on an excessive working-set size of pages.

But at the driver level, it is easy to detect when the system is
overloaded, to such an extend that it cannot process packets
fast-enough. This is simply indicated by the driver cannot empty the
RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).

This indication could be passed to a XDP hook, which can implement a
drop policy. Filtering packets at this level can likely restore
normal system operation. Building on the principal of spending as few
CPU cycles as possible on packets that need to be dropped anyhow (by a
deeper layer).

It is important to realize that, dropping the the XDP driver level is
extremely efficient. Experiments show that, the filter capacity of
XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF
map), while iptables-raw is 6Mpps, and hitting socket limit is around
0.7Mpps. Thus, an attacker can actually consume significant CPU
resources by simply sending UDP packets to a closed port.


Performance vs feedback-loop accounting
---------------------------------------

For performance reasons, the accounting should be kept as per CPU
structures.

For NIC drivers it actually makes sense to keep accounting 100% per
CPU. In essence, we would like the circuit-breaker to kick-in per RX
HW queue, as that would allow remaining RX queue traffic flow.

RX queues are usually bound to a specific CPU, to avoid packet
reordering (and NIC RSS hashing (try-to) keep flows per RX queue).
Thus, keeping page recycling and stats per CPU structures, basically
achieves the same as binding a page-pool per RX queue.

If RX queue SMP affinity change runtime, then it does not matter. A
RX ring-queue can contain pages "belonging" to another CPU, but it
does not matter, as eventually they will be returned to the owning
CPU.


It would be possible to also keep a more central state for a
page-pool, because the number of pages it manage only change when
(re)filling or returning pages to the page allocator, which should be
a more infrequent event. I would prefer not to.


Determining steady-state working-set
------------------------------------

For optimal performance and to minimize memory usage, the page-pool
should only maintain the number of pages required for the steady-state
working-set.

The size of the steady-state working-set will vary depending on the
workload. E.g. in a forwarding workload it will be fairly small.
E.g. for a TCP (local)host delivery workload it will be bigger. Thus,
the steady-state working-set should be dynamically determined.

Detecting steady-state by realizing, that in steady state, no
(re)filling have occurred for a while, and the number of "free" pages
in the pool is not excessive.

Idea: could we track number of page-pool recycle alloc and free's
within N x jiffies, and if the numbers (rate) are approx the same,
record number of outstanding pages as the steady-state number? (Could
be implemented as single signed counter reset every N jiffies, inc/dec
for alloc/free, approaching zero (at reset point) == stable)


If RX rate is bigger than TX/consumption rate, queue theory says a
queue will form. While the queue builds (somewhere outside out our
control), the page-pool need to request more and more pages from
page-allocator. The number of outstanding pages increase, seen from
the page-pool, proportional to the queue in the system.

This, RX > TX is an overload situation. Congestion-control aware
traffic will self stabilize. Assuming dealing with
non-congestion-controlled traffic, some different scenarios exist:

1. (good-queue) Overload only exist for a short period of time, like a
traffic burst. This is "good-queue", where we absorb bursts.

2. (bad-queue) Situation persists, but some other limit is hit, and
packets get dropped. Like qdisc limit on forwarding, or local
socket limit. This could be interpreted as a "steady-steady", as
page-recycling reach a certain level, and maybe it should?

3. (OOM) Situation persists, and no natural resource limit is hit.
Eventually system runs dry of memory pages and OOM. This situation
should be caught by our circuit-breaker mechanism, before OOM.

4. For forwarding, the hole code path from RX to TX, takes longer than
the packet arrival rate. Drops happen at HW level by overflowing
RX queue (as it is not emptied fast enough). Possible to detect
inside driver, and we could start a eBPF program to filter?

After an overload situation, when RX decrease (or stop), so RX < TX
(likely for a short period of time). Then, we have the opportunity to
return/free objects/pages back to the page-allocator.

Q: How quickly should we do so (return pages)?
Q: How much slack to handle bursts?
Q: Is "steady-state" number of pages an absolute limit?


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).

The XDP eBPF hook can maintain it's own internal data structure, to
track pages.

We could saved the RX HW hash (maybe in struct-page), then eBPF could
implement flow metering without touching packet data.

The eBPF prog can even do it's own timestamping on RX and compare at
pool "return" point. Essentially implementing a CoDel like scheme,
measuring "time-spend-in-network-stack". (For this to make sense, it
would likely need to group by RX-HW-hash, as multiple ways through the
netstack exist, thus it cannot be viewed as a FIFO).


Conclusion
----------

The resource limitation/protection feature offered by the page-pool,
is primarily a circuit-breaker facility for protecting other parts of
the system. Combined with a XDP/eBPF hook, it offers a powerful and
more fine-grained control.

It requires more work and research if we want to react
"earlier". e.g. before the circuit-breaker kicks in. Here one should
be careful not to interfere with congestion aware traffic, by giving
it sufficient time to reach.

At the driver level it is also possible to detect, if system is not
processing RX packets fast-enough. This is not an inherent feature of
the page-pool, but it would be useful input for a eBPF filter.

For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation

Relating to page allocator
==========================

The current page allocator have a per CPU caching layer for
order-0 pages, called PCP (per CPU pages) ::

struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};

The "high" watermark, can be compared to (dynamic) steady-state
number, which determine how many cached (order-0) pages are kept,
before they are returned to the page allocator.

For PCP once the "high" watermark is hit, then "batch" number of
pages are returned. (Using a batch (re)moves the pathological
case of two object working-set being recycles on the "edge" of
the "high" watermark, causing too much interaction with the page
alloactor).

On my 8 core (i7-4790K CPU @ 4.00GHz) with 16GB RAM, the values
for PCP are high=186 and batch=31 (note 31*6 = 186). These
setting are likely not optimal for networking, as e.g. TX DMA
completion is default allowed to freeing up-to 256 pages.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?


Background material
===================

Circuit Breaker
---------------

Quotes from:

.. _RFC-Circuit-Breaker:
https://tools.ietf.org/html/draft-ietf-tsvwg-circuit-breaker-14

RFC-Circuit-Breaker_ ::

[...] non-congestion-controlled traffic, including many applications
using the User Datagram Protocol (UDP), can form a significant
proportion of the total traffic traversing a link. The current
Internet therefore requires that non-congestion-controlled traffic is
considered to avoid persistent excessive congestion


RFC-Circuit-Breaker_ ::

This longer period is needed to provide sufficient time for transport
congestion control (or applications) to adjust their rate following
congestion, and for the network load to stabilize after any
adjustment.


RFC-Circuit-Breaker_ ::

In contrast, Circuit Breakers are recommended for non-congestion-
controlled Internet flows and for traffic aggregates, e.g., traffic
sent using a network tunnel. They operate on timescales much longer
than the packet RTT, and trigger under situations of abnormal
(excessive) congestion.


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.

Below I propose that we use XDP eBPF in a new fashion, based on the
state of the feedback loop that the page-pool offer. With eBPF hooks
at both RX and page-return-time, we can implement a super powerful DDoS
protection mechanism. Does it make sense?
Mostly ;-). I like the idea of returning an index from eBPF which
basically just gives a queue to transmit. Presumably, each receive
queue would have it's own XDP transmit queue that it can use lockless.
Also, I think it is reasonable that we could cross devices but within
the _same_ driver (like supporting forwarding between two Mellanox
NICs). In that case each RX queue has one dedicated XDP TX queue for
each device.

For forwarding on a non XDP queue (like 3B or crossing different type
devices) I don't think we should do anything special. Just pass the
packet to the stack like in the olden days and use the stack
forwarding path. This is obviously slow path, but not really very
interesting to optimize in XDP.

One thing I am not sure how to deal with is flow control. i.e. if the
transmit queue is being blocked who should do the drop. Preferably,
we'd want the to know the queue occupancy in BPF to do an intelligent
drop (some crude fq-codel or the like?)

Tom

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Designing the page-pool
=======================
:Version: 0.1.1
:Authors: Jesper Dangaard Brouer

Introduction
============

Motivation for page recycling is primarily performance related (due to
network stack use-cases), as bottlenecks exist in both the page
allocator and DMA APIs.

It was agreed at MM-summit 2016, that the ideas developed to speedup
the per CPU caching layer, should be integrated into the page
allocator, where possible. And we should try to share data
structures. (https://lwn.net/Articles/684616/)

The page-pool per device, still have merits, as it can:
1) solve the DMA API (IOMMU) overhead issue, which
2) in-directly make pages writable by network-stack,
3) provide a feedback-loop at the device level

Referring to MM-summit presentation, for the DMA use-case, and why
larger order pages are considered problematic.

MM-summit 2016 presentation avail here:
* http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf


XDP (eXpress Data Path)
-----------------------

The page-pool is a component that XDP need in-order to perform packet
forwarding at the page level.

* https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
* http://lwn.net/Articles/682538/


Avoid NUMA problems, return to same CPU
=======================================

A classical problem, especially for NUMA systems, is that in a
asymmetric workload memory can be allocated on one CPU but free'ed on
a remote CPU (worst-case a remote NUMA node). (Thus, CPU "local"
recycling on free is problematic).

Upfront, our design solves this issue, by requiring pages are recycled
back to the originating CPU. (This feature is also beneficial for the
feedback-loop and associated accounting.)


Reduce page (buddy) fragmentation
=================================

Another benefit of a page-pool layer on-top of a driver, which can
maintain a steady-state working-set, is that the page allocator have
less chances of getting fragmented.


Feedback loop
=============

With drivers current approach (of calling the page allocator directly)
the number of pages a driver can hand-out is unbounded.

The page-pool provide the ability to get a feedback loop facility, at
the device level.

A classical problem is that a single device can take up an unfair
large portion of the shared memory resources, if e.g. an application
(or guest VM) does not free the resources (fast-enough). Thus,
negatively impacting the entire system, possibly leading to
Out-Of-Memory (OOM) conditions.

The protection mechanism the page-pool can provide (at the device
level) MUST not be seen as a congestion-control mechanism. It should
be seen as a "circuit-breaker" last resort facility to protect other
parts of the system.

Congestion-control aware traffic usually handle the situation (and
adjust their rate to stabilize the network). Thus, a circuit-breaker
must allow sufficient time for congestion-control aware traffic to
stabilize.

The situations that are relevant for the circuit-breaker, are
excessive and persistent non-congestion-controlled traffic, that
affect other parts of the system.

Drop policy
-----------

When the circuit-breaker is in effect (e.g. dropping all packets and
recycling the page directly), then XDP/eBPF hook could decide to
change the drop verdict.

With the XDP hook in-place, it is possible to implement arbitrarily
drop policies. If the XDP hook, gets the RX HW-hash, then it can
implement flow based policies without touching packet data.


Detecting driver overload
-------------------------

It might be difficult to determine when the circuit-breaker should
kick-in, based on an excessive working-set size of pages.

But at the driver level, it is easy to detect when the system is
overloaded, to such an extend that it cannot process packets
fast-enough. This is simply indicated by the driver cannot empty the
RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).

This indication could be passed to a XDP hook, which can implement a
drop policy. Filtering packets at this level can likely restore
normal system operation. Building on the principal of spending as few
CPU cycles as possible on packets that need to be dropped anyhow (by a
deeper layer).

It is important to realize that, dropping the the XDP driver level is
extremely efficient. Experiments show that, the filter capacity of
XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF
map), while iptables-raw is 6Mpps, and hitting socket limit is around
0.7Mpps. Thus, an attacker can actually consume significant CPU
resources by simply sending UDP packets to a closed port.


Performance vs feedback-loop accounting
---------------------------------------

For performance reasons, the accounting should be kept as per CPU
structures.

For NIC drivers it actually makes sense to keep accounting 100% per
CPU. In essence, we would like the circuit-breaker to kick-in per RX
HW queue, as that would allow remaining RX queue traffic flow.

RX queues are usually bound to a specific CPU, to avoid packet
reordering (and NIC RSS hashing (try-to) keep flows per RX queue).
Thus, keeping page recycling and stats per CPU structures, basically
achieves the same as binding a page-pool per RX queue.

If RX queue SMP affinity change runtime, then it does not matter. A
RX ring-queue can contain pages "belonging" to another CPU, but it
does not matter, as eventually they will be returned to the owning
CPU.


It would be possible to also keep a more central state for a
page-pool, because the number of pages it manage only change when
(re)filling or returning pages to the page allocator, which should be
a more infrequent event. I would prefer not to.


Determining steady-state working-set
------------------------------------

For optimal performance and to minimize memory usage, the page-pool
should only maintain the number of pages required for the steady-state
working-set.

The size of the steady-state working-set will vary depending on the
workload. E.g. in a forwarding workload it will be fairly small.
E.g. for a TCP (local)host delivery workload it will be bigger. Thus,
the steady-state working-set should be dynamically determined.

Detecting steady-state by realizing, that in steady state, no
(re)filling have occurred for a while, and the number of "free" pages
in the pool is not excessive.

Idea: could we track number of page-pool recycle alloc and free's
within N x jiffies, and if the numbers (rate) are approx the same,
record number of outstanding pages as the steady-state number? (Could
be implemented as single signed counter reset every N jiffies, inc/dec
for alloc/free, approaching zero (at reset point) == stable)


If RX rate is bigger than TX/consumption rate, queue theory says a
queue will form. While the queue builds (somewhere outside out our
control), the page-pool need to request more and more pages from
page-allocator. The number of outstanding pages increase, seen from
the page-pool, proportional to the queue in the system.

This, RX > TX is an overload situation. Congestion-control aware
traffic will self stabilize. Assuming dealing with
non-congestion-controlled traffic, some different scenarios exist:

1. (good-queue) Overload only exist for a short period of time, like a
traffic burst. This is "good-queue", where we absorb bursts.

2. (bad-queue) Situation persists, but some other limit is hit, and
packets get dropped. Like qdisc limit on forwarding, or local
socket limit. This could be interpreted as a "steady-steady", as
page-recycling reach a certain level, and maybe it should?

3. (OOM) Situation persists, and no natural resource limit is hit.
Eventually system runs dry of memory pages and OOM. This situation
should be caught by our circuit-breaker mechanism, before OOM.

4. For forwarding, the hole code path from RX to TX, takes longer than
the packet arrival rate. Drops happen at HW level by overflowing
RX queue (as it is not emptied fast enough). Possible to detect
inside driver, and we could start a eBPF program to filter?

After an overload situation, when RX decrease (or stop), so RX < TX
(likely for a short period of time). Then, we have the opportunity to
return/free objects/pages back to the page-allocator.

Q: How quickly should we do so (return pages)?
Q: How much slack to handle bursts?
Q: Is "steady-state" number of pages an absolute limit?


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).

The XDP eBPF hook can maintain it's own internal data structure, to
track pages.

We could saved the RX HW hash (maybe in struct-page), then eBPF could
implement flow metering without touching packet data.

The eBPF prog can even do it's own timestamping on RX and compare at
pool "return" point. Essentially implementing a CoDel like scheme,
measuring "time-spend-in-network-stack". (For this to make sense, it
would likely need to group by RX-HW-hash, as multiple ways through the
netstack exist, thus it cannot be viewed as a FIFO).


Conclusion
----------

The resource limitation/protection feature offered by the page-pool,
is primarily a circuit-breaker facility for protecting other parts of
the system. Combined with a XDP/eBPF hook, it offers a powerful and
more fine-grained control.

It requires more work and research if we want to react
"earlier". e.g. before the circuit-breaker kicks in. Here one should
be careful not to interfere with congestion aware traffic, by giving
it sufficient time to reach.

At the driver level it is also possible to detect, if system is not
processing RX packets fast-enough. This is not an inherent feature of
the page-pool, but it would be useful input for a eBPF filter.

For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation

Relating to page allocator
==========================

The current page allocator have a per CPU caching layer for
order-0 pages, called PCP (per CPU pages) ::

struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};

The "high" watermark, can be compared to (dynamic) steady-state
number, which determine how many cached (order-0) pages are kept,
before they are returned to the page allocator.

For PCP once the "high" watermark is hit, then "batch" number of
pages are returned. (Using a batch (re)moves the pathological
case of two object working-set being recycles on the "edge" of
the "high" watermark, causing too much interaction with the page
alloactor).

On my 8 core (i7-4790K CPU @ 4.00GHz) with 16GB RAM, the values
for PCP are high=186 and batch=31 (note 31*6 = 186). These
setting are likely not optimal for networking, as e.g. TX DMA
completion is default allowed to freeing up-to 256 pages.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?


Background material
===================

Circuit Breaker
---------------

Quotes from:

.. _RFC-Circuit-Breaker:
https://tools.ietf.org/html/draft-ietf-tsvwg-circuit-breaker-14

RFC-Circuit-Breaker_ ::

[...] non-congestion-controlled traffic, including many applications
using the User Datagram Protocol (UDP), can form a significant
proportion of the total traffic traversing a link. The current
Internet therefore requires that non-congestion-controlled traffic is
considered to avoid persistent excessive congestion


RFC-Circuit-Breaker_ ::

This longer period is needed to provide sufficient time for transport
congestion control (or applications) to adjust their rate following
congestion, and for the network load to stabilize after any
adjustment.


RFC-Circuit-Breaker_ ::

In contrast, Circuit Breakers are recommended for non-congestion-
controlled Internet flows and for traffic aggregates, e.g., traffic
sent using a network tunnel. They operate on timescales much longer
than the packet RTT, and trigger under situations of abnormal
(excessive) congestion.


Re: The page-pool as a component for XDP forwarding

Jesper Dangaard Brouer
 

On Wed, 4 May 2016 09:52:08 -0700
Tom Herbert <tom@...> wrote:

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.

Below I propose that we use XDP eBPF in a new fashion, based on the
state of the feedback loop that the page-pool offer. With eBPF hooks
at both RX and page-return-time, we can implement a super powerful DDoS
protection mechanism. Does it make sense?
Mostly ;-). I like the idea of returning an index from eBPF which
basically just gives a queue to transmit. Presumably, each receive
queue would have it's own XDP transmit queue that it can use lockless.
Also, I think it is reasonable that we could cross devices but within
the _same_ driver (like supporting forwarding between two Mellanox
NICs). In that case each RX queue has one dedicated XDP TX queue for
each device.
I'm not sure how you can get lockless TX with only one XDP-TX queue.

Remember we have to build in bulk TX from day one. Why? Remember the
TX tailptr write is costing in the area of 100ns. Thus, too expensive
to send single TX frames.


For forwarding on a non XDP queue (like 3B or crossing different type
devices) I don't think we should do anything special. Just pass the
packet to the stack like in the olden days and use the stack
forwarding path. This is obviously slow path, but not really very
interesting to optimize in XDP.
Yes for 3B.
For 3A I want to support cross device driver.

One thing I am not sure how to deal with is flow control. i.e. if the
transmit queue is being blocked who should do the drop. Preferably,
we'd want the to know the queue occupancy in BPF to do an intelligent
drop (some crude fq-codel or the like?)
Flow control or push-back is an interesting problem to solve.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Designing the page-pool
=======================
:Version: 0.1.1
:Authors: Jesper Dangaard Brouer

Introduction
============

Motivation for page recycling is primarily performance related (due to
network stack use-cases), as bottlenecks exist in both the page
allocator and DMA APIs.

It was agreed at MM-summit 2016, that the ideas developed to speedup
the per CPU caching layer, should be integrated into the page
allocator, where possible. And we should try to share data
structures. (https://lwn.net/Articles/684616/)

The page-pool per device, still have merits, as it can:
1) solve the DMA API (IOMMU) overhead issue, which
2) in-directly make pages writable by network-stack,
3) provide a feedback-loop at the device level

Referring to MM-summit presentation, for the DMA use-case, and why
larger order pages are considered problematic.

MM-summit 2016 presentation avail here:
* http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf


XDP (eXpress Data Path)
-----------------------

The page-pool is a component that XDP need in-order to perform packet
forwarding at the page level.

* https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
* http://lwn.net/Articles/682538/


Avoid NUMA problems, return to same CPU
=======================================

A classical problem, especially for NUMA systems, is that in a
asymmetric workload memory can be allocated on one CPU but free'ed on
a remote CPU (worst-case a remote NUMA node). (Thus, CPU "local"
recycling on free is problematic).

Upfront, our design solves this issue, by requiring pages are recycled
back to the originating CPU. (This feature is also beneficial for the
feedback-loop and associated accounting.)


Reduce page (buddy) fragmentation
=================================

Another benefit of a page-pool layer on-top of a driver, which can
maintain a steady-state working-set, is that the page allocator have
less chances of getting fragmented.


Feedback loop
=============

With drivers current approach (of calling the page allocator directly)
the number of pages a driver can hand-out is unbounded.

The page-pool provide the ability to get a feedback loop facility, at
the device level.

A classical problem is that a single device can take up an unfair
large portion of the shared memory resources, if e.g. an application
(or guest VM) does not free the resources (fast-enough). Thus,
negatively impacting the entire system, possibly leading to
Out-Of-Memory (OOM) conditions.

The protection mechanism the page-pool can provide (at the device
level) MUST not be seen as a congestion-control mechanism. It should
be seen as a "circuit-breaker" last resort facility to protect other
parts of the system.

Congestion-control aware traffic usually handle the situation (and
adjust their rate to stabilize the network). Thus, a circuit-breaker
must allow sufficient time for congestion-control aware traffic to
stabilize.

The situations that are relevant for the circuit-breaker, are
excessive and persistent non-congestion-controlled traffic, that
affect other parts of the system.

Drop policy
-----------

When the circuit-breaker is in effect (e.g. dropping all packets and
recycling the page directly), then XDP/eBPF hook could decide to
change the drop verdict.

With the XDP hook in-place, it is possible to implement arbitrarily
drop policies. If the XDP hook, gets the RX HW-hash, then it can
implement flow based policies without touching packet data.


Detecting driver overload
-------------------------

It might be difficult to determine when the circuit-breaker should
kick-in, based on an excessive working-set size of pages.

But at the driver level, it is easy to detect when the system is
overloaded, to such an extend that it cannot process packets
fast-enough. This is simply indicated by the driver cannot empty the
RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).

This indication could be passed to a XDP hook, which can implement a
drop policy. Filtering packets at this level can likely restore
normal system operation. Building on the principal of spending as few
CPU cycles as possible on packets that need to be dropped anyhow (by a
deeper layer).

It is important to realize that, dropping the the XDP driver level is
extremely efficient. Experiments show that, the filter capacity of
XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF
map), while iptables-raw is 6Mpps, and hitting socket limit is around
0.7Mpps. Thus, an attacker can actually consume significant CPU
resources by simply sending UDP packets to a closed port.


Performance vs feedback-loop accounting
---------------------------------------

For performance reasons, the accounting should be kept as per CPU
structures.

For NIC drivers it actually makes sense to keep accounting 100% per
CPU. In essence, we would like the circuit-breaker to kick-in per RX
HW queue, as that would allow remaining RX queue traffic flow.

RX queues are usually bound to a specific CPU, to avoid packet
reordering (and NIC RSS hashing (try-to) keep flows per RX queue).
Thus, keeping page recycling and stats per CPU structures, basically
achieves the same as binding a page-pool per RX queue.

If RX queue SMP affinity change runtime, then it does not matter. A
RX ring-queue can contain pages "belonging" to another CPU, but it
does not matter, as eventually they will be returned to the owning
CPU.


It would be possible to also keep a more central state for a
page-pool, because the number of pages it manage only change when
(re)filling or returning pages to the page allocator, which should be
a more infrequent event. I would prefer not to.


Determining steady-state working-set
------------------------------------

For optimal performance and to minimize memory usage, the page-pool
should only maintain the number of pages required for the steady-state
working-set.

The size of the steady-state working-set will vary depending on the
workload. E.g. in a forwarding workload it will be fairly small.
E.g. for a TCP (local)host delivery workload it will be bigger. Thus,
the steady-state working-set should be dynamically determined.

Detecting steady-state by realizing, that in steady state, no
(re)filling have occurred for a while, and the number of "free" pages
in the pool is not excessive.

Idea: could we track number of page-pool recycle alloc and free's
within N x jiffies, and if the numbers (rate) are approx the same,
record number of outstanding pages as the steady-state number? (Could
be implemented as single signed counter reset every N jiffies, inc/dec
for alloc/free, approaching zero (at reset point) == stable)


If RX rate is bigger than TX/consumption rate, queue theory says a
queue will form. While the queue builds (somewhere outside out our
control), the page-pool need to request more and more pages from
page-allocator. The number of outstanding pages increase, seen from
the page-pool, proportional to the queue in the system.

This, RX > TX is an overload situation. Congestion-control aware
traffic will self stabilize. Assuming dealing with
non-congestion-controlled traffic, some different scenarios exist:

1. (good-queue) Overload only exist for a short period of time, like a
traffic burst. This is "good-queue", where we absorb bursts.

2. (bad-queue) Situation persists, but some other limit is hit, and
packets get dropped. Like qdisc limit on forwarding, or local
socket limit. This could be interpreted as a "steady-steady", as
page-recycling reach a certain level, and maybe it should?

3. (OOM) Situation persists, and no natural resource limit is hit.
Eventually system runs dry of memory pages and OOM. This situation
should be caught by our circuit-breaker mechanism, before OOM.

4. For forwarding, the hole code path from RX to TX, takes longer than
the packet arrival rate. Drops happen at HW level by overflowing
RX queue (as it is not emptied fast enough). Possible to detect
inside driver, and we could start a eBPF program to filter?

After an overload situation, when RX decrease (or stop), so RX < TX
(likely for a short period of time). Then, we have the opportunity to
return/free objects/pages back to the page-allocator.

Q: How quickly should we do so (return pages)?
Q: How much slack to handle bursts?
Q: Is "steady-state" number of pages an absolute limit?


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).

The XDP eBPF hook can maintain it's own internal data structure, to
track pages.

We could saved the RX HW hash (maybe in struct-page), then eBPF could
implement flow metering without touching packet data.

The eBPF prog can even do it's own timestamping on RX and compare at
pool "return" point. Essentially implementing a CoDel like scheme,
measuring "time-spend-in-network-stack". (For this to make sense, it
would likely need to group by RX-HW-hash, as multiple ways through the
netstack exist, thus it cannot be viewed as a FIFO).


Conclusion
----------

The resource limitation/protection feature offered by the page-pool,
is primarily a circuit-breaker facility for protecting other parts of
the system. Combined with a XDP/eBPF hook, it offers a powerful and
more fine-grained control.

It requires more work and research if we want to react
"earlier". e.g. before the circuit-breaker kicks in. Here one should
be careful not to interfere with congestion aware traffic, by giving
it sufficient time to reach.

At the driver level it is also possible to detect, if system is not
processing RX packets fast-enough. This is not an inherent feature of
the page-pool, but it would be useful input for a eBPF filter.

For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation

Relating to page allocator
==========================

The current page allocator have a per CPU caching layer for
order-0 pages, called PCP (per CPU pages) ::

struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};

The "high" watermark, can be compared to (dynamic) steady-state
number, which determine how many cached (order-0) pages are kept,
before they are returned to the page allocator.

For PCP once the "high" watermark is hit, then "batch" number of
pages are returned. (Using a batch (re)moves the pathological
case of two object working-set being recycles on the "edge" of
the "high" watermark, causing too much interaction with the page
alloactor).

On my 8 core (i7-4790K CPU @ 4.00GHz) with 16GB RAM, the values
for PCP are high=186 and batch=31 (note 31*6 = 186). These
setting are likely not optimal for networking, as e.g. TX DMA
completion is default allowed to freeing up-to 256 pages.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?


Background material
===================

Circuit Breaker
---------------

Quotes from:

.. _RFC-Circuit-Breaker:
https://tools.ietf.org/html/draft-ietf-tsvwg-circuit-breaker-14

RFC-Circuit-Breaker_ ::

[...] non-congestion-controlled traffic, including many applications
using the User Datagram Protocol (UDP), can form a significant
proportion of the total traffic traversing a link. The current
Internet therefore requires that non-congestion-controlled traffic is
considered to avoid persistent excessive congestion


RFC-Circuit-Breaker_ ::

This longer period is needed to provide sufficient time for transport
congestion control (or applications) to adjust their rate following
congestion, and for the network load to stabilize after any
adjustment.


RFC-Circuit-Breaker_ ::

In contrast, Circuit Breakers are recommended for non-congestion-
controlled Internet flows and for traffic aggregates, e.g., traffic
sent using a network tunnel. They operate on timescales much longer
than the packet RTT, and trigger under situations of abnormal
(excessive) congestion.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Wed, May 4, 2016 at 11:13 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
On Wed, 4 May 2016 09:52:08 -0700
Tom Herbert <tom@...> wrote:

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.

Below I propose that we use XDP eBPF in a new fashion, based on the
state of the feedback loop that the page-pool offer. With eBPF hooks
at both RX and page-return-time, we can implement a super powerful DDoS
protection mechanism. Does it make sense?
Mostly ;-). I like the idea of returning an index from eBPF which
basically just gives a queue to transmit. Presumably, each receive
queue would have it's own XDP transmit queue that it can use lockless.
Also, I think it is reasonable that we could cross devices but within
the _same_ driver (like supporting forwarding between two Mellanox
NICs). In that case each RX queue has one dedicated XDP TX queue for
each device.
I'm not sure how you can get lockless TX with only one XDP-TX queue.
One XDP-TX queue per receiving CPU. Also, we might have even more
queues for priority. We don't want to bake in any assumption of 1-1
relationship between RX and TX queues either, there's more benefit to
#TX >= #RX

Remember we have to build in bulk TX from day one. Why? Remember the
TX tailptr write is costing in the area of 100ns. Thus, too expensive
to send single TX frames.
That can be done in the backend driver, although would be nice if BPF
code can request flush.

For forwarding on a non XDP queue (like 3B or crossing different type
devices) I don't think we should do anything special. Just pass the
packet to the stack like in the olden days and use the stack
forwarding path. This is obviously slow path, but not really very
interesting to optimize in XDP.
Yes for 3B.
For 3A I want to support cross device driver.
Maybe we can get basic forwarding to work first ;-). From a system
design point of view mixing different types of NICs on the same server
is not very good anyway.

Tom

One thing I am not sure how to deal with is flow control. i.e. if the
transmit queue is being blocked who should do the drop. Preferably,
we'd want the to know the queue occupancy in BPF to do an intelligent
drop (some crude fq-codel or the like?)
Flow control or push-back is an interesting problem to solve.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Designing the page-pool
=======================
:Version: 0.1.1
:Authors: Jesper Dangaard Brouer

Introduction
============

Motivation for page recycling is primarily performance related (due to
network stack use-cases), as bottlenecks exist in both the page
allocator and DMA APIs.

It was agreed at MM-summit 2016, that the ideas developed to speedup
the per CPU caching layer, should be integrated into the page
allocator, where possible. And we should try to share data
structures. (https://lwn.net/Articles/684616/)

The page-pool per device, still have merits, as it can:
1) solve the DMA API (IOMMU) overhead issue, which
2) in-directly make pages writable by network-stack,
3) provide a feedback-loop at the device level

Referring to MM-summit presentation, for the DMA use-case, and why
larger order pages are considered problematic.

MM-summit 2016 presentation avail here:
* http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf


XDP (eXpress Data Path)
-----------------------

The page-pool is a component that XDP need in-order to perform packet
forwarding at the page level.

* https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
* http://lwn.net/Articles/682538/


Avoid NUMA problems, return to same CPU
=======================================

A classical problem, especially for NUMA systems, is that in a
asymmetric workload memory can be allocated on one CPU but free'ed on
a remote CPU (worst-case a remote NUMA node). (Thus, CPU "local"
recycling on free is problematic).

Upfront, our design solves this issue, by requiring pages are recycled
back to the originating CPU. (This feature is also beneficial for the
feedback-loop and associated accounting.)


Reduce page (buddy) fragmentation
=================================

Another benefit of a page-pool layer on-top of a driver, which can
maintain a steady-state working-set, is that the page allocator have
less chances of getting fragmented.


Feedback loop
=============

With drivers current approach (of calling the page allocator directly)
the number of pages a driver can hand-out is unbounded.

The page-pool provide the ability to get a feedback loop facility, at
the device level.

A classical problem is that a single device can take up an unfair
large portion of the shared memory resources, if e.g. an application
(or guest VM) does not free the resources (fast-enough). Thus,
negatively impacting the entire system, possibly leading to
Out-Of-Memory (OOM) conditions.

The protection mechanism the page-pool can provide (at the device
level) MUST not be seen as a congestion-control mechanism. It should
be seen as a "circuit-breaker" last resort facility to protect other
parts of the system.

Congestion-control aware traffic usually handle the situation (and
adjust their rate to stabilize the network). Thus, a circuit-breaker
must allow sufficient time for congestion-control aware traffic to
stabilize.

The situations that are relevant for the circuit-breaker, are
excessive and persistent non-congestion-controlled traffic, that
affect other parts of the system.

Drop policy
-----------

When the circuit-breaker is in effect (e.g. dropping all packets and
recycling the page directly), then XDP/eBPF hook could decide to
change the drop verdict.

With the XDP hook in-place, it is possible to implement arbitrarily
drop policies. If the XDP hook, gets the RX HW-hash, then it can
implement flow based policies without touching packet data.


Detecting driver overload
-------------------------

It might be difficult to determine when the circuit-breaker should
kick-in, based on an excessive working-set size of pages.

But at the driver level, it is easy to detect when the system is
overloaded, to such an extend that it cannot process packets
fast-enough. This is simply indicated by the driver cannot empty the
RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).

This indication could be passed to a XDP hook, which can implement a
drop policy. Filtering packets at this level can likely restore
normal system operation. Building on the principal of spending as few
CPU cycles as possible on packets that need to be dropped anyhow (by a
deeper layer).

It is important to realize that, dropping the the XDP driver level is
extremely efficient. Experiments show that, the filter capacity of
XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF
map), while iptables-raw is 6Mpps, and hitting socket limit is around
0.7Mpps. Thus, an attacker can actually consume significant CPU
resources by simply sending UDP packets to a closed port.


Performance vs feedback-loop accounting
---------------------------------------

For performance reasons, the accounting should be kept as per CPU
structures.

For NIC drivers it actually makes sense to keep accounting 100% per
CPU. In essence, we would like the circuit-breaker to kick-in per RX
HW queue, as that would allow remaining RX queue traffic flow.

RX queues are usually bound to a specific CPU, to avoid packet
reordering (and NIC RSS hashing (try-to) keep flows per RX queue).
Thus, keeping page recycling and stats per CPU structures, basically
achieves the same as binding a page-pool per RX queue.

If RX queue SMP affinity change runtime, then it does not matter. A
RX ring-queue can contain pages "belonging" to another CPU, but it
does not matter, as eventually they will be returned to the owning
CPU.


It would be possible to also keep a more central state for a
page-pool, because the number of pages it manage only change when
(re)filling or returning pages to the page allocator, which should be
a more infrequent event. I would prefer not to.


Determining steady-state working-set
------------------------------------

For optimal performance and to minimize memory usage, the page-pool
should only maintain the number of pages required for the steady-state
working-set.

The size of the steady-state working-set will vary depending on the
workload. E.g. in a forwarding workload it will be fairly small.
E.g. for a TCP (local)host delivery workload it will be bigger. Thus,
the steady-state working-set should be dynamically determined.

Detecting steady-state by realizing, that in steady state, no
(re)filling have occurred for a while, and the number of "free" pages
in the pool is not excessive.

Idea: could we track number of page-pool recycle alloc and free's
within N x jiffies, and if the numbers (rate) are approx the same,
record number of outstanding pages as the steady-state number? (Could
be implemented as single signed counter reset every N jiffies, inc/dec
for alloc/free, approaching zero (at reset point) == stable)


If RX rate is bigger than TX/consumption rate, queue theory says a
queue will form. While the queue builds (somewhere outside out our
control), the page-pool need to request more and more pages from
page-allocator. The number of outstanding pages increase, seen from
the page-pool, proportional to the queue in the system.

This, RX > TX is an overload situation. Congestion-control aware
traffic will self stabilize. Assuming dealing with
non-congestion-controlled traffic, some different scenarios exist:

1. (good-queue) Overload only exist for a short period of time, like a
traffic burst. This is "good-queue", where we absorb bursts.

2. (bad-queue) Situation persists, but some other limit is hit, and
packets get dropped. Like qdisc limit on forwarding, or local
socket limit. This could be interpreted as a "steady-steady", as
page-recycling reach a certain level, and maybe it should?

3. (OOM) Situation persists, and no natural resource limit is hit.
Eventually system runs dry of memory pages and OOM. This situation
should be caught by our circuit-breaker mechanism, before OOM.

4. For forwarding, the hole code path from RX to TX, takes longer than
the packet arrival rate. Drops happen at HW level by overflowing
RX queue (as it is not emptied fast enough). Possible to detect
inside driver, and we could start a eBPF program to filter?

After an overload situation, when RX decrease (or stop), so RX < TX
(likely for a short period of time). Then, we have the opportunity to
return/free objects/pages back to the page-allocator.

Q: How quickly should we do so (return pages)?
Q: How much slack to handle bursts?
Q: Is "steady-state" number of pages an absolute limit?


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).

The XDP eBPF hook can maintain it's own internal data structure, to
track pages.

We could saved the RX HW hash (maybe in struct-page), then eBPF could
implement flow metering without touching packet data.

The eBPF prog can even do it's own timestamping on RX and compare at
pool "return" point. Essentially implementing a CoDel like scheme,
measuring "time-spend-in-network-stack". (For this to make sense, it
would likely need to group by RX-HW-hash, as multiple ways through the
netstack exist, thus it cannot be viewed as a FIFO).


Conclusion
----------

The resource limitation/protection feature offered by the page-pool,
is primarily a circuit-breaker facility for protecting other parts of
the system. Combined with a XDP/eBPF hook, it offers a powerful and
more fine-grained control.

It requires more work and research if we want to react
"earlier". e.g. before the circuit-breaker kicks in. Here one should
be careful not to interfere with congestion aware traffic, by giving
it sufficient time to reach.

At the driver level it is also possible to detect, if system is not
processing RX packets fast-enough. This is not an inherent feature of
the page-pool, but it would be useful input for a eBPF filter.

For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation

Relating to page allocator
==========================

The current page allocator have a per CPU caching layer for
order-0 pages, called PCP (per CPU pages) ::

struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
};

The "high" watermark, can be compared to (dynamic) steady-state
number, which determine how many cached (order-0) pages are kept,
before they are returned to the page allocator.

For PCP once the "high" watermark is hit, then "batch" number of
pages are returned. (Using a batch (re)moves the pathological
case of two object working-set being recycles on the "edge" of
the "high" watermark, causing too much interaction with the page
alloactor).

On my 8 core (i7-4790K CPU @ 4.00GHz) with 16GB RAM, the values
for PCP are high=186 and batch=31 (note 31*6 = 186). These
setting are likely not optimal for networking, as e.g. TX DMA
completion is default allowed to freeing up-to 256 pages.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?


Background material
===================

Circuit Breaker
---------------

Quotes from:

.. _RFC-Circuit-Breaker:
https://tools.ietf.org/html/draft-ietf-tsvwg-circuit-breaker-14

RFC-Circuit-Breaker_ ::

[...] non-congestion-controlled traffic, including many applications
using the User Datagram Protocol (UDP), can form a significant
proportion of the total traffic traversing a link. The current
Internet therefore requires that non-congestion-controlled traffic is
considered to avoid persistent excessive congestion


RFC-Circuit-Breaker_ ::

This longer period is needed to provide sufficient time for transport
congestion control (or applications) to adjust their rate following
congestion, and for the network load to stabilize after any
adjustment.


RFC-Circuit-Breaker_ ::

In contrast, Circuit Breakers are recommended for non-congestion-
controlled Internet flows and for traffic aggregates, e.g., traffic
sent using a network tunnel. They operate on timescales much longer
than the packet RTT, and trigger under situations of abnormal
(excessive) congestion.


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


Re: The page-pool as a component for XDP forwarding

Thomas Monjalon <thomas.monjalon@...>
 

2016-05-04 12:47, Tom Herbert:
On Wed, May 4, 2016 at 11:13 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
Tom Herbert <tom@...> wrote:
For forwarding on a non XDP queue (like 3B or crossing different type
devices) I don't think we should do anything special. Just pass the
packet to the stack like in the olden days and use the stack
forwarding path. This is obviously slow path, but not really very
interesting to optimize in XDP.
Yes for 3B.
For 3A I want to support cross device driver.
Maybe we can get basic forwarding to work first ;-). From a system
design point of view mixing different types of NICs on the same server
is not very good anyway.
Mixing NICs on a server is probably not common. But I wonder wether it
could allow to leverage different offload capabilities for an asymmetrical
traffic?
Please could you elaborate why mixing is not very good?


Re: The page-pool as a component for XDP forwarding

Tom Herbert <tom@...>
 

On Wed, May 4, 2016 at 12:55 PM, Thomas Monjalon
<thomas.monjalon@...> wrote:
2016-05-04 12:47, Tom Herbert:
On Wed, May 4, 2016 at 11:13 AM, Jesper Dangaard Brouer
<brouer@...> wrote:
Tom Herbert <tom@...> wrote:
For forwarding on a non XDP queue (like 3B or crossing different type
devices) I don't think we should do anything special. Just pass the
packet to the stack like in the olden days and use the stack
forwarding path. This is obviously slow path, but not really very
interesting to optimize in XDP.
Yes for 3B.
For 3A I want to support cross device driver.
Maybe we can get basic forwarding to work first ;-). From a system
design point of view mixing different types of NICs on the same server
is not very good anyway.
Mixing NICs on a server is probably not common. But I wonder wether it
could allow to leverage different offload capabilities for an asymmetrical
traffic?
Maybe, but it's a lot of complexity. Do you have a specific use case in mind?

Please could you elaborate why mixing is not very good?
Harder to design, test, don't see much value in it. Supporting such
things forces us to continually raise the abstraction and generalize
interfaces more and more which is exactly how we wind up with things
like 400 bytes skbuffs, locking, soft queues, etc. XDP is expressly
not meant to be a general solution, and that gives us liberty to cut
out anything that doesn't yield performance like trying to preserve a
high performance interface between two arbitrary drivers (but still
addressing the 90% case).


Re: The page-pool as a component for XDP forwarding

Alexei Starovoitov
 

On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer
<brouer@...> wrote:

I've started a separate document for designing my page-pool idea.

I see the page-pool as a component, for allowing fast forwarding with
XDP, at the packet-page level, cross device.

I want your input on how you imagine XDP/eBPF forwarding would work?
I could imagine,
1) eBPF returns an ifindex it want to forward to,
2) look if netdevice supports new NDO for XDP-page-fwd
3A) call XDP-page-fwd with packet-page,
3B) No XDP-page-fwd, then construct SKB and xmit directly on device,
4) (for both above cases) later at TX-DMA completion, return to page-pool.
I think the first step is option 0 where program will return
single return code 'TX' and driver side will figure out which tx queue
to use to avoid conflicts.
More sophisticated selection of ifindex and/or tx queue can be built
on top.

Avoid NUMA problems, return to same CPU
I think at this stage the numa part can be ignored.
We should assume one socket and deal with numa later,
since such things are out of bpf control and not part of
API that we need to stabilize right now.
We may have some sysctl knobs or ethtool in the future.

For performance reasons, the accounting should be kept as per CPU
structures.
In general that's absolutely correct, but by default XDP should
not have any counters. It's up to the program to keep the stats
on number of dropped packets. Thankfully per-cpu hash maps
already exist.

XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).
I think we don't have cycles to do anything sophisticated
at 'pool return' point. Something like hard limit (ethtool configurable)
on number of recycle-able pages should be good enough.

The question is, whether the PCP "high" watermark could be
dynamically determined by the same method proposed for
determining the steady-state criteria?
I think we'll try to pick the good default for most of the use cases,
but ultimately it's another knob. If program processing time
is high, the user would have to increase this knob to keep
all pages in the recycle-able pool instead of talking to
main page-allocator. Even when this knob is not optimal,
the performance will still be acceptable, since the cost
of page_alloc+mmap-s will be amortized.


minutes: IO Visor TSC and Dev Members Call

Brenden Blanco <bblanco@...>
 

Hi All,

Below, please find the meeting minutes from yesterday.

IO Visor had two sessions at OpenStack Summit in Austin last week. The
first was from Yunsong@Huawei [1], which covered how they are using IO
Visor/BPF in different form factors. The second was from Deepa and
myself, covering how we wrote IO Modules to enforce
application/container policy within CloudFoundry.

[1] https://github.com/iovisor/bpf-docs/blob/master/openstack/2016-04-25/OpenStackSummitAustin2016_iovisor_v1.0.pdf

On the kernel side, multiple updates:

Alexei
finishing up direct packet access to improve performance
won't need pseudo skb
simple descriptor: *data, length
restrictions:
tc: read-only access
xdp: read-write

Danie
constant blinding being worked on
push-pop: "change" helper has been in flight for some time, tricky
only really suitable for xdp

Jesper
mm summit update (see other mail thread)
optimizing page allocator making progress
recycling facility to be used cross devices
page pool allocator would need to be implemented by the driver to add support
dma map/unmap fits in as a constructor/destructor of the page when
returned from the pool to the allocator

John
lockless qdisc coming soon
possible lockless htb

One Idea Alexei and Brenden were throwing around was to incorporate
CTF - common c-type format, to be used instead of header files for
discovering kernel struct layouts. This could also be used to describe
bpf maps, automatic table pretty-print from kernel.

We also discussed XDP ideas a bit. For now, as Alexei updated in the
page-pool thread, we should try to create a simple forward action that
takes no parameters, and the device has a simple rxq -> txq mapping.
With this action, the packet leaves the same port it came in on. Use
cases could be load-balancer type of device in one-arm mode. More
complicated forwarding behavior could build on top of this without
breaking any ABI.

Attendees:
Jesper Brouer
Brenden Blanco
Daniel Borkmann
Alexei Starovoitov
Prem Jonnalagadda
John Fastabend

141 - 160 of 2021