Re: The page-pool as a component for XDP forwarding


Jesper Dangaard Brouer
 

On Wed, 4 May 2016 12:47:23 -0700 Tom Herbert <tom@...> wrote:
On Wed, May 4, 2016 at 11:13 AM, Jesper Dangaard Brouer <brouer@...> wrote:
[...]

One thing I am not sure how to deal with is flow control. i.e. if the
transmit queue is being blocked who should do the drop. Preferably,
we'd want the to know the queue occupancy in BPF to do an intelligent
drop (some crude fq-codel or the like?)
Flow control or push-back is an interesting problem to solve.
The page-pool doc section "Feedback loop" was primarily about how the
page-pool's need to recycle, offer a way to handle and implement flow
control.

The doc identified two states, but you just identified another e.g. when
TX-queue/egress is blocking/full. And yes, I also think we can handle
that situation.

From conclusion:
For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation
New state: "TX-overload"

I'll think some more about if-and-how this state differs from above states.



Designing the page-pool
=======================
:Version: 0.1.1
:Authors: Jesper Dangaard Brouer
[...]

Feedback loop
=============

With drivers current approach (of calling the page allocator directly)
the number of pages a driver can hand-out is unbounded.

The page-pool provide the ability to get a feedback loop facility, at
the device level.

A classical problem is that a single device can take up an unfair
large portion of the shared memory resources, if e.g. an application
(or guest VM) does not free the resources (fast-enough). Thus,
negatively impacting the entire system, possibly leading to
Out-Of-Memory (OOM) conditions.

The protection mechanism the page-pool can provide (at the device
level) MUST not be seen as a congestion-control mechanism. It should
be seen as a "circuit-breaker" last resort facility to protect other
parts of the system.

Congestion-control aware traffic usually handle the situation (and
adjust their rate to stabilize the network). Thus, a circuit-breaker
must allow sufficient time for congestion-control aware traffic to
stabilize.

The situations that are relevant for the circuit-breaker, are
excessive and persistent non-congestion-controlled traffic, that
affect other parts of the system.

Drop policy
-----------

When the circuit-breaker is in effect (e.g. dropping all packets and
recycling the page directly), then XDP/eBPF hook could decide to
change the drop verdict.

With the XDP hook in-place, it is possible to implement arbitrarily
drop policies. If the XDP hook, gets the RX HW-hash, then it can
implement flow based policies without touching packet data.


Detecting driver overload
-------------------------

It might be difficult to determine when the circuit-breaker should
kick-in, based on an excessive working-set size of pages.

But at the driver level, it is easy to detect when the system is
overloaded, to such an extend that it cannot process packets
fast-enough. This is simply indicated by the driver cannot empty the
RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).

This indication could be passed to a XDP hook, which can implement a
drop policy. Filtering packets at this level can likely restore
normal system operation. Building on the principal of spending as few
CPU cycles as possible on packets that need to be dropped anyhow (by a
deeper layer).

It is important to realize that, dropping the the XDP driver level is
extremely efficient. Experiments show that, the filter capacity of
XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF
map), while iptables-raw is 6Mpps, and hitting socket limit is around
0.7Mpps. Thus, an attacker can actually consume significant CPU
resources by simply sending UDP packets to a closed port.


Performance vs feedback-loop accounting
---------------------------------------

For performance reasons, the accounting should be kept as per CPU
structures.

For NIC drivers it actually makes sense to keep accounting 100% per
CPU. In essence, we would like the circuit-breaker to kick-in per RX
HW queue, as that would allow remaining RX queue traffic flow.

RX queues are usually bound to a specific CPU, to avoid packet
reordering (and NIC RSS hashing (try-to) keep flows per RX queue).
Thus, keeping page recycling and stats per CPU structures, basically
achieves the same as binding a page-pool per RX queue.

If RX queue SMP affinity change runtime, then it does not matter. A
RX ring-queue can contain pages "belonging" to another CPU, but it
does not matter, as eventually they will be returned to the owning
CPU.


It would be possible to also keep a more central state for a
page-pool, because the number of pages it manage only change when
(re)filling or returning pages to the page allocator, which should be
a more infrequent event. I would prefer not to.


Determining steady-state working-set
------------------------------------

For optimal performance and to minimize memory usage, the page-pool
should only maintain the number of pages required for the steady-state
working-set.

The size of the steady-state working-set will vary depending on the
workload. E.g. in a forwarding workload it will be fairly small.
E.g. for a TCP (local)host delivery workload it will be bigger. Thus,
the steady-state working-set should be dynamically determined.

Detecting steady-state by realizing, that in steady state, no
(re)filling have occurred for a while, and the number of "free" pages
in the pool is not excessive.

Idea: could we track number of page-pool recycle alloc and free's
within N x jiffies, and if the numbers (rate) are approx the same,
record number of outstanding pages as the steady-state number? (Could
be implemented as single signed counter reset every N jiffies, inc/dec
for alloc/free, approaching zero (at reset point) == stable)


If RX rate is bigger than TX/consumption rate, queue theory says a
queue will form. While the queue builds (somewhere outside out our
control), the page-pool need to request more and more pages from
page-allocator. The number of outstanding pages increase, seen from
the page-pool, proportional to the queue in the system.

This, RX > TX is an overload situation. Congestion-control aware
traffic will self stabilize. Assuming dealing with
non-congestion-controlled traffic, some different scenarios exist:

1. (good-queue) Overload only exist for a short period of time, like a
traffic burst. This is "good-queue", where we absorb bursts.

2. (bad-queue) Situation persists, but some other limit is hit, and
packets get dropped. Like qdisc limit on forwarding, or local
socket limit. This could be interpreted as a "steady-steady", as
page-recycling reach a certain level, and maybe it should?

3. (OOM) Situation persists, and no natural resource limit is hit.
Eventually system runs dry of memory pages and OOM. This situation
should be caught by our circuit-breaker mechanism, before OOM.

4. For forwarding, the hole code path from RX to TX, takes longer than
the packet arrival rate. Drops happen at HW level by overflowing
RX queue (as it is not emptied fast enough). Possible to detect
inside driver, and we could start a eBPF program to filter?

After an overload situation, when RX decrease (or stop), so RX < TX
(likely for a short period of time). Then, we have the opportunity to
return/free objects/pages back to the page-allocator.

Q: How quickly should we do so (return pages)?
Q: How much slack to handle bursts?
Q: Is "steady-state" number of pages an absolute limit?


XDP pool return hook
--------------------

What about allowing a eBPF hook at page-pool "return" point? That
would allow eBPF to function as an "egress" meter (in circuit-breaker
terminology).

The XDP eBPF hook can maintain it's own internal data structure, to
track pages.

We could saved the RX HW hash (maybe in struct-page), then eBPF could
implement flow metering without touching packet data.

The eBPF prog can even do it's own timestamping on RX and compare at
pool "return" point. Essentially implementing a CoDel like scheme,
measuring "time-spend-in-network-stack". (For this to make sense, it
would likely need to group by RX-HW-hash, as multiple ways through the
netstack exist, thus it cannot be viewed as a FIFO).


Conclusion
----------

The resource limitation/protection feature offered by the page-pool,
is primarily a circuit-breaker facility for protecting other parts of
the system. Combined with a XDP/eBPF hook, it offers a powerful and
more fine-grained control.

It requires more work and research if we want to react
"earlier". e.g. before the circuit-breaker kicks in. Here one should
be careful not to interfere with congestion aware traffic, by giving
it sufficient time to reach.

At the driver level it is also possible to detect, if system is not
processing RX packets fast-enough. This is not an inherent feature of
the page-pool, but it would be useful input for a eBPF filter.

For the XDP/eBPF hook, this means that it should take a "signal" as
input of how the current operating machine state is.

Considering the states:
* State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop
* State:"RX-overload" - eBPF can choose to drop packets to restore operation
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.