
Jesper Dangaard Brouer
On Wed, 4 May 2016 12:47:23 -0700 Tom Herbert <tom@...> wrote: On Wed, May 4, 2016 at 11:13 AM, Jesper Dangaard Brouer <brouer@...> wrote: [...]
One thing I am not sure how to deal with is flow control. i.e. if the transmit queue is being blocked who should do the drop. Preferably, we'd want the to know the queue occupancy in BPF to do an intelligent drop (some crude fq-codel or the like?) Flow control or push-back is an interesting problem to solve.
The page-pool doc section "Feedback loop" was primarily about how the page-pool's need to recycle, offer a way to handle and implement flow control. The doc identified two states, but you just identified another e.g. when TX-queue/egress is blocking/full. And yes, I also think we can handle that situation. From conclusion: For the XDP/eBPF hook, this means that it should take a "signal" as input of how the current operating machine state is.
Considering the states: * State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop * State:"RX-overload" - eBPF can choose to drop packets to restore operation New state: "TX-overload" I'll think some more about if-and-how this state differs from above states. Designing the page-pool ======================= :Version: 0.1.1 :Authors: Jesper Dangaard Brouer
[...] Feedback loop =============
With drivers current approach (of calling the page allocator directly) the number of pages a driver can hand-out is unbounded.
The page-pool provide the ability to get a feedback loop facility, at the device level.
A classical problem is that a single device can take up an unfair large portion of the shared memory resources, if e.g. an application (or guest VM) does not free the resources (fast-enough). Thus, negatively impacting the entire system, possibly leading to Out-Of-Memory (OOM) conditions.
The protection mechanism the page-pool can provide (at the device level) MUST not be seen as a congestion-control mechanism. It should be seen as a "circuit-breaker" last resort facility to protect other parts of the system.
Congestion-control aware traffic usually handle the situation (and adjust their rate to stabilize the network). Thus, a circuit-breaker must allow sufficient time for congestion-control aware traffic to stabilize.
The situations that are relevant for the circuit-breaker, are excessive and persistent non-congestion-controlled traffic, that affect other parts of the system.
Drop policy -----------
When the circuit-breaker is in effect (e.g. dropping all packets and recycling the page directly), then XDP/eBPF hook could decide to change the drop verdict.
With the XDP hook in-place, it is possible to implement arbitrarily drop policies. If the XDP hook, gets the RX HW-hash, then it can implement flow based policies without touching packet data.
Detecting driver overload -------------------------
It might be difficult to determine when the circuit-breaker should kick-in, based on an excessive working-set size of pages.
But at the driver level, it is easy to detect when the system is overloaded, to such an extend that it cannot process packets fast-enough. This is simply indicated by the driver cannot empty the RX queue fast-enough, thus HW is RX dropping packets (FIFO taildrop).
This indication could be passed to a XDP hook, which can implement a drop policy. Filtering packets at this level can likely restore normal system operation. Building on the principal of spending as few CPU cycles as possible on packets that need to be dropped anyhow (by a deeper layer).
It is important to realize that, dropping the the XDP driver level is extremely efficient. Experiments show that, the filter capacity of XDP filter is 14.8Mpps (DDIO touching packet and updating up an eBPF map), while iptables-raw is 6Mpps, and hitting socket limit is around 0.7Mpps. Thus, an attacker can actually consume significant CPU resources by simply sending UDP packets to a closed port.
Performance vs feedback-loop accounting ---------------------------------------
For performance reasons, the accounting should be kept as per CPU structures.
For NIC drivers it actually makes sense to keep accounting 100% per CPU. In essence, we would like the circuit-breaker to kick-in per RX HW queue, as that would allow remaining RX queue traffic flow.
RX queues are usually bound to a specific CPU, to avoid packet reordering (and NIC RSS hashing (try-to) keep flows per RX queue). Thus, keeping page recycling and stats per CPU structures, basically achieves the same as binding a page-pool per RX queue.
If RX queue SMP affinity change runtime, then it does not matter. A RX ring-queue can contain pages "belonging" to another CPU, but it does not matter, as eventually they will be returned to the owning CPU.
It would be possible to also keep a more central state for a page-pool, because the number of pages it manage only change when (re)filling or returning pages to the page allocator, which should be a more infrequent event. I would prefer not to.
Determining steady-state working-set ------------------------------------
For optimal performance and to minimize memory usage, the page-pool should only maintain the number of pages required for the steady-state working-set.
The size of the steady-state working-set will vary depending on the workload. E.g. in a forwarding workload it will be fairly small. E.g. for a TCP (local)host delivery workload it will be bigger. Thus, the steady-state working-set should be dynamically determined.
Detecting steady-state by realizing, that in steady state, no (re)filling have occurred for a while, and the number of "free" pages in the pool is not excessive.
Idea: could we track number of page-pool recycle alloc and free's within N x jiffies, and if the numbers (rate) are approx the same, record number of outstanding pages as the steady-state number? (Could be implemented as single signed counter reset every N jiffies, inc/dec for alloc/free, approaching zero (at reset point) == stable)
If RX rate is bigger than TX/consumption rate, queue theory says a queue will form. While the queue builds (somewhere outside out our control), the page-pool need to request more and more pages from page-allocator. The number of outstanding pages increase, seen from the page-pool, proportional to the queue in the system.
This, RX > TX is an overload situation. Congestion-control aware traffic will self stabilize. Assuming dealing with non-congestion-controlled traffic, some different scenarios exist:
1. (good-queue) Overload only exist for a short period of time, like a traffic burst. This is "good-queue", where we absorb bursts.
2. (bad-queue) Situation persists, but some other limit is hit, and packets get dropped. Like qdisc limit on forwarding, or local socket limit. This could be interpreted as a "steady-steady", as page-recycling reach a certain level, and maybe it should?
3. (OOM) Situation persists, and no natural resource limit is hit. Eventually system runs dry of memory pages and OOM. This situation should be caught by our circuit-breaker mechanism, before OOM.
4. For forwarding, the hole code path from RX to TX, takes longer than the packet arrival rate. Drops happen at HW level by overflowing RX queue (as it is not emptied fast enough). Possible to detect inside driver, and we could start a eBPF program to filter?
After an overload situation, when RX decrease (or stop), so RX < TX (likely for a short period of time). Then, we have the opportunity to return/free objects/pages back to the page-allocator.
Q: How quickly should we do so (return pages)? Q: How much slack to handle bursts? Q: Is "steady-state" number of pages an absolute limit?
XDP pool return hook --------------------
What about allowing a eBPF hook at page-pool "return" point? That would allow eBPF to function as an "egress" meter (in circuit-breaker terminology).
The XDP eBPF hook can maintain it's own internal data structure, to track pages.
We could saved the RX HW hash (maybe in struct-page), then eBPF could implement flow metering without touching packet data.
The eBPF prog can even do it's own timestamping on RX and compare at pool "return" point. Essentially implementing a CoDel like scheme, measuring "time-spend-in-network-stack". (For this to make sense, it would likely need to group by RX-HW-hash, as multiple ways through the netstack exist, thus it cannot be viewed as a FIFO).
Conclusion ----------
The resource limitation/protection feature offered by the page-pool, is primarily a circuit-breaker facility for protecting other parts of the system. Combined with a XDP/eBPF hook, it offers a powerful and more fine-grained control.
It requires more work and research if we want to react "earlier". e.g. before the circuit-breaker kicks in. Here one should be careful not to interfere with congestion aware traffic, by giving it sufficient time to reach.
At the driver level it is also possible to detect, if system is not processing RX packets fast-enough. This is not an inherent feature of the page-pool, but it would be useful input for a eBPF filter.
For the XDP/eBPF hook, this means that it should take a "signal" as input of how the current operating machine state is.
Considering the states: * State:"circuit-breaker"- eBPF can choose to approve packets, else stack drop * State:"RX-overload" - eBPF can choose to drop packets to restore operation
-- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|
On Thu, May 05, 2016 at 07:22:49PM -0700, Tom Herbert wrote: If we allow priority and we allow specifying the interface as in your suggested interface, then the next obvious thing is that we'd want to return both a priority and interface at the same time which becomes another opcode. And then if we add rate limiting, then we may want to return any combination of rate limit, priority, and interface-- which means three additional opcodes for a total of six. So this pattern degenerates to needing N! number of different op codes for N independent features-- that is not an extensible interface. A such problem doesn't exist. Look at existing TC_ACT_DIRECT/OK/SHOT They're already parametrized while being distinct codes. Having multiple codes doesn't mean that each code won't have its arguments. All codes are indepdenent actions and won't be combined. All cases where action might have multiple parameters will be represented as such via bits in the code and via per-cpu scratch data. Muxing everything into single TX action is simply horrible interface. Sending to magic id would mean encryption? and another magic id would mean priority ? how about both? yet another id? Then it will end up real N! problem. Also, if we're allocating bits in the return code each time a feature is added we'll run out of bits pretty quickly ;-) (that is a real imo 256 return values in 8-bits is more than enough. We'll never be able to exhaust them. TC took 10 years to go from code 6 to 7 sure, all makes sense, but it's easier to do with TX, TX_PRIO, etc enums instead of magic numbers. But not extensible. Think of all the qdiscs in the kernel and how people might want to provide support for those in HW. Think of the features of network virtualization in HW like the encapsulation service.
please be specific. I don't see any problems with extensibility. Think of how we could provide for offloading security, compression. exactly. encrypt/decrypt should be different opcodes. "TX into id=5 == encrypt" is beyond ugly. interoperate in the future. Think of how people will want to combine use of these features in arbitrary combinations. yes. that's why we have other 56-bits in the return code and a ton of per-cpu scratch space. Take a look at bpf_redirect() helper for example. an inflexible interface from the get-go. Supporting priority and egress interface are only two features and its nice that they're getting thought, but this is no where near to being a complete set of features we might need to support. Exactly. That's why representing everything as an id is a non starter. I think you're exaggerating the programmatic complexity. In the normal case, I would expect that the BPF parses the packet, does a hash lookup on the results of the parse, and the returns whatever is saved in the hash structure. The table is programmed from userspace, so at runtime BPF doesn't need to worry about the contents or structure. Not really. That wasn't the case so far. I suspect you're thinking of p4/openflow style, when action codes are embedded inside tables. Typical bpf program doesn't store its action as a value inside the table. It may store ifindex in the table, but not the opcode.
|
On Thu, May 5, 2016 at 6:17 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote: On Thu, May 05, 2016 at 05:26:53PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 4:51 PM, Tom Herbert <tom@...> wrote:
On Thu, May 5, 2016 at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. then semantics will become driver dependent. that will be unusable from the program point view. Non-portable programs etc. Also if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit _is_ slower than if ((u8)ret == TX) do_xmit At line rate every cycle and every branch count. You have the same problem if you allow a priority to be returned. Not really. The TX case stays fast and unchanged. When TX_PRIO will be added, it will be handled by different case in one switch ((u8)ret).
See the N! problem below. General parameterized return value is always a single lookup regardless of how many features HW might add for us. If we use array of 256 entries for indices we don't even need to check array bounds (is a problem for switch). If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. why? I think it's the opposite. tx with prio is clearly different action. Some drivers may support it, some not. And the programs will clearly indicate to the driver side what they want, so it's only one 'switch ((u8)ret)' on the driver side to understand the action and do it.
But then so is rate limiting, so is receiving to a tap queue, so is sending on a paced queue, etc. All of these just boil down to which HW queue is used that provides the COS.
yes and all of them should be different opcodes that have normal names in an uapi enum. Magic numbers are not human friendly.
If we allow priority and we allow specifying the interface as in your suggested interface, then the next obvious thing is that we'd want to return both a priority and interface at the same time which becomes another opcode. And then if we add rate limiting, then we may want to return any combination of rate limit, priority, and interface-- which means three additional opcodes for a total of six. So this pattern degenerates to needing N! number of different op codes for N independent features-- that is not an extensible interface. A parameterized return value is extensible to express any combination of features and instruction. Every time a new HW feature is exposed it can be absorbed without ever having to change the base interface. Also, if we're allocating bits in the return code each time a feature is added we'll run out of bits pretty quickly ;-) (that is a real problem as we see often on netdev, features are only added, hardly ever removed). As for being "human readable" I don't believe there is any issue. Everything can still be abstracted to be human readable. Here's a more specific example of how this might work with priority.
1) Someone writes a BPF program that transmits on three priorities we'll call A, B, C. 2) At runtime we ask the driver for three indices referring to the three priorities. If it can fulfill the request (I1, I2,I3) then we just map A->I1, B->I2, C->I3 and everyone's happy. 3) If this the driver cannot fulfill the request, then the user has a decision to make. Either don't proceed because we want to use a feature that is unsupported, or make due with what we can get. Maybe we only get one index for all transmit so it might be reasonable to map priorities to one queue A->I1, B->I1, C->I1. sure, all makes sense, but it's easier to do with TX, TX_PRIO, etc enums instead of magic numbers.
But not extensible. Think of all the qdiscs in the kernel and how people might want to provide support for those in HW. Think of the features of network virtualization in HW like the encapsulation service. Think of how we could provide for offloading security, compression. Think of switchdev and how someone might want to interoperate in the future. Think of how people will want to combine use of these features in arbitrary combinations. I don't necessarily believe that any any or all of those have merit for XDP, but I do believe we don't want to preclude anything in the future by creating an inflexible interface from the get-go. Supporting priority and egress interface are only two features and its nice that they're getting thought, but this is no where near to being a complete set of features we might need to support. User space can ask the driver whether it supports TX_PRIO and what range of priorities is available then it can choose to abort the whole thing if TX_PRIO is not supported, or if the range is too big or too small it can dynamically recompile the program for specific available range of priorities. Exactly my point, the driver will need to provide capabilities up front and the program will need to be adjusted for that. But the program stays simple and its C code has 'return TX_PRIO | prio' which is easier to debug instead of magic 'return TX | id' which is mistake prone. I think you're exaggerating the programmatic complexity. In the normal case, I would expect that the BPF parses the packet, does a hash lookup on the results of the parse, and the returns whatever is saved in the hash structure. The table is programmed from userspace, so at runtime BPF doesn't need to worry about the contents or structure. We humans deal with pointers all the time and no one every complains that they're not human readable ;-) Tom
The important thing is that it is the always _user's_ decision what to do with the available resources. The driver should never on its own try compensate for a lack of resources, and it should *never* resort to software solution to providing a feature lest we wind up reinventing the whole queuing disciplines. Completely agree. The driver should never try to be smart and emulate things in software. There should be a mechanism to ask the driver whether it supports TX_PRIO, TX_TO_NETDEV or else. If user space choose to ignore that and loaded the program that returns TX_PRIO, the driver will do packet drop instead. In general any unknown return code must be drop which is code 0 for a reason. It's an exception code after div_by_0.
|
On Thu, May 05, 2016 at 05:26:53PM -0700, Tom Herbert wrote: On Thu, May 5, 2016 at 4:51 PM, Tom Herbert <tom@...> wrote:
On Thu, May 5, 2016 at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. then semantics will become driver dependent. that will be unusable from the program point view. Non-portable programs etc. Also if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit _is_ slower than if ((u8)ret == TX) do_xmit At line rate every cycle and every branch count. You have the same problem if you allow a priority to be returned. Not really. The TX case stays fast and unchanged. When TX_PRIO will be added, it will be handled by different case in one switch ((u8)ret). If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. why? I think it's the opposite. tx with prio is clearly different action. Some drivers may support it, some not. And the programs will clearly indicate to the driver side what they want, so it's only one 'switch ((u8)ret)' on the driver side to understand the action and do it.
But then so is rate limiting, so is receiving to a tap queue, so is sending on a paced queue, etc. All of these just boil down to which HW queue is used that provides the COS.
yes and all of them should be different opcodes that have normal names in an uapi enum. Magic numbers are not human friendly. Here's a more specific example of how this might work with priority.
1) Someone writes a BPF program that transmits on three priorities we'll call A, B, C. 2) At runtime we ask the driver for three indices referring to the three priorities. If it can fulfill the request (I1, I2,I3) then we just map A->I1, B->I2, C->I3 and everyone's happy. 3) If this the driver cannot fulfill the request, then the user has a decision to make. Either don't proceed because we want to use a feature that is unsupported, or make due with what we can get. Maybe we only get one index for all transmit so it might be reasonable to map priorities to one queue A->I1, B->I1, C->I1. sure, all makes sense, but it's easier to do with TX, TX_PRIO, etc enums instead of magic numbers. User space can ask the driver whether it supports TX_PRIO and what range of priorities is available then it can choose to abort the whole thing if TX_PRIO is not supported, or if the range is too big or too small it can dynamically recompile the program for specific available range of priorities. But the program stays simple and its C code has 'return TX_PRIO | prio' which is easier to debug instead of magic 'return TX | id' which is mistake prone. The important thing is that it is the always _user's_ decision what to do with the available resources. The driver should never on its own try compensate for a lack of resources, and it should *never* resort to software solution to providing a feature lest we wind up reinventing the whole queuing disciplines. Completely agree. The driver should never try to be smart and emulate things in software. There should be a mechanism to ask the driver whether it supports TX_PRIO, TX_TO_NETDEV or else. If user space choose to ignore that and loaded the program that returns TX_PRIO, the driver will do packet drop instead. In general any unknown return code must be drop which is code 0 for a reason. It's an exception code after div_by_0.
|
On Thu, May 5, 2016 at 4:51 PM, Tom Herbert <tom@...> wrote: On Thu, May 5, 2016 at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. then semantics will become driver dependent. that will be unusable from the program point view. Non-portable programs etc. Also if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit _is_ slower than if ((u8)ret == TX) do_xmit At line rate every cycle and every branch count. You have the same problem if you allow a priority to be returned.
If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. why? I think it's the opposite. tx with prio is clearly different action. Some drivers may support it, some not. And the programs will clearly indicate to the driver side what they want, so it's only one 'switch ((u8)ret)' on the driver side to understand the action and do it.
But then so is rate limiting, so is receiving to a tap queue, so is sending on a paced queue, etc. All of these just boil down to which HW queue is used that provides the COS. Here's a more specific example of how this might work with priority. 1) Someone writes a BPF program that transmits on three priorities we'll call A, B, C. 2) At runtime we ask the driver for three indices referring to the three priorities. If it can fulfill the request (I1, I2,I3) then we just map A->I1, B->I2, C->I3 and everyone's happy. 3) If this the driver cannot fulfill the request, then the user has a decision to make. Either don't proceed because we want to use a feature that is unsupported, or make due with what we can get. Maybe we only get one index for all transmit so it might be reasonable to map priorities to one queue A->I1, B->I1, C->I1. The important thing is that it is the always _user's_ decision what to do with the available resources. The driver should never on its own try compensate for a lack of resources, and it should *never* resort to software solution to providing a feature lest we wind up reinventing the whole queuing disciplines. Tom
|
On Thu, May 5, 2016 at 4:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote: On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. then semantics will become driver dependent. that will be unusable from the program point view. Non-portable programs etc. Also if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit _is_ slower than if ((u8)ret == TX) do_xmit At line rate every cycle and every branch count. You have the same problem if you allow a priority to be returned.
If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. why? I think it's the opposite. tx with prio is clearly different action. Some drivers may support it, some not. And the programs will clearly indicate to the driver side what they want, so it's only one 'switch ((u8)ret)' on the driver side to understand the action and do it.
But then so is rate limiting, so is receiving to a tap queue, so is sending on a paced queue, etc. All of these just boil down to which HW queue is used that provides the COS.
|
On Thu, May 5, 2016 at 4:18 PM, Daniel Borkmann <daniel@...> wrote: On 05/06/2016 01:00 AM, Tom Herbert via iovisor-dev wrote:
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think.
+1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex
I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming.
No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to
But that would mean, that XDP programs are not portable anymore across different drivers, no? So they'd have to be rewritten when porting to a different nic or cannot be supported there due to missing features.
Actually, I believe the opposite is true. The interface I propose does not require any features to be supported except that the driver can drop, transmit on a default queue, and receive packets. To use anything else is an "advanced" feature which will vary from device to device, and we don't want to mandate anything beyond that. To deal with this, allow the BPF program to be parameterized at runtime to get the mappings of logical features to indices. Priority is a great example, if we have an action called BPF_XDP_TX_PRIO but the device doesn't support priority queues then what does that mean? It's doesn't make sense to arbitrarily require this to be supported. Instead, we can ask the driver up front what it supports and adjust the program (mappings) as we see fit. Tom the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. Priority queues are a perfect example of this, these are not a common supported feature and should not be exposed in the base action. For portability on one XDP program across different NICs, then this would either just be a *hint* to the driver and drivers not supporting this don't care about that, or drivers need to indicate their capabilities in some way to the verifier, so that verifier makes sure that such action is possible at all for the given driver. For prio queues case, probably first option might be better.
This also means the return code is simple, just two fields: the opcode and its parameter. In phase one the parameter would always be zero.
Tom _______________________________________________ iovisor-dev mailing list iovisor-dev@... https://lists.iovisor.org/mailman/listinfo/iovisor-dev
|
On Thu, May 05, 2016 at 04:00:10PM -0700, Tom Herbert wrote: On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. then semantics will become driver dependent. that will be unusable from the program point view. Non-portable programs etc. Also if ((u8)ret == TX && ret>>8 == curent-dev->ifindex) do_xmit _is_ slower than if ((u8)ret == TX) do_xmit At line rate every cycle and every branch count. If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. why? I think it's the opposite. tx with prio is clearly different action. Some drivers may support it, some not. And the programs will clearly indicate to the driver side what they want, so it's only one 'switch ((u8)ret)' on the driver side to understand the action and do it.
|
On 05/06/2016 01:00 AM, Tom Herbert via iovisor-dev wrote: On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to But that would mean, that XDP programs are not portable anymore across different drivers, no? So they'd have to be rewritten when porting to a different nic or cannot be supported there due to missing features. the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. Priority queues are a perfect example of this, these are not a common supported feature and should not be exposed in the base action. For portability on one XDP program across different NICs, then this would either just be a *hint* to the driver and drivers not supporting this don't care about that, or drivers need to indicate their capabilities in some way to the verifier, so that verifier makes sure that such action is possible at all for the given driver. For prio queues case, probably first option might be better. This also means the return code is simple, just two fields: the opcode and its parameter. In phase one the parameter would always be zero.
Tom _______________________________________________ iovisor-dev mailing list iovisor-dev@... https://lists.iovisor.org/mailman/listinfo/iovisor-dev
|
On Thu, May 5, 2016 at 3:41 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote: On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote:
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
Sorry, I'm missing your point. The simple model is that we have three opcodes and two of them take parameters. The parameters are generic so they can indicate arbitrary instructions on what to do with the packet (these can point to priority queue, HW rate limited queue, tap queue, whatever). This is a way that drivers can extend the capabilities of the interface for their own features without requiring any changes to the interface. A singe index allows one array lookup which returns what every information is needed for driver to act on the packet. So this scheme is both generic and performant-- it allows generality in that the TX and RX actions can be arbitrarily extended and is performant since all the driver needs to do is look the index in the array to complete the action. If we don't do something like this, then every time someone adds some new functionality we have to add another action-- that doesn't scale. Priority queues are a perfect example of this, these are not a common supported feature and should not be exposed in the base action. This also means the return code is simple, just two fields: the opcode and its parameter. In phase one the parameter would always be zero. Tom
|
On Thu, May 05, 2016 at 03:21:24PM -0700, Tom Herbert wrote: On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ).
Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. No. See my comment to Jesper and rant about 'generality vs performance' Combining them into one generic TX code is not simpler. Not for the program and not for the driver side.
|
On Thu, May 5, 2016 at 2:44 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote: On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex I think we can simplify these three into just XDP_TX (and without BPF tag to allow possibility that some non-BPF entity really wants to use this interface ;-) ). Just have XDP_TX with some index. Index maps to priority, queue, other device, what ever. The caller will need to understand what the different possible indices mean but this can be negotiated out of band and up front before programming. BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev Similarly, just do XDP_RX with some index that has meaning to caller and driver. (Also NETDEV and IFINDEX are kernel specific terms, we should avoid that in the interface). So that just leaves three actions XDP_DROP, XDP_TX, XDP_RX! lower 8-bits to encode action should be enough. First merge-able step is to do 0,1,2 in one driver (like mlx4) and start building it in other drivers.
|
On 05/06/2016 12:04 AM, Alexei Starovoitov wrote: On Fri, May 06, 2016 at 12:00:57AM +0200, Daniel Borkmann wrote:
On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev lower 8-bits to encode action should be enough. First merge-able step is to do 0,1,2 in one driver (like mlx4) and start building it in other drivers. Can't this be done in a second step, with some per-cpu scratch data as we have for redirect? That would seem easier to use to me, and easier to extend with further data required to tx or rx to stack ... The return code could have a flag to tell to look at the scratch data, for example. yes. 3,4,5,6,7,.. code can look at per-cpu scratch data too. My point that for step one we define semantic for opcodes 0,1,2 in the first 8 bits of return value. Everything else is reserved and defaults to drop. Yep, first step with opcodes 0=drop, 1=pass/stack, 2=tx/fwd defined sounds reasonable to me with rest as drop.
|
On Fri, May 06, 2016 at 12:00:57AM +0200, Daniel Borkmann wrote: On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote:
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev lower 8-bits to encode action should be enough. First merge-able step is to do 0,1,2 in one driver (like mlx4) and start building it in other drivers. Can't this be done in a second step, with some per-cpu scratch data as we have for redirect? That would seem easier to use to me, and easier to extend with further data required to tx or rx to stack ... The return code could have a flag to tell to look at the scratch data, for example. yes. 3,4,5,6,7,.. code can look at per-cpu scratch data too. My point that for step one we define semantic for opcodes 0,1,2 in the first 8 bits of return value. Everything else is reserved and defaults to drop.
|
On 05/05/2016 11:44 PM, Alexei Starovoitov via iovisor-dev wrote: On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote:
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev lower 8-bits to encode action should be enough. First merge-able step is to do 0,1,2 in one driver (like mlx4) and start building it in other drivers. Can't this be done in a second step, with some per-cpu scratch data as we have for redirect? That would seem easier to use to me, and easier to extend with further data required to tx or rx to stack ... The return code could have a flag to tell to look at the scratch data, for example.
|
On Thu, May 05, 2016 at 09:32:32PM +0200, Jesper Dangaard Brouer wrote: On Wed, 4 May 2016 22:22:07 -0700 Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer <brouer@...> wrote:
I've started a separate document for designing my page-pool idea.
I see the page-pool as a component, for allowing fast forwarding with XDP, at the packet-page level, cross device.
I want your input on how you imagine XDP/eBPF forwarding would work? I could imagine, 1) eBPF returns an ifindex it want to forward to, 2) look if netdevice supports new NDO for XDP-page-fwd 3A) call XDP-page-fwd with packet-page, 3B) No XDP-page-fwd, then construct SKB and xmit directly on device, 4) (for both above cases) later at TX-DMA completion, return to page-pool. I think the first step is option 0 where program will return single return code 'TX' and driver side will figure out which tx queue to use to avoid conflicts. More sophisticated selection of ifindex and/or tx queue can be built on top. I agree that driver choose TX queue to use to avoid conflicts, allowing lockless access.
I think XDP/BPF "forward"-mode" should always select an egress/TX ifindex/netdevice. If the ifindex happen to match the driver itself, then driver can to the superfast TX into a driver TX-ring queue. But if the ifindex is for another device (that does not support this) then we fallback to full-SKB alloc and normal stack TX towards that ifindex/netdevice (likely bypassing the rx_handler). NO. See my ongoing rant on performance vs generality. 'then it looks so generic and nice' arguments are not applicable to XDP. Even if ifindex check didn't cost any performance, it still doesn't make sense to do it, since ifindex is dynamic, so the program would need to be tailored for specific ifindex. Either compiled on-the-fly when ifindex is known or extra map lookup to figure out which ifindex to use. For load balancer/ila router use cases it's unnecessary, so program will not be dealing with ifindex.
|
On Thu, May 05, 2016 at 01:19:37PM -0700, Tom Herbert wrote: I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. +1 we'd need a way to specify priority queue from bpf program. Probably not as a first step though. Something like BPF_XDP_DROP 0 BPF_XDP_PASS 1 BPF_XDP_TX 2 BPF_XDP_TX_PRIO 3 | upper bits used for prio BPF_XDP_TX_PHYS_IFINDEX 4 | upper bits for ifindex BPF_XDP_RX_NETDEV_IFINDEX 5 | upper bits for ifindex of veth or any netdev lower 8-bits to encode action should be enough. First merge-able step is to do 0,1,2 in one driver (like mlx4) and start building it in other drivers.
|
On Thu, May 5, 2016 at 1:46 PM, Thomas Monjalon <thomas.monjalon@...> wrote: 2016-05-04 14:01, Tom Herbert:
On Wed, May 4, 2016 at 12:55 PM, Thomas Monjalon <thomas.monjalon@...> wrote:
2016-05-04 12:47, Tom Herbert:
Maybe we can get basic forwarding to work first ;-). From a system design point of view mixing different types of NICs on the same server is not very good anyway. Mixing NICs on a server is probably not common. But I wonder wether it could allow to leverage different offload capabilities for an asymmetrical traffic? Maybe, but it's a lot of complexity. Do you have a specific use case in mind? No real use case now but offload in NICs are becoming more and more complex and really different depending of the vendor. We're trying hard to discourage vendors from doing that. All these complex HW offloads aren't helping matters! (e.g. see the continuing saga of getting vendors to give us protocol generic checksum offload...). Other than checksum offload and RSS, I'm not seeing much we can leverage from the HW offloads for XDP. Of course when can offload the BPF program to HW that might be a different story. I think tunnel encapsulation offload use case is becoming real. We can also think to different flow steering depending of the tunnel type. Won't we be able to implement encap/decap in XDP just as easily but in a way that is completely user programmable? Tom
Please could you elaborate why mixing is not very good? Harder to design, test, don't see much value in it. Supporting such things forces us to continually raise the abstraction and generalize interfaces more and more which is exactly how we wind up with things like 400 bytes skbuffs, locking, soft queues, etc. XDP is expressly not meant to be a general solution, and that gives us liberty to cut out anything that doesn't yield performance like trying to preserve a high performance interface between two arbitrary drivers (but still addressing the 90% case). Interesting point of view. Thanks
|
Thomas Monjalon <thomas.monjalon@...>
2016-05-04 14:01, Tom Herbert: On Wed, May 4, 2016 at 12:55 PM, Thomas Monjalon <thomas.monjalon@...> wrote:
2016-05-04 12:47, Tom Herbert:
Maybe we can get basic forwarding to work first ;-). From a system design point of view mixing different types of NICs on the same server is not very good anyway. Mixing NICs on a server is probably not common. But I wonder wether it could allow to leverage different offload capabilities for an asymmetrical traffic? Maybe, but it's a lot of complexity. Do you have a specific use case in mind? No real use case now but offload in NICs are becoming more and more complex and really different depending of the vendor. I think tunnel encapsulation offload use case is becoming real. We can also think to different flow steering depending of the tunnel type. Please could you elaborate why mixing is not very good? Harder to design, test, don't see much value in it. Supporting such things forces us to continually raise the abstraction and generalize interfaces more and more which is exactly how we wind up with things like 400 bytes skbuffs, locking, soft queues, etc. XDP is expressly not meant to be a general solution, and that gives us liberty to cut out anything that doesn't yield performance like trying to preserve a high performance interface between two arbitrary drivers (but still addressing the 90% case).
Interesting point of view. Thanks
|
On Thu, May 5, 2016 at 12:11 PM, Jesper Dangaard Brouer <brouer@...> wrote: On Thu, 5 May 2016 11:01:52 -0700 Tom Herbert <tom@...> wrote:
On Thu, May 5, 2016 at 10:41 AM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Thu, May 05, 2016 at 10:06:40AM -0700, Tom Herbert wrote:
On Wed, May 4, 2016 at 10:22 PM, Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Wed, May 4, 2016 at 2:15 AM, Jesper Dangaard Brouer <brouer@...> wrote:
I've started a separate document for designing my page-pool idea.
I see the page-pool as a component, for allowing fast forwarding with XDP, at the packet-page level, cross device.
I want your input on how you imagine XDP/eBPF forwarding would work? I could imagine, 1) eBPF returns an ifindex it want to forward to, 2) look if netdevice supports new NDO for XDP-page-fwd 3A) call XDP-page-fwd with packet-page, 3B) No XDP-page-fwd, then construct SKB and xmit directly on device, 4) (for both above cases) later at TX-DMA completion, return to page-pool. I think the first step is option 0 where program will return single return code 'TX' and driver side will figure out which tx queue to use to avoid conflicts. I'm not sure what this means. In XDP the driver should not be making any decisions (i.e. driver does not implement any). If there is a choice of TX queue that should be made by the BPF code. Maybe for the first instantiation there is only one queue and BPF always returns index of zero-- this will be sufficient for most L4 load balancers and ILA router. There are always multiple rx and tx queues. It makes the program portable across different nics and hw configuration when it doesn't know rx queue number and doesn't make decision about tx queues. I don't see a use case for selecting tx queue. The driver side should be making this decision to make sure the performance is optimal and everything is lock-less. Like it can allocate N+M TX queues and N RX queues where N is multiple of cpu count and use M TX queues for normal tcp stack tx traffic. Then everything is collision free and lockless.
Right, the TX queues used by the stack need to be completely independent of those used by XDP. If an XDP instance (e.g. an RX queue) has exclusive access to a TX queue there is no locking and no collisions. Neither is there any need for the instance to transmit on multiple queues except in the case that the different COS is offered by different queues (e.g. priority), but again COS would be decided by the BPF not the driver. In other words, for XDP we need one TX queue per COS per each instance (RX queue) of XDP. There should be at most one RX queue serviced per CPU also. I almost agree, but there are some details ;-)
Yes, for XDP-TX we likely cannot piggy-back on the normal stack TX queues (like we do on the RX queues). Thus, when a driver support the XDP-TX feature, they need to provide some more TX queue's for XDP. For lockless TX I assume we need a XDP-TX queue per CPU.
The way I understand you, you want the BPF program to choose the TX queue number. I disagree, as BPF should have no knowledge about TX queue numbers. (It would be hard to get lockless TX queue's if BPF program chooses). IMHO the BPF program can choose the egress netdevice (e.g. via ifindex). Then we call the NDO "XDP-page-fwd", inside that call, the actual TX queue is chosen based on the current-running-CPU (maybe simply via a this_cpu_xxx call).
I think we're saying the same the thing just using different notation. BPF program returns an index which the driver maps to a queue, but this index is relative to XDP instance. So if a device offers 3 levels priority queues then BPF program can return 0,1, or 2. The driver can map this return value to a queue (probably from a set of three queues dedicated to the XDP instance). What I am saying is that this driver mapping should be trivial and does not implement any policy other than restricting the XDP instance to its set-- like mapping to actual queue number could be 3*N+R where N in instance # of XDP and R is return index. Egress on a different interface can work the same way, for instance 0 index might queue for local interface, 1 index might queue for interface. This simple return value to queue mapping is lot easier for crossing devices if they are managed by the same driver I think. Getting TX queues lockless, have one problem: TX DMA completion interrupts. Today TX completion, "cleanup" of TX ring-queue can run on another CPU. This breaks the lockless scheme. We need deal with this somehow, and setup our XDP-TX-queue "more-strict" somehow from the kernel side, and not allow userspace to change smp_affinity (simply chmod the proc file ;-)).
Hmm, how does DPDK deal with this? Hopefully you wouldn't need an actual lock for this either, atomic ops on producer/consumer pointers should work for most devices? Tom -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
|