BPF Concurrency
Kanthi P <Pavuluri.kanthi@...>
Hi, I’ve been reading that hash map’s update element is atomic and also that we can use BPF_XADD to make the entire map update atomically. But I think that doesn’t guarantee that these updates are thread safe, meaning one cpu core can overwrite other core’s update. Is there a clean way of keeping them thread safe. Unfortunately I can’t use per-cpu maps as I need global counters. And spin locks sounds a costly operation. Can you please throw some light? Regards, Kanthi |
|
Yonghong Song
On Fri, May 22, 2020 at 1:07 PM Kanthi P <Pavuluri.kanthi@...> wrote:
BPF_XADD is to make one map element inside the kenel to update atomically. Could you filx an issue with more details? This way, we will have a better record. Not sure what do you mean here. yes, one cpu updated a map element and the other can modify it. What kind of primitive do you want? compare-and-exchange?
|
|
Andrii Nakryiko
On Fri, May 22, 2020 at 1:07 PM Kanthi P <Pavuluri.kanthi@...> wrote:
Stating that spin locks are costly without empirical data seems premature. What's the scenario? What's the number of CPUs? What's the level of contention? Under light contention, spin locks in practice would be almost as fast as atomic increments. Under heavy contention, spin locks would probably be even better than atomics because they will not waste as much CPU, as a typical atomic retry loop would. But basically, depending on your use case (which you should probably describe to get a better answer), you can either: - do atomic increment/decrement if you need to update a counter (see examples in kernel selftests using __sync_fetch_and_add); - use map with bpf_spin_lock (there are also examples in selftests).
|
|
Thanks Song and Andrii for the response. Use-case is global rate-limiting for incoming TCP connections. And we want to implement the token bucket algorithm using XDP for this purpose. So we are planning to have a map that holds a token counter which gets two kinds of updates: 1. Periodic increments with 'x' number of tokens per second Most of our systems are 64 core machines. Since every core would try to update the counter in parallel as the packets arrive each of them, the problem I am imagining is that I might miss few updates of the counter as one core update can overwrite other’s. I guess it is still ok to lose the case 2 type of updates as that might just allow a small fraction of more or less connections than what is configured. But I cannot afford to lose case 1 kind of updates as that could mean that I cannot process bunch of connections until the next second. So if I use "__sync_fetch_and_add" for incrementing the counter (for case 1), would it guarantee that this update is never missed(though some other core is trying to update the map to decrement the counter to account the incoming connection at the same time)? My understanding is that __sync_fetch_and_add translates to BPF_XADD internally. And it looks like spin locks are being supported from 5.x kernel versions, we are on lower version, so can’t try this one atm. Regards,
|
|
Andrii Nakryiko
On Sun, Jun 14, 2020 at 4:45 PM Kanthi P <Pavuluri.kanthi@...> wrote:
You should use __sync_fetch_and_add() for both cases, and then yes, you won't lose any update. You probably would want __sync_add_and_fetch() to get the counter after update, but that's not supported by BPF yet. But you should still get far enough with __sync_fetch_and_add(). Also, if you could use BPF global variables instead of BPF maps directly, you will avoid map lookup overhead on BPF side. See BPF selftests for examples, global vars are being used quite extensively there. BTW, you mentioned that you are going to update counter on every packet, right? On 64-core machine, even __sync_fetch_and_add() might be too much overhead. I recommend looking at Paul McKenney's book ([0]), see chapter on counting. It might provide you with good ideas how to scale this further to per-CPU counters, if need be. [0] https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
|
|
Hi Kanthi and Quentin,
You mention token counter as the use-case, for somehow ratelimiting (TCP connections in your case?). That reminds me that Quentin implemented[1] a two-color token-bucket ratelimiter in BPF. (Notice, code is 3 years old, and e.g. use the old syntax for maps). There is a really good README[1] with illustrations: [1] https://github.com/qmonnet/tbpoc-bpf What I found interesting, is that the code actually doesn't implement this via token counter, but instead uses a sliding time window. Based on getting timestamping via bpf_ktime_get_ns() code[2]. (p.s. as we have discussed on xdp-newbies list, getting this timestamp also have overhead, but we can mitigate overhead by only doing it once per NAPI cycle). I would like to hear Quentin why he avoided maintaining the token counter? -Jesper [2] https://github.com/qmonnet/tbpoc-bpf/blob/master/tokenbucket.c#L98 On Mon, 15 Jun 2020 05:07:25 +0530 "Kanthi P" <Pavuluri.kanthi@...> wrote: [Edited Message Follows] -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer |
|
Quentin Monnet
2020-06-16 14:18 UTC+0200 ~ Jesper Dangaard Brouer <brouer@...>
Hi Kanthi and Quentin, Hi Jesper, Kanthi, This token bucket implementation was realised in the context of a research project (BEBA), for which the OPP interface, based on “extended finite state machines”, was developed. This abstraction interface can be used to implement various programs, including the token bucket application, on several platforms (we had proof-of-concept implementations on FPGAs or in software switches for example). Since it is strictly event-based, the model does not really allow for “external” interactions such as map update from user space, which would be necessary to increment the token counter periodically. More information on BEBA and OPP is available from the links on the GitHub README. My objective at the time was simply to translate this token bucket example into an eBPF program, in order to show that eBPF could be a target for OPP, so I stuck to the OPP-based algorithm we had. It should work like a regular token bucket implementation, the sliding window is just another way to represent the maximum number of packets to accept in a given period of time, without having to periodically increment the counter. For example, a long shift of the sliding window (case 2 on the diagram) equates to refilling the bucket entirely. We may have run some benchmarks at the time, but we did not compare it to a more standard implementation, so I could not really tell which is best in terms of performance. The code is from 2017, but if you want to test it, I would still expect it to load (I haven't checked recently though). Jesper, the syntax for maps is still the one used today when loading programs with iproute2 I believe? Anyway, not sure how I can help further, but let me know if you have questions. Best regards, Quentin |
|
Kanthi P <Pavuluri.kanthi@...>
Thanks Andrii. __sync_fetch_and_add doesn't seem to work as expected, it is adding the increment, but it is returning the wrong value.
I am actually hitting the same issue mentioned here: https://lists.iovisor.org/g/iovisor-dev/topic/problems_with/23670176?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,20,23670176 Can anyone suggest if it is fixed recently? I am on 4.15 kernel. Thanks, Kanthi |
|
Kanthi P <Pavuluri.kanthi@...>
Hi Jesper and Quentin,
Nice, I checked that logic. If I understand it right, that implementation would also need few operations to be atomic, for example the window movements(whenever R and B are being added/subtracted). That's the issue that I am attempting to solve, but couldn't conclude anything yet. Regards, Kanthi |
|
Yonghong Song
On Sun, Jun 21, 2020 at 4:17 PM Kanthi P <Pavuluri.kanthi@...> wrote:
You cannot use the return value. A recent llvm should return an error if you try to use it. There is some preliminary work to have more atomic operations in the BPF ISA. https://reviews.llvm.org/D72184. We could add a version of fetch_and_add with proper return value. This may take some time as we need to ensure kernel has proper support.
|
|
Kanthi P <Pavuluri.kanthi@...>
Thanks, fetch_and_add would be more appropriate to my use-case
toggle quoted message
Show quoted text
On Sun, Jun 21, 2020 at 06:02 PM, Yonghong Song wrote: You cannot use the return value. A recent llvm should return an error |
|