Topics

Polling multiple BPF_MAP_TYPE_PERF_EVENT_ARRAY causing dropped events


Ian
 
Edited

The project I am working on generically loads BPF object files, pins their respective maps, and then proceeds to use perf_buffer__poll from libbpf to poll the maps. I currently am polling the multiple maps this way after loading and setting everything else up:

        while(true) {
            LIST_FOREACH(evt, &event_head, list) {
                if(evt->map_loaded == 1) {
                    err = perf_buffer__poll(evt->pb, 10);
                    if(err < 0) {
                        break;
                    }
                }
            }
        }

Where a evt is a structure that looks like:

struct evt_struct {
    char * map_name;
    FILE * fp;
    int map_loaded;
    ...<some elements removed for clarity>...
    struct perf_buffer * pb;
    LIST_ENTRY(evt_struct) list;
};

Essentially each event (evt) in this program correlates to a BPF program. I am looping through the events and calling perf_buffer__poll for each of them. This doesn't seem efficient and to me it makes the epoll_wait that perf_buffer__poll calls loose any of its efficiencies by looping through the events before hand. In perf_buffer__poll epoll is used to poll each CPU. Is there a more efficient way to poll multiple maps like this? Does it involve dropping perf? I don't like that I have to make a separate epoll context for each BPF program I am going to poll, that just checks the CPUs. It would be better if I just had two sets for epoll to monitor, but then I would lose the built in perf functionality. More than just being efficient my current polling implementation drops a significant number of events (i.e. the lost event callback in the perf options is called). This is the issue that really must be fixed.  I have some ideas that might be worth trying but I wanted to ascertain more information before I do any substantial refactoring: 

1) I was thinking about dropping perf and just using another BPF map type (Hash, Array) to pass elements back to user space then using a standard epoll context to monitor all the maps FD. I wouldn't lose any events that way (or if I did I would never know). But I have read in various books that perf maps are the ideal way to send data to user space...

2) Do perf maps or their buffer pages (for the mmap ring buffer) get cleaned up automatically? When do analyzed entries get removed? I tried increasing the page size of my perf buffer and it just took longer for me to start getting lost events. Which almost suggests I am leaking memory. Am I using perf incorrectly? Each perf buffer is created by:

pb_opts.sample_cb = handle_events;
pb_opts.lost_cb = handle_lost_events;
evt->pb = perf_buffer__new(map_fd, 16, &pb_opts); // Where the map_fd is received from a bpf_object_get call

Any help or advice would be appreciated!

- Ian
 


Andrii Nakryiko
 

On Mon, Aug 10, 2020 at 5:22 AM Ian <@iwcampbell> wrote:

[Edited Message Follows]

The project I am working on generically loads BPF object files, pins their respective maps, and then proceeds to use perf_buffer__poll from libbpf to poll the maps. I currently am polling the multiple maps this way after loading and setting everything else up:

while(true) {
LIST_FOREACH(evt, &event_head, list) {
if(evt->map_loaded == 1) {
err = perf_buffer__poll(evt->pb, 10);
if(err < 0) {
break;
}
}
}
}

Where a evt is a structure that looks like:

struct evt_struct {
char * map_name;
FILE * fp;
int map_loaded;
...<some elements removed for clarity>...
struct perf_buffer * pb;
LIST_ENTRY(evt_struct) list;
};

Essentially each event (evt) in this program correlates to a BPF program. I am looping through the events and calling perf_buffer__poll for each of them. This doesn't seem efficient and to me it makes the epoll_wait that perf_buffer__poll calls loose any of its efficiencies by looping through the events before hand. In perf_buffer__poll epoll is used to poll each CPU. Is there a more efficient way to poll multiple maps like this? Does it involve dropping perf? I don't like that I have to make a separate epoll context for each BPF program I am going to poll, that just checks the CPUs. It would be better if I just had two sets for epoll to monitor, but then I would lose the built in perf functionality. More than just being efficient my current polling implementation drops a significant number of events (i.e. the lost event callback in the perf options is called). This is the issue that really must be fixed. I have some ideas that might be worth trying but I wanted to ascertain more information before I do any substantial refactoring:

1) I was thinking about dropping perf and just using another BPF map type (Hash, Array) to pass elements back to user space then using a standard epoll context to monitor all the maps FD. I wouldn't lose any events that way (or if I did I would never know). But I have read in various books that perf maps are the ideal way to send data to user space...
If you have the luxury of using Linux kernel 5.8 or newer, you can try
a new BPF ring buffer map, that provides MPSC queue (so you can queue
from multiple CPUs simultaneously, while BPF perf buffer allows you to
only enqueue on your current CPU). But what's more important for you,
libbpf's ring_buffer interface allows you to do exactly what you need:
poll multiple independent ring buffers simultaneously from a single
epoll FD. See [0] for example of using that API in user-space, plus
[1] for corresponding BPF-side code.

But having said that, we should probably extend libbpf's perf_buffer
API to support similar use cases. I'll try to do this some time soon.

[0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c#L54-L62
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c


2) Do perf maps or their buffer pages (for the mmap ring buffer) get cleaned up automatically? When do analyzed entries get removed? I tried increasing the page size of my perf buffer and it just took longer for me to start getting lost events. Which almost suggests I am leaking memory. Am I using perf incorrectly? Each perf buffer is created by:

pb_opts.sample_cb = handle_events;
pb_opts.lost_cb = handle_lost_events;
evt->pb = perf_buffer__new(map_fd, 16, &pb_opts); // Where the map_fd is received from a bpf_object_get call
Yes, after your handle_event() callback returns, libbpf marks that
sample as consumed and the space it was taking is now available for
new samples to be enqueued. You are right, though, that by increasing
the size of each per-CPU perf ring buffer, you'll delay the drops,
because now you can accumulate more samples in the ring before the
ring buffer is full.


Any help or advice would be appreciated!

- Ian


Ian
 

If you have the luxury of using Linux kernel 5.8 or newer, you can try
a new BPF ring buffer map, that provides MPSC queue (so you can queue
from multiple CPUs simultaneously, while BPF perf buffer allows you to
only enqueue on your current CPU). But what's more important for you,
libbpf's ring_buffer interface allows you to do exactly what you need:
poll multiple independent ring buffers simultaneously from a single
epoll FD. See [0] for example of using that API in user-space, plus
[1] for corresponding BPF-side code.

But having said that, we should probably extend libbpf's perf_buffer
API to support similar use cases. I'll try to do this some time soon.

[0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c#L54-L62
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c
Unfortunately my project is currently targeting Ubuntu 20.04 which ships with linux kernel version 5.4. It is a shame because the new ring buffer interface looks excellent! That said, would you still suggest we use the perf functionality? Or is this currently an incorrect usage? (More on possible changes below)
Yes, after your handle_event() callback returns, libbpf marks that
sample as consumed and the space it was taking is now available for
new samples to be enqueued. You are right, though, that by increasing
the size of each per-CPU perf ring buffer, you'll delay the drops,
because now you can accumulate more samples in the ring before the
ring buffer is full.
When you say delay the drops, do you mean that the threshold for dropping events is larger? So if I made my page size 256, would that make it far less likely to receive dropped events all together? What would a suggested page size be? I initially thought 16 seemed like plenty, but I haven't found any research to support this. Will I always lose some events? Because that is the behavior I am witnessing right now. It seems like I always eventually start to lose events. Some of this might be due to a feedback loop where my BPF program that monitors file opens collects events triggered by my user space program. I was thinking about using a BPF map that is written by my user space program containing its PID and having all my BPF programs read that map and not write any corresponding events with matching PIDs. Any advice or thoughts on this would be appreciated!

Some of the event loss might also be attributed to the inefficiencies of my looping mechanism. Although I think the feedback loop might be the bigger culprit. I am thinking about following the Sysdig approach, which is to have a single perf buffer that is used by all my BPF programs (16 in total). This would remove the loop and eliminate all but 1 perf buffer. I would think that would be more efficient because I am removing 15 perf buffers and their epoll_waits. Then I would use a ID member of each passed data structure to properly read the data. 


Andrii Nakryiko
 

On Wed, Aug 12, 2020 at 5:38 AM Ian <@iwcampbell> wrote:

If you have the luxury of using Linux kernel 5.8 or newer, you can try
a new BPF ring buffer map, that provides MPSC queue (so you can queue
from multiple CPUs simultaneously, while BPF perf buffer allows you to
only enqueue on your current CPU). But what's more important for you,
libbpf's ring_buffer interface allows you to do exactly what you need:
poll multiple independent ring buffers simultaneously from a single
epoll FD. See [0] for example of using that API in user-space, plus
[1] for corresponding BPF-side code.

But having said that, we should probably extend libbpf's perf_buffer
API to support similar use cases. I'll try to do this some time soon.

[0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c#L54-L62
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c

Unfortunately my project is currently targeting Ubuntu 20.04 which ships with linux kernel version 5.4. It is a shame because the new ring buffer interface looks excellent! That said, would you still suggest we use the perf functionality? Or is this currently an incorrect usage? (More on possible changes below)
No perf buffer is just fine to pass data from the BPF program in the
kernel to the user-space part for post-processing.


Yes, after your handle_event() callback returns, libbpf marks that
sample as consumed and the space it was taking is now available for
new samples to be enqueued. You are right, though, that by increasing
the size of each per-CPU perf ring buffer, you'll delay the drops,
because now you can accumulate more samples in the ring before the
ring buffer is full.

When you say delay the drops, do you mean that the threshold for dropping events is larger? So if I made my page size 256, would that make it far less likely to receive dropped events all together? What would a suggested page size be? I initially thought 16 seemed like plenty, but I haven't found any research to support this. Will I always lose some events? Because that is the behavior I am witnessing right now. It seems like I always eventually start to lose events. Some of this might be due to a feedback loop where my BPF program that monitors file opens collects events triggered by my user space program. I was thinking about using a BPF map that is written by my user space program containing its PID and having all my BPF programs read that map and not write any corresponding events with matching PIDs. Any advice or thoughts on this would be appreciated!
It's hard to give you any definitive answer, it all depends. But think
about this. Perf buffer is a queue. Let's say that your per-CPU buffer
size is 1MB, each of your samples is say 1KB. What does that mean? It
means that at any given time you can't have at most 1024 samples
enqueued. So, if your BPF program in the kernel generates those 1024
samples faster than the user-space side consumes them, then you'll
have drops. So you have many ways to reduce drops:

1. Generate events at the lower rate. E.g., add sampling, filter
unuseful events, etc. This will give user-space side time to consume.
2. Speed up user-space. Many things can influence this. You can do
less work per item. You can ensure you start reacting to items sooner
by increasing priority of your consumer thread and/or pin it to a
dedicated CPU, etc.
3. Reduce the size of the event. If you can reduce sample size from
1KB to 512B by more effective data encoding or dropping unnecessary
data, you suddenly will be able to produce up to 2048 events before
running out of space. That will give your user-space more time to
consume data.
4. Increase per-CPU buffer size. Going from 1MB to 2MB will have the
same effect as reducing sample size from 1KB to 512B, again,
increasing the capacity of your buffer and thus giving more time to
consumer data.

Hope that makes sense and helps showing that I can't answer your
questions, you'll need to do analysis on your own based on your
specific implementation and problem domain.

Some of the event loss might also be attributed to the inefficiencies of my looping mechanism. Although I think the feedback loop might be the bigger culprit. I am thinking about following the Sysdig approach, which is to have a single perf buffer that is used by all my BPF programs (16 in total). This would remove the loop and eliminate all but 1 perf buffer. I would think that would be more efficient because I am removing 15 perf buffers and their epoll_waits. Then I would use a ID member of each passed data structure to properly read the data.
Yes, that would be a good approach. It's better to have 16x bigger
single perf_buffer shared across all BPF programs, than 16 separate
smaller perf buffers. Because you can absorb event spikes more
effectively.

One way I can help you, if you do need to have multiple
PERF_EVENT_ARRAY maps that you need to consume, is to add perf_buffer
APIs similar to ring_buffer that would allow to epoll all of them
simultaneously. Let me know if you are interested. That will
effectively eliminate your outer (LIST_FOREACH(evt, &event_head,
list)), you'll be just doing while(true) perf_buffer__poll() across
all perf buffers simultaneously. But single perf_buffer allows you to
do the same, effectively.