Re: Polling multiple BPF_MAP_TYPE_PERF_EVENT_ARRAY causing dropped events


Andrii Nakryiko
 

On Mon, Aug 10, 2020 at 5:22 AM Ian <icampbe14@gmail.com> wrote:

[Edited Message Follows]

The project I am working on generically loads BPF object files, pins their respective maps, and then proceeds to use perf_buffer__poll from libbpf to poll the maps. I currently am polling the multiple maps this way after loading and setting everything else up:

while(true) {
LIST_FOREACH(evt, &event_head, list) {
if(evt->map_loaded == 1) {
err = perf_buffer__poll(evt->pb, 10);
if(err < 0) {
break;
}
}
}
}

Where a evt is a structure that looks like:

struct evt_struct {
char * map_name;
FILE * fp;
int map_loaded;
...<some elements removed for clarity>...
struct perf_buffer * pb;
LIST_ENTRY(evt_struct) list;
};

Essentially each event (evt) in this program correlates to a BPF program. I am looping through the events and calling perf_buffer__poll for each of them. This doesn't seem efficient and to me it makes the epoll_wait that perf_buffer__poll calls loose any of its efficiencies by looping through the events before hand. In perf_buffer__poll epoll is used to poll each CPU. Is there a more efficient way to poll multiple maps like this? Does it involve dropping perf? I don't like that I have to make a separate epoll context for each BPF program I am going to poll, that just checks the CPUs. It would be better if I just had two sets for epoll to monitor, but then I would lose the built in perf functionality. More than just being efficient my current polling implementation drops a significant number of events (i.e. the lost event callback in the perf options is called). This is the issue that really must be fixed. I have some ideas that might be worth trying but I wanted to ascertain more information before I do any substantial refactoring:

1) I was thinking about dropping perf and just using another BPF map type (Hash, Array) to pass elements back to user space then using a standard epoll context to monitor all the maps FD. I wouldn't lose any events that way (or if I did I would never know). But I have read in various books that perf maps are the ideal way to send data to user space...
If you have the luxury of using Linux kernel 5.8 or newer, you can try
a new BPF ring buffer map, that provides MPSC queue (so you can queue
from multiple CPUs simultaneously, while BPF perf buffer allows you to
only enqueue on your current CPU). But what's more important for you,
libbpf's ring_buffer interface allows you to do exactly what you need:
poll multiple independent ring buffers simultaneously from a single
epoll FD. See [0] for example of using that API in user-space, plus
[1] for corresponding BPF-side code.

But having said that, we should probably extend libbpf's perf_buffer
API to support similar use cases. I'll try to do this some time soon.

[0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/prog_tests/ringbuf_multi.c#L54-L62
[1] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/progs/test_ringbuf_multi.c


2) Do perf maps or their buffer pages (for the mmap ring buffer) get cleaned up automatically? When do analyzed entries get removed? I tried increasing the page size of my perf buffer and it just took longer for me to start getting lost events. Which almost suggests I am leaking memory. Am I using perf incorrectly? Each perf buffer is created by:

pb_opts.sample_cb = handle_events;
pb_opts.lost_cb = handle_lost_events;
evt->pb = perf_buffer__new(map_fd, 16, &pb_opts); // Where the map_fd is received from a bpf_object_get call
Yes, after your handle_event() callback returns, libbpf marks that
sample as consumed and the space it was taking is now available for
new samples to be enqueued. You are right, though, that by increasing
the size of each per-CPU perf ring buffer, you'll delay the drops,
because now you can accumulate more samples in the ring before the
ring buffer is full.


Any help or advice would be appreciated!

- Ian

Join iovisor-dev@lists.iovisor.org to automatically receive all group messages.