Re: Polling multiple BPF_MAP_TYPE_PERF_EVENT_ARRAY causing dropped events
On Wed, Aug 12, 2020 at 5:38 AM Ian <firstname.lastname@example.org> wrote:
No perf buffer is just fine to pass data from the BPF program in the
kernel to the user-space part for post-processing.
It's hard to give you any definitive answer, it all depends. But think
about this. Perf buffer is a queue. Let's say that your per-CPU buffer
size is 1MB, each of your samples is say 1KB. What does that mean? It
means that at any given time you can't have at most 1024 samples
enqueued. So, if your BPF program in the kernel generates those 1024
samples faster than the user-space side consumes them, then you'll
have drops. So you have many ways to reduce drops:
1. Generate events at the lower rate. E.g., add sampling, filter
unuseful events, etc. This will give user-space side time to consume.
2. Speed up user-space. Many things can influence this. You can do
less work per item. You can ensure you start reacting to items sooner
by increasing priority of your consumer thread and/or pin it to a
dedicated CPU, etc.
3. Reduce the size of the event. If you can reduce sample size from
1KB to 512B by more effective data encoding or dropping unnecessary
data, you suddenly will be able to produce up to 2048 events before
running out of space. That will give your user-space more time to
4. Increase per-CPU buffer size. Going from 1MB to 2MB will have the
same effect as reducing sample size from 1KB to 512B, again,
increasing the capacity of your buffer and thus giving more time to
Hope that makes sense and helps showing that I can't answer your
questions, you'll need to do analysis on your own based on your
specific implementation and problem domain.
Some of the event loss might also be attributed to the inefficiencies of my looping mechanism. Although I think the feedback loop might be the bigger culprit. I am thinking about following the Sysdig approach, which is to have a single perf buffer that is used by all my BPF programs (16 in total). This would remove the loop and eliminate all but 1 perf buffer. I would think that would be more efficient because I am removing 15 perf buffers and their epoll_waits. Then I would use a ID member of each passed data structure to properly read the data.Yes, that would be a good approach. It's better to have 16x bigger
single perf_buffer shared across all BPF programs, than 16 separate
smaller perf buffers. Because you can absorb event spikes more
One way I can help you, if you do need to have multiple
PERF_EVENT_ARRAY maps that you need to consume, is to add perf_buffer
APIs similar to ring_buffer that would allow to epoll all of them
simultaneously. Let me know if you are interested. That will
effectively eliminate your outer (LIST_FOREACH(evt, &event_head,
list)), you'll be just doing while(true) perf_buffer__poll() across
all perf buffers simultaneously. But single perf_buffer allows you to
do the same, effectively.