High volume bpf_perf_output tracing


wes.vaske@...
 

I'm currently working on a python script to trace the nvme driver. I'm hitting a performance bottleneck on the event callback in python and am looking for the best way (or maybe a quick and dirty way) to improve performance.

Currently I'm attaching to a kprobe and 2 tracepoints and using perf_submit to pass information back to userspace.

When my callback is:
def count_only(cpu, data, size):
    event_count += 1

My throughput is ~2,000,000 events per second

When my callback is my full event processing the throughput drops to ~40,000 events per second.

My first idea was to put the event_data in a Queue and have multiple worker processes handle the parsing. Unfortunately the bcc.Table classes aren't pickleable. As soon as we start parsing data to put in the queue we drop down to 150k events per second without even touching the Queue, just converting data types.

My next idea was to just store the data in memory and process after the fact (for this use case, I effectively have "unlimited" memory for the trace). This ranges from 100k to 450k events per second. (I think python his issues allocating memory quickly with a list.append() and with tuning I should be able to get 450k sustained). This isn't terrible but I'd like to be above 1,000,000 events per second.

My next idea was to see if I can attach multiple reader processes to the same BPF map. This is where I hit the wall and came here. It looks like there isn't a way to do this with the Python API; at least not easily.

With that context, I have 2 questions:
  1. Is there a way I can attach multiple python processes to the same BPF map to poll in parallel? Event ordering doesn't matter, I'll just post process it all anyway. This doesn't need to be a final solution, just something to get me through the next month
  2. What is the "right" way to do this? My primary concern is increasing the rate at which I can move data from the BPF_PERF_OUTPUT map to userspace. It looks like the Python API is being deprecated in favor of libbpf. So I'm assuming a C++ version of this script would be the "right" way? (I've never touched C/C++ outside the BPF C code so this would need to be a future project for me)


Thanks!


Daniel Xu
 

Hi,

Ideally you’d want to do as much work in the kernel as possible. Passing that much data to user space is kind of mis using bpf.

What kind of work are you doing that can only be done in user space?

But otherwise, yeah, if you need perf, you might get more power from a lower level language. C/c++ is one option, you could also check out libbpf-rs if you prefer to write in rust.

Daniel

On Thu, Nov 19, 2020, at 5:56 PM, wes.vaske@... wrote:
I'm currently working on a python script to trace the nvme driver. I'm
hitting a performance bottleneck on the event callback in python and am
looking for the best way (or maybe a quick and dirty way) to improve
performance.

Currently I'm attaching to a kprobe and 2 tracepoints and using
perf_submit to pass information back to userspace.

When my callback is:
def count_only(cpu, data, size):
event_count += 1

My throughput is ~2,000,000 events per second

When my callback is my full event processing the throughput drops to
~40,000 events per second.

My first idea was to put the event_data in a Queue and have multiple
worker processes handle the parsing. Unfortunately the bcc.Table
classes aren't pickleable. As soon as we start parsing data to put in
the queue we drop down to 150k events per second without even touching
the Queue, just converting data types.

My next idea was to just store the data in memory and process after the
fact (for this use case, I effectively have "unlimited" memory for the
trace). This ranges from 100k to 450k events per second. (I think
python his issues allocating memory quickly with a list.append() and
with tuning I should be able to get 450k sustained). This isn't
terrible but I'd like to be above 1,000,000 events per second.

My next idea was to see if I can attach multiple reader processes to
the same BPF map. This is where I hit the wall and came here. It looks
like there isn't a way to do this with the Python API; at least not
easily.

With that context, I have 2 questions:
1. Is there a way I can attach multiple python processes to the same
BPF map to poll in parallel? Event ordering doesn't matter, I'll just
post process it all anyway. This doesn't need to be a final solution,
just something to get me through the next month
2. What is the "right" way to do this? My primary concern is
increasing the rate at which I can move data from the BPF_PERF_OUTPUT
map to userspace. It looks like the Python API is being deprecated in
favor of libbpf. So I'm assuming a C++ version of this script would be
the "right" way? (I've never touched C/C++ outside the BPF C code so
this would need to be a future project for me)


Thanks!