I'm currently working on a python script to trace the nvme driver. I'm hitting a performance bottleneck on the event callback in python and am looking for the best way (or maybe a quick and dirty way) to improve performance.
Currently I'm attaching to a kprobe and 2 tracepoints and using perf_submit to pass information back to userspace.
When my callback is:
def count_only(cpu, data, size):
event_count += 1
My throughput is ~2,000,000 events per second
When my callback is my full event processing the throughput drops to ~40,000 events per second.
My first idea was to put the event_data in a Queue and have multiple worker processes handle the parsing. Unfortunately the bcc.Table classes aren't pickleable. As soon as we start parsing data to put in the queue we drop down to 150k events per second without even touching the Queue, just converting data types.
My next idea was to just store the data in memory and process after the fact (for this use case, I effectively have "unlimited" memory for the trace). This ranges from 100k to 450k events per second. (I think python his issues allocating memory quickly with a list.append() and with tuning I should be able to get 450k sustained). This isn't terrible but I'd like to be above 1,000,000 events per second.
My next idea was to see if I can attach multiple reader processes to the same BPF map. This is where I hit the wall and came here. It looks like there isn't a way to do this with the Python API; at least not easily.
With that context, I have 2 questions: