Date   

__builtin_memcpy behavior

Tristan Mayfield
 

The other day I was in the process of porting a little libbpf application from Ubuntu 20 (Linux 5.4) to CentOS 8 (Linux 4.18). This program uses tracepoint:tcp:tcp_send_reset. Here's the relevant BPF code:

struct tcp_send_rst_args {
    long long pad;
    const void * skbaddr;
    const void * skaddr;
    int state;
    u16 sport;
    u16 dport;
    u8 saddr[4];
    u8 daddr[4];
    u8 saddr_v6[16];
    u8 daddr_v6[16];
};

SEC("tracepoint/tcp/tcp_send_reset")
int tcp_send_reset_prog(struct tcp_send_rst_args * args) {

    struct tcprstsend_data_t data = {};
    data.pid = bpf_get_current_pid_tgid() >> 32;

    data.sport = args->sport;
    data.dport = args->dport;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    __builtin_memcpy(&data.saddr, args->saddr, sizeof(data.saddr));
    __builtin_memcpy(&data.daddr, args->daddr, sizeof(data.daddr));

    __builtin_memcpy(&data.saddr_v6, args->saddr_v6, sizeof(data.saddr_v6));
    __builtin_memcpy(&data.daddr_v6, args->daddr_v6, sizeof(data.daddr_v6));

    bpf_perf_event_output(args, &tcprstsend_events, BPF_F_CURRENT_CPU, &data, sizeof(data));
    return 0;
}

What I found was that this code compiles and can be loaded into the kernel, but fails when you are attaching it to the tracepoint.
It fails with a permission error stating that it can't be attached to the pfd. Here's the actual message

libbpf: program 'tracepoint/tcp/tcp_send_reset': failed to attach to pfd 92: Permission denied
libbpf: program 'tracepoint/tcp/tcp_send_reset': failed to attach to tracepoint 'tcp/tcp_send_reset': Permission denied

I switched from __builtin_memcpy to bpf_probe_read to see if that would help and it resolved the permission errors and allowed me to attach to the tracepoint, but I found that the data wasn't read correctly. The "state" member of the tcp_send_rst_args struct that I defined isn't included in CentOS 8/kernel 4.18 so all my reads were off by four bytes on CentOS. It works fine if I redefine that struct to:

struct tcp_send_rst_args {
    long long pad;
    const void * skbaddr;
    const void * skaddr;
#ifndef RHEL_RELEASE_CODE
    int state; // This needs to be removed for CentOS 8/Linux 4.18
#endif
    u16 sport;
    u16 dport;
    u8 saddr[4];
    u8 daddr[4];
    u8 saddr_v6[16];
    u8 daddr_v6[16];
};

Now I'm a bit confused because __builtin_memcpy seemed to fail at attach time rather than load time. However, it did actually fail (albeit with error messages that ended up being really hard to debug, I will never NOT check the tracepoint format file again though). bpf_probe_read just happily read past the struct.
I'm not sure where the memory it was reading was and if that should be defined behavior, but I thought I would send this here and see if this is intended or if I have actually found something unexpected. Should __builtin_memcpy be used? Or should bpf_probe_read? If bpf_probe_read is recommended, is there a way we can verify that we're not reading garbage data in this context other than having a human eyeball the data returned? Or is that just a necessary part of BPF development in this context? Is this issue something that the verifier can even check at load time? I can provide more information on the program and/or bug if it's needed, thanks!


Re: android adeb KASAN_SHADOW_SCALE_SHIFT

Yonghong Song
 

Unfortunately, the value is defined in Makefile,

```
ifeq ($(CONFIG_KASAN_SW_TAGS), y)
KASAN_SHADOW_SCALE_SHIFT := 4
else ifeq ($(CONFIG_KASAN_GENERIC), y)
KASAN_SHADOW_SCALE_SHIFT := 3
endif

KBUILD_CFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_CPPFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_AFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
```

We could add something above in helpers.h e.g.,
```
#if defined(__aarch64__)
#if defined(CONFIG_KASAN_SW_TAGS)
#define KASAN_SHADOW_SCALE_SHIFT 4
#elif defined(CONFIG_KASAN_GENERIC)
#define KASAN_SHADOW_SCALE_SHIFT 3
#endif
#endif
```

You can also add the above code to the tool itself.

On Wed, Feb 10, 2021 at 10:18 AM katrina lulz
<anotherworkqueue@...> wrote:

Hi *,
I managed to setup adeb on a pixel4 with custom kernel compiled as suggested by adeb's README.
The setup is working fine for some BCC tools, as vfsstat but a few as opensnoop and the trace command return the following error:

In file included from ./arch/arm64/include/asm/thread_info.h:13:
./arch/arm64/include/asm/memory.h:136:24: error: use of undeclared identifier 'KASAN_SHADOW_SCALE_SHIFT'
return kimage_vaddr - KIMAGE_VADDR;

I verified by the config.gz on the device that IKHEADERS and the other BPF related configs are correctly enabled.
Any ideas on how to fix the above error?

thanks,
best.


Re: BPF perf event: runq length

Yonghong Song
 

On Mon, Feb 15, 2021 at 3:45 AM Raga lahari <ragalahari.potti@...> wrote:

Hi,


I am trying to write a BPF perf event program to get CPU runq length. The Following is the code snippet. I am observing that a big integer (len is 2839296536 ) as queue length in trace output for some instances.


Can someone please let me know that whether this approach helps to get length?
Take a look at bcc tool runqlen.py. Did you get abnormal len with runqlen.py?



struct cfs_rq_partial {

struct load_weight load;

unsigned long runnable_weight;

unsigned int nr_running;

unsigned int h_nr_running;

};


#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})


SEC("perf_event")

int do_sample(struct bpf_perf_event_data *ctx)

{



struct cfs_rq_partial *my_q = NULL;

struct task_struct *task = NULL;

unsigned int len;



task = (struct task_struct *)bpf_get_current_task();

my_q = _(task->se.cfs_rq);

len = _(my_q->nr_running);

bpf_printk("len is %u", len);



…..

}


I have tested with another program and confirmed that cfs_rq has runnable_weight filed.




Regards,
Ragalahari


BPF perf event: runq length

Raga lahari
 

Hi,


I am trying to write a BPF perf event program to get CPU runq length.  The Following is the code snippet. I am observing that a big integer (len is 2839296536 ) as queue length in trace output for some instances.


Can someone please let me know that  whether this approach helps to get length? 


struct cfs_rq_partial {

    struct load_weight load;

    unsigned long runnable_weight;

    unsigned int nr_running;

    unsigned int h_nr_running;

};


#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;}) 


SEC("perf_event")

int do_sample(struct bpf_perf_event_data *ctx)

{

       

        struct cfs_rq_partial *my_q = NULL;

        struct task_struct *task = NULL;

        unsigned int len;

        

        task = (struct task_struct *)bpf_get_current_task();

        my_q = _(task->se.cfs_rq);

        len = _(my_q->nr_running);

        bpf_printk("len is %u", len);

       

      …..

}


I have tested with another program and confirmed that cfs_rq has runnable_weight filed.




Regards,
Ragalahari


android adeb KASAN_SHADOW_SCALE_SHIFT

katrina lulz
 

Hi *,
I managed to setup adeb on a pixel4 with custom kernel compiled as suggested by adeb's README.
The setup is working  fine for some BCC tools, as vfsstat but a few as opensnoop and the trace command return the following error:

In file included from ./arch/arm64/include/asm/thread_info.h:13:
./arch/arm64/include/asm/memory.h:136:24: error: use of undeclared identifier 'KASAN_SHADOW_SCALE_SHIFT'
        return kimage_vaddr - KIMAGE_VADDR;
                              
I verified by the config.gz on the device that IKHEADERS and the other BPF related configs are correctly enabled.
Any ideas on how to fix the above error?

thanks,
best.


get function latency using ebpf-uprobe when using coroutine

Forrest Chen
 

Bcc has funclatency.py which support get function latency for the user program, using pid_tgid as the key.
But when it comes to a program was written by golang which supports coroutine(goroutine), it doesn't work anymore.

Is there another way to handle this situation?

Thanks!


Re: Weird behaviour when updating a hash map from userspace

Yonghong Song
 

On Fri, Jan 15, 2021 at 12:42 PM William Findlay
<williamfindlay@...> wrote:

Hi all.

Currently debugging a very strange behaviour with eBPF hash maps and was wondering if anyone else has run into a similar issue? I am using libbpf-rs with BPF CO-RE and my kernel version is 5.9.14.

My setup: I have a map with some compound key and I am updating it once from userspace using libbpf and once (later) from a BPF program, using the same key both times, but with different values.

Here's the weird part: Somehow both key,value pairs are being stored in the map, according to output from bpftool. Even more bizarre, the value provided from userspace is essentially a "ghost value" the entire time -- all map lookups fail until the map has been updated from a BPF program as described above.

To be clear, the weirdness is two-fold:

Lookup should not fail after updating the map the first time; and
The second value should be overwriting the first one.

After performing both updates, here is the output from bpftool showcasing the weird behaviour:

[{
"key": {
"id": 3069983010007500772,
"device": 48
},
"value": 10
},{
"key": {
"id": 3069983010007500772,
"device": 48
},
"value": 40
}

]
Does your key data structure have padding? Different padding values
will cause different actual keys.

If padding is not an issue in your case, could you construct a test
case (best if no rust involved) so we can take a deep look?
You can file a task to document this issue if you intend to send a test case.
Thanks!


This behaviour also seems to be inconsistent between different maps and yet consistent between different runs. For some maps, I get the expected result and for others I get this weirdness instead.

Is this possibly a bug in the kernel? Any assistance would be greatly appreciated.

Regards,
William
[...]


Weird behaviour when updating a hash map from userspace

williamfindlay@...
 

Hi all.

Currently debugging a very strange behaviour with eBPF hash maps and was wondering if anyone else has run into a similar issue? I am using libbpf-rs with BPF CO-RE and my kernel version is 5.9.14.

My setup: I have a map with some compound key and I am updating it once from userspace using libbpf and once (later) from a BPF program, using the same key both times, but with different values.

Here's the weird part: Somehow both key,value pairs are being stored in the map, according to output from bpftool. Even more bizarre, the value provided from userspace is essentially a "ghost value" the entire time -- all map lookups fail until the map has been updated from a BPF program as described above.

To be clear, the weirdness is two-fold:

  1. Lookup should not fail after updating the map the first time; and
  2. The second value should be overwriting the first one.

After performing both updates, here is the output from bpftool showcasing the weird behaviour:

[{
        "key": {
            "id": 3069983010007500772,
            "device": 48
        },
        "value": 10
    },{
        "key": {
            "id": 3069983010007500772,
            "device": 48
        },
        "value": 40
    }

]

This behaviour also seems to be inconsistent between different maps and yet consistent between different runs. For some maps, I get the expected result and for others I get this weirdness instead.

Is this possibly a bug in the kernel? Any assistance would be greatly appreciated.

Regards,
William


Re: verifier: variable offset stack access question

Yonghong Song
 

On Fri, Dec 25, 2020 at 5:41 PM Andrei Matei <andreimatei1@...> wrote:

For posterity, I think I can now answer my own question. I suspect
things were different in 2018 (because otherwise I don’t see how the
referenced exchange makes sense); here’s my understanding about the
verifier’s rules for stack accesses today:

There’s two distinct aspects relevant to the use of variable stack offsets:

1) “Direct” stack access with variable offset. This is simply
forbidden; you can’t read or write from a dynamic offset in the stack
because, in the case of reads, the verifier doesn’t know what type of
memory would be returned (is it “misc” data? Is it a spilled
register?) and, in the case of writes, what stack slot’s memory type
should be updated.
Separately, when reading from the stack with a fixed offset, the
respective memory needs to have been initialized (i.e. written to)
before.

2) Passing pointers to the stack to helper functions which will write
through the pointer (such as bpf_probe_read_user()). Here, if the
stack offset is variable, then all the memory that falls within the
possible bounds has to be initialized.
If the offset is fixed, then the memory doesn’t necessarily need to be
initialized (at least not if the helper’s argument is of type
ARG_PTR_TO_UNINIT_MEM). Why the restriction in the variable offset
case? Because, in that case, it cannot be known what memory the helper
will end up initializing; if the verifier pretended that all the
memory within the offset bounds would be initialized then further
reads could leak uninitialized stack memory.
I think your above assessment is kind of correct. For any read/write
to stack in bpf programs, the stack offset must be known so the
verifier knows exactly what the program tries to do. For helpers,
variable length of stack is permitted and the verifier will do
analysis to ensure the stack meets the memory (esp. initialized
memory) requirement as stated in helper proto definition.






Re: verifier: variable offset stack access question

Yonghong Song
 

On Wed, Dec 23, 2020 at 2:21 PM Andrei Matei <andreimatei1@...> wrote:

Hello Yonghong, all,

I'm curious about a verifier workaround that Yonghong provided two years ago, in this thread.
Brendan Gregg was asking about accessing stack buffers through a register with a variable offset, and Yonghong suggested a memset as a solution:
"
You can initialize the array with ' ' to workaround the issue:
struct data_t data;
uint64_t max = sizeof(data.argv);
const char *argp = NULL;
memset(&data, ' ', sizeof(data));
bpf_probe_read(&argp, sizeof(argp), (void *)&__argv[0]);
uint64_t len = bpf_probe_read_str(&data.argv, max, argp);
len &= 0xffffffff; // to avoid: "math between fp pointer and register errs"
bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid
"

My question is - how does the memset help? I sort of understand the trouble with variable stack access (different regions of the stack can hold memory of different types), and I've looked through the verifier's code but I've failed to get a clue.
I cannot remember details. Here, what "memset" did is to initialize
related bytes in stack to 0. I guess maybe at that point
bpf_probe_read_str requires an initialized memory?

Right now, bpf_probe_read_str does not require initialized memory, so
memset may not be necessary.


As far as actually trying the trick, I've had difficulty importing <string.h> in my bpf program. I'm not working in the context of BCC, so maybe that makes the difference. I've tried zero-ing out my buffer manually, and it didn't seem to change anything. I've had better success allocating my buffer using map memory rather than stack memory, but I'm still curious what a memset could do for me.
A lot of string.h functions are implemented as external functions in
glibc. This won't work for bpf programs as the bpf program is not
linked against glibc. The clang compiler will translate the above
memset to some stores if memset() size is not big enough. Better,
using clang __builtin_memset() so it won't have any relation with
glibc.


Thanks a lot!

- Andrei


Re: verifier: variable offset stack access question

Andrei Matei
 

For posterity, I think I can now answer my own question. I suspect
things were different in 2018 (because otherwise I don’t see how the
referenced exchange makes sense); here’s my understanding about the
verifier’s rules for stack accesses today:

There’s two distinct aspects relevant to the use of variable stack offsets:

1) “Direct” stack access with variable offset. This is simply
forbidden; you can’t read or write from a dynamic offset in the stack
because, in the case of reads, the verifier doesn’t know what type of
memory would be returned (is it “misc” data? Is it a spilled
register?) and, in the case of writes, what stack slot’s memory type
should be updated.
Separately, when reading from the stack with a fixed offset, the
respective memory needs to have been initialized (i.e. written to)
before.

2) Passing pointers to the stack to helper functions which will write
through the pointer (such as bpf_probe_read_user()). Here, if the
stack offset is variable, then all the memory that falls within the
possible bounds has to be initialized.
If the offset is fixed, then the memory doesn’t necessarily need to be
initialized (at least not if the helper’s argument is of type
ARG_PTR_TO_UNINIT_MEM). Why the restriction in the variable offset
case? Because, in that case, it cannot be known what memory the helper
will end up initializing; if the verifier pretended that all the
memory within the offset bounds would be initialized then further
reads could leak uninitialized stack memory.


verifier: variable offset stack access question

Andrei Matei
 

Hello Yonghong, all,

I'm curious about a verifier workaround that Yonghong provided two years ago, in this thread.
Brendan Gregg was asking about accessing stack buffers through a register with a variable offset, and Yonghong suggested a memset as a solution:
"
You can initialize the array with ' ' to workaround the issue:
    struct data_t data;
    uint64_t max = sizeof(data.argv);
    const char *argp = NULL;
    memset(&data, ' ', sizeof(data));
    bpf_probe_read(&argp, sizeof(argp), (void *)&__argv[0]);
    uint64_t len = bpf_probe_read_str(&data.argv, max, argp);
    len &= 0xffffffff; // to avoid: "math between fp pointer and register errs"
    bpf_trace_printk("len: %d\\n", len); // sanity check: len is indeed valid
"

My question is - how does the memset help? I sort of understand the trouble with variable stack access (different regions of the stack can hold memory of different types), and I've looked through the verifier's code but I've failed to get a clue.

As far as actually trying the trick, I've had difficulty importing <string.h> in my bpf program. I'm not working in the context of BCC, so maybe that makes the difference. I've tried zero-ing out my buffer manually, and it didn't seem to change anything. I've had better success allocating my buffer using map memory rather than stack memory, but I'm still curious what a memset could do for me.

Thanks a lot!

- Andrei


[Warning ⚠] Do you understand how to built bpf.file for snort on fedora?

Dorian ROSSE
 

Hello, 


[Warning ⚠] Do you understand how to built bpf.file for snort on fedora?

Thank you in advance, 

I hope the success, 

Regards. 


Dorian Rosse 
Téléchargez Outlook pour Android


Re: High volume bpf_perf_output tracing

Daniel Xu
 

Hi,

Ideally you’d want to do as much work in the kernel as possible. Passing that much data to user space is kind of mis using bpf.

What kind of work are you doing that can only be done in user space?

But otherwise, yeah, if you need perf, you might get more power from a lower level language. C/c++ is one option, you could also check out libbpf-rs if you prefer to write in rust.

Daniel

On Thu, Nov 19, 2020, at 5:56 PM, wes.vaske@... wrote:
I'm currently working on a python script to trace the nvme driver. I'm
hitting a performance bottleneck on the event callback in python and am
looking for the best way (or maybe a quick and dirty way) to improve
performance.

Currently I'm attaching to a kprobe and 2 tracepoints and using
perf_submit to pass information back to userspace.

When my callback is:
def count_only(cpu, data, size):
event_count += 1

My throughput is ~2,000,000 events per second

When my callback is my full event processing the throughput drops to
~40,000 events per second.

My first idea was to put the event_data in a Queue and have multiple
worker processes handle the parsing. Unfortunately the bcc.Table
classes aren't pickleable. As soon as we start parsing data to put in
the queue we drop down to 150k events per second without even touching
the Queue, just converting data types.

My next idea was to just store the data in memory and process after the
fact (for this use case, I effectively have "unlimited" memory for the
trace). This ranges from 100k to 450k events per second. (I think
python his issues allocating memory quickly with a list.append() and
with tuning I should be able to get 450k sustained). This isn't
terrible but I'd like to be above 1,000,000 events per second.

My next idea was to see if I can attach multiple reader processes to
the same BPF map. This is where I hit the wall and came here. It looks
like there isn't a way to do this with the Python API; at least not
easily.

With that context, I have 2 questions:
1. Is there a way I can attach multiple python processes to the same
BPF map to poll in parallel? Event ordering doesn't matter, I'll just
post process it all anyway. This doesn't need to be a final solution,
just something to get me through the next month
2. What is the "right" way to do this? My primary concern is
increasing the rate at which I can move data from the BPF_PERF_OUTPUT
map to userspace. It looks like the Python API is being deprecated in
favor of libbpf. So I'm assuming a C++ version of this script would be
the "right" way? (I've never touched C/C++ outside the BPF C code so
this would need to be a future project for me)


Thanks!


Re: BPF Maps with wildcards

Yonghong Song
 

On Thu, Nov 19, 2020 at 9:57 AM Marinos Dimolianis
<dimolianis.marinos@...> wrote:

Thanks for the response.
LPM is actually the closest solution however I wanted a structure closer to the way TCAMs operate in which you can have wildcards also in the interim bits.
I believe that something like that does not exist and I need to implement it using available structures in
eBPF/XDP.

Right, BPF does not have TCAM style maps. If you organize data
structure properly, you may be able to use LPM.


Στις Πέμ, 19 Νοε 2020 στις 5:27 π.μ., ο/η Y Song <ys114321@...> έγραψε:

On Wed, Nov 18, 2020 at 6:20 AM <dimolianis.marinos@...> wrote:

Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
I could not find anything relevant to that, does anyone know anything further.
Are there any efforts towards that functionality?
The closest map is lpm (trie) map. You may want to take a look.


High volume bpf_perf_output tracing

wes.vaske@...
 

I'm currently working on a python script to trace the nvme driver. I'm hitting a performance bottleneck on the event callback in python and am looking for the best way (or maybe a quick and dirty way) to improve performance.

Currently I'm attaching to a kprobe and 2 tracepoints and using perf_submit to pass information back to userspace.

When my callback is:
def count_only(cpu, data, size):
    event_count += 1

My throughput is ~2,000,000 events per second

When my callback is my full event processing the throughput drops to ~40,000 events per second.

My first idea was to put the event_data in a Queue and have multiple worker processes handle the parsing. Unfortunately the bcc.Table classes aren't pickleable. As soon as we start parsing data to put in the queue we drop down to 150k events per second without even touching the Queue, just converting data types.

My next idea was to just store the data in memory and process after the fact (for this use case, I effectively have "unlimited" memory for the trace). This ranges from 100k to 450k events per second. (I think python his issues allocating memory quickly with a list.append() and with tuning I should be able to get 450k sustained). This isn't terrible but I'd like to be above 1,000,000 events per second.

My next idea was to see if I can attach multiple reader processes to the same BPF map. This is where I hit the wall and came here. It looks like there isn't a way to do this with the Python API; at least not easily.

With that context, I have 2 questions:
  1. Is there a way I can attach multiple python processes to the same BPF map to poll in parallel? Event ordering doesn't matter, I'll just post process it all anyway. This doesn't need to be a final solution, just something to get me through the next month
  2. What is the "right" way to do this? My primary concern is increasing the rate at which I can move data from the BPF_PERF_OUTPUT map to userspace. It looks like the Python API is being deprecated in favor of libbpf. So I'm assuming a C++ version of this script would be the "right" way? (I've never touched C/C++ outside the BPF C code so this would need to be a future project for me)


Thanks!


Re: BPF Maps with wildcards

Marinos Dimolianis
 

Thanks for the response.
LPM is actually the closest solution however I wanted a structure closer to the way TCAMs operate in which you can have wildcards also in the interim bits.
I believe that something like that does not exist and I need to implement it using available structures in eBPF/XDP.

Στις Πέμ, 19 Νοε 2020 στις 5:27 π.μ., ο/η Y Song <ys114321@...> έγραψε:

On Wed, Nov 18, 2020 at 6:20 AM <dimolianis.marinos@...> wrote:
>
> Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
> I could not find anything relevant to that, does anyone know anything further.
> Are there any efforts towards that functionality?

The closest map is lpm (trie) map. You may want to take a look.


Re: BPF Maps with wildcards

Yonghong Song
 

On Wed, Nov 18, 2020 at 6:20 AM <dimolianis.marinos@...> wrote:

Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
I could not find anything relevant to that, does anyone know anything further.
Are there any efforts towards that functionality?
The closest map is lpm (trie) map. You may want to take a look.


BPF Maps with wildcards

Marinos Dimolianis
 

Hi all, I am trying to find a way to represent wildcards in BPF Map Keys?
I could not find anything relevant to that, does anyone know anything further.
Are there any efforts towards that functionality?
Regards,
Marinos


Attaching dynamic uprobe to C++ library/application #bcc

harnan@...
 

Hi all,

I am learning about ebpf and the bcc tools/library. I have a question about dynamic uprobe of C++ code. I have been able to attach a uprobe successfully by looking up the mangled symbol name. However, I am curious how the bpf program will access the parameters or arguments of a function I am probing. For a C++ object, do I just create an equivalent C struct that represents the application's C++ object/class, and then typecast the argument (from PT_REGS_PARM[x](ctx)) ?

Thanks!
Siva