Date   

Questions about samples/bpf/sockex3_user.c

yaofighting
 

Hello iovisor friends,

I found that there seems to be a problem in the eBPF sample source code of the Linux kernel samples/bpf/sockex3_user.c. This program contains the following code:

printf("%s.%05d -> %s.%05d %12lld %12lld",
                   inet_ntoa((struct in_addr){htonl(next_key.src)}),
                   next_key.port16[0],
                   inet_ntoa((struct in_addr){htonl(next_key.dst)}),
                   next_key.port16[1],
                   value.bytes, value.packets);
the "port16[0]" is the source port and the "port16[1]" is the destination port.
I found that "ports" was obtained in the following way when I view the samples/bpf/sockex3_kern.c:

g->flow.ports = load_word(skb, nhoff);

The byte order of my host is Little-Endian. In this case, port[1] should be the source port, while port[0] is the destination port.I tested this on my machine. Is this an error in this example?I hope I can get your answers and help.


Thank you,

Siyao Zhou


Re: one-shot BPF program in the context of a specific PID

Andrei Matei
 


On Wed, Dec 21, 2022 at 2:59 PM Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Tue, Dec 20, 2022 at 8:03 AM Andrei Matei <andreimatei1@...> wrote:
>
> Hi Alexei,
>
> I'm playing around with using BPF to read a given process' memory for debugging purposes. I'd like to recreate some of the experience that a debugger gives you by stopping and ptrace-ing another process, except without the "stopping" part. One of the aspects is around getting a snapshot of the running process; for example, for a Go program, this involves reading the information that the Go runtime has about all the Goroutines, walking their stacks, and collecting different variables from the different stack frames.
> You'll notice that this use case doesn't quite fit as a uprobe - I don't want the BPF program to be run when a particular program counter is hit; instead, I want the BPF program to run whenever the debugger decides to run it. Crucially, the BPF program needs to run *within the virtual memory context* of the debugged program, so it can bpf_probe_read_user() its memory. So, I want process A to trigger a BPF program that will execute within process B. Or, is there perhaps a way to read the virtual memory of an arbitrary program?

yes. That's what bpf iterator of task->vma is for.
The prog doesn't need to execute "within process B" to read its memory.

That's great news! I will play around with these iterators.

Thank you!

 

> > Have you considered using a task iterator parametrized with a particular task?
>
> I had not. I'm reading about it now, but I'm not sure if it helps me. If it applies, can you please say more?

I think that is exactly what you need.
The iterator can read mm of another process as long as
you can ptrace it.
The same permission checks as gdb.
The main difference is that bpf iter doesn't stop another process.
See task iter and vma selftests including bpf_find_vma helper
that can also be useful.

> Thanks!
>
> On Tue, Dec 20, 2022 at 9:16 AM Alexei Starovoitov <alexei.starovoitov@...> wrote:
>>
>> On Sun, Dec 18, 2022 at 4:09 PM Andrei Matei <andreimatei1@...> wrote:
>> >
>> > Hello iovisor friends,
>> >
>> > I'm curious what my options are for running a BPF program once, immediately, in the virtual memory context of a particular (user space) process. For example, say I want to read the current value from a known virtual memory address in the process' space. I'm curious if there's an official answer or, short of that, tricks that people might have used.
>> > What I want is similar in spirit to BPF_PROG_RUN, I think, except that I think I want my program type to be perf-event (and BPF_PROG_RUN doesn't seem to support this program type), and I want to also control specify which process I'm interested in.
>> >
>> > I feel like one solution might be around sending a signal to the process I'm interested in and placing a uprobe somewhere on the signal handling path, but I'm not sure of a general way to do this. Any suggestion is most welcome.
>>
>> Could you describe what prog is going to do?
>> Have you considered using a task iterator parametrized with a particular task?


Re: one-shot BPF program in the context of a specific PID

Alexei Starovoitov
 

On Tue, Dec 20, 2022 at 8:03 AM Andrei Matei <andreimatei1@...> wrote:

Hi Alexei,

I'm playing around with using BPF to read a given process' memory for debugging purposes. I'd like to recreate some of the experience that a debugger gives you by stopping and ptrace-ing another process, except without the "stopping" part. One of the aspects is around getting a snapshot of the running process; for example, for a Go program, this involves reading the information that the Go runtime has about all the Goroutines, walking their stacks, and collecting different variables from the different stack frames.
You'll notice that this use case doesn't quite fit as a uprobe - I don't want the BPF program to be run when a particular program counter is hit; instead, I want the BPF program to run whenever the debugger decides to run it. Crucially, the BPF program needs to run *within the virtual memory context* of the debugged program, so it can bpf_probe_read_user() its memory. So, I want process A to trigger a BPF program that will execute within process B. Or, is there perhaps a way to read the virtual memory of an arbitrary program?
yes. That's what bpf iterator of task->vma is for.
The prog doesn't need to execute "within process B" to read its memory.

Have you considered using a task iterator parametrized with a particular task?
I had not. I'm reading about it now, but I'm not sure if it helps me. If it applies, can you please say more?
I think that is exactly what you need.
The iterator can read mm of another process as long as
you can ptrace it.
The same permission checks as gdb.
The main difference is that bpf iter doesn't stop another process.
See task iter and vma selftests including bpf_find_vma helper
that can also be useful.

Thanks!

On Tue, Dec 20, 2022 at 9:16 AM Alexei Starovoitov <alexei.starovoitov@...> wrote:

On Sun, Dec 18, 2022 at 4:09 PM Andrei Matei <andreimatei1@...> wrote:

Hello iovisor friends,

I'm curious what my options are for running a BPF program once, immediately, in the virtual memory context of a particular (user space) process. For example, say I want to read the current value from a known virtual memory address in the process' space. I'm curious if there's an official answer or, short of that, tricks that people might have used.
What I want is similar in spirit to BPF_PROG_RUN, I think, except that I think I want my program type to be perf-event (and BPF_PROG_RUN doesn't seem to support this program type), and I want to also control specify which process I'm interested in.

I feel like one solution might be around sending a signal to the process I'm interested in and placing a uprobe somewhere on the signal handling path, but I'm not sure of a general way to do this. Any suggestion is most welcome.
Could you describe what prog is going to do?
Have you considered using a task iterator parametrized with a particular task?


Re: one-shot BPF program in the context of a specific PID

Andrei Matei
 

Hi Alexei,

I'm playing around with using BPF to read a given process' memory for debugging purposes. I'd like to recreate some of the experience that a debugger gives you by stopping and ptrace-ing another process, except without the "stopping" part. One of the aspects is around getting a snapshot of the running process; for example, for a Go program, this involves reading the information that the Go runtime has about all the Goroutines, walking their stacks, and collecting different variables from the different stack frames.
You'll notice that this use case doesn't quite fit as a uprobe - I don't want the BPF program to be run when a particular program counter is hit; instead, I want the BPF program to run whenever the debugger decides to run it. Crucially, the BPF program needs to run *within the virtual memory context* of the debugged program, so it can bpf_probe_read_user() its memory. So, I want process A to trigger a BPF program that will execute within process B. Or, is there perhaps a way to read the virtual memory of an arbitrary program?

> Have you considered using a task iterator parametrized with a particular task?

I had not. I'm reading about it now, but I'm not sure if it helps me. If it applies, can you please say more?

Thanks!

On Tue, Dec 20, 2022 at 9:16 AM Alexei Starovoitov <alexei.starovoitov@...> wrote:
On Sun, Dec 18, 2022 at 4:09 PM Andrei Matei <andreimatei1@...> wrote:
>
> Hello iovisor friends,
>
> I'm curious what my options are for running a BPF program once, immediately, in the virtual memory context of a particular (user space) process. For example, say I want to read the current value from a known virtual memory address in the process' space. I'm curious if there's an official answer or, short of that, tricks that people might have used.
> What I want is similar in spirit to BPF_PROG_RUN, I think, except that I think I want my program type to be perf-event (and BPF_PROG_RUN doesn't seem to support this program type), and I want to also control specify which process I'm interested in.
>
> I feel like one solution might be around sending a signal to the process I'm interested in and placing a uprobe somewhere on the signal handling path, but I'm not sure of a general way to do this. Any suggestion is most welcome.

Could you describe what prog is going to do?
Have you considered using a task iterator parametrized with a particular task?


Re: one-shot BPF program in the context of a specific PID

Alexei Starovoitov
 

On Sun, Dec 18, 2022 at 4:09 PM Andrei Matei <andreimatei1@...> wrote:

Hello iovisor friends,

I'm curious what my options are for running a BPF program once, immediately, in the virtual memory context of a particular (user space) process. For example, say I want to read the current value from a known virtual memory address in the process' space. I'm curious if there's an official answer or, short of that, tricks that people might have used.
What I want is similar in spirit to BPF_PROG_RUN, I think, except that I think I want my program type to be perf-event (and BPF_PROG_RUN doesn't seem to support this program type), and I want to also control specify which process I'm interested in.

I feel like one solution might be around sending a signal to the process I'm interested in and placing a uprobe somewhere on the signal handling path, but I'm not sure of a general way to do this. Any suggestion is most welcome.
Could you describe what prog is going to do?
Have you considered using a task iterator parametrized with a particular task?


one-shot BPF program in the context of a specific PID

Andrei Matei
 

Hello iovisor friends,

I'm curious what my options are for running a BPF program once, immediately, in the virtual memory context of a particular (user space) process. For example, say I want to read the current value from a known virtual memory address in the process' space. I'm curious if there's an official answer or, short of that, tricks that people might have used.
What I want is similar in spirit to BPF_PROG_RUN, I think, except that I think I want my program type to be perf-event (and BPF_PROG_RUN doesn't seem to support this program type), and I want to also control specify which process I'm interested in.

I feel like one solution might be around sending a signal to the process I'm interested in and placing a uprobe somewhere on the signal handling path, but I'm not sure of a general way to do this. Any suggestion is most welcome.

Thank you!

- Andrei


Re: Print slow commit operations in nfsslower

TATSUKAWA KOSUKE(立川 江介)
 

Hi,

On Tue, Nov 15, 2022 at 2:04 AM TATSUKAWA KOSUKE(立川 江介)
<tatsu-ab1@...> wrote:

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.
The concept and code looks okay to me. One thing is you could use
RAW_TRACEPOINT_PROBE macro to be consistent with some other
bcc tools raw tracepoint usages.

Please submit a pull request.
thank you for the reply. I'll change the code to use
RAW_TRACEPOINT_PROBE macro used in other bcc tools.

While testing on some old environments which doesn't support raw
tracepoints, I noticed that linux/nfs_fs.h header file causes bcc to
print the error message "use of undeclared identifier 'KBUILD_MODNAME'".
This problem seems to affect other bcc tools, which #define this
symbol to work around it.

I'll submit a pull request once I've done a little more testing.

Best regards.


Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:

---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...


Re: Print slow commit operations in nfsslower

Yonghong Song
 

On Tue, Nov 15, 2022 at 2:04 AM TATSUKAWA KOSUKE(立川 江介)
<tatsu-ab1@...> wrote:

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.
The concept and code looks okay to me. One thing is you could use
RAW_TRACEPOINT_PROBE macro to be consistent with some other
bcc tools raw tracepoint usages.

Please submit a pull request.


Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:




Print slow commit operations in nfsslower

TATSUKAWA KOSUKE(立川 江介)
 

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.

Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:


Re: Tracing queries within DB transactions with dbslower

Dave M
 

Hi Kevin,

I see you've made a Github issue as well
(https://github.com/iovisor/bcc/issues/4256). That's good, as I
would've recommended making one since the mailing list isn't
frequently used these days. I will respond in the issue as well and we
can continue conversation wherever you'd prefer.

The folks who did most of the implementation of `dbslower`
(https://github.com/iovisor/bcc/pulls?q=is%3Apr+dbslower+is%3Aclosed)
aren't particularly active in this repo these days. They might have
the mysql expertise necessary to answer your question. Unfortunately
I'm not sure if any of the current maintainers have experience tracing
MySQL - myself included. Regardless, if you're willing to dig around
in the MySQL source I do think your suggested approach of modifying
which functions are uprobe'd is likely to work. If current tracing of
'dispatch_command' isn't catching the actual statements in the
transaction, presumably there's some other function that is processing
the text of those statements that should be fine to trace in lieu of /
in addition to what's currently being traced.

If you do get this working we'd certainly appreciate a PR!

-Dave

On Tue, Oct 4, 2022 at 3:56 AM Kevin Maier <kevin.maier@...> wrote:

Hello,
as part of a university project I am analysing the dbslower tool to evaluate the performance cost of tracing the queries and to test what can be traced with it. As MySQL dropped support for Dtrace for v8.0+ I am using the alternative with uprobes. During my own testing with MySQL v8.0.28 I found out that regular queries get traced without issues, but if they are within a transaction only the queries b'START TRANSACTION' and b'COMMIT' are detected and none of the queries within the transaction get traced. Is there a way to adapt the tracing to include the queries within a transaction aswell, e.g. by adapting the mysql_func_name?

Best regards,
Kevin Maier


Questions about cachetop.py tool #bcc

thesummernightcat@...
 

I have some questions about cachetop.py tool.
Why the total accesses is defined as the times of entering mark_page_accessed plus mark_buffer_dirty functions?
Doesn't enter mark_buffer_dirty must also enter mark_page_accessed, that is, doesn't they overlap?
I don't really get it...... If someone can tell me why, I will be grateful!                                                                


Tracing queries within DB transactions with dbslower

Kevin Maier
 

Hello,
as part of a university project I am analysing the dbslower tool to evaluate the performance cost of tracing the queries and to test what can be traced with it. As MySQL dropped support for Dtrace for v8.0+ I am using the alternative with uprobes. During my own testing with MySQL v8.0.28 I found out that regular queries get traced without issues, but if they are within a transaction only the queries b'START TRANSACTION' and b'COMMIT' are detected and none of the queries within the transaction get traced. Is there a way to adapt the tracing to include the queries within a transaction aswell, e.g. by adapting the mysql_func_name?

Best regards,
Kevin Maier


Parsing Non linear packet payload in TC BPF programs

deepankur.gupta.97@...
 

Hi all,

We are parsing DHCP packets in TC egress program, In our case, the packet after the UDP header was not present in between skb->data and skb->data_end. On further investigation we found that it lies in non-linear portion. and we have bpf_skb_pull_data(skb,len) to direct access non-linear data.

Few questions based on above:

1. After call to bpf_skb_pull_data(skb, skb->len); the value of  skb->data and skb->data_end pointers changed. Can there be any implication down the stack with change in value of skb->data pointer. Also is this helper function analogous to skb_pull which changes the skb->data pointer and is typically moved when the packet goes up the stack as a result of packet being parsed on that layer? Do we have something similar to skb_linearize() in BPF or any other way to parse non-linear portion?

2. For our case, packet after the UDP header was in non-linear portion, can it happen that packet after IP header go in non-linear or packet after ethernet-header?

 

Regards,

Deepankur


Re: Looking for some bcc issues for new bie

Yonghong Song
 

Hi, Jules,

Just go to https://github.com/iovisor/bcc/issues and you can find
quite some issues for you to investigate. Thanks!

Yonghong

On Wed, Aug 10, 2022 at 3:02 AM Jules Irenge <jbi.octave@...> wrote:

Hi
I am looking for some bcc issues for newbie.
Any link ?

Jules
..


Looking for some bcc issues for new bie

Jules Irenge
 

Hi 
I am looking for some bcc issues for newbie.
Any link ?

Jules
..


LPC 2022 Networking and BPF Track CFP (Final Reminder)

Daniel Borkmann
 

This is the final reminder for the Call for Proposals (CFP) for the Networking
and BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which
is planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


LPC 2022 Networking and BPF Track CFP (Reminder)

Daniel Borkmann
 

This is a reminder for the Call for Proposals (CFP) for the Networking and
BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which is
planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


stackcount return *blank* stack.

Ye ZhengMao
 

https://github.com/iovisor/bcc/issues/3891
I have the same issue, anyone can help me to fix it?


LPC 2022 Networking and BPF Track CFP

Daniel Borkmann
 

We are pleased to announce the Call for Proposals (CFP) for the Networking and
BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which is
planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


RPI install with snapcraft

theycallmevirgo@...
 

I installed on latest Raspbian (bullseye) with Snapcraft. However, when I tried to run hello_world.py with python3, I got 


modprobe: FATAL: Module kheaders not found in directory /lib/modules/5.10.63-v7+

Unable to find kernel headers. Try rebuilding kernel with CONFIG_IKHEADERS=m (mo                                                                 dule) or installing the kernel development package for your running kernel versi                                                                 on.

chdir(/lib/modules/5.10.63-v7+/build): No such file or directory

Traceback (most recent call last):

  File "/bcc/examples/./hello_world.py", line 12, in <module>

    BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!                                                                 \\n"); return 0; }').trace_print()

  File "/usr/lib/python3/dist-packages/bcc/__init__.py", line 364, in __init__

    raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))

Exception: Failed to compile BPF module <text>

I guess this means I have to rebuild kernel?

Thanks so much in advance

Joe