Date   

Re: Print slow commit operations in nfsslower

TATSUKAWA KOSUKE(立川 江介)
 

Hi,

On Tue, Nov 15, 2022 at 2:04 AM TATSUKAWA KOSUKE(立川 江介)
<tatsu-ab1@...> wrote:

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.
The concept and code looks okay to me. One thing is you could use
RAW_TRACEPOINT_PROBE macro to be consistent with some other
bcc tools raw tracepoint usages.

Please submit a pull request.
thank you for the reply. I'll change the code to use
RAW_TRACEPOINT_PROBE macro used in other bcc tools.

While testing on some old environments which doesn't support raw
tracepoints, I noticed that linux/nfs_fs.h header file causes bcc to
print the error message "use of undeclared identifier 'KBUILD_MODNAME'".
This problem seems to affect other bcc tools, which #define this
symbol to work around it.

I'll submit a pull request once I've done a little more testing.

Best regards.


Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:

---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...


Re: Print slow commit operations in nfsslower

Yonghong Song
 

On Tue, Nov 15, 2022 at 2:04 AM TATSUKAWA KOSUKE(立川 江介)
<tatsu-ab1@...> wrote:

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.
The concept and code looks okay to me. One thing is you could use
RAW_TRACEPOINT_PROBE macro to be consistent with some other
bcc tools raw tracepoint usages.

Please submit a pull request.


Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:




Print slow commit operations in nfsslower

TATSUKAWA KOSUKE(立川 江介)
 

Hi,

I was using nfsslower in bcc tools and noticed that it doesn't print out
slow commit operations, which sometimes cause long delays in NFS writes.
I tried adding code to trace slow commit operations and created the
attached patch.

Please let me hear your opinions.

NFS commits are quite slow and are not performed often, so the patch
shouldn't incur visible overhead.

Below is a sample output from the modified tool when writing a large
file using dd command. Most commits are issued from kworker threads,
so the commit output may be mostly dropped when filtered by PID.

TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
...
16:46:11 dd 1212 W 1048576 651264 46.58 testfile
16:46:11 dd 1212 W 1048576 653312 54.41 testfile
16:46:11 dd 1212 W 1048576 654336 18.96 testfile
16:46:11 dd 1212 W 1048576 655360 49.05 testfile
16:46:11 dd 1212 W 1048576 657408 82.96 testfile
16:46:11 dd 1212 W 1048576 659456 109.25 testfile
16:46:12 dd 1212 W 1048576 660480 163.55 testfile
16:46:12 dd 1212 W 1048576 662528 205.44 testfile
16:46:13 dd 1212 W 1048576 663552 751.02 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27449.05 testfile
16:46:37 kworker/u8:5 1207 C 0 0 26725.16 testfile
16:46:37 kworker/u8:5 1207 C 0 0 27592.04 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22188.75 testfile
16:46:37 kworker/u8:4 1206 C 0 0 26092.59 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27268.90 testfile
16:46:37 kworker/u8:4 1206 C 0 0 22303.24 testfile
16:46:37 kworker/u8:4 1206 C 0 0 27081.34 testfile
16:46:37 dd 1212 W 1048576 664576 24337.80 testfile
16:46:38 dd 1212 W 1048576 958464 61.77 testfile
16:46:38 dd 1212 W 1048576 960512 56.60 testfile
16:46:38 dd 1212 W 1048576 963584 55.75 testfile
16:46:38 dd 1212 W 1048576 965632 54.84 testfile
...

Best regards.
---
Kosuke TATSUKAWA | 1st Platform Software Division
| NEC Solution Innovators
| tatsu-ab1@...

------------------------------------------------------------------------
diff -rup a/tools/nfsslower.py b/tools/nfsslower.py
--- a/tools/nfsslower 2022-11-14 12:39:23.677073240 +0900
+++ b/tools/nfsslower 2022-11-14 14:44:49.960073240 +0900
@@ -9,6 +9,8 @@
# This script traces some common NFS operations: read, write, opens and
# getattr. It measures the time spent in these operations, and prints details
# for each that exceeded a threshold.
+# The script also traces commit operations, which is specific to nfs and could
+# be pretty slow.
#
# WARNING: This adds low-overhead instrumentation to these NFS operations,
# including reads and writes from the file system cache. Such reads and writes
@@ -25,6 +27,8 @@
# Currently there aren't any tracepoints available for nfs_read_file,
# nfs_write_file and nfs_open_file, nfs_getattr does have entry and exit
# tracepoints but we chose to use kprobes for consistency
+# Raw tracepoints are used to trace nfs:nfs_initiate_commit and
+# nfs:nfs_commit_done.
#
# 31-Aug-2017 Samuel Nair created this. Should work with NFSv{3,4}

@@ -41,8 +45,8 @@ examples = """
./nfsslower -p 121 # trace pid 121 only
"""
parser = argparse.ArgumentParser(
- description="""Trace READ, WRITE, OPEN \
-and GETATTR NFS calls slower than a threshold,\
+ description="""Trace READ, WRITE, OPEN, GETATTR \
+and COMMIT NFS calls slower than a threshold,\
supports NFSv{3,4}""",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=examples)
@@ -66,11 +70,13 @@ bpf_text = """
#include <linux/fs.h>
#include <linux/sched.h>
#include <linux/dcache.h>
+#include <linux/nfs_fs.h>

#define TRACE_READ 0
#define TRACE_WRITE 1
#define TRACE_OPEN 2
#define TRACE_GETATTR 3
+#define TRACE_COMMIT 4

struct val_t {
u64 ts;
@@ -79,6 +85,12 @@ struct val_t {
struct dentry *d;
};

+struct commit_t {
+ u64 ts;
+ u64 offset;
+ u64 count;
+};
+
struct data_t {
// XXX: switch some to u32's when supported
u64 ts_us;
@@ -93,6 +105,7 @@ struct data_t {

BPF_HASH(entryinfo, u64, struct val_t);
BPF_PERF_OUTPUT(events);
+BPF_HASH(commitinfo, u64, struct commit_t);

int trace_rw_entry(struct pt_regs *ctx, struct kiocb *iocb,
struct iov_iter *data)
@@ -227,6 +240,56 @@ int trace_getattr_return(struct pt_regs
return trace_exit(ctx, TRACE_GETATTR);
}

+int raw_nfs_initiate_commit(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[0];
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct commit_t c = { 0 };
+
+ c.ts = bpf_ktime_get_ns();
+ bpf_probe_read_kernel(&c.offset, sizeof(cd->args.offset),
+ &cd->args.offset);
+ bpf_probe_read_kernel(&c.count, sizeof(cd->args.count), &cd->args.count);
+ commitinfo.update(&key, &c);
+ return 0;
+}
+
+int raw_nfs_commit_done(struct bpf_raw_tracepoint_args *ctx) {
+ u64 key = (u64)ctx->args[1];
+ struct commit_t *cp = commitinfo.lookup(&key);
+
+ if (cp) {
+ struct nfs_commit_data *cd = (struct nfs_commit_data *)key;
+ struct nfs_open_context *p;
+ struct dentry *de;
+ struct qstr qs;
+ u64 ts = bpf_ktime_get_ns();
+ u64 delta_us = (ts - cp->ts) / 1000;
+ u32 pid = bpf_get_current_pid_tgid() >> 32;
+ struct data_t data = {.type = TRACE_COMMIT, .offset = cp->offset,
+ .size = cp->count, .ts_us = ts/1000, .delta_us = delta_us,
+ .pid = pid};
+
+ commitinfo.delete(&key);
+ bpf_get_current_comm(&data.task, sizeof(data.task));
+
+ if(FILTER_PID)
+ return 0;
+
+ if (FILTER_US)
+ return 0;
+
+ bpf_probe_read_kernel(&p, sizeof(p), &cd->context);
+ bpf_probe_read_kernel(&de, sizeof(de), &p->dentry);
+ bpf_probe_read_kernel(&qs, sizeof(qs), &de->d_name);
+ if (qs.len) {
+ bpf_probe_read_kernel(&data.file, sizeof(data.file),
+ (void *)qs.name);
+ events.perf_submit(ctx, &data, sizeof(data));
+ }
+ }
+ return 0;
+}
+
"""
if min_ms == 0:
bpf_text = bpf_text.replace('FILTER_US', '0')
@@ -253,6 +316,8 @@ def print_event(cpu, data, size):
type = 'O'
elif event.type == 3:
type = 'G'
+ elif event.type == 4:
+ type = 'C'

if(csv):
print("%d,%s,%d,%s,%d,%d,%d,%s" % (
@@ -273,7 +338,18 @@ def print_event(cpu, data, size):
# Currently specifically works for NFSv4, the other kprobes are generic
# so it should work with earlier NFS versions

-b = BPF(text=bpf_text)
+# The following warning is shown on kernels after linux-5.18 when using bcc.
+# Add compile option to silence it.
+# In file included from /virtual/main.c:7:
+# In file included from include/linux/nfs_fs.h:31:
+# In file included from include/linux/sunrpc/auth.h:13:
+# In file included from include/linux/sunrpc/sched.h:19:
+# include/linux/sunrpc/xdr.h:751:10: warning: result of comparison of constant 4611686018427387903 with expression of type '__u32' (aka 'unsigned int') is always false [-Wtautological-constant-out-of-range-compare]
+# if (len > SIZE_MAX / sizeof(*p))
+# ~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
+# 1 warning generated.
+b = BPF(text=bpf_text,
+ cflags=["-Wno-tautological-constant-out-of-range-compare"])
b.attach_kprobe(event="nfs_file_read", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs_file_write", fn_name="trace_rw_entry")
b.attach_kprobe(event="nfs4_file_open", fn_name="trace_file_open_entry")
@@ -286,6 +362,9 @@ b.attach_kretprobe(event="nfs4_file_open
b.attach_kretprobe(event="nfs_file_open", fn_name="trace_file_open_return")
b.attach_kretprobe(event="nfs_getattr", fn_name="trace_getattr_return")

+b.attach_raw_tracepoint("nfs_initiate_commit", "raw_nfs_initiate_commit")
+b.attach_raw_tracepoint("nfs_commit_done", "raw_nfs_commit_done")
+
if(csv):
print("ENDTIME_us,TASK,PID,TYPE,BYTES,OFFSET_b,LATENCY_us,FILE")
else:


Re: Tracing queries within DB transactions with dbslower

Dave M
 

Hi Kevin,

I see you've made a Github issue as well
(https://github.com/iovisor/bcc/issues/4256). That's good, as I
would've recommended making one since the mailing list isn't
frequently used these days. I will respond in the issue as well and we
can continue conversation wherever you'd prefer.

The folks who did most of the implementation of `dbslower`
(https://github.com/iovisor/bcc/pulls?q=is%3Apr+dbslower+is%3Aclosed)
aren't particularly active in this repo these days. They might have
the mysql expertise necessary to answer your question. Unfortunately
I'm not sure if any of the current maintainers have experience tracing
MySQL - myself included. Regardless, if you're willing to dig around
in the MySQL source I do think your suggested approach of modifying
which functions are uprobe'd is likely to work. If current tracing of
'dispatch_command' isn't catching the actual statements in the
transaction, presumably there's some other function that is processing
the text of those statements that should be fine to trace in lieu of /
in addition to what's currently being traced.

If you do get this working we'd certainly appreciate a PR!

-Dave

On Tue, Oct 4, 2022 at 3:56 AM Kevin Maier <kevin.maier@...> wrote:

Hello,
as part of a university project I am analysing the dbslower tool to evaluate the performance cost of tracing the queries and to test what can be traced with it. As MySQL dropped support for Dtrace for v8.0+ I am using the alternative with uprobes. During my own testing with MySQL v8.0.28 I found out that regular queries get traced without issues, but if they are within a transaction only the queries b'START TRANSACTION' and b'COMMIT' are detected and none of the queries within the transaction get traced. Is there a way to adapt the tracing to include the queries within a transaction aswell, e.g. by adapting the mysql_func_name?

Best regards,
Kevin Maier


Questions about cachetop.py tool #bcc

thesummernightcat@...
 

I have some questions about cachetop.py tool.
Why the total accesses is defined as the times of entering mark_page_accessed plus mark_buffer_dirty functions?
Doesn't enter mark_buffer_dirty must also enter mark_page_accessed, that is, doesn't they overlap?
I don't really get it...... If someone can tell me why, I will be grateful!                                                                


Tracing queries within DB transactions with dbslower

Kevin Maier
 

Hello,
as part of a university project I am analysing the dbslower tool to evaluate the performance cost of tracing the queries and to test what can be traced with it. As MySQL dropped support for Dtrace for v8.0+ I am using the alternative with uprobes. During my own testing with MySQL v8.0.28 I found out that regular queries get traced without issues, but if they are within a transaction only the queries b'START TRANSACTION' and b'COMMIT' are detected and none of the queries within the transaction get traced. Is there a way to adapt the tracing to include the queries within a transaction aswell, e.g. by adapting the mysql_func_name?

Best regards,
Kevin Maier


Parsing Non linear packet payload in TC BPF programs

deepankur.gupta.97@...
 

Hi all,

We are parsing DHCP packets in TC egress program, In our case, the packet after the UDP header was not present in between skb->data and skb->data_end. On further investigation we found that it lies in non-linear portion. and we have bpf_skb_pull_data(skb,len) to direct access non-linear data.

Few questions based on above:

1. After call to bpf_skb_pull_data(skb, skb->len); the value of  skb->data and skb->data_end pointers changed. Can there be any implication down the stack with change in value of skb->data pointer. Also is this helper function analogous to skb_pull which changes the skb->data pointer and is typically moved when the packet goes up the stack as a result of packet being parsed on that layer? Do we have something similar to skb_linearize() in BPF or any other way to parse non-linear portion?

2. For our case, packet after the UDP header was in non-linear portion, can it happen that packet after IP header go in non-linear or packet after ethernet-header?

 

Regards,

Deepankur


Re: Looking for some bcc issues for new bie

Yonghong Song
 

Hi, Jules,

Just go to https://github.com/iovisor/bcc/issues and you can find
quite some issues for you to investigate. Thanks!

Yonghong

On Wed, Aug 10, 2022 at 3:02 AM Jules Irenge <jbi.octave@...> wrote:

Hi
I am looking for some bcc issues for newbie.
Any link ?

Jules
..


Looking for some bcc issues for new bie

Jules Irenge
 

Hi 
I am looking for some bcc issues for newbie.
Any link ?

Jules
..


LPC 2022 Networking and BPF Track CFP (Final Reminder)

Daniel Borkmann
 

This is the final reminder for the Call for Proposals (CFP) for the Networking
and BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which
is planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


LPC 2022 Networking and BPF Track CFP (Reminder)

Daniel Borkmann
 

This is a reminder for the Call for Proposals (CFP) for the Networking and
BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which is
planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


stackcount return *blank* stack.

Ye ZhengMao
 

https://github.com/iovisor/bcc/issues/3891
I have the same issue, anyone can help me to fix it?


LPC 2022 Networking and BPF Track CFP

Daniel Borkmann
 

We are pleased to announce the Call for Proposals (CFP) for the Networking and
BPF track at the 2022 edition of the Linux Plumbers Conference (LPC), which is
planned to be held in Dublin, Ireland, on September 12th - 14th, 2022.

Note that the conference is planned to be both in person and remote (hybrid).
CFP submitters should ideally be able to give their presentation in person to
minimize technical issues if circumstances permit, although presenting remotely
will also be possible.

This year's Networking and BPF track technical committee is comprised of:

David S. Miller <davem@...>
Jakub Kicinski <kuba@...>
Paolo Abeni <pabeni@...>
Eric Dumazet <edumazet@...>
Alexei Starovoitov <ast@...>
Daniel Borkmann <daniel@...>
Andrii Nakryiko <andrii@...>

We are seeking proposals of 40 minutes in length (including Q&A discussion).

Any kind of advanced Linux networking and/or BPF related topic will be considered.

Please submit your proposals through the official LPC website at:

https://lpc.events/event/16/abstracts/

Make sure to select "eBPF & Networking" in the track pull-down menu.

Proposals must be submitted by August 10th, and submitters will be notified of
acceptance by August 12th.

Final slides (as PDF) are due on the first day of the conference.

We are very much looking forward to a great conference and seeing you all!


RPI install with snapcraft

theycallmevirgo@...
 

I installed on latest Raspbian (bullseye) with Snapcraft. However, when I tried to run hello_world.py with python3, I got 


modprobe: FATAL: Module kheaders not found in directory /lib/modules/5.10.63-v7+

Unable to find kernel headers. Try rebuilding kernel with CONFIG_IKHEADERS=m (mo                                                                 dule) or installing the kernel development package for your running kernel versi                                                                 on.

chdir(/lib/modules/5.10.63-v7+/build): No such file or directory

Traceback (most recent call last):

  File "/bcc/examples/./hello_world.py", line 12, in <module>

    BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!                                                                 \\n"); return 0; }').trace_print()

  File "/usr/lib/python3/dist-packages/bcc/__init__.py", line 364, in __init__

    raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))

Exception: Failed to compile BPF module <text>

I guess this means I have to rebuild kernel?

Thanks so much in advance

Joe


Re: #bcc Count map.ringbuf_reserve() failures #bcc

นิวัฒน์ ไชยจันทร์
 


ในวันที่ อ. 2 พ.ย. 2021 21:31 น. Eelco Chaudron <echaudro@...> เขียนว่า:

[Edited Message Follows]

On Tue, Nov 2, 2021 at 03:24 PM, Eelco Chaudron wrote:

Hi,

Was wondering if there is a way to count the number of times map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?

This way I can get notified in userspace that I have missed events, and might need to increase the buffer size.

Cheers,

Eelco

Thought I added the #bcc tag but I no longer see it :( So just in case, it's not clear, this is with BCC.


Re: #bcc Count map.ringbuf_reserve() failures #bcc

Eelco Chaudron
 

On 3 Nov 2021, at 6:26, Y Song wrote:

On Tue, Nov 2, 2021 at 7:31 AM Eelco Chaudron <echaudro@...> wrote:

[Edited Message Follows]

On Tue, Nov 2, 2021 at 03:24 PM, Eelco Chaudron wrote:

Hi,

Was wondering if there is a way to count the number of times map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?

This way I can get notified in userspace that I have missed events, and might need to increase the buffer size.

Cheers,

Eelco

Thought I added the #bcc tag but I no longer see it :( So just in case, it's not clear, this is with BCC.
You can check return value of map.ringbuf_reserve(). If the
reservation failed, you can notify user space through map, another
side channel ringbuf, perf buf, etc. Depending on your program type
and program running context, you might be able to use
bpf_send_signal() helper to send a signal to the *current* process.
Thanks, I was looking at BCC to solve this in the wrappers, but you are right, as simple BPF_TABLE() solved it.

//Eelco


Re: #bcc Count map.ringbuf_reserve() failures #bcc

Yonghong Song
 

On Tue, Nov 2, 2021 at 7:31 AM Eelco Chaudron <echaudro@...> wrote:

[Edited Message Follows]

On Tue, Nov 2, 2021 at 03:24 PM, Eelco Chaudron wrote:

Hi,

Was wondering if there is a way to count the number of times map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?

This way I can get notified in userspace that I have missed events, and might need to increase the buffer size.

Cheers,

Eelco

Thought I added the #bcc tag but I no longer see it :( So just in case, it's not clear, this is with BCC.
You can check return value of map.ringbuf_reserve(). If the
reservation failed, you can notify user space through map, another
side channel ringbuf, perf buf, etc. Depending on your program type
and program running context, you might be able to use
bpf_send_signal() helper to send a signal to the *current* process.



Re: #bcc Count map.ringbuf_reserve() failures #bcc

Eelco Chaudron
 
Edited

On Tue, Nov 2, 2021 at 03:24 PM, Eelco Chaudron wrote:

Hi,

Was wondering if there is a way to count the number of times map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?

This way I can get notified in userspace that I have missed events, and might need to increase the buffer size.

Cheers,

Eelco

Thought I added the #bcc tag but I no longer see it :( So just in case, it's not clear, this is with BCC.


Count map.ringbuf_reserve() failures

Eelco Chaudron
 

Hi,

Was wondering if there is a way to count the number of times map.ringbuf_reserve() fails for a BPF_RINGBUF_OUTPUT buffer?

This way I can get notified in userspace that I have missed events, and might need to increase the buffer size.

Cheers,

Eelco


Re: Access packet payload in TC egress programs

Yonghong Song
 

On Fri, Oct 22, 2021 at 12:31 AM Federico Parola
<federico.parola@...> wrote:

Thanks for the answer, I wasn't aware of the existence of that helper.
I have two additional comments:

1. The documentation of the helper says that passing a length of zero
should pull the whole length of the packet [1], however with that
parameter the length of direct accessible data stays unchanged. I think
there is a mismatch in the behavior and the documentation.
The source code is
BPF_CALL_2(bpf_skb_pull_data, struct sk_buff *, skb, u32, len)
{
/* Idea is the following: should the needed direct read/write
* test fail during runtime, we can pull in more data and redo
* again, since implicitly, we invalidate previous checks here.
*
* Or, since we know how much we need to make read/writeable,
* this can be done once at the program beginning for direct
* access case. By this we overcome limitations of only current
* headroom being accessible.
*/
return bpf_try_make_writable(skb, len ? : skb_headlen(skb));
}

So if len is 0, it will only try to make *existing* linear data to be
writable, so
you are right. It seems we are not not trying to pull more data in. I will
check with other kernel developers later.


2. I'd like to avoid re-parsing all the headers after I have pulled new
data. To do so I save the offset I just reached (the end of the TCP
header), pull data, get the new data and data_end pointers and add the
offset to data. However the verifier does not accept my accesses to the
packet from this point on. Here is some example code:
The current behavior is after data pull, you will need to reparse the packet.
There are a lot of helpers fitting in this case:

bool bpf_helper_changes_pkt_data(void *func)
{
if (func == bpf_skb_vlan_push ||
func == bpf_skb_vlan_pop ||
func == bpf_skb_store_bytes ||
func == bpf_skb_change_proto ||
func == bpf_skb_change_head ||
func == sk_skb_change_head ||
func == bpf_skb_change_tail ||
func == sk_skb_change_tail ||
func == bpf_skb_adjust_room ||
func == sk_skb_adjust_room ||
func == bpf_skb_pull_data ||
func == sk_skb_pull_data ||
func == bpf_clone_redirect ||
func == bpf_l3_csum_replace ||
func == bpf_l4_csum_replace ||
func == bpf_xdp_adjust_head ||
func == bpf_xdp_adjust_meta ||
func == bpf_msg_pull_data ||
func == bpf_msg_push_data ||
func == bpf_msg_pop_data ||
func == bpf_xdp_adjust_tail ||
#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
func == bpf_lwt_seg6_store_bytes ||
func == bpf_lwt_seg6_adjust_srh ||
func == bpf_lwt_seg6_action ||
#endif
#ifdef CONFIG_INET
func == bpf_sock_ops_store_hdr_opt ||
#endif
func == bpf_lwt_in_push_encap ||
func == bpf_lwt_xmit_push_encap)
return true;

return false;
}

It is possible that we could fine tune this behavior as some helpers
like bpf_skb_pull_data() may not need to start over again. But I
could miss some conditions.

Could you post your questions at bpf@...?
Networking people in the mailing list may give you a better
answer why this behavior for bpf_skb_pull_data() and whether
it can be improved.


unsigned payload_offset = (void *)tcph + (tcph->doff << 2) - data;
bpf_skb_pull_data(ctx, ctx->len);
data = (void *)(long)ctx->data;
data_end = (void *)(long)ctx->data_end;

struct tls_record_hdr *rech = data + payload_offset;
if ((void *)(rech + 1) > data_end)
return TC_ACT_OK;

if (rech->type == TLS_CONTENT_TYPE_HANDSAHKE)
bpf_trace_printk("It's a handshake");

Running this code gives me the error "R1 offset is outside of the
packet" even if I performed the correct check on packet boundaries. If I
re-parse all header the code is accepted. Is there a way to solve the
problem?

[1]
https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L2312

On 20/10/21 08:11, Y Song wrote:
On Tue, Oct 19, 2021 at 8:13 AM Federico Parola
<federico.parola@...> wrote:

Dear all,
how can I access the payload of the packet in a program attached to the
TC egress hook (SCHED_CLS attached to clsact qdisc)?
ctx->data_end points to the end of the L4 header, while on the ingress
hook it points to the end of the packet (tested on kernel v5.14).
This could be the case that linear data only covers up to the end of
L4 header. In such cases, you can use bpf_skb_pull_data() helper
to get more data into linear region and after that your ctx->data_end
will point to much later packet data.


Best regards,
Federico Parola







1 - 20 of 2021