Date   

[PATCH] Report proper module on kernel backtrace

Jiri Olsa
 

Raghavendra Rao reported that memleak does not display
proper name of the related kernel module, but just the
"kernel" string, like here for xfs module functions:

131072 bytes in 4 allocations from stack
..
bvec_alloc+0x92 [kernel]
bio_alloc_bioset+0x13f [kernel]
xfs_add_to_ioend+0x2df [kernel]
xfs_do_writepage+0x148 [kernel]
write_cache_pages+0x171 [kernel]
xfs_vm_writepages+0x59 [kernel]
do_writepages+0x43 [kernel]
...

The kernel resolver code is parsing /proc/kallsyms, which
already has the module information in.

This patch is adding support to parse the module info
from /proc/kallsyms and initialize the module with
proper value.

Above memleak backtrace now looks like:

131072 bytes in 4 allocations from stack

bvec_alloc+0x92 [kernel]
bio_alloc_bioset+0x13f [kernel]
xfs_add_to_ioend+0x2df [xfs]
xfs_do_writepage+0x148 [xfs]
write_cache_pages+0x171 [kernel]
xfs_vm_writepages+0x59 [xfs]
do_writepages+0x43 [kernel]
...

Reported-by: Raghavendra Rao <ragrao@...>
Signed-off-by: Jiri Olsa <jolsa@...>
---
src/cc/bcc_proc.c | 20 ++++++++++++++++++--
src/cc/bcc_proc.h | 2 +-
src/cc/bcc_syms.cc | 8 ++++----
src/cc/libbpf | 2 +-
src/cc/syms.h | 5 +++--
tests/cc/test_c_api.cc | 2 +-
6 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/src/cc/bcc_proc.c b/src/cc/bcc_proc.c
index c8f4041ed27b..d6e17282eed6 100644
--- a/src/cc/bcc_proc.c
+++ b/src/cc/bcc_proc.c
@@ -208,7 +208,7 @@ done:
=20
int bcc_procutils_each_ksym(bcc_procutils_ksymcb callback, void *payload=
) {
char line[2048];
- char *symname, *endsym;
+ char *symname, *endsym, *modname, *endmod =3D NULL;
FILE *kallsyms;
unsigned long long addr;
=20
@@ -237,7 +237,23 @@ int bcc_procutils_each_ksym(bcc_procutils_ksymcb cal=
lback, void *payload) {
while (*endsym && !isspace(*endsym)) endsym++;
*endsym =3D '\0';
=20
- callback(symname, addr, payload);
+ // Parse module name if it's available
+ modname =3D endsym + 1;
+ while (*modname && isspace(*endsym)) modname++;
+
+ if (*modname && *modname =3D=3D '[') {
+ endmod =3D ++modname;
+ while (*endmod && *endmod !=3D ']') endmod++;
+ if (*endmod)
+ *(endmod) =3D '\0';
+ else
+ endmod =3D NULL;
+ }
+
+ if (!endmod)
+ modname =3D "kernel";
+
+ callback(symname, modname, addr, payload);
}
=20
fclose(kallsyms);
diff --git a/src/cc/bcc_proc.h b/src/cc/bcc_proc.h
index a56f680d5f56..368f27d9e0c3 100644
--- a/src/cc/bcc_proc.h
+++ b/src/cc/bcc_proc.h
@@ -42,7 +42,7 @@ typedef struct mod_info {
typedef int (*bcc_procutils_modulecb)(mod_info *, int, void *);
=20
// Symbol name, address, payload
-typedef void (*bcc_procutils_ksymcb)(const char *, uint64_t, void *);
+typedef void (*bcc_procutils_ksymcb)(const char *, const char *, uint64_=
t, void *);
=20
char *bcc_procutils_which_so(const char *libname, int pid);
char *bcc_procutils_which(const char *binpath);
diff --git a/src/cc/bcc_syms.cc b/src/cc/bcc_syms.cc
index a946fe1a1048..95ec8bad398a 100644
--- a/src/cc/bcc_syms.cc
+++ b/src/cc/bcc_syms.cc
@@ -47,9 +47,9 @@ bool ProcStat::is_stale() {
ProcStat::ProcStat(int pid)
: procfs_(tfm::format("/proc/%d/exe", pid)), inode_(getinode_()) {}
=20
-void KSyms::_add_symbol(const char *symname, uint64_t addr, void *p) {
+void KSyms::_add_symbol(const char *symname, const char *modname, uint64=
_t addr, void *p) {
KSyms *ks =3D static_cast<KSyms *>(p);
- ks->syms_.emplace_back(symname, addr);
+ ks->syms_.emplace_back(symname, modname, addr);
}
=20
void KSyms::refresh() {
@@ -67,13 +67,13 @@ bool KSyms::resolve_addr(uint64_t addr, struct bcc_sy=
mbol *sym, bool demangle) {
if (syms_.empty())
goto unknown_symbol;
=20
- it =3D std::upper_bound(syms_.begin(), syms_.end(), Symbol("", addr));
+ it =3D std::upper_bound(syms_.begin(), syms_.end(), Symbol("", "", add=
r));
if (it !=3D syms_.begin()) {
it--;
sym->name =3D (*it).name.c_str();
if (demangle)
sym->demangle_name =3D sym->name;
- sym->module =3D "kernel";
+ sym->module =3D (*it).mod.c_str();
sym->offset =3D addr - (*it).addr;
return true;
}
diff --git a/src/cc/libbpf b/src/cc/libbpf
index 45ad8626010b..33b017498543 160000
--- a/src/cc/libbpf
+++ b/src/cc/libbpf
@@ -1 +1 @@
-Subproject commit 45ad8626010b384306bc277c46a4107f21414b22
+Subproject commit 33b017498543167b65fa948d3a0267794c78787f
diff --git a/src/cc/syms.h b/src/cc/syms.h
index 351da1dffbf4..021b28aa4c3a 100644
--- a/src/cc/syms.h
+++ b/src/cc/syms.h
@@ -50,8 +50,9 @@ public:
=20
class KSyms : SymbolCache {
struct Symbol {
- Symbol(const char *name, uint64_t addr) : name(name), addr(addr) {}
+ Symbol(const char *name, const char *mod, uint64_t addr) : name(name=
), mod(mod), addr(addr) {}
std::string name;
+ std::string mod;
uint64_t addr;
=20
bool operator<(const Symbol &rhs) const { return addr < rhs.addr; }
@@ -59,7 +60,7 @@ class KSyms : SymbolCache {
=20
std::vector<Symbol> syms_;
std::unordered_map<std::string, uint64_t> symnames_;
- static void _add_symbol(const char *, uint64_t, void *);
+ static void _add_symbol(const char *, const char *, uint64_t, void *);
=20
public:
virtual bool resolve_addr(uint64_t addr, struct bcc_symbol *sym, bool =
demangle =3D true) override;
diff --git a/tests/cc/test_c_api.cc b/tests/cc/test_c_api.cc
index 8b96c649b65c..e3744be4e16a 100644
--- a/tests/cc/test_c_api.cc
+++ b/tests/cc/test_c_api.cc
@@ -67,7 +67,7 @@ TEST_CASE("binary resolution with `which`", "[c_api]") =
{
free(ld);
}
=20
-static void _test_ksym(const char *sym, uint64_t addr, void *_) {
+static void _test_ksym(const char *sym, const char *mod, uint64_t addr, =
void *_) {
if (!strcmp(sym, "startup_64"))
REQUIRE(addr !=3D 0x0ull);
}
--=20
2.21.0


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

neirac
 

I agree with that approach then, is easier than mantain and keep in sync more namei functions.
Thanks a lot.

El vie., 19 de jul. de 2019 12:07, Y Song <ys114321@...> escribió:
On Fri, Jul 19, 2019 at 6:02 AM cnb <cneirabustos@...> wrote:
>
> kern_path is the one, should I move this and dependencies to a new file called bfp_namei.c, in there I'll change it to use GFP_ATOMIC or create a new function to replace kern_path. What do you think?

I think adding GFP_ATOMIC to function getname_kernel() is reasonable.

                const size_t size = offsetof(struct filename, iname[1]);
                struct filename *tmp;

                tmp = kmalloc(size, GFP_KERNEL);

/* fs/open.c */
struct audit_names;
struct filename {
        const char              *name;  /* pointer to actual string */
        const __user char       *uptr;  /* original userland pointer */
        int                     refcnt;
        struct audit_names      *aname;
        const char              iname[];
};

The size is pretty small. on x64, it should be 25 bytes. If the system
cannot honor
this 25 bytes in non-blocking mode, it is probably already in stress,
bpf subsystem itself may not work reliably as it uses some GFP_ATOMIC
as well for some map (e.g., update) helpers.

So let us have this patch (adding GFP_ATOMIC) in the first series. We can
think of alternatives (e.g., separate functions)  if anybody objects.

>
> El jue., 18 de jul. de 2019 22:10, Y Song <ys114321@...> escribió:
>>
>> On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
>> <cneirabustos@...> wrote:
>> >
>> > Hi,
>> > Yes, I'm still interested as I need this capability at $WORK, but haven't had the time to re-write the dentry functions we need, as currently they could sleep.
>> > I could resume work on this next monday when I get back from vacations.
>> > I really appreciate your help and guidance on this.
>>
>> Great. Which dentry function are you referring to? dput? kern_path
>> cannot be used as it uses
>> kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.
>>
>> >
>> > Bests
>> >
>> >
>> > El jue., 18 de jul. de 2019 20:29, Y Song <ys114321@...> escribió:
>> >>
>> >> Hi, Carlos,
>> >>
>> >> Are you still interested in upstreaming this patch? Looks like there
>> >> still a desire
>> >> to make bcc work inside the containers.
>> >>
>> >> The bpf-next will open next week. If you would like, could you submit again?
>> >> I will review the patch on bpf-next as well to make sure we made
>> >> forward progress.
>> >>
>> >> Please let me know. Thanks!
>> >>
>> >> Yonghong
>> >>
>> >> On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
>> >> <cneirabustos@...> wrote:
>> >> >
>> >> > As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
>> >> > it calls getname_kernel() which may sleep.
>> >> > The inode is accessed directly, as we are just interested in the inode's s_dev.
>> >> > Let me know if this approach is the correct one.
>> >> > -------------------------------------------------------------------------------
>> >> > From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
>> >> > From: Carlos <cneirabustos@...>
>> >> > Date: Tue, 16 Apr 2019 17:10:46 -0400
>> >> > Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
>> >> >   from current task
>> >> >
>> >> > This helper obtains the active namespace from current and returns pid, tgid,
>> >> > device and namespace id as seen from that namespace, allowing to instrument
>> >> > a process inside a container.
>> >> > Device is read from /proc/self/ns/pid, as in the future it's possible that
>> >> > different pid_ns files may belong to different devices, according
>> >> > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
>> >> > conference.
>> >> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
>> >> > scripts but this helper returns the pid as seen by the root namespace which is
>> >> > fine when a bcc script is not executed inside a container.
>> >> > When the process of interest is inside a container, pid filtering will not work
>> >> > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
>> >> > returning the pid as it's seen by the current namespace where the script is
>> >> > executing.
>> >> >
>> >> > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
>> >> > used to do pid filtering even inside a container.
>> >> >
>> >> > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
>> >> >
>> >> >         u32 pid = bpf_get_current_pid_tgid() >> 32;
>> >> >         if (pid != <pid_arg_passed_in>)
>> >> >                 return 0;
>> >> > Could be modified to use bpf_get_current_pidns_info() as follows:
>> >> >
>> >> >         struct bpf_pidns pidns;
>> >> >         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>> >> >         u32 pid = pidns.tgid;
>> >> >         u32 nsid = pidns.nsid;
>> >> >         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>> >> >                 return 0;
>> >> >
>> >> > To find out the name PID namespace id of a process, you could use this command:
>> >> >
>> >> > $ ps -h -o pidns -p <pid_of_interest>
>> >> >
>> >> > Or this other command:
>> >> >
>> >> > $ ls -Li /proc/<pid_of_interest>/ns/pid
>> >> >
>> >> > Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
>> >> > ---
>> >> >  include/linux/bpf.h      |  1 +
>> >> >  include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
>> >> >  kernel/bpf/core.c        |  1 +
>> >> >  kernel/bpf/helpers.c     | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >  kernel/trace/bpf_trace.c |  2 ++
>> >> >  5 files changed, 93 insertions(+), 1 deletion(-)
>> >> >
>> >> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> >> > index e4d4c1771ab0..4393f8f088cc 100644
>> >> > --- a/include/linux/bpf.h
>> >> > +++ b/include/linux/bpf.h
>> >> > @@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
>> >> >  extern const struct bpf_func_proto bpf_spin_lock_proto;
>> >> >  extern const struct bpf_func_proto bpf_spin_unlock_proto;
>> >> >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>> >> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>> >> >
>> >> >  /* Shared helpers among cBPF and eBPF. */
>> >> >  void bpf_user_rnd_init_once(void);
>> >> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> >> > index 31a27dd337dc..7bf457875c31 100644
>> >> > --- a/include/uapi/linux/bpf.h
>> >> > +++ b/include/uapi/linux/bpf.h
>> >> > @@ -2500,6 +2500,18 @@ union bpf_attr {
>> >> >   *     Return
>> >> >   *             0 if iph and th are a valid SYN cookie ACK, or a negative error
>> >> >   *             otherwise.
>> >> > + *
>> >> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
>> >> > + *     Description
>> >> > + *             Copies into *pidns* pid, namespace id and tgid as seen by the
>> >> > + *             current namespace and also device from /proc/self/ns/pid.
>> >> > + *             *size_of_pidns* must be the size of *pidns*
>> >> > + *
>> >> > + *             This helper is used when pid filtering is needed inside a
>> >> > + *             container as bpf_get_current_tgid() helper returns always the
>> >> > + *             pid id as seen by the root namespace.
>> >> > + *     Return
>> >> > + *             0 on success -EINVAL on error.
>> >> >   */
>> >> >  #define __BPF_FUNC_MAPPER(FN)          \
>> >> >         FN(unspec),                     \
>> >> > @@ -2602,7 +2614,8 @@ union bpf_attr {
>> >> >         FN(skb_ecn_set_ce),             \
>> >> >         FN(get_listener_sock),          \
>> >> >         FN(skc_lookup_tcp),             \
>> >> > -       FN(tcp_check_syncookie),
>> >> > +       FN(tcp_check_syncookie),        \
>> >> > +       FN(get_current_pidns_info),
>> >> >
>> >> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> >> >   * function eBPF program intends to call
>> >> > @@ -3298,4 +3311,14 @@ struct bpf_line_info {
>> >> >  struct bpf_spin_lock {
>> >> >         __u32   val;
>> >> >  };
>> >> > +
>> >> > +/* helper bpf_get_current_pidns_info will store the following
>> >> > + * data, dev will contain major/minor from /proc/self/pid.
>> >> > +*/
>> >> > +struct bpf_pidns_info {
>> >> > +       __u32 dev;
>> >> > +       __u32 nsid;
>> >> > +       __u32 tgid;
>> >> > +       __u32 pid;
>> >> > +};
>> >> >  #endif /* _UAPI__LINUX_BPF_H__ */
>> >> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> >> > index ace8c22c8b0e..ecbdc72ba459 100644
>> >> > --- a/kernel/bpf/core.c
>> >> > +++ b/kernel/bpf/core.c
>> >> > @@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>> >> >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>> >> >  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>> >> >  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
>> >> > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>> >> >
>> >> >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>> >> >  {
>> >> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> >> > index a411fc17d265..2de82d14424d 100644
>> >> > --- a/kernel/bpf/helpers.c
>> >> > +++ b/kernel/bpf/helpers.c
>> >> > @@ -18,6 +18,11 @@
>> >> >  #include <linux/sched.h>
>> >> >  #include <linux/uidgid.h>
>> >> >  #include <linux/filter.h>
>> >> > +#include <linux/pid_namespace.h>
>> >> > +#include <linux/major.h>
>> >> > +#include <linux/stat.h>
>> >> > +#include <linux/namei.h>
>> >> > +#include <linux/version.h>
>> >> >
>> >> >  /* If kernel subsystem is allowing eBPF programs to call this function,
>> >> >   * inside its own verifier_ops->get_func_proto() callback it should return
>> >> > @@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>> >> >         preempt_enable();
>> >> >  }
>> >> >
>> >> > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
>> >> > +        size)
>> >> > +{
>> >> > +       const char *pidnspath = "/proc/self/ns/pid";
>> >> > +       struct pid_namespace *pidns = NULL;
>> >> > +       DEFINE_SPINLOCK(bpf_spinlock);
>> >> > +       struct inode *inode;
>> >> > +       struct kstat ks;
>> >> > +       struct path kp;
>> >> > +       pid_t tgid = 0;
>> >> > +       pid_t pid = 0;
>> >> > +       int ret;
>> >> > +
>> >> > +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
>> >> > +               goto clear;
>> >> > +
>> >> > +       pidns = task_active_pid_ns(current);
>> >> > +
>> >> > +       if (unlikely(!pidns))
>> >> > +               goto clear;
>> >> > +
>> >> > +       pidns_info->nsid =  pidns->ns.inum;
>> >> > +       pid = task_pid_nr_ns(current, pidns);
>> >> > +
>> >> > +       if (unlikely(!pid))
>> >> > +               goto clear;
>> >> > +
>> >> > +       tgid = task_tgid_nr_ns(current, pidns);
>> >> > +
>> >> > +       if (unlikely(!tgid))
>> >> > +               goto clear;
>> >> > +
>> >> > +       pidns_info->tgid = (u32) tgid;
>> >> > +       pidns_info->pid = (u32) pid;
>> >> > +
>> >> > +       spin_lock(&bpf_spinlock);
>> >> > +        ret = kern_path(pidnspath, 0, &kp);
>> >> > +       if (ret)
>> >> > +               goto clear;
>> >> > +       inode = d_backing_inode(kp.dentry);
>> >> > +       pidns_info->dev = inode->i_sb->s_dev;
>> >> > +       spin_unlock(&bpf_spinlock);
>> >> > +
>> >> > +       return 0;
>> >> > +
>> >> > +       clear:
>> >> > +       if (pidns_info)
>> >> > +               memset((void *)pidns, 0, (size_t) size);
>> >> > +
>> >> > +       return -EINVAL;
>> >> > +}
>> >> > +
>> >> > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
>> >> > +       .func   = bpf_get_current_pidns_info,
>> >> > +       .gpl_only       = false,
>> >> > +       .ret_type       = RET_INTEGER,
>> >> > +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
>> >> > +       .arg2_type      = ARG_CONST_SIZE,
>> >> > +};
>> >> > +
>> >> >  #ifdef CONFIG_CGROUPS
>> >> >  BPF_CALL_0(bpf_get_current_cgroup_id)
>> >> >  {
>> >> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> >> > index d64c00afceb5..2ef0de78d4ec 100644
>> >> > --- a/kernel/trace/bpf_trace.c
>> >> > +++ b/kernel/trace/bpf_trace.c
>> >> > @@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> >> >         case BPF_FUNC_get_current_cgroup_id:
>> >> >                 return &bpf_get_current_cgroup_id_proto;
>> >> >  #endif
>> >> > +       case BPF_FUNC_get_current_pidns_info:
>> >> > +               return &bpf_get_current_pidns_info_proto;
>> >> >         default:
>> >> >                 return NULL;
>> >> >         }
>> >> > --
>> >> > 2.11.0
>> >> >
>


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

Yonghong Song
 

On Fri, Jul 19, 2019 at 6:02 AM cnb <cneirabustos@...> wrote:

kern_path is the one, should I move this and dependencies to a new file called bfp_namei.c, in there I'll change it to use GFP_ATOMIC or create a new function to replace kern_path. What do you think?
I think adding GFP_ATOMIC to function getname_kernel() is reasonable.

const size_t size = offsetof(struct filename, iname[1]);
struct filename *tmp;

tmp = kmalloc(size, GFP_KERNEL);

/* fs/open.c */
struct audit_names;
struct filename {
const char *name; /* pointer to actual string */
const __user char *uptr; /* original userland pointer */
int refcnt;
struct audit_names *aname;
const char iname[];
};

The size is pretty small. on x64, it should be 25 bytes. If the system
cannot honor
this 25 bytes in non-blocking mode, it is probably already in stress,
bpf subsystem itself may not work reliably as it uses some GFP_ATOMIC
as well for some map (e.g., update) helpers.

So let us have this patch (adding GFP_ATOMIC) in the first series. We can
think of alternatives (e.g., separate functions) if anybody objects.


El jue., 18 de jul. de 2019 22:10, Y Song <ys114321@...> escribió:

On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
<cneirabustos@...> wrote:

Hi,
Yes, I'm still interested as I need this capability at $WORK, but haven't had the time to re-write the dentry functions we need, as currently they could sleep.
I could resume work on this next monday when I get back from vacations.
I really appreciate your help and guidance on this.
Great. Which dentry function are you referring to? dput? kern_path
cannot be used as it uses
kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.


Bests


El jue., 18 de jul. de 2019 20:29, Y Song <ys114321@...> escribió:

Hi, Carlos,

Are you still interested in upstreaming this patch? Looks like there
still a desire
to make bcc work inside the containers.

The bpf-next will open next week. If you would like, could you submit again?
I will review the patch on bpf-next as well to make sure we made
forward progress.

Please let me know. Thanks!

Yonghong

On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
<cneirabustos@...> wrote:

As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
it calls getname_kernel() which may sleep.
The inode is accessed directly, as we are just interested in the inode's s_dev.
Let me know if this approach is the correct one.
-------------------------------------------------------------------------------
From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@...>
Date: Tue, 16 Apr 2019 17:10:46 -0400
Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

u32 pid = bpf_get_current_pid_tgid() >> 32;
if (pid != <pid_arg_passed_in>)
return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

struct bpf_pidns pidns;
bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
u32 pid = pidns.tgid;
u32 nsid = pidns.nsid;
if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
kernel/bpf/core.c | 1 +
kernel/bpf/helpers.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 2 ++
5 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4d4c1771ab0..4393f8f088cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
extern const struct bpf_func_proto bpf_spin_lock_proto;
extern const struct bpf_func_proto bpf_spin_unlock_proto;
extern const struct bpf_func_proto bpf_get_local_storage_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;

/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31a27dd337dc..7bf457875c31 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2500,6 +2500,18 @@ union bpf_attr {
* Return
* 0 if iph and th are a valid SYN cookie ACK, or a negative error
* otherwise.
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ * Description
+ * Copies into *pidns* pid, namespace id and tgid as seen by the
+ * current namespace and also device from /proc/self/ns/pid.
+ * *size_of_pidns* must be the size of *pidns*
+ *
+ * This helper is used when pid filtering is needed inside a
+ * container as bpf_get_current_tgid() helper returns always the
+ * pid id as seen by the root namespace.
+ * Return
+ * 0 on success -EINVAL on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2602,7 +2614,8 @@ union bpf_attr {
FN(skb_ecn_set_ce), \
FN(get_listener_sock), \
FN(skc_lookup_tcp), \
- FN(tcp_check_syncookie),
+ FN(tcp_check_syncookie), \
+ FN(get_current_pidns_info),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3298,4 +3311,14 @@ struct bpf_line_info {
struct bpf_spin_lock {
__u32 val;
};
+
+/* helper bpf_get_current_pidns_info will store the following
+ * data, dev will contain major/minor from /proc/self/pid.
+*/
+struct bpf_pidns_info {
+ __u32 dev;
+ __u32 nsid;
+ __u32 tgid;
+ __u32 pid;
+};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ace8c22c8b0e..ecbdc72ba459 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;

const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
{
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a411fc17d265..2de82d14424d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -18,6 +18,11 @@
#include <linux/sched.h>
#include <linux/uidgid.h>
#include <linux/filter.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>

/* If kernel subsystem is allowing eBPF programs to call this function,
* inside its own verifier_ops->get_func_proto() callback it should return
@@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
preempt_enable();
}

+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+ size)
+{
+ const char *pidnspath = "/proc/self/ns/pid";
+ struct pid_namespace *pidns = NULL;
+ DEFINE_SPINLOCK(bpf_spinlock);
+ struct inode *inode;
+ struct kstat ks;
+ struct path kp;
+ pid_t tgid = 0;
+ pid_t pid = 0;
+ int ret;
+
+ if (unlikely(size != sizeof(struct bpf_pidns_info)))
+ goto clear;
+
+ pidns = task_active_pid_ns(current);
+
+ if (unlikely(!pidns))
+ goto clear;
+
+ pidns_info->nsid = pidns->ns.inum;
+ pid = task_pid_nr_ns(current, pidns);
+
+ if (unlikely(!pid))
+ goto clear;
+
+ tgid = task_tgid_nr_ns(current, pidns);
+
+ if (unlikely(!tgid))
+ goto clear;
+
+ pidns_info->tgid = (u32) tgid;
+ pidns_info->pid = (u32) pid;
+
+ spin_lock(&bpf_spinlock);
+ ret = kern_path(pidnspath, 0, &kp);
+ if (ret)
+ goto clear;
+ inode = d_backing_inode(kp.dentry);
+ pidns_info->dev = inode->i_sb->s_dev;
+ spin_unlock(&bpf_spinlock);
+
+ return 0;
+
+ clear:
+ if (pidns_info)
+ memset((void *)pidns, 0, (size_t) size);
+
+ return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+ .func = bpf_get_current_pidns_info,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg2_type = ARG_CONST_SIZE,
+};
+
#ifdef CONFIG_CGROUPS
BPF_CALL_0(bpf_get_current_cgroup_id)
{
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d64c00afceb5..2ef0de78d4ec 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_get_current_cgroup_id:
return &bpf_get_current_cgroup_id_proto;
#endif
+ case BPF_FUNC_get_current_pidns_info:
+ return &bpf_get_current_pidns_info_proto;
default:
return NULL;
}
--
2.11.0


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

neirac
 

kern_path is the one, should I move this and dependencies to a new file called bfp_namei.c, in there I'll change it to use GFP_ATOMIC or create a new function to replace kern_path. What do you think?


El jue., 18 de jul. de 2019 22:10, Y Song <ys114321@...> escribió:
On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
<cneirabustos@...> wrote:
>
> Hi,
> Yes, I'm still interested as I need this capability at $WORK, but haven't had the time to re-write the dentry functions we need, as currently they could sleep.
> I could resume work on this next monday when I get back from vacations.
> I really appreciate your help and guidance on this.

Great. Which dentry function are you referring to? dput? kern_path
cannot be used as it uses
kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.

>
> Bests
>
>
> El jue., 18 de jul. de 2019 20:29, Y Song <ys114321@...> escribió:
>>
>> Hi, Carlos,
>>
>> Are you still interested in upstreaming this patch? Looks like there
>> still a desire
>> to make bcc work inside the containers.
>>
>> The bpf-next will open next week. If you would like, could you submit again?
>> I will review the patch on bpf-next as well to make sure we made
>> forward progress.
>>
>> Please let me know. Thanks!
>>
>> Yonghong
>>
>> On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
>> <cneirabustos@...> wrote:
>> >
>> > As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
>> > it calls getname_kernel() which may sleep.
>> > The inode is accessed directly, as we are just interested in the inode's s_dev.
>> > Let me know if this approach is the correct one.
>> > -------------------------------------------------------------------------------
>> > From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
>> > From: Carlos <cneirabustos@...>
>> > Date: Tue, 16 Apr 2019 17:10:46 -0400
>> > Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
>> >   from current task
>> >
>> > This helper obtains the active namespace from current and returns pid, tgid,
>> > device and namespace id as seen from that namespace, allowing to instrument
>> > a process inside a container.
>> > Device is read from /proc/self/ns/pid, as in the future it's possible that
>> > different pid_ns files may belong to different devices, according
>> > to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
>> > conference.
>> > Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
>> > scripts but this helper returns the pid as seen by the root namespace which is
>> > fine when a bcc script is not executed inside a container.
>> > When the process of interest is inside a container, pid filtering will not work
>> > if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
>> > returning the pid as it's seen by the current namespace where the script is
>> > executing.
>> >
>> > This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
>> > used to do pid filtering even inside a container.
>> >
>> > For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
>> >
>> >         u32 pid = bpf_get_current_pid_tgid() >> 32;
>> >         if (pid != <pid_arg_passed_in>)
>> >                 return 0;
>> > Could be modified to use bpf_get_current_pidns_info() as follows:
>> >
>> >         struct bpf_pidns pidns;
>> >         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>> >         u32 pid = pidns.tgid;
>> >         u32 nsid = pidns.nsid;
>> >         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>> >                 return 0;
>> >
>> > To find out the name PID namespace id of a process, you could use this command:
>> >
>> > $ ps -h -o pidns -p <pid_of_interest>
>> >
>> > Or this other command:
>> >
>> > $ ls -Li /proc/<pid_of_interest>/ns/pid
>> >
>> > Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
>> > ---
>> >  include/linux/bpf.h      |  1 +
>> >  include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
>> >  kernel/bpf/core.c        |  1 +
>> >  kernel/bpf/helpers.c     | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >  kernel/trace/bpf_trace.c |  2 ++
>> >  5 files changed, 93 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> > index e4d4c1771ab0..4393f8f088cc 100644
>> > --- a/include/linux/bpf.h
>> > +++ b/include/linux/bpf.h
>> > @@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
>> >  extern const struct bpf_func_proto bpf_spin_lock_proto;
>> >  extern const struct bpf_func_proto bpf_spin_unlock_proto;
>> >  extern const struct bpf_func_proto bpf_get_local_storage_proto;
>> > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>> >
>> >  /* Shared helpers among cBPF and eBPF. */
>> >  void bpf_user_rnd_init_once(void);
>> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> > index 31a27dd337dc..7bf457875c31 100644
>> > --- a/include/uapi/linux/bpf.h
>> > +++ b/include/uapi/linux/bpf.h
>> > @@ -2500,6 +2500,18 @@ union bpf_attr {
>> >   *     Return
>> >   *             0 if iph and th are a valid SYN cookie ACK, or a negative error
>> >   *             otherwise.
>> > + *
>> > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
>> > + *     Description
>> > + *             Copies into *pidns* pid, namespace id and tgid as seen by the
>> > + *             current namespace and also device from /proc/self/ns/pid.
>> > + *             *size_of_pidns* must be the size of *pidns*
>> > + *
>> > + *             This helper is used when pid filtering is needed inside a
>> > + *             container as bpf_get_current_tgid() helper returns always the
>> > + *             pid id as seen by the root namespace.
>> > + *     Return
>> > + *             0 on success -EINVAL on error.
>> >   */
>> >  #define __BPF_FUNC_MAPPER(FN)          \
>> >         FN(unspec),                     \
>> > @@ -2602,7 +2614,8 @@ union bpf_attr {
>> >         FN(skb_ecn_set_ce),             \
>> >         FN(get_listener_sock),          \
>> >         FN(skc_lookup_tcp),             \
>> > -       FN(tcp_check_syncookie),
>> > +       FN(tcp_check_syncookie),        \
>> > +       FN(get_current_pidns_info),
>> >
>> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>> >   * function eBPF program intends to call
>> > @@ -3298,4 +3311,14 @@ struct bpf_line_info {
>> >  struct bpf_spin_lock {
>> >         __u32   val;
>> >  };
>> > +
>> > +/* helper bpf_get_current_pidns_info will store the following
>> > + * data, dev will contain major/minor from /proc/self/pid.
>> > +*/
>> > +struct bpf_pidns_info {
>> > +       __u32 dev;
>> > +       __u32 nsid;
>> > +       __u32 tgid;
>> > +       __u32 pid;
>> > +};
>> >  #endif /* _UAPI__LINUX_BPF_H__ */
>> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> > index ace8c22c8b0e..ecbdc72ba459 100644
>> > --- a/kernel/bpf/core.c
>> > +++ b/kernel/bpf/core.c
>> > @@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>> >  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>> >  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>> >  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
>> > +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>> >
>> >  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>> >  {
>> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> > index a411fc17d265..2de82d14424d 100644
>> > --- a/kernel/bpf/helpers.c
>> > +++ b/kernel/bpf/helpers.c
>> > @@ -18,6 +18,11 @@
>> >  #include <linux/sched.h>
>> >  #include <linux/uidgid.h>
>> >  #include <linux/filter.h>
>> > +#include <linux/pid_namespace.h>
>> > +#include <linux/major.h>
>> > +#include <linux/stat.h>
>> > +#include <linux/namei.h>
>> > +#include <linux/version.h>
>> >
>> >  /* If kernel subsystem is allowing eBPF programs to call this function,
>> >   * inside its own verifier_ops->get_func_proto() callback it should return
>> > @@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>> >         preempt_enable();
>> >  }
>> >
>> > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
>> > +        size)
>> > +{
>> > +       const char *pidnspath = "/proc/self/ns/pid";
>> > +       struct pid_namespace *pidns = NULL;
>> > +       DEFINE_SPINLOCK(bpf_spinlock);
>> > +       struct inode *inode;
>> > +       struct kstat ks;
>> > +       struct path kp;
>> > +       pid_t tgid = 0;
>> > +       pid_t pid = 0;
>> > +       int ret;
>> > +
>> > +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
>> > +               goto clear;
>> > +
>> > +       pidns = task_active_pid_ns(current);
>> > +
>> > +       if (unlikely(!pidns))
>> > +               goto clear;
>> > +
>> > +       pidns_info->nsid =  pidns->ns.inum;
>> > +       pid = task_pid_nr_ns(current, pidns);
>> > +
>> > +       if (unlikely(!pid))
>> > +               goto clear;
>> > +
>> > +       tgid = task_tgid_nr_ns(current, pidns);
>> > +
>> > +       if (unlikely(!tgid))
>> > +               goto clear;
>> > +
>> > +       pidns_info->tgid = (u32) tgid;
>> > +       pidns_info->pid = (u32) pid;
>> > +
>> > +       spin_lock(&bpf_spinlock);
>> > +        ret = kern_path(pidnspath, 0, &kp);
>> > +       if (ret)
>> > +               goto clear;
>> > +       inode = d_backing_inode(kp.dentry);
>> > +       pidns_info->dev = inode->i_sb->s_dev;
>> > +       spin_unlock(&bpf_spinlock);
>> > +
>> > +       return 0;
>> > +
>> > +       clear:
>> > +       if (pidns_info)
>> > +               memset((void *)pidns, 0, (size_t) size);
>> > +
>> > +       return -EINVAL;
>> > +}
>> > +
>> > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
>> > +       .func   = bpf_get_current_pidns_info,
>> > +       .gpl_only       = false,
>> > +       .ret_type       = RET_INTEGER,
>> > +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
>> > +       .arg2_type      = ARG_CONST_SIZE,
>> > +};
>> > +
>> >  #ifdef CONFIG_CGROUPS
>> >  BPF_CALL_0(bpf_get_current_cgroup_id)
>> >  {
>> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> > index d64c00afceb5..2ef0de78d4ec 100644
>> > --- a/kernel/trace/bpf_trace.c
>> > +++ b/kernel/trace/bpf_trace.c
>> > @@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> >         case BPF_FUNC_get_current_cgroup_id:
>> >                 return &bpf_get_current_cgroup_id_proto;
>> >  #endif
>> > +       case BPF_FUNC_get_current_pidns_info:
>> > +               return &bpf_get_current_pidns_info_proto;
>> >         default:
>> >                 return NULL;
>> >         }
>> > --
>> > 2.11.0
>> >


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

Yonghong Song
 

On Thu, Jul 18, 2019 at 6:21 PM carlos antonio neira bustos
<cneirabustos@...> wrote:

Hi,
Yes, I'm still interested as I need this capability at $WORK, but haven't had the time to re-write the dentry functions we need, as currently they could sleep.
I could resume work on this next monday when I get back from vacations.
I really appreciate your help and guidance on this.
Great. Which dentry function are you referring to? dput? kern_path
cannot be used as it uses
kmalloc without GFP_ATOMIC. See https://github.com/iovisor/bcc/issues/1329.


Bests


El jue., 18 de jul. de 2019 20:29, Y Song <ys114321@...> escribió:

Hi, Carlos,

Are you still interested in upstreaming this patch? Looks like there
still a desire
to make bcc work inside the containers.

The bpf-next will open next week. If you would like, could you submit again?
I will review the patch on bpf-next as well to make sure we made
forward progress.

Please let me know. Thanks!

Yonghong

On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
<cneirabustos@...> wrote:

As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
it calls getname_kernel() which may sleep.
The inode is accessed directly, as we are just interested in the inode's s_dev.
Let me know if this approach is the correct one.
-------------------------------------------------------------------------------
From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@...>
Date: Tue, 16 Apr 2019 17:10:46 -0400
Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

u32 pid = bpf_get_current_pid_tgid() >> 32;
if (pid != <pid_arg_passed_in>)
return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

struct bpf_pidns pidns;
bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
u32 pid = pidns.tgid;
u32 nsid = pidns.nsid;
if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
kernel/bpf/core.c | 1 +
kernel/bpf/helpers.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 2 ++
5 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4d4c1771ab0..4393f8f088cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
extern const struct bpf_func_proto bpf_spin_lock_proto;
extern const struct bpf_func_proto bpf_spin_unlock_proto;
extern const struct bpf_func_proto bpf_get_local_storage_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;

/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31a27dd337dc..7bf457875c31 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2500,6 +2500,18 @@ union bpf_attr {
* Return
* 0 if iph and th are a valid SYN cookie ACK, or a negative error
* otherwise.
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ * Description
+ * Copies into *pidns* pid, namespace id and tgid as seen by the
+ * current namespace and also device from /proc/self/ns/pid.
+ * *size_of_pidns* must be the size of *pidns*
+ *
+ * This helper is used when pid filtering is needed inside a
+ * container as bpf_get_current_tgid() helper returns always the
+ * pid id as seen by the root namespace.
+ * Return
+ * 0 on success -EINVAL on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2602,7 +2614,8 @@ union bpf_attr {
FN(skb_ecn_set_ce), \
FN(get_listener_sock), \
FN(skc_lookup_tcp), \
- FN(tcp_check_syncookie),
+ FN(tcp_check_syncookie), \
+ FN(get_current_pidns_info),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3298,4 +3311,14 @@ struct bpf_line_info {
struct bpf_spin_lock {
__u32 val;
};
+
+/* helper bpf_get_current_pidns_info will store the following
+ * data, dev will contain major/minor from /proc/self/pid.
+*/
+struct bpf_pidns_info {
+ __u32 dev;
+ __u32 nsid;
+ __u32 tgid;
+ __u32 pid;
+};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ace8c22c8b0e..ecbdc72ba459 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;

const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
{
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a411fc17d265..2de82d14424d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -18,6 +18,11 @@
#include <linux/sched.h>
#include <linux/uidgid.h>
#include <linux/filter.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>

/* If kernel subsystem is allowing eBPF programs to call this function,
* inside its own verifier_ops->get_func_proto() callback it should return
@@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
preempt_enable();
}

+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+ size)
+{
+ const char *pidnspath = "/proc/self/ns/pid";
+ struct pid_namespace *pidns = NULL;
+ DEFINE_SPINLOCK(bpf_spinlock);
+ struct inode *inode;
+ struct kstat ks;
+ struct path kp;
+ pid_t tgid = 0;
+ pid_t pid = 0;
+ int ret;
+
+ if (unlikely(size != sizeof(struct bpf_pidns_info)))
+ goto clear;
+
+ pidns = task_active_pid_ns(current);
+
+ if (unlikely(!pidns))
+ goto clear;
+
+ pidns_info->nsid = pidns->ns.inum;
+ pid = task_pid_nr_ns(current, pidns);
+
+ if (unlikely(!pid))
+ goto clear;
+
+ tgid = task_tgid_nr_ns(current, pidns);
+
+ if (unlikely(!tgid))
+ goto clear;
+
+ pidns_info->tgid = (u32) tgid;
+ pidns_info->pid = (u32) pid;
+
+ spin_lock(&bpf_spinlock);
+ ret = kern_path(pidnspath, 0, &kp);
+ if (ret)
+ goto clear;
+ inode = d_backing_inode(kp.dentry);
+ pidns_info->dev = inode->i_sb->s_dev;
+ spin_unlock(&bpf_spinlock);
+
+ return 0;
+
+ clear:
+ if (pidns_info)
+ memset((void *)pidns, 0, (size_t) size);
+
+ return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+ .func = bpf_get_current_pidns_info,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg2_type = ARG_CONST_SIZE,
+};
+
#ifdef CONFIG_CGROUPS
BPF_CALL_0(bpf_get_current_cgroup_id)
{
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d64c00afceb5..2ef0de78d4ec 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_get_current_cgroup_id:
return &bpf_get_current_cgroup_id_proto;
#endif
+ case BPF_FUNC_get_current_pidns_info:
+ return &bpf_get_current_pidns_info_proto;
default:
return NULL;
}
--
2.11.0


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

neirac
 

Hi,
Yes, I'm still interested as I need this capability at $WORK, but haven't had the time to re-write the dentry functions we need, as currently they could sleep.
I could resume work on this next monday when I get back from vacations.
I really appreciate your help and guidance on this.

Bests


El jue., 18 de jul. de 2019 20:29, Y Song <ys114321@...> escribió:
Hi, Carlos,

Are you still interested in upstreaming this patch? Looks like there
still a desire
to make bcc work inside the containers.

The bpf-next will open next week. If you would like, could you submit again?
I will review the patch on bpf-next as well to make sure we made
forward progress.

Please let me know. Thanks!

Yonghong

On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
<cneirabustos@...> wrote:
>
> As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
> it calls getname_kernel() which may sleep.
> The inode is accessed directly, as we are just interested in the inode's s_dev.
> Let me know if this approach is the correct one.
> -------------------------------------------------------------------------------
> From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
> From: Carlos <cneirabustos@...>
> Date: Tue, 16 Apr 2019 17:10:46 -0400
> Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
>   from current task
>
> This helper obtains the active namespace from current and returns pid, tgid,
> device and namespace id as seen from that namespace, allowing to instrument
> a process inside a container.
> Device is read from /proc/self/ns/pid, as in the future it's possible that
> different pid_ns files may belong to different devices, according
> to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
> conference.
> Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
> scripts but this helper returns the pid as seen by the root namespace which is
> fine when a bcc script is not executed inside a container.
> When the process of interest is inside a container, pid filtering will not work
> if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
> returning the pid as it's seen by the current namespace where the script is
> executing.
>
> This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
> used to do pid filtering even inside a container.
>
> For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):
>
>         u32 pid = bpf_get_current_pid_tgid() >> 32;
>         if (pid != <pid_arg_passed_in>)
>                 return 0;
> Could be modified to use bpf_get_current_pidns_info() as follows:
>
>         struct bpf_pidns pidns;
>         bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
>         u32 pid = pidns.tgid;
>         u32 nsid = pidns.nsid;
>         if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
>                 return 0;
>
> To find out the name PID namespace id of a process, you could use this command:
>
> $ ps -h -o pidns -p <pid_of_interest>
>
> Or this other command:
>
> $ ls -Li /proc/<pid_of_interest>/ns/pid
>
> Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
> ---
>  include/linux/bpf.h      |  1 +
>  include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
>  kernel/bpf/core.c        |  1 +
>  kernel/bpf/helpers.c     | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/trace/bpf_trace.c |  2 ++
>  5 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index e4d4c1771ab0..4393f8f088cc 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
>  extern const struct bpf_func_proto bpf_spin_lock_proto;
>  extern const struct bpf_func_proto bpf_spin_unlock_proto;
>  extern const struct bpf_func_proto bpf_get_local_storage_proto;
> +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;
>
>  /* Shared helpers among cBPF and eBPF. */
>  void bpf_user_rnd_init_once(void);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 31a27dd337dc..7bf457875c31 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2500,6 +2500,18 @@ union bpf_attr {
>   *     Return
>   *             0 if iph and th are a valid SYN cookie ACK, or a negative error
>   *             otherwise.
> + *
> + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
> + *     Description
> + *             Copies into *pidns* pid, namespace id and tgid as seen by the
> + *             current namespace and also device from /proc/self/ns/pid.
> + *             *size_of_pidns* must be the size of *pidns*
> + *
> + *             This helper is used when pid filtering is needed inside a
> + *             container as bpf_get_current_tgid() helper returns always the
> + *             pid id as seen by the root namespace.
> + *     Return
> + *             0 on success -EINVAL on error.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -2602,7 +2614,8 @@ union bpf_attr {
>         FN(skb_ecn_set_ce),             \
>         FN(get_listener_sock),          \
>         FN(skc_lookup_tcp),             \
> -       FN(tcp_check_syncookie),
> +       FN(tcp_check_syncookie),        \
> +       FN(get_current_pidns_info),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -3298,4 +3311,14 @@ struct bpf_line_info {
>  struct bpf_spin_lock {
>         __u32   val;
>  };
> +
> +/* helper bpf_get_current_pidns_info will store the following
> + * data, dev will contain major/minor from /proc/self/pid.
> +*/
> +struct bpf_pidns_info {
> +       __u32 dev;
> +       __u32 nsid;
> +       __u32 tgid;
> +       __u32 pid;
> +};
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index ace8c22c8b0e..ecbdc72ba459 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
>  const struct bpf_func_proto bpf_get_current_comm_proto __weak;
>  const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
>  const struct bpf_func_proto bpf_get_local_storage_proto __weak;
> +const struct bpf_func_proto bpf_get_current_pidns_info __weak;
>
>  const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
>  {
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index a411fc17d265..2de82d14424d 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -18,6 +18,11 @@
>  #include <linux/sched.h>
>  #include <linux/uidgid.h>
>  #include <linux/filter.h>
> +#include <linux/pid_namespace.h>
> +#include <linux/major.h>
> +#include <linux/stat.h>
> +#include <linux/namei.h>
> +#include <linux/version.h>
>
>  /* If kernel subsystem is allowing eBPF programs to call this function,
>   * inside its own verifier_ops->get_func_proto() callback it should return
> @@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
>         preempt_enable();
>  }
>
> +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
> +        size)
> +{
> +       const char *pidnspath = "/proc/self/ns/pid";
> +       struct pid_namespace *pidns = NULL;
> +       DEFINE_SPINLOCK(bpf_spinlock);
> +       struct inode *inode;
> +       struct kstat ks;
> +       struct path kp;
> +       pid_t tgid = 0;
> +       pid_t pid = 0;
> +       int ret;
> +
> +       if (unlikely(size != sizeof(struct bpf_pidns_info)))
> +               goto clear;
> +
> +       pidns = task_active_pid_ns(current);
> +
> +       if (unlikely(!pidns))
> +               goto clear;
> +
> +       pidns_info->nsid =  pidns->ns.inum;
> +       pid = task_pid_nr_ns(current, pidns);
> +
> +       if (unlikely(!pid))
> +               goto clear;
> +
> +       tgid = task_tgid_nr_ns(current, pidns);
> +
> +       if (unlikely(!tgid))
> +               goto clear;
> +
> +       pidns_info->tgid = (u32) tgid;
> +       pidns_info->pid = (u32) pid;
> +
> +       spin_lock(&bpf_spinlock);
> +        ret = kern_path(pidnspath, 0, &kp);
> +       if (ret)
> +               goto clear;
> +       inode = d_backing_inode(kp.dentry);
> +       pidns_info->dev = inode->i_sb->s_dev;
> +       spin_unlock(&bpf_spinlock);
> +
> +       return 0;
> +
> +       clear:
> +       if (pidns_info)
> +               memset((void *)pidns, 0, (size_t) size);
> +
> +       return -EINVAL;
> +}
> +
> +const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
> +       .func   = bpf_get_current_pidns_info,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_UNINIT_MEM,
> +       .arg2_type      = ARG_CONST_SIZE,
> +};
> +
>  #ifdef CONFIG_CGROUPS
>  BPF_CALL_0(bpf_get_current_cgroup_id)
>  {
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index d64c00afceb5..2ef0de78d4ec 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>         case BPF_FUNC_get_current_cgroup_id:
>                 return &bpf_get_current_cgroup_id_proto;
>  #endif
> +       case BPF_FUNC_get_current_pidns_info:
> +               return &bpf_get_current_pidns_info_proto;
>         default:
>                 return NULL;
>         }
> --
> 2.11.0
>


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

Yonghong Song
 

Hi, Carlos,

Are you still interested in upstreaming this patch? Looks like there
still a desire
to make bcc work inside the containers.

The bpf-next will open next week. If you would like, could you submit again?
I will review the patch on bpf-next as well to make sure we made
forward progress.

Please let me know. Thanks!

Yonghong

On Tue, Apr 16, 2019 at 6:31 PM Carlos Antonio Neira Bustos
<cneirabustos@...> wrote:

As a bpf program cannot sleep, I needed to add a spinlock to kern_path, as
it calls getname_kernel() which may sleep.
The inode is accessed directly, as we are just interested in the inode's s_dev.
Let me know if this approach is the correct one.
-------------------------------------------------------------------------------
From 35b7bfcbf6524ec807f107302209a0fb07614cc8 Mon Sep 17 00:00:00 2001
From: Carlos <cneirabustos@...>
Date: Tue, 16 Apr 2019 17:10:46 -0400
Subject: [PATCH] [PATCH bpf-next 1/3] BPF: New helper to obtain namespace data
from current task

This helper obtains the active namespace from current and returns pid, tgid,
device and namespace id as seen from that namespace, allowing to instrument
a process inside a container.
Device is read from /proc/self/ns/pid, as in the future it's possible that
different pid_ns files may belong to different devices, according
to the discussion between Eric Biederman and Yonghong in 2017 linux plumbers
conference.
Currently bpf_get_current_pid_tgid(), is used to do pid filtering in bcc's
scripts but this helper returns the pid as seen by the root namespace which is
fine when a bcc script is not executed inside a container.
When the process of interest is inside a container, pid filtering will not work
if bpf_get_current_pid_tgid() is used. This helper addresses this limitation
returning the pid as it's seen by the current namespace where the script is
executing.

This helper has the same use cases as bpf_get_current_pid_tgid() as it can be
used to do pid filtering even inside a container.

For example a bcc script using bpf_get_current_pid_tgid() (tools/funccount.py):

u32 pid = bpf_get_current_pid_tgid() >> 32;
if (pid != <pid_arg_passed_in>)
return 0;
Could be modified to use bpf_get_current_pidns_info() as follows:

struct bpf_pidns pidns;
bpf_get_current_pidns_info(&pidns, sizeof(struct bpf_pidns));
u32 pid = pidns.tgid;
u32 nsid = pidns.nsid;
if ((pid != <pid_arg_passed_in>) && (nsid != <nsid_arg_passed_in>))
return 0;

To find out the name PID namespace id of a process, you could use this command:

$ ps -h -o pidns -p <pid_of_interest>

Or this other command:

$ ls -Li /proc/<pid_of_interest>/ns/pid

Signed-off-by: Carlos Antonio Neira Bustos <cneirabustos@...>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 25 ++++++++++++++++++-
kernel/bpf/core.c | 1 +
kernel/bpf/helpers.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 2 ++
5 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e4d4c1771ab0..4393f8f088cc 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -987,6 +987,7 @@ extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
extern const struct bpf_func_proto bpf_spin_lock_proto;
extern const struct bpf_func_proto bpf_spin_unlock_proto;
extern const struct bpf_func_proto bpf_get_local_storage_proto;
+extern const struct bpf_func_proto bpf_get_current_pidns_info_proto;

/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31a27dd337dc..7bf457875c31 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2500,6 +2500,18 @@ union bpf_attr {
* Return
* 0 if iph and th are a valid SYN cookie ACK, or a negative error
* otherwise.
+ *
+ * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns)
+ * Description
+ * Copies into *pidns* pid, namespace id and tgid as seen by the
+ * current namespace and also device from /proc/self/ns/pid.
+ * *size_of_pidns* must be the size of *pidns*
+ *
+ * This helper is used when pid filtering is needed inside a
+ * container as bpf_get_current_tgid() helper returns always the
+ * pid id as seen by the root namespace.
+ * Return
+ * 0 on success -EINVAL on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2602,7 +2614,8 @@ union bpf_attr {
FN(skb_ecn_set_ce), \
FN(get_listener_sock), \
FN(skc_lookup_tcp), \
- FN(tcp_check_syncookie),
+ FN(tcp_check_syncookie), \
+ FN(get_current_pidns_info),

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3298,4 +3311,14 @@ struct bpf_line_info {
struct bpf_spin_lock {
__u32 val;
};
+
+/* helper bpf_get_current_pidns_info will store the following
+ * data, dev will contain major/minor from /proc/self/pid.
+*/
+struct bpf_pidns_info {
+ __u32 dev;
+ __u32 nsid;
+ __u32 tgid;
+ __u32 pid;
+};
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ace8c22c8b0e..ecbdc72ba459 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2046,6 +2046,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak;
const struct bpf_func_proto bpf_get_local_storage_proto __weak;
+const struct bpf_func_proto bpf_get_current_pidns_info __weak;

const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
{
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a411fc17d265..2de82d14424d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -18,6 +18,11 @@
#include <linux/sched.h>
#include <linux/uidgid.h>
#include <linux/filter.h>
+#include <linux/pid_namespace.h>
+#include <linux/major.h>
+#include <linux/stat.h>
+#include <linux/namei.h>
+#include <linux/version.h>

/* If kernel subsystem is allowing eBPF programs to call this function,
* inside its own verifier_ops->get_func_proto() callback it should return
@@ -317,6 +322,66 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
preempt_enable();
}

+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,
+ size)
+{
+ const char *pidnspath = "/proc/self/ns/pid";
+ struct pid_namespace *pidns = NULL;
+ DEFINE_SPINLOCK(bpf_spinlock);
+ struct inode *inode;
+ struct kstat ks;
+ struct path kp;
+ pid_t tgid = 0;
+ pid_t pid = 0;
+ int ret;
+
+ if (unlikely(size != sizeof(struct bpf_pidns_info)))
+ goto clear;
+
+ pidns = task_active_pid_ns(current);
+
+ if (unlikely(!pidns))
+ goto clear;
+
+ pidns_info->nsid = pidns->ns.inum;
+ pid = task_pid_nr_ns(current, pidns);
+
+ if (unlikely(!pid))
+ goto clear;
+
+ tgid = task_tgid_nr_ns(current, pidns);
+
+ if (unlikely(!tgid))
+ goto clear;
+
+ pidns_info->tgid = (u32) tgid;
+ pidns_info->pid = (u32) pid;
+
+ spin_lock(&bpf_spinlock);
+ ret = kern_path(pidnspath, 0, &kp);
+ if (ret)
+ goto clear;
+ inode = d_backing_inode(kp.dentry);
+ pidns_info->dev = inode->i_sb->s_dev;
+ spin_unlock(&bpf_spinlock);
+
+ return 0;
+
+ clear:
+ if (pidns_info)
+ memset((void *)pidns, 0, (size_t) size);
+
+ return -EINVAL;
+}
+
+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {
+ .func = bpf_get_current_pidns_info,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg2_type = ARG_CONST_SIZE,
+};
+
#ifdef CONFIG_CGROUPS
BPF_CALL_0(bpf_get_current_cgroup_id)
{
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index d64c00afceb5..2ef0de78d4ec 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -603,6 +603,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_get_current_cgroup_id:
return &bpf_get_current_cgroup_id_proto;
#endif
+ case BPF_FUNC_get_current_pidns_info:
+ return &bpf_get_current_pidns_info_proto;
default:
return NULL;
}
--
2.11.0


Re: math between pkt pointer and register with unbounded min value is not allowed #verifier

Yonghong Song
 

Hi, Simon,

Finally got some time to deep dive into the verifier for this issue.
I filed an issue https://github.com/iovisor/bcc/issues/2463 with
possible suggestion to add a new checksum calculation helper
for xdp. I forgot what is your ultimate workaround for this issue.
Could you comment on the issue?

Thanks!

Yonghong

On Thu, Mar 7, 2019 at 9:37 AM Y Song <ys114321@...> wrote:

On Wed, Mar 6, 2019 at 7:08 AM <contact@...> wrote:

I'm playing with bcc to prototype an UDP load balancer.

I'm facing an issue that I didn't succeed to understand...

In my code I tried to validate my UDP packet using code like this :

struct udphdr *udp;
udp = iph + 1;
if (udp + 1 > data_end)
return XDP_DROP;
__u16 udp_len = bpf_ntohs(udp->len);
//__u16 udp_len = 8;
if (udp_len < 8)
return XDP_DROP;
if (udp_len > 512) // TODO use a more approriate max value
return XDP_DROP;
if ((void *) udp + udp_len > data_end)
return XDP_DROP;

And the verifier does not like it ..
This is caused by compiler optimizations.


28: (71) r2 = *(u8 *)(r7 +23)
29: (b7) r0 = 2
30: (55) if r2 != 0x11 goto pc+334
R0=inv2 R1=pkt_end(id=0,off=0,imm=0) R2=inv17 R3=inv5 R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=34,imm=0) R8=pkt(id=0,off=34,r=34,imm=0) R9=pkt(id=0,off=14,r=34,imm=0) R10=fp0,call_-1
31: (bf) r2 = r8
32: (07) r2 += 8
33: (b7) r0 = 1
34: (2d) if r2 > r1 goto pc+330
R0=inv1 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=42,r=42,imm=0) R3=inv5 R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) R10=fp0,call_-1
35: (69) r3 = *(u16 *)(r7 +38)
36: (dc) r3 = be16 r3
r3 get the value from memory, its value could be any one as permitted
by the type.

37: (bf) r2 = r3
38: (07) r2 += -8
39: (57) r2 &= 65535
40: (b7) r0 = 1
41: (25) if r2 > 0x1f8 goto pc+323
test is done by r2. We indeed get better range for r2 (below:
R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) )
but r3 range is not tightened.

R0=inv1 R1=pkt_end(id=0,off=0,imm=0) R2=inv(id=0,umax_value=504,var_off=(0x0; 0x1ff)) R3=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=pkt(id=0,off=0,r=42,imm=0) R8=pkt(id=0,off=34,r=42,imm=0) R9=pkt(id=0,off=14,r=42,imm=0) R10=fp0,call_-1
42: (bf) r2 = r7
43: (0f) r2 += r3
math between pkt pointer and register with unbounded min value is not allowed
Here, we use r3 to do arith and the verifier not happy. If we use the
tightened old r2, it may work.
The compiler does the right thing, just verifier is not advanced enough.


I'm pretty sure the issue is about udp_len, that's why I tried to validate its value before to use it ... but without success...
When I set udp_len to 8 (just for testing) this seems to works. Any idea about that ?
Yes, you will need some source workaround. You could try below (untested):

if (udp_len > 512) // TODO use a more approriate max value
return XDP_DROP;
+ udp_len = udp_len & 0x1ff;
if ((void *) udp + udp_len > data_end)
return XDP_DROP;


Full code is available here : https://gist.github.com/sbernard31/d4fee7518a1ff130452211c0d355b3f7

(I'm using python-bpfcc 0.8.0-4 from debian sid with a 4.19.12 kernel)
(I don't know if this is the right place for this kind of question, )


BPF tracing book

Brendan Gregg
 

G'Day,

Many of you already know (and have been helping, thanks) and we've
discussed this on the iovisor call before: my book is finally launched
properly. Some URLs:

http://www.brendangregg.com/blog/2019-07-15/bpf-performance-tools-book.html
http://www.brendangregg.com/bpf-performance-tools-book.html
http://www.informit.com/title/9780136554820
https://www.amazon.com/dp/0136554822

Brendan


Re: sockmap redirect doesn't work all the time

Ferenc Fejes
 

Hi!

Are you sure about the problem? Try to log (with bpf_printk for example) the incoming packets in the BPF program to check if the really missing. I also encountered your problem in my experiments, but then it turned out the packets dropped because of insufficient receiver memory on the peer socket. I recommend you to try it with small messages (like 10kbyte, so you can count all the packets if the arriving properly, based on the BPF parser log). 

Thanks,
Ferenc


Linux Plumbers BPF micro-conference CFP (reminder)

Daniel Borkmann
 

This is a call for proposals for the BPF micro-conference at this
years' Linux Plumbers Conference (LPC) 2019 which will be held in
Lisbon, Portugal for September 9-11.

The goal of the BPF micro-conference is to bring BPF developers
together to discuss topics around Linux kernel work related to
the BPF core infrastructure as well as its many subsystems under
tracing, networking, security, and BPF user space tooling (LLVM,
libbpf, bpftool and many others).

The format of the micro-conference has a main focus on discussion,
therefore each accepted topic will provide a short 1-2 slide
introduction with subsequent discussion for the rest of the given
time slot.

The BPF micro-conference is a community-driven event and open to
all LPC attendees, there is no additional registration required.

Please submit your discussion proposals to the LPC BPF micro-conference
organizers at:

lpc-bpf@...

Proposals must be submitted until August 2nd, and submitters will
be notified of acceptance at latest by August 9. (Please note that
proposals must not be sent as html mail as they are otherwise dropped
by vger.)

The format of the submission and many other details can be found at:

http://vger.kernel.org/lpc-bpf.html

Looking forward to seeing you all in Lisbon in September!


Go-bpf with TC

Kanthi P
 

Hi,

Can anyone help me with an example of a GO-bpf program using TC

Thanks,
Kanthi


sockmap redirect doesn't work all the time

Forrest Chen
 

Hi all,

Recently I've been testing with ebpf sockmap. I know that after setting the sock fd in sockmap, the data send to this socket can redirect to another socket that register in the sockmap. 

I rewrite an example based on https://github.com/dippynark/bpf-sockmap and find something unexpected. In my example, my program(an LB instance with sockmap support to redirect data from client to the backend real server) will accept a connection from client and build a new connection to real server. Than set both the two sock fds to sockmap and the verdict program will redirect packet from one socket to another.
I find that some packets are not handled by sockmap if the packet arrived before I set sock fd. So I have to read the packets from my program and send it out to the real server.

In my expectation, I think that after set sockfd to sockmap, all packets should be handled by parser/verdict ebpf functions and should not read any data from userspace, even if the packet received before the sock fd. But it seems I'm wrong...

If the description above is as expected, in what scenario I could use the sockmap since it doesn't make sure all packet handled by ebpf redirect.
Or if I am wrong, or using the sockmap in the wrong way/scenario, please point this out

Thanks,

Forrest Chen


Re: [agenda] IO Visor TSC/Dev Meeting

Brenden Blanco
 

Hi all,

Let's skip the meeting this week, since there isn't a specific agenda
to discuss. Going forward, I'll send out a call for agenda reminder 2
days in advance so the meeting isn't decided last minute.

Thanks,
Brenden

On Wed, Jun 26, 2019 at 9:27 AM Brenden Blanco <bblanco@...> wrote:

On Tue, Jun 25, 2019 at 1:39 PM Brenden Blanco <bblanco@...> wrote:

Hi All,

As per the discussion from last meeting, this week's meeting will be
provisional on having a proposed agenda rather than free-form. Therefore,
please reply if there is a topic that you would like to discuss live with the
other BPF developers.
I have only received the following agenda from Brendan so far:

I have a meeting clash so I can only join for a bit; my agenda items are:
BPF tracing book
ply
Are there any further agenda items?



=== IO Visor Dev/TSC Meeting ===

Every 2 weeks on Wednesday, from Wednesday, January 25, 2017, to no end date
11:00 am | Pacific Daylight Time (San Francisco, GMT-07:00) | 30 min

https://bluejeans.com/568677804/


Re: [agenda] IO Visor TSC/Dev Meeting

Brenden Blanco
 

On Tue, Jun 25, 2019 at 1:39 PM Brenden Blanco <bblanco@...> wrote:

Hi All,

As per the discussion from last meeting, this week's meeting will be
provisional on having a proposed agenda rather than free-form. Therefore,
please reply if there is a topic that you would like to discuss live with the
other BPF developers.
I have only received the following agenda from Brendan so far:

I have a meeting clash so I can only join for a bit; my agenda items are:
BPF tracing book
ply
Are there any further agenda items?



=== IO Visor Dev/TSC Meeting ===

Every 2 weeks on Wednesday, from Wednesday, January 25, 2017, to no end date
11:00 am | Pacific Daylight Time (San Francisco, GMT-07:00) | 30 min

https://bluejeans.com/568677804/


[agenda] IO Visor TSC/Dev Meeting

Brenden Blanco
 

Hi All,

As per the discussion from last meeting, this week's meeting will be
provisional on having a proposed agenda rather than free-form. Therefore,
please reply if there is a topic that you would like to discuss live with the
other BPF developers.


=== IO Visor Dev/TSC Meeting ===

Every 2 weeks on Wednesday, from Wednesday, January 25, 2017, to no end date
11:00 am | Pacific Daylight Time (San Francisco, GMT-07:00) | 30 min

https://bluejeans.com/568677804/


bpftrace v0.9.1

Matheus Marchini <mat@...>
 

We just released bpftrace v0.9.1. This turned into a pretty big
release, with almost 200 commits and 3 months since v0.9.0. We'll try
to have smaller development cycles for the next releases.

Some highlights of this version:

- Compound assignment operators (+= and friends)
- Support arrays and IPv6 in ntop
- Add basic support to enums
- Add basic macro definition support
- Allow comparison of two string variables
- Add pre and post behavior to ++ and -- operators
- Ban kprobes that cause CPU deadlocks
- Add unsafe-mode and make default execution mode safe-mode

Full release notes can be found at:
https://github.com/iovisor/bpftrace/releases/tag/v0.9.1

Hope y'all enjoy this version! Let us know if you find any issues.
Cheers,


Re: Performance of veth XDP

Forrest Chen
 

On Wed, Jun 12, 2019 at 09:31 PM, Toshiaki Makita wrote:
I cannot reproduce it.
Is it an XDP-related problem? What happens if you use bridge in place of XDP_REDIRECT?
Did you collect tcpdump result in server-side?
Also how about checking "ethtool -S" and "nstat" (not netstat) periodically?
Thanks. I have tested in non-XDP mode and the problem also happen. It maybe a netperf bug...


Re: bpf_probe_read() split: bpftrace RFC

Matheus Marchini <mat@...>
 

How will bpf_probe_read_user/bpf_probe_read_kernel be enforced in the
Kernel? In other words, how bpf_probe_read_user will detect and report
when it get's a Kernel address as parameter, and vice-versa? Will it
be accomplished by the verifier (is it even possible to do this
reliably with the verifier) or only on runtime?

If the kernel will only test it during runtime, and it returns an
unique error code (different than errors that probe_read can return
today, we might need to create a new error code) , we could do the
following for the dereference operands (*/str()):

typedef int (probe_read_t)(void *dst, int size, void *src);

// Assuming bpf_probe_read_[user,kernel] will return EINVALADDRSPC
// if the user tires to access an address with the wrong function
int err;

// space_ctx is defined according to Brendan's email
probe_read_t default_probe_read; = space_ctx == KERNEL ?
bpf_probe_read_kernel : bpf_probe_read_user;
probe_read_t fallback_probe_read;
if (addr_space_ctx == KERNEL) {
default_probe_read = bpf_probe_read_kernel;
fallback_probe_read = bpf_probe_read_user;
}
else {
default_probe_read = bpf_probe_read_user;
fallback_probe_read = bpf_probe_read_kernel;
}

if (err = (*default_probe_read)(dst, size, src) == EINVALADDRSPC) {
err = (*fallback_probe_read)(dst, size, src);
}
if (err < 0)
{
bpf_trace_printk("Error while reading address %x\n", src);
return;
}

With this approach we can avoid breaking any scripts. The only
difference is that it will add more overhead when the fallback
probe_read is used (and if the user is affected by this overhead, they
can still use kptr/uptr/kstr/ustr). We could also: print to
stdout/syslog when the fallback method is used if bpftrace is running
in verbose mode, and provide a "strict" mode which would not try to
run the fallback probe_read.

On Thu, Jun 13, 2019 at 11:32 AM Brendan Gregg
<brendan.d.gregg@...> wrote:

G'Day,

This is the biggest change afoot to the bpftrace API, and I think we
can sort it out quickly without fuss, but it is worth sharing here.
This is from https://github.com/iovisor/bpftrace/issues/614 .

bpftrace currently allows pointer dereferencing via *addr, and
str(addr) for strings. But the future split of bpf_probe_read() into
bpf_probe_read_user() and bpf_probe_read_kernel() (to support SPARC,
etc) may break a lot of bpftrace tools and documentation. Or it may
not, if we are clever about it.

The proposal is this: add the following bpftrace builtins:

- uptr(addr): dereference user address
- ustr(addr): fetch NULL-terminated user string
- kptr(addr): dereference kernel address
- kstr(addr): fetch NULL-terminated kernel string

AND, to introduce a "context" for probe actions -- user or kernel --
where *addr and str(addr) work relative to that context. The context
would be:

- kprobes/kretprobes: kernel
- uprobes/uretprobes: user
- tracepoints: kernel (with the exception of syscall tracepoints: user)
- other probe types: kernel

It's possible that this context approach leaves us with zero broken
tools and documentation (ie, there are zero cases so far where we even
need to use uptr/ustr/kptr/kstr). I'm still checking and looking for
exceptions. Where you can help: can you think of a syscall tracepoint
that has a kernel address as an argument? Or another non-syscall
tracepoint that has a user-address as an argument? Or can you think of
any other problem with this plan?

thanks,

Brendan



bpf_probe_read() split: bpftrace RFC

Brendan Gregg
 

G'Day,

This is the biggest change afoot to the bpftrace API, and I think we
can sort it out quickly without fuss, but it is worth sharing here.
This is from https://github.com/iovisor/bpftrace/issues/614 .

bpftrace currently allows pointer dereferencing via *addr, and
str(addr) for strings. But the future split of bpf_probe_read() into
bpf_probe_read_user() and bpf_probe_read_kernel() (to support SPARC,
etc) may break a lot of bpftrace tools and documentation. Or it may
not, if we are clever about it.

The proposal is this: add the following bpftrace builtins:

- uptr(addr): dereference user address
- ustr(addr): fetch NULL-terminated user string
- kptr(addr): dereference kernel address
- kstr(addr): fetch NULL-terminated kernel string

AND, to introduce a "context" for probe actions -- user or kernel --
where *addr and str(addr) work relative to that context. The context
would be:

- kprobes/kretprobes: kernel
- uprobes/uretprobes: user
- tracepoints: kernel (with the exception of syscall tracepoints: user)
- other probe types: kernel

It's possible that this context approach leaves us with zero broken
tools and documentation (ie, there are zero cases so far where we even
need to use uptr/ustr/kptr/kstr). I'm still checking and looking for
exceptions. Where you can help: can you think of a syscall tracepoint
that has a kernel address as an argument? Or another non-syscall
tracepoint that has a user-address as an argument? Or can you think of
any other problem with this plan?

thanks,

Brendan

281 - 300 of 2020