Date   

Re: CPU Concurrency Issues

Yonghong Song
 

On Tue, Aug 20, 2019 at 12:05 PM Arnaldo Carvalho de Melo
<arnaldo.melo@...> wrote:

Em Tue, Aug 20, 2019 at 11:22:50AM -0700, Yonghong Song escreveu:
On Tue, Aug 20, 2019 at 11:02 AM Matthew Ahrens <mahrens@...> wrote:
On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song <ys114321@...> wrote:
Also we have per cpu counter to prevent when bpf program interrupted then another bpf to run for tracing programs.
I think that means that if an interrupt fires while the bpf program is run, the interrupt will run, but if the interrupt causes another tracing event to fire, the associated bpf program will not run (i.e. the event will be ignored / dropped). Is that right?
Yes. See
https://github.com/torvalds/linux/blob/master/kernel/trace/bpf_trace.c#L88-L97
Yonghong, I thought there would be some counter to at least let users
know that drop happened, has this ever surfaced? Or is there some way to
know about those drops that I'm missing?
You did not miss anything. Currently, there are no counters to count those
drops due to nmi or due to bpf program already running on that cpu.

There is effort by Daniel Xu to expose nhit/nmisses counters
from k/uprobe trace infra. Even kprobe is not a miss, bpf program may not
fire due to the above reasons.
https://lore.kernel.org/bpf/20190820214819.16154-1-dxu@dxuuu.xyz/T/#t

debugfs has k/uprobe_profile to count nhit/nmisses from k/uprobe trace infra.

We could add a counter into trace_event_call->event to count hit/miss. The hit
can also be counted by bpf program itself. The "miss" should be rare, and
most bpf programs e.g. in bcc are designed to tolerate occasional probe miss,
which should not affect much on the final aggregation results.

How strongly do you feel such a bpf prog hit/miss counter for tracing programs
is needed?


- Arnaldo


Re: CPU Concurrency Issues

brad.lewis@...
 

Thanks Matt and Yonghong.  Info on the preempt_disable()/preempt_enable() region is exactly what I was looking for.


Re: CPU Concurrency Issues

Matt Ahrens
 



On Tue, Aug 20, 2019 at 11:23 AM Y Song <ys114321@...> wrote:
On Tue, Aug 20, 2019 at 11:02 AM Matthew Ahrens <mahrens@...> wrote:
>
>
>
> On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song <ys114321@...> wrote:
>>
>> On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens <mahrens@...> wrote:
>> >
>> > On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song <ys114321@...> wrote:
>> >>
>> >> On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code.  Can anyone shed some light on this?
>> >>
>> >> preemption is a kernel thing, bcc does not disable it.  You need to
>> >> check kernel configuration CONFIG_PREEMPT in your host.
>> >
>> >
>> > When running bpf code from a kprobe / kretprobe, does anything ensure that cpu_id doesn't change while the bpf is running (e.g. due to preemption)?  Does anything ensure that no other bpf code runs on this CPU while this kprobe is running (e.g. due to an interrupt firing and hitting a different kprobe)?  If either of those things can happen, it seems difficult to atomically increment an entry in a HASH_MAP (even when using the cpu_id as a key).
>>
>> bpf program run is wrapped in preempt_disable()/preempt_enable()
>> region. Also we have per cpu counter to prevent when bpf program
>> interrupted then another bpf to run for tracing programs. Therefore,
>> for tracing programs, the situation you mentioned in the above won't
>> happen.
>>
>> The only thing which can happen is networking and tracing program
>> share the same map, e.g.,
>>    networking bpf program can be nmi interrupted and a bpf tracing
>> program (perf_event type) may be running.
>>
>
> Great, thanks Yonghong!  And I get that this is part of the infrastructure that calls into the bpf code (e.g. perf_events), not bcc.
>
> > Also we have per cpu counter to prevent when bpf program interrupted then another bpf to run for tracing programs.
>
> I think that means that if an interrupt fires while the bpf program is run, the interrupt will run, but if the interrupt causes another tracing event to fire, the associated bpf program will not run (i.e. the event will be ignored / dropped).  Is that right?

Yes. See
https://github.com/torvalds/linux/blob/master/kernel/trace/bpf_trace.c#L88-L97


Thanks for the pointer to the code!

--matt


Re: CPU Concurrency Issues

Yonghong Song
 

On Tue, Aug 20, 2019 at 11:02 AM Matthew Ahrens <mahrens@...> wrote:



On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song <ys114321@...> wrote:

On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens <mahrens@...> wrote:

On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song <ys114321@...> wrote:

On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:

Hi all,

I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP. My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code. Can anyone shed some light on this?
preemption is a kernel thing, bcc does not disable it. You need to
check kernel configuration CONFIG_PREEMPT in your host.

When running bpf code from a kprobe / kretprobe, does anything ensure that cpu_id doesn't change while the bpf is running (e.g. due to preemption)? Does anything ensure that no other bpf code runs on this CPU while this kprobe is running (e.g. due to an interrupt firing and hitting a different kprobe)? If either of those things can happen, it seems difficult to atomically increment an entry in a HASH_MAP (even when using the cpu_id as a key).
bpf program run is wrapped in preempt_disable()/preempt_enable()
region. Also we have per cpu counter to prevent when bpf program
interrupted then another bpf to run for tracing programs. Therefore,
for tracing programs, the situation you mentioned in the above won't
happen.

The only thing which can happen is networking and tracing program
share the same map, e.g.,
networking bpf program can be nmi interrupted and a bpf tracing
program (perf_event type) may be running.
Great, thanks Yonghong! And I get that this is part of the infrastructure that calls into the bpf code (e.g. perf_events), not bcc.

Also we have per cpu counter to prevent when bpf program interrupted then another bpf to run for tracing programs.
I think that means that if an interrupt fires while the bpf program is run, the interrupt will run, but if the interrupt causes another tracing event to fire, the associated bpf program will not run (i.e. the event will be ignored / dropped). Is that right?
Yes. See
https://github.com/torvalds/linux/blob/master/kernel/trace/bpf_trace.c#L88-L97


--matt





--matt


Re: CPU Concurrency Issues

Matt Ahrens
 



On Tue, Aug 20, 2019 at 10:47 AM Yonghong Song <ys114321@...> wrote:
On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens <mahrens@...> wrote:
>
> On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song <ys114321@...> wrote:
>>
>> On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:
>> >
>> > Hi all,
>> >
>> > I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code.  Can anyone shed some light on this?
>>
>> preemption is a kernel thing, bcc does not disable it.  You need to
>> check kernel configuration CONFIG_PREEMPT in your host.
>
>
> When running bpf code from a kprobe / kretprobe, does anything ensure that cpu_id doesn't change while the bpf is running (e.g. due to preemption)?  Does anything ensure that no other bpf code runs on this CPU while this kprobe is running (e.g. due to an interrupt firing and hitting a different kprobe)?  If either of those things can happen, it seems difficult to atomically increment an entry in a HASH_MAP (even when using the cpu_id as a key).

bpf program run is wrapped in preempt_disable()/preempt_enable()
region. Also we have per cpu counter to prevent when bpf program
interrupted then another bpf to run for tracing programs. Therefore,
for tracing programs, the situation you mentioned in the above won't
happen.

The only thing which can happen is networking and tracing program
share the same map, e.g.,
   networking bpf program can be nmi interrupted and a bpf tracing
program (perf_event type) may be running.


Great, thanks Yonghong!  And I get that this is part of the infrastructure that calls into the bpf code (e.g. perf_events), not bcc.

> Also we have per cpu counter to prevent when bpf program interrupted then another bpf to run for tracing programs.

I think that means that if an interrupt fires while the bpf program is run, the interrupt will run, but if the interrupt causes another tracing event to fire, the associated bpf program will not run (i.e. the event will be ignored / dropped).  Is that right?

--matt
 


>
> --matt




Re: CPU Concurrency Issues

Yonghong Song
 

On Tue, Aug 20, 2019 at 10:38 AM Matthew Ahrens <mahrens@...> wrote:

On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song <ys114321@...> wrote:

On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:

Hi all,

I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP. My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code. Can anyone shed some light on this?
preemption is a kernel thing, bcc does not disable it. You need to
check kernel configuration CONFIG_PREEMPT in your host.

When running bpf code from a kprobe / kretprobe, does anything ensure that cpu_id doesn't change while the bpf is running (e.g. due to preemption)? Does anything ensure that no other bpf code runs on this CPU while this kprobe is running (e.g. due to an interrupt firing and hitting a different kprobe)? If either of those things can happen, it seems difficult to atomically increment an entry in a HASH_MAP (even when using the cpu_id as a key).
bpf program run is wrapped in preempt_disable()/preempt_enable()
region. Also we have per cpu counter to prevent when bpf program
interrupted then another bpf to run for tracing programs. Therefore,
for tracing programs, the situation you mentioned in the above won't
happen.

The only thing which can happen is networking and tracing program
share the same map, e.g.,
networking bpf program can be nmi interrupted and a bpf tracing
program (perf_event type) may be running.




--matt


Re: CPU Concurrency Issues

Matt Ahrens
 

On Mon, Aug 19, 2019 at 4:30 PM Yonghong Song <ys114321@...> wrote:
On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:
>
> Hi all,
>
> I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code.  Can anyone shed some light on this?

preemption is a kernel thing, bcc does not disable it.  You need to
check kernel configuration CONFIG_PREEMPT in your host.

When running bpf code from a kprobe / kretprobe, does anything ensure that cpu_id doesn't change while the bpf is running (e.g. due to preemption)?  Does anything ensure that no other bpf code runs on this CPU while this kprobe is running (e.g. due to an interrupt firing and hitting a different kprobe)?  If either of those things can happen, it seems difficult to atomically increment an entry in a HASH_MAP (even when using the cpu_id as a key).

--matt


agenda: IO Visor TSC/Dev Meeting

Brenden Blanco
 

Hi All,

We have the bi-weekly phone conference scheduled for two days from now, does
anybody have a discussion topic to add to the agenda? As a reminder, we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden


Re: CPU Concurrency Issues

Yonghong Song
 

On Mon, Aug 19, 2019 at 1:54 PM <brad.lewis@...> wrote:

Hi all,

I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP. My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code. Can anyone shed some light on this?
preemption is a kernel thing, bcc does not disable it. You need to
check kernel configuration CONFIG_PREEMPT in your host.




Thank you,

Brad Lewis






CPU Concurrency Issues

brad.lewis@...
 

Hi all, 

I'm trying to verify that there are no concurrency issues with an approach I'm using cpu_id as a key to a HASH_MAP.   My understanding is that bcc disables preemption but the details are fuzzy and I haven't been able to find anything in the code.  Can anyone shed some light on this? 

 

Thank you, 

Brad Lewis 

 

 


Re: agenda: IO Visor TSC/Dev Meeting

Brenden Blanco
 

Hi All,

Since we haven't received any agenda suggestions for the meeting
today, please consider it canceled.

Thanks,
Brenden

On Mon, Aug 5, 2019 at 8:55 PM Brenden Blanco <bblanco@...> wrote:

Hi All,

We have the bi-weekly phone conference scheduled for two days from now, does
anybody have a discussion topic to add to the agenda? As a reminder, we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden


Does bpf_redirect_map support redirect packet to bridge device or veth which master is a bridge?

Forrest Chen
 
Edited

Hi all,

I attach an XDP prog(SKB mode) to host net device(enp0s9) and the prog will redirect packet to veth device.
If there's only veth device, I can redirect packet into netns through veth device in host, but if the veth device is attached in a bridge device, the redirect will fail(bpf_redirect_map return XDP_REDIRECT but no packet found in netns).

My sample code is here
The network topo shows below
note1: that if a remove br1, I can redirect successfully. 
note2: currently our k8s network mode has bridge device to forword packet. If bridge will make xdp not work, I have to consider to eliminate the bridge.
note3: I have test this sample in both 4.18 and 5.0.0, both of them failed.


Re: agenda: IO Visor TSC/Dev Meeting

alessandro.gario@...
 

Hello Song!

I'm using sys_enter_execve/sys_exit_execve and
sys_enter_execveat/sys_exit_execveat.

Bound loops seem really useful! This is going to solve many of the
issues I had.

I don't have code I can show, but I'm using LLVM and Clang directly by
generating IR. Data is acquired in a similar way to execsnoop (i.e.
looping through argv[] looking for the null terminator).

The second example is causing me issues as I can only capture a low
amount of parameters before I run out of instruction space. Bound loops
will certainly help quite a lot in this scenario!

Thanks so much for your help!

Alessandro Gario

On Tue, 2019-08-06 at 11:17 -0700, Y Song wrote:
On Tue, Aug 6, 2019 at 7:31 AM <alessandro.gario@...> wrote:
Hello Brenden!

I’m not sure if it counts as an agenda item, but I’m interested in
recording process events using tracepoints, and I would like to
know
what are the best practices when attempting to do so.
Which tracepoint do you have in mind for your particular use case?

Due to project goals (endpoint monitoring) one of the requirements
is
to avoid losing any event data.

It is probably not a surprise given the limits imposed by the
verifier,
but I’m having trouble with variadic functions and long strings.

The following are some events I would like to capture with
reasonable
success:

String padding, causing the string I need to be truncated:

bash -c “<padding whitespace> /bin/rm -rf /home”
The recent kernel (5.3) added bounded loop support up to 1M
instructions. You can have a bounded loop like
start = ...
for (i = 0; i < 256 && start < end && start[i] == ' ')
start++;
The verifier should be able to handle this properly.

In the old kernel, you will have to manually unroll the loop
and do the checking.

Argument padding, causing the BPF program to not reach the last
elements:

sudo bash --verbose --verbose .. --verbose -c ‘printf
“SELINUX=disabled\nSELINUXTYPE=targeted\n” > /etc/selinux/config’
Not sure what is the issue here.
Maybe you can describe your bpf program and tracepoint setup
with more details. So we can understand better about the problem.

I thought about trying to (tail?) call additional BPF programs to
work
around the second issue, but I’m not sure how to proceed with the
first
one.

Thanks!

Alessandro Gario

On Mon, 2019-08-05 at 20:55 -0700, Brenden Blanco wrote:
Hi All,

We have the bi-weekly phone conference scheduled for two days
from
now, does
anybody have a discussion topic to add to the agenda? As a
reminder,
we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden




Re: agenda: IO Visor TSC/Dev Meeting

Yonghong Song
 

On Tue, Aug 6, 2019 at 7:31 AM <alessandro.gario@...> wrote:

Hello Brenden!

I’m not sure if it counts as an agenda item, but I’m interested in
recording process events using tracepoints, and I would like to know
what are the best practices when attempting to do so.
Which tracepoint do you have in mind for your particular use case?


Due to project goals (endpoint monitoring) one of the requirements is
to avoid losing any event data.

It is probably not a surprise given the limits imposed by the verifier,
but I’m having trouble with variadic functions and long strings.

The following are some events I would like to capture with reasonable
success:

String padding, causing the string I need to be truncated:

bash -c “<padding whitespace> /bin/rm -rf /home”
The recent kernel (5.3) added bounded loop support up to 1M
instructions. You can have a bounded loop like
start = ...
for (i = 0; i < 256 && start < end && start[i] == ' ')
start++;
The verifier should be able to handle this properly.

In the old kernel, you will have to manually unroll the loop
and do the checking.


Argument padding, causing the BPF program to not reach the last
elements:

sudo bash --verbose --verbose .. --verbose -c ‘printf
“SELINUX=disabled\nSELINUXTYPE=targeted\n” > /etc/selinux/config’
Not sure what is the issue here.
Maybe you can describe your bpf program and tracepoint setup
with more details. So we can understand better about the problem.


I thought about trying to (tail?) call additional BPF programs to work
around the second issue, but I’m not sure how to proceed with the first
one.

Thanks!

Alessandro Gario

On Mon, 2019-08-05 at 20:55 -0700, Brenden Blanco wrote:
Hi All,

We have the bi-weekly phone conference scheduled for two days from
now, does
anybody have a discussion topic to add to the agenda? As a reminder,
we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden





Re: agenda: IO Visor TSC/Dev Meeting

alessandro.gario@...
 

Hello Brenden!

I’m not sure if it counts as an agenda item, but I’m interested in
recording process events using tracepoints, and I would like to know
what are the best practices when attempting to do so.

Due to project goals (endpoint monitoring) one of the requirements is
to avoid losing any event data.

It is probably not a surprise given the limits imposed by the verifier,
but I’m having trouble with variadic functions and long strings.

The following are some events I would like to capture with reasonable
success:

String padding, causing the string I need to be truncated:

bash -c “<padding whitespace> /bin/rm -rf /home”

Argument padding, causing the BPF program to not reach the last
elements:

sudo bash --verbose --verbose .. --verbose -c ‘printf
“SELINUX=disabled\nSELINUXTYPE=targeted\n” > /etc/selinux/config’

I thought about trying to (tail?) call additional BPF programs to work
around the second issue, but I’m not sure how to proceed with the first
one.

Thanks!

Alessandro Gario

On Mon, 2019-08-05 at 20:55 -0700, Brenden Blanco wrote:
Hi All,

We have the bi-weekly phone conference scheduled for two days from
now, does
anybody have a discussion topic to add to the agenda? As a reminder,
we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden



Re: sockmap redirect doesn't work all the time

Forrest Chen
 


 Hi Ferenc,

I think sock_ops is a good option for me, it seems the cilium project also use this to accelerate networks.

All of your curl request successfully establish his TCP connection? If there are some connection reset failure, maybe you could increase net.somaxconn and the backlog size of the listener socket.
Yes, all TCP connection established.  There're no connection pressures and I send curl requests manually. 

Thanks,
Forrest chen


agenda: IO Visor TSC/Dev Meeting

Brenden Blanco
 

Hi All,

We have the bi-weekly phone conference scheduled for two days from now, does
anybody have a discussion topic to add to the agenda? As a reminder, we are
planning to hold the meeting only if agenda items are proposed.

Cheers,
Brenden


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtainnamespace data from current task.

Yonghong Song
 

On Mon, Aug 5, 2019 at 2:19 PM carlos antonio neira bustos
<cneirabustos@...> wrote:

Hi Yonghong,



I had time yesterday to finish the code to avoid using getname_kernel , self-tests run ok with expected results.
looks good. A couple of comments below.


Here is the current code to avoid the getname_kernel call.



diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c

index 5e28718928ca..bfb2bbf37a58 100644

--- a/kernel/bpf/helpers.c

+++ b/kernel/bpf/helpers.c

@@ -11,6 +11,12 @@

#include <linux/uidgid.h>

#include <linux/filter.h>

#include <linux/ctype.h>

+#include <linux/pid_namespace.h>

+#include <linux/major.h>

+#include <linux/stat.h>

+#include <linux/namei.h>

+#include <linux/version.h>

+



#include "../../lib/kstrtox.h"

@@ -312,6 +318,81 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,

preempt_enable();

}

+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,

+ size)

+{

+ const char* name = "/proc/self/ns/pid";

+ struct pid_namespace *pidns = NULL;

+ int len = strlen(name) + 1;

+ struct filename *tmp = NULL;
reverse Christmas order.


+ struct inode *inode;

+ struct path kp;

+ pid_t tgid = 0;

+ pid_t pid = 0;

+ int ret;
Looks like -EINVAL is returned for all error cases.
This is suboptimal. There are cases where other error
code should be returned.

+

+

+ if (unlikely(size != sizeof(struct bpf_pidns_info)))

+ goto clear;
directly return -EINVAL.

+

+ pidns = task_active_pid_ns(current);

+

+ if (unlikely(!pidns))

+ goto clear;

+

+ pidns_info->nsid = pidns->ns.inum;

+ pid = task_pid_nr_ns(current, pidns);

+

+ if (unlikely(!pid))

+ goto clear;

+

+ tgid = task_tgid_nr_ns(current, pidns);

+

+ if (unlikely(!tgid))

+ goto clear;
Not 100% sure whether pidns/pid/tgid can go wrong. During process exit?
You can keep it now. Need more study or other people can comment on this.


+

+ pidns_info->tgid = (u32) tgid;

+ pidns_info->pid = (u32) pid;

+

+ if(!names_cachep)

+ goto clear;

+

+ tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);

+ if (unlikely(!tmp)) {

+ goto clear;

+ }

+

+ memcpy((char *)tmp->name, name, len);

+ tmp->uptr = NULL;

+ tmp->aname = NULL;

+ tmp->refcnt = 1;

+

+ ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);

+

+ if (ret)

+ goto clear;
We should return the error code "ret" instead of -EINVAL.


+

+

+ inode = d_backing_inode(kp.dentry);

+ pidns_info->dev = inode->i_sb->s_dev;

+

+ return 0;

+

+ clear:

+ if (pidns_info)

+ memset((void *)pidns, 0, (size_t) size);
pidns_info is not NULL, guaranteed by the verifier, so "if
(pidns_info)" can be removed.
It should be
memset((void *)pidns_info, ...)
Accidentally zeroing pidns may cause kernel errors (not sure whether
it is related below fault or not).

After making the above suggested changes, you can directly send the
patch set to the upstream.
Please cc me at yhs@....


+

+ return -EINVAL;

+}

+

+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {

+ .func = bpf_get_current_pidns_info,

+ .gpl_only = false,

+ .ret_type = RET_INTEGER,

+ .arg1_type = ARG_PTR_TO_UNINIT_MEM,

+ .arg2_type = ARG_CONST_SIZE,

+};

+

#ifdef CONFIG_CGROUPS

BPF_CALL_0(bpf_get_current_cgroup_id)

{





Now the problem I’m facing is that after running the sample code to call the helper, I’m seeing this once but then it’s never repeated running in later runs of the same sample code.



[ 75.097213] CPU: 0 PID: 1382 Comm: bash Not tainted 5.3.0-rc1patch12-00120-g5d8525007580-dirty #7

[ 75.097213] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018

[ 75.097215] RIP: 0010:idr_alloc_u32+0xa8/0xc0

[ 75.097216] Code: 31 d2 48 89 e6 48 89 ef e8 e5 54 00 00 31 c0 48 8b 7c 24 20 65 48 33 3c 25 28 00 00 00 75 1a 48 83 c4 28 5b 5d 41 5c 41 5d c3 <0f> 0b 0d 04 00 80 00 89 47 04 e9 78 ff ff ff e8 14 ed 93 ff 0f 1f

[ 75.097216] RSP: 0018:ffffac39007c7d08 EFLAGS: 00010046

[ 75.097217] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000001ffff

[ 75.097218] RDX: ffffac39007c7d5c RSI: 0000000000000d99 RDI: ffffffffbea46b88

[ 75.097218] RBP: ffffffffbea46b88 R08: 0000000000000a20 R09: 0000000000000000

[ 75.097218] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000

[ 75.097219] R13: ffffac39007c7d5c R14: 000000000000012c R15: ffffffffbea46b80

[ 75.097219] FS: 00007f7ba7691b40(0000) GS:ffffa0fef5c00000(0000) knlGS:0000000000000000

[ 75.097220] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[ 75.097220] CR2: 00000000022eea64 CR3: 000000022f498004 CR4: 00000000003606f0

[ 75.097246] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[ 75.097246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[ 75.097247] Call Trace:

[ 75.097251] ? kmem_cache_alloc+0x161/0x600

[ 75.097252] idr_alloc_cyclic+0x53/0xb0

[ 75.097254] alloc_pid+0xc1/0x2a0

[ 75.097255] copy_process+0xdc4/0x1c20

[ 75.097256] _do_fork+0x74/0x360

[ 75.097257] ? recalc_sigpending+0x17/0x50

[ 75.097258] __x64_sys_clone+0x7f/0xa0

[ 75.097260] do_syscall_64+0x55/0x130

[ 75.097261] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 75.097263] RIP: 0033:0x7f7ba6d6538b

[ 75.097264] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00

[ 75.097265] RSP: 002b:00007ffc9a6cf180 EFLAGS: 00000246 ORIG_RAX: 0000000000000038

[ 75.097265] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7ba6d6538b

[ 75.097266] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011

[ 75.097266] RBP: 00007ffc9a6cf1b0 R08: 00007f7ba7691b40 R09: 0000000000000002

[ 75.097266] R10: 00007f7ba7691e10 R11: 0000000000000246 R12: 0000000000000000

[ 75.097267] R13: 00000000004cd15e R14: 0000000000000000 R15: 0000000000000000

[ 75.097268] ---[ end trace 2937ef4d051e39cd ]---



I need a couple of days to debug this and then submit the patch if nobody finds anything odd in the code.

I hope tomorrow I’ll have good news on this if nobody finds anything odd in the code first.



Bests



From: Y Song
Sent: 05 August 2019 16:16
To: carlos antonio neira bustos
Cc: Alexei Starovoitov; Tom Herbert via iovisor-dev; Eric Biederman
Subject: Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtainnamespace data from current task.



Carlos,



Any update on this? It would be good to submit the patch earlier for a

release cycle

so we got a chance it may be accepted in the current release cycle.



On Fri, Jul 26, 2019 at 3:41 AM carlos antonio neira bustos

<cneirabustos@...> wrote:

Thanks Yonghong!.
Yes, I was looking at filename_lookup today on how to do it, I'll work on that.
Thank you very much.
El vie., 26 de jul. de 2019 01:18, Yonghong Song <ys114321@...> escribió:
On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov
<alexei.starovoitov@...> wrote:
On Thu, Jul 25, 2019 at 11:53 AM neirac <cneirabustos@...> wrote:
Hey Yonghong,
I have changed getname_kernel interface to specify the allocation type,
all previous callers preserve GFP_KERNEL as the allocation type, after this change
getname_kernel could be used with GFP_ATOMIC in ebpf helpers.
If this change goes in, I could complete bpf_get_current_pidns_info helper.
Let me know your thoughts.
please figure out how to get pidns_info w/o calling into getname_kernel().
Looks like we can call filename_lookup() directly. We do not need to
call getname_kernel().
We know the exact path name. We can do a atomic allocation inside the helper.
Carlos, could you take a look along this line? In this way, we do not
need to change
any files under linux/fs.


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtainnamespace data from current task.

neirac
 

Hi Yonghong,

 

I had time yesterday to finish the code to avoid using getname_kernel , self-tests run ok with expected results.

Here is the current code to avoid the getname_kernel call.

 

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c

index 5e28718928ca..bfb2bbf37a58 100644

--- a/kernel/bpf/helpers.c

+++ b/kernel/bpf/helpers.c

@@ -11,6 +11,12 @@

#include <linux/uidgid.h>

#include <linux/filter.h>

#include <linux/ctype.h>

+#include <linux/pid_namespace.h>

+#include <linux/major.h>

+#include <linux/stat.h>

+#include <linux/namei.h>

+#include <linux/version.h>

+

 

 #include "../../lib/kstrtox.h"

@@ -312,6 +318,81 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,

               preempt_enable();

}

+BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32,

+             size)

+{

+             const char* name = "/proc/self/ns/pid";

+             struct pid_namespace *pidns = NULL;

+             int len = strlen(name) + 1;

+             struct filename *tmp = NULL;

+             struct inode *inode;

+             struct path kp;

+             pid_t tgid = 0;

+             pid_t pid = 0;

+             int ret;

+            

+

+             if (unlikely(size != sizeof(struct bpf_pidns_info)))

+                             goto clear;

+

+             pidns = task_active_pid_ns(current);

+

+             if (unlikely(!pidns))

+                             goto clear;

+

+             pidns_info->nsid =  pidns->ns.inum;

+             pid = task_pid_nr_ns(current, pidns);

+

+             if (unlikely(!pid))

+                             goto clear;

+

+             tgid = task_tgid_nr_ns(current, pidns);

+

+             if (unlikely(!tgid))

+                             goto clear;

+

+             pidns_info->tgid = (u32) tgid;

+             pidns_info->pid = (u32) pid;

+

+             if(!names_cachep)

+                             goto clear;

+

+             tmp = kmem_cache_alloc(names_cachep, GFP_ATOMIC);

+             if (unlikely(!tmp)) {

+                             goto clear;

+             }

+

+             memcpy((char *)tmp->name, name, len);

+             tmp->uptr = NULL;

+             tmp->aname = NULL;

+             tmp->refcnt = 1;

+

+        ret = filename_lookup(AT_FDCWD, tmp, 0, &kp, NULL);

+

+             if (ret)

+                             goto clear;

+            

+

+             inode = d_backing_inode(kp.dentry);

+             pidns_info->dev = inode->i_sb->s_dev;

+

+             return 0;

+

+       clear:

+             if (pidns_info)

+                             memset((void *)pidns, 0, (size_t) size);

+

+             return -EINVAL;

+}

+

+const struct bpf_func_proto bpf_get_current_pidns_info_proto = {

+             .func      = bpf_get_current_pidns_info,

+             .gpl_only             = false,

+             .ret_type             = RET_INTEGER,

+             .arg1_type          = ARG_PTR_TO_UNINIT_MEM,

+             .arg2_type          = ARG_CONST_SIZE,

+};

+

#ifdef CONFIG_CGROUPS

BPF_CALL_0(bpf_get_current_cgroup_id)

{

 

 

Now the problem I’m facing is that after running the sample code to call the helper, I’m seeing this once but then it’s never repeated running in later runs of the same sample code.

 

[ 75.097213] CPU: 0 PID: 1382 Comm: bash Not tainted 5.3.0-rc1patch12-00120-g5d8525007580-dirty #7

[   75.097213] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018

[   75.097215] RIP: 0010:idr_alloc_u32+0xa8/0xc0

[   75.097216] Code: 31 d2 48 89 e6 48 89 ef e8 e5 54 00 00 31 c0 48 8b 7c 24 20 65 48 33 3c 25 28 00 00 00 75 1a 48 83 c4 28 5b 5d 41 5c 41 5d c3 <0f> 0b 0d 04 00 80 00 89 47 04 e9 78 ff ff ff e8 14 ed 93 ff 0f 1f

[   75.097216] RSP: 0018:ffffac39007c7d08 EFLAGS: 00010046

[   75.097217] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000001ffff

[   75.097218] RDX: ffffac39007c7d5c RSI: 0000000000000d99 RDI: ffffffffbea46b88

[   75.097218] RBP: ffffffffbea46b88 R08: 0000000000000a20 R09: 0000000000000000

[   75.097218] R10: 0000000001200000 R11: 0000000000000000 R12: 0000000000000000

[   75.097219] R13: ffffac39007c7d5c R14: 000000000000012c R15: ffffffffbea46b80

[   75.097219] FS:  00007f7ba7691b40(0000) GS:ffffa0fef5c00000(0000) knlGS:0000000000000000

[   75.097220] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[   75.097220] CR2: 00000000022eea64 CR3: 000000022f498004 CR4: 00000000003606f0

[   75.097246] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[   75.097246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[   75.097247] Call Trace:

[   75.097251]  ? kmem_cache_alloc+0x161/0x600

[   75.097252]  idr_alloc_cyclic+0x53/0xb0

[   75.097254]  alloc_pid+0xc1/0x2a0

[   75.097255]  copy_process+0xdc4/0x1c20

[   75.097256]  _do_fork+0x74/0x360

[   75.097257]  ? recalc_sigpending+0x17/0x50

[   75.097258]  __x64_sys_clone+0x7f/0xa0

[   75.097260]  do_syscall_64+0x55/0x130

[   75.097261]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[   75.097263] RIP: 0033:0x7f7ba6d6538b

[   75.097264] Code: db 45 85 f6 0f 85 95 01 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 de 00 00 00 85 c0 41 89 c5 0f 85 e5 00 00

[   75.097265] RSP: 002b:00007ffc9a6cf180 EFLAGS: 00000246 ORIG_RAX: 0000000000000038

[   75.097265] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7ba6d6538b

[   75.097266] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011

[   75.097266] RBP: 00007ffc9a6cf1b0 R08: 00007f7ba7691b40 R09: 0000000000000002

[   75.097266] R10: 00007f7ba7691e10 R11: 0000000000000246 R12: 0000000000000000

[   75.097267] R13: 00000000004cd15e R14: 0000000000000000 R15: 0000000000000000

[   75.097268] ---[ end trace 2937ef4d051e39cd ]---

 

I need a couple of days to debug this and then submit the patch if nobody finds anything odd in the code.

I hope tomorrow I’ll have good news on this if nobody finds anything odd in the code first.

 

Bests

 

From: Y Song
Sent: 05 August 2019 16:16
To: carlos antonio neira bustos
Cc: Alexei Starovoitov; Tom Herbert via iovisor-dev; Eric Biederman
Subject: Re: [iovisor-dev][PATCHv5 RFC 1/3] BPF: New helper to obtainnamespace data from current task.

 

Carlos,

 

Any update on this? It would be good to submit the patch earlier for a

release cycle

so we got a chance it may be accepted in the current release cycle.

 

On Fri, Jul 26, 2019 at 3:41 AM carlos antonio neira bustos

<cneirabustos@...> wrote:

> Thanks Yonghong!.

> Yes, I was looking at filename_lookup today on how to do it, I'll work on that.

> Thank you very much.

> El vie., 26 de jul. de 2019 01:18, Yonghong Song <ys114321@...> escribió:

>> 

>> On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov

>> <alexei.starovoitov@...> wrote:

>> >

>> > On Thu, Jul 25, 2019 at 11:53 AM neirac <cneirabustos@...> wrote:

>> > >

>> > > Hey Yonghong,

>> > > I have changed getname_kernel interface to specify the allocation type,

>> > > all previous callers preserve GFP_KERNEL as the allocation type, after this change

>> > > getname_kernel could be used with GFP_ATOMIC in ebpf helpers.

>> > > If this change goes in, I could complete bpf_get_current_pidns_info helper.

>> > > Let me know your thoughts.

>> >

>> > please figure out how to get pidns_info w/o calling into getname_kernel().

>> 

>> Looks like we can call filename_lookup() directly. We do not need to

>> call getname_kernel().

>> We know the exact path name. We can do a atomic allocation inside the helper.

>> Carlos, could you take a look along this line? In this way, we do not

>> need to change

>> any files under linux/fs.

>> 

>>

>> 

 


Re: [PATCHv5 RFC 1/3] BPF: New helper to obtain namespace data from current task.

Yonghong Song
 

Carlos,

Any update on this? It would be good to submit the patch earlier for a
release cycle
so we got a chance it may be accepted in the current release cycle.

On Fri, Jul 26, 2019 at 3:41 AM carlos antonio neira bustos
<cneirabustos@...> wrote:

Thanks Yonghong!.
Yes, I was looking at filename_lookup today on how to do it, I'll work on that.
Thank you very much.


El vie., 26 de jul. de 2019 01:18, Yonghong Song <ys114321@...> escribió:

On Thu, Jul 25, 2019 at 12:11 PM Alexei Starovoitov
<alexei.starovoitov@...> wrote:

On Thu, Jul 25, 2019 at 11:53 AM neirac <cneirabustos@...> wrote:

Hey Yonghong,
I have changed getname_kernel interface to specify the allocation type,
all previous callers preserve GFP_KERNEL as the allocation type, after this change
getname_kernel could be used with GFP_ATOMIC in ebpf helpers.
If this change goes in, I could complete bpf_get_current_pidns_info helper.
Let me know your thoughts.
please figure out how to get pidns_info w/o calling into getname_kernel().
Looks like we can call filename_lookup() directly. We do not need to
call getname_kernel().
We know the exact path name. We can do a atomic allocation inside the helper.
Carlos, could you take a look along this line? In this way, we do not
need to change
any files under linux/fs.


241 - 260 of 2021