Hi,
I have another issue with timestamps returned by bpf_ktime_get_ns()
but I am not sure I identified the root cause:
https://github.com/weaveworks/scope/issues/2650
|
Join GitHub today. GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
|
I have a few questions that might help to understand:
- Does the issue happen when you use a percpu map?
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array
|
bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more
|
- Do you test on bare metal or on some kind of VM such as on EC2, GCE
or VirtualBox?
My problem does not happen on bare metal, only in virtualized
environments. I noticed that one BPF function can take more than 100ms
in virtualized environments, presumably because the hypervisor
schedule the CPU to another VM. I took two timestamps with
bpf_ktime_get_ns() in the same BPF function and I check the
difference. On bare metal, I have not noticed this kind of variable
time of execution.
- The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous. Another CPU might execute the BPF function
concurrently and finish faster than the first CPU. I have not observed
this kind of issues in practice on bare metal but I have seen that on
GCE.
- I noticed this worrying comment in __ktime_get_fast_ns (code behind
bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388
|
linux - Linux kernel source tree
|
I don't understand the full details.
- Which kernel version and which Linux distribution do you use? There
was a kernel bug that caused timestamps to be wrong:
See
https://github.com/weaveworks/scope/issues/2334#issue-214111575:
|
There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) that makes timestamps be wrong and results in messages like this in scope's logs: <probe> ERRO: 2017/03/13 12:33:19.623...
|
> This bug was introduced in 4.8 but was backported to older kernels by distros. The fix was included in 4.9 and also backported. For example, in Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until kernel 4.4.0-51.
See
https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef
|
When I added some extra sanity checking in timekeeping_get_ns() under CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() method was using timekeeping_get_ns(). Thus the loc...
|
and
https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d
|
In commit 27727df240c7 ("Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the timekeeping_get_ns() function, but I forgot to include the unit conversi...
|
Cheers,
Alban
On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev
<
iovisor-dev@...> wrote:
> Hi,
>
>
> While tracing the scheduling events of a particular task, I noticed that
> sometimes, bpf_ktime_get_ns() returns a previous time stamp. Whenever the
> time stamps become out of order, the subsequent events follow the earlier
> timestamp. For example, something like:
>
>
> Old TSP New TSP
>
>
> 23456 23458
>
> 23458 23461
>
> ....
>
> ....
>
> 23509 23467
>
> 23467 23470
>
> 23470.......
>
> ...... 23527
>
> 23527 23446
>
> 23446 .........
>
>
> I have used a return probe with sys_clone(0) to store the thread Ids of a
> particular task, and used this information in the sched_switch tracepoint to
> trace the switching events of my task. The task is run on an 8-core machine.
>
>
> What could be reason for this behaviour? I have included the code, in case I
> have done something wrong???
>
>
> ##############################
##############################
##########
>
>
> from bcc import BPF
>
> # define BPF program
> bpf_text = """
>
> #include <uapi/linux/ptrace.h>
> #include <linux/sched.h>
> #include <linux/types.h>
>
> struct values_t{
> u32 threadId;
> char comm[TASK_COMM_LEN];
> };
>
> BPF_HASH(threadList, u32, u32); //Stores threadIds of participating
> threads
> BPF_HASH(tsp, u32, u64, 1); //Stores timestamp of previous event
>
> /*struct sched_switch_args {
> // from /sys/kernel/debug/tracing/
events/random/urandom_read/
format
> u64 __unused__;
> char prev_comm[16];
> pid_t prev_pid;
> int prev_prio;
> long prev_state;
> char next_comm[16];
> pid_t next_pid;
> int next_prio;
> };*/
>
> //Return Probe for sys_clone() - stores the thread Ids of a task with
> command, "prod"
> int trace_sys_clone(struct pt_regs *ctx) {
>
> struct values_t value = {};
> u32 one = 1, *val, *val1;
>
> bpf_get_current_comm(&value.
comm, sizeof(value.comm));
>
> if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o'
> && value.comm[3]=='d') {
>
> //Store the parent thread ID
> value.threadId = bpf_get_current_pid_tgid();
> val = threadList.lookup_or_init(&
value.threadId, &one);
>
> //Stores other thread IDs
> struct values_t value1={};
> value1.threadId = PT_REGS_RC(ctx);
> val1 = threadList.lookup_or_init(&
value1.threadId, &one);
> }
> return 0;
> }
>
>
> //Tracepoint probe for the Sched_Switch tracepoint - stores the previous
> event timestamp in a map
>
> TRACEPOINT_PROBE(sched, sched_switch){
>
> u32 *prev, *next; //Pointers to entries in threadList map
> u32 next_pid, prev_pid, key=0;
>
> //Copy data to BPF stack
> bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid);
> bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid);
>
> //Look up thread ids in the list created by sys_clone()
> next = threadList.lookup(&next_pid);
> prev = threadList.lookup(&prev_pid);
>
> //If task-threads are involved in Context Switch
> if( prev || next){
>
> u64 newTS = bpf_ktime_get_ns();
> u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous
> timestamp
>
> if(oldTS==0)
> return 0;
>
> if(newTS > *oldTS)
> bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS,
> newTS);
> else
> bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>
> tsp.update(&key, &newTS); //Update timestamp
> }
> return 0;
> }
> """
>
> # load BPF program
> b = BPF(text=bpf_text)
> b.attach_kretprobe(event="sys_
clone", fn_name="trace_sys_clone")
>
> # format output
> while 1:
> try:
> (task, pid, cpu, flags, ts, msg) = b.trace_fields()
> print("%s" % (msg))
>
> except ValueError:
> continue
>
>
> ##############################
##############################
####################
>
>
> The tool cpudist.py also mentions about a case when timestamps may be out of
> order.
>
>
> Many Thanks ,
>
> Reena.
>
>
>
>
>
> ______________________________
_________________
> iovisor-dev mailing list
>
iovisor-dev@...
>
https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Discussion list for IOVisor development. To see the collection of prior postings to the list, visit the iovisor-dev Archives. Using iovisor-dev
|
>