Re: Out of order clock values when tracing scheduling events


Yonghong Song
 

Just a wild guess. Could it be due to subsequent bpf invocations are not in
the same cpu? The previous  invocation reads timestamp in one cpu,
and the next invocation gets timestamp in a different cpu.
If two cpus are not sync'ing their cpu's perfectly, you may get an issue?
If there is an update (e.g., NTP), different cpu may have different pace of
adjustment, could this contribute the temporary timestamp divergence?

On Mon, Jul 24, 2017 at 7:43 AM, Nair, Reena via iovisor-dev <iovisor-dev@...> wrote:


Hi,


I am running the program on bare metal, Ubuntu 16.04.2 LTS, kernel version: 4.8.0-40800-generic. I addressed the following suggestions:


  1. - The time between your bpf_ktime_get_ns() and your tsp.update() is
    not instantaneous.   - Changed
  2. Replaced BPF_HASH with PERCPU_ARRAY
Still the problem persists...

Reena.







From: Alban Crequy <alban@...>
Sent: 24 July 2017 16:02:51
To: Nair, Reena
Cc: iovisor-dev@...
Subject: Re: [iovisor-dev] Out of order clock values when tracing scheduling events
 
Hi,

I have another issue with timestamps returned by bpf_ktime_get_ns()
but I am not sure I identified the root cause:
https://github.com/weaveworks/scope/issues/2650
Join GitHub today. GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.



I have a few questions that might help to understand:

- Does the issue happen when you use a percpu map?
https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#7-bpf_percpu_array
bcc - BCC - Tools for BPF-based Linux IO analysis, networking, monitoring, and more



- Do you test on bare metal or on some kind of VM such as on EC2, GCE
or VirtualBox?

My problem does not happen on bare metal, only in virtualized
environments. I noticed that one BPF function can take more than 100ms
in virtualized environments, presumably because the hypervisor
schedule the CPU to another VM. I took two timestamps with
bpf_ktime_get_ns() in the same BPF function and I check the
difference. On bare metal, I have not noticed this kind of variable
time of execution.

- The time between your bpf_ktime_get_ns() and your tsp.update() is
not instantaneous. Another CPU might execute the BPF function
concurrently and finish faster than the first CPU. I have not observed
this kind of issues in practice on bare metal but I have seen that on
GCE.

- I noticed this worrying comment in __ktime_get_fast_ns (code behind
bpf_ktime_get_ns()):
"other CPUs are likely to be able observe [time going backward]"
https://github.com/torvalds/linux/blob/174ddfd5dfbfc2d91a45332f809977050ac3fdc5/kernel/time/timekeeping.c#L386-L388
linux - Linux kernel source tree


I don't understand the full details.

- Which kernel version and which Linux distribution do you use? There
was a kernel bug that caused timestamps to be wrong:
See https://github.com/weaveworks/scope/issues/2334#issue-214111575:
There is a kernel bug introduced by torvalds/linux@27727df (see bcc's report) that makes timestamps be wrong and results in messages like this in scope's logs: <probe> ERRO: 2017/03/13 12:33:19.623...


> This bug was introduced in 4.8 but was backported to older kernels by distros. The fix was included in 4.9 and also backported. For example, in Ubuntu, the bug was introduced in kernel 4.4.0-42 and it's not fixed until kernel 4.4.0-51.
See https://github.com/torvalds/linux/commit/27727df240c7cc84f2ba6047c6f18d5addfd25ef
When I added some extra sanity checking in timekeeping_get_ns() under CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() method was using timekeeping_get_ns(). Thus the loc...


and https://github.com/torvalds/linux/commit/58bfea9532552d422bde7afa207e1a0f08dffa7d
In commit 27727df240c7 ("Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code the timekeeping_get_ns() function, but I forgot to include the unit conversi...



Cheers,
Alban


On Mon, Jul 24, 2017 at 9:49 AM, Nair, Reena via iovisor-dev
<iovisor-dev@...> wrote:
> Hi,
>
>
> While tracing the scheduling events of a particular task, I noticed that
> sometimes,  bpf_ktime_get_ns() returns a previous time stamp. Whenever the
> time stamps become out of order, the subsequent events follow the  earlier
> timestamp. For example, something like:
>
>
> Old TSP            New TSP
>
>
> 23456               23458
>
> 23458               23461
>
> ....
>
> ....
>
> 23509               23467
>
> 23467               23470
>
> 23470.......
>
> ......                    23527
>
> 23527                23446
>
> 23446                .........
>
>
> I have used a return probe with sys_clone(0) to store the thread Ids of a
> particular task, and used this information in the sched_switch tracepoint to
> trace the switching events of my task. The task is run on an 8-core machine.
>
>
> What could be reason for this behaviour? I have included the code, in case I
> have done something wrong???
>
>
> ######################################################################
>
>
> from bcc import BPF
>
> # define BPF program
> bpf_text = """
>
> #include <uapi/linux/ptrace.h>
> #include <linux/sched.h>
> #include <linux/types.h>
>
> struct values_t{
>         u32 threadId;
>         char comm[TASK_COMM_LEN];
> };
>
> BPF_HASH(threadList, u32, u32);     //Stores threadIds of participating
> threads
> BPF_HASH(tsp, u32, u64, 1);         //Stores timestamp of previous event
>
> /*struct sched_switch_args {
>     // from /sys/kernel/debug/tracing/events/random/urandom_read/format
>     u64 __unused__;
>     char prev_comm[16];
>     pid_t prev_pid;
>     int prev_prio;
>     long prev_state;
>     char next_comm[16];
>     pid_t next_pid;
>     int next_prio;
> };*/
>
> //Return Probe for sys_clone() - stores the thread Ids of a task with
> command, "prod"
> int trace_sys_clone(struct pt_regs *ctx) {
>
>         struct values_t value = {};
>         u32 one = 1, *val, *val1;
>
>         bpf_get_current_comm(&value.comm, sizeof(value.comm));
>
>         if(value.comm[0]=='p' && value.comm[1] == 'r' && value.comm[2]=='o'
> && value.comm[3]=='d') {
>
>                 //Store the parent thread ID
>                 value.threadId = bpf_get_current_pid_tgid();
>                 val = threadList.lookup_or_init(&value.threadId, &one);
>
>                 //Stores other thread IDs
>                 struct values_t value1={};
>                 value1.threadId = PT_REGS_RC(ctx);
>                 val1 = threadList.lookup_or_init(&value1.threadId, &one);
>         }
>         return 0;
> }
>
>
> //Tracepoint probe for the Sched_Switch tracepoint - stores the previous
> event timestamp in a map
>
> TRACEPOINT_PROBE(sched, sched_switch){
>
>     u32 *prev, *next;  //Pointers to entries in threadList map
>     u32 next_pid, prev_pid, key=0;
>
>     //Copy data to BPF stack
>     bpf_probe_read(&next_pid, sizeof(next_pid), &args->next_pid);
>     bpf_probe_read(&prev_pid, sizeof(prev_pid), &args->prev_pid);
>
>     //Look up thread ids in the list created by sys_clone()
>     next = threadList.lookup(&next_pid);
>     prev = threadList.lookup(&prev_pid);
>
>     //If task-threads are involved in Context Switch
>     if( prev || next){
>
>         u64 newTS = bpf_ktime_get_ns();
>         u64 *oldTS = tsp.lookup_or_init(&key, &newTS); //Retrieve previous
> timestamp
>
>         if(oldTS==0)
>                 return 0;
>
>         if(newTS > *oldTS)
>                 bpf_trace_printk("Correct: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>         else
>                 bpf_trace_printk("Wrong: Old: %llu New: %llu\\n", *oldTS,
> newTS);
>
>         tsp.update(&key, &newTS); //Update timestamp
>      }
>     return 0;
> }
> """
>
> # load BPF program
> b = BPF(text=bpf_text)
> b.attach_kretprobe(event="sys_clone", fn_name="trace_sys_clone")
>
> # format output
> while 1:
>    try:
>       (task, pid, cpu, flags, ts, msg) = b.trace_fields()
>       print("%s" % (msg))
>
>    except ValueError:
>         continue
>
>
> ################################################################################
>
>
> The tool cpudist.py also mentions about a case when timestamps may be out of
> order.
>
>
> Many Thanks ,
>
> Reena.
>
>
>
>
>
> _______________________________________________
> iovisor-dev mailing list
> iovisor-dev@...
> https://lists.iovisor.org/mailman/listinfo/iovisor-dev
Discussion list for IOVisor development. To see the collection of prior postings to the list, visit the iovisor-dev Archives. Using iovisor-dev


>

_______________________________________________
iovisor-dev mailing list
iovisor-dev@...
https://lists.iovisor.org/mailman/listinfo/iovisor-dev


Join {iovisor-dev@lists.iovisor.org to automatically receive all group messages.