off-cpu:
view
# test in centos7
echo 1 | sudo tee /proc/sys/kernel/sched_schedstats
perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -p $(cat /var/run/cs.pid) -g -o perf.data.raw sleep 10
perf inject -v -s -i perf.data.raw -o perf.data # The step is merging sched_start and sched_switch events
perf script -F comm,pid,tid,cpu,time,period,event,ip,sym,dso,trace | awk '
NF > 4 { exec = $1; period_ms = int($5 / 1000000) }
NF > 1 && NF <= 4 && period_ms > 0 { print $2 }
NF < 2 && period_ms > 0 { printf "%s\n%d\n\n", exec, period_ms }' | \
/opt/FlameGraph/stackcollapse.pl | \
/opt/FlameGraph/flamegraph.pl --countname=ms --title="Off-CPU Time Flame Graph" --colors=io > offcpu.svg
原理:
1、Application stack traces don't change while off-CPU.
This means we only need to measure the stack trace once, either at the beginning or end of the off-CPU period. The end is usually easier, since you're recording the time interval then anyway. Here is tracing pseudocode for measuring off-CPU time with stack traces:
on context switch finish:
sleeptime[prev_thread_id] = timestamp
if !sleeptime[thread_id]
return
delta = timestamp - sleeptime[thread_id]
totaltime[pid, execname, user stack, kernel stack] += delta
sleeptime[thread_id] = 0
on tracer exit:
for each key in totaltime:
print key
print totaltime[key]
Q: sched_switch事件似乎已经够用了,为什么还要sched_stat_sleep呢? 因为sched_switch虽然能知道花了多少时间,但是并不清楚时间都花在哪里,而且时间包含了involuntary context switches,而我们感兴趣的由于block event而发生的voluntary context switches。怎么获取? -> sched_stat_sleep, 因为block event发生时,程序会相应的进入TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE状态,获取伪代码如下
update_stats_dequeue
if (tsk->state & TASK_INTERRUPTIBLE)
__schedstat_set(se->statistics.sleep_start, rq_clock(rq_of(cfs_rq)));
if (tsk->state & TASK_UNINTERRUPTIBLE)
__schedstat_set(se->statistics.block_start, rq_clock(rq_of(cfs_rq)));
update_stats_enqueue
update_stats_enqueue_sleeper(cfs_rq, se); if (flags & ENQUEUE_WAKEUP)
if (sleep_start)
trace_sched_stat_sleep(tsk, delta);
if (block_start)
trace_sched_stat_iowait(tsk, delta); if (tsk->in_iowait)
trace_sched_stat_blocked(tsk, delta);
综上:
perf inject:
通过关联以下2个tracepoint, 我们可以得到任务sleep的时长及其对应的callchain
- sched_iowait/sleep/blocked. 获得sleep的时长
- sched_switch. 获得调用栈
commit 26a031e136f4f8dc82c64df48cca0eb3b5d3eb4f
Author: Andrew Vagin <[email protected]>
Date: Tue Aug 7 16:56:04 2012 +0400
perf inject: Merge sched_stat_* and sched_switch events
You may want to know where and how long a task is sleeping. A callchain
may be found in sched_switch and a time slice in stat_iowait, so I add
handler in perf inject for merging this events.
My code saves sched_switch event for each process and when it meets
stat_iowait, it reports the sched_switch event, because this event
contains a correct callchain. By another words it replaces all
stat_iowait events on proper sched_switch events.
参考:
- Linux perf_events Off-CPU Time Flame Graph
- https://perf.wiki.kernel.org/index.php/Tutorial
- PingCAP 工欲性能调优,必先利其器(2)- 火焰图
- 透过Tracepoint理解内核 - 调度器框架和性能 - J.FW的文章 - 知乎