一、背景
在之前的博客 获取进程或线程级别的iodelay的方法_io验证延时链-CSDN博客 里,我们讲到了获取进程或线程的iodelay的方法,但是博客里讲到的获取iodelay的值是一个累积值,并不能准确的捕获到每个单次的iodelay具体是多少。这篇博客里是为了监控每个单次的iodelay,除了监控iodelay,还监控线程的D状态的开始和结束,另外,对于监控到的事件进行堆栈的记录,对于堆栈的记录和落盘,我们采取相对高效的方式,在捕获iodelay和D状态开始和结束的关键路径代码里只是把堆栈的函数地址作为事件信息的一部分写入到ringbuffer里,而事件的输出记录则是在work里去做,打印出事件类型和对应堆栈的同时也打印了进程的cmdline和父进程的cmdline,关于如何打印进程的cmdline和父进程的cmdline的细节参考之前的博客 内核模块里获取当前进程和父进程的cmdline的方法及注意事项,涉及父子进程管理,和rcu的初步介绍_内核获取cmdline-CSDN博客,这篇博客用的是相对更为简练的通过kallsyms_lookup_name函数获取get_cmdline的函数指针来调用的方式,这种方式在之前的博客 内核模块里访问struct rq及获取rq_clock_task时间的方法-CSDN博客 里也有使用。
在第二章里,我们贴出内核模块的源码,并做一定的说明和运行结果的展示,另外,在第二章里的最后,我们也会给出通过shell脚本方式去抓取系统上当前这一刻状态是D的线程的信息和堆栈,算是对第二章里讲到的通过内核模块抓取D状态线程开始和结束事件的一个补充,因为该shell脚本抓取的是现状所有的D状态的线程的情况,对于长时间是D状态的线程,通过脚本就可以直接捕获到,免去了通过看落盘文件再去倒推时间去分析的麻烦。
然后,在第三章里,我们对于第二章里的源码进行分析和原理讲解。
二、源码和成果展示
我们在 2.1 里先展示通过内核模块抓取D状态开始和结束事件,和iodelay事件(iodelay每个单次的持续时间)和D事件(每个单次的持续时间)的内核源码。在 2.2 里我们展示一下运行结果并做一些关于结果的说明,在 2.3 里我们讲通过shell脚本抓取系统当前情况下的所有D状态的线程的堆栈以及各个cpu上iowait情况,关于cpu的iowait概念我们在后面的博客里去介绍,本文并不涉及。
2.1 通过内核模块抓取D状态开始和结束事件和iodelay事件的内核源码
下面展示的内核模块的源码所用到的一处iodelay每个单次的持续时间的tracepoint,是需要在内核里添加改动的,这个改动主要用于监控更加精准的单次iodelay的时间,因为增加的这处tracepoint是严格根据,不过我们可以把下面展示代码里的IODELAY_TRACEPOINT_ENABLE宏关掉,就可以在不修改内核镜像的情况下进行监控,仅仅是iodelay的时间没有刚说增加的tracepoint这样的监控来得准确,但是误差也在tick周期一般是4ms以内。
上面说的IODELAY_TRACEPOINT_ENABLE宏关掉只需把下面代码里IODELAY_TRACEPOINT_ENABLE宏注释掉即可,如下:
这一章按照打开IODELAY_TRACEPOINT_ENABLE宏的情况下来展示。
2.1.1 增加每个单次iodelay监控所依赖的内核添加的tracepoint
下图是增加的精确监控每个单次iodelay数值的增加的tracepoint点,另外,这个监控还需要依赖相关内核编译选项和增加内核grub这些条件,相关条件及说明见之前的博客 获取进程或线程级别的iodelay的方法_io验证延时链-CSDN博客 里的 2.1 一节。
如果不增加该tracepoint点,甚至不依赖刚才说的这些iodelay所依赖的内核编译选项等,也是可以进行精度在大约4ms以内的监控的(实际的精度误差会更加小),在这篇博客的后面的章节里会讲到。
关于如何配合增加其他的tracepoint的头文件及添加DECLARE_TRACE的语句,可以参考之前的博客 内核tracepoint的注册回调及添加的方法_tracepoint 自定义回调-CSDN博客 ,这里不再赘述。
2.1.2 内核模块源码
#include <linux/module.h>
#include <linux/capability.h>
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/proc_fs.h>
#include <linux/ctype.h>
#include <linux/seq_file.h>
#include <linux/poll.h>
#include <linux/types.h>
#include <linux/ioctl.h>
#include <linux/errno.h>
#include <linux/stddef.h>
#include <linux/lockdep.h>
#include <linux/kthread.h>
#include <linux/sched.h>
#include <linux/delay.h>
#include <linux/wait.h>
#include <linux/init.h>
#include <asm/atomic.h>
#include <trace/events/workqueue.h>
#include <linux/sched/clock.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/interrupt.h>
#include <linux/tracepoint.h>
#include <trace/events/osmonitor.h>
#include <trace/events/sched.h>
#include <trace/events/irq.h>
#include <trace/events/kmem.h>
#include <linux/ptrace.h>
#include <linux/uaccess.h>
#include <asm/processor.h>
#include <linux/sched/task_stack.h>
#include <linux/nmi.h>
#include <asm/apic.h>
#include <linux/version.h>
#include <linux/sched/mm.h>
#include <asm/irq_regs.h>
#include <linux/kallsyms.h>
#include <linux/kprobes.h>
#include <linux/stop_machine.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("zhaoxin");
MODULE_DESCRIPTION("Module for monitor D tasks.");
MODULE_VERSION("1.0");
#define IODELAY_TRACEPOINT_ENABLE
#define TEST_STACK_TRACE_ENTRIES 32
typedef unsigned int (*stack_trace_save_tsk_func)(struct task_struct *task,
unsigned long *store, unsigned int size,
unsigned int skipnr);
stack_trace_save_tsk_func _stack_trace_save_tsk;
typedef int (*get_cmdline_func)(struct task_struct *task, char *buffer, int buflen);
get_cmdline_func _get_cmdline_func;
#define TESTDIOMONITOR_SAMPLEDESC_SWDSTART "swDstart"
#define TESTDIOMONITOR_SAMPLEDESC_WADSTOP "waDstop"
#define TESTDIOMONITOR_SAMPLEDESC_SWDIOSTART "swDiostart"
#define TESTDIOMONITOR_SAMPLEDESC_WADIOSTOP "waDiostop"
#define TESTDIOMONITOR_SAMPLEDESC_DEXCEED "Dexceed"
#define TESTDIOMONITOR_SAMPLEDESC_DIOEXCEED "Dioexceed"
#define TESTDIOMONITOR_SAMPLEDESC_IOEXCEED "Ioexceed"
// 1ms
//#define TESTDIOMONITOR_DEXCEED_THRESHOLD 1000ull//1000000ull
struct uclamp_bucket {
unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
};
struct uclamp_rq {
unsigned int value;
struct uclamp_bucket bucket[UCLAMP_BUCKETS];
};
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
unsigned int nr_running;
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_nr_running; /* SCHED_IDLE */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
u64 exec_clock;
u64 min_vruntime;
#ifdef CONFIG_SCHED_CORE
unsigned int forceidle_seq;
u64 min_vruntime_fi;
#endif
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
#endif
struct rb_root_cached tasks_timeline;
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr;
struct sched_entity *next;
struct sched_entity *last;
struct sched_entity *skip;
#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
#endif
#ifdef CONFIG_SMP
/*
* CFS load tracking
*/
struct sched_avg avg;
#ifndef CONFIG_64BIT
u64 last_update_time_copy;
#endif
struct {
raw_spinlock_t lock ____cacheline_aligned;
int nr;
unsigned long load_avg;
unsigned long util_avg;
unsigned long runnable_avg;
} removed;
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
long propagate;
long prop_runnable_sum;
/*
* h_load = weight * f(tg)
*
* Where f(tg) is the recursive weight fraction assigned to
* this group.
*/
unsigned long h_load;
u64 last_h_load_update;
struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
/*
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
*
* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
* This list is used during load balance.
*/
int on_list;
struct list_head leaf_cfs_rq_list;
struct task_group *tg; /* group that "owns" this runqueue */
/* Locally cached copy of our task_group's idle value */
int idle;
#ifdef CONFIG_CFS_BANDWIDTH
int runtime_enabled;
s64 runtime_remaining;
u64 throttled_pelt_idle;
#ifndef CONFIG_64BIT
u64 throttled_pelt_idle_copy;
#endif
u64 throttled_clock;
u64 throttled_clock_pelt;
u64 throttled_clock_pelt_time;
int throttled;
int throttle_count;
struct list_head throttled_list;
#ifdef CONFIG_SMP
struct list_head throttled_csd_list;
#endif
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
struct rt_prio_array {
DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
struct list_head queue[MAX_RT_PRIO];
};
/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
struct rt_prio_array active;
unsigned int rt_nr_running;
unsigned int rr_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
int curr; /* highest queued rt task prio */
#ifdef CONFIG_SMP
int next; /* next highest */
#endif
} highest_prio;
#endif
#ifdef CONFIG_SMP
unsigned int rt_nr_migratory;
unsigned int rt_nr_total;
int overloaded;
struct plist_head pushable_tasks;
#endif /* CONFIG_SMP */
int rt_queued;
int rt_throttled;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
raw_spinlock_t rt_runtime_lock;
#ifdef CONFIG_RT_GROUP_SCHED
unsigned int rt_nr_boosted;
struct rq *rq;
struct task_group *tg;
#endif
};
/* Deadline class' related fields in a runqueue */
struct dl_rq {
/* runqueue is an rbtree, ordered by deadline */
struct rb_root_cached root;
unsigned int dl_nr_running;
#ifdef CONFIG_SMP
/*
* Deadline values of the currently executing and the
* earliest ready task on this rq. Caching these facilitates
* the decision whether or not a ready but not running task
* should migrate somewhere else.
*/
struct {
u64 curr;
u64 next;
} earliest_dl;
unsigned int dl_nr_migratory;
int overloaded;
/*
* Tasks on this rq that can be pushed away. They are kept in
* an rb-tree, ordered by tasks' deadlines, with caching
* of the leftmost (earliest deadline) element.
*/
struct rb_root_cached pushable_dl_tasks_root;
#else
struct dl_bw dl_bw;
#endif
/*
* "Active utilization" for this runqueue: increased when a
* task wakes up (becomes TASK_RUNNING) and decreased when a
* task blocks
*/
u64 running_bw;
/*
* Utilization of the tasks "assigned" to this runqueue (including
* the tasks that are in runqueue and the tasks that executed on this
* CPU and blocked). Increased when a task moves to this runqueue, and
* decreased when the task moves away (migrates, changes scheduling
* policy, or terminates).
* This is needed to compute the "inactive utilization" for the
* runqueue (inactive utilization = this_bw - running_bw).
*/
u64 this_bw;
u64 extra_bw;
/*
* Maximum available bandwidth for reclaiming by SCHED_FLAG_RECLAIM
* tasks of this rq. Used in calculation of reclaimable bandwidth(GRUB).
*/
u64 max_bw;
/*
* Inverse of the fraction of CPU utilization that can be reclaimed
* by the GRUB algorithm.
*/
u64 bw_ratio;
};
struct rq {
/* runqueue lock: */
raw_spinlock_t __lock;
/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
unsigned int nr_running;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
unsigned int numa_migrate_on;
#endif
#ifdef CONFIG_NO_HZ_COMMON
#ifdef CONFIG_SMP
unsigned long last_blocked_load_update_tick;
unsigned int has_blocked_load;
call_single_data_t nohz_csd;
#endif /* CONFIG_SMP */
unsigned int nohz_tick_stopped;
atomic_t nohz_flags;
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_SMP
unsigned int ttwu_pending;
#endif
u64 nr_switches;
#ifdef CONFIG_UCLAMP_TASK
/* Utilization clamp values based on CPU's RUNNABLE tasks */
struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
unsigned int uclamp_flags;
#define UCLAMP_FLAG_IDLE 0x01
#endif
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
#ifdef CONFIG_FAIR_GROUP_SCHED
/* list of leaf cfs_rq on this CPU: */
struct list_head leaf_cfs_rq_list;
struct list_head *tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */
/*
* This is part of a global counter where only the total sum
* over all CPUs matters. A task can increase this counter on
* one CPU and if it got migrated afterwards it may decrease
* it on another CPU. Always updated under the runqueue lock:
*/
unsigned int nr_uninterruptible;
struct task_struct __rcu *curr;
struct task_struct *idle;
struct task_struct *stop;
unsigned long next_balance;
struct mm_struct *prev_mm;
unsigned int clock_update_flags;
u64 clock;
/* Ensure that all clocks are in the same cache line */
u64 clock_task ____cacheline_aligned;
u64 clock_pelt;
unsigned long lost_idle_time;
atomic_t nr_iowait;
#ifdef CONFIG_SCHED_DEBUG
u64 last_seen_need_resched_ns;
int ticks_without_resched;
#endif
#ifdef CONFIG_MEMBARRIER
int membarrier_state;
#endif
#ifdef CONFIG_SMP
struct root_domain *rd;
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
struct callback_head *balance_callback;
unsigned char nohz_idle_balance;
unsigned char idle_balance;
unsigned long misfit_task_load;
/* For active balancing */
int active_balance;
int push_cpu;
struct cpu_stop_work active_balance_work;
/* CPU of this runqueue: */
int cpu;
int online;
struct list_head cfs_tasks;
struct sched_avg avg_rt;
struct sched_avg avg_dl;
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
struct sched_avg avg_irq;
#endif
#ifdef CONFIG_SCHED_THERMAL_PRESSURE
struct sched_avg avg_thermal;
#endif
u64 idle_stamp;
u64 avg_idle;
unsigned long wake_stamp;
u64 wake_avg_idle;
/* This is used to determine avg_idle's max value */
u64 max_idle_balance_cost;
#ifdef CONFIG_HOTPLUG_CPU
struct rcuwait hotplug_wait;
#endif
#endif /* CONFIG_SMP */
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
u64 prev_steal_time_rq;
#endif
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
call_single_data_t hrtick_csd;
#endif
struct hrtimer hrtick_timer;
ktime_t hrtick_time;
#endif
#ifdef CONFIG_SCHEDSTATS
/* latency stats */
struct sched_info rq_sched_info;
unsigned long long rq_cpu_time;
/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */
/* sys_sched_yield() stats */
unsigned int yld_count;
/* schedule() stats */
unsigned int sched_count;
unsigned int sched_goidle;
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
#endif
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
#endif
#ifdef CONFIG_SMP
unsigned int nr_pinned;
#endif
unsigned int push_busy;
struct cpu_stop_work push_work;
#ifdef CONFIG_SCHED_CORE
/* per rq */
struct rq *core;
struct task_struct *core_pick;
unsigned int core_enabled;
unsigned int core_sched_seq;
struct rb_root core_tree;
/* shared state -- careful with sched_core_cpu_deactivate() */
unsigned int core_task_seq;
unsigned int core_pick_seq;
unsigned long core_cookie;
unsigned int core_forceidle_count;
unsigned int core_forceidle_seq;
unsigned int core_forceidle_occupation;
u64 core_forceidle_start;
#endif
};
typedef struct testdiomonitor_sample {
struct timespec64 time;
int cpu;
int pid;
int tgid;
int ppid;
char comm[TASK_COMM_LEN];
char ppidcomm[TASK_COMM_LEN];
// 0 or 1
int bin_iowait;
/*
* "swDstart" // 在sched_switch里
* "waDstop" // 在sched_waking里
* "swDiostart" // 在sched_switch里
* "waDiostop" // 在sched_waking里
* "Dexceed" // 超出阈值,非iowait
* "Dioexceed" // 超出阈值,iowait
*/
const char* desc;
u64 dtimens; // 纳秒单位,D状态持续的时间
u64 iowaittimens; // 纳秒单位,等待io的时间
int stackn;
void* parray_stack[TEST_STACK_TRACE_ENTRIES];
u32 writedone; // 0 or 1
} testdiomonitor_sample;
#define TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT 8192
typedef struct testdiomonitor_sample_ringbuff {
testdiomonitor_sample* parray_sample;
volatile u64 wp; // Index is wp & (TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT - 1).
volatile u64 rp; // Index is rp & (TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT - 1).
u32 skipcount; // 0 means no skip any abnormal event
} testdiomonitor_sample_ringbuff;
#define TESTDIOMONITOR_LINEBUFF 1024
typedef struct testdiomonitor_env {
struct file* file;
char file_linebuff[TESTDIOMONITOR_LINEBUFF];
int headoffset;
loff_t file_pos;
testdiomonitor_sample_ringbuff ringbuff;
} testdiomonitor_env;
static testdiomonitor_env _env;
static struct delayed_work work_write_file;
static struct workqueue_struct *wq_write_file;
#define FILENAME "test.txt"
void init_file(void)
{
_env.file = filp_open(FILENAME, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (IS_ERR(_env.file)) {
_env.file = NULL;
}
}
void exit_file(void)
{
if (_env.file) {
filp_close(_env.file, NULL);
}
}
void testdiomonitor_write_file(char* i_pchar, int i_size)
{
if (_env.file) {
kernel_write(_env.file, i_pchar, i_size, &_env.file_pos);
}
}
void testdiomonitor_write_file_emptyline(void)
{
testdiomonitor_write_file("\n", strlen("\n"));
}
void testdiomonitor_file_oneline(const char* i_format, ...)
{
char* pcontent = &_env.file_linebuff[_env.headoffset];
va_list args;
va_start(args, i_format);
vsnprintf(pcontent, TESTDIOMONITOR_LINEBUFF - _env.headoffset, i_format, args);
va_end(args);
testdiomonitor_write_file(_env.file_linebuff, strlen(_env.file_linebuff));
}
void testdiomonitor_replace_null_with_space(char *str, int n) {
for (int i = 0; i < n - 1; i++) {
if (str[i] == '\0') {
str[i] = ' ';
}
}
}
void testdiomonitor_set_cmdline(char* i_pbuff, int i_buffsize, struct task_struct* i_ptask)
{
int ret = _get_cmdline_func(i_ptask, i_pbuff, i_buffsize);
if (ret <= 0) {
i_pbuff[0] = '\0';
return;
}
testdiomonitor_replace_null_with_space(i_pbuff, ret);
i_pbuff[ret - 1] = '\0';
}
void testdiomonitor_checkget_parentinfo_and_cmdline(testdiomonitor_sample* io_psample, struct task_struct* i_ptask)
{
struct task_struct* parent;
rcu_read_lock();
parent = rcu_dereference(i_ptask->real_parent);
io_psample->ppid = parent->pid;
strlcpy(io_psample->ppidcomm, parent->comm, TASK_COMM_LEN);
rcu_read_unlock();
}
#define TESTDIOMONITOR_COMMANDLINE_MAX 128
static void write_file(struct work_struct *w)
{
ssize_t ret;
u32 index;
testdiomonitor_sample* psample;
struct tm t;
char timestr[64];
char exceedstr[64];
char temp_commandline[TESTDIOMONITOR_COMMANDLINE_MAX];
struct pid* pid_struct;
struct task_struct* ptask;
int stacki;
while (_env.ringbuff.rp != _env.ringbuff.wp) {
index = (_env.ringbuff.rp & (TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT - 1));
psample = &_env.ringbuff.parray_sample[index];
if (psample->writedone != 1) {
break;
}
testdiomonitor_write_file_emptyline();
_env.headoffset = sprintf(_env.file_linebuff, "[%llu][%s] ", _env.ringbuff.rp, psample->desc);
time64_to_tm(psample->time.tv_sec + 8 * 60 * 60, 0, &t);
snprintf(timestr, 64, "%04ld-%02d-%02d-%02d_%02d_%02d.%09ld",
1900 + t.tm_year, t.tm_mon + 1, t.tm_mday, t.tm_hour, t.tm_min, t.tm_sec, psample->time.tv_nsec);
if (psample->desc == TESTDIOMONITOR_SAMPLEDESC_DEXCEED) {
snprintf(exceedstr, 64, "dtimens[%llu]", psample->dtimens);
}
else if (psample->desc == TESTDIOMONITOR_SAMPLEDESC_DIOEXCEED) {
snprintf(exceedstr, 64, "iowaittimens[%llu]", psample->iowaittimens);
}
else if (psample->desc == TESTDIOMONITOR_SAMPLEDESC_IOEXCEED) {
snprintf(exceedstr, 64, "delayacct_iowaittimens[%llu]", psample->iowaittimens);
}
else {
exceedstr[0] = '\0';
}
testdiomonitor_file_oneline("begin...time[%s]cpu[%d]desc[%s]%s\n",
timestr, psample->cpu, psample->desc, exceedstr);
testdiomonitor_file_oneline("tgid[%d]pid[%d]comm[%s]ppid[%d]ppidcomm[%s]\n",
psample->tgid, psample->pid, psample->ppidcomm, psample->pid, psample->comm);
pid_struct = find_get_pid(psample->pid);
if (pid_struct) {
ptask = get_pid_task(pid_struct, PIDTYPE_PID);
if (ptask) {
testdiomonitor_set_cmdline(temp_commandline, TESTDIOMONITOR_COMMANDLINE_MAX, ptask);
put_task_struct(ptask);
}
else {
temp_commandline[0] = '\0';
}
put_pid(pid_struct);
}
else {
temp_commandline[0] = '\0';
}
testdiomonitor_file_oneline("commandline[%s]\n", temp_commandline);
pid_struct = find_get_pid(psample->ppid);
if (pid_struct) {
ptask = get_pid_task(pid_struct, PIDTYPE_PID);
if (ptask) {
testdiomonitor_set_cmdline(temp_commandline, TESTDIOMONITOR_COMMANDLINE_MAX, ptask);
put_task_struct(ptask);
}
else {
temp_commandline[0] = '\0';
}
put_pid(pid_struct);
}
else {
temp_commandline[0] = '\0';
}
testdiomonitor_file_oneline("ppid_commandline[%s]\n", temp_commandline);
testdiomonitor_file_oneline("stack[%d]:\n", psample->stackn);
for (stacki = 0; stacki < psample->stackn; stacki++) {
testdiomonitor_file_oneline("%*c%pS\n", 5, ' ', (void *)psample->parray_stack[stacki]);
}
testdiomonitor_write_file_emptyline();
psample->writedone = 0;
_env.ringbuff.rp ++;
}
queue_delayed_work_on(nr_cpu_ids - 1, wq_write_file,
&work_write_file, 1);
}
static void init_write_file(void)
{
init_file();
wq_write_file = alloc_workqueue("testdiomonitor_write_file", WQ_MEM_RECLAIM, 0);
INIT_DELAYED_WORK(&work_write_file, write_file);
queue_delayed_work_on(nr_cpu_ids - 1, wq_write_file,
&work_write_file, 3);
}
static void exit_write_file(void)
{
cancel_delayed_work_sync(&work_write_file);
destroy_workqueue(wq_write_file);
exit_file();
}
void init_testdiomonitor_sample_ringbuff(void)
{
testdiomonitor_sample* psample;
_env.ringbuff.parray_sample = kvzalloc(sizeof(testdiomonitor_sample) * TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT, GFP_KERNEL);
}
void exit_testdiomonitor_sample_ringbuff(void)
{
kvfree(_env.ringbuff.parray_sample);
}
testdiomonitor_sample* testdiomonitor_get_psample(void)
{
u64 windex_raw, windex_raw_old;
u32 windex;
while (1) {
windex_raw = _env.ringbuff.wp;
if (windex_raw - _env.ringbuff.rp >= (u64)(TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT)) {
_env.ringbuff.skipcount ++;
return NULL;
}
// atomic_cmpxchg return old value
windex_raw_old = atomic64_cmpxchg((atomic64_t*)&_env.ringbuff.wp,
windex_raw, windex_raw + 1);
if (windex_raw_old == windex_raw) {
break;
}
}
windex = (u32)(windex_raw & (u64)(TESTDIOMONITOR_SAMPLE_RINGBUFF_MAXCOUNT - 1));
return &_env.ringbuff.parray_sample[windex];
}
void testdiomonitor_add_sample(const char* i_desc, struct task_struct* i_task, u64 i_timens)
{
testdiomonitor_sample* psample = testdiomonitor_get_psample();
if (!psample) {
return;
}
ktime_get_real_ts64(&psample->time);
psample->cpu = task_cpu(i_task);
psample->pid = i_task->pid;
psample->tgid = i_task->tgid;
strlcpy(psample->comm, i_task->comm, TASK_COMM_LEN);
testdiomonitor_checkget_parentinfo_and_cmdline(psample, i_task);
psample->bin_iowait = i_task->in_iowait;
psample->desc = i_desc;
if (i_desc == TESTDIOMONITOR_SAMPLEDESC_DEXCEED) {
psample->dtimens = i_timens;
}
else if (i_desc == TESTDIOMONITOR_SAMPLEDESC_DIOEXCEED || i_desc == TESTDIOMONITOR_SAMPLEDESC_IOEXCEED) {
psample->iowaittimens = i_timens;
}
psample->stackn = _stack_trace_save_tsk(i_task, (unsigned long*)psample->parray_stack, TEST_STACK_TRACE_ENTRIES, 0);
psample->writedone = 1;
}
static void cb_sched_switch(void *i_data, bool i_preempt,
struct task_struct *i_prev,
struct task_struct *i_next,
unsigned int i_prev_state)
{
void* parray_stack[TEST_STACK_TRACE_ENTRIES];
int num_stack;
int stacki;
if (i_prev_state == TASK_UNINTERRUPTIBLE) {
if (i_prev->in_iowait) {
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_SWDIOSTART, i_prev, 0);
}
else {
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_SWDSTART, i_prev, 0);
}
}
}
static void cb_sched_waking(void *i_data, struct task_struct *i_p) {
if (i_p->__state == TASK_UNINTERRUPTIBLE) {
if (i_p->in_iowait) {
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_WADIOSTOP, i_p, 0);
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_DIOEXCEED, i_p, local_clock() - i_p->se.exec_start);
}
else {
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_WADSTOP, i_p, 0);
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_DEXCEED, i_p, local_clock() - i_p->se.exec_start);
}
}
}
static void cb_iodelay_account(void *i_data, struct task_struct *i_curr,
unsigned long long i_delta)
{
testdiomonitor_add_sample(TESTDIOMONITOR_SAMPLEDESC_IOEXCEED, i_curr, i_delta);
}
struct kern_tracepoint {
void *callback;
struct tracepoint *ptr;
bool bregister;
};
static void clear_kern_tracepoint(struct kern_tracepoint *tp)
{
if (tp->bregister) {
tracepoint_probe_unregister(tp->ptr, tp->callback, NULL);
}
}
#define INIT_KERN_TRACEPOINT(tracepoint_name) \
static struct kern_tracepoint mykern_##tracepoint_name = {.callback = NULL, .ptr = NULL, .bregister = false};
#define TRACEPOINT_CHECK_AND_SET(tracepoint_name) \
static void tracepoint_name##_tracepoint_check_and_set(struct tracepoint *tp, void *priv) \
{ \
if (!strcmp(#tracepoint_name, tp->name)) \
{ \
((struct kern_tracepoint *)priv)->ptr = tp; \
return; \
} \
}
INIT_KERN_TRACEPOINT(sched_switch)
TRACEPOINT_CHECK_AND_SET(sched_switch)
INIT_KERN_TRACEPOINT(sched_waking)
TRACEPOINT_CHECK_AND_SET(sched_waking)
#ifdef IODELAY_TRACEPOINT_ENABLE
INIT_KERN_TRACEPOINT(iodelay_account)
TRACEPOINT_CHECK_AND_SET(iodelay_account)
#endif
typedef unsigned long (*kallsyms_lookup_name_func)(const char *name);
kallsyms_lookup_name_func _kallsyms_lookup_name_func;
void* get_func_by_symbol_name_kallsyms_lookup_name(void)
{
int ret;
void* pfunc = NULL;
struct kprobe kp;
memset(&kp, 0, sizeof(kp));
kp.symbol_name = "kallsyms_lookup_name";
kp.pre_handler = NULL;
kp.addr = NULL; // 作为强调,提示使用symbol_name
ret = register_kprobe(&kp);
if (ret < 0) {
printk("register_kprobe fail!\n");
return NULL;
}
printk("register_kprobe succeed!\n");
pfunc = (void*)kp.addr;
unregister_kprobe(&kp);
return pfunc;
}
void* get_func_by_symbol_name(const char* i_symbol)
{
if (_kallsyms_lookup_name_func == NULL) {
return NULL;
}
return _kallsyms_lookup_name_func(i_symbol);
}
static int __init testdiomonitor_init(void)
{
_kallsyms_lookup_name_func = get_func_by_symbol_name_kallsyms_lookup_name();
init_testdiomonitor_sample_ringbuff();
init_write_file();
_stack_trace_save_tsk = get_func_by_symbol_name("stack_trace_save_tsk");
if (_stack_trace_save_tsk == NULL) {
printk(KERN_ERR "get_func_by_symbol_name stack_trace_save_tsk failed!\n");
return -1;
}
_get_cmdline_func = get_func_by_symbol_name("get_cmdline");
if (_get_cmdline_func == NULL) {
printk(KERN_ERR "get_func_by_symbol_name get_cmdline failed!\n");
return -1;
}
mykern_sched_switch.callback = cb_sched_switch;
for_each_kernel_tracepoint(sched_switch_tracepoint_check_and_set, &mykern_sched_switch);
if (!mykern_sched_switch.ptr) {
printk(KERN_ERR "mykern_sched_switch register failed!\n");
return -1;
}
else {
printk(KERN_INFO "mykern_sched_switch register succeeded!\n");
}
tracepoint_probe_register(mykern_sched_switch.ptr, mykern_sched_switch.callback, NULL);
mykern_sched_switch.bregister = 1;
mykern_sched_waking.callback = cb_sched_waking;
for_each_kernel_tracepoint(sched_waking_tracepoint_check_and_set, &mykern_sched_waking);
if (!mykern_sched_waking.ptr) {
printk(KERN_ERR "mykern_sched_waking register failed!\n");
return -1;
}
else {
printk(KERN_INFO "mykern_sched_waking register succeeded!\n");
}
tracepoint_probe_register(mykern_sched_waking.ptr, mykern_sched_waking.callback, NULL);
mykern_sched_waking.bregister = 1;
#ifdef IODELAY_TRACEPOINT_ENABLE
mykern_iodelay_account.callback = cb_iodelay_account;
for_each_kernel_tracepoint(iodelay_account_tracepoint_check_and_set, &mykern_iodelay_account);
if (!mykern_iodelay_account.ptr) {
printk(KERN_ERR "mykern_iodelay_account register failed!\n");
return -1;
}
else {
printk(KERN_INFO "mykern_iodelay_account register succeeded!\n");
}
tracepoint_probe_register(mykern_iodelay_account.ptr, mykern_iodelay_account.callback, NULL);
mykern_iodelay_account.bregister = 1;
#endif
return 0;
}
static void __exit testdiomonitor_exit(void)
{
clear_kern_tracepoint(&mykern_sched_switch);
clear_kern_tracepoint(&mykern_sched_waking);
#ifdef IODELAY_TRACEPOINT_ENABLE
clear_kern_tracepoint(&mykern_iodelay_account);
#endif
tracepoint_synchronize_unregister();
exit_write_file();
exit_testdiomonitor_sample_ringbuff();
}
module_init(testdiomonitor_init);
module_exit(testdiomonitor_exit);
2.2 成果展示
按照设计的下图里的7个监控项,目前的实现里,并没有去按照阈值去筛选,而是完整地对全部事件进行记录:
2.2.1 7个监控项的含义及抓到的内容
关于swDstart和swDiostart:
swDstart表示在sched_switch里去判断是否prev的任务是否处于D状态,如果是,则进行记录,不过目前的实现里,如果处于D状态下,仍然处于任务的iowait状态的话,就不再统计为swDstart事件,而是统计到swDiostart事件里。
swDstart:
swDiostart:
关于waDstop和waDiostop:
waDstop表示在sched_waking里判断到next的任务在这次唤醒之前是处于D状态,则进行记录,同样地,如果处于D状态下,仍然处于任务的iowait状态的话,就不再统计为waDstop事件,而是统计到waDiostop事件里。
waDstop:
waDiostop:
关于Dexceed和Dioexceed:
Dexceed事件是在sched_waking时,计算D的持续时间(关于如何计算,见下面第三章),并进行D的持续时间的事件记录,同样的,如果这时候也处在iowait状态,则算在Dioexceed事件里,不算在Dexceed事件里。
Dexceed:
Dioexceed:
关于Ioexceed:
Ioexceed事件完全借助 2.1.1 里提到的增加的tracepoint来实现,不进行计算和其他逻辑。
2.2.2 不依赖iodelay的内核选项和增加grub抓到的io事件统计到的delay时间和使用内核delayacct抓到的事件统计到的时间,误差是不大的
如 2.2.1 里解释到Dioexceed事件是用的sched_waking进行的抓取,而Ioexceed事件依赖的是内核delayacct机制进行的抓取,如下图看到,它们俩统计到的时间是差不多的:
大的一些误差也是在4ms以内的:
2.3 通过shell脚本抓取系统当前情况下的所有D状态的线程的堆栈以及各个cpu上iowait情况
我们贴出实现的脚本(下面的脚本会抓各个cpu的iowait,以及当前系统上的所有的D状态和R状态的任务的堆栈情况,关于cpu的iowait会在后面的博客里详细举例展开,本文并不展开):
#!/bin/bash
for ((timei=1; timei<=10; timei++))
do
# 获取当前系统的 CPU 核心数量
cpu_cores=$(nproc)
# 计算要用于 tail 的行数
tail_start_line=$((cpu_cores + 8))
# 使用计算出的行数替代固定的 40
iowait_percentage=$(mpstat -P ALL 1 1 | awk '{print $6}' | tail -n +"$tail_start_line")
number=0
for i in $iowait_percentage; do
echo "cpu[$number] iowait:$i"
((number++))
done
load=$(uptime | awk -F 'load average:' '{print $2}' | awk '{gsub(/,/, "", $1); print $1}')
echo "load=$load"
ps_output=$(ps -L -eo pid,tid,psr,rtprio,ni,%cpu,state,stat,args,lstart,etime,cls,wchan:32,flags:10 | sort -k3)
d_processes=$(echo "$ps_output" | awk '$7=="D"')
if [ -n "$d_processes" ]; then
echo -e "\nD状态的进程:"
while IFS= read -r d_process; do
tid=$(echo "$d_process" | awk '{print $2}')
echo "$d_process"
cat "/proc/$tid/stack"
done <<< "$d_processes"
fi
r_processes=$(echo "$ps_output" | awk '$7=="R"')
if [ -n "$r_processes" ]; then
echo -e "\nR状态的进程:"
while IFS= read -r r_process; do
tid=$(echo "$r_process" | awk '{print $2}')
echo "$r_process"
cat "/proc/$tid/stack"
done <<< "$r_processes"
fi
sleep 1
echo -e "\n\n\n"
done
三、源码分析及原理讲解
这一章里对第二章里贴出的源码进行分析及原理讲解。
3.1 7个监控项的定义和原理
定义的如下7个事件:
关于事件的原理和说明,在 2.2.1 里也涉及了,这里再摘出来复述一下:
关于swDstart和swDiostart:
swDstart表示在sched_switch里去判断是否prev的任务是否处于D状态,如果是,则进行记录,不过目前的实现里,如果处于D状态下,仍然处于任务的iowait状态的话,就不再统计为swDstart事件,而是统计到swDiostart事件里。
关于waDstop和waDiostop:
waDstop表示在sched_waking里判断到next的任务在这次唤醒之前是处于D状态,则进行记录,同样地,如果处于D状态下,仍然处于任务的iowait状态的话,就不再统计为waDstop事件,而是统计到waDiostop事件里。
关于Dexceed和Dioexceed:
Dexceed事件是在sched_waking时,计算D的持续时间,如何计算的逻辑见 3.2 一节,并进行D的持续时间的事件记录,同样的,如果这时候也处在iowait状态,则算在Dioexceed事件里,不算在Dexceed事件里。
关于Ioexceed:
Ioexceed事件完全借助 2.1.1 里提到的增加的tracepoint来实现,不进行计算和其他逻辑。
3.2 把struct rq有关的定义和依赖的定义复制过来,为了可以拿到rq里的clock_task
Dexceed和Dioexceed事件都是通过sched_waking里计算整个状态的持续事件的,逻辑如下:
核心逻辑就是上图中红色框出的部分,用了my_rq_clock_task接口:
这个接口是通过拿struct rq里的clock_task,来减去在任务交出cpu时记录的se.exec_start的时间,exec_start是用的rq里的clock_task的:
关于rq_clock_task等更多的细节见之前的 内核模块里访问struct rq及获取rq_clock_task时间的方法-CSDN博客 博客。
3.3 使用kworker进行事件的写文件落盘
kworker的相关定义:
下图是创建kwoker的逻辑:
文件落盘逻辑:
里面进行了获取任务的cmdline及父进程cmdline的动作,相关细节见之前的博客 内核模块里获取当前进程和父进程的cmdline的方法及注意事项,涉及父子进程管理,和rcu的初步介绍_内核获取self进程cmdline-CSDN博客 。
3.4 ringbuffer的相关逻辑
该ringbuffer的生产和消费模型,是极致性能考虑的,相关细节逻辑会在后面的博客里展开描述和做相应地抽象及拓展,这里先是使用它,核心的逻辑如下:
标签:file,sched,struct,int,unsigned,testdiomonitor,线程,监控,iodelay From: https://blog.csdn.net/weixin_42766184/article/details/145066795