一、背景
内核里为每个应用程序(线程或进程颗粒度)因为IO操作所导致的线程的睡眠有一项专门的统计,叫iodelay,为什么叫这个名字,因为我们用pidstat查看具体进程I/O的时延时,是用的这个名字,如下红色框图:
pidstat -d显示出来的iodelay是以user HZ为单位的。关于现有工具也就是通过pidstat -d来查看的方式的细节会在第二章里介绍,另外,第二章里也会讲到需要系统配置相关的选项才能进行这块的监控。第三章里,我们会讲到如何用自己编写程序的方式来监控这个iodelay。在第四章里,我们会讲一些与iodelay相关的一些概念,并做一些实验,第四章里也会简介一些后面博客里会展开介绍的一些内容。
二、通过已有工具进行获取
现有的工具就是通过pidstat来查看iodelay的情况,无论是用现有的工具还是用第三章里介绍的自己编写代码的方式,都需要依赖 2.1 一节里讲到的内核打开相关选项进行编译,并打开delayacct的选项。
2.1 内核选项配置CONFIG_TASK_DELAY_ACCT和CONFIG_TASKSTATS,并grub里增加delayacct
无论是用现有的工具还是用第三章里介绍的自己编写代码的方式,都需要做这一节里讲到的这些步骤。
内核选项里增加CONFIG_TASK_DELAY_ACCT和CONFIG_TASKSTATS为y后进行编译,默认的ubuntu的内核镜像是带这两个参数的。
然后在grub里增加delayacct:
2.1.1 可以通过sysctl的kernel.task_delayacct开关,动态设置打开与否
通过sysctl的kernel.task_delayacct开关动态设置打开与否,这种方法有一个限制,就是它不能影响已经启动了的程序,只能影响在进行sysctl设置kernel.task_delayacct之后启动的程序。这个测试在4.1 一节里进行。
2.2 pidstat -d的使用
要使用pidstat -d来监测iodelay,需要确保 2.1 一节里的内容已经做过。
2.2.1 pidstat的安装
pidstat工具的安装通过如下命令:
sudo apt-get update
sudo apt-get install sysstat
在之前的博客 调度时延的观测_csdn 调度时延的观测 杰克崔-CSDN博客 里,我们讲到了可以通过pidstat来查看线程的调度时延,这里我们接下来讲的是通过pidstat查看线程或进程的iodelay
2.2.2 pidstat -d显示的iodelay的单位是user HZ,即每10ms加1
和/proc/<pid>/stat里的第14和15项,utime和stime的单位一致,都是userHZ,userHZ是每10ms加1,比如一个在用户态死循环的程序,我们每1秒中观测一次它的utime(user time),utime的值每隔1秒增加100
同样的,对于pidstat -d显示的iodelay,也是一样的单位,关于这个,我们在 4.1 一节里会做实验验证。
我们看一下man pidstat关于-d的iodelay的解释:
所以,千万别把上图中的红色框出的clock ticks理解成cycles,这里的clock ticks就是user HZ的ticks,单位是10ms。
2.2.3 pidstat -d <时间周期s>只会抓各个进程的主线程的iodelay
如果写<时间周期s>的话,会按照写的时间周期来观察iodelay的变化差值
如果不写<时间周期s>的话,会统计进程开启到现在的iodelay总值
如果不带-t的话,pidstat -d只能抓到进程的主线程的iodelay,对于-p <pid>这么抓取iodelay,也一样只能抓取到主线程的iodelay。
我们用 4.1.1 里的模拟高iodelay的程序,这个程序是主进程里啥也不干就起一个线程去做IO的操作,这个线程里的行为会造成很高的iodelay,因为用的是O_SYNC | O_DIRECT方式进行文件写操作。
我们监测pidstat -d 1去观测系统上的iodelay的变化,看是否能观测到非主线程的iodelay,是不行的:
用pidstat -d -p <pid> 1去观测这个高iodelay进程里的非主线程iodelay情况,也是不行的:
2.2.4 pidstat -d <时间周期s> -t 可以抓取所有线程的iodelay情况
下图是抓取系统上时间周期内,所有线程的iodelay变化情况:
当然也可以抓取一个具体线程的iodelay的变化情况:
也可以抓取一个具体线程的iodelay的总值情况:
三、获取编写代码获取进程或线程的iodelay
要使用本章里的测试程序,和使用pidstat -d一样,都需要确保 2.1 一节里的内容已经做过。
在 3.1 一节里,我们先贴出精简后的获取进程或线程的iodelay的程序。
在 3.2 一节里,我们贴出完整的程序,取自于内核的现有的工具的代码。
关于代码里的逻辑细节和内核的配合的介绍见 4.2 一节。
3.1 精简后的获取进程或线程的iodelay的程序源码及使用方式
3.1.1 程序源码
下面的源码是用 3.2 里的完整代码精简出与iodelay相关的部分后的代码。
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <poll.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <signal.h>
#include <linux/genetlink.h>
#include <linux/taskstats.h>
#include <linux/cgroupstats.h>
/*
* Generic macros for dealing with netlink sockets. Might be duplicated
* elsewhere. It is recommended that commercial grade applications use
* libnl or libnetlink and use the interfaces provided by the library
*/
#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN))
#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
#define err(code, fmt, arg...) \
do { \
fprintf(stderr, fmt, ##arg); \
exit(code); \
} while (0)
int done;
int rcvbufsz;
char name[100];
int dbg;
int print_delays = 0;
#define PRINTF(fmt, arg...) { \
if (dbg) { \
printf(fmt, ##arg); \
} \
}
/* Maximum size of response requested or message sent */
#define MAX_MSG_SIZE 1024
/* Maximum number of cpus expected to be specified in a cpumask */
#define MAX_CPUS 32
struct msgtemplate {
struct nlmsghdr n;
struct genlmsghdr g;
char buf[MAX_MSG_SIZE];
};
char cpumask[100 + 6 * MAX_CPUS];
static void usage(void)
{
fprintf(stderr, "getdelays [-dilv] [-w logfile] [-r bufsize] "
"[-m cpumask] [-t tgid] [-p pid]\n");
fprintf(stderr, " -d: print delayacct stats\n");
fprintf(stderr, " -i: print IO accounting (works only with -p)\n");
fprintf(stderr, " -l: listen forever\n");
fprintf(stderr, " -v: debug on\n");
fprintf(stderr, " -C: container path\n");
}
/*
* Create a raw netlink socket and bind
*/
static int create_nl_socket(int protocol)
{
int fd;
struct sockaddr_nl local;
fd = socket(AF_NETLINK, SOCK_RAW, protocol);
if (fd < 0)
return -1;
if (rcvbufsz)
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
&rcvbufsz, sizeof(rcvbufsz)) < 0) {
fprintf(stderr, "Unable to set socket rcv buf size to %d\n",
rcvbufsz);
goto error;
}
memset(&local, 0, sizeof(local));
local.nl_family = AF_NETLINK;
if (bind(fd, (struct sockaddr*)&local, sizeof(local)) < 0)
goto error;
return fd;
error:
close(fd);
return -1;
}
static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
__u8 genl_cmd, __u16 nla_type,
void* nla_data, int nla_len)
{
struct nlattr* na;
struct sockaddr_nl nladdr;
int r, buflen;
char* buf;
struct msgtemplate msg;
msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
msg.n.nlmsg_type = nlmsg_type;
msg.n.nlmsg_flags = NLM_F_REQUEST;
msg.n.nlmsg_seq = 0;
msg.n.nlmsg_pid = nlmsg_pid;
msg.g.cmd = genl_cmd;
msg.g.version = 0x1;
na = (struct nlattr*)GENLMSG_DATA(&msg);
na->nla_type = nla_type;
na->nla_len = nla_len + NLA_HDRLEN;
memcpy(NLA_DATA(na), nla_data, nla_len);
msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
buf = (char*)&msg;
buflen = msg.n.nlmsg_len;
memset(&nladdr, 0, sizeof(nladdr));
nladdr.nl_family = AF_NETLINK;
while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr*)&nladdr,
sizeof(nladdr))) < buflen) {
if (r > 0) {
buf += r;
buflen -= r;
}
else if (errno != EAGAIN)
return -1;
}
return 0;
}
/*
* Probe the controller in genetlink to find the family id
* for the TASKSTATS family
*/
static int get_family_id(int sd)
{
struct {
struct nlmsghdr n;
struct genlmsghdr g;
char buf[256];
} ans;
int id = 0, rc;
struct nlattr* na;
int rep_len;
strcpy(name, TASKSTATS_GENL_NAME);
rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
CTRL_ATTR_FAMILY_NAME, (void*)name,
strlen(TASKSTATS_GENL_NAME) + 1);
if (rc < 0)
return 0; /* sendto() failure? */
rep_len = recv(sd, &ans, sizeof(ans), 0);
if (ans.n.nlmsg_type == NLMSG_ERROR ||
(rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
return 0;
na = (struct nlattr*)GENLMSG_DATA(&ans);
na = (struct nlattr*)((char*)na + NLA_ALIGN(na->nla_len));
if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
id = *(__u16*)NLA_DATA(na);
}
return id;
}
#define average_ms(t, c) (t / 1000000ULL / (c ? c : 1))
static void print_delayacct(struct taskstats* t)
{
printf("\n\ncount:%15llu\ndelay total:%15llu\ndelay average:%15llums\n",
(unsigned long long)t->blkio_count,
(unsigned long long)t->blkio_delay_total,
average_ms(t->blkio_delay_total, t->blkio_count)
);
}
int main(int argc, char* argv[])
{
int c, rc, rep_len, aggr_len, len2;
int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC;
__u16 id;
__u32 mypid;
struct nlattr* na;
int nl_sd = -1;
int len = 0;
pid_t tid = 0;
pid_t rtid = 0;
int fd = 0;
int count = 0;
int maskset = 0;
int loop = 0;
int cfd = 0;
int forking = 0;
sigset_t sigset;
struct msgtemplate msg;
while (!forking) {
c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:c:");
if (c < 0)
break;
switch (c) {
case 'd':
printf("print delayacct stats ON\n");
print_delays = 1;
break;
case 't':
tid = atoi(optarg);
if (!tid)
err(1, "Invalid tgid\n");
cmd_type = TASKSTATS_CMD_ATTR_TGID;
break;
case 'p':
tid = atoi(optarg);
if (!tid)
err(1, "Invalid pid\n");
cmd_type = TASKSTATS_CMD_ATTR_PID;
break;
default:
usage();
exit(-1);
}
}
nl_sd = create_nl_socket(NETLINK_GENERIC);
if (nl_sd < 0)
err(1, "error creating Netlink socket\n");
mypid = getpid();
id = get_family_id(nl_sd);
if (!id) {
fprintf(stderr, "Error getting family id, errno %d\n", errno);
goto err;
}
PRINTF("family id %d\n", id);
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
&cpumask, strlen(cpumask) + 1);
PRINTF("Sent register cpumask, retval %d\n", rc);
if (rc < 0) {
fprintf(stderr, "error sending register cpumask\n");
goto err;
}
}
/*
* If we forked a child, wait for it to exit. Cannot use waitpid()
* as all the delicious data would be reaped as part of the wait
*/
if (tid && forking) {
int sig_received;
sigwait(&sigset, &sig_received);
}
if (tid) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
cmd_type, &tid, sizeof(__u32));
PRINTF("Sent pid/tgid, retval %d\n", rc);
if (rc < 0) {
fprintf(stderr, "error sending tid/tgid cmd\n");
goto done;
}
}
if (!maskset && !tid) {
usage();
goto err;
}
do {
rep_len = recv(nl_sd, &msg, sizeof(msg), 0);
PRINTF("received %d bytes\n", rep_len);
if (rep_len < 0) {
fprintf(stderr, "nonfatal reply error: errno %d\n",
errno);
continue;
}
if (msg.n.nlmsg_type == NLMSG_ERROR ||
!NLMSG_OK((&msg.n), rep_len)) {
struct nlmsgerr* err = (struct nlmsgerr*)NLMSG_DATA(&msg);
fprintf(stderr, "fatal reply error, errno %d\n",
err->error);
goto done;
}
PRINTF("nlmsghdr size=%zu, nlmsg_len=%d, rep_len=%d\n",
sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len);
rep_len = GENLMSG_PAYLOAD(&msg.n);
na = (struct nlattr*)GENLMSG_DATA(&msg);
len = 0;
while (len < rep_len) {
len += NLA_ALIGN(na->nla_len);
switch (na->nla_type) {
case TASKSTATS_TYPE_AGGR_TGID:
/* Fall through */
case TASKSTATS_TYPE_AGGR_PID:
aggr_len = NLA_PAYLOAD(na->nla_len);
len2 = 0;
/* For nested attributes, na follows */
na = (struct nlattr*)NLA_DATA(na);
done = 0;
while (len2 < aggr_len) {
switch (na->nla_type) {
case TASKSTATS_TYPE_PID:
rtid = *(int*)NLA_DATA(na);
if (print_delays)
printf("PID\t%d\n", rtid);
break;
case TASKSTATS_TYPE_TGID:
rtid = *(int*)NLA_DATA(na);
if (print_delays)
printf("TGID\t%d\n", rtid);
break;
case TASKSTATS_TYPE_STATS:
count++;
if (print_delays)
print_delayacct((struct taskstats*)NLA_DATA(na));
if (fd) {
if (write(fd, NLA_DATA(na), na->nla_len) < 0) {
err(1, "write error\n");
}
}
if (!loop)
goto done;
break;
case TASKSTATS_TYPE_NULL:
break;
default:
fprintf(stderr, "Unknown nested"
" nla_type %d\n",
na->nla_type);
break;
}
len2 += NLA_ALIGN(na->nla_len);
na = (struct nlattr*)((char*)na +
NLA_ALIGN(na->nla_len));
}
break;
default:
fprintf(stderr, "Unknown nla_type %d\n",
na->nla_type);
case TASKSTATS_TYPE_NULL:
break;
}
na = (struct nlattr*)(GENLMSG_DATA(&msg) + len);
}
} while (loop);
done:
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
&cpumask, strlen(cpumask) + 1);
printf("Sent deregister mask, retval %d\n", rc);
if (rc < 0)
err(rc, "error sending deregister cpumask\n");
}
err:
close(nl_sd);
if (fd)
close(fd);
if (cfd)
close(cfd);
return 0;
}
3.1.2 使用方式
如 2.2 一节里,我们一样的先启动 4.1.2 里的模拟高iodelay的测试程序,后通过top -H -p <pid>可以看到进行write的子线程是72581(进程是72579):
运行 3.1.1 里编出的程序 testgetdelay.out,如下,先用-p方式:
可以看到-p方式是按照线程颗粒度来看iodelay状态的,实例程序主线程并没有write操作,所以主线程72579的iodelay是0:
而-p 子线程的id 72581,可以看到:
上图中count表示iodelay的次数,delay total是ns纳秒单位,平均下来是一次iodelay是12ms。
该程序还可以用-t方式,把一个tgid也就是一个进程内的所有的线程的iodelay情况搜集汇总到一起来输出:
3.2 完整的程序(取自于内核的现有工具的代码)源码及使用方式
3.2.1 程序源码
下面代码是内核代码的 tools/accounting/getdelays.c:
// SPDX-License-Identifier: GPL-2.0
/* getdelays.c
*
* Utility to get per-pid and per-tgid delay accounting statistics
* Also illustrates usage of the taskstats interface
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2005
* Copyright (C) Balbir Singh, IBM Corp. 2006
* Copyright (c) Jay Lan, SGI. 2006
*
* Compile with
* gcc -I/usr/src/linux/include getdelays.c -o getdelays
*/
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <poll.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <signal.h>
#include <linux/genetlink.h>
#include <linux/taskstats.h>
#include <linux/cgroupstats.h>
/*
* Generic macros for dealing with netlink sockets. Might be duplicated
* elsewhere. It is recommended that commercial grade applications use
* libnl or libnetlink and use the interfaces provided by the library
*/
#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN))
#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
#define err(code, fmt, arg...) \
do { \
fprintf(stderr, fmt, ##arg); \
exit(code); \
} while (0)
int done;
int rcvbufsz;
char name[100];
int dbg;
int print_delays = 0;
int print_io_accounting = 0;
int print_task_context_switch_counts = 0;
#define PRINTF(fmt, arg...) { \
if (dbg) { \
printf(fmt, ##arg); \
} \
}
/* Maximum size of response requested or message sent */
#define MAX_MSG_SIZE 1024
/* Maximum number of cpus expected to be specified in a cpumask */
#define MAX_CPUS 32
struct msgtemplate {
struct nlmsghdr n;
struct genlmsghdr g;
char buf[MAX_MSG_SIZE];
};
char cpumask[100 + 6 * MAX_CPUS];
static void usage(void)
{
fprintf(stderr, "getdelays [-dilv] [-w logfile] [-r bufsize] "
"[-m cpumask] [-t tgid] [-p pid]\n");
fprintf(stderr, " -d: print delayacct stats\n");
fprintf(stderr, " -i: print IO accounting (works only with -p)\n");
fprintf(stderr, " -l: listen forever\n");
fprintf(stderr, " -v: debug on\n");
fprintf(stderr, " -C: container path\n");
}
/*
* Create a raw netlink socket and bind
*/
static int create_nl_socket(int protocol)
{
int fd;
struct sockaddr_nl local;
fd = socket(AF_NETLINK, SOCK_RAW, protocol);
if (fd < 0)
return -1;
if (rcvbufsz)
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
&rcvbufsz, sizeof(rcvbufsz)) < 0) {
fprintf(stderr, "Unable to set socket rcv buf size to %d\n",
rcvbufsz);
goto error;
}
memset(&local, 0, sizeof(local));
local.nl_family = AF_NETLINK;
if (bind(fd, (struct sockaddr*)&local, sizeof(local)) < 0)
goto error;
return fd;
error:
close(fd);
return -1;
}
static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
__u8 genl_cmd, __u16 nla_type,
void* nla_data, int nla_len)
{
struct nlattr* na;
struct sockaddr_nl nladdr;
int r, buflen;
char* buf;
struct msgtemplate msg;
msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
msg.n.nlmsg_type = nlmsg_type;
msg.n.nlmsg_flags = NLM_F_REQUEST;
msg.n.nlmsg_seq = 0;
msg.n.nlmsg_pid = nlmsg_pid;
msg.g.cmd = genl_cmd;
msg.g.version = 0x1;
na = (struct nlattr*)GENLMSG_DATA(&msg);
na->nla_type = nla_type;
na->nla_len = nla_len + NLA_HDRLEN;
memcpy(NLA_DATA(na), nla_data, nla_len);
msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
buf = (char*)&msg;
buflen = msg.n.nlmsg_len;
memset(&nladdr, 0, sizeof(nladdr));
nladdr.nl_family = AF_NETLINK;
while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr*)&nladdr,
sizeof(nladdr))) < buflen) {
if (r > 0) {
buf += r;
buflen -= r;
}
else if (errno != EAGAIN)
return -1;
}
return 0;
}
/*
* Probe the controller in genetlink to find the family id
* for the TASKSTATS family
*/
static int get_family_id(int sd)
{
struct {
struct nlmsghdr n;
struct genlmsghdr g;
char buf[256];
} ans;
int id = 0, rc;
struct nlattr* na;
int rep_len;
strcpy(name, TASKSTATS_GENL_NAME);
rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
CTRL_ATTR_FAMILY_NAME, (void*)name,
strlen(TASKSTATS_GENL_NAME) + 1);
if (rc < 0)
return 0; /* sendto() failure? */
rep_len = recv(sd, &ans, sizeof(ans), 0);
if (ans.n.nlmsg_type == NLMSG_ERROR ||
(rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
return 0;
na = (struct nlattr*)GENLMSG_DATA(&ans);
na = (struct nlattr*)((char*)na + NLA_ALIGN(na->nla_len));
if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
id = *(__u16*)NLA_DATA(na);
}
return id;
}
#define average_ms(t, c) (t / 1000000ULL / (c ? c : 1))
static void print_delayacct(struct taskstats* t)
{
printf("\n\nCPU %15s%15s%15s%15s%15s\n"
" %15llu%15llu%15llu%15llu%15.3fms\n"
"IO %15s%15s%15s\n"
" %15llu%15llu%15llums\n"
"SWAP %15s%15s%15s\n"
" %15llu%15llu%15llums\n"
"RECLAIM %12s%15s%15s\n"
" %15llu%15llu%15llums\n"
"THRASHING%12s%15s%15s\n"
" %15llu%15llu%15llums\n"
"COMPACT %12s%15s%15s\n"
" %15llu%15llu%15llums\n"
"WPCOPY %12s%15s%15s\n"
" %15llu%15llu%15llums\n",
"count", "real total", "virtual total",
"delay total", "delay average",
(unsigned long long)t->cpu_count,
(unsigned long long)t->cpu_run_real_total,
(unsigned long long)t->cpu_run_virtual_total,
(unsigned long long)t->cpu_delay_total,
average_ms((double)t->cpu_delay_total, t->cpu_count),
"count", "delay total", "delay average",
(unsigned long long)t->blkio_count,
(unsigned long long)t->blkio_delay_total,
average_ms(t->blkio_delay_total, t->blkio_count),
"count", "delay total", "delay average",
(unsigned long long)t->swapin_count,
(unsigned long long)t->swapin_delay_total,
average_ms(t->swapin_delay_total, t->swapin_count),
"count", "delay total", "delay average",
(unsigned long long)t->freepages_count,
(unsigned long long)t->freepages_delay_total,
average_ms(t->freepages_delay_total, t->freepages_count),
"count", "delay total", "delay average",
(unsigned long long)t->thrashing_count,
(unsigned long long)t->thrashing_delay_total,
average_ms(t->thrashing_delay_total, t->thrashing_count),
"count", "delay total", "delay average",
(unsigned long long)t->compact_count,
(unsigned long long)t->compact_delay_total,
average_ms(t->compact_delay_total, t->compact_count),
"count", "delay total", "delay average",
(unsigned long long)t->wpcopy_count,
(unsigned long long)t->wpcopy_delay_total,
average_ms(t->wpcopy_delay_total, t->wpcopy_count));
}
static void task_context_switch_counts(struct taskstats* t)
{
printf("\n\nTask %15s%15s\n"
" %15llu%15llu\n",
"voluntary", "nonvoluntary",
(unsigned long long)t->nvcsw, (unsigned long long)t->nivcsw);
}
static void print_cgroupstats(struct cgroupstats* c)
{
printf("sleeping %llu, blocked %llu, running %llu, stopped %llu, "
"uninterruptible %llu\n", (unsigned long long)c->nr_sleeping,
(unsigned long long)c->nr_io_wait,
(unsigned long long)c->nr_running,
(unsigned long long)c->nr_stopped,
(unsigned long long)c->nr_uninterruptible);
}
static void print_ioacct(struct taskstats* t)
{
printf("%s: read=%llu, write=%llu, cancelled_write=%llu\n",
t->ac_comm,
(unsigned long long)t->read_bytes,
(unsigned long long)t->write_bytes,
(unsigned long long)t->cancelled_write_bytes);
}
int main(int argc, char* argv[])
{
int c, rc, rep_len, aggr_len, len2;
int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC;
__u16 id;
__u32 mypid;
struct nlattr* na;
int nl_sd = -1;
int len = 0;
pid_t tid = 0;
pid_t rtid = 0;
int fd = 0;
int count = 0;
int write_file = 0;
int maskset = 0;
char* logfile = NULL;
int loop = 0;
int containerset = 0;
char* containerpath = NULL;
int cfd = 0;
int forking = 0;
sigset_t sigset;
struct msgtemplate msg;
while (!forking) {
c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:c:");
if (c < 0)
break;
switch (c) {
case 'd':
printf("print delayacct stats ON\n");
print_delays = 1;
break;
case 'i':
printf("printing IO accounting\n");
print_io_accounting = 1;
break;
case 'q':
printf("printing task/process context switch rates\n");
print_task_context_switch_counts = 1;
break;
case 'C':
containerset = 1;
containerpath = optarg;
break;
case 'w':
logfile = strdup(optarg);
printf("write to file %s\n", logfile);
write_file = 1;
break;
case 'r':
rcvbufsz = atoi(optarg);
printf("receive buf size %d\n", rcvbufsz);
if (rcvbufsz < 0)
err(1, "Invalid rcv buf size\n");
break;
case 'm':
strncpy(cpumask, optarg, sizeof(cpumask));
cpumask[sizeof(cpumask) - 1] = '\0';
maskset = 1;
printf("cpumask %s maskset %d\n", cpumask, maskset);
break;
case 't':
tid = atoi(optarg);
if (!tid)
err(1, "Invalid tgid\n");
cmd_type = TASKSTATS_CMD_ATTR_TGID;
break;
case 'p':
tid = atoi(optarg);
if (!tid)
err(1, "Invalid pid\n");
cmd_type = TASKSTATS_CMD_ATTR_PID;
break;
case 'c':
/* Block SIGCHLD for sigwait() later */
if (sigemptyset(&sigset) == -1)
err(1, "Failed to empty sigset");
if (sigaddset(&sigset, SIGCHLD))
err(1, "Failed to set sigchld in sigset");
sigprocmask(SIG_BLOCK, &sigset, NULL);
/* fork/exec a child */
tid = fork();
if (tid < 0)
err(1, "Fork failed\n");
if (tid == 0)
if (execvp(argv[optind - 1],
&argv[optind - 1]) < 0)
exit(-1);
/* Set the command type and avoid further processing */
cmd_type = TASKSTATS_CMD_ATTR_PID;
forking = 1;
break;
case 'v':
printf("debug on\n");
dbg = 1;
break;
case 'l':
printf("listen forever\n");
loop = 1;
break;
default:
usage();
exit(-1);
}
}
if (write_file) {
fd = open(logfile, O_WRONLY | O_CREAT | O_TRUNC,
S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
if (fd == -1) {
perror("Cannot open output file\n");
exit(1);
}
}
nl_sd = create_nl_socket(NETLINK_GENERIC);
if (nl_sd < 0)
err(1, "error creating Netlink socket\n");
mypid = getpid();
id = get_family_id(nl_sd);
if (!id) {
fprintf(stderr, "Error getting family id, errno %d\n", errno);
goto err;
}
PRINTF("family id %d\n", id);
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
&cpumask, strlen(cpumask) + 1);
PRINTF("Sent register cpumask, retval %d\n", rc);
if (rc < 0) {
fprintf(stderr, "error sending register cpumask\n");
goto err;
}
}
if (tid && containerset) {
fprintf(stderr, "Select either -t or -C, not both\n");
goto err;
}
/*
* If we forked a child, wait for it to exit. Cannot use waitpid()
* as all the delicious data would be reaped as part of the wait
*/
if (tid && forking) {
int sig_received;
sigwait(&sigset, &sig_received);
}
if (tid) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
cmd_type, &tid, sizeof(__u32));
PRINTF("Sent pid/tgid, retval %d\n", rc);
if (rc < 0) {
fprintf(stderr, "error sending tid/tgid cmd\n");
goto done;
}
}
if (containerset) {
cfd = open(containerpath, O_RDONLY);
if (cfd < 0) {
perror("error opening container file");
goto err;
}
rc = send_cmd(nl_sd, id, mypid, CGROUPSTATS_CMD_GET,
CGROUPSTATS_CMD_ATTR_FD, &cfd, sizeof(__u32));
if (rc < 0) {
perror("error sending cgroupstats command");
goto err;
}
}
if (!maskset && !tid && !containerset) {
usage();
goto err;
}
do {
rep_len = recv(nl_sd, &msg, sizeof(msg), 0);
PRINTF("received %d bytes\n", rep_len);
if (rep_len < 0) {
fprintf(stderr, "nonfatal reply error: errno %d\n",
errno);
continue;
}
if (msg.n.nlmsg_type == NLMSG_ERROR ||
!NLMSG_OK((&msg.n), rep_len)) {
struct nlmsgerr* err = (struct nlmsgerr*)NLMSG_DATA(&msg);
fprintf(stderr, "fatal reply error, errno %d\n",
err->error);
goto done;
}
PRINTF("nlmsghdr size=%zu, nlmsg_len=%d, rep_len=%d\n",
sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len);
rep_len = GENLMSG_PAYLOAD(&msg.n);
na = (struct nlattr*)GENLMSG_DATA(&msg);
len = 0;
while (len < rep_len) {
len += NLA_ALIGN(na->nla_len);
switch (na->nla_type) {
case TASKSTATS_TYPE_AGGR_TGID:
/* Fall through */
case TASKSTATS_TYPE_AGGR_PID:
aggr_len = NLA_PAYLOAD(na->nla_len);
len2 = 0;
/* For nested attributes, na follows */
na = (struct nlattr*)NLA_DATA(na);
done = 0;
while (len2 < aggr_len) {
switch (na->nla_type) {
case TASKSTATS_TYPE_PID:
rtid = *(int*)NLA_DATA(na);
if (print_delays)
printf("PID\t%d\n", rtid);
break;
case TASKSTATS_TYPE_TGID:
rtid = *(int*)NLA_DATA(na);
if (print_delays)
printf("TGID\t%d\n", rtid);
break;
case TASKSTATS_TYPE_STATS:
count++;
if (print_delays)
print_delayacct((struct taskstats*)NLA_DATA(na));
if (print_io_accounting)
print_ioacct((struct taskstats*)NLA_DATA(na));
if (print_task_context_switch_counts)
task_context_switch_counts((struct taskstats*)NLA_DATA(na));
if (fd) {
if (write(fd, NLA_DATA(na), na->nla_len) < 0) {
err(1, "write error\n");
}
}
if (!loop)
goto done;
break;
case TASKSTATS_TYPE_NULL:
break;
default:
fprintf(stderr, "Unknown nested"
" nla_type %d\n",
na->nla_type);
break;
}
len2 += NLA_ALIGN(na->nla_len);
na = (struct nlattr*)((char*)na +
NLA_ALIGN(na->nla_len));
}
break;
case CGROUPSTATS_TYPE_CGROUP_STATS:
print_cgroupstats((struct cgroupstats*)NLA_DATA(na));
break;
default:
fprintf(stderr, "Unknown nla_type %d\n",
na->nla_type);
case TASKSTATS_TYPE_NULL:
break;
}
na = (struct nlattr*)(GENLMSG_DATA(&msg) + len);
}
} while (loop);
done:
if (maskset) {
rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
&cpumask, strlen(cpumask) + 1);
printf("Sent deregister mask, retval %d\n", rc);
if (rc < 0)
err(rc, "error sending deregister cpumask\n");
}
err:
close(nl_sd);
if (fd)
close(fd);
if (cfd)
close(cfd);
return 0;
}
3.2.2 使用方式
使用方式和精简版是一样的,如下例子,只是显示的欸容除了IO那一项的iodelay外,还有其他很多统计项,这里就先不展开了。
四、与iodelay相关的概念及实验
这一章会讲到与iodelay相关的一些概念,一部分会在这一章里直接进行说明和实验,另外一部分会在后面的博客里展开,会做原理分析,也会做一些实验来说明。
4.1 用于模拟高iodelay的测试程序,确认pidstat里的iodelay的单位,和kernel.task_delayacct开关对程序iodelay监测的影响
系统的节点里有两个主要使用的单位,一个是ns,如/proc/<pid>/schedstat,另外一个是userhz的tick,即/proc/<pid>/stat,pidstat工具的iodelay也是用userhz的tick,我们这里先验证/proc/<pid>/stat里的utime和stime的单位。
然后,我们写一个模拟高iodelay的测试程序,来确认pidstat的iodelay也是一样的单位。
4.1.1 先用一个死循环程序,确认/proc/<pid>/stat里的单位是userHZ,即10ms为一个user tick
先运行死循环程序:
然后,cat程序的stat节点,sleep 1秒以后,再cat一次
可以从上图中看到两次抓取到的stat节点里的第14和15项(utime和stime)里第14项的utime数值,在1秒内增加了100。100个单位表示1秒(因为是用户态死循环,所以是utime统计项),所以,1个单位是10ms,这就是user HZ的单位。
4.1.2 模拟高iodelay的测试程序,确认pidstat的iodelay也是userHZ单位
下面的测试程序,是通过O_DIRECT | O_SYNC方式写入文件,这么写入文件,会导致线程的iodelay飙高。代码如下:
#include <stdio.h>
#include <unistd.h>
#include <unistd.h>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <cstring>
#include <fcntl.h>
#include <pthread.h>
#define WRITE_SIZE 0x1000000
int fd_;
void* aligned_data;
// 线程函数
void* write_to_file(void* arg) {
while (1) {
int ret = write(fd_, aligned_data, WRITE_SIZE);
if (ret != WRITE_SIZE) {
printf("write fail, ret=%d\n", ret);
break;
}
}
return nullptr;
}
int main() {
// 分配对齐内存
int result = posix_memalign(&aligned_data, 512, WRITE_SIZE);
if (result != 0) {
perror("posix_memalign failed");
return EXIT_FAILURE;
}
// 打开文件
fd_ = open("test.txt", O_RDWR | O_CREAT | O_SYNC | O_DIRECT, 0644);
if (fd_ == -1) {
perror("open failed");
free(aligned_data);
return EXIT_FAILURE;
}
pthread_t thread;
// 创建线程
if (pthread_create(&thread, nullptr, write_to_file, nullptr) != 0) {
perror("pthread_create failed");
close(fd_);
free(aligned_data);
return EXIT_FAILURE;
}
// 等待线程结束
pthread_join(thread, nullptr);
// 清理资源
close(fd_);
free(aligned_data);
return 0;
}
上面代码中,是起了一个线程来做O_DIRECT | O_SYNC的写入操作,起一个线程是为了做pidstat -d -t抓线程iodelay的情况,方便实验。上述代码中关于如何进行O_DIRECT写入有一些细节,我们放到后面的博客里展开做介绍。
运行该程序后,通过pidstat -d -t 1就可以看到线程72240的iodelay到90左右:
1秒钟总共是100userHZ,90左右是算在了iodelay上,剩下的时间是cpu上,可以从下top中看到:
可以看到,如果按照userHZ来算,总时间基本是匹配的。
4.1.3 通过sysctl动态打开kernel.task_delayacct开关,只对改动后的程序的iodelay监测有影响
下图可以看到当前系统并没有开启delayacct:
就算我们设置了kernel.task_delayacct为1:
对于systemd这种在设置前就已经启动的程序而言,iodelay是不会再有统计的:
而对于后面再启动的程序,是能统计到变化的:
但是如果是关掉的话,无论之前启动时是开还是关,iodelay都不会再进行统计:
4.2 iodelay的获取和内核的配合逻辑
第三章的获取iodelay的用户态程序是用的netlink方式获取内核taskstats的任务统计信息。
struct taskstats包含了众多的任务的关键统计信息,与iodelay相关的就下面这两项:
和第三章里程序打印iodelay信息的代码,复用的是同一个结构体:
事实上,头文件include/uapi/linux/taskstats.h就是内核态和用户态公用的头文件,定义了该taskstats结构体。
我们看一下blkio_count和blkio_delay_total的赋值逻辑:
在delayacct.c里的delayacct_add_tsk函数(被fill_stats/fill_stats_for_tgid/fill_tgid_exit调用)里,是最终上报用户态iodelay等其他信息的地方。如下进行了blkio_delay_total和blkio_count的对上的数据填充上报:
如上框出的部分,blkio_delay_total是搜集与task_struct.delays->blkio_delay信息进行的汇总。
而blkio_delay和blkio_count则表示的当前任务的总iodelay和次数。在如下__delayacct_blkio_end函数进行的累加:
blkio_start是在__delayacct_blkio_start里计时:
所以就看__delayacct_blkio_start到__delayacct_blkio_end的时间,作为一次的iodelay。
可以看到__delayacct_blkio_start和__delayacct_blkio_end是被delayacct_blkio_start和delayacct_blkio_end所调用:
在__schedule函数里,根据prev线程是否在iowait里的标记来进行blkio的start时间的记录:
而哪些情况会让in_iowait标记为true,我们在后面讲iowait的博客里展开。
再来看delayacct_blkio_end函数在哪里被调用:
可以看到,在唤醒的下面两个函数里被调用:
所以,iodelay只统计到了唤醒的时间,并没有统计从唤醒到真正switch-in的时间,而这个时间其实就是调度时延。关于它的细节,见之前的博客 调度时延的观测_csdn 调度时延的观测 杰克崔-CSDN博客。
4.3 iodelay相关的一些其他重要知识点
后面的IO子系统的博客我们会包括但不限于介绍以下这些重要知识点:
1)cpu里的一个重要指标iowait,它与iodelay的关系
2)如何抓取和分析iowait事件的相关数据
3)iodelay和线程的D状态的关系
4)directIO的方式如何写入文件以及注意事项
5)directIO和page cache方式写入文件有哪些常见的触发iodelay的调用链
标签:struct,int,na,len,线程,long,级别,iodelay From: https://blog.csdn.net/weixin_42766184/article/details/144302612