获取进程或线程级别的iodelay的方法

标签：struct int na len 线程 long 级别 iodelay

一、背景

内核里为每个应用程序（线程或进程颗粒度）因为IO操作所导致的线程的睡眠有一项专门的统计，叫iodelay，为什么叫这个名字，因为我们用pidstat查看具体进程I/O的时延时，是用的这个名字，如下红色框图：

pidstat -d显示出来的iodelay是以user HZ为单位的。关于现有工具也就是通过pidstat -d来查看的方式的细节会在第二章里介绍，另外，第二章里也会讲到需要系统配置相关的选项才能进行这块的监控。第三章里，我们会讲到如何用自己编写程序的方式来监控这个iodelay。在第四章里，我们会讲一些与iodelay相关的一些概念，并做一些实验，第四章里也会简介一些后面博客里会展开介绍的一些内容。

二、通过已有工具进行获取

现有的工具就是通过pidstat来查看iodelay的情况，无论是用现有的工具还是用第三章里介绍的自己编写代码的方式，都需要依赖 2.1 一节里讲到的内核打开相关选项进行编译，并打开delayacct的选项。

2.1 内核选项配置CONFIG_TASK_DELAY_ACCT和CONFIG_TASKSTATS，并grub里增加delayacct

无论是用现有的工具还是用第三章里介绍的自己编写代码的方式，都需要做这一节里讲到的这些步骤。

内核选项里增加CONFIG_TASK_DELAY_ACCT和CONFIG_TASKSTATS为y后进行编译，默认的ubuntu的内核镜像是带这两个参数的。

然后在grub里增加delayacct：

2.1.1 可以通过sysctl的kernel.task_delayacct开关，动态设置打开与否

通过sysctl的kernel.task_delayacct开关动态设置打开与否，这种方法有一个限制，就是它不能影响已经启动了的程序，只能影响在进行sysctl设置kernel.task_delayacct之后启动的程序。这个测试在4.1 一节里进行。

2.2 pidstat -d的使用

要使用pidstat -d来监测iodelay，需要确保 2.1 一节里的内容已经做过。

2.2.1 pidstat的安装

pidstat工具的安装通过如下命令：

sudo apt-get update
sudo apt-get install sysstat

在之前的博客调度时延的观测_csdn 调度时延的观测杰克崔-CSDN博客里，我们讲到了可以通过pidstat来查看线程的调度时延，这里我们接下来讲的是通过pidstat查看线程或进程的iodelay

2.2.2 pidstat -d显示的iodelay的单位是user HZ，即每10ms加1

和/proc/<pid>/stat里的第14和15项，utime和stime的单位一致，都是userHZ，userHZ是每10ms加1，比如一个在用户态死循环的程序，我们每1秒中观测一次它的utime（user time），utime的值每隔1秒增加100

同样的，对于pidstat -d显示的iodelay，也是一样的单位，关于这个，我们在 4.1 一节里会做实验验证。

我们看一下man pidstat关于-d的iodelay的解释：

所以，千万别把上图中的红色框出的clock ticks理解成cycles，这里的clock ticks就是user HZ的ticks，单位是10ms。

2.2.3 pidstat -d <时间周期s>只会抓各个进程的主线程的iodelay

如果写<时间周期s>的话，会按照写的时间周期来观察iodelay的变化差值

如果不写<时间周期s>的话，会统计进程开启到现在的iodelay总值

如果不带-t的话，pidstat -d只能抓到进程的主线程的iodelay，对于-p <pid>这么抓取iodelay，也一样只能抓取到主线程的iodelay。

我们用 4.1.1 里的模拟高iodelay的程序，这个程序是主进程里啥也不干就起一个线程去做IO的操作，这个线程里的行为会造成很高的iodelay，因为用的是O_SYNC | O_DIRECT方式进行文件写操作。

我们监测pidstat -d 1去观测系统上的iodelay的变化，看是否能观测到非主线程的iodelay，是不行的：

用pidstat -d -p <pid> 1去观测这个高iodelay进程里的非主线程iodelay情况，也是不行的：

2.2.4 pidstat -d <时间周期s> -t 可以抓取所有线程的iodelay情况

下图是抓取系统上时间周期内，所有线程的iodelay变化情况：

当然也可以抓取一个具体线程的iodelay的变化情况：

也可以抓取一个具体线程的iodelay的总值情况：

三、获取编写代码获取进程或线程的iodelay

要使用本章里的测试程序，和使用pidstat -d一样，都需要确保 2.1 一节里的内容已经做过。

在 3.1 一节里，我们先贴出精简后的获取进程或线程的iodelay的程序。

在 3.2 一节里，我们贴出完整的程序，取自于内核的现有的工具的代码。

关于代码里的逻辑细节和内核的配合的介绍见 4.2 一节。

3.1 精简后的获取进程或线程的iodelay的程序源码及使用方式

3.1.1 程序源码

下面的源码是用 3.2 里的完整代码精简出与iodelay相关的部分后的代码。

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <poll.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <signal.h>

#include <linux/genetlink.h>
#include <linux/taskstats.h>
#include <linux/cgroupstats.h>

 /*
  * Generic macros for dealing with netlink sockets. Might be duplicated
  * elsewhere. It is recommended that commercial grade applications use
  * libnl or libnetlink and use the interfaces provided by the library
  */
#define GENLMSG_DATA(glh)        ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
#define GENLMSG_PAYLOAD(glh)        (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
#define NLA_DATA(na)                ((void *)((char*)(na) + NLA_HDRLEN))
#define NLA_PAYLOAD(len)        (len - NLA_HDRLEN)

#define err(code, fmt, arg...)                        \
        do {                                        \
                fprintf(stderr, fmt, ##arg);        \
                exit(code);                        \
        } while (0)

int done;
int rcvbufsz;
char name[100];
int dbg;
int print_delays = 0;

#define PRINTF(fmt, arg...) {                        \
            if (dbg) {                                \
                printf(fmt, ##arg);                \
            }                                        \
        }

/* Maximum size of response requested or message sent */
#define MAX_MSG_SIZE        1024
/* Maximum number of cpus expected to be specified in a cpumask */
#define MAX_CPUS        32

struct msgtemplate {
        struct nlmsghdr n;
        struct genlmsghdr g;
        char buf[MAX_MSG_SIZE];
};

char cpumask[100 + 6 * MAX_CPUS];

static void usage(void)
{
        fprintf(stderr, "getdelays [-dilv] [-w logfile] [-r bufsize] "
                "[-m cpumask] [-t tgid] [-p pid]\n");
        fprintf(stderr, "  -d: print delayacct stats\n");
        fprintf(stderr, "  -i: print IO accounting (works only with -p)\n");
        fprintf(stderr, "  -l: listen forever\n");
        fprintf(stderr, "  -v: debug on\n");
        fprintf(stderr, "  -C: container path\n");
}

/*
 * Create a raw netlink socket and bind
 */
static int create_nl_socket(int protocol)
{
        int fd;
        struct sockaddr_nl local;

        fd = socket(AF_NETLINK, SOCK_RAW, protocol);
        if (fd < 0)
                return -1;

        if (rcvbufsz)
                if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
                        &rcvbufsz, sizeof(rcvbufsz)) < 0) {
                        fprintf(stderr, "Unable to set socket rcv buf size to %d\n",
                                rcvbufsz);
                        goto error;
                }

        memset(&local, 0, sizeof(local));
        local.nl_family = AF_NETLINK;

        if (bind(fd, (struct sockaddr*)&local, sizeof(local)) < 0)
                goto error;

        return fd;
error:
        close(fd);
        return -1;
}


static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
        __u8 genl_cmd, __u16 nla_type,
        void* nla_data, int nla_len)
{
        struct nlattr* na;
        struct sockaddr_nl nladdr;
        int r, buflen;
        char* buf;

        struct msgtemplate msg;

        msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
        msg.n.nlmsg_type = nlmsg_type;
        msg.n.nlmsg_flags = NLM_F_REQUEST;
        msg.n.nlmsg_seq = 0;
        msg.n.nlmsg_pid = nlmsg_pid;
        msg.g.cmd = genl_cmd;
        msg.g.version = 0x1;
        na = (struct nlattr*)GENLMSG_DATA(&msg);
        na->nla_type = nla_type;
        na->nla_len = nla_len + NLA_HDRLEN;
        memcpy(NLA_DATA(na), nla_data, nla_len);
        msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);

        buf = (char*)&msg;
        buflen = msg.n.nlmsg_len;
        memset(&nladdr, 0, sizeof(nladdr));
        nladdr.nl_family = AF_NETLINK;
        while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr*)&nladdr,
                sizeof(nladdr))) < buflen) {
                if (r > 0) {
                        buf += r;
                        buflen -= r;
                }
                else if (errno != EAGAIN)
                        return -1;
        }
        return 0;
}


/*
 * Probe the controller in genetlink to find the family id
 * for the TASKSTATS family
 */
static int get_family_id(int sd)
{
        struct {
                struct nlmsghdr n;
                struct genlmsghdr g;
                char buf[256];
        } ans;

        int id = 0, rc;
        struct nlattr* na;
        int rep_len;

        strcpy(name, TASKSTATS_GENL_NAME);
        rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
                CTRL_ATTR_FAMILY_NAME, (void*)name,
                strlen(TASKSTATS_GENL_NAME) + 1);
        if (rc < 0)
                return 0;        /* sendto() failure? */

        rep_len = recv(sd, &ans, sizeof(ans), 0);
        if (ans.n.nlmsg_type == NLMSG_ERROR ||
                (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
                return 0;

        na = (struct nlattr*)GENLMSG_DATA(&ans);
        na = (struct nlattr*)((char*)na + NLA_ALIGN(na->nla_len));
        if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
                id = *(__u16*)NLA_DATA(na);
        }
        return id;
}

#define average_ms(t, c) (t / 1000000ULL / (c ? c : 1))

static void print_delayacct(struct taskstats* t)
{
        printf("\n\ncount:%15llu\ndelay total:%15llu\ndelay average:%15llums\n",
                (unsigned long long)t->blkio_count,
                (unsigned long long)t->blkio_delay_total,
                average_ms(t->blkio_delay_total, t->blkio_count)
        );
}

int main(int argc, char* argv[])
{
        int c, rc, rep_len, aggr_len, len2;
        int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC;
        __u16 id;
        __u32 mypid;

        struct nlattr* na;
        int nl_sd = -1;
        int len = 0;
        pid_t tid = 0;
        pid_t rtid = 0;

        int fd = 0;
        int count = 0;
        int maskset = 0;
        int loop = 0;
        int cfd = 0;
        int forking = 0;
        sigset_t sigset;

        struct msgtemplate msg;

        while (!forking) {
                c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:c:");
                if (c < 0)
                        break;

                switch (c) {
                case 'd':
                        printf("print delayacct stats ON\n");
                        print_delays = 1;
                        break;
                case 't':
                        tid = atoi(optarg);
                        if (!tid)
                                err(1, "Invalid tgid\n");
                        cmd_type = TASKSTATS_CMD_ATTR_TGID;
                        break;
                case 'p':
                        tid = atoi(optarg);
                        if (!tid)
                                err(1, "Invalid pid\n");
                        cmd_type = TASKSTATS_CMD_ATTR_PID;
                        break;
                default:
                        usage();
                        exit(-1);
                }
        }

        nl_sd = create_nl_socket(NETLINK_GENERIC);
        if (nl_sd < 0)
                err(1, "error creating Netlink socket\n");


        mypid = getpid();
        id = get_family_id(nl_sd);
        if (!id) {
                fprintf(stderr, "Error getting family id, errno %d\n", errno);
                goto err;
        }
        PRINTF("family id %d\n", id);

        if (maskset) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
                        &cpumask, strlen(cpumask) + 1);
                PRINTF("Sent register cpumask, retval %d\n", rc);
                if (rc < 0) {
                        fprintf(stderr, "error sending register cpumask\n");
                        goto err;
                }
        }

        /*
         * If we forked a child, wait for it to exit. Cannot use waitpid()
         * as all the delicious data would be reaped as part of the wait
         */
        if (tid && forking) {
                int sig_received;
                sigwait(&sigset, &sig_received);
        }

        if (tid) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        cmd_type, &tid, sizeof(__u32));
                PRINTF("Sent pid/tgid, retval %d\n", rc);
                if (rc < 0) {
                        fprintf(stderr, "error sending tid/tgid cmd\n");
                        goto done;
                }
        }

        if (!maskset && !tid) {
                usage();
                goto err;
        }

        do {
                rep_len = recv(nl_sd, &msg, sizeof(msg), 0);
                PRINTF("received %d bytes\n", rep_len);

                if (rep_len < 0) {
                        fprintf(stderr, "nonfatal reply error: errno %d\n",
                                errno);
                        continue;
                }
                if (msg.n.nlmsg_type == NLMSG_ERROR ||
                        !NLMSG_OK((&msg.n), rep_len)) {
                        struct nlmsgerr* err = (struct nlmsgerr*)NLMSG_DATA(&msg);
                        fprintf(stderr, "fatal reply error,  errno %d\n",
                                err->error);
                        goto done;
                }

                PRINTF("nlmsghdr size=%zu, nlmsg_len=%d, rep_len=%d\n",
                        sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len);


                rep_len = GENLMSG_PAYLOAD(&msg.n);

                na = (struct nlattr*)GENLMSG_DATA(&msg);
                len = 0;
                while (len < rep_len) {
                        len += NLA_ALIGN(na->nla_len);
                        switch (na->nla_type) {
                        case TASKSTATS_TYPE_AGGR_TGID:
                                /* Fall through */
                        case TASKSTATS_TYPE_AGGR_PID:
                                aggr_len = NLA_PAYLOAD(na->nla_len);
                                len2 = 0;
                                /* For nested attributes, na follows */
                                na = (struct nlattr*)NLA_DATA(na);
                                done = 0;
                                while (len2 < aggr_len) {
                                        switch (na->nla_type) {
                                        case TASKSTATS_TYPE_PID:
                                                rtid = *(int*)NLA_DATA(na);
                                                if (print_delays)
                                                        printf("PID\t%d\n", rtid);
                                                break;
                                        case TASKSTATS_TYPE_TGID:
                                                rtid = *(int*)NLA_DATA(na);
                                                if (print_delays)
                                                        printf("TGID\t%d\n", rtid);
                                                break;
                                        case TASKSTATS_TYPE_STATS:
                                                count++;
                                                if (print_delays)
                                                        print_delayacct((struct taskstats*)NLA_DATA(na));
                                                if (fd) {
                                                        if (write(fd, NLA_DATA(na), na->nla_len) < 0) {
                                                                err(1, "write error\n");
                                                        }
                                                }
                                                if (!loop)
                                                        goto done;
                                                break;
                                        case TASKSTATS_TYPE_NULL:
                                                break;
                                        default:
                                                fprintf(stderr, "Unknown nested"
                                                        " nla_type %d\n",
                                                        na->nla_type);
                                                break;
                                        }
                                        len2 += NLA_ALIGN(na->nla_len);
                                        na = (struct nlattr*)((char*)na +
                                                NLA_ALIGN(na->nla_len));
                                }
                                break;

                        default:
                                fprintf(stderr, "Unknown nla_type %d\n",
                                        na->nla_type);
                        case TASKSTATS_TYPE_NULL:
                                break;
                        }
                        na = (struct nlattr*)(GENLMSG_DATA(&msg) + len);
                }
        } while (loop);
done:
        if (maskset) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
                        &cpumask, strlen(cpumask) + 1);
                printf("Sent deregister mask, retval %d\n", rc);
                if (rc < 0)
                        err(rc, "error sending deregister cpumask\n");
        }
err:
        close(nl_sd);
        if (fd)
                close(fd);
        if (cfd)
                close(cfd);
        return 0;
}

3.1.2 使用方式

如 2.2 一节里，我们一样的先启动 4.1.2 里的模拟高iodelay的测试程序，后通过top -H -p <pid>可以看到进行write的子线程是72581（进程是72579）：

运行 3.1.1 里编出的程序 testgetdelay.out，如下，先用-p方式：

可以看到-p方式是按照线程颗粒度来看iodelay状态的，实例程序主线程并没有write操作，所以主线程72579的iodelay是0：

而-p 子线程的id 72581，可以看到：

上图中count表示iodelay的次数，delay total是ns纳秒单位，平均下来是一次iodelay是12ms。

该程序还可以用-t方式，把一个tgid也就是一个进程内的所有的线程的iodelay情况搜集汇总到一起来输出：

3.2 完整的程序（取自于内核的现有工具的代码）源码及使用方式

3.2.1 程序源码

下面代码是内核代码的 tools/accounting/getdelays.c：

// SPDX-License-Identifier: GPL-2.0
/* getdelays.c
 *
 * Utility to get per-pid and per-tgid delay accounting statistics
 * Also illustrates usage of the taskstats interface
 *
 * Copyright (C) Shailabh Nagar, IBM Corp. 2005
 * Copyright (C) Balbir Singh, IBM Corp. 2006
 * Copyright (c) Jay Lan, SGI. 2006
 *
 * Compile with
 *        gcc -I/usr/src/linux/include getdelays.c -o getdelays
 */

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <poll.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <signal.h>

#include <linux/genetlink.h>
#include <linux/taskstats.h>
#include <linux/cgroupstats.h>

 /*
  * Generic macros for dealing with netlink sockets. Might be duplicated
  * elsewhere. It is recommended that commercial grade applications use
  * libnl or libnetlink and use the interfaces provided by the library
  */
#define GENLMSG_DATA(glh)        ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
#define GENLMSG_PAYLOAD(glh)        (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
#define NLA_DATA(na)                ((void *)((char*)(na) + NLA_HDRLEN))
#define NLA_PAYLOAD(len)        (len - NLA_HDRLEN)

#define err(code, fmt, arg...)                        \
        do {                                        \
                fprintf(stderr, fmt, ##arg);        \
                exit(code);                        \
        } while (0)

int done;
int rcvbufsz;
char name[100];
int dbg;
int print_delays = 0;
int print_io_accounting = 0;
int print_task_context_switch_counts = 0;

#define PRINTF(fmt, arg...) {                        \
            if (dbg) {                                \
                printf(fmt, ##arg);                \
            }                                        \
        }

/* Maximum size of response requested or message sent */
#define MAX_MSG_SIZE        1024
/* Maximum number of cpus expected to be specified in a cpumask */
#define MAX_CPUS        32

struct msgtemplate {
        struct nlmsghdr n;
        struct genlmsghdr g;
        char buf[MAX_MSG_SIZE];
};

char cpumask[100 + 6 * MAX_CPUS];

static void usage(void)
{
        fprintf(stderr, "getdelays [-dilv] [-w logfile] [-r bufsize] "
                "[-m cpumask] [-t tgid] [-p pid]\n");
        fprintf(stderr, "  -d: print delayacct stats\n");
        fprintf(stderr, "  -i: print IO accounting (works only with -p)\n");
        fprintf(stderr, "  -l: listen forever\n");
        fprintf(stderr, "  -v: debug on\n");
        fprintf(stderr, "  -C: container path\n");
}

/*
 * Create a raw netlink socket and bind
 */
static int create_nl_socket(int protocol)
{
        int fd;
        struct sockaddr_nl local;

        fd = socket(AF_NETLINK, SOCK_RAW, protocol);
        if (fd < 0)
                return -1;

        if (rcvbufsz)
                if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF,
                        &rcvbufsz, sizeof(rcvbufsz)) < 0) {
                        fprintf(stderr, "Unable to set socket rcv buf size to %d\n",
                                rcvbufsz);
                        goto error;
                }

        memset(&local, 0, sizeof(local));
        local.nl_family = AF_NETLINK;

        if (bind(fd, (struct sockaddr*)&local, sizeof(local)) < 0)
                goto error;

        return fd;
error:
        close(fd);
        return -1;
}


static int send_cmd(int sd, __u16 nlmsg_type, __u32 nlmsg_pid,
        __u8 genl_cmd, __u16 nla_type,
        void* nla_data, int nla_len)
{
        struct nlattr* na;
        struct sockaddr_nl nladdr;
        int r, buflen;
        char* buf;

        struct msgtemplate msg;

        msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
        msg.n.nlmsg_type = nlmsg_type;
        msg.n.nlmsg_flags = NLM_F_REQUEST;
        msg.n.nlmsg_seq = 0;
        msg.n.nlmsg_pid = nlmsg_pid;
        msg.g.cmd = genl_cmd;
        msg.g.version = 0x1;
        na = (struct nlattr*)GENLMSG_DATA(&msg);
        na->nla_type = nla_type;
        na->nla_len = nla_len + NLA_HDRLEN;
        memcpy(NLA_DATA(na), nla_data, nla_len);
        msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);

        buf = (char*)&msg;
        buflen = msg.n.nlmsg_len;
        memset(&nladdr, 0, sizeof(nladdr));
        nladdr.nl_family = AF_NETLINK;
        while ((r = sendto(sd, buf, buflen, 0, (struct sockaddr*)&nladdr,
                sizeof(nladdr))) < buflen) {
                if (r > 0) {
                        buf += r;
                        buflen -= r;
                }
                else if (errno != EAGAIN)
                        return -1;
        }
        return 0;
}


/*
 * Probe the controller in genetlink to find the family id
 * for the TASKSTATS family
 */
static int get_family_id(int sd)
{
        struct {
                struct nlmsghdr n;
                struct genlmsghdr g;
                char buf[256];
        } ans;

        int id = 0, rc;
        struct nlattr* na;
        int rep_len;

        strcpy(name, TASKSTATS_GENL_NAME);
        rc = send_cmd(sd, GENL_ID_CTRL, getpid(), CTRL_CMD_GETFAMILY,
                CTRL_ATTR_FAMILY_NAME, (void*)name,
                strlen(TASKSTATS_GENL_NAME) + 1);
        if (rc < 0)
                return 0;        /* sendto() failure? */

        rep_len = recv(sd, &ans, sizeof(ans), 0);
        if (ans.n.nlmsg_type == NLMSG_ERROR ||
                (rep_len < 0) || !NLMSG_OK((&ans.n), rep_len))
                return 0;

        na = (struct nlattr*)GENLMSG_DATA(&ans);
        na = (struct nlattr*)((char*)na + NLA_ALIGN(na->nla_len));
        if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
                id = *(__u16*)NLA_DATA(na);
        }
        return id;
}

#define average_ms(t, c) (t / 1000000ULL / (c ? c : 1))

static void print_delayacct(struct taskstats* t)
{
        printf("\n\nCPU   %15s%15s%15s%15s%15s\n"
                "      %15llu%15llu%15llu%15llu%15.3fms\n"
                "IO    %15s%15s%15s\n"
                "      %15llu%15llu%15llums\n"
                "SWAP  %15s%15s%15s\n"
                "      %15llu%15llu%15llums\n"
                "RECLAIM  %12s%15s%15s\n"
                "      %15llu%15llu%15llums\n"
                "THRASHING%12s%15s%15s\n"
                "      %15llu%15llu%15llums\n"
                "COMPACT  %12s%15s%15s\n"
                "      %15llu%15llu%15llums\n"
                "WPCOPY   %12s%15s%15s\n"
                "      %15llu%15llu%15llums\n",
                "count", "real total", "virtual total",
                "delay total", "delay average",
                (unsigned long long)t->cpu_count,
                (unsigned long long)t->cpu_run_real_total,
                (unsigned long long)t->cpu_run_virtual_total,
                (unsigned long long)t->cpu_delay_total,
                average_ms((double)t->cpu_delay_total, t->cpu_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->blkio_count,
                (unsigned long long)t->blkio_delay_total,
                average_ms(t->blkio_delay_total, t->blkio_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->swapin_count,
                (unsigned long long)t->swapin_delay_total,
                average_ms(t->swapin_delay_total, t->swapin_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->freepages_count,
                (unsigned long long)t->freepages_delay_total,
                average_ms(t->freepages_delay_total, t->freepages_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->thrashing_count,
                (unsigned long long)t->thrashing_delay_total,
                average_ms(t->thrashing_delay_total, t->thrashing_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->compact_count,
                (unsigned long long)t->compact_delay_total,
                average_ms(t->compact_delay_total, t->compact_count),
                "count", "delay total", "delay average",
                (unsigned long long)t->wpcopy_count,
                (unsigned long long)t->wpcopy_delay_total,
                average_ms(t->wpcopy_delay_total, t->wpcopy_count));
}

static void task_context_switch_counts(struct taskstats* t)
{
        printf("\n\nTask   %15s%15s\n"
                "       %15llu%15llu\n",
                "voluntary", "nonvoluntary",
                (unsigned long long)t->nvcsw, (unsigned long long)t->nivcsw);
}

static void print_cgroupstats(struct cgroupstats* c)
{
        printf("sleeping %llu, blocked %llu, running %llu, stopped %llu, "
                "uninterruptible %llu\n", (unsigned long long)c->nr_sleeping,
                (unsigned long long)c->nr_io_wait,
                (unsigned long long)c->nr_running,
                (unsigned long long)c->nr_stopped,
                (unsigned long long)c->nr_uninterruptible);
}


static void print_ioacct(struct taskstats* t)
{
        printf("%s: read=%llu, write=%llu, cancelled_write=%llu\n",
                t->ac_comm,
                (unsigned long long)t->read_bytes,
                (unsigned long long)t->write_bytes,
                (unsigned long long)t->cancelled_write_bytes);
}

int main(int argc, char* argv[])
{
        int c, rc, rep_len, aggr_len, len2;
        int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC;
        __u16 id;
        __u32 mypid;

        struct nlattr* na;
        int nl_sd = -1;
        int len = 0;
        pid_t tid = 0;
        pid_t rtid = 0;

        int fd = 0;
        int count = 0;
        int write_file = 0;
        int maskset = 0;
        char* logfile = NULL;
        int loop = 0;
        int containerset = 0;
        char* containerpath = NULL;
        int cfd = 0;
        int forking = 0;
        sigset_t sigset;

        struct msgtemplate msg;

        while (!forking) {
                c = getopt(argc, argv, "qdiw:r:m:t:p:vlC:c:");
                if (c < 0)
                        break;

                switch (c) {
                case 'd':
                        printf("print delayacct stats ON\n");
                        print_delays = 1;
                        break;
                case 'i':
                        printf("printing IO accounting\n");
                        print_io_accounting = 1;
                        break;
                case 'q':
                        printf("printing task/process context switch rates\n");
                        print_task_context_switch_counts = 1;
                        break;
                case 'C':
                        containerset = 1;
                        containerpath = optarg;
                        break;
                case 'w':
                        logfile = strdup(optarg);
                        printf("write to file %s\n", logfile);
                        write_file = 1;
                        break;
                case 'r':
                        rcvbufsz = atoi(optarg);
                        printf("receive buf size %d\n", rcvbufsz);
                        if (rcvbufsz < 0)
                                err(1, "Invalid rcv buf size\n");
                        break;
                case 'm':
                        strncpy(cpumask, optarg, sizeof(cpumask));
                        cpumask[sizeof(cpumask) - 1] = '\0';
                        maskset = 1;
                        printf("cpumask %s maskset %d\n", cpumask, maskset);
                        break;
                case 't':
                        tid = atoi(optarg);
                        if (!tid)
                                err(1, "Invalid tgid\n");
                        cmd_type = TASKSTATS_CMD_ATTR_TGID;
                        break;
                case 'p':
                        tid = atoi(optarg);
                        if (!tid)
                                err(1, "Invalid pid\n");
                        cmd_type = TASKSTATS_CMD_ATTR_PID;
                        break;
                case 'c':

                        /* Block SIGCHLD for sigwait() later */
                        if (sigemptyset(&sigset) == -1)
                                err(1, "Failed to empty sigset");
                        if (sigaddset(&sigset, SIGCHLD))
                                err(1, "Failed to set sigchld in sigset");
                        sigprocmask(SIG_BLOCK, &sigset, NULL);

                        /* fork/exec a child */
                        tid = fork();
                        if (tid < 0)
                                err(1, "Fork failed\n");
                        if (tid == 0)
                                if (execvp(argv[optind - 1],
                                        &argv[optind - 1]) < 0)
                                        exit(-1);

                        /* Set the command type and avoid further processing */
                        cmd_type = TASKSTATS_CMD_ATTR_PID;
                        forking = 1;
                        break;
                case 'v':
                        printf("debug on\n");
                        dbg = 1;
                        break;
                case 'l':
                        printf("listen forever\n");
                        loop = 1;
                        break;
                default:
                        usage();
                        exit(-1);
                }
        }

        if (write_file) {
                fd = open(logfile, O_WRONLY | O_CREAT | O_TRUNC,
                        S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
                if (fd == -1) {
                        perror("Cannot open output file\n");
                        exit(1);
                }
        }

        nl_sd = create_nl_socket(NETLINK_GENERIC);
        if (nl_sd < 0)
                err(1, "error creating Netlink socket\n");


        mypid = getpid();
        id = get_family_id(nl_sd);
        if (!id) {
                fprintf(stderr, "Error getting family id, errno %d\n", errno);
                goto err;
        }
        PRINTF("family id %d\n", id);

        if (maskset) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        TASKSTATS_CMD_ATTR_REGISTER_CPUMASK,
                        &cpumask, strlen(cpumask) + 1);
                PRINTF("Sent register cpumask, retval %d\n", rc);
                if (rc < 0) {
                        fprintf(stderr, "error sending register cpumask\n");
                        goto err;
                }
        }

        if (tid && containerset) {
                fprintf(stderr, "Select either -t or -C, not both\n");
                goto err;
        }

        /*
         * If we forked a child, wait for it to exit. Cannot use waitpid()
         * as all the delicious data would be reaped as part of the wait
         */
        if (tid && forking) {
                int sig_received;
                sigwait(&sigset, &sig_received);
        }

        if (tid) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        cmd_type, &tid, sizeof(__u32));
                PRINTF("Sent pid/tgid, retval %d\n", rc);
                if (rc < 0) {
                        fprintf(stderr, "error sending tid/tgid cmd\n");
                        goto done;
                }
        }

        if (containerset) {
                cfd = open(containerpath, O_RDONLY);
                if (cfd < 0) {
                        perror("error opening container file");
                        goto err;
                }
                rc = send_cmd(nl_sd, id, mypid, CGROUPSTATS_CMD_GET,
                        CGROUPSTATS_CMD_ATTR_FD, &cfd, sizeof(__u32));
                if (rc < 0) {
                        perror("error sending cgroupstats command");
                        goto err;
                }
        }
        if (!maskset && !tid && !containerset) {
                usage();
                goto err;
        }

        do {
                rep_len = recv(nl_sd, &msg, sizeof(msg), 0);
                PRINTF("received %d bytes\n", rep_len);

                if (rep_len < 0) {
                        fprintf(stderr, "nonfatal reply error: errno %d\n",
                                errno);
                        continue;
                }
                if (msg.n.nlmsg_type == NLMSG_ERROR ||
                        !NLMSG_OK((&msg.n), rep_len)) {
                        struct nlmsgerr* err = (struct nlmsgerr*)NLMSG_DATA(&msg);
                        fprintf(stderr, "fatal reply error,  errno %d\n",
                                err->error);
                        goto done;
                }

                PRINTF("nlmsghdr size=%zu, nlmsg_len=%d, rep_len=%d\n",
                        sizeof(struct nlmsghdr), msg.n.nlmsg_len, rep_len);


                rep_len = GENLMSG_PAYLOAD(&msg.n);

                na = (struct nlattr*)GENLMSG_DATA(&msg);
                len = 0;
                while (len < rep_len) {
                        len += NLA_ALIGN(na->nla_len);
                        switch (na->nla_type) {
                        case TASKSTATS_TYPE_AGGR_TGID:
                                /* Fall through */
                        case TASKSTATS_TYPE_AGGR_PID:
                                aggr_len = NLA_PAYLOAD(na->nla_len);
                                len2 = 0;
                                /* For nested attributes, na follows */
                                na = (struct nlattr*)NLA_DATA(na);
                                done = 0;
                                while (len2 < aggr_len) {
                                        switch (na->nla_type) {
                                        case TASKSTATS_TYPE_PID:
                                                rtid = *(int*)NLA_DATA(na);
                                                if (print_delays)
                                                        printf("PID\t%d\n", rtid);
                                                break;
                                        case TASKSTATS_TYPE_TGID:
                                                rtid = *(int*)NLA_DATA(na);
                                                if (print_delays)
                                                        printf("TGID\t%d\n", rtid);
                                                break;
                                        case TASKSTATS_TYPE_STATS:
                                                count++;
                                                if (print_delays)
                                                        print_delayacct((struct taskstats*)NLA_DATA(na));
                                                if (print_io_accounting)
                                                        print_ioacct((struct taskstats*)NLA_DATA(na));
                                                if (print_task_context_switch_counts)
                                                        task_context_switch_counts((struct taskstats*)NLA_DATA(na));
                                                if (fd) {
                                                        if (write(fd, NLA_DATA(na), na->nla_len) < 0) {
                                                                err(1, "write error\n");
                                                        }
                                                }
                                                if (!loop)
                                                        goto done;
                                                break;
                                        case TASKSTATS_TYPE_NULL:
                                                break;
                                        default:
                                                fprintf(stderr, "Unknown nested"
                                                        " nla_type %d\n",
                                                        na->nla_type);
                                                break;
                                        }
                                        len2 += NLA_ALIGN(na->nla_len);
                                        na = (struct nlattr*)((char*)na +
                                                NLA_ALIGN(na->nla_len));
                                }
                                break;

                        case CGROUPSTATS_TYPE_CGROUP_STATS:
                                print_cgroupstats((struct cgroupstats*)NLA_DATA(na));
                                break;
                        default:
                                fprintf(stderr, "Unknown nla_type %d\n",
                                        na->nla_type);
                        case TASKSTATS_TYPE_NULL:
                                break;
                        }
                        na = (struct nlattr*)(GENLMSG_DATA(&msg) + len);
                }
        } while (loop);
done:
        if (maskset) {
                rc = send_cmd(nl_sd, id, mypid, TASKSTATS_CMD_GET,
                        TASKSTATS_CMD_ATTR_DEREGISTER_CPUMASK,
                        &cpumask, strlen(cpumask) + 1);
                printf("Sent deregister mask, retval %d\n", rc);
                if (rc < 0)
                        err(rc, "error sending deregister cpumask\n");
        }
err:
        close(nl_sd);
        if (fd)
                close(fd);
        if (cfd)
                close(cfd);
        return 0;
}

3.2.2 使用方式

使用方式和精简版是一样的，如下例子，只是显示的欸容除了IO那一项的iodelay外，还有其他很多统计项，这里就先不展开了。

四、与iodelay相关的概念及实验

这一章会讲到与iodelay相关的一些概念，一部分会在这一章里直接进行说明和实验，另外一部分会在后面的博客里展开，会做原理分析，也会做一些实验来说明。

4.1 用于模拟高iodelay的测试程序，确认pidstat里的iodelay的单位，和kernel.task_delayacct开关对程序iodelay监测的影响

系统的节点里有两个主要使用的单位，一个是ns，如/proc/<pid>/schedstat，另外一个是userhz的tick，即/proc/<pid>/stat，pidstat工具的iodelay也是用userhz的tick，我们这里先验证/proc/<pid>/stat里的utime和stime的单位。

然后，我们写一个模拟高iodelay的测试程序，来确认pidstat的iodelay也是一样的单位。

4.1.1 先用一个死循环程序，确认/proc/<pid>/stat里的单位是userHZ，即10ms为一个user tick

先运行死循环程序：

然后，cat程序的stat节点，sleep 1秒以后，再cat一次

可以从上图中看到两次抓取到的stat节点里的第14和15项（utime和stime）里第14项的utime数值，在1秒内增加了100。100个单位表示1秒（因为是用户态死循环，所以是utime统计项），所以，1个单位是10ms，这就是user HZ的单位。

4.1.2 模拟高iodelay的测试程序，确认pidstat的iodelay也是userHZ单位

下面的测试程序，是通过O_DIRECT | O_SYNC方式写入文件，这么写入文件，会导致线程的iodelay飙高。代码如下：

#include <stdio.h>
#include <unistd.h>
#include <unistd.h>
#include <cstdlib>
#include <cstdio>
#include <ctime>
#include <cstring>
#include <fcntl.h>
#include <pthread.h>

#define WRITE_SIZE 0x1000000

int fd_;
void* aligned_data;

// 线程函数
void* write_to_file(void* arg) {
    while (1) {
        int ret = write(fd_, aligned_data, WRITE_SIZE);
        if (ret != WRITE_SIZE) {
            printf("write fail, ret=%d\n", ret);
            break;
        }
    }
    return nullptr;
}

int main() {
    // 分配对齐内存
    int result = posix_memalign(&aligned_data, 512, WRITE_SIZE);
    if (result != 0) {
        perror("posix_memalign failed");
        return EXIT_FAILURE;
    }

    // 打开文件
    fd_ = open("test.txt", O_RDWR | O_CREAT | O_SYNC | O_DIRECT, 0644);
    if (fd_ == -1) {
        perror("open failed");
        free(aligned_data);
        return EXIT_FAILURE;
    }

    pthread_t thread;
    // 创建线程
    if (pthread_create(&thread, nullptr, write_to_file, nullptr) != 0) {
        perror("pthread_create failed");
        close(fd_);
        free(aligned_data);
        return EXIT_FAILURE;
    }

    // 等待线程结束
    pthread_join(thread, nullptr);

    // 清理资源
    close(fd_);
    free(aligned_data);
    return 0;
}

上面代码中，是起了一个线程来做O_DIRECT | O_SYNC的写入操作，起一个线程是为了做pidstat -d -t抓线程iodelay的情况，方便实验。上述代码中关于如何进行O_DIRECT写入有一些细节，我们放到后面的博客里展开做介绍。

运行该程序后，通过pidstat -d -t 1就可以看到线程72240的iodelay到90左右：

1秒钟总共是100userHZ，90左右是算在了iodelay上，剩下的时间是cpu上，可以从下top中看到：

可以看到，如果按照userHZ来算，总时间基本是匹配的。

4.1.3 通过sysctl动态打开kernel.task_delayacct开关，只对改动后的程序的iodelay监测有影响

下图可以看到当前系统并没有开启delayacct：

就算我们设置了kernel.task_delayacct为1：

对于systemd这种在设置前就已经启动的程序而言，iodelay是不会再有统计的：

而对于后面再启动的程序，是能统计到变化的：

但是如果是关掉的话，无论之前启动时是开还是关，iodelay都不会再进行统计：

4.2 iodelay的获取和内核的配合逻辑

第三章的获取iodelay的用户态程序是用的netlink方式获取内核taskstats的任务统计信息。

struct taskstats包含了众多的任务的关键统计信息，与iodelay相关的就下面这两项：

和第三章里程序打印iodelay信息的代码，复用的是同一个结构体：

事实上，头文件include/uapi/linux/taskstats.h就是内核态和用户态公用的头文件，定义了该taskstats结构体。

我们看一下blkio_count和blkio_delay_total的赋值逻辑：

在delayacct.c里的delayacct_add_tsk函数（被fill_stats/fill_stats_for_tgid/fill_tgid_exit调用）里，是最终上报用户态iodelay等其他信息的地方。如下进行了blkio_delay_total和blkio_count的对上的数据填充上报：

如上框出的部分，blkio_delay_total是搜集与task_struct.delays->blkio_delay信息进行的汇总。

而blkio_delay和blkio_count则表示的当前任务的总iodelay和次数。在如下__delayacct_blkio_end函数进行的累加：

blkio_start是在__delayacct_blkio_start里计时：

所以就看__delayacct_blkio_start到__delayacct_blkio_end的时间，作为一次的iodelay。

可以看到__delayacct_blkio_start和__delayacct_blkio_end是被delayacct_blkio_start和delayacct_blkio_end所调用：

在__schedule函数里，根据prev线程是否在iowait里的标记来进行blkio的start时间的记录：

而哪些情况会让in_iowait标记为true，我们在后面讲iowait的博客里展开。

再来看delayacct_blkio_end函数在哪里被调用：

可以看到，在唤醒的下面两个函数里被调用：

所以，iodelay只统计到了唤醒的时间，并没有统计从唤醒到真正switch-in的时间，而这个时间其实就是调度时延。关于它的细节，见之前的博客调度时延的观测_csdn 调度时延的观测杰克崔-CSDN博客。

4.3 iodelay相关的一些其他重要知识点

后面的IO子系统的博客我们会包括但不限于介绍以下这些重要知识点：

1）cpu里的一个重要指标iowait，它与iodelay的关系

2）如何抓取和分析iowait事件的相关数据

3）iodelay和线程的D状态的关系

4）directIO的方式如何写入文件以及注意事项

5）directIO和page cache方式写入文件有哪些常见的触发iodelay的调用链

标签：struct,int,na,len,线程,long,级别,iodelay
From： https://blog.csdn.net/weixin_42766184/article/details/144302612