Linux Subreaper 机制及内核态逃离方法(PR_SET_CHILD_SUBREAPER, prctl, systemed)

标签：PR systemed SET struct pid task reaper child subreaper

PS：要转载请注明出处，本人版权所有。

PS: 这个只是基于《我自己》的理解，

如果和你的原则及想法相冲突，请谅解，勿喷。

环境说明

无

前言

由于某些其他的原因，我们在测试另外一个问题的时候发现了一个奇怪的现象：在我们一直朴素的认知下，如果一个程序创建了parent-process和child-process，这个时候，当child-process正在运行，parent-process退出的时候，child-process会被托孤到init进程。但是我们却通过pstree -p 发现了并不是这样的，他会被托孤到某一个特殊进程下面，这个特殊进程并不是init进程，而是init进程下面的某一个进程。下面是这个现象的验证过程：

测试程序

#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>

int main(int argc, char * argv[])
{
    int pid = fork();

    if (pid < 0)
        printf("fork failed.\n");

    else if (pid > 0){

        printf("parent : child pid = %d\n", pid);
    }
    else{

        printf("child doing ... ...\n");
        printf("first get ppid = %d\n", getppid());
        sleep(20);
        printf("second get ppid = %d\n", getppid());
        sleep(1000);
    }
    sleep(15);
    return 0;
}

我们运行这个程序后，其运行输出如下图：

我们对这个图进行分析，可以知道，当前的运行a.out进程pid是72140，然后子进程的pid是72141，在parent-process退出后，我们再次获取ppid，可以看到输出是3203。

接着我们看一下在parent-process未退出时的进程树图片节选：

在图中我们可以知道，我们的a.out在systemd(3203)->gnome-terminal-(3741)->bash(5073)->a.out(72140)

接着我们看一下在parent-process退出时的进程树图片节选：

在图中我们可以知道，当parent-process退出后，子进程72141被托孤给了systemd(3203)，并不是我们熟知的pid为1的init进程。这里提前透露一下3203是systemd --user一个进程（同时也是一个subreaper）。

带着对这个问题的疑问，我查询了相关的资料，做了相关的实验，查询到这个现象的原因是PR_SET_CHILD_SUBREAPER相关导致的，因此有了本文的相关内容。

什么是Subreaper(PR_SET_CHILD_SUBREAPER) ？

对于这个问题，我们还是要去看man手册，链接如下：https://man7.org/linux/man-pages/man2/prctl.2.html

通过prctl函数，我们可以对当前进程做很多有趣的设置，其中一个就是PR_SET_CHILD_SUBREAPER选项，他主要是用来收集这些托孤进程的，一般是用来给一些守护进程管理进程（例如：上文提到的systemd）使用，使得一个进程能够管理自己的所有后代进程。其主要还是操作当前进程的task_struct中的is_child_subreaper属性，下面是实现的源码节选：

//kernel/sys.c
static int propagate_has_child_subreaper(struct task_struct *p, void *data)
{
	/*
	 * If task has has_child_subreaper - all its descendants
	 * already have these flag too and new descendants will
	 * inherit it on fork, skip them.
	 *
	 * If we've found child_reaper - skip descendants in
	 * it's subtree as they will never get out pidns.
	 */
	if (p->signal->has_child_subreaper ||
	    is_child_reaper(task_pid(p)))
		return 0;

	p->signal->has_child_subreaper = 1;
	return 1;
}

//kernel/sys.c
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
		unsigned long, arg4, unsigned long, arg5)
{
    struct task_struct *me = current;
    //...
    switch (option){
        //...
        case PR_SET_CHILD_SUBREAPER:
            me->signal->is_child_subreaper = !!arg2;
            if (!arg2)
                break;
            //此函数遍历当前进程的所有子进程，并调用propagate_has_child_subreaper设置has_child_subreaper属性。
            walk_process_tree(me, propagate_has_child_subreaper, NULL);
            break;
        case PR_GET_CHILD_SUBREAPER:
            error = put_user(me->signal->is_child_subreaper,
                    (int __user *)arg2);
            break;
        //...
    }
    //...
}

通过如上的源码，我们可以知道了PR_SET_CHILD_SUBREAPER的实现部分原理，但是现在我们还是不知道为啥这样设置之后，当一个进程还有子进程时，当前进程退出后，子进程就托孤给了这个子进程收割者。这里可以提前透露一下，主要是和task_struct中的has_child_subreaper属性有关系。

当含有子进程的父进程退出时，怎么进行托孤的？

其实从问题就可以看出一点端倪，这个托孤的动作一般是发生在进程退出的时候，所以我们去找进程退出相关的代码应该能够找到一些启发。一般来说，我们的进程退出都会调用exit系统调用，对应到内核态，其实就是do_exit。我们通过has_child_subreaper来全局搜索，可以看到一些关联。下面是部分代码节选：

void __noreturn do_exit(long code)
{
	struct task_struct *tsk = current;
    int group_dead;

    //...
    exit_notify(tsk, group_dead);
    //...
}

/*
 * Send signals to all our closest relatives so that they know
 * to properly mourn us..
 */
static void exit_notify(struct task_struct *tsk, int group_dead)
{
    //...
	LIST_HEAD(dead);

    //...
	forget_original_parent(tsk, &dead);
    //...
}

/*
 * This does two things:
 *
 * A.  Make init inherit all the child processes
 * B.  Check to see if any process groups have become orphaned
 *	as a result of our exiting, and if they have any stopped
 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 */
static void forget_original_parent(struct task_struct *father,
					struct list_head *dead)
{
	struct task_struct *p, *t, *reaper;

	if (unlikely(!list_empty(&father->ptraced)))
		exit_ptrace(father, dead);

	/* Can drop and reacquire tasklist_lock */
	//通过task_active_pid_ns()->child_reaper查找到一个reaper,然后返回出来，注意一般情况下这里查出来的进程就是当前namespace的init进程。
	reaper = find_child_reaper(father, dead);
	if (list_empty(&father->children))
		return;

	//根据init进程和has_child_subreaper属性，查询真正符合条件的reaper
	reaper = find_new_reaper(father, reaper);
	list_for_each_entry(p, &father->children, sibling) {//遍历当前退出进程的所有子进程
		for_each_thread(p, t) {//遍历所有子进程的线程
			RCU_INIT_POINTER(t->real_parent, reaper);//设置真正的父进程，这里的父进程就是上面我们查找出来了的满足要求的reaper
			BUG_ON((!t->ptrace) != (rcu_access_pointer(t->parent) == father));
			if (likely(!t->ptrace))
				t->parent = t->real_parent;
			if (t->pdeath_signal)
				group_send_sig_info(t->pdeath_signal,
						    SEND_SIG_NOINFO, t,
						    PIDTYPE_TGID);
		}
		/*
		 * If this is a threaded reparent there is no need to
		 * notify anyone anything has happened.
		 */
		if (!same_thread_group(reaper, father))
			reparent_leader(father, p, dead);
	}
	list_splice_tail_init(&father->children, &reaper->children);
}

在forget_original_parent中，我们可以看到整个方法的作用就是，找到一个reaper，然后将所有子进程交付给这个reaper。

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

其实这个问题就在forget_original_parent中的find_new_reaper函数中，也就是has_child_subreaper这个属性怎么生效，下面我们来看看这个函数的功能：

/*
 * When we die, we re-parent all our children, and try to:
 * 1. give them to another thread in our thread group, if such a member exists
 * 2. give it to the first ancestor process which prctl'd itself as a
 *    child_subreaper for its children (like a service manager)
 * 3. give it to the init process (PID 1) in our pid namespace
 */
static struct task_struct *find_new_reaper(struct task_struct *father,
					   struct task_struct *child_reaper)
{
	struct task_struct *thread, *reaper;

	thread = find_alive_thread(father);
	if (thread)
		return thread;

	if (father->signal->has_child_subreaper) {//注意has_child_subreaper属性生效的地方。
		unsigned int ns_level = task_pid(father)->level;
		/*
		 * Find the first ->is_child_subreaper ancestor in our pid_ns.
		 * We can't check reaper != child_reaper to ensure we do not
		 * cross the namespaces, the exiting parent could be injected
		 * by setns() + fork().
		 * We check pid->level, this is slightly more efficient than
		 * task_active_pid_ns(reaper) != task_active_pid_ns(father).
		 */
		for (reaper = father->real_parent;
		     task_pid(reaper)->level == ns_level;
		     reaper = reaper->real_parent) {
			if (reaper == &init_task)
				break;
			if (!reaper->signal->is_child_subreaper)
				continue;
			thread = find_alive_thread(reaper);
			if (thread)
				return thread;
		}
	}

	return child_reaper;
}

我们从find_new_reaper中可以知道，当has_child_subreaper有值时，我们就从当前进程的父进程开始查找，当找到一个进程的is_child_subreaper属性是有值时，我们就返回这个进程作为真正的reaper。当has_child_subreaper无值时，就是以init进程为reaper来托孤。

从以上的推理来看，我们有两个方案可以逃离PR_SET_CHILD_SUBREAPER影响：

直接改写真正PR_SET_CHILD_SUBREAPER的地方，不启用这个属性。例如修改systemd的源码。
写一个内核态的小工具，修改指定进程的as_child_subreaper的值，当我们禁用此值时，在进程退出时，就会把子进程托孤给init进程。

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

按照上一个小结的结论，我们一般情况下是不会去改一些开源的系统程序，例如：systemd。因此我们选择直接写一个基本的内核态模块，直接修改其task_struct数据结构即可。ko文件如下：

#include <linux/module.h>	/* Needed by all modules */
#include <linux/kernel.h>	/* Needed for KERN_INFO */

#include <linux/pid.h>
#include <linux/sched.h>
#include <linux/sched/signal.h>
#include <linux/sched/mm.h>
#include <linux/mm_types.h>
#include <linux/rwsem.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/mmap_lock.h>
#include <linux/pid_namespace.h>

MODULE_AUTHOR("sky <[email protected]>");
MODULE_DESCRIPTION("sky's hack");
MODULE_LICENSE("GPL");
MODULE_VERSION("1.0.0");

static int hack_pid = -1;

module_param_named(hack_pid, hack_pid, uint, S_IRUGO);
MODULE_PARM_DESC(hack_pid, "hack_pid");


int init_module(void)
{
	printk(KERN_INFO "Hello sky_hack.\n");
	printk(KERN_INFO "hack pid %d\n", hack_pid);

	rcu_read_lock();

	struct pid * _pid_struct = find_vpid(hack_pid);
	if (NULL == _pid_struct){

		printk("get pid struct failed.\n");
		rcu_read_unlock();
		return -1;
	}

	struct task_struct * _task_struct = get_pid_task(_pid_struct, PIDTYPE_PID);
	if (NULL == _task_struct){

		printk("get task struct failed.\n");
		rcu_read_unlock();
		return -1;
	}


	struct mm_struct * _mm_struct = get_task_mm(_task_struct);
	if (NULL == _mm_struct){

		printk("get mm struct failed.\n");
		rcu_read_unlock();
		return -1;
	}

	mmap_read_lock(_mm_struct);
    if (_mm_struct->exe_file) {

                char * pathname = kmalloc(PATH_MAX, GFP_ATOMIC);
                if (pathname) {
                    char * p = d_path(&_mm_struct->exe_file->f_path, pathname, PATH_MAX);
                    /*Now you have the path name of exe in p*/
					printk(KERN_INFO "process full path %s\n", p);
                }
				kfree(pathname);
    }
	mmap_read_unlock(_mm_struct);

	struct pid_namespace *pid_ns = task_active_pid_ns(_task_struct);
	struct task_struct *reaper = pid_ns->child_reaper;

	printk(KERN_INFO "pid_ns->child_reaper=%x, current task_struct=%x\n", pid_ns->child_reaper, _task_struct);
	printk(KERN_INFO "is_child_subreaper %d\n", _task_struct->signal->is_child_subreaper);
	printk(KERN_INFO "has_child_subreaper %d\n", _task_struct->signal->has_child_subreaper);

	//escape from a subreaper by do_exit()
	_task_struct->signal->has_child_subreaper = 0;
	rcu_read_unlock();

	return 0;
}

void cleanup_module(void)
{
	printk(KERN_INFO "Goodbye sky_hack.\n");
}

当前这个驱动的唯一目的就是把指定pid进程的has_child_subreaper改为0，这样就可以逃离subreaper。

编译Makefile

obj-m += sky_hack.o
all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

下面我们再一次做上面的测试，运行a.out，查看a.out相关的进程树，然后运行sky_hack.ko hack_pid=‘a.out进程id’，等待一段时间后，当a.out退出后，再次查看a.out的相关进程树即可。

运行a.out的输出：

我们其实可以看到，按照上面我们的说明进行操作后，a.out第二次打印的ppid已经是1了，这意味着我们逃离subreaper成功了。

下面我们看看insmod sky_hack.ko hack_pid=14749的输出：

我们其实可以看到，在驱动里面我们打印了a.out进程的has_child_subreaper属性是1，因此我们在驱动中重置了它，导致了退出时，成功托孤给了init进程。

下面我们看看这整个阶段中的进程树状况：

这里的进程分布和上述开始的一样。

我们看看逃离subreaper后：

这里的进程分布就和最开始的不一样的，我们成功的将我们的子进程托孤给了init进程。

后记

我们首先从一个其他问题，遇到了这个现象，然后我们深究了这个现象产生的原因，并且最终尝试设计出逃离这种现象的技术方案。这其中会涉及一些内核源码，驱动编写，同时加深了我们对subreaper的理解。经过这些过程后，我们对Linux内核，Linux的应用开发会有一个新的认知和理解。同时也增强了我们解决问题的综合能力。

参考文献

打赏、订阅、收藏、丢香蕉、硬币，请关注公众号（攻城狮的搬砖之路） qrc_img

标签：PR,systemed,SET,struct,pid,task,reaper,child,subreaper
From： https://www.cnblogs.com/Iflyinsky/p/17520936.html

Linux Subreaper 机制及内核态逃离方法(PR_SET_CHILD_SUBREAPER, prctl, systemed)

环境说明

前言

什么是Subreaper(PR_SET_CHILD_SUBREAPER) ？

当含有子进程的父进程退出时，怎么进行托孤的？

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

我们怎么逃离PR_SET_CHILD_SUBREAPER的影响呢？

后记

参考文献

相关文章

赞助商

阅读排行