Linux调度pick_next_task_fair整体框架解读

标签：task rq fair next Linux prev cfs se

pick_next_task_fair是CFS调度类中选择next任务的主要路径，其主要功能是从当前CPU的就绪队列cfs_rq中选出一个可运行的任务作为"next任务"，并将前一个任务prev重新放到就绪队列。

下面是这段代码框架流程解读。

1 判断rq->cfs.nr_running>0？如果不满足说明没有可运行任务则goto idle

2 从最顶层的就绪队列rq->cfs逐层向下遍历，(1)检查每一层的cfs_rq是否throttle，如果throttle则判断是否还有可运行任务，若无则goto idle; (2) pick_next_entity(cfs_rq, curr)选择对应的调度实体，直到叶子节点task(se->my_q为空)

3 经过第2)就选好了下一个要运行的task，接下来就要将prev task从CPU上”put”卸载下来；同时将next task从就绪队列set到curr。

这里要考虑几个因素：(1) prev task从CPU卸载下来时，它的上层级se，如parent是否也要卸载？(2) next task从就绪队列set为curr时，它的上层se，如parent是否也要同样set? (3) 如果prev task和next task在同一个cgroup控制组，即它们都parent是同一个se，情况如何呢？

3.1 从prev和next task的叶子节点开始向上层级遍历(用pse和se分别代表prev和next的调度实体)，直到prev和next处于同一层级，即 pse->cfs_rq == se->cfs_rq

3.2 在每次遍历过程中判断pse和se的层级深度，如果(pse->depth >= se->depth)则先put_prev_entity(pse)然后prev往parent上升一级pse = pse->parent;

相反，如果(se->depth >= pse->depth)则先set_next_entity(se)然后se往parent上升一级se = se->parent;

3.3 最终，se和pse处于同一个层级，同为sibling

下面是代码和注释，代码基于开源linux-5.10:

struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
        struct cfs_rq *cfs_rq = &rq->cfs;    //此rq的顶层cfs_rq队列
        struct sched_entity *se;
        struct task_struct *p;
        int new_tasks;

again:
        if (!sched_fair_runnable(rq))    //rq->cfs.nr_running>0； 思考：rq->cfs.nr_running和rq->nr_running孰大孰小？
                goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED
        if (!prev || prev->sched_class != &fair_sched_class)
                goto simple;

        /*
         * Because of the set_next_buddy() in dequeue_task_fair() it is rather
         * likely that a next task is from the same cgroup as the current.
         *
         * Therefore attempt to avoid putting and setting the entire cgroup
         * hierarchy, only change the part that actually changes.
         */
       //由于set_next_buddy() in dequeue_task_fair() 很可能将next和prev在同一个cgroup，因此尽量只针对修改过的部分，而非调整整个整个hierarchy
        do {    //从最顶层的cfs_rq开始遍历，直到层级的叶子节点se，即真正的task(判断叶子节点方法：se->my_q为NULL)
                struct sched_entity *curr = cfs_rq->curr;

                /*
                 * Since we got here without doing put_prev_entity() we also
                 * have to consider cfs_rq->curr. If it is still a runnable
                 * entity, update_curr() will update its vruntime, otherwise
                 * forget we've ever seen it.
                 */
                if (curr) {
                        if (curr->on_rq)
                                update_curr(cfs_rq);
                        else
                                curr = NULL;

                        /*
                         * This call to check_cfs_rq_runtime() will do the
                         * throttle and dequeue its entity in the parent(s).
                         * Therefore the nr_running test will indeed
                         * be correct.
                         */
                        if (unlikely(check_cfs_rq_runtime(cfs_rq))) {    //【0】
                                cfs_rq = &rq->cfs;

                                if (!cfs_rq->nr_running)
                                        goto idle;

                                goto simple;
                        }
                }

                se = pick_next_entity(cfs_rq, curr);    //【1】从cfs_rq选一个next se，具体算法：todo
                cfs_rq = group_cfs_rq(se);        //se->my_q
        } while (cfs_rq);

        p = task_of(se);

        /*
         * Since we haven't yet done put_prev_entity and if the selected task
         * is a different task than we started out with, try and touch the
         * least amount of cfs_rqs.
         */
        if (prev != p) {
                struct sched_entity *pse = &prev->se;
          //从prev和se的叶子节点不断向上level调整层级，直到二者的se->cfs_rq一致，即同一个level为止
                while (!(cfs_rq = is_same_group(se, pse))) {    //循环对prev和next  se做put或set，直到它们都在同一个level
                        int se_depth = se->depth;                    //如何判断它们都层级深浅呢？通过se->depth来判定的。
                        int pse_depth = pse->depth;

                        if (se_depth <= pse_depth) {    //如果se<=prev的层级，则put prev，且prev往parent生一个level，最终目的是要和se保持同一个level
                                put_prev_entity(cfs_rq_of(pse), pse);
                                pse = parent_entity(pse);
                        }
                        if (se_depth >= pse_depth) {
                                set_next_entity(cfs_rq_of(se), se);
                                se = parent_entity(se);
                        }
                }
          //到此二者的层级一致，此层级下面的层级都已经更新
         // 这样se->parent的状态和se的状态保持一致，例如se->statistics.wait_start都会清0
        //另外，如果se和pse在同一个group，会发生什么？pse是等待状态，但是pse->parent和se->parent是同一个
                put_prev_entity(cfs_rq, pse);        //【2】
                set_next_entity(cfs_rq, se);        //【3】
        }

        goto done;
simple:
#endif
        if (prev)
                put_prev_task(rq, prev);

        do {
                se = pick_next_entity(cfs_rq, NULL);
                set_next_entity(cfs_rq, se);
                cfs_rq = group_cfs_rq(se);
        } while (cfs_rq);

        p = task_of(se);

done: __maybe_unused;
#ifdef CONFIG_SMP
        /*
         * Move the next running task to the front of
         * the list, so our cfs_tasks list becomes MRU
         * one.
         */
        list_move(&p->se.group_node, &rq->cfs_tasks);
#endif

        if (hrtick_enabled(rq))
                hrtick_start_fair(rq, p);

        update_misfit_status(p, rq);

        return p;

idle:
        if (!rf)
                return NULL;

        new_tasks = newidle_balance(rq, rf);

        /*
         * Because newidle_balance() releases (and re-acquires) rq->lock, it is
         * possible for any higher priority task to appear. In that case we
         * must re-start the pick_next_entity() loop.
         */
        if (new_tasks < 0)
                return RETRY_TASK;

        if (new_tasks > 0)
                goto again;

        /*
         * rq is about to be idle, check if we need to update the
         * lost_idle_time of clock_pelt
         */
        update_idle_rq_clock_pelt(rq);

        return NULL;
}

标签：task,rq,fair,next,Linux,prev,cfs,se
From： https://www.cnblogs.com/liuhailong0112/p/17981594

Linux调度pick_next_task_fair整体框架解读

相关文章

赞助商

阅读排行