Scheduling Policy
Linux scheduling is based on the time sharing technique: several processes run in “time multiplexing” because the CPU time is divided into slices(called, quantum), one for each runnable process.
An alternative classification distinguishes three classes of processes(instead of I/O-bound or CPU-bound)
- Interactive processes
These interact constantly with their users, and therefore spend a lot of time waiting for keypresses and mouse operations.
- Batch processes
These do not need user interaction, and hence they often run in the background. Because such processes do not need to be very responsive, they are often penalized by the scheduler
- Real-time processes
These have very stringent scheduling requirements. Such processes should never be blocked by lower-priority processes and should have a short guaranteed response time with a minimum variance.(the programs collect data from physical sensors)
While real-time programs are explicitly recognized as such by the scheduling algorithm in Linux, there is no easy way to distinguish between interactive and batch programs.(need execute a while)
Process Preemption
As mentioned in the first chapter, Linux processes are preemptable. When a process enters the TASK_RUNNING state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run. Of course, a process also may be preempted when its time quantum expires. When this occurs, the TIF_NEED_RESCHED flag in the thread_info structure of the current process is set, so the scheduler is invoked when the timer interrupt handler terminates.
How Long Must a Quantum Last?
not too long, not to short.
The Scheduling Algorithm
The scheduling algorithm of Linux 2.6 selects the process to run in constant time, independently of the number of runnable processes. It also scales well with the number of processors because each CPU has its own queue of runnable processes. Furthermore, the new algorithm does a better job of distinguishing interactive processes and batch processes.
there is always at least one runnable process: the swapper process, which has PID 0 and executes only when the CPU cannot execute other processes. As mentioned in Chapter 3, every CPU of a multiprocessor system has its own swapper process with PID equal to 0.
classification of the scheduleing strategy
The scheduling algorithm behaves quite differently depending on whether the process is conventional or real-time.
Every Linux process is always scheduled according to one of the following scheduling classes:
- SCHED_FIFO
A First-In, First-Out real-time process. When the scheduler assigns the CPU to the process, it leaves the process descriptor in its current position in the runqueue list. If no other higher-priority real-time process is runnable, the process continues to use the CPU as long as it wishes, even if other real-time processes that have the same priority are runnable.
- SCHED_RR
A Round Robin real-time process. When the scheduler assigns the CPU to the process, it puts the process descriptor at the end of the runqueue list. This policy ensures a fair assignment of CPU time to all SCHED_RR real-time processes that
have the same priority.
- SCHED_NORMAL
A conventional, time-shared process.
Scheduling of Conventional Processes
Every conventional process has its own static priority, which is a value used by the scheduler to rate the process with respect to the other conventional processes in the system.
A new process always inherits the static priority of its parent.(could change it by syscall nice())
Base time quantum
Dynamic priority and average sleep time
Besides a static priority, a conventional process also has a dynamic priority, which is a value ranging from 100 (highest priority) to 139 (lowest priority). The dynamic priority is the number actually looked up by the scheduler when selecting the new process to run.
The bonus is a value ranging from 0 to 10; a value less than 5 represents a penalty that lowers the dynamic priority, while a value greater than 5 is a premium that raises the dynamic priority. it is related to the average sleep time of the process.
average sleep time
Roughly, the average sleep time is the average number of nanoseconds that the process spent while sleeping. Be warned, however, that this is not an average operation on the elapsed time. For instance, sleeping in TASK_INTERRUPTIBLE state contributes to the average sleep time in a different way from sleeping in TASK_UNINTERRUPTIBLE state. Moreover, the average sleep time decreases while a process is running. Finally, the average sleep time can never become larger than 1 second.
The average sleep time is also used by the scheduler to determine whether a given process should be considered interactive or batch. More precisely, a process is considered “interactive” if it satisfies the following formula:
dynamic priority ≤ 3 × static priority / 4 + 28
which same with:
bonus - 5 ≥ static priority / 4 − 28
The expression static priority / 4 − 28 is called the interactive delta;
Active and expired processes
Even if conventional processes having higher static priorities get larger slices of the CPU time, they should not completely lock out the processes having lower static priority., the scheduler keeps two disjoint sets of runnable processes:
- Active process
These runnable processes have not yet exhausted their time quantum
- Expired processes
quantum used.
strategy
An active interactive process that finishes its time quantum usually remains active: the scheduler refills its time quantum and leaves it in the set of active processes. However, the scheduler moves an interactive process that finished its time quantum into the set of expired processes if the eldest expired process has already waited for a long time, or if an expired process has higher static priority (lower value) than the interactive process.
Scheduling of Real-Time Processes
Every real-time process is associated with a real-time priority, which is a value ranging from 1 (highest priority) to 99 (lowest priority). The scheduler always favors a higher priority runnable process over a lower priority one; in other words, a real-time process inhibits the execution of every lower-priority process while it remains runnable.
A real-time process is replaced by another process only when one of the following events occurs:
- The process is preempted by another process having higher real-time priority.
- The process performs a blocking operation, and it is put to sleep.
- The process is stopped (in state TASK_STOPPED or TASK_TRACED), or it is killed.
- The process voluntarily relinquishes the CPU by invoking the sched_yield() system call.
- The process is Round Robin real-time (SCHED_RR), and it has exhausted its time quantum.
The nice() and setpriority() system calls, when applied to a Round Robin realtime process, do not change the real-time priority but rather the duration of the base time quantum.
Data Structures used by the Scheduler
The runqueue Data Structure
Each CPU in the system has its own runqueue; all runqueue structures are stored in the runqueues per-CPU variable.The this_rq() macro yields the address of the runqueue of the local CPU, while the cpu_rq(n) macro yields the address of the runqueue of the CPU having index n.
Every runnable process in the system belongs to one, and just one, runqueue. As long as a runnable process remains in the same runqueue, it can be executed only by the CPU owning that runqueue. However, as we’ll see later, runnable processes may migrate from one runqueue to another.
Periodically, the role of the two data structures in arrays changes: the active processes suddenly become the expired processes, and the expired processes become the active ones. To achieve this change, the scheduler simply exchanges the contents of the active and expired fields of the runqueue.
Process Descriptor
time_slice
Ticks left in the time quantum of the process.
the number of ticks left to the parent is split in two halves: one for the parent and one for the child. More generally speaking, a process cannot hog resources (unless it has privileges to give itself a real-time policy) by forking multiple descendents.
array
Pointer to the runqueue’s prio_array_t set that includes the process.
first_time_slice
Flag set to 1 if the process never exhausted its time quantum.
sched_clock()
essentially, this function returns the contents of the 64-bit TSC register converted to nanoseconds.
Functions Used by the Scheduler
related func
scheduler_tick()
Keeps the time_slice counter of current up-to-date
try_to_wake_up()
Awakens a sleeping process
recalc_task_prio()
Updates the dynamic priority of a process
schedule()
Selects a new process to be executed
load_balance()
Keeps the runqueues of a multiprocessor system balanced
The scheduler_tick() Function
timer_interrupt()->update_process_times()->scheduler_tick()
scheduler_tick() decreases the time slice counter of the current process, and checks whether its quantum is exhausted.
- set value of the timestamp_last_tick field of the local runqueue.
- Checks whether the current process is the swapper(pid=0) process of the local CPU. If so, it performs the following substeps:
- If the local runqueue includes another runnable process besides swapper, it sets the TIF_NEED_RESCHED flag of the current process to force rescheduling.
- jump to step 7.
- Checks whether current->array points to the active list of the local runqueue. If not, the process has expired its time quantum, but it has not yet been replaced: sets the TIF_NEED_RESCHED flag, and jumps to step 7.
- Acquires the this_rq()->lock spin lock.
- Decreases the time slice counter of the current process, and checks whether the quantum is exhausted, Depending on the scheduling class of the process.(see below)
- Releases the this_rq()->lock spin lock.
- Invokes the rebalance_tick() function, which should ensure that the runqueues of the various CPUs contain approximately the same number of runnable processes.
Updating the time slice of a real-time process
-
FIFO real-time process
do nothing, keeping running and no need to update time slice. -
RR real-time process
if (current->policy == SCHED_RR && !--current->time_slice) {
// refill the time slice of the current. task_timeslice() return the base time quantum basing on the static priority of the current.
current->time_slice = task_timeslice(current);
current->first_time_slice = 0;
set_tsk_need_resched(current);
list_del(¤t->run_list);
list_add_tail(¤t->run_list,
this_rq()->active->queue+current->prio);
}
Updating the time slice of a conventional process
- Decreases the time slice counter (current->time_slice).
- Checks the time slice counter. If the time quantum is exhausted, the function performs the following operations:
- Invokes dequeue_task() to remove current from the this_rq()->active set of runnable processes.
- Invokes set_tsk_need_resched() to set the TIF_NEED_RESCHED flag.
- Updates the dynamic priority of current: current->prio = effective_prio(current);
- Refills the time quantum of the process.
- update the info of the expired task in the runqueue and expired task set.(p273)
- Otherwise, if the time quantum is not exhausted (current->time_slice is not zero), checks whether the remaining time slice of the current process is too long.
The try_to_wake_up() Function
The try_to_wake_up() function awakes a sleeping or stopped process by setting its state to TASK_RUNNING and inserting it into the runqueue of the local CPU. The function receives as its parameters:
- The descriptor pointer (p) of the process to be awakened.
- A mask of the process states (state) that can be awakened.
- A flag (sync) that forbids the awakened process to preempt the process currently running on the local CPU.
The function performs the following operations:
- Invokes the task_rq_lock() function to disable local interrupts and to acquire the lock of the runqueue rq owned by the CPU that was last executing the process (it could be different from the local CPU). The logical number of that CPU is stored in the p->thread_info->cpu field.
- Checks if the state of the process p->state belongs to the mask of states state passed as argument to the function; if this is not the case, it jumps to step 9 to terminate the function.
- If the p->array field is not NULL, the process already belongs to a runqueue; therefore, it jumps to step 8.
- In multiprocessor systems, it checks whether the process to be awakened should be migrated from the runqueue of the lastly executing CPU to the runqueue of another CPU.(p274 some heuristic rules to select cpu)
- If the process is in the TASK_UNINTERRUPTIBLE state, it decreases the nr_uninterruptible field of the target runqueue, and sets the p->activated field of the process descriptor to -1.
- Invokes the activate_task() function, which in turn performs the following substeps:
- Invokes sched_clock() to get the current timestamp in nanoseconds.
- Invokes recalc_task_prio(), passing to it the process descriptor pointer and the timestamp computed in the previous step.(update the dynamic priority)
- Sets the value of the p->activated field.
- Sets the p->timestamp field with the timestamp computed before.
- Inserts the process descriptor in the active set.
- If either the target CPU is not the local CPU or if the sync flag is not set, it checks whether the new runnable process has a dynamic priority higher than that of the current process of the rq runqueue (p->prio < rq->curr->prio); if so, invokes resched_task() to preempt rq->curr. In multiprocessor systems resched_task() also checks some kernel flag and invoke smp_send_reschedule() to raise an IPI and force rescheduling on the target CPU if needed.
- Sets the p->state field of the process to TASK_RUNNING.
- Invokes task_rq_unlock() to unlock the rq runqueue and reenable the local interrupts.
- Returns 1 if the process has been successfully awakened or 0 otherwise.
The recalc_task_prio() Function
try_to_wake_up() -> recalc_task_prio()
The recalc_task_prio() function updates the average sleep time and the dynamic priority of a process. It receives as its parameters a process descriptor pointer p and a timestamp now computed by the sched_clock() function.
The function executes the following operations:
-
Stores in the sleep_time local variable the result of: min (now − p->timestamp, 10^9). The p->timestamp field contains the timestamp of the process switch that put the process to sleep;
-
If sleep_time is not greater than zero, it jumps to step 8 so as to skip updating the average sleep time of the process.
-
Checks whether the process is not a kernel thread, whether it is awakening from the TASK_UNINTERRUPTIBLE state (p->activated field equal to −1, and whether it has been continuously asleep beyond a given sleep time threshold(related to static priority).If so, set sleep_time an empirical value(900 ticks).
the goal of this rule is to ensure that processes that have been asleep for a long time in uninterruptible mode—usually waiting for disk I/O operations—get a predefined sleep average value that is large enough to allow them to be quickly serviced, but it is also not so large to cause starvation for other processes.
-
Executes the CURRENT_BONUS macro to compute the bonus value of the previous average sleep time of the process.the lower the current average sleep time is, the more rapidly it will rise(multiply higer bonus).
-
If the process is in TASK_UNINTERRUPTIBLE mode and it is not a kernel thread, it performs some steps to limiting the increment of the average sleep time.
-
Adds sleep_time to the average sleep time of the process.
-
Checks whether p->sleep_avg exceeds 1000 ticks; if so, the function cuts it down to 1000 ticks.
-
Updates the dynamic priority of the process:p->prio = effective_prio(p);
The effective_prio() function reads the static_prio and sleep_avg fields of current, and computes the dynamic priority of the process.
The schedule() Function
The schedule( ) function implements the scheduler. Its objective is to find a process in the runqueue list and then assign the CPU to it. It is invoked, directly or in a lazy (deferred) way, by several kernel routines.
Direct invocation
The scheduler is invoked directly when the current process must be blocked right away because the resource it needs is not available.
syscall exit(), i.e. when process exit.
The scheduler is also directly invoked by many device drivers that execute long iterative tasks. At each iteration cycle, the driver checks the value of the TIF_NEED_RESCHED flag and, if necessary, invokes schedule( ) to voluntarily relinquish the CPU.
Lazy invocation
The scheduler can also be invoked in a lazy way by setting the TIF_NEED_RESCHED flag of current .Because a check on the value of this flag is always made before resuming the execution of a User Mode process, schedule( ) will definitely be invoked at some time in the near future.
Typical examples of lazy invocation of the scheduler are:
- When current has used up its quantum of CPU time; this is done by the scheduler_tick() function.
- When a process is woken up and its priority is higher than that of the current process; this task is performed by the try_to_wake_up() function.
- When a sched_setscheduler() system call is issued.
Actions performed by schedule() before a process switch
- disabling kernel preemption and initializing a few local variables.(struct task_struct prev = current)
- handle big kernel lock.
- invoke sched_clock() and computes the duration of the CPU time used by prev;
- favor the process with higer average sleep time by run_time /= (CURRENT_BONUS(prev) ? : 1);.
- Before starting to look at the runnable processes, schedule() must disable the local interrupts and acquire the spin lock that protects the runqueue.
- look at the PF_DEAD flag to identify the termination of the process.
- examines the state of prev. If it is not runnable and it has not been preempted in Kernel Mode, then it should be removed from the runqueue(by invoke deactivate_task()). However, if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE, the function sets the process state to TASK_RUNNING and leaves it into the runqueue to give it chance to re-run.
- checks the number of runnable processes left in the runqueue. If there are some runnable processes, the function invokes the dependent_sleeper() function, which checks whether the process that is going to be selected for execution has significantly lower priority than a sibling process already running on a logical CPU of the same physical CPU; in this particular case, schedule() refuses to select the lower privilege process and executes the swapper process instead(in kernel support hyper-threading technology).
- If no runnable process exists, the function invokes idle_balance() to move some runnable process from another runqueue to the local runqueue(similar to load_balance(), see below);if failed, reselect swapper process to run.
- then check that at least one of these runnable processes is active. If not, the function exchanges the contents of the active and expired fields of the runqueue data structure;
- look up a runnable process in the active prio_array_t data structure and store it in variable next.
Action performed by schedule() to make the process switch
prefetch()
The prefetch macro is a hint to the CPU control unit to bring the contents of the first fields of process descriptor in the hardware cache(which is struct thread_info). It is here just to improve the performance of schedule().
steps
-
do some administrative work.(modify flags and calculate times it sleeped)
-
It is quite possible that prev and next are the same process. In this case, the function skips the process switch:
-
invoke context_switch().
The context_switch() function sets up the address space of next. the active_mm field of the process descriptor points to the memory descriptor that is used by the process, while the mm field points to the memory descriptor owned by the process. For normal processes, the two fields hold the same address; however, a kernel thread does not have
its own address space and its mm field is always set to NULL. The context_switch() function ensures that if next is a kernel thread, it uses the address space used by prev.- if next is a regular process, the context_switch() function replaces the address space of prev with the one of next.If prev is a kernel thread or an exiting process, the context_switch() function saves the pointer to the memory descriptor used by prev in the runqueue’s prev_mm field.
then resets prev->active_mm: - context_switch() call switch_to();
- if next is a regular process, the context_switch() function replaces the address space of prev with the one of next.If prev is a kernel thread or an exiting process, the context_switch() function saves the pointer to the memory descriptor used by prev in the runqueue’s prev_mm field.
Actions performed by schedule() after a process switch
The first instructions after a process switch are:
点击查看代码
barrier();
finish_task_switch(prev);
Runqueue Balancing in Multiprocessor Systems
Starting from kernel version 2.6.7, Linux sports a sophisticated runqueue balancing algorithm based on the notion of “scheduling domains.”
Scheduling Domains
a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel. Generally speaking, scheduling domains are hierarchically organized: the top-most scheduling domain, which usually spans all CPUs in the system, includes children scheduling domains, each of which include a subset of the CPUs. Thanks to the hierarchy of scheduling domains, workload balancing can be done in a rather efficient way.
Workload balancing is always done between groups of a scheduling domain. In other words, a process is
moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain.
Every scheduling domain is represented by a sched_domain descriptor, while every group inside a scheduling domain is represented by a sched_group descriptor. Each sched_domain descriptor includes a field groups, which points to the first element in a list of group descriptors.
The rebalance_tick() Function
To keep the runqueues in the system balanced, the rebalance_tick() function is invoked by scheduler_tick() once every tick.
- The rebalance_tick() function determines first the number of processes in the runqueue and updates the runqueue’s average workload;
- Then, rebalance_tick() starts a loop over all scheduling domains in the path from the base domain (referenced by the sd field of the local runqueue descriptor) to the toplevel domain and invoke load_balance() if needed(base on cpu state and the flag passed to rebalance_tick()).
The load_balance() Funciton
The load_balance() function checks whether a scheduling domain is significantly unbalanced; more precisely, it checks whether unbalancing can be reduced by moving some processes from the busiest group to the runqueue of the local CPU.
load_balance() ----- find_busiest_group()
----- find_busiest_queue()
----- move_tasks()
The move_tasks() Function
The move_tasks() function moves processes from a source runqueue to the local runqueue.
move_tasks() ---- can_migrate_task()
---- pull_task() ---- dequeue_task()
---- enqueue_task()
---- resched_task()
System Calls Related to Scheduling
The nice() System Call
The nice( ) system call allows processes to change their base priority. The nice( ) system call is maintained for backward compatibility only; it has been replaced by the setpriority() system call described next.
The getpriority() and setpriority() System Calls
The nice( ) system call affects only the process that invokes it. Two other system calls, denoted as getpriority( ) and setpriority( ), act on the base priorities of all processes in a given group.
The sched_getaffinity() and sched_setaffinity() System Calls
The sched_getaffinity() and sched_setaffinity() system calls respectively return and set up the CPU affinity mask of a process—the bit mask of the CPUs that are allowed to execute the process.
System Calls Related to Real-Time Processes
a group of system calls that allow processes to change their scheduling discipline and, in particular, to become real-time processes. As usual, a process must have a CAP_SYS_NICE capability to modify the values of the rt_priority mand policy process descriptor fields of any process, including itself.
- The sched_getscheduler( ) and sched_setscheduler( ) system calls
The sched_getscheduler( ) system call queries the scheduling policy currently applied to the process identified by the pid parameter.
- The sched_ getparam( ) and sched_setparam( ) system calls
The sched_getparam( ) system call retrieves the scheduling parameters(task_struct->rt_priority) for the process identified by pid.
- The sched_ yield( ) system call
The sched_yield( ) system call allows a process to relinquish the CPU voluntarily without being suspended; the process remains in a TASK_RUNNING state.
- The sched_ yield( ) system calls
The sched_get_priority_min( ) and sched_get_priority_max( ) system calls return, respectively, the minimum and the maximum real-time static priority value that can be used with the scheduling policy identified by the policy parameter.
- The sched_rr_ get_interval( ) system call
The sched_rr_get_interval( ) system call writes into a structure stored in the User Mode address space the Round Robin time quantum for the real-time process identified by the pid parameter.