7.5. Runqueue Balancing in Multiprocessor SystemsWe have seen in Chapter 4 that Linux sticks to the Symmetric Multiprocessing model (SMP ); this means, essentially, that the kernel should not have any bias toward one CPU with respect to the others. However, multiprocessor machines come in many different flavors, and the scheduler behaves differently depending on the hardware characteristics. In particular, we will consider the following three types of multiprocessor machines:
These basic kinds of multiprocessor systems are often combined. For instance, a motherboard that includes two different hyper-threaded CPUs is seen by the kernel as four logical CPUs. As we have seen in the previous section, the schedule( ) function picks the new process to run from the runqueue of the local CPU. Therefore, a given CPU can execute only the runnable processes that are contained in the corresponding runqueue. On the other hand, a runnable process is always stored in exactly one runqueue: no runnable process ever appears in two or more runqueues. Therefore, until a process remains runnable, it is usually bound to one CPU. This design choice is usually beneficial for system performance, because the hardware cache of every CPU is likely to be filled with data owned by the runnable processes in the runqueue. In some cases, however, binding a runnable process to a given CPU might induce a severe performance penalty. For instance, consider a large number of batch processes that make heavy use of the CPU: if most of them end up in the same runqueue, one CPU in the system will be overloaded, while the others will be nearly idle. Therefore, the kernel periodically checks whether the workloads of the runqueues are balanced and, if necessary, moves some process from one runqueue to another. However, to get the best performance from a multiprocessor system, the load balancing algorithm should take into consideration the topology of the CPUs in the system. Starting from kernel version 2.6.7, Linux sports a sophisticated runqueue balancing algorithm based on the notion of "scheduling domains." Thanks to the scheduling domains, the algorithm can be easily tuned for all kinds of existing multiprocessor architectures (and even for recent architectures such as those based on the "multi-core" microprocessors). 7.5.1. Scheduling DomainsEssentially, a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel. Generally speaking, scheduling domains are hierarchically organized: the top-most scheduling domain, which usually spans all CPUs in the system, includes children scheduling domains, each of which include a subset of the CPUs. Thanks to the hierarchy of scheduling domains, workload balancing can be done in a rather efficient way. Every scheduling domain is partitioned, in turn, in one or more groups, each of which represents a subset of the CPUs of the scheduling domain. Workload balancing is always done between groups of a scheduling domain. In other words, a process is moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain. Figure 7-2 illustrates three examples of scheduling domain hierarchies, corresponding to the three main architectures of multiprocessor machines. Figure 7-2. Three examples of scheduling domain hierarchies![]() Figure 7-2 (a) represents a hierarchy composed by a single scheduling domain for a 2-CPU classic multiprocessor architecture. The scheduling domain includes only two groups, each of which includes one CPU. Figure 7-2 (b) represents a two-level hierarchy for a 2-CPU multiprocessor box with hyper-threading technology. The top-level scheduling domain spans all four logical CPUs in the system, and it is composed by two groups. Each group of the top-level domain corresponds to a child scheduling domain and spans a physical CPU. The bottom-level scheduling domains (also called base scheduling domains ) include two groups, one for each logical CPU. Finally, Figure 7-2 (c) represents a two-level hierarchy for an 8-CPU NUMA architecture with two nodes and four CPUs per node. The top-level domain is organized in two groups, each of which corresponds to a different node. Every base scheduling domain spans the CPUs inside a single node and has four groups, each of which spans a single CPU. Every scheduling domain is represented by a sched_domain descriptor, while every group inside a scheduling domain is represented by a sched_group descriptor. Each sched_domain descriptor includes a field groups, which points to the first element in a list of group descriptors. Moreover, the parent field of the sched_domain structure points to the descriptor of the parent scheduling domain, if any. The sched_domain descriptors of all physical CPUs in the system are stored in the per-CPU variable phys_domains. If the kernel does not support the hyper-threading technology, these domains are at the bottom level of the domain hierarchy, and the sd fields of the runqueue descriptors point to themthat is, they are the base scheduling domains. Conversely, if the kernel supports the hyper-threading technology, the bottom-level scheduling domains are stored in the per-CPU variable cpu_domains. 7.5.2. The rebalance_tick( ) FunctionTo keep the runqueues in the system balanced, the rebalance_tick( ) function is invoked by scheduler_tick( ) once every tick. It receives as its parameters the index this_cpu of the local CPU, the address this_rq of the local runqueue, and a flag, idle, which can assume the following values:
The rebalance_tick( ) function determines first the number of processes in the runqueue and updates the runqueue's average workload; to do this, the function accesses the nr_running and cpu_load fields of the runqueue descriptor. Then, rebalance_tick( ) starts a loop over all scheduling domains in the path from the base domain (referenced by the sd field of the local runqueue descriptor) to the top-level domain. In each iteration the function determines whether the time has come to invoke the load_balance( ) function, thus executing a rebalancing operation on the scheduling domain. The value of idle and some parameters stored in the sched_domain descriptor determine the frequency of the invocations of load_balance( ). If idle is equal to SCHED_IDLE, then the runqueue is empty, and rebalance_tick( ) invokes load_balance( ) quite often (roughly once every one or two ticks for scheduling domains corresponding to logical and physical CPUs). Conversely, if idle is equal to NOT_IDLE, rebalance_tick( ) invokes load_balance( ) sparingly (roughly once every 10 milliseconds for scheduling domains corresponding to logical CPUs, and once every 100 milliseconds for scheduling domains corresponding to physical CPUs). 7.5.3. The load_balance( ) FunctionThe load_balance( ) function checks whether a scheduling domain is significantly unbalanced; more precisely, it checks whether unbalancing can be reduced by moving some processes from the busiest group to the runqueue of the local CPU. If so, the function attempts this migration. It receives four parameters:
The function performs the following operations:
7.5.4. The move_tasks( ) FunctionThe move_tasks( ) function moves processes from a source runqueue to the local runqueue. It receives six parameters: this_rq and this_cpu (the local runqueue descriptor and the local CPU index), busiest (the source runqueue descriptor), max_nr_move (the maximum number of processes to be moved), sd (the address of the scheduling domain descriptor in which this balancing operation is carried on), and the idle flag (beside SCHED_IDLE and NOT_IDLE, this flag can also be set to NEWLY_IDLE when the function is indirectly invoked by idle_balance( ); see the section "The schedule( ) Function" earlier in this chapter). The function first analyzes the expired processes of the busiest runqueue, starting from the higher priority ones. When all expired processes have been scanned, the function scans the active processes of the busiest runqueue. For each candidate process, the function invokes can_migrate_task( ), which returns 1 if all the following conditions hold:
If can_migrate_task( ) returns the value 1, move_tasks( ) invokes the pull_task( ) function to move the candidate process to the local runqueue. Essentially, pull_task( ) executes dequeue_task( ) to remove the process from the remote runqueue, then executes enqueue_task( ) to insert the process in the local runqueue, and finally, if the process just moved has higher dynamic priority than current, invokes resched_task( ) to preempt the current process of the local CPU. |