7.5. Runqueue Balancing in Multiprocessor Systems

We have seen in Chapter 4 that Linux sticks to the Symmetric Multiprocessing model (SMP ); this means, essentially, that the kernel should not have any bias toward one CPU with respect to the others. However, multiprocessor machines come in many different flavors, and the scheduler behaves differently depending on the hardware characteristics. In particular, we will consider the following three types of multiprocessor machines:


Classic multiprocessor architecture

Until recently, this was the most common architecture for multiprocessor machines. These machines have a common set of RAM chips shared by all CPUs.


Hyper-threading

A hyper-threaded chip is a microprocessor that executes several threads of execution at once; it includes several copies of the internal registers and quickly switches between them. This technology, which was invented by Intel, allows the processor to exploit the machine cycles to execute another thread while the current thread is stalled for a memory access. A hyper-threaded physical CPU is seen by Linux as several different logical CPUs.


NUMA

CPUs and RAM chips are grouped in local "nodes" (usually a node includes one CPU and a few RAM chips). The memory arbiter (a special circuit that serializes the accesses to RAM performed by the CPUs in the system, see the section "Memory Addresses" in Chapter 2) is a bottleneck for the performance of the classic multiprocessor systems. In a NUMA architecture, when a CPU accesses a "local" RAM chip inside its own node, there is little or no contention, thus the access is usually fast; on the other hand, accessing a "remote" RAM chip outside of its node is much slower. We'll mention in the section "Non-Uniform Memory Access (NUMA)" in Chapter 8 how the Linux kernel memory allocator supports NUMA architectures.

These basic kinds of multiprocessor systems are often combined. For instance, a motherboard that includes two different hyper-threaded CPUs is seen by the kernel as four logical CPUs.

As we have seen in the previous section, the schedule( ) function picks the new process to run from the runqueue of the local CPU. Therefore, a given CPU can execute only the runnable processes that are contained in the corresponding runqueue. On the other hand, a runnable process is always stored in exactly one runqueue: no runnable process ever appears in two or more runqueues. Therefore, until a process remains runnable, it is usually bound to one CPU.

This design choice is usually beneficial for system performance, because the hardware cache of every CPU is likely to be filled with data owned by the runnable processes in the runqueue. In some cases, however, binding a runnable process to a given CPU might induce a severe performance penalty. For instance, consider a large number of batch processes that make heavy use of the CPU: if most of them end up in the same runqueue, one CPU in the system will be overloaded, while the others will be nearly idle.

Therefore, the kernel periodically checks whether the workloads of the runqueues are balanced and, if necessary, moves some process from one runqueue to another. However, to get the best performance from a multiprocessor system, the load balancing algorithm should take into consideration the topology of the CPUs in the system. Starting from kernel version 2.6.7, Linux sports a sophisticated runqueue balancing algorithm based on the notion of "scheduling domains." Thanks to the scheduling domains, the algorithm can be easily tuned for all kinds of existing multiprocessor architectures (and even for recent architectures such as those based on the "multi-core" microprocessors).

7.5.1. Scheduling Domains

Essentially, a scheduling domain is a set of CPUs whose workloads should be kept balanced by the kernel. Generally speaking, scheduling domains are hierarchically organized: the top-most scheduling domain, which usually spans all CPUs in the system, includes children scheduling domains, each of which include a subset of the CPUs. Thanks to the hierarchy of scheduling domains, workload balancing can be done in a rather efficient way.

Every scheduling domain is partitioned, in turn, in one or more groups, each of which represents a subset of the CPUs of the scheduling domain. Workload balancing is always done between groups of a scheduling domain. In other words, a process is moved from one CPU to another only if the total workload of some group in some scheduling domain is significantly lower than the workload of another group in the same scheduling domain.

Figure 7-2 illustrates three examples of scheduling domain hierarchies, corresponding to the three main architectures of multiprocessor machines.

Figure 7-2. Three examples of scheduling domain hierarchies


Figure 7-2 (a) represents a hierarchy composed by a single scheduling domain for a 2-CPU classic multiprocessor architecture. The scheduling domain includes only two groups, each of which includes one CPU.

Figure 7-2 (b) represents a two-level hierarchy for a 2-CPU multiprocessor box with hyper-threading technology. The top-level scheduling domain spans all four logical CPUs in the system, and it is composed by two groups. Each group of the top-level domain corresponds to a child scheduling domain and spans a physical CPU. The bottom-level scheduling domains (also called base scheduling domains ) include two groups, one for each logical CPU.

Finally, Figure 7-2 (c) represents a two-level hierarchy for an 8-CPU NUMA architecture with two nodes and four CPUs per node. The top-level domain is organized in two groups, each of which corresponds to a different node. Every base scheduling domain spans the CPUs inside a single node and has four groups, each of which spans a single CPU.

Every scheduling domain is represented by a sched_domain descriptor, while every group inside a scheduling domain is represented by a sched_group descriptor. Each sched_domain descriptor includes a field groups, which points to the first element in a list of group descriptors. Moreover, the parent field of the sched_domain structure points to the descriptor of the parent scheduling domain, if any.

The sched_domain descriptors of all physical CPUs in the system are stored in the per-CPU variable phys_domains. If the kernel does not support the hyper-threading technology, these domains are at the bottom level of the domain hierarchy, and the sd fields of the runqueue descriptors point to themthat is, they are the base scheduling domains. Conversely, if the kernel supports the hyper-threading technology, the bottom-level scheduling domains are stored in the per-CPU variable cpu_domains.

7.5.2. The rebalance_tick( ) Function

To keep the runqueues in the system balanced, the rebalance_tick( ) function is invoked by scheduler_tick( ) once every tick. It receives as its parameters the index this_cpu of the local CPU, the address this_rq of the local runqueue, and a flag, idle, which can assume the following values:


SCHED_IDLE

The CPU is currently idle, that is, current is the swapper process.


NOT_IDLE

The CPU is not currently idle, that is, current is not the swapper process.

The rebalance_tick( ) function determines first the number of processes in the runqueue and updates the runqueue's average workload; to do this, the function accesses the nr_running and cpu_load fields of the runqueue descriptor.

Then, rebalance_tick( ) starts a loop over all scheduling domains in the path from the base domain (referenced by the sd field of the local runqueue descriptor) to the top-level domain. In each iteration the function determines whether the time has come to invoke the load_balance( ) function, thus executing a rebalancing operation on the scheduling domain. The value of idle and some parameters stored in the sched_domain descriptor determine the frequency of the invocations of load_balance( ). If idle is equal to SCHED_IDLE, then the runqueue is empty, and rebalance_tick( ) invokes load_balance( ) quite often (roughly once every one or two ticks for scheduling domains corresponding to logical and physical CPUs). Conversely, if idle is equal to NOT_IDLE, rebalance_tick( ) invokes load_balance( ) sparingly (roughly once every 10 milliseconds for scheduling domains corresponding to logical CPUs, and once every 100 milliseconds for scheduling domains corresponding to physical CPUs).

7.5.3. The load_balance( ) Function

The load_balance( ) function checks whether a scheduling domain is significantly unbalanced; more precisely, it checks whether unbalancing can be reduced by moving some processes from the busiest group to the runqueue of the local CPU. If so, the function attempts this migration. It receives four parameters:


this_cpu

The index of the local CPU


this_rq

The address of the descriptor of the local runqueue


sd

Points to the descriptor of the scheduling domain to be checked


idle

Either SCHED_IDLE (local CPU is idle) or NOT_IDLE

The function performs the following operations:

  1. Acquires the this_rq->lock spin lock.

  2. Invokes the find_busiest_group( ) function to analyze the workloads of the groups inside the scheduling domain. The function returns the address of the sched_group descriptor of the busiest group, provided that this group does not include the local CPU; in this case, the function also returns the number of processes to be moved into the local runqueue to restore balancing. On the other hand, if either the busiest group includes the local CPU or all groups are essentially balanced, the function returns NULL. This procedure is not trivial, because the function tries to filter the statistical fluctuations in the workloads.

  3. If find_busiest_group( ) did not find a group not including the local CPU that is significantly busier than the other groups in the scheduling domain, the function releases the this_rq->lock spin lock, tunes the parameters in the scheduling domain descriptor so as to delay the next invocation of load_balance( ) on the local CPU, and terminates.

  4. Invokes the find_busiest_queue( ) function to find the busiest CPUs in the group found in step 2. The function returns the descriptor address busiest of the corresponding runqueue.

  5. Acquires a second spin lock, namely the busiest->lock spin lock. To prevent deadlocks, this has to be done carefully: the this_rq->lock is first released, then the two locks are acquired by increasing CPU indices.

  6. Invokes the move_tasks( ) function to try moving some processes from the busiest runqueue to the local runqueue this_rq (see the next section).

  7. If the move_task( ) function failed in migrating some process to the local runqueue, the scheduling domain is still unbalanced. Sets to 1 the busiest->active_balance flag and wakes up the migration kernel thread whose descriptor is stored in busiest->migration_thread. The migration kernel thread walks the chain of the scheduling domain, from the base domain of the busiest runqueue to the top domain, looking for an idle CPU. If an idle CPU is found, the kernel thread invokes move_tasks( ) to move one process into the idle runqueue.

  8. Releases the busiest->lock and this_rq->lock spin locks.

  9. Terminates.

7.5.4. The move_tasks( ) Function

The move_tasks( ) function moves processes from a source runqueue to the local runqueue. It receives six parameters: this_rq and this_cpu (the local runqueue descriptor and the local CPU index), busiest (the source runqueue descriptor), max_nr_move (the maximum number of processes to be moved), sd (the address of the scheduling domain descriptor in which this balancing operation is carried on), and the idle flag (beside SCHED_IDLE and NOT_IDLE, this flag can also be set to NEWLY_IDLE when the function is indirectly invoked by idle_balance( ); see the section "The schedule( ) Function" earlier in this chapter).

The function first analyzes the expired processes of the busiest runqueue, starting from the higher priority ones. When all expired processes have been scanned, the function scans the active processes of the busiest runqueue. For each candidate process, the function invokes can_migrate_task( ), which returns 1 if all the following conditions hold:

  • The process is not being currently executed by the remote CPU.

  • The local CPU is included in the cpus_allowed bitmask of the process descriptor.

  • At least one of the following holds:

    • The local CPU is idle. If the kernel supports the hyper-threading technology, all logical CPUs in the local physical chip must be idle.

    • The kernel is having trouble in balancing the scheduling domain, because repeated attempts to move processes have failed.

    • The process to be moved is not "cache hot" (it has not recently executed on the remote CPU, so one can assume that no data of the process is included in the hardware cache of the remote CPU).

If can_migrate_task( ) returns the value 1, move_tasks( ) invokes the pull_task( ) function to move the candidate process to the local runqueue. Essentially, pull_task( ) executes dequeue_task( ) to remove the process from the remote runqueue, then executes enqueue_task( ) to insert the process in the local runqueue, and finally, if the process just moved has higher dynamic priority than current, invokes resched_task( ) to preempt the current process of the local CPU.




Copyright @ 2007 OpenSourceProject.org.cn.部分作品为网上收集整理,供开源爱好者学习使用,如侵犯了您的权益,请联系chinaperl@gmail.com,本站将立即删除。