- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Process Scheduling Challenges in the Era of Multi-Core Processors
MULTI-CORE SCHEDULING
Shared resource topologies in multi-core platforms pose interesting challenges and opportunities to the system software. Shared resources between cores like shared cache hierarchy, provide good resource utilization and make inter-core communication efficient. However, heterogeneous data access patterns of memory-intensive tasks running on the cores sharing caches can lead to cache contention and sub-optimal performance. Contention and its impact on performance depend on the resources shared, the number of active tasks, and the access patterns of the individual tasks. A fair amount of CPU time allocated to each task by the process scheduler will not essentially translate into efficient and fair usage of the shared resources. The main challenge before the process scheduler is to identify and predict the resource needs of each task and schedule them in a fashion that will minimize shared resource contention, maximize shared resource utilization, and exploit the advantage of shared resources between cores. To achieve this, the process scheduler needs to be aware of multi-core, shared resource topology, resource requirements of tasks, and the inter-relationships between the tasks.
In the following sections, we describe some of the multi-core scheduling mechanisms; challenges in exploiting optimal performance, and power savings in the SMP platform. We analyze the impact of Intel® Dynamic Acceleration Technology and processor power management technologies on these scheduling mechanisms. We also look into some of the heuristics that today's system software can exploit to minimize the shared resource contention among the cores sharing resources.
Experimental Setup
For our experiments and analysis, we primarily considered a dual-package SMP platform, with each package having two cores sharing a 4MB last-level cache. Different workloads such as SPECjbb2000, SPECjbb2005, SPECfp_rate of CPU2000, and an in-memory database search (IMDS) are considered for our analysis. These workloads are widely known except for the IMDS one. The IMDS workload is a non-standard workload simulating CPU and memory behavior of a typical database search algorithm. This workload is considered because of its high cycles per instruction (CPI) characteristic when compared to the other workloads used in our experiments. Run to run variations of these workloads are within 1%. Each of these workloads was run three times and the middle number was used for the performance comparisons.
Some of these workloads (SPECjbb2000, IMDS, for example) spawn threads sharing a process address space and some (like SPECfp_rate) spawn different processes, each having its own address space. Platform under test is run at 3GHz processor frequency unless otherwise stated and doesn't support Intel Dynamic Acceleration Technology.
Scheduling on Cores Sharing Resources vs. Not Sharing
In this workload scenario, all the considered workloads were run in a 2-task configuration. For example, the SPECjbb workload was run in two warehouse configurations (where each warehouse is represented by a user-level thread), and the SPECfp rate was run in the base configuration with two users (where each user is represented by an individual process). Similarly, the IMDS workload used two threads (belonging to the same process) to process the database queries.

Table 1: Performance difference between scheduling two tasks running on two cores using different last-level cache vs. scheduling on cores sharing same last-level cache. The higher % means that scheduling on cores with different caches is better.
click image for larger view
With two running threads on a dual-core, dual-package SMP platform, the main choices before the process scheduler are to schedule the two running threads on the cores in the different (Option 1) or same (Option 2) packages. Option 1 will result in maximum resource utilization, and as the other core in each package is idle, there is no shared resource contention. Option 2 will result in one busy package (with both the cores busy running the tasks), and the other package being completely idle. While this is not the best solution from the resource utilization and shared resource contention (tasks running in one package may contend for shared resources between cores) perspective, this mechanism will take advantage of the data sharing between tasks, if any.
Table 1 shows the results of different workloads with different scheduling mechanisms on a dual-core, dual-package SMP platform. As shown in the table, all the workloads benefited from distributing the load to two different packages. This indicates that the considered workloads take advantage of the increased available cache and the shared resource (primarily last-level cache) contention is playing a significant role when both the tasks run on the same package. Moreover, the contention is present whether the running tasks belong to the same process (where there is some data sharing, for example, SPECjbb) or different processes (for example, SPECfp_rate of cpu2000). Among the workloads, the IMDS workload in fact performs almost the same, irrespective of the threads running on cores sharing the same or different packages. This is primarily because the workload doesn't exhibit good locality of memory references and as such doesn't get affected much by sharing the last-level cache between two threads.
Last-level Cache Size Influence on Shared Resource Contentions
Hardware designers and researchers are looking into different options (like optimum size, layout of shared resources, design and management of these shared resources) and solutions for maximizing resource utilization and at the same time minimizing the resource contention. Moore's law [6] is dictating the cache size increase on Intel® platforms from generation to generation. The current x86 generation of 65nm processors features up to 4MB of L2 cache in the dual-core version and up to 8MB in the quad-core version, and the leading-edge 45nm generation [1] of x86 processors sports up to 6MB of L2 cache in the dual-core version and up to 12MB in the quad-core version. The degree of cache associativity is increasing with the increase in cache size, leading to hit rate improvement and better utilization.
Figure 3 shows the impact of the last-level cache size on the process scheduling mechanisms for the workloads we considered in Table 1. While the platforms considered for this experiment are quite different from each other (different characteristics and properties), each of the platforms under test is configured to work as dual-core, dual-package platforms. Each of these platforms has a different last-level cache size shared by two cores that reside in the processor package.
The data in Figure 3 show that for the given task load (two tasks in our experiments), as the shared resources among the cores increases, one can expect that the amount of shared resource contention will decrease accordingly. For example, SPECfp_rate of CPU2000 was performing the same whether the two tasks were running in the same package or in different packages with 16MB of last-level cache. The impact of last-level cache size is fairly negligible for the two threaded IMDS workloads. As noticed before, this is primarily because of the poor locality of memory references.

Figure 3: Impact of last-level cache size on the performance differences between scenarios doing scheduling on cores using same vs. different last-level cache. The smaller number indicates less resource contention in the scenarios when tasks run on cores sharing last-level cache.
click image for larger view
Influence of Intel Dynamic Acceleration Technology
Intel® Dynamic Acceleration Technology is currently available in Intel® Core™2 Duo mobile processors. Let us look into the influence of this technology on process scheduling, if this support is available in the future for the server platforms supporting multiple processor packages.
Using the Linux* kernel CPUfreq subsystem, we simulated the concept of Intel Dynamic Acceleration Technology in today's mainstream dual-package platform based on Intel Core 2 Duo processors. With the help of CPUfreq subsystem, the processor frequency can be changed to a specific value that the processor supports. Using this infrastructure, 3GHz-capable processors were run at 2.66GHz (a bin down) in the mode when the process scheduler schedules the two running tasks on two cores belonging to the same package. In the mode when the scheduler selects two different packages for running the two tasks, processors were run at 3GHz (as Intel Dynamic Acceleration Technology will enhance the speed of the active core, while one or more cores in the same package are idle). Figure 4 shows the performance numbers, which include the effects of running on different caches and at improved processor speeds as a result of dynamic acceleration.

Figure 4: Performance difference between running two tasks on a) cores running at 2.66GHz, sharing last-level cache vs. b) cores running at 2.66GHz, different last-level cache vs. c) cores running at 3GHz, different last-level cache
click image for larger view
Figure 4 shows that the dynamic acceleration favors the scheduling policy of distributing the load among the available processor packages for optimal performance. In the presence of dynamic acceleration, IMDS workloads also benefited when the two IMDS threads were run on different packages.
Scheduling for Optimal Power Savings
Consider the same dual-package experimental system that we looked at before. If we have two runnable tasks, as observed in the previous sub sections, resource contention will be minimized when these two tasks are scheduled on different packages. But, because of the P-state coordination in the current generation of multi-core platforms, we are restricting other idle cores in both packages to run at higher power P-state (voltage/frequency pair). Similarly, the shared block in both packages will reside in higher power C0 state (because of one busy core). This will result in a non-optimal solution from a power-savings perspective. Instead, if the scheduler picks the same package for both tasks, other packages with all cores being idle, will transition into the lowest-power P and C-state, resulting in more power savings. For optimal power savings, the number of physical packages carrying the load needs to be minimized. But as the cores share resources (like last-level cache) as seen in previous sections, scheduling both tasks to the same package may or may not lead to optimal behavior from a performance perspective.
Task Group Scheduling
In the workload scenario where all the shared resources and packages are busy, the main challenge for the process scheduler is to schedule the tasks in such a way that will minimize the shared resource contention and take maximum advantage of the shared resources between cores.
If all the running tasks are resource intensive, the challenge before the process scheduler is to identify the tasks that share data and schedule them on the cores sharing the last-level cache. This will help minimize the shared resource contentions and shared data duplication. This will also result in efficient data communication between the tasks that share data. The system software has some inherent knowledge about data sharing between tasks. For example, threads belonging to a process share the same address space and as such share everything (text, data, heap, etc.). Similarly, processes attached to the same shared memory segment will share the data in that segment. The process scheduler can do optimizations such as grouping threads belonging to a process or grouping processes attached to the same shared memory segment and co-schedule them in the cores sharing the package resources. To highlight the group-scheduling potential, we ran two instances of the SPECjbb workload, with each instance having two warehouses (each warehouse represented by a thread) on the dual-core, dual-package platform, considered before. Table 2 shows that grouping the threads belonging to a process onto the same package takes advantage of shared resources between cores and helps minimize the shared resource contentions.

Table 2: Performance improvement seen when threads belonging to a process scheduled to two cores residing on same package when compared to scheduling them on different packages. Workload considered is with two instances of SPECjbb in a two warehouse (two processes with two threads each) configuration.
click image for larger view
Scheduling Challenges
For exploiting optimal performance, the process scheduler needs to schedule tasks in such a way that all the platform resources are used effectively. And this effective mechanism varies with workload, processor, platform topology, and system load.
Some workloads will exploit optimal benefit when the tasks are scheduled on the cores that share resources. For example, the IMDS workload performed the same, whether the tasks were run on cores sharing resources or not. For such workloads, in the presence of idle packages, by scheduling the tasks on the cores residing in a package, optimal performance and power-savings will be achieved. Similarly, workloads that share data and take maximum advantage of the shared resources between cores will achieve optimal performance when run on cores that are closer. For example, if the data shared between the tasks are modified and exchanged often or if one executing task prefetches data for the other task, optimal performance will be achieved when the tasks are scheduled closer to each other.
For some workloads, even in the presence of data sharing, distributing the load among the available idle packages will lead to optimal performance. This distribution will lead to shared data duplication in the caches of the packages carrying the load. If the shared data are mostly read-only, this data duplication still may be better than leaving the shared resources idle. In this scenario, tasks can take advantage of the increased shared resources (caches in our considered setup) that are available and can cache more shared data and/or task private data.
The presence of technologies, like dynamic acceleration, influence the process scheduling mechanisms for some workloads in the presence of idle cores and packages. As seen in Figure 4, when the load is uniformly distributed among the available packages, the resulting core speed increases, resulting from dynamic acceleration, helped achieve optimal performance. For some workloads, even in the presence of dynamic acceleration, running on the cores sharing caches may give optimal performance.
In future, as more cores are integrated into the processor package, the available shared resources will also increase accordingly. As such, the amount of shared resource contention will be minimal when few of the available cores in the package are busy (similar to what we see in Figure 3). The challenge for the process scheduler is to track the shared resource usages and the associated contentions. In such a scenario, the scheduler can minimize the processor packages carrying load, and when there is a contention for the shared resources, the scheduler can distribute the load to minimize resource contentions. To address the challenge, the process scheduler needs to track the micro-architectural information like the task's cycles per instruction (CPI) and how the task's CPI is affected by the co-running tasks in the other cores sharing resources. An individual task's CPI will also help the process scheduler in making decisions such as which tasks benefit most from the increased core speed that the dynamic acceleration technology brings in.
In a scenario where all the shared resources and packages are busy, the process scheduler needs to minimize the resource contention for exploiting optimal performance. For example, grouping CPU-intensive and memory-intensive tasks onto the cores sharing the same last-level cache will result in minimized cache contention. Task characteristics and behavior can be predicted using the micro-architectural history of a task by using performance counters. In the absence of such micro-architectural information, the system software can also use some heuristics to estimate the resource requirements. For example, one can use the number of physical pages that are accessed (using the Accessed bit in the page tables that manage virtual to physical address translation in x86 architecture) for certain intervals or use the tasks memory Resident Set Size (RSS). The process scheduler can use this information and group schedule tasks on the cores residing in a physical package with the goal of minimizing shared resource contention.
