- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Datacenter-on-Chip Architectures: Tera-scale Opportunities and Challenges
ADAPTABILITY CHALLENGES AND SOLUTIONS
Flexible and dynamic management of platform resources is important in DoC tera-scale architectures since multiple VMs will be running simultaneously. Traditionally, the execution environment (a virtual machine monitor (VMM) or hypervisor in DoC) attempts to control the visible resources (number of cores and memory capacity for instance). However, this alone will not suffice for CMP platforms where more cores might be available to run the virtualized applications simultaneously, but they end up contending for other (invisible) shared resources that have first-order performance impact [2, 4, 10, 16, 18, 34]. Key among these invisible resources are cache space and memory bandwidth. In addition to cache and memory, other resources that are shared include interconnects, micro-architectural resources in the core (shared between hardware threads), and power.
While sharing resources is generally the most efficient approach to maximize resource utilization, having no control over management of these resources can lead to loss of determinism, lack of performance isolation, and an overall lack of the notion of QoS provided to an individual application running on the platform. This has a very direct impact on the datacenter consolidation environments where more and more heterogeneous workloads are consolidated into a single platform contending for the shared hardware resources. Another important aspect to consider when managing shared resources is the relative importance of each of the consolidated applications. Not all applications consolidated may be of equal importance. The difference in priority could be based purely on the service level agreement provided to the customer or could be based on the relative throughput requirements of each of the consolidated applications. It could also be decided by the VMM layer based on the workload behavior (cache friendly, IOVM, etc.).

Figure 8: OLTP cache performance under consolidation
click image for larger view
We start by studying the extent to which contention for shared cache space affects an individual OLTP application when running with one or more other consolidated applications. We performed a trace-driven simulation [9] of a 32-core processor with 8MB of last-level cache running (a) an 8-#threaded OLTP application (based on TPC-C) running alone, (b) OLTP consolidated with an 8-threaded J2EE application server workload (based on SPECjappserver2004), (c) OLTP running consolidated with an 8-threaded ERP application (based on SAP SD/2T), (d) OLTP consolidated with a 8-threaded Java workload (based on SPECjbb2005), and (e) OLTP consolidated with all of the three above applications. Figure 8 shows the impact of consolidation on OLTP cache occupancy as well as OLTP miss rate. As the occupancy is reduced from 100% when running alone to as low as 20% when running with all other workloads, the miss rate goes up significantly (by as much as 3X). It should be noted that even though the compute resources available to the OLTP application remain the same when running alone and running in consolidated mode, the performance will be significantly affected due to the increase in miss rate.
QoS and Virtual Platform Architectures
Managing the allocation of shared resources in the platform is key to addressing the contention effects shown above and to providing performance differentiation, performance isolation, and the overall notion of QoS. Today, Intel and other processor manufacturers, support hardware virtualization features. While these features support functionally isolated VMs, they do not offer the ability to provide performance isolation. Our goal is to define mechanisms that allow VMs to transform into Virtual Platform Architectures (VPAs). A VPA is defined as a collection of virtual resources (i.e., some fraction of each of the physical shared resources) that a VM is provided. In this section we introduce our Platform QoS research that enables QoS-aware platforms and VPAs.

Figure 9: Platform QoS approach
click image for larger view
Platform QoS
For DoC tera-scale architectures, there are three key questions that the Platform QoS research attempts to answer: (a) how much of each shared resource is an application or VM using (b) how can the resource allocation be modified to improve individual QoS or overall performance, and (c) what are the most appropriate interfaces and mechanisms needed between hardware and software to achieve QoS and VPAs?
The Platform QoS approach attempts to address these questions by enabling three major components: (a) monitoring, (b) enforcement, and (c) exposure. Figure 9 shows the components and their relationship in terms of information flow. The monitoring and enforcement components are implemented in hardware, whereas the QoS policies and exposure can either be guided by software or by hardware. The monitoring component keeps track of shared resource usage on a per-application or per-VM basis. The resource monitoring ability allows the platform to pass back information to the execution environment (VMM or hypervisor in DoC) to determine the VPA that each VM ends up with in a platform. In addition, providing this information back to the software domain allows the VMM to optimize scheduling decisions or pass down hints for resource enforcement. The monitoring ability may also be useful to the system administrator to determine (a) whether a VM should be migrated to a different platform (if it is getting too few resources consistently), (b) what QoS hints should be passed down to the platform to modify resource allocation, or (c) what the end-customer should be charged based on resource usage. Alternatively, the administrator may be able to set up a QoS policy that performs one or more of the above dynamically, based on monitoring data.
The resource enforcement component implements shared resource partitioning based on software guidance. This requires an architectural interface to be exposed to the execution environment that allows the specification of resource allocation requirements on a per-VM basis. While we expect QoS policies to be determined primarily by software, it is also important to allow a path for future platform optimizations that dynamically manage resources entirely in hardware. The resource enforcement component enables the VMM to create VPAs with a user-specified amount of resources. To achieve a scalable low-cost QoS solution, we propose resource partitioning and QoS exposure on a class of service basis instead of a per-VM basis. This is sufficient since it is unlikely that all of the VMs running on the platform need performance isolation simultaneously. Instead, one or more VMs can be mapped to a single class of service as specified by the VMM, and a smaller number of classes of service can be supported by the platform. In this paper, we use the terms "priority class," "priority level" and "class of service" interchangeably.
To help clearly describe the Platform QoS approach and mechanisms required, we now present a case study using the shared cache as the platform resource.
QoS Case Study Using Shared Caches
Since contention to shared cache (e.g., last-level) is a key issue, we now describe the implementation considerations for shared cache monitoring, enforcement, and exposure (highlighted in Figure 10).
In the case of cache monitoring, the goal is to keep track of cache space consumed on a per-application or per-VM basis. In order to do so, the VMM needs to pass down a unique identity (ID) to the platform for each running VM. This can be easily done by writing the ID to a new register, a Platform QoS Register (PQR), that is part of the processor architectural state. Since the ID is finite, it should be noted that the ID might have to be recycled amongst VMs (when the number of VMs is larger than the number of IDs). Once the ID is passed down, each load/store generated by the CPU is tagged with the ID so that it is passed down to the last-level cache. In the last-level cache, each cache line is tagged with the ID, and a global cache occupancy counter is also maintained per ID. When a line is evicted from the core, the appropriate cache occupancy counter is decremented. When a new line is allocated into the cache, the appropriate cache occupancy counter is incremented. The implementation can be optimized for area by employing set sampling techniques [37] if so desired.

Figure 10: Cache QoS architecture and techniques
click image for larger view
For cache enforcement, we are investigating the use of several forms of capacity partitioning. One potential approach attempts to limit the number of cache lines in the entire cache used by a certain class of service. Since the class of service is also associated with each cache line, this enforcement is accomplished by modifying the cache replacement policy to pick the next victim from a class that is currently exceeding its assigned cache quota.
For cache QoS exposure, we introduce the PQR. The PQR allows software to specify (a) the VM ID, (b) the class of service (also referred to as priority level or priority class) that this VM should be mapped to, and (c) an optional resource allocation target for that class of service. As described above, the VM ID is used by the platform to monitor cache occupancy per application. The class of service is used to guide the QoS-aware replacement decision.
To study the potential benefits of cache QoS enforcement we extended our trace-based cache simulations to implement cache enforcement. We conducted performance simulations of various consolidation scenarios where we limited the amount of cache space available to the low priority VM, but allowed high-priority VMs to allocate anywhere in the cache. In our example, we chose the OLTP application as the high-priority VM (with access to 100% of shared cache) and the three other consolidated applications as the low-priority VMs (limited to X% of the cache space). Figure 11 shows the OLTP miss rate as a function of X% (on the x-axis). As expected, reducing X from 100% to 12.5% improves the cache performance of the high-priority OLTP application significantly. It may be noted that this will negatively impact the performance of the low priority VMs, but that is expected as an outcome of performance differentiation and QoS.

Figure 11: Case study of cache QoS benefits
click image for larger view
In the previous sections, we focused on the cache-sharing impact and cache QoS implementations. However, the implications are similar for other shared platform resources. For example, memory bandwidth is another resource that has a direct impact on application performance. Memory QoS [11] can be achieved by implementing priority queues in the controller or enabling rate control of the request stream. Once all shared resources are enabled with QoS support, we could provide differentiated service to the individual VMs running on top of these resources. This combined with hardware-supported virtualization provides a complete VPA where functional and performance isolation is provided to VMs in a DoC architecture.