Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.01

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 5 of 10  

Integration Challenges and Tradeoffs for Tera-scale Architectures

CACHE HIERARCHY AND COHERENCE PROTOCOL

Diversity of workloads and concentration of compute resources in the tera-scale architecture put tremendous demands on the cache hierarchy and coherency protocol. This requires a flexible cache organization that can adapt to workload demands and puts minimal restrictions on the software to fully realize the performance potential. The associated coherency protocol needs to be efficient and scalable. It should also be flexible in terms of the requirements it imposes on the building blocks of the tera-scale architecture. In this section we highlight the challenges and tradeoffs associated with the cache hierarchy and protocol and point out potential directions for tera-scale architecture.

Developing parallel applications to harness and effectively use the massively parallel tera-scale processors is likely to be the key challenge for tera-scale computing. Many parallel programming models and languages have been deployed in different contexts over the last few decades and in fact, parallel programming remains an area of active research. A clear lesson, however, that we can draw from the history of parallel computing to date, is that hardware shared memory has proven to be a particularly successful programming model for general-purpose systems. Accordingly, tera-scale architecture should include first-class hardware support for shared memory. Industry and academic experience with coherence protocols for large-scale, hardware-shared memory machines has demonstrated that shared memory machines scaling to hundreds of processors can be successfully built. In fact, implementing a message-passing library such as the Message Passing Interface (MPI) over hardware-shared memory often results in higher bandwidth and lower latency than equivalent implementations using specialized low-latency cluster networks [19]. In addition, hardware support for shared memory will allow tera-scale processors to support common operating systems assuming that such operating systems overcome any existing scalability bottlenecks to harness the capabilities of tera-scale architecture.

A cache hierarchy should efficiently support a wide range of programming models and workloads. These are some important classes:

  • Multiprogrammed workloads where there is no communication and data sharing among the processes running in different cores.
  • Workloads with a mix of scalar and parallel sections. The performance of these workloads on tera-scale architecture is limited by the performance of the scalar section as indicated by Amdahl's law.
  • Highly parallel workloads, where most of the computations can be parallelized. These workloads may exhibit one or more of the types of parralelism as described below:
    • Thread parallelism: Each thread may be similar or very different from each other and may or may not share data with other threads. Threads are created based on the granularities exposed by the application and then scheduled on available hardware contexts through task queues or other constructs. Examples of this programming model can be found in transaction processing and Web applications.
    • Data parallelism: A similar task is performed on different data sets, where some data may be shared between tasks. Applications are more structured, and algorithms are typically modified to fit the underlying cache organization. The number of threads used in this model is typically the same or less than the number of hardware contexts available. Examples of this programming model can be found in media, numerical analysis, and data-mining workloads.
    • Streams: Programs are structured as kernels where input data are processed and output data are fed into other kernels. In this model threads (or kernels) are statically scheduled to hardware contexts. Within each kernel, thread- or data-level parallelism constructs can be applied to break tasks into ever finer sizes. Examples of this programming model can be found in media and graphics applications.

Cache Organization

A combination of different workloads and different types of parallelisms within these workloads presents unique architectural and design challenges for the cache hierarchy of tera-scale architecture. Architectural challenges center on the organization and policies associated with the cache hierarchy to meet performance, scalability, and energy-efficiency goals. Cache organization deals with the number of levels in the cache hierarchy, and with the size, associativity, latency, and bandwidth parameters at each level. Cache policies determine accessibility, allocation, and eviction policies to effectively utilize on-chip cache resources.

The objective of a cache hierarchy is to minimize the latency to frequently accessed data. In a traditional uniprocessor cache hierarchy, we move cache blocks closer and closer to the core through the levels in the cache hierarchy, based on access frequency. The same principle applies to multi-core cache hierarchies, but we have to take into account whether cores have to share a given level in the cache hierarchy or whether a level is implemented as a single physical block or as multiple physically distributed banks with non-uniform access latency to each bank.

In multi-core processors released over the last few years, the first one or two levels in the cache hierarchy are private to each core. However, different designs have pursued a range of options in sharing the last-level cache. In some designs such as those described in [20], the last-level cache is private to a core. In others, such as those described in [18, 23], the last-level cache is shared among multiple cores.

In CMPs with only a few cores, the last-level cache is being implemented as a single physical block with uniform access latency to the entirety of the cache by all the cores sharing it. As the number of cores and cache banks increase, physically distributed caches become attractive from a physical design perspective [15]. Moreover, by collocating a portion of the cache with a subset of the cores, there is an opportunity to reduce access latency to a portion of the cache, instead of offering equally high latency to all the cache. Figure 6 summarizes different multi-core cache organizations according to their suitability for the types of workloads, assuming a distributed multibank last-level cache.



Figure 6: Cache organization options for multi-core architectures
click image for larger view
 

In a tera-scale processor with a last-level cache physically distributed across multiple tiles, private and shared caches introduce distinct tradeoffs. A shared cache design increases effective cache capacity because only a single copy of a block shared by multiple cores resides in the cache. The downside is that any given block, whether private or shared, may be placed in a tile arbitrarily and be far away from the core(s) using it. In contrast, a private cache design will have all blocks used by a specific core on its local tile. However, since read-shared blocks will be replicated in multiple tiles, the effective cache capacity is reduced, and off-die traffic may increase.

Recent work suggests that other hybrid alternatives are possible: these combine the advantages of private and shared caches while avoiding their shortcomings. The key observation is that in a physically distributed cache design where some cache banks are closer to a specific core than others, one can optimize cache performance by optimizing the placement of blocks in the cache banks so that they are closer to the point of use. A number of approaches in the literature have been proposed to achieve this [4, 9, 30, 31]. Such approaches are beneficial in any multi-core processor with differential access latency to a given portion of a shared cache, but are particularly effective in a tera-scale processor where there is large variation in the latency to access the cache in different tiles.

Fundamentally, all approaches have the following key policies to set: initial placement, read-shared block replication, block migration, and eviction. The initial placement policy defines where a block is placed in the cache hierarchy when it is fetched from memory. The replication policy determines whether multiple copies of a read-shared block can coexist in different cache banks. The block migration policy determines whether a block will move between tiles in response to processor accesses. Finally, the eviction policy determines what happens to a block evicted from a cache bank. Private and shared cache designs represent the end points in the design space with regard to these specific policies. For example, in a private design, a block is initially placed in the cache of the requesting core, while in a shared design, a block is placed in a cache bank determined by the physical address of the block (home tile). Hybrid approaches combine policies from private and shared design or introduce new policies to perform better than either private or shared designs, or they even dynamically switch between competing policies based on application demands. For example, the Adaptive Selective Replication (ASR) [4] determines the replication level within the context of a private cache design based on program behavior.

The enormous computing power available in tera-scale design implies that many applications (or applications consisting of many concurrent functions with distinct caching behavior) will be running concurrently (e.g., games physics with game AI and graphics rendering). Accordingly, when a level in the cache hierarchy is shared among multiple cores in the presence of diverse per-core access patterns and working sets, destructive interference can occur. One of the causes of destructive interference is the suboptimal behavior of the least recently used (LRU) replacement policy, typically implemented in processor caches, when the application workload exceeds the cache capacity. Sharing a cache level among multiple threads can further exacerbate the problem. This is a well known issue for any shared cache, including page disk and file system caches. Recent work in this area, however, shows some promise of success [22].

In its generalized version, the tera-scale architecture is a collection of modular and heterogeneous building blocks with well defined interfaces. Such a heterogeneous collection of elements puts its own unique requirements on the cache hierarchy, and meeting these with a single set of caching policies and a single cache hierarchy is quite challenging. For example, if an incarnation of tera-scale architecture is a collection of several general-purpose processors, some graphics coprocessors, a few network accelerators, a security coprocessor and so on, each of these processors exhibit very different data footprints and locality characteristics. Satisfying their needs through a unified cache hierarchy is challenging and requires further exploration.

Cache Coherency

The cache coherency protocol for tera-scale architecture must be scalable to a large number of caching agents and must enable efficient utilization of on-chip resources. The choice of a coherency protocol is closely linked to the cache organization and the interconnect. For example, a protocol designed for cache organization without any shared caches may be designed to keep precise information about the lines present in private caches, such that off-chip reads and writes are minimal. A snoop broadcast protocol is suitable when there is a broadcast interconnect, but it cannot be scaled.

On-chip interconnects are capable of providing an order of magnitude smaller latency and an order of magnitude higher bandwidth than off-chip socket-to-socket interconnects. Therefore, latency and bandwidth optimizations may not seem to be the primary goals for an on-chip coherency protocol. However, since tera-scale processors are expected to have a concentrated density of computing throughput, they do impose tremendously high bandwidth demands on the interconnect. Since the power delivery, cooling, and off- chip bandwidth available to each chip is not scaling with process technology, the protocol must enable improved utilization of on-chip cache structures, and the interconnect overhead, because of the protocol, must be kept to a minimum to gain the maximum performance under these limits.

Directory-based protocols have been widely used in large-scale, multichip multiprocessors [7, 8, 17, 24], where a directory is used to keep track of copies of blocks in different caches. The same concept can be applied to on-chip cache coherence protocols in tera-scale architecture with some modification. As illustrated in Figure 7, a directory consists of entries corresponding to lines in caches where each entry has a state field and a field to store the identities (indicated as pointers in the directory structure in Figure 7) of the caches with a copy of the block. The state field indicates if a block may be present in one of the caches and the possible states the cached copies could be in. For a directory that is inclusive of all the on-chip caches, a directory miss or a state of I (invalid) in the directory indicates that none of the caches have a copy of the block; a state of S (shared) indicates that some caches may have copies of the block in Shared state; and a state of X (exclusive) indicates that one of the caches may have a copy in either Modified, Exclusive, or Shared state. When a directory entry is in S or X state, the identity field identifies the cache(s) with a copy of the memory block. The identity information can be stored in various ways, either as a set of bits with 1 bit for each cache (called full bit map), 1 bit for a group of caches (called coarse bit map) or a limited set of explicit cache identities with a mechanism to handle overflows. The cost of a full bit map directory may be acceptable for the first few generations of tera-scale architecture; however, a more compact representation may be desirable for further scalability.



Figure 7: Directory structure to track cache lines
click image for larger view
 

Since the purpose of the directory is to keep track of copies of a cache line in different private caches, the size, associativity, and replacement policy of the directory needs to provide adequate coverage for the total capacity of the private caches. Some designs combine the directory information in the same structure as the cache at the higher level in the hierarchy (if there is one), which may reduce complexity and area at the expense of some performance disadvantage due to conflicting policy requirements on the directory and cache. The cost and scalability or directory structures may start becoming a problem when the number of entities being tracked becomes very large. At that point, mechanisms to reduce directory size [8] or distributed directory [12] implementations may have to be considered.

The enormous amount of computing resources in a tera-scale platform enable a richer set of interactions between the computer and the end user than previously was possible. These include speech, motion and gesture recognition, enhanced visual effects, etc. often within virtual worlds where multiple users directly interact with each other. Interactions with the physical world introduce real-time considerations, and the tera-scale architecture must properly address them. Caches, however, interact in unpredictable ways with real-time applications. For this reason, processors targeted to interactive applications often include hardware mechanisms to allow applications to control the caching behavior to the point where one can reason about their expected performance [1]. Accordingly, the tera-scale cache hierarchy should include support in the form of locking primitives or similar mechanisms to allow applications to keep critical data in the caches. The exact form of such support is an area of active research.

Tera-scale architecture may also require much tighter integration of off-chip memory and I/O interfaces to take full advantage of its compute capabilities. Therefore, the on-chip protocol must enable optimizations for efficiently accessing local memory and for interacting with other auxiliary engines such as special-purpose co-processors and I/O controllers.

  Section 5 of 10  

Back to Top

In This Article

Download a PDF of this article.