Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.06

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 4 of 8  

Datacenter-on-Chip Architectures: Tera-scale Opportunities and Challenges

SCALABILITY CHALLENGES AND SOLUTIONS

As described in the previous section, the tera-scale architecture offers a high compute density (large number of cores and threads) that is attractive for DoC usage models. However, in order to provide high performance and scalability, it is important to carefully design a balanced platform with sufficient resources (cache, memory, I/O, etc.). In this section, we present the DoC scalability considerations and discuss potential solutions that address the key challenges.

The first challenge is that of providing sufficient cache space in order to reduce memory stalls and minimize memory bandwidth bottlenecks. Previous work [36] has shown that die area and cost will significantly restrict the amount of cache space that can be provided in tera-scale processors. In DoC usage models, the fact that several multi-threaded server applications will run simultaneously poses two potential considerations for cache hierarchy design: (a) since the threads within each server application tend to share code as well as data, cache space efficiency can be improved if these threads are allowed to share cache space, (b) since the cache space usage of each of the server applications can be quite different at different times in the execution, better utilization can be achieved if the cache space is shared. To take advantage of both of these sharing properties, we propose and evaluate a hierarchy of shared caches for tera-scale DoC platforms.

Figure 4 illustrates a three-level hierarchy of shared caches in a tera-scale platform. The hierarchy of shared caches starts an L1 (16K to 64K) that is private to the core but shared between the multiple threads within the core. The L2 (256K to 1M, mid- level) cache is also shared by multiple cores within a "node." The node forms the basic building block for the architecture. The L3 (8 to 32M, last-level) cache is logically shared by all of the nodes in the socket. However, since the L3 cache is quite large, it is physically distributed around the die in smaller "slices." A scalable interconnect connects all the L3 cache slices and the nodes. The benefits of sharing at each level is best explained with an example. Figure 5 compares the cache performance of private and shared L2 caches for an OLTP workload (based on the TPC-C [28]). As shown in the figure, a shared cache organization (e.g., 512K shared by four cores) is equivalent in cache performance to a private cache organization (four cores each with a 256K private L2 cache). This essentially shows a potential of 2X space efficiency with a shared cache organization. Similar benefits were found for other server workloads as well as for other levels of the hierarchy.

Having defined a cache hierarchy, the next major challenge is that of providing sufficient memory bandwidth to sustain the misses from the last-level cache. Figure 6 shows the cache scaling behavior of a consolidated server workload running on a last-level cache. These data were obtained from trace-driven simulations of four (8-threaded) workloads based on TPC-C [28], SPECjbb2005 [26], SPECjappserver2004 [25], and SAP SD/2T [24] running simultaneously on 32 single-threaded cores. The data show that consolidation workloads have good cache scaling behavior from 4MB all the way to 128MB of cache shared between the 32 cores.



Figure 4: Tera-scale DoC hierarchy of shared caches
click image for larger view
 



Figure 5: Tera-scale shared L2 cache benefits
click image for larger view
 



Figure 6: Tera-scale DoC L3 cache scaling behavior
click image for larger view
 

To understand the memory bandwidth requirements of tera-scale DoC platforms, let us now consider the simulation configuration with 8MB L3 cache and 32 cores. In this configuration, we estimated that the bandwidth requirements can be as high as 20GB/s. Given that tera-scale processors may contain as many as 128 threads, the overall bandwidth requirements can be 100GB/s or higher. This in turn requires that a proportional number of memory channels be supported on the socket. Alternate solutions to solving the memory bandwidth bottleneck for tera-scale platforms could be the use of large capacity L4 caches. As shown in Figure 7, large capacity L4 caches can be implemented either as an additional package on the package (in a multi-chip package) or stacked (using 3D stacking technologies [1]). To understand the potential of large capacity L4 DRAM caches that can provide as much as twice the bandwidth at as little as one-third of the memory latency, we conducted simulations of a 32-core, 8MB L3 cache configuration with and without a 32MB or 64MB L4 cache. We found that significant performance benefits (from 10 to 40%) can be achieved depending on the organization of the DRAM cache, the exact bandwidth capability, and the latency benefits as compared to main memory latency. However, the key benefit is that of providing sufficient headroom in external memory so that the number of channels that is implemented can be reduced without affecting the performance.



Figure 7: Tera-scale DoC L3 cache scaling behavior
click image for larger view
 

Having addressed the cache/memory scalability challenges for tera-scale DoC architectures (using a hierarchy of shared caches and large L4 caches), we next turn our attention to adaptability concerns and solutions.

  Section 4 of 8  

Back to Top

In this article

Download a PDF of this article.