- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Package Technology to Address the Memory Bandwidth Challenge for Tera-scale Computing
PACKAGE ARCHITECTURES TO MEET THE MEMORY BANDWIDTH NEEDS OF TERA-SCALE COMPUTING
To meet the memory bandwidth needs of tera-scale computing there needs to be an evolutionary transition in package architectures and technologies. In this section we discuss three architectures in detail: the CPU + memory 2D planar MCP, a package substrate embedded memory + CPU MCP architecture, and a 3D CPU + memory stacked die MCP. Figure 7 illustrates these MCP concepts.

Figure 7: Memory + CPU package architectures for addressing bandwidth challenges
click image for larger view
Each of these package architectures has benefits and challenges associated with the technologies. The memory + CPU 2D planar MCP is the most straightforward to implement and can be implemented with today's packaging technologies. There are capability limits on the amount of additional memory bandwidth this architecture can provide, however. Embedding memory in the package is the next evolutionary step in enabling higher memory bandwidth at the package level. This potentially enables higher bandwidths than the memory + CPU 2D planar MCP but requires technology development and comes with integration challenges. The final architecture, 3D stacked die memory + CPU MCP, potentially enables bandwidth capability surpassing the previous two architectures, but requires much technology development and has significant integration challenges. Given these tradeoffs among the architecture challenges and their bandwidth capabilities, a transitional package technology roadmap makes sense and is proposed in Figure 8.
For the remainder of this section we provide details on the capabilities and challenges of each of these on-package memory architectures. For the purposes of this discussion, we assume a memory technology that is capable of delivering the bandwidth targets in terms of data rate and connection density that will be discussed. A discussion of the memory technology details is outside the scope of this paper. Also, the focus of our discussion is on memory bandwidth. Memory capacity and latency also play a key role in CPU system performance, but are outside the scope of this paper.

Figure 8: Proposed roadmap of package architecture transitions to address the memory bandwidth challenge
click image for larger view
The first transition to on-package memory that can be implemented in the most evolutionary manner with respect to today's package technology is the memory + CPU 2D planar MCP. Intel has used 2D planar MCP packaging in the past and continues to use MCP package technology today for many of its multi-core processors, so this package technology is not a revolutionary technology. There are unique challenges associated with a CPU + memory 2D planar MCP that realistically limit its bandwidth capabilities.
The key challenges with a CPU + memory 2D planar MCP are that heterogeneous die are being assembled onto a single-package substrate with a requirement of optimizing performance to achieve a very fast, dense memory bus interconnection scheme. There are both design and electrical performance challenges associated with this architecture. Key design challenges are form factor fit, die placement, and routability.
In general, bump pitch between the CPU die and the memory or DRAM die will likely not match. This leads to routing issues that do not enable the design to take full advantage of line/space density capabilities of the package technology. Figure 9 illustrates a typical scenario when trying to route 200 I/Os between two die on an MCP substrate. Single-layer routing becomes impossible because of cut-off of the routing lanes. The solution is to use two-layer routing. This results in an increase in package cost and challenges in performance resulting from the division of the memory bus into two layers of routing.

Figure 9: Design and routing challenges with on-package memory 2D planar MCP
click image for larger view
In addition to routing challenges, there are challenges in large numbers of I/Os escaping from each die due to bump pitch constraints. Table 1 summarizes results for 200 and 400 I/Os. While routing a memory bus with 200 I/Os is fairly scalable with reasonable package technology and bump pitch routing capability scaling assumptions, increasing this to 400 I/Os becomes challenging.
Table 1: Design and routing/die-escape challenges with on-package memory 2D planar MCP
click image for larger view
Signal integrity issues in an MCP configuration also lead to performance challenges. Because there is still a substantial amount of trace length that is routed on the package between chips and these are routed very densely, crosstalk limits performance. For a single-ended configuration, the upper limit is ~67Gb/s. Results of a signal integrity sensitivity study are summarized in Figures 10 and 11.

Figure 10: 2D planar MCP with on-package memory signal integrity results (eye height)
click image for larger view

Figure 11: 2D planar MCP with on-package memory signal integrity results (eye width)
click image for larger view
Combining the limits introduced by routing, fit, and signal integrity challenges, an estimate on the maximum sustainable bandwidth of a 2D planar MCP configuration is 100GB/s200GB/s, depending upon the number of memory chips placed on the MCP. This also assumes a transition in memory technology to enable the types of connection densities and memory speeds used in this study. While this is a substantial amount of memory bandwidth capability, it is still not sufficient to meet the ultimate targets for tera-scale computing in the long term.
The next transition in package technology that can enable higher memory bandwidth than the CPU + memory MCP is a package embedded memory architecture. Figure 12 shows a schematic of this package architecture. This type of package architecture eliminates one level of transition between the CPU die and the DRAM die, i.e., the routing, which is responsible for the crosstalk that limits the ultimate I/O speed of the CPU + memory MCP architecture. This conceivably enables higher bandwidth by providing a very short and direct CPU-to-DRAM interconnect.

Figure 12: MCP with substrate embedded DRAM
click image for larger view
Signal integrity simulations for a typical CPU-to-DRAM interconnect using a substrate embedded DRAM revealed very minimal impact due to crosstalk since the die-to-die connections are separated by an appreciable distance, equal to at least the bump pitch. The model used for the signal integrity studies is shown in Figure 13. At 4Gb/s, the substrate embedded memory configuration resulted in at least 3040% more margin than the CPU + memory 2D planar MCP configuration results shown in Figure 10. Extrapolating from these results, it is conceivable that the substrate embedded memory architecture can easily achieve a bit rate of at least 10Gb/s. Given that direct connections between the CPU and memory can be made at the same pitch as the die bump pitch, hundreds of connections can be enabled in a small area. Consequently, this architecture can enable a memory bandwidth in the 200GB/s1TB/s range.

Figure 13: MCP with substrate embedded DRAM
click image for larger view
One challenge with this type of architecture is the thermal performance with an embedded die. Preliminary thermal modeling and historical data suggest that a limit of approximately ten watts or less for the embedded device should be maintained to avoid excessive refresh rates and increased power penalties. There are also substantial integration challenges with this architecture. This is, however, an intriguing architecture for enabling bandwidths in the 200GB/s-1TB/s range.
To enable memory bandwidths beyond 1TB/s, the 3D CPU + memory stacked die MCP architecture becomes interesting. Because this will provide the shortest possible interconnect between the CPU and memory die, the bit rate will far exceed 10Gb/s. In addition, the interconnect density will scale to enable thousands of die-to-die interconnects. Intel recently demonstrated a single-chip teraflop processor, Polaris, with 80 cores. Polaris contains hooks for stacking a separate SRAM chip, Freya, in a 3D configuration [6] and [7].
In This Article
- Abstract
- Introduction
- Memory Bandwidth Fundamentals
- Review of Package Technology Evolution vs. Memory Bandwidth Requirements
- Tera-scale Computing Memory Bandwidth Challenges for Package Technology
- Package Architectures To Meet the Memory Bandwidth Needs of Tera-Scale Computing
- Summary and Conclusion
- Acknowledgments
- References
- Authors' Biographies