- Home›
- Technology and Research›
- Intel Technology Journal›
- Tera-scale Computing
Tera-scale Computing
Integration Challenges and Tradeoffs for Tera-scale Architectures
ON-DIE INTERCONNECT
The on-die interconnect is the primary "meeting ground" for various elements of the tiled architecture in Figure 1. Given its central nature, there are certain basic requirements for the on-die interconnect:
- Scalability: Given the requirements of a large number of nodes (agents) on the interconnect (high tens to low hundreds), we realistically desire a) a sub-linear growth in average distance with number of nodes, b) a relatively low per-hop latency through each switch under no-load conditions, and c) manageable growth in latency under loaded conditions.
- Partitionability: The topology, with appropriate routing support, should enable the tera-scale architecture to be dynamically partitioned to achieve both performance and fault isolation.
- Fault tolerance: The tera-scale architecture with its tiled structure has the potential for a graceful degradation under faults. Further, with the expected impact of variations on process technology, there is a greater susceptibility to "performance" faults (discussed in the next section). Hence, the topology, with appropriate support, should support routing around faults.
- Validation and testing: The interconnect should provide support for testing and validation, which is critical for high-volume manufacturing. For example, an interconnect that uses a deadlock-free routing approach is easier to test and validate compared to one using deadlock-recovery based routing.
- Regularity: In order to make the design of the tera-scale chip tractable, it is imperative that the layout, planning, and design of each tile be done in such a way as to make the tile physically symmetric. Thus integration may be achieved largely through abutment of tiles. To that end, each "tile" needs to plan its global wiring tracks.
- Flexibility and design friendliness:
- Designs should facilitate "choppability" so that with minimal redesign effort, a range of market segments can be satisfied.
- Furthermore, the basic router design should not change as the underlying parameter, for example, as the number of processors, changes with each process generation.
- The tiled architecture will have different-sized tiles arising possibly from the need for heterogeneous cores (e.g., some suited for throughput and others suited for single-thread performance), specialized engines, fixed function units, etc. The on- die interconnect design needs to incorporate the needs of each by, for example, clustering multiple low-bandwidth engines into a single routing agent.
Candidate Topologies

Figure 2: 2D embedding of a 64-node 3D-mesh network
click image for larger view
In off-chip networks, both the number of links (which has a bearing on the topology) and the link widths are determined by pin- out limitations. In on-chip networks, this limitation is absent. The topology choices (apart from the number of routing agents that need to be supported) are determined by the wiring density and router complexity in terms of area and power. The wiring density is in turn determined by the number of metal layers available and the directionality constraints (uniform availability of horizontal and vertical metal layers), as well as the need for the topology to be embedded in 2D space. The latter point implies that for higher (greater than 2D) dimensional networks, topological adjacency does not lead to spatial adjacency [14]. This has significant implications both on the wire delay and on the wiring density. Consider the embedding of the 3D mesh in Figure 2. For the longest hop, the topological distance is 9, but three of these hops span half the length of the die. Hence, the distance in tile span units is 18!
Considering wiring density, router complexity, and design friendliness, tera-scale architecture topologies will be fixed-degree and will have a low dimension (1-2) in the foreseeable future. Thus, ring and 2D torus/mesh networks and their many variants [2] will be candidate topologies.
In the rest of this section, we use the 2D mesh as an example topology for illustrative purposes only.
Interconnect Microarchitecture
The main challenge in on-die networks is to achieve the required bandwidth and latency under the constraints of power and area. While topological choices, as mentioned above, help with bandwidth scaling and keep latencies manageable, they come at increasing power costs.
Wang et al. [28] show that the router power is almost 2x the power of the wires in the MIT-RAW [25] chip. Further decomposition shows that the power is roughly spent as much in the switch (crossbar) as in the buffers.
Wang et al. also propose segmented and cut-through crossbars as possible solutions to reduce crossbar power. Meanwhile Nicopoulos et al. [21] reduce buffer power through careful microarchitectural techniques to minimize the number of buffers required for the same network throughput.
Kumar et al. [16] observe that certain paths in a 2D mesh are common for a number of flows (between different source/destination pairs), and thus traffic traveling on these trunks could be aggregated and switched togetherthus avoiding the need for packets to stop and be buffered at intermediate nodes. This in turn saves buffer power and reduces contention on network resourcesthe latter helping to improve the throughput and thus eventually the energy characteristics of the network.
Traffic Classes
Emerging workloads [10] may see different classes of traffic overlaid on the on-die network. Taylor et al. [25] and Gratz et al. [11] use a network fabric to route operands between different clusters of functional units. It is conceivable that the different cores of the tera-scale processor may be used to realize a virtual superscalar microarchitecture [29], thus necessitating fine- grained communication of operands and control, in addition to the cache coherent and message-based communication in the cache- memory subsystem.
Furthermore, as media applications become a dominant consumer of compute capacity on a chip, they will place hard real-time constraints on different shared resources such as cache and interconnect. In addition, running disparate applications on the same multi-core is likely to result in bandwidth over-subscription by some applications at the cost of starvation of some others. Careful rationing of bandwidth while providing latency guarantees to the necessary applications will require careful architecture definition and design of the fabric.
Resiliency
The need for fault tolerance arises from both an increased susceptibility to faults and the opportunity to gracefully degrade in a tera-scale environment.
Future process technology trends are likely to adversely affect the resilience of a tera-scale processor chip. Such trends might include process variations becoming a more significant determinant of overall performance and insufficient burn-in time to weed out infant mortality. Consequently, there is a higher probability of in-field failures and accelerated degradation potentially shortening the expected lifetime of the product [6].
Interconnect Fault-tolerance Approaches
The following mechanisms can be adopted for addressing resilience in a tera-scale processor interconnect. These approaches can be used either to address true faults or performance faults (i.e., when underperforming or "out-of-spec" tiles are treated as failed tiles).
Sparing: Spare processor tiles paired with network interfaces and switches can potentially solve a multiplicity of fault scenarios including increased possibility of in-field failures. Upon detection of failures in some tile components, spare tiles are activated after the interconnection network is reconfigured as shown in Figure 3.

Figure 3: Illustrating use of spare tiles that maintain original topology
click image for larger view
Fault-tolerant Routing
Fault-tolerant routing support is required in the interconnect to enable reconfiguration of the system components in the presences of failed tiles and routers. Upon system reset/initialization, a fault and topology discovery algorithm is run to determine the location/identity of failed components and to mark them in the interconnect. Other regions also need to be marked safe or unsafe from a deadlock-free routing perspective. A fault-tolerant routing algorithm is then configured to route around faulty and unsafe regions. Figure 4 shows faulty (dead) nodes in the interconnect. A few additional nodes are marked unsafe so as to form rectangular fault regions. After the fault-tolerant routing algorithm (such as in [5]) is configured, all working (and spare nodes, if sparing is used) tiles in the fabric can communicate with each other.

Figure 4: Illustrating need for fault-tolerant routing
click image for larger view
The fault-tolerant routing algorithm should be simple to implement, deadlock free, and be able to handle a wide variety of faults. It is also desirable for the routing algorithm to adaptively respond to congestion that may occur in the network due to the additional effort needed to route around fault regions.
Partitioning for Performance Isolation
We expect several partitions to be supported on a tera-scale processoreach partition with a fraction of the total number of processing units, special-purpose units, and other platform elements. There may be several different usage models for a partitioned tera-scale processor including multiple server partitions in a consolidated "server on chip" or, for example, multiple virtual appliances on a home server.
It is desirable that the performance of each of the multiple partitions on a tera-scale processor be unaffected by the performance of other partitions. Some partitions may be more sensitive to performance perturbations from other partitions or may have stricter Quality of Service (QoS) requirements.
Performance isolation: Performance isolation relies on confining intra-partition communication of a given QoS sensitive partition to physically distinct components of the on-die interconnect from that of other partitions. Figure 5 shows a configuration with three isolated partitions where traffic generated in one partition does not interfere with traffic from another partition.

Figure 5: Performance isolation in a 2D mesh with rectangular partitions
click image for larger view
Virtualization of Network Interfaces
In order to better realize a uniform interface that can comprehend the diverse needs of accelerators, fixed function units, general-purpose processor cores, cache blocks, etc., it is desirable to formalize and possibly even export the interconnect as an abstraction to the application programmer and/or the run-time system. Thus, one could envision a programmer fine-tuning an application's inter-processor communication requirements, based specifically on that application. For example, a media application requiring little support for cache coherence may be better served in power and performance through a direct send/receive interface. Similarly, the programmer may want to commandeer different levels of resources of the interconnects, i.e., number of buffers, switching priority, or bandwidth, and leave the rest of the hardware for use by another application.
A network interface that can allow this level of control and flexibility, yet can achieve good performance would be powerful. In addition, such a network makes for easy and rapid integration of multiple IP blocks that conform to the same interface.
