Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.04

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 2 of 11  

Runtime Environment for Tera-scale Platforms

INTRODUCTION

System software tends to view a tera-scale chip multiprocessor (hereafter called TS-CMP) as a large-scale "symmetric multiprocessor (SMP) on a die"; yet, tera-scale CMPs have several characteristics that are fundamentally different from those of SMPs. It is critical to address these differences in order to implement a scalable and effective software stack. In particular it is important for the software stack to support (1) efficient fine-grain parallelism, (2) new concurrency abstractions that make parallel programming easier, and (3) platform and application heterogeneity.

Supporting Fine-grain Parallelism

TS-CMP has a very different compute-to-cache ratio than a traditional SMP. A 32-way SMP system typically has more than 100 MBs of aggregate cache size, while a 32-core TS-CMP has less than 10 MBs of cache. Thus a TS-CMP application needs to be threaded at a much finer granularity to reduce its working set. For example, MPEG4 encoding could be parallelized on a large-way SMP by encoding several frames in parallel. On a TS-CMP the encoding of an individual frame needs to be parallelized since the platform will not be able to cache multiple high-definition frames. Finally, many tera-scale applications benefit from fine-grain nested data parallelism rather than from coarse-grain task parallelism.

On the other hand, a TS-CMP enables fine-grain parallelism since inter-core communication is much easier—core-core bandwidth is of the order of terabytes/sec as opposed to gigabytes/sec for an SMP, and core-core latency is in the low tens of cycles (say 20 cycles) as opposed to hundreds of cycles in an SMP. Moreover, the effective core-core latency is much smaller, since the high degree of threading in a TS-CMP core allows some other thread (within the same core) to fully utilize the core resources if one thread is blocked on a cache miss.

Supporting New Concurrency Abstractions

Due to their high cost, large-way SMP systems have been restricted to niche markets, running applications written by sophisticated programmers whereas TS-CMP processors are targeted at mainstream price points and will bring parallelism to the average programmer. The success of TS-CMP processors and the applications that run on them depends on mainstream programmers embracing parallelism aggressively. Thus, the system SW stack should include new higher-level concurrency abstractions that make it easier for the average programmer to deal with parallelism.

Supporting Heterogeneity

Unlike SMPs, a TS-CMP software stack must comprehend heterogeneity at multiple levels. At the application level, TS-CMP processors will run a more diverse set of applications because they are targeted at a much broader market. At the hardware level, the TS-CMP platform may be heterogeneous with a combination of high-performance scalar cores, an array of high-throughput cores, and fixed function units. The system software stack must comprehend this heterogeneity. It needs to support configurable policies, for example, configurable scheduling policies, to adapt to different applications, and it needs to schedule applications according to their hardware requirements.

In this paper we present the design and implementation of McRT, a runtime environment for tera-scale platforms. McRT provides a configurable runtime framework that addresses the key tera-scale runtime requirements in the following ways:

  • Fine-grain parallelism: McRT implements a significant fraction of threading services such as thread creation, synchronization, memory management, etc. at the user level. It also provides efficient user-level abstractions such as futures that make it easier to program and extract fine-grain parallelism.
  • Concurrency abstractions: McRT includes a high-performance transactional memory library that supports an atomic construct in both C/C++ and Java. Transactional memory [15] provides a number of software engineering benefits compared to locks for managing access to shared data.
  • Heterogeneity: McRT supports a number of configurable runtime policies that can be adapted for a particular application. In addition, McRT also supports multiple scheduling domains. Different hardware (HW) units can be mapped to different scheduling domains, and applications can be scheduled independently within each domain.

We show McRT's scalability using media encoding and Recognition, Mining, Synthesis (RMS) applications [11] on a tera-scale simulator. The results show that McRT's efficient threading primitives enable the applications to scale almost linearly up to 64 HW threads. We show that transactional memory can significantly ease parallel programming. Applications can use coarse-grain atomic blocks to synchronize access to shared data; yet they can achieve the performance of fine-grain locking. We also show a prototype implementation of a heterogeneous HW platform that leverages the support for scheduling domains in McRT.

  Section 2 of 11  

Back to Top

In This Article

Download a PDF of this article.