Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.04

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 3 of 11  

Runtime Environment for Tera-scale Platforms

McRT ARCHITECTURE

At its core, McRT contains a set of user-level threading primitives including a scheduler, memory manager, synchronization primitives, and a set of threading abstractions. We implemented these traditional operating system (OS) services as user-level primitives to improve efficiency by avoiding the expensive transitions between the user level and OS level making fine-grain parallelism more tractable. The McRT architecture is shown in Figure 1.



Figure 1: McRT architecture
click image for larger view
 

McRT provides two user-level threading abstractions, threads and futures. The threads are similar to POSIX threads in functionality, while the futures are more lightweight and intended to support a concurrency idiom found in some languages such as MultiLisp [14] and CILK [8]. Futures provide a serial execution semantic, but can be executed in parallel if there are additional hardware resources.

The user-level scheduler is implemented as a task queue. An application can configure the number of task queues, e.g., specifying a single task queue for each processor. The application can also specify the scheduling policy, e.g., it can ask for a work- sharing policy where new tasks are distributed among the task queues, or a work-stealing policy where idle processors search different queues for the next available task.

McRT uses a cooperative scheduling policy as opposed to the preemptive scheduling policy used predominantly in software stacks for SMPs. In an SMP system, the processing resource is expensive. Therefore the system software tries to timeshare the processing resource across multiple application threads by using preemptive scheduling. In a TS-CMP platform (say a platform with 128 cores), the processing resource is both inexpensive and abundant, which led us to use cooperative scheduling. This in turn addresses scalability bottlenecks, such as convoying, since an application can control when a thread gets preempted.

McRT includes a user-level synchronization library that includes different scalable algorithms such as MCS [24] locks and CLH [22] queues. It also includes a user-level memory allocator [16] that uses per-thread private allocation blocks. The allocator uses a completely non-blocking implementation that allows it to scale even with large oversubscription where the number of software threads is much greater than the number of hardware processors.

Finally, McRT includes a number of client adaptors that translate existing popular paradigms such as OpenMP and pthreads to the core McRT API. The OpenMP adaptor implements the API used by the Intel® C compiler, while the pthreads adaptor translates the POSIX API.

The core services in McRT are modularized and can be used as standalone services. For example, the memory manager ships as part of the Threading Building Blocks, while the transactional memory module has been tightly integrated into several compilers including the Intel C compiler, the StarJIT compiler [1], and the Harmony JITtrino compiler [5].

Evaluating Support for Fine-Grain Parallelism

We used a number of micro-benchmarks to evaluate the efficiency of the McRT threading primitives and hence its support for fine- grain parallelism. Figure 2 shows the results: the first row compares the cost of creating 255 threads; the second row compares the cost of 1000 consecutive lock acquire and release operations; and the final row compares the cost of 1000 context switches. In each case the gettimeofday() system call was used for the measurements. All the experiments were run on a 2.8GHz Intel® Xeon® processor. Column 2 reports the measurements observed by using native threads on Linux* 2.4.9, while Column 3 reports the measurements from using native threads on RedHat Enterprise* Linux 2.6.9-22ELsmp (NPTL 0.60).



Figure 2: Micro-benchmark evaluation
click image for larger view
 

We also measured the scalability of our threading primitives. Figure 3 compares the cost of creating thousands of threads on McRT and on Linux (2.6.9). Note that the efficiency of thread creation in McRT does not degrade even with thousands of threads.



Figure 3: Scalability of thread creation
click image for larger view
 

As mentioned before, McRT also implements futures to provide a lighter weight concurrency mechanism. Figure 4 compares the overhead of McRT futures to that of using McRT threads. For this, we created batches of futures and threads whose executable code simply returned immediately. We compared the time to complete such a batch using both threads and futures. Figure 4 compares the ratio of the execution time for threads and futures with futures being 40 to 100 times more efficient than threads. Obviously, futures can provide very good support for fine-grain parallelism.



Figure 4: Thread vs. future creation overhead
click image for larger view
 

  Section 3 of 11  

Back to Top

In This Article

Download a PDF of this article.