Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.04

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 6 of 11  

Runtime Environment for Tera-scale Platforms

RESULTS

We used a cycle-accurate simulator to evaluate McRT's performance on a TS-CMP processor. The simulated platform consists of an array of up to 16 in-order cores, each of which has four threads. Each core will select a different thread each cycle, round- robin, unless the thread is stalled due to, for example, a cache miss, being in the sleep state. The memory system consists of a 32 KB L1 data cache that is shared by all four threads in the core, a 2MB L2 cache that is shared by all the cores, and an off- chip 4MB L3 cache. All caches were simulated with an 8-way set associative configuration. The L1 cache access time is 3 cycles, the L2 cache access time is 12 cycles, and the L3 cache access time is 40 cycles. The simulator performs a cycle accurate simulation of the execution pipeline for all the HW threads, the different caches, the coherence protocol, the bandwidth for data transfer between different parts of the memory system, and the interconnect to the external memory.

We ported McRT to run directly on the simulator. Thus, the results reflect true execution driven simulation and accurately account for inter-thread synchronization. The simulator was modified to support system calls, while McRT provided all the threading services required by the application.

We used the popular open source MPEG4 encoder XviD (www.xvid.org*) and a set of RMS kernels [11] for Singular Value Decomposition (SVD) and Self Organizing Maps (SOM) as our workloads. The XviD encoder is used mainly on frames of 1920x1080 to correspond with frame sizes in emerging high-definition video. We show the performance for encoding the P frames since these (along with the B frames) happen to be the computationally intensive parts of the encoding. The simulated cache size does not allow multiple frames to be encoded in parallel; therefore, we had to parallelize the encoding of a single frame. A frame is partitioned into "k" sub- blocks, where "k" is the number of logical processors used for encoding. Thus, the scalability of MPEG4 encoding is a good test of the efficiency of McRT's fine-grain threading support.

SVD has numerous applications in the areas of data-mining and feature extraction, signal processing, and automated control; this workload uses the Jacobi method. An SOM is an unsupervised learning method represented by a two-layer neural network. Typically, it is used to map N dimensional data to two dimensions to discern patterns. It is extensively applied in text and feature mining, pattern recognition, and medical diagnostics.



Figure 9: XviD speedup
click image for larger view
 

Figure 9 shows the speedup of encoding a single frame as we increase the number of processors. The x-axis shows the number of cores. Note that for k cores, the number of HW processors is k*4. The graph uses the execution time on a single core (4 threads) as the baseline. Even at 16 cores (64 threads) we get almost a linear speedup (the "Linear" line in the graph represents speedup expected if the application was completely parallelized). The speedup on the 1080P (1920x1080) frame is slightly higher than on the 768P (1024x768) frame since the sub-block sizes are larger, and hence the cost of threading gets amortized.

Figure 10 shows the speedup for the RMS workloads. (The x-axis represents the number of cores, and the baseline is the execution time on a single core.) Both SVD and SOM scale almost perfectly up to 64 HW threads.



Figure 10: RMS speedup
click image for larger view
 

  Section 6 of 11  

Back to Top

In This Article

Download a PDF of this article.