Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 03

Tera-scale Computing


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1103.05

  • Volume 11
  • Issue 03
  • Published August 22, 2007

Tera-scale Computing

  Section 5 of 9  

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

EXPERIMENTAL EVALUATION

Benchmarks

We evaluate our proposed hardware on benchmarks from a key emerging application domain: RMS. All benchmarks were parallelized within our lab. Table 1 and Table 2 give the benchmarks and their data sets.

Loop-level parallelism: We use primitive matrix operations and Gauss-Seidel as a set of benchmarks with loop-level parallelism since these are both very common in RMS applications and very useful for a wide range of problem sizes. Most of these benchmarks are standard operations and require little explanation. The sparse matrices are encoded in compressed row format. Gauss-Seidel iteratively solves a boundary value problem with finite differencing using a red-black Gauss-Seidel algorithm. These benchmarks are straightforward to parallelize; each parallel loop simply specifies a range of indices and the granularity of tasks. We evaluate each benchmark with several problem sizes to show the sensitivity of performance to problem sizes.

Task-level parallelism: We use modules from full RMS applications as a set of benchmarks with task-level parallelism. Task-level parallelism is more general than loop-level parallelism where each parallel section starts with a set of initial tasks and any task may enqueue other tasks. These benchmarks represent a set of common modules across the RMS domain. Some of the benchmarks are based on publicly available code, and the remaining ones are based on well-known algorithms. These benchmarks are as follows:

  1. The Binomial Tree uses a 1D binomial tree to price a single option. Given a tree of asset prices, the algorithm derives the value of an option at time 0 (that is, now) by starting at time T (that is, the expiration date) and iteratively stepping "backward" toward t=0 in a discrete number of time steps, N. At each time step t, it computes the corresponding value of the option Vi at each node of the tree i. The number of exposed tasks (i.e., nodes) is small at any time step and the task size is small. Hence, task queue overhead must be small to effectively exploit the available parallelism.
  2. The Game Physics constraint solver iteratively solves a set of force equations in a game physics constraint solver. For inputs that have few bodies and constraints, the amount of parallelism is limited, especially for a large number of cores.
  3. Cholesky, Backward Solve, and Forward Solve are operations on sparse matrices. Cholesky performs Cholesky factorization. Backward Solve and Forward Solve perform backward and forward triangular solve on a sparse matrix, respectively. These solvers use a data structure called elimination tree that encodes the dependency between tasks. The irregularity in sparse matrices results in high variation of task size. Therefore, we need very efficient task management to achieve good load balancing.
  4. The Canny Edge Detection computes an edge mask for an image using the Canny edge detection algorithm. It first finds a group of edge candidates and determines whether their neighbors are likely to be edges. If so, the algorithm adds them to the candidate list and recursively checks their neighbors for candidacy. The number of tasks correlates directly to the number of candidates and tends to be small. Further, the amount of work in checking for candidacy is also very small.



Figure 6: Performance of loop-level benchmarks
click image for larger view
 



Figure 7: Performance of task-level benchmarks
click image for larger view
 

Results

Figures 6 and 7 show the performance benefit of our proposed hardware for the loop-level and the task-level benchmarks, respectively, when running with 64 cores. In particular, the hardware proposal is compared with the best optimized software implementations and an idealized implementation (Ideal) in which tasks bypass the LTUs and are sent directly to/from GTU with zero interconnect latency. Additionally, the GTU processes these tasks instantly without any latency. The optimized software implementation uses a combination of widely used state of the art techniques [2, 3] that deliver the best performance.

The graphs represent the speedup over the one-thread execution using the Ideal implementation. For each benchmark, they show multiple bars. Each bar corresponds to a different data set shown in Tables 1 and 2.

For the loop-level benchmarks in Figure 6, the proposed hardware executes 88% faster on average than the optimized software implementation and only 3% slower than Ideal.

For the task-level benchmarks in Figure 7, on average the proposed hardware is 98% faster compared to the best software version and is within 2.7% of Ideal. For Game Physics with one data set, and Forward Solve with another data set, the amount of parallelism available is very limited. In the software implementations, the cores contend with each other to grab the few available tasks, which adversely impacts performance.

  Section 5 of 9  

Back to Top

In this article

Download a PDF of this article.