Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.04

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 4 of 12  

Intel® Performance Libraries: Multi-Core-Ready Software for Numeric-Intensive Computation

LAPACK

In the previous section we discussed the factors that go into the optimization of functions and showed how choosing the right level for parallelization can have a substantial impact on parallel performance as the number of cores increases, using LU factorization (DGETRF) as an example. The MKL has threaded and optimized many of the most important LAPACK functions. The problem is usually the same: how to feed the arithmetic units, which translates into how to get data into the caches and then to reuse them sufficiently to accommodate the substantial differences in the rate of consumption by the floating point hardware and the rate of supply by the memory subsystem.



Figure 4: LAPACK vs. BLAS-level threading
click image for larger view
 

LAPACK largely replaced LINPACK and employs blocked algorithms instead of the vector algorithms of LINPACK, making it much better suited for cache-based architectures. However, there are many areas where LAPACK code can further employ Level 3 BLAS [4] instead of the lower-level functions, which can improve cache usage. That, in turn, improves parallel performance, including performance on multi-core systems. We provide several examples of how increasing the use of higher-level BLAS substantially improves the performance of the MKL implementation of LAPACK over the reference implementation.

One of the important linear algebra applications is double-sided decompositions like singular value decomposition (SVD) or Symmetric Eigenvalue Decomposition. In MKL we block the chains of plane rotations using Level 3 BLAS, resulting in remarkable improvements in performance of up to about 18x. Figure 5 compares the resulting threaded symmetric solver DSYEV against the reference implementation, with performance improvements of up to approximately 18x. In this chart, the MKL performance1 is threaded using eight threads, computing all eigenvectors.

A second example employing a blocking algorithm implementation that allows the use of higher-order BLAS are the routines operating on packed storage format. This optimization requires the allocation of additional workspace of size N*NB (where N is the size of the problem, and NB is the block size, usually around 64). Use of workspace is common in other LAPACK functions and the cost, in terms of memory usage, is small.



Figure 5: DSYEV improvements via Level 3 BLAS
click image for larger view
 

In the case of the Cholesky solver performance on packed storage format, the performance improvement again is around 18x on the same system as for DSYEV, as shown in Figure 6.

While restructuring of the LAPACK code to use Level 3 BLAS improves performance markedly, more advanced techniques must be employed to minimize dependencies on the sequential code that remain after employing Level 3 BLAS.



Figure 6: Packed-format Cholesky factorization
click image for larger view
 



Figure 7: DGETRF-level versus BLAS-level threading
click image for larger view
 

In such functions as LU and QR factorization [5], a look-ahead technique is used that allows the next block factorization to begin before the matrix has been fully updated, which increases concurrency. Figure 7 looks at DGETRF performance on an 8-core system comparing MKL versus netlib performance. As the chart shows, there are optimizations in MKL that improve the performance even on one thread vis-à-vis the reference implementation.

[1] 2.4 GHz, Dual-socket, Intel® Xeon® processor 5300 Series 1067 MHz front-side bus. 2x4 MB L2 cache.

[2] Stream 1: preakness_59.94fps_Xvid_4Mbs_CBR.avi; Stream 2: Boss.avi; Stream 3: Taxi.avi; Stream 4: Term2.divx.

  Section 4 of 12  

Back to Top

In this article

Download a PDF of this article.