Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.01

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 8 of 12  

Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors

PERFORMANCE RESULTS

In this section we provide performance validation of the new threadizer and vectorizer using the industry-standardized computationally intensive benchmark suite SPEC* CPU2006 in which the CINT2006 suite comprises 12 integer C and C++ benchmarks, and the CFP2006 suite comprises 17 floating-point Fortran, C and C++ benchmarks, all derived from real-life applications that have up to 932818 lines of code. The SPEC CPU2006 benchmarks are widely used and considered to be representative of a wide spectrum of application domains. The multi-core system used to measure performance is configured with two 2.67 GHz Intel® Core™2 Quad processors with a 4M L2 cache, an 8 GB RAM, and booted with an SuSE Linux* OS.



Figure 3: SPEC CPU2006 speedup estimates with auto-threadizer based on internal measurements
click image for larger view
 

To evaluate the effectiveness of the new threadizer, we first measured the baseline performance with the option –fast (i.e., –ipo –O3 –xT –no-prec-div –static). Then, we added the –parallel switch to measure the speedup over the fully optimized baseline performance. The contributions from threadization are shown in Figure 3, which shows the speedup of benchmarks in the SPEC CFP2006 suite delivered by the auto-threadizer. The 15.45% geomean gain of all speedups is shown in the last column. Even though default base optimizations already obtain acceptable performance, auto-threadization of the Intel® C++/Fortran compiler further boosts the performance of a number of benchmarks substantially, going up to a 2.52x speedup for a 436.cactusADM. No benchmark suffered a noticeable slowdown due to the auto-threadizer.

Auto-converting a sequential program into threaded code becomes an increasingly important technique to leverage multi-core platforms in a transparent manner. Besides the gain delivered for SPEC CFP2006 performance, the auto-threadizer delivered a 12.17% gain (geomean) for SPEC CINT2006 on top of fully optimized serial code by using –parallel and –par-runtime-control options that contributed to a 4.63x performance speedup for the 462.libquantum.



Figure 4: SPEC CPU2006 speedup estimates with auto-vectorizer based on internal measurements
click image for larger view
 

Vectorization also forms a significant part of performance improvements. To evaluate the effectiveness of the new vectorizer, we first measured the baseline performance using –fast but with the vectorizer off (fast_xT_novec). Then, we measured the performance with the vectorizer enabled (fast_xT) to get the speedup over fast_xT_novec. The contributions made by vectorization are shown in Figure 4, which shows the speedup of benchmarks in the SPEC CFP2006 suite delivered by the auto-vectorizer. The 5.11% geomean gain is shown in the last column. Even though baseline optimizations already provide high performance, the auto-vectorizer of the Intel C++/Fortran compiler further boosts the performance of a number of benchmarks substantially, going up to a 1.29x speedup for 436.cactusADM. Albeit generally biased towards floating-point applications, the advanced code generation makes a noticeable contribution to integer applications: a 33.6% gain. In other cases, experience shows that it makes performance less sensitive to minor changes in the generated code.

  Section 8 of 12  

Back to Top

In this article

Download a PDF of this article.