- Home›
- Technology and Research›
- Intel Technology Journal›
- Multi-Core Software
Multi-Core Software
Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors
PERFORMANCE RESULTS
In this section we provide performance validation of the new threadizer and vectorizer using the industry-standardized computationally intensive benchmark suite SPEC* CPU2006 in which the CINT2006 suite comprises 12 integer C and C++ benchmarks, and the CFP2006 suite comprises 17 floating-point Fortran, C and C++ benchmarks, all derived from real-life applications that have up to 932818 lines of code. The SPEC CPU2006 benchmarks are widely used and considered to be representative of a wide spectrum of application domains. The multi-core system used to measure performance is configured with two 2.67 GHz Intel® Core™2 Quad processors with a 4M L2 cache, an 8 GB RAM, and booted with an SuSE Linux* OS.

Figure 3: SPEC CPU2006 speedup estimates with auto-threadizer based on internal measurements
click image for larger view
To evaluate the effectiveness of the new threadizer, we first measured the baseline performance with the option fast (i.e., ipo O3 xT no-prec-div static). Then, we added the parallel switch to measure the speedup over the fully optimized baseline performance. The contributions from threadization are shown in Figure 3, which shows the speedup of benchmarks in the SPEC CFP2006 suite delivered by the auto-threadizer. The 15.45% geomean gain of all speedups is shown in the last column. Even though default base optimizations already obtain acceptable performance, auto-threadization of the Intel® C++/Fortran compiler further boosts the performance of a number of benchmarks substantially, going up to a 2.52x speedup for a 436.cactusADM. No benchmark suffered a noticeable slowdown due to the auto-threadizer.
Auto-converting a sequential program into threaded code becomes an increasingly important technique to leverage multi-core platforms in a transparent manner. Besides the gain delivered for SPEC CFP2006 performance, the auto-threadizer delivered a 12.17% gain (geomean) for SPEC CINT2006 on top of fully optimized serial code by using parallel and par-runtime-control options that contributed to a 4.63x performance speedup for the 462.libquantum.

Figure 4: SPEC CPU2006 speedup estimates with auto-vectorizer based on internal measurements
click image for larger view
Vectorization also forms a significant part of performance improvements. To evaluate the effectiveness of the new vectorizer, we first measured the baseline performance using fast but with the vectorizer off (fast_xT_novec). Then, we measured the performance with the vectorizer enabled (fast_xT) to get the speedup over fast_xT_novec. The contributions made by vectorization are shown in Figure 4, which shows the speedup of benchmarks in the SPEC CFP2006 suite delivered by the auto-vectorizer. The 5.11% geomean gain is shown in the last column. Even though baseline optimizations already provide high performance, the auto-vectorizer of the Intel C++/Fortran compiler further boosts the performance of a number of benchmarks substantially, going up to a 1.29x speedup for 436.cactusADM. Albeit generally biased towards floating-point applications, the advanced code generation makes a noticeable contribution to integer applications: a 33.6% gain. In other cases, experience shows that it makes performance less sensitive to minor changes in the generated code.
