Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.01

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 2 of 12  

Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors

INTRODUCTION

The aggressive delivery of Intel® multi-core processors to the mass computer market shows that, as the performance improvements from continuously increasing clock frequencies start to taper off, other architectural advances that reduce latency or increase memory bandwidth are gaining importance [9]. In particular, since packaging densities are still growing, integrating multiple processors on a single die and using SIMD extensions are becoming more widespread [1]. The Intel® Core™2 Duo and Quad processors are equipped with a rich set of micro-architectural and architectural features to boost performance:

  • dual-core or quad-core on a single chip
  • wider execution units for Streaming SIMD Extensions (SSE, SSE2, SSE3)
  • a set of new instructions referred to as Supplemental Streaming SIMD Extensions 3 (SSSE3)
  • advanced smart shared L2 cache among cores on the same chip

Due to the complexity of modern processors, compiler support has become an important part of obtaining higher performance. Most importantly, to assist programmers in leveraging all parallel capabilities of Intel's new processors, the Intel® C++/Fortran compiler provides an essential tool for unleashing the power of Intel® multi-core processors and SIMD instructions by means of high-level optimizations and advanced code generation.

The Intel® compilers perform automatic optimizations of programs using threadization [10], vectorization [1, 2, 5], classical loop transformations (e.g., distribution, unrolling, interchange, fusion) [7, 11, 12], scalar optimizations such as constant propagation, Partial Dead Store Elimination (PDSE), Partial Redundancy Elimination (PRE), copy propagation, Inter-Procedural Optimizations (IPO) [7], and advanced machine code generation techniques that together yield a significant performance gain compared to the default level of optimization. The contributions of the new threadizer and vectorizer are as follows:

  • The new threadizer yields up to 4.63x speedup (with 8 cores) by exploiting thread-level parallelism from a serial program in the SPEC* CPU2006 benchmark suites. Overall, the auto-threadization delivers a 15.45% gain (geomean with 8 cores) for SPEC CFP2006 suite and a 12.17% gain (geomean with 8 cores) for SPEC CINT2006 suite.
  • The new vectorizer yields up to 1.28x performance speedup by exploiting SIMD-type vector parallelism from a serial program in the SPEC CPU2006 suites. Overall, the auto-vectorization delivers a 5.11% gain (geomean) for SPEC CFP2006 suite and a 2.01% gain (geomean) for SPEC CINT2006 suite.

The rest of this paper is organized as follows. First, we provide some basics on the Intel® Core™ micro-architecture. Then, we discuss the design and implementation of the new threadizer and vectorizer, respectively, inside the Intel® 10.1 compilers. Subsequently, we discuss the loop optimizations and enhancements made to support efficient threadization and vectorization. We also present an overview of advanced code generation for the Intel Core 2 Duo and Quad processors. Finally, we provide performance results using the SPEC CPU2006 industry-standard benchmark suite built with the Intel 10.1 C++ and FORTRAN compilers.

  Section 2 of 12  

Back to Top

In this article

Download a PDF of this article.