Technology & Research

Intel® Technology Journal Home

Volume 11, Issue 04

Multi-Core Software


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1104.01

  • Volume 11
  • Issue 04
  • Published November 15, 2007

Multi-Core Software

  Section 7 of 12  

Inside the Intel® 10.1 Compilers: New Threadizer and New Vectorizer for Intel® Core™2 Processors

ADVANCED CODE GENERATION

The Intel® compiler uses its intimate knowledge of the Intel® micro-architecture to guide instruction selection tradeoffs. The compiler takes advantage of efficient instructions and instruction forms while avoiding inefficient instruction sequences. In addition, a restricted instruction scheduling form is used to enhance performance.

Instruction Selection

The bit test instruction bt was introduced in the i386™ processor. In some implementations, including the Intel NetBurst® micro-architecture, the instruction has a high latency. The Intel® Core™ micro-architecture executes bt in a single cycle, when the bit base operand is a register. Therefore, the Intel® C++/Fortran compiler uses the bt instruction to implement a common bit test idiom when optimizing for the Intel Core micro-architecture. The optimized code runs about 20% faster than the generic version on an Intel® Core™2 Duo processor. Both of these versions are shown below:

C source code

int x, n; ... if (x & (1 << n)) ...

Generic code generation

; edx contains x, ecx contains n. mov eax, 1 shl eax, cl test edx, eax je taken

Intel Core micro-architecture code generation

; edx contains x, eax contains n. bt edx, eax jae taken

Variable-length instructions pose a challenge to the processor's instruction decoder, which must identify where one instruction ends and the next begins. Some instruction prefixes change the length of their instructions and cause a significant decoder stall in the Intel Core micro-architecture. Integer instructions that take immediate arguments and use the operand size override prefix 0x66 suffer from this penalty, because the size of the immediate operand is changed by the prefix. The compiler avoids these instructions, as shown below:

C source code

short *p; ... *p &= 0x5555;

Generic code generation

; The and instruction encodes ; as hex 66 81 20 55 55. ; The immediate is 2 bytes. mov eax, DWORD PTR _p and WORD PTR [eax], 0x5555

Intel Core micro-architecture code generation

; The and instruction encodes ; as hex 25 55 55 00 00. ; The immediate is 4 bytes. mov edx, DWORD PTR _p movzx eax, WORD PTR [edx] and eax, 0x5555 mov WORD PTR [edx], ax

The vector unpack low instructions are convenient for gather and broadcast operations, which occur frequently in vector code. With the exception of the 64-bit to 128-bit instructions punpcklqdq and unpcklpd, unpack instructions are costly in the Intel® Core™ micro-architecture compared to alternative code sequences. The Intel® Fortran/C++ compiler favors alternative code sequences when optimizing for the Intel Core micro-architecture. Two examples are given below:

Example I: Broadcast the least-significant single-precision floating-point vector element. Generic code generation

unpcklps xmm0, xmm0 unpcklps xmm0, xmm0

Intel Core micro-architecture implementation

movsldup xmm0, xmm0 movlhps xmm0, xmm0

Example II: Gather four single-precision floating-point elements from locations 128 bytes apart.

Generic code generation

movss xmm3, [eax] movss xmm2, [128+eax] movss xmm0, [256+eax] movss xmm1, [384+eax] unpcklps xmm3, xmm0 unpcklps xmm2, xmm1 unpcklps xmm3, xmm2

Intel Core microarchitecture implementation

movss xmm2, [eax] movss xmm3, [128+eax] movss xmm0, [256+eax] movss xmm1, [384+eax] unpcklpd xmm2, xmm0 unpcklpd xmm3, xmm1 psllq xmm3, 32 orps xmm3, xmm2

The conditional move instruction cmovCC presents an interesting dilemma for the compiler. It can achieve dramatic performance improvements when replacing a poorly predicted branch. On the other hand, replacing a branch with cmovCC may lengthen the critical path and cause a slowdown in cases where the branch is well predicted. Branch predictability is difficult to determine at compile time, so the decision of whether to use a branch or conditional move is made by rough heuristics that can often yield poor results. The Intel Core micro-architecture simplifies this tradeoff by providing a low-latency cmovCC implementation compared to previous generations. When optimizing for the Intel Core micro-architecture, the Intel compiler more aggressively eliminates branches in favor of cmovCC. This strategy yields a substantial speedup for some applications.

Instruction Scheduling

In a dynamically scheduled environment like the Intel Core micro-architecture, the effectiveness of instruction scheduling at compile time is greatly reduced. Using its knowledge of machine internals, however, the Intel C++/Fortran compiler is able to schedule instructions to avoid micro-architectural pitfalls and to take advantage of micro-architectural features.

As described earlier, the Intel Core micro-architecture features a data prefetcher to speculatively load data into the caches. The L2 to L1 cache prefetcher uses a 256-entry table to map loads to load address predictors. This table is indexed by the lower eight bits of the instruction pointer (IP) address of the load. Since there is only one table entry per index, two loads offset by a multiple of 256 bytes cannot both reside in the table. If a conflict occurs in a loop and involves a predictable load, the effectiveness of the data prefetcher can be drastically reduced. In a critical loop, this can cause a significant reduction in overall application performance.

The compiler attempts to avoid IP prefetch conflicts in inner loops. It first identifies and classifies load instructions, distinguishing between loads that are likely to benefit from prefetching and those that are not. For example, loads from constant addresses will not benefit from prefetching. An IP prefetch conflict between two such loads is unlikely to affect performance. After identifying and classifying loads, the compiler inserts nop padding such that each prefetchable load has a modulo-256 address that is different from every other load in the inner loop.

The Intel Core micro-architecture can combine an integer compare (cmp) or test (test) instruction and a subsequent conditional jump instruction (jCC) into a single micro-operation through a process called macro-fusion. For macro-fusion to occur between cmp and jCC, the jump condition must test only the carry and/or zero flags, which is typically the case for unsigned integer compare and jump operations. The Intel Fortran/C++ compiler takes advantages of the macro-fusion feature by generating code that is likely to expose macro-fusion opportunities by detecting compare and jump instructions that are candidates for fusion. During scheduling, it forces these compare and jump instructions to be adjacent. Note that this strategy conflicts with a traditional latency-based strategy, which tends to separate producers (the compare in this case) from consumers (the conditional jump).

  Section 7 of 12  

Back to Top

In this article

Download a PDF of this article.