How to Use Code from the Intel® Itanium® Processor on the Itanium® 2 Processor

Submit New Article


Last Modified On :   May 6, 2008 10:09 PM PDT
Rate
 


Challenge
Get better performance when moving an application written for the Intel® Itanium® processor to the Itanium® 2 processor. The second-generation Itanium processor, the Itanium 2 processor, has additional execution units, and their availability permits more instruction combinations to be executed in parallel without causing a split issue. In the Itanium 2 processor, the execution environment is expanded with two additional memory units. Four memory units (M0, M1, M2, and M3) provide more memory access and more ability to perform integer operations – six of the 11 execution units are available to execute integer instructions.

Solution
Recompile code developed for the Itanium processor before using it on the Itanium 2 processor to take advantage of the advanced microarchitectural features of the platform. The Intel Itanium 2 processor still disperses a maximum of six instructions per clock cycle, but the compiler has greater choice in matching instructions to the execution units. The following figure is a matrix comparing the possible full-dispersal bundle-pair configurations supported by the Itanium processor and the Intel Itanium 2 processor:



The templates down the left column represent the first bundle in the dispersals window, and those across the top represent the second bundle. Bundle configurations dispersed by either the Itanium processor or the Intel Itanium 2 processor are highlighted in orange; those that are only permitted for the Intel Itanium 2 processor are in green. The availability of six memory-integer execution resources more than doubles the bundle combinations that represent full dispersal for the Intel Itanium 2 processor. Code generated for the Intel Itanium 2 processor has less nops, fewer split dispersals, and a boost in performance.

The Itanium processor does not support full dispersal of the MII-MII code sequence that follows:

{ .mii   // two bundles dispersed in one clock
alloc r33=ar.pfs,1,9,2,0 //0:->Cycle 1
mov r32=b0 //0:->Cycle 1
add r41=r40,41 //0:->Cycle 1
} { .mii
add r42=0x0,r0 //0:->Cycle 1
add r38=r37,r39 //0:->Cycle 1
add r36=r35,r34 //0:->Cycle 1

The second bundle is compiled with an alternate template and filled with nops, which results in the issue of just four instructions during a clock cycle. The dispersal matrix identifies that the Intel Itanium 2 processor supports full dispersal for this bundle-pair combination. The following figure illustrates how the Intel Itanium 2 processor disperses the individual instructions for execution in the same clock cycle:



Code compiled for the original Itanium processor executes on future family members without recompiling. However, recompiling makes better use of the additional architectural resources. The result is higher performance.

Source





Comments (0)



Leave a comment

Name (required)

Email (required; will not be displayed on this page)

Your URL (optional)


Comment*