Avoid Manual Loop Unrolling
The Intel® Compiler can typically generate efficient vectorized code if a loop structure is not manually unrolled. It is better to let the compiler do the unrolls, and you can control unrolling using "#pragma unroll (n)". Vector-alignment, loop-collapsing, interactions with other loop optimizations become much more complex if the compiler has to "undo" the manual unrolling. In all but the simplest of cases, this refactoring has to be done by the user to get the best performing vector-code.
To add to this, manual loop unrolling tends to tune a loop for a particular processor or architecture, making it less optimal for some future port of the application. Generally, it is good advice to write code in the most readable, straightforward manner. This gives the compiler the best chance of optimizing a given loop structure.
Fortran Example where manual unrolling is done in the source:
m = MOD(N,4)
if ( m /= 0 ) THEN
do i = 1 , m
Dy(i) = Dy(i) + Da*Dx(i)
end do
if ( N < 4 ) RETURN
end if
mp1 = m + 1
do i = mp1 , N , 4
Dy(i) = Dy(i) + Da*Dx(i)
Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
end do
It is better to express this in the simple form of:
do i=1,N
Dy(i)= = Dy(i) + Da*Dx(i)
end do
This allows the compiler to generate efficient vector-code for the entire computation and also improves code readability.
C++ Example where manual unrolling is done in the source:
double accu1 = 0, accu2 = 0, accu3 = 0, accu4 = 0;
double accu5 = 0, accu6 = 0, accu7 = 0, accu8 = 0;
for (i = 0; i < NUM; i += 8) {
accu1 = src1[i+0]*src2 + accu1;
accu2 = src1[i+1]*src2 + accu2;
accu3 = src1[i+2]*src2 + accu3;
accu4 = src1[i+3]*src2 + accu4;
accu5 = src1[i+4]*src2 + accu5;
accu6 = src1[i+5]*src2 + accu6;
accu7 = src1[i+6]*src2 + accu7;
accu8 = src1[i+7]*src2 + accu8;
}
accu = accu1 + accu2 + accu3 + accu4 +
accu5 + accu6 + accu7 + accu8;
It is better to express this in the simple form of:
double accu = 0;
for (i = 0; i < NUM; i++ ) {
accu = src1[i]*src2 + accu;
}
NEXT STEPS
It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon processors. The paths provided in this guide reflect the steps necessary to get best possible application performance.