Visible to Intel only — GUID: GUID-F6ADE969-EC9D-4670-AB1A-93CD4CF4BE94
Visible to Intel only — GUID: GUID-F6ADE969-EC9D-4670-AB1A-93CD4CF4BE94
Loop Constructs
Loops can be formed with the usual for and while constructs. Loops must have a single entry and a single exit to be vectorized. The following examples illustrate loop constructs that can and cannot be vectorized. The non-vectorizable structure example shows a loop that cannot be vectorized because of the inherent potential for an early exit from the loop.
Vectorizable structure:
void vec(float a[], float b[], float c[]) {
int i = 0;
while (i < 100) {
// The if branch is inside body of loop.
a[i] = b[i] * c[i];
if (a[i] < 0.0)
a[i] = 0.0;
i++;
}
}
Non-vectorizable structure:
void no_vec(float a[], float b[], float c[]) {
int i = 0;
while (i < 100) {
if (a[i] < 50)
// The next statement is a second exit
// that allows an early exit from the loop.
break;
++i;
}
}
Loop Exit Conditions
Loop exit conditions determine the number of iterations a loop executes. For example, fixed indexes for loops determine the iterations. The loop iterations must be countable and the number of iterations must be expressed as one of the following:
A constant.
A loop invariant term.
A linear function of outermost loop indices.
In the case where a loops exit depends on computation, the loops are not countable. The examples below show loop constructs that are countable and non-countable. The non-countable loop example demonstrates a loop construct that is non-countable due to dependency loop variant count value.
Countable loop, example one:
void cnt1(float a[], float b[], float c[],
int n, int lb) {
// Exit condition specified by "N-1b+1"
int cnt=n, i=0;
while (cnt >= lb) {
// lb is not affected within loop.
a[i] = b[i] * c[i];
cnt--;
i++;
}
}
Countable loop, example two:
void vec(float a[], float b[], float c[]) {
int i = 0;
while (i < 100) {
// The if branch is inside body of loop.
a[i] = b[i] * c[i];
if (a[i] < 0.0)
a[i] = 0.0;
i++;
}
}
Non-countable loop:
void no_cnt(float a[], float b[], float c[]) {
int i=0;
// Iterations dependent on a[i].
while (a[i]>0.0) {
a[i] = b[i] * c[i];
i++;
}
}
Strip-mining and Cleanup
Strip-mining, also known as loop sectioning, is a loop transformation technique for enabling SIMD-encoding of loops, as well as a means of improving memory performance. By fragmenting a large loop into smaller segments or strips, this technique transforms the loop structure in two ways:
By increasing the temporal and spatial locality in the data cache if the data is reusable in different passes of an algorithm.
By reducing the number of iterations of the loop by a factor of the length of each vector, or number of operations being performed per SIMD operation. With the Intel® Streaming SIMD Extensions (Intel® SSE), the vector or strip-length is reduced by four times: four floating-point data items per single Intel® SSE single-precision floating-point SIMD operation are processed.
First introduced for vectorizers, this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine.
The compiler automatically strip-mines your loop and generates a cleanup loop. For example, assume the compiler attempts to strip-mine the loop before vectorization. After vectorization, the compiler might handle the strip mining and loop cleaning by restructuring the loop.
Before vectorization:
i=0;
while(i<n) {
// Original loop code
a[i]=b[i]+c[i];
++i;
}
After vectorization:
// The vectorizer generates the following two loops
i=0;
while(i<(n-n%4)) {
// Vector strip-mined loop
// Subscript [i:i+3] denotes SIMD execution
a[i:i+3]=b[i:i+3]+c[i:i+3];
i=i+4;
}
while(i<n) {
// Scalar clean-up loop
a[i]=b[i]+c[i];
++i;
}
Loop Blocking
It is possible to treat loop blocking as strip-mining in two or more dimensions. Loop blocking is a useful technique for memory performance optimization. The main purpose of loop blocking is to eliminate as many cache misses as possible. This technique transforms the memory domain into smaller chunks rather than sequentially traversing through the entire memory domain. Each chunk should be small enough to fit all the data for a given computation into the cache, maximizing data reuse.
Consider the following example, loop blocking allows arrays A and B to be blocked into smaller rectangular chunks so that the total combined size of two blocked (A and B) chunks is smaller than cache size, which can improve data reuse.
The transformed loop after blocking example illustrates loop blocking the add function (from the original loop example). In order to benefit from this optimization, you might have to increase the cache size.
Original loop:
#include <time.h>
#include <stdio.h>
#define MAX 7000
void add(int a[][MAX], int b[][MAX]);
int main() {
int i, j;
int A[MAX][MAX];
int B[MAX][MAX];
time_t start, elaspe;
int sec;
//Initialize array
for(i=0;i<MAX;i++) {
for(j=0;j<MAX; j++) {
A[i][j]=j;
B[i][j]=j;
}
}
start= time(NULL);
add(A, B);
elaspe=time(NULL);
sec = elaspe - start;
printf("Time %d",sec); //List time taken to complete add function
}
void add(int a[][MAX], int b[][MAX]) {
int i, j;
for(i=0;i<MAX;i++) {
for(j=0; j<MAX;j++ {
a[i][j] = a[i][j] + b[j][i]; //Adds two matrices
}
}
}
Transformed loop after blocking:
#include <stdio.h>
#include <time.h>
#define MAX 7000
void add(int a[][MAX], int b[][MAX]);
int main() {
#define BS 8 //Block size is selected as the loop-blocking factor.
int i, j;
int A[MAX][MAX];
int B[MAX][MAX];
time_t start, elaspe;
int sec;
//initialize array
for(i=0;i<MAX;i++) {
for(j=0;j<MAX;j++) {
A[i][j]=j;
B[i][j]=j;
}
}
start= time(NULL);
add(A, B);
elapse=time(NULL);
sec = elapse - start;
printf("Time %d",sec); //Display time taken to complete loopBlocking function
}
void add(int a[][MAX], int b[][MAX]) {
int i, j, ii, jj;
for(i=0;i<MAX;i+=BS) {
for(j=0; j<MAX;j+=BS) {
for(ii=i; ii<i+BS; ii++) { //outer loop
for(jj=j;jj<j+BS; jj++) { //Array B experiences one cache miss
//for every iteration of outer loop
a[ii][jj] = a[ii][jj] + b[jj][ii]; //Add the two arrays
}
}
}
}
}
Loop Interchange and Subscripts with Matrix Multiply
Loop interchange is often used for improving memory access patterns. Matrix multiplication is commonly written as shown in the typical matrix multiplication example.
The use of B(K,J) is not a stride-1 reference and therefore will not be vectorized efficiently.
If the loops are interchanged, all the references become stride-1 as shown in the matrix multiplication with stride-1 example.
Typical matrix multiplication:
void matmul_slow(float *a[], float *b[], float *c[]) {
int N = 100;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
Matrix multiplication with stride-1:
void matmul_fast(float *a[], float *b[], float *c[]) {
int N = 100;
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
Interchanging is not always possible because of dependencies, which can lead to different results.