Visible to Intel only — GUID: GUID-ED06E4E1-C6F7-4B64-938A-2A9C9BE0DB42
Visible to Intel only — GUID: GUID-ED06E4E1-C6F7-4B64-938A-2A9C9BE0DB42
Vectorization Recommendations for C++
Ineffective Peeled/Remainder Loop(s) Present
All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.
Align Data
One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.
Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:
float *array; array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32); // Somewhere else __assume_aligned(array, 32); // Use array in loop _mm_free(array);
Align static data using a 64-byte boundary:
__declspec(align(64)) float array[ARRAY_SIZE]
See also:
Parallelize The Loop with Both Threads and SIMD Instructions
The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:
- Use the #pragma omp parallel for simd directive to parallelize the loop with both threads and SIMD instructions. Specifically, this directive divides loop iterations into chunks (subsets) and distributes the chunks among threads, then chunk iterations execute concurrently using SIMD instructions.
- Add the schedule(simd: [kind]) modifier to the directive to guarantee the chunk size (number of iterations per chunk) is a multiple of vector length.
Original code sample:
void f(int a[], int b[], int c[]) { #pragma omp parallel for schedule(static) for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; } }
Revised code sample:
void f(int a[], int b[], int c[]) { #pragma omp parallel for simd schedule(simd:static) for (int i = 0; i < n; i++) { a[i] = b[i] + c[i]; } }
See also:
Force Scalar Remainder Generation
The compiler generated a masked vectorized remainder loop that contains too few iterations for efficient vector processing. A scalar loop may be more beneficial. To fix: Force scalar remainder generation using a directive: #pragma vector novecremainder.
void add_floats(float *a, float *b, float *c, float *d, float *e, int n) { int i; // Force the compiler to not vectorize the remainder loop #pragma vector novecremainder for (i=0; i<n; i++) { a[i] = a[i] + b[i] + c[i] + d[i] + e[i]; } }
See also: