Vectorization Recommendations for C++
Ineffective Peeled/Remainder Loop(s) Present
All or some source loop iterations are not executing in the loop body. Improve performance by moving source loop iterations from peeled/remainder loops to the loop body.
One of the memory accesses in the source loop does not start at an optimally aligned address boundary. To fix: Align the data and tell the compiler the data is aligned.
Align dynamic data using a 64-byte boundary and tell the compiler the data is aligned:
float *array; array = (float *)_mm_malloc(ARRAY_SIZE*sizeof(float), 32); // Somewhere else __assume_aligned(array, 32); // Use array in loop _mm_free(array);
Align static data using a 64-byte boundary:
__declspec(align(64)) float array[ARRAY_SIZE]
Parallelize The Loop with Both Threads and SIMD Instructions
The loop is threaded and auto-vectorized; however, the trip count is not a multiple of vector length. To fix: Do all of the following:
- Use the