# Use the Intel® SPMD Program Compiler for CPU Vectorization in Games

Published: 08/17/2017

Last Updated: 08/17/2017

## Introduction

The open source LLVM* based Intel® SPMD Program Compiler (commonly referred to in previous documents as ISPC) is not a replacement for the Gnu* Compiler Collection (GCC) or the Microsoft* C++ compiler; instead it should be considered more akin to a shader compiler for the CPU that can generate vector instructions for a variety of instruction sets such as Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 4 (Intel® SSE4), Intel® Advanced Vector Extensions (Intel® AVX), Intel® AVX2, and so on. The input shaders or kernels are C-based and the output is a precompiled object file with an accompanying header file to be included in your application. Through the use of a small number of keywords, the compiler can be explicitly directed on how the work should be split across the CPU vector units.

The extra performance from explicit vectorization is available if a developer chooses to write intrinsics directly into their codebase, however, this has a high complexity and high maintenance cost. Intel SPMD Program Compiler kernels are written using a high-level language so the development cost is low. It also becomes trivial to support multiple instruction sets to provide the best performance for the CPU that the code is running on, rather than the lowest common denominator, such as Intel SSE4.

This article does not aim to teach the reader how to write Intel SPMD Program Compiler kernels; it simply demonstrates how to plug Intel SPMD Program Compiler into a Microsoft Visual Studio* solution and provides guidance on how to port simple High-Level Shading Language* (HLSL*) compute shaders to Intel SPMD Program Compiler kernels. For a more detailed overview of Intel SPMD Program Compiler, please refer to the online documentation.

The example code provided with this article is based on a modified version of the Microsoft DirectX* 12 n-body sample that has been ported to support Intel SPMD Program Compiler vectorized compute kernels. It is not intended to show performance deltas against the GPU, but to show the large performance gains that can be achieved when moving from standard scalar CPU code to vectorized CPU code.

While this application clearly does not represent a game due to the very light CPU load in the original sample, it does show the kind of performance scaling possible by using the vector units on multiple CPU cores.

Figure 1. Screenshot from the modified n-Body Gravity sample

## The Original DirectX* 12 n-Body Gravity Sample

Before starting the port to Intel SPMD Program Compiler, it would be useful to understand the original sample and its intent. The DirectX 12 n-Body Gravity sample was written to highlight how to use the separate compute engine in DirectX 12 to perform asynchronous compute; that is, the particle render is done in parallel to the particle update, all on the GPU. The sample generates 10,000 particles and updates and renders them each frame. The update involves every particle interacting with every other particle, to generate 100,000,000 interactions per simulation tick.

The HLSL compute shader maps a compute thread to each particle to perform the update. The particle data is double-buffered so that for each frame, the GPU renders from buffer 1 and asynchronously updates buffer 2, before flipping the buffers in preparation for the next frame.

That’s it. Pretty simple, and a good candidate for an Intel SPMD Program Compiler port because an asynchronous compute task lends itself perfectly to being run on the CPU; the code and engine have already been designed to perform the compute in a concurrent execution path, so by transferring some of this load onto the often underutilized CPU, the GPU can either finish its frame quicker, or can be given more work, while making full use of the CPU.

## Porting to Intel® SPMD Program Compiler

The recommended approach would be to port from HLSL to scalar C/C++ first. This ensures that the algorithm is correct and produces the correct results, interacts with the rest of the application correctly and, if applicable, handles multiple threads properly. As trivial as this sounds, there are a few things to consider:

1. How to share memory between the GPU and CPU.
2. How to synchronize between the GPU and CPU.
3. How to partition the work for single instruction/multiple data (SIMD) and multithreading.
4. Porting the HLSL code to scalar C.
5. Porting scalar C to an Intel SPMD Program Compiler kernel.

Some of these are easier than others.

### Sharing Memory

We know we need to share the memory between the CPU and the GPU, but how? Fortunately, DirectX 12 provides a few options, such as mapping GPU buffers into CPU memory, and so on. To keep this example simple and to minimize code changes, we just re-use the particle upload staging buffers that were used for the initialization of the GPU particle buffers, and we create a double-buffered CPU copy for CPU access. The usage model becomes:

• Update CPU-accessible particle buffer from the CPU.
• Call the DirectX 12 helper UpdateSubresources using the original upload staging buffer with the GPU particle buffer as the destination.
• Bind the GPU particle buffer and render.

### Synchronization

The synchronization falls out naturally as the original async compute code already has a DirectX 12 Fence object for marshalling the interaction between the compute and render, and this is simply reused to signal to the render engine that the copy has finished.

### Partitioning the Work

To partition the work, we should first consider how the GPU partitions the work, as this may be a natural fit for the CPU. Compute shaders have two ways to control their partitioning. First is the dispatch size, which is the size passed to the API call when recording the command stream. This describes the number of and the dimensionality of the work groups to be run. Second is the size and dimensionality of the local work group, which is hard coded into the shader itself. Each item in the local work group can be considered a work thread and each thread can share information with other threads in the work group if shared memory is used.

Looking at the nBodyGravityCS.hlsl compute shader, we can see that the local work group size is 128 x 1 x 1 and it uses some shared memory to optimize some of the particle loads, but this may not be necessary on the CPU. Other than this, there is no interaction between the threads and each thread works on a different particle from the outer loop while interacting with all other particles in the inner loop.

This seems a natural fit to the CPU vector width, so we could swap the 128 x 1 x 1 with 8 x 1 x 1 for Intel AVX2 or 4 x 1 x 1 for Intel SSE4. We can also use the dispatch size as a hint for how to multithread the code, so we could divide the 10,000 particles by 8 or 4, depending on the SIMD width. But, because we have discovered that there is no dependency between each thread, we could simplify it and just divide the number of particles by the available number of threads in the thread pool, or available logical cores on the device, which would be 8 on a typical quad core CPU with Intel® Hyper-Threading Technology enabled. When porting other compute shaders, this may require more thought.

This gives us the following pseudocode:

For each thread
Process N particles where N is 10000/threadCount
For each M particles from N, where M is the SIMD width
Test interaction with all 10000 particles


### Porting HLSL* to Scalar C

When writing Intel SPMD Program Compiler kernels, unless you are experienced, it is recommended that you have a scalar C version written first. This will ensure that all of the application glue, multithreading, and memory operations are working before you start vectorizing.

To that end, most of the HLSL code from nBodyGravityCS.hlsl will work in C with minimal modifications other than adding the outer loop for the particles and changing the shader math vector types to using a C-based equivalent. In this example, float4/float3 types were exchanged for the DirectX XMFLOAT4/XMFLOAT3 types, and some vector math operations were split out into their scalar equivalents.

The CPU particle buffers are used for reading and writing, and then the write buffer is uploaded to the GPU as described above, using the original fences for the synchronization. To provide the threading the sample uses Microsoft’s concurrency::parallel_for construct from their Parallel Patterns Library.

The code can be seen in D3D12nBodyGravity::SimulateCPU() and D3D12nBodyGravity::ProcessParticles().

Once the scalar code is working, it is worth doing a quick performance check to ensure there are no algorithmic hot spots that should be fixed before moving to Intel SPMD Program Compiler. In this sample, some basic hot spot analysis with Intel® VTune™ tools highlighted that a reciprocal square root (sqrt) was on a hot path, so this was replaced with the infamous fast reciprocal sqrt approximation from Quake* that provided a small performance improvement with no perceivable impact due to the loss of precision.

### Porting Scalar C to Scalar Intel® SPMD Program Compiler

Once your build system has been modified to build Intel SPMD Program Compiler kernels and link them into your application (Microsoft Visual Studio modifications are described later in this article), it is time to start writing Intel SPMD Program Compiler code and hook it into your application.

#### Hooks

To call any Intel SPMD Program Compiler kernels from your application code, you need to include the relevant auto-generated output header file and then call any of the exported functions as you would any normal library, remembering that all declarations are wrapped in the ispc namespace. In the sample, we call ispc::ProcessParticles from within the SimulateCPU() function.

#### Vector Math

Once the hooks are in, the next step is to get scalar Intel SPMD Program Compiler code working, and then vectorize it. Most of the scalar C code can be dropped straight into an Intel SPMD Program Compiler kernel with only a few simple modifications. In the sample, all vector math types needed defining because, although Intel SPMD Program Compiler does provide some templated vector types the types are only needed for storage, so new structs were defined. Once done, all XMFLOAT types were converted to the Vec3 and Vec4 types.

#### Keywords

We now need to start decorating the code with some Intel SPMD Program Compiler specific keywords to help direct the vectorization and compilation. The first keyword is export, which is used on a function signature like a calling convention, to inform Intel SPMD Program Compiler that this is an entry point into the kernel. This does two things. First, it adds the function signature to the autogenerated header file along with any required structs, but it also puts some restrictions on the function signature, as all arguments need to be scalar; which leads us to the next two keywords to be used, varying and uniform.

A uniform variable describes a scalar variable that will not get shared, but its contents will be shared across all SIMD lanes, while a varying variable will get vectorized and have unique values across all SIMD lanes. All variables are varying by default so whilst the keyword can be added, it has not been used in this sample. In our first pass of creating a scalar version of this kernel, we will decorate all variables with the uniform keyword to ensure it is strictly scalar.

#### Intel SPMD Program Compiler Standard Library

Intel SPMD Program Compiler provides a standard library containing many common functions which can also aid the port, including functions like floatbits() and intbits(), which are required for some of the floating point casts required in the fast reciprocal sqrt function.

### Vectorizing the Intel SPMD Program Compiler Kernel

When the Intel SPMD Program Compiler kernel is functioning as expected, it is time to vectorize. The main complexity is normally deciding what to parallelize and how to parallelize it. A rule of thumb for porting GPU compute shaders is to follow the original model of GPU vectorization which, in this case, had the core compute kernel invoked by multiple GPU execution units in parallel. So, where we added a new outer loop for the scalar version of the particle update, it is this outer loop that should most naturally be vectorized.

The layout of data is also important, as scatter/gather operations can be expensive for vector ISAs (although this is improved with the Intel AVX2 instruction set), so consecutive memory locations are normally preferred for frequent loads/stores.

#### Parallel Loops

In the n-body example, this rule of thumb was followed and the outer loop was vectorized, leaving the inner loop scalar. Therefore, 8 particles would be loaded into the Intel AVX registers and all 8 would then be tested against the entire 10,000 particles. These 10,000 positions would all be treated as scalar variables, shared across all SIMD lanes with no scatter/gather cost. Intel SPMD Program Compiler hides the actual vector width from us (unless we really want to know), which provides a nice abstraction to transparently support the different SIMD widths for Intel SSE4 or Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and so on.

The vectorization was done by replacing the outer for loop with an Intel SPMD Program Compiler foreach loop, which directs Intel SPMD Program Compiler to iterate over the range in N-sized chunks, where N is the current vector width. Hence, whenever the foreach loop iterator ii is used to dereference an array variable, the value of ii will be different for each SIMD lane of the vector, which allows each lane to work on a different particle.

#### Data Layout

At this point, it is important to briefly mention data layout. When using vector registers on the CPU it is important that they are loaded and unloaded efficiently; not doing so can cause a big performance slowdown. To achieve this, vector registers want to have the data loaded from a structure of arrays (SoA) data source, so a vector width number of memory-adjacent values can be loaded directly into the working vector register with a single instruction. If this cannot be achieved, then a slower gather operation is required to load a vector width of non-adjacent values into the vector register, and a scatter operation is required to save the data out again.

In this example, like many graphics applications, the particle data is kept in an array of structures (AoS) layout. This could be converted to SoA to avoid the scatter/gather, but due to the nature of the algorithm, the scatter/gather required in the outer loop becomes a small cost compared to processing the 10,000 scalar particles in the inner loop, so the data is left as AoS.

#### Vector Variables

The aim is to vectorize the outer loop and keep the inner loop scalar, hence a vector width of outer loop particles will all be processing the same inner loop particle. To achieve this, we load the position, velocity, and acceleration from the outer loop particles into vector registers by declaring pos, vel, and accel as varying. This was done by removing the uniform decoration we added to the scalar kernel, so Intel SPMD Program Compiler knows these variables require vectorizing.

This needs propagating through the bodyBodyInteraction and Q_rsqrt functions to ensure they are all correctly vectorized. This is just a case of following the flow of the variables and checking for compiler errors. The result is that Q_rsqrt is fully vectorized and bodyBodyInteraction is mainly vectorized, apart from the inner loop particle position thatPos, which is scalar.

This should be all that is required, and the Intel SPMD Program Compiler vectorized kernel should now run, providing good performance gains over the scalar version.

### Performance

The modified n-body application was tested on two different Intel CPUs and the performance data was captured using PresentMon* to record the frame times from three runs of 10 seconds each, which were then averaged. This showed performance scaling in the region of 8–10x from scalar C/C++ code to an Intel AVX2 targeted Intel SPMD Program Compiler kernel. Both devices used Nvidia* 1080 GTX GPUs and used all available CPU cores.

 Processor Scalar CPU Implementation Intel® AVX2 Implementation Compiled with Intel® SPMD Program Compiler Scaling Intel® Core™ i7-7700K processor 92.37 ms 8.42 ms 10.97x Intel Core I7-6950X Processor Extreme Edition brand 55.84 ms 6.44 ms 8.67x

## How to Integrate Intel SPMD Program Compiler into Microsoft Visual Studio*

1. Ensure the Intel SPMD Program Compiler compiler is on the path or easily located from within Microsoft Visual Studio*.
2. Include your Intel SPMD Program Compiler kernel into your project. It will not be built by default as the file type will not be recognized.
3. Right-click on the file Properties to alter the Item Type to be a Custom Build Tool:

4. Click OK and then re-open the Property pages, allowing you to modify the custom build tool.

a. Use the following command-line format:

ispc -O2 <filename> -o <output obj> -h <output header> --target=<target backends> --opt=fast-math

b. The full command line used in the sample is:

$(ProjectDir)..\..\..\..\third_party\ispc\ispc -O2 "%(Filename).ispc" -o "$(IntDir)%(Filename).obj" -h "$(ProjectDir)%(Filename)_ispc.h" --target=sse4,avx2 --opt=fast-math c. Add the relevant compiler generated outputs ie. obj files: $(IntDir)%(Filename).obj;$(IntDir)%(Filename)_sse4.obj;$(IntDir)%(Filename)_avx2.obj

d. Set Link Objects to Yes.

5. Now compile your Intel SPMD Program Compiler kernel. If successful, this should produce a header file and an object file.