We have added a new simple SGEMM example to the Intel® SPMD Program Compiler GitHub* repo. The Intel® SPMD Program Compiler is colloquially referred to as “ISPC”, as in the compiler’s executable name “ispc.exe”. The new SGEMM sample is instructive for showing several variants of how to approach optimizing a computation in ISPC. Generally, Single precision General Matrix Multiply (SGEMM) is a nice compact example that is familiar to many programmers. Having an ISPC version of SGEMM is helpful for comparing and contrasting with other programming languages and their approach to optimized SGEMM code.
More generally, what lessons for programming model evolution can we learn from ISPC? Lots! ISPC elegantly demonstrates an important future direction for the evolution of CPU Multi-Core SIMD languages, GPU compute languages, data parallel C++ extensions, and embedded application or domain specific compute languages. Specifically, I want to discuss how ISPC exposes thread level programming and separately how ISPC distinctly specifies SPMD iteration ranges.
At first glance, ISPC appears to be yet another kernel based Single Program Multiple Data (SPMD) programming language much like popular vendor-portable GPU compute languages such as OpenCL*, DirectX* Compute Shader, Vulkan* Compute, Metal* Compute, as well as NVidia’s* CUDA* language. These languages employ scalar code kernels that are compiled as gangs of multiple instances to target Single Instruction Multiple Data (SIMD) or Single Instruction Multiple Thread (SIMT) vector compute architectures. Such languages enable efficient and easily vectorizable assembly code generation. And if data access across kernel instances is clearly contiguous, then vector instructions can efficiently load contiguous data directly into vector registers, without more costly scalar loads or non-contiguous gather/scatter operations.To demonstrate the point, I’ve included a typical ISPC kernel and the heart of its x86-64 codegen for the inner loop of a variant of the SGEMM code.
Broadly speaking, these GPU Compute languages and the ISPC language follow similar compilation strategies. But an important key difference between ISPC and these other languages is how iteration ranges over kernel instances are specified.
Consider programming iteration in OpenCL: Application programmers use driver APIs to specify an “N-Dimensional Range” that consists of a hierarchical “global-work” size which aggregates sub sizes specified as local “work-groups”. The sizes can be one, two, or three dimensions. For advanced users, work-groups are revealed to be further subdivided into “sub-groups”. DirectX* Compute Shader and CUDA* have similar abstractions.
The ND-Range is intended to simplify programming with a single abstraction that drives both multiple hardware threads as well as SIMD programming. For groups of hardware threads executing kernel instances, the work-group abstractions exist to define a software to hardware mapping for shared (local) memory resources, barriers, and atomics. The sub-group abstractions exist to suggest to programmers how the compiler is targeting a single hardware thread with a gang of kernel instances. GPU compute frequently delivers impressive performance, so obviously these compute language abstractions are getting something right.
But the GPU compute languages and these abstractions also bring some needless drawbacks. Consider these:
- Iteration ranges are specified in API calls that are compiled and programmed separately from the compute kernels.
- Once kernels start running with an index range, they are stuck with it. To change the iteration range requires more API calls and dispatching a new kernel.
- ND-ranges, work-groups, sub-groups (thread-groups, warps, wavefronts, etc) can be confusing concepts for both newbies and experts.
- For experts that are optimizing codes like SGEMM, convolutional neural networks, or ray tracing, they often need to subvert the work-group paradigm in an attempt to program directly to the hardware thread to manage data initialization or per thread resources.
- The work-group abstraction is unnecessary complexity if your application does not use shared local memory or barriers, or perhaps on architectures with a high performance cache hierarchy.
In the ISPC language, both the mapping to hardware threads, and the programming of iteration ranges is different. In ISPC, hardware threads are mapped directly, and iteration ranges are specified directly in the kernel code via the foreach() semantic. These seemingly simple language design choices enable a threaded and SPMD compute programming model with several desirable attributes:
- Kernel code before the foreach() or in between foreach() constructs, maps predictably 1:1 with a hardware thread.
- “Programming directly to the hardware thread” makes it easier for programmers and compilers to initialize local arrays, load blocks of data, employ intrinsics to perform horizontal SIMD operations, or to more tightly control register allocation.
- Programmers can easily specify multiple different iteration ranges within a single kernel, just by writing a new foreach() { } block. Some programmers call this “changing the axis of parallelism” within the kernel.
- The SPMD iteration specified within foreach() clauses is clearly mapped to a single hardware thread.
- Within a single kernel, programmers can nest foreach()’s within multiple nests of conventional for() loops. Because all are programmed within a single kernel, compilers can still make smart vectorization and optimization transform choices across the known loop bounds.
- Short constant defined SPMD iterations can be trivially unrolled since the iteration range is known.
These desirable attributes are good for both programmer and compiler. For programmers it enables simple and intuitive code that is uncluttered by work-group, thread-group, and sub-group abstractions. The code’s mapping to hardware is intuitive and performance transparent. And while we believe that cross architecture performance portability is a myth, the performance of this style of code tends to be more robust when porting. For compilers, this language design very effectively represents SPMD vectorization blocks for high performance machine code generation.
The call to action here is for programmers and compute language designers to internalize these important ISPC language design choices. Consider how GPU compute and CPU SIMD languages can evolve to to incorporate similar kernel language semantics, and create a choice that frees programmers from the burdens of workgroup and subgroup abstractions.
There are many other attractive ISPC language design choices that are worth their own discussion, including the uniform/varying variable specifiers, the bi-directional and trivial C/C++ language bindings, a dirt simple task & spawn syntax for task parallelism, and others. Perhaps in a future blog.