What is AOT Compilation
-
Ahead-of-Time (AOT) compilation is a technique where the source code is compiled into machine code before the program is run, rather than at runtime (Just-in-Time or JIT compilation).
-
AOT compilation involves compiling SYCL kernels into device-specific binaries before the application is executed.
How Does AOT Compilation Work
-
Source Code Compilation:
SYCL code is written in standard C++ with parallel kernels marked for execution on accelerators.
The source code includes both host code (runs on the CPU) and device code (runs on the accelerator, such as GPU).
-
Separate Compilation:
The SYCL source file is first compiled using the SYCL compiler.
The compiler splits the code into host code and device code.
-
Device-Specific Compilation:
The device code is compiled into intermediate representation (IR), such as SPIR-V (Standard Portable Intermediate Representation).
This IR is then compiled into device-specific binary code using device-specific backends or AOT compilers.
-
Linking:
The host code and the precompiled device binary are linked together to create the final executable.
-
Execution:
At runtime, the SYCL runtime system loads the precompiled device binary and executes it on the target accelerator.
Benefits of AOT Compilation
- Performance:
- Reduced runtime overhead: Since the kernel is already compiled, the runtime does not need to perform JIT compilation.
- Faster startup time: The application starts faster because there is no need for runtime compilation.
- Optimization:
- Target-specific optimizations: AOT compilers can perform optimizations tailored to the specific hardware architecture.
- Stable performance: Performance is more predictable and consistent compared to JIT, where runtime conditions may affect optimization.
- Portability:
- Precompiled binaries: Device binaries can be shipped with the application, ensuring that the application runs on the target hardware without requiring the SYCL compiler at runtime.
- Deployment:
- Easier deployment: Applications can be deployed in environments where installing a compiler is not feasible or desirable.
Disadvantages of AOT Compilation
-
Flexibility:
-
AOT binaries are specific to the target architecture, reducing flexibility compared to JIT compilation which can adapt to the runtime environment.
-
-
Development Cycle:
-
AOT adds an additional compilation step, which can increase the build time.
-
Testing changes to kernels can be slower because each change requires recompilation.
-
-
Binary Compatibility:
-
The compiled binary is specific to a particular hardware architecture, requiring different binaries for different devices.
-
Differences between AOT and JIT Compilation
The most significant differences happen in device compilation, what the resulting binary contains after host code compilation, and how the final binary is deployed.
-
During the compilation time, the IR code is compiled into a binary format suitable for the target device (e.g., GPU, FPGA) using the device-specific compiler. Since this step is performed before the application is run, hence "ahead-of-time".
-
For AOT compilation, after the host part of the SYCL code is compiled into native machine code by the host compiler, the resulting binary contains the host code and pre-compiled device code. However, the resulting binary contains the host code and the intermediate representation (IR) code for the device part if it’s JIT compilation.
-
The compiled binary is deployed and run on the target platform for AOT. Since the device code is already compiled, there is no need for further compilation at runtime. For JIT, the compiled binary is deployed and run on the target platform, but the IR code needs to be compiled into device-specific binary code using a JIT compiler at runtime where the JIT compiler translates the IR code into machine code suitable for the target device.
SYCL AOT Compilation Flow
-
SYCL compiler logically can be split into the host compiler and device compiler(s).
-
Host compilation is like plain C++ compilation.
-
Device compilation will first compile codes into device object file(s), known as LLVM IR, and are sent to be translated into SPIR-V module.
-
SPIR-V is compiled into device-specific binary code using a target-specific compiler for the target hardware.
-
By linking the host object files and device object files, a “fat binary file” is created; this fat binary contains a host binary with embedded device binary(s).
AOT Workflow Example
1. Write SYCL code, here’s an example with vector addition kernel:
#include <sycl/sycl.hpp>
#include<iostream>
using namespace sycl;
int VEC_SIZE = 10;
void addVec(queue &q, const std::vector<float> &a_vec, const std::vector<float> &b_vec, std::vector<float> &sum_vec) {
buffer a_buf(a_vec);
buffer b_buf(b_vec);
buffer sum_buf(sum_vec);
q.submit([&](handler &h) {
accessor a_acc(a_buf, h, read_only);
accessor b_acc(b_buf, h, read_only);
accessor sum_acc(sum_buf, h, write_only, no_init);
h.parallel_for(range<1>(a_vec.size()), [=](id<1> i) {
sum_acc[i] = a_acc[i] + b_acc[i];
});
});
q.wait();
}
void initVec(std::vector<float> &vec) {
for (size_t i = 0; i < vec.size(); i++) vec.at(i) = i;
}
int main(int argc, char* argv[]) {
std::vector<float> a, b, sum;
a.resize(VEC_SIZE);
b.resize(VEC_SIZE);
sum.resize(VEC_SIZE);
initVec(a);
initVec(b);
queue q;
addVec(q, a, b, sum);
for (int i = 0; i < sum.size(); i++) {
std::cout << "[" << i << "]: " << a[i] << " + " << b[i] << " = " << sum[i] << "\n";
}
a.clear();
b.clear();
sum.clear();
std::cout << "Vector add successfully completed on device.\n";
return 0;
}
2. Compile with SYCL compiler:
Compile the host code and generate the device binary, and use -fsycl-targets option to specify the target device architectures for which the SYCL kernels should be compiled:
icpx -fsycl -fsycl-targets=spir64_gen -c vector_add.cpp -o vector_add.o
3. Linking and creating executable:
Link the host object code with the precompiled device binary:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device *" vector_add.o -o vector_add
Here we build the executable suitable for all the targets known to OpenCL™ Offline Compile “OCLOC”, so we use *
4. Run the application:
Execute the application with the precompiled device binary:
./vector_add
Reference for using Intel® CPUs, GPUs as the Target Device for AOT and the target device options
Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference
Comparing Flow of AOT vs JIT
AOT Runtime Flow
-
Compilation:
-
Compile the SYCL source code to SPIR-V
-
Use the AOT compiler for the specific device (e.g., Intel GPU)
-
Compile the host code and link with the pre-compiled device code
-
Load Binary:
-
The pre-compiled binary is loaded into memory.
-
The binary includes both host and device code.
-
Execution:
-
The application starts executing the host code.
-
When a SYCL kernel is enqueued for execution, the runtime loads the pre-compiled device code and runs it on the target device.
-
Result:
-
The device code executes immediately without any runtime compilation overhead.
JIT Runtime Flow
-
Compilation:
-
Compile the SYCL source code (SPIR-V remains in the binary)
-
Load Binary:
-
The binary containing the host code and IR code is loaded into memory.
-
Execution:
-
The application starts executing the host code.
-
Device Code Compilation (JIT):
-
When a SYCL kernel is enqueued for execution, the runtime compiles the IR code into device-specific binary code using a JIT compiler.
-
This step introduces additional overhead at runtime due to the need for compilation.
-
Execution:
-
Once the device code is compiled, it is loaded and executed on the target device.
-
Result:
-
The device code runs after the JIT compilation.
AOT Optimization Options
In addition to the standard -O levels for compile-time optimization, which range from -O0 (no optimization) to -O3 and -Ofast (aggressive optimization), SYCL-specific optimizations are also available.
Compile-Time Optimization Options
-fsycl-dead-args-optimization
This flag enables the optimization that removes unused arguments in SYCL kernels. When compiling SYCL code, if there are kernel arguments that are declared but not actually used within the kernel body, this optimization will exclude them from the compiled kernel binary.
Impacts:
-
Smaller Binary Size:
-
By excluding unused arguments, the size of the compiled kernel binaries is reduced.
-
Potential Performance Improvement:
-
Removing unused arguments can reduce the overhead associated with argument passing, potentially leading to better runtime performance.
Usage Example:
icpx -fsycl -fsycl-dead-args-optimization my_sycl_program.cpp -o my_sycl_program
Runtime Optimization Options
-fsycl-device-code-split
The -fsycl-device-code-split option controls how the device code is split into multiple parts. This can be useful for optimizing the build process, reducing binary size, and improving load times.
Impacts:
-
Modularization:
-
Splitting device code into smaller modules can make the resulting binary more modular. This can simplify the linking process and make it easier to manage large codebases.
-
Binary Size:
-
By splitting device code, you can potentially reduce the size of the binary that gets loaded onto the device. This can lead to better performance, especially on devices with limited memory.
-
Linking Efficiency:
-
Splitting can improve linking efficiency by reducing the complexity of the final linking step. This can speed up the build process and reduce compilation times.
-
Code Management:
-
Helps in managing code complexity by logically grouping device code into separate parts. This can be particularly beneficial in large projects where different modules may have different update cycles.
Usage Examples:
-
Auto Splitting:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device *" -fsycl-device-code-split=auto my_sycl_program.cpp -o my_sycl_program
-
Per Kernel Splitting:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device *" -fsycl-device-code-split=per_kernel my_sycl_program.cpp -o my_sycl_program
-fno-sycl-rdc
The -fno-sycl-rdc option disables relocatable device code. Relocatable Device Code (RDC) allows device code to be separately compiled and linked, providing flexibility similar to traditional host code compilation.
Impacts:
-
Monolithic Compilation:
-
Disabling RDC forces all device code to be compiled into a single module. This can simplify the build process but at the cost of flexibility.
-
Performance:
-
With RDC disabled, all device code is compiled together, potentially leading to better optimization opportunities since the compiler has visibility into all device code at once. However, this can also lead to larger binary sizes.
-
Build Simplicity:
-
The build process can be simpler without RDC, as there’s no need to handle multiple device code modules. This can be easier for projects that do not require the flexibility of separate compilation.
-
Linking Constraints:
-
Disabling RDC can make linking more constrained because all device code must be present and compiled together. This might not be suitable for very large projects or projects that benefit from incremental builds.
Usage Example:
-
Disable RDC:
icpx -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device *" -fno-sycl-rdc my_sycl_program.cpp -o my_sycl_program