Visible to Intel only — GUID: GUID-CA39436E-A364-4816-8442-C6DF2E50113E
Visible to Intel only — GUID: GUID-CA39436E-A364-4816-8442-C6DF2E50113E
Just-In-Time Compilation in SYCL
The Intel® oneAPI DPC++ Compiler converts a SYCL program into an intermediate language called SPIR-V and stores that in the binary produced by the compilation process. The advantage of producing this intermediate file instead of the binary is that this code can be run on any hardware platform by translating the SPIR-V code into the assembly code of the platform at runtime. This process of translating the intermediate code present in the binary is called JIT compilation (Just-In-Time compilation). JIT compilation can happen on demand at runtime. There are multiple ways in which this JIT compilation can be controlled. By default, all the SPIR-V code present in the binary is translated upfront at the beginning of the execution of the first offloaded kernel.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <CL/sycl.hpp> #include <array> #include <chrono> #include <iostream> // Array type and data size for this example. constexpr size_t array_size = (1 << 16); typedef std::array<int, array_size> IntArray; void VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void VectorAdd2(sycl::queue &q, const IntArray &a, const IntArray &b, IntArray &sum) { sycl::range num_items{a.size()}; sycl::buffer a_buf(a); sycl::buffer b_buf(b); sycl::buffer sum_buf(sum.data(), num_items); auto e = q.submit([&](auto &h) { // Input accessors sycl::accessor a_acc(a_buf, h, sycl::read_only); sycl::accessor b_acc(b_buf, h, sycl::read_only); // Output accessor sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init); h.parallel_for(num_items, [=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; }); }); q.wait(); } void InitializeArray(IntArray &a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; } int main() { sycl::default_selector d_selector; IntArray a, b, sum; InitializeArray(a); InitializeArray(b); sycl::queue q(d_selector, sycl::property::queue::enable_profiling{}); std::cout << "Running on device: " << q.get_device().get_info<sycl::info::device::name>() << "\n"; std::cout << "Vector size: " << a.size() << "\n"; auto start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); auto end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd1(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add1 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Initial Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); VectorAdd2(q, a, b, sum); end = std::chrono::steady_clock::now(); std::cout << "Second Vector add2 successfully completed on device - took " << (end - start).count() << " nano-secs\n"; return 0; }
When the program above is compiled using the command below (assuming that the name of the source file is example.cpp):
icpx -O3 -o example example.cpp
and run, the output generated will show that the first call to VectorAdd1 takes much longer than the calls to other kernels in the program due to the cost of JIT compilation, which gets invoked when vectorAdd1 is executed for the first time.
The overhead of JIT compilation at runtime can be avoided by Ahead-Of-Time (AOT) compilation (it is enabled by appropriate switches on the compile-line). With AOT compile, the binary will contain the actual assembly code of the platform that was selected during compilation instead of the SPIR-V intermediate code. The advantage is that we do not need to JIT compile the code from SPIR-V to assembly during execution, which makes the code run faster. The disadvantage is that now the code cannot run anywhere other than the platform for which it was compiled.
The example above can be compiled on a Gen9 GPU using the following command with AOT code-generation:
icpx -fsycl -O3 -o example example.cpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend=spir64_gen-unknown-unknown-sycldevice "-device skl"
When this compiled program is run, it can be seen from the output that the time it takes to execute all the calls to the kernels takes the same amount of time, unlike before where the first kernel takes a lot more time because of JIT compilation.
If the application contains multiple kernels, one can force eager JIT compilation or lazy JIT compilation using compile-time switches. Eager JIT compilation will invoke the JITter on all the kernels in the binary at the beginning of execution, while lazy JIT compilation will enable the JITter only when the kernel is actually called during execution. In situations where certain kernels are not called, this has the advantage of not translating code that is never actually executed, which avoids unnecessary JIT compilation. This mode can be enabled during compilation using the following option:
-fsycl-device-code-split=<value>
where <value> is
per_kernel: generates code to do JIT compilation of a kernel only when it is called
per_source: generates code to do JIT compilation of all kernels in the source file when any of the kernels in the source file are called
off: performs eager JIT compilation of all kernels in the application
auto: the default, the compiler will use its heuristic to select the best way of splitting device code for JIT compilation
If the above program is compiled with this option:
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_kernel
and run, then from the timings of the kernel executions it can be seen that the first invocations of VectorAdd1 and VectorAdd2 take longer, while the second invocations will take less time because they do not pay the cost of JIT compilation.
In the example above, we can put VectorAdd1 and VectorAdd2 in separate files and compile them with and without the per_source option to see the impact on the execution times of the kernels. When compiled with
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_source
and run, the execution times of the kernels will show that the JIT compilation cost is paid at the first kernel invocation, while the subsequent kernel invocations do not pay the JIT compilation cost. But when the program is compiled with
icpx -fsycl -O3 -o example vec1.cpp vec2.cpp main.cpp
and run, the execution times of the kernels will show that the JIT compilation cost is paid upfront at the first invocation of the kernel, and all subsequent kernels do not pay the cost of JIT compilation.