Just-In-Time Compilation in DPC++
The Intel
®
oneAPI DPC++ Compiler converts a DPC++ program into an
intermediate language called SPIR-V and stores that in the binary produced by
the compilation process. The advantage of producing this intermediate file
instead of the binary is that this code can be run on any hardware platform by
translating the SPIR-V code into the assembly code of the platform at runtime.
This process of translating the intermediate code present in the binary is
called JIT compilation (just-in-time compilation). JIT compilation can happen
on demand at runtime. There are multiple ways in which this JIT compilation can
be controlled. By default, all the SPIR-V code present in the binary is
translated upfront at the beginning of the execution of the first offloaded
kernel.#include <CL/sycl.hpp>
#include <array>
#include <chrono>
#include <iostream>
// Array type and data size for this example.
constexpr size_t array_size = (1 << 16);
typedef std::array<int, array_size> IntArray;
void VectorAdd1(sycl::queue &q, const IntArray &a, const IntArray &b,
IntArray &sum) {
sycl::range num_items{a.size()};
sycl::buffer a_buf(a);
sycl::buffer b_buf(b);
sycl::buffer sum_buf(sum.data(), num_items);
auto e = q.submit([&](auto &h) {
// Input accessors
sycl::accessor a_acc(a_buf, h, sycl::read_only);
sycl::accessor b_acc(b_buf, h, sycl::read_only);
// Output accessor
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items,
[=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
q.wait();
}
void VectorAdd2(sycl::queue &q, const IntArray &a, const IntArray &b,
IntArray &sum) {
sycl::range num_items{a.size()};
sycl::buffer a_buf(a);
sycl::buffer b_buf(b);
sycl::buffer sum_buf(sum.data(), num_items);
auto e = q.submit([&](auto &h) {
// Input accessors
sycl::accessor a_acc(a_buf, h, sycl::read_only);
sycl::accessor b_acc(b_buf, h, sycl::read_only);
// Output accessor
sycl::accessor sum_acc(sum_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items,
[=](auto i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
q.wait();
}
void InitializeArray(IntArray &a) {
for (size_t i = 0; i < a.size(); i++)
a[i] = i;
}
int main() {
sycl::default_selector d_selector;
IntArray a, b, sum;
InitializeArray(a);
InitializeArray(b);
sycl::queue q(d_selector, sycl::property::queue::enable_profiling{});
std::cout << "Running on device: "
<< q.get_device().get_info<sycl::info::device::name>() << "\n";
std::cout << "Vector size: " << a.size() << "\n";
auto start = std::chrono::steady_clock::now();
VectorAdd1(q, a, b, sum);
auto end = std::chrono::steady_clock::now();
std::cout << "Initial Vector add1 successfully completed on device - took "
<< (end - start).count() << " nano-secs\n";
start = std::chrono::steady_clock::now();
VectorAdd1(q, a, b, sum);
end = std::chrono::steady_clock::now();
std::cout << "Second Vector add1 successfully completed on device - took "
<< (end - start).count() << " nano-secs\n";
start = std::chrono::steady_clock::now();
VectorAdd2(q, a, b, sum);
end = std::chrono::steady_clock::now();
std::cout << "Initial Vector add2 successfully completed on device - took "
<< (end - start).count() << " nano-secs\n";
start = std::chrono::steady_clock::now();
VectorAdd2(q, a, b, sum);
end = std::chrono::steady_clock::now();
std::cout << "Second Vector add2 successfully completed on device - took "
<< (end - start).count() << " nano-secs\n";
return 0;
}
When the program above is compiled using the command below (assuming
that the name of the source file is
example.cpp
):dpcpp -O3 -o example example.cpp
and run, the output generated will show that the first call to
VectorAdd1
takes much longer than the calls to other kernels
in the program due to the cost of JIT compilation, which gets invoked
when vectorAdd1
is executed for the first time.The overhead of JIT compilation at runtime can be avoided by
ahead-of-time (AOT) compilation (it is enabled by appropriate switches
on the compile-line). With AOT compile, the binary will contain the
actual assembly code of the platform that was selected during
compilation instead of the SPIR-V intermediate code. The advantage
is that we do not need to JIT compile the code from SPIR-V to
assembly during execution, which makes the code run faster. The
disadvantage is that now the code cannot run anywhere other than the
platform for which it was compiled.
The example above can be compiled on a Gen9 GPU using the
following command with AOT code-generation:
dpcpp -O3 -o example example.cpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend=spir64_gen-unknown-unknown-sycldevice "-device skl"
When this compiled program is run, it can be seen from the output that the time
it takes to execute all the calls to the kernels takes the same amount of time,
unlike before where the first kernel takes a lot more time because of JIT
compilation.
If the application contains multiple kernels, one can force eager JIT
compilation or lazy JIT compilation using compile-time switches. Eager JIT
compilation will invoke the JITter on all the kernels in the binary at the
beginning of execution, while lazy JIT compilation will enable the JITter only
when the kernel is actually called during execution. In situations where
certain kernels are not called, this has the advantage of not translating code
that is never actually executed, which avoids unnecessary JIT compilation. This
mode can be enabled during compilation using the following option:
-fsycl-device-code-split=<value>
where <value> is
per_kernel: generates code to do JIT compilation of a kernel only when it is called per_source: generates code to do JIT compilation of all kernels in the source file when any of the kernels in the source file are called off: the default, which does eager JIT compilation of all kernels in the application
If the above program is compiled with this option:
dpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_kernel
and run, then from the timings of the kernel executions it can be seen that the
first invocations of
VectorAdd1
and VectorAdd2
take longer, while the
second invocations will take less time because they do not pay the cost of JIT
compilation.In the example above, we can put
VectorAdd1
and VectorAdd2
in separate
files and compile them with and without the per_source
option to see the
impact on the execution times of the kernels. When compiled withdpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp -fsycl-device-code-split=per_source
and run, the execution times of the kernels will show that the JIT compilation
cost is paid at the first kernel invocation, while the subsequent kernel
invocations do not pay the JIT compilation cost. But when the program is
compiled with
dpcpp -O3 -o example vec1.cpp vec2.cpp main.cpp
and run, the execution times of the kernels will show that the JIT compilation
cost is paid upfront at the first invocation of the kernel, and all subsequent
kernels do not pay the cost of JIT compilation.