Specialization Constants
SYCL has a feature called specialization constants that can explicitly trigger JIT compilation to generate code from the intermediate SPIR-V code based on the run-time values of these specialization constants. These JIT compilation actions are done during the execution of the program when the values of these constants are known. This is different from the JIT compilation, which is triggered based on the options provided to -fsycl-device-code-split.
In the example below, the call to set_specialization_constant binds the value returned by the call to function get_value, defined on line 10, to the SYCL kernel bundle. When the kernel bundle is initially compiled, this value is not known and so cannot be used for optimizations. At runtime, after function get_value is executed, the value is known, so it is used by command groups handler to trigger JIT compilation of the specialized kernel with this value.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <CL/sycl.hpp> #include <vector> class specialized_kernel; // const static identifier of specialization constant const static sycl::specialization_id<float> value_id; // Fetch a value at runtime. float get_value() { return 10; }; int main() { sycl::queue queue; std::vector<float> vec(1); { sycl::buffer<float> buffer(vec.data(), vec.size()); queue.submit([&](auto &cgh) { sycl::accessor acc(buffer, cgh, sycl::write_only, sycl::no_init); // Set value of specialization constant. cgh.template set_specialization_constant<value_id>(get_value()); // Runtime builds the kernel with specialization constant // replaced by the literal value provided in the preceding // call of `set_specialization_constant<value_id>` cgh.template single_task<specialized_kernel>( [=](sycl::kernel_handler kh) { const float val = kh.get_specialization_constant<value_id>(); acc[0] = val; }); }); } queue.wait_and_throw(); std::cout << vec[0] << std::endl; return 0; }
The specialized kernel at line 24 will eventually become the code shown below:
cgh.single_task<specialized_kernel>( [=]() { acc[0] = 10; });
This JIT compilation also has an impact on the amount of time it takes to execute a kernel. This is illustrated by the example below:
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <CL/sycl.hpp> #include <chrono> #include <vector> class specialized_kernel; class literal_kernel; // const static identifier of specialization constant const static sycl::specialization_id<float> value_id; // Fetch a value at runtime. float get_value() { return 10; }; int main() { sycl::queue queue; // Get kernel ID from kernel class qualifier sycl::kernel_id specialized_kernel_id = sycl::get_kernel_id<specialized_kernel>(); // Construct kernel bundle with only specialized_kernel in the input state sycl::kernel_bundle kb_src = sycl::get_kernel_bundle<sycl::bundle_state::input>( queue.get_context(), {specialized_kernel_id}); // set specialization constant value kb_src.set_specialization_constant<value_id>(get_value()); auto start = std::chrono::steady_clock::now(); // build the kernel bundle for the set value sycl::kernel_bundle kb_exe = sycl::build(kb_src); auto end = std::chrono::steady_clock::now(); std::cout << "specialization took - " << (end - start).count() << " nano-secs\n"; std::vector<float> vec{0, 0, 0, 0, 0}; sycl::buffer<float> buffer1(vec.data(), vec.size()); sycl::buffer<float> buffer2(vec.data(), vec.size()); start = std::chrono::steady_clock::now(); { queue.submit([&](auto &cgh) { sycl::accessor acc(buffer1, cgh, sycl::write_only, sycl::no_init); // use the precompiled kernel bundle in the executable state cgh.use_kernel_bundle(kb_exe); cgh.template single_task<specialized_kernel>( [=](sycl::kernel_handler kh) { float v = kh.get_specialization_constant<value_id>(); acc[0] = v; }); }); queue.wait_and_throw(); } end = std::chrono::steady_clock::now(); { sycl::host_accessor host_acc(buffer1, sycl::read_only); std::cout << "result1 (c): " << host_acc[0] << " " << host_acc[1] << " " << host_acc[2] << " " << host_acc[3] << " " << host_acc[4] << std::endl; } std::cout << "execution took : " << (end - start).count() << " nano-secs\n"; start = std::chrono::steady_clock::now(); { queue.submit([&](auto &cgh) { sycl::accessor acc(buffer2, cgh, sycl::write_only, sycl::no_init); cgh.template single_task<literal_kernel>([=]() { acc[0] = 20; }); }); queue.wait_and_throw(); } end = std::chrono::steady_clock::now(); { sycl::host_accessor host_acc(buffer2, sycl::read_only); std::cout << "result2 (c): " << host_acc[0] << " " << host_acc[1] << " " << host_acc[2] << " " << host_acc[3] << " " << host_acc[4] << std::endl; } std::cout << "execution took - " << (end - start).count() << " nano-secs\n"; }
Looking at the runtimes reported by each of the timing messages, it can be seen that the initial translation of the kernel takes a long time, while the actual execution of the JIT-compiled kernel takes less time. The same kernel which had not been precompiled to the executable state takes longer because this kernel will have been JIT-compiled by the runtime before actually executing it.
Below we provide some examples showing simple use cases and applications of specialization constants.
Simple Trip Count Use Case
The following example performs a summation and uses specialization constants to set the trip count.
//============================================================== // Copyright © 2022 Intel Corporation // // SPDX-License-Identifier: MIT // ============================================================= #include <CL/sycl.hpp> class SpecializedKernel; // Identify the specialization constant. constexpr sycl::specialization_id<int> nx_sc; int main() { sycl::queue queue; std::cout << "Running on " << queue.get_device().get_info<sycl::info::device::name>() << "\n"; std::vector<float> vec(1); { sycl::buffer<float> buf(vec.data(), vec.size()); // Application execution stops here asking for input from user int Nx; std::cout << "Enter input number ..." << std::endl; std::cin >> Nx; queue.submit([&](sycl::handler &h) { sycl::accessor acc(buf, h, sycl::write_only, sycl::no_init); // set specialization constant with runtime variable h.set_specialization_constant<nx_sc>(Nx); h.single_task<SpecializedKernel>([=](sycl::kernel_handler kh) { // nx_sc value here will be input value provided at runtime and // can be optimized because JIT compiler now treats it as a constant. int runtime_const_trip_count = kh.get_specialization_constant<nx_sc>(); int accum = 0; for (int i = 0; i < runtime_const_trip_count; i++) { accum = accum + i; } acc[0] = accum; }); }); } std::cout << vec[0] << std::endl; return 0; }
The goal is to specialize the trip count variable Nx for the loop in the kernel. Since the user inputs the trip count after execution of the program starts, the host compiler does not know the value of Nx. The input value can be passed as a specialization constant to the JIT compiler, allowing the JIT compiler to apply some optimizations such as unrolling the loop.
Without the specialization constants feature, the variable Nx would need to be a constant expression for the whole program to achieve this. In this way, specialization constants can lead to more optimization and hence faster kernel code, by creating constant values from runtime variables.
In contrast, the host compiler cannot effectively optimize the example loop below where the trip count is not a constant, since it needs runtime checks for safety/legality.
for (int i = 0; i < Nx; i++) { // Optimizations are limited when Nx is not a constant. }
Modifed STREAM Triad Application
In the following example (a modified STREAM Triad) we have the classic STREAM Triad with several multiply and add operations where the variable multiplier and the number of multiply-add operations are determined by an input variable.
Below is a snippet of the original kernel code run on the device. The runtime variable inner_loop_size is used to set the loop upper bound.
auto q0_event = q.submit([&](sycl::handler &h) { h.parallel_for<non_specialized_kernel>(array_size / 2, [=](auto idx) { // set trip count to runtime variable auto runtime_trip_count_const = inner_loop_size; auto accum = 0; for (size_t j = 0; j < runtime_trip_count_const; j++) { auto multiplier = scalar * j; accum = accum + A0[idx] + B0[idx] * multiplier; } C0[idx] = accum; }); }); q.wait(); cl_ulong exec_time_ns0 = q0_event .get_profiling_info<sycl::info::event_profiling::command_end>() - q0_event .get_profiling_info<sycl::info::event_profiling::command_start>(); std::cout << "Execution time (iteration " << i << ") [sec]: " << (double)exec_time_ns0 * 1.0E-9 << "\n"; min_time_ns0 = std::min(min_time_ns0, exec_time_ns0);
In order to improve performance, we use the specialization constant feature to specialize the variable inner_loop_size. Below is a snippet of the kernel code run on the device - using a specialization constant.
auto q0_event = q.submit([&](sycl::handler &h) { // set specialization constant using runtime variable h.set_specialization_constant<trip_sc>(inner_loop_size); h.parallel_for<specialized_kernel>( array_size / 2, [=](auto idx, sycl::kernel_handler kh) { // set trip count to the now known specialization constant auto runtime_trip_count_const = kh.get_specialization_constant<trip_sc>(); auto accum = 0; for (size_t j = 0; j < runtime_trip_count_const; j++) { auto multiplier = scalar * j; accum = accum + A0[idx] + B0[idx] * multiplier; } C0[idx] = accum; }); }); q.wait(); cl_ulong exec_time_ns0 = q0_event .get_profiling_info<sycl::info::event_profiling::command_end>() - q0_event .get_profiling_info<sycl::info::event_profiling::command_start>(); std::cout << "Execution time (iteration " << i << ") [sec]: " << (double)exec_time_ns0 * 1.0E-9 << "\n";
We finally compare the specialization trip count value with the following example that uses a regular constant value. Below is a snippet of the kernel code run on the device using a regular constant.
auto q0_event = q.submit([&](sycl::handler &h) { h.parallel_for<regular_constant_kernel>(array_size / 2, [=](auto idx) { // set trip count to known regular constant size_t runtime_trip_count_const = 10; auto accum = 0; for (size_t j = 0; j < runtime_trip_count_const; j++) { auto multiplier = scalar * j; accum = accum + A0[idx] + B0[idx] * multiplier; } C0[idx] = accum; }); }); q.wait(); cl_ulong exec_time_ns0 = q0_event .get_profiling_info<sycl::info::event_profiling::command_end>() - q0_event .get_profiling_info<sycl::info::event_profiling::command_start>(); std::cout << "Execution time (iteration " << i << ") [sec]: " << (double)exec_time_ns0 * 1.0E-9 << "\n"; min_time_ns0 = std::min(min_time_ns0, exec_time_ns0);
Timings from the runs of the three different versions are displayed below. The stream size represents the size of arrays A0, B0, and C0. The inner trip count represents the value of the runtime_trip_count_const variable set using the specialization constant.
Displayed below are timing outputs for example runs of the different versions using a stream size of 134217728 elements (1024 MB) and an inner trip count of 10 as inputs.
Run with runtime variable: Time in sec (fastest run): 0.00161008
Run with specialization constant: Time in sec (fastest run): 0.00156256
Run with constant value: Time in sec (fastest run): 0.00155104
The results, as expected, show that using the specialization constant improves the performance of the computation on the device from the execution time seen with the runtime variable to one that more closely matches the execution time seen with the constant value. Furthermore, analysis of the generated code shows the specialized version of the application unrolls the main loop due to it’s added capability to specialize the loop trip count & JIT compile it as a known constant. We witness about inner_loop_size times (thus 10 times in this example) as many floating-point add instructions in the main loop of the program using specialization constants as compared to the one using a runtime variable.