Execution Model Overview
Thread Mapping and GPU Occupancy
Kernels
Using Libraries for GPU Offload
Host/Device Memory, Buffer and USM
Unified Shared Memory Allocations
Performance Impact of USM and Buffers
Avoiding Moving Data Back and Forth between Host and Device
Optimizing Data Transfers
Avoiding Declaring Buffers in a Loop
Buffer Accessor Modes
Host/Device Coordination
Using Multiple Heterogeneous Devices
Compilation
OpenMP Offloading Tuning Guide
Multi-GPU and Multi-Stack Architecture and Programming
Level Zero
Performance Profiling and Analysis
Configuring GPU Device
Sub-Groups and SIMD Vectorization
Removing Conditional Checks
Registers and Performance
Shared Local Memory
Pointer Aliasing and the Restrict Directive
Synchronization among Threads in a Kernel
Considerations for Selecting Work-Group Size
Prefetch
Reduction
Kernel Launch
Executing Multiple Kernels on the Device at the Same Time
Submitting Kernels to Multiple Queues
Avoiding Redundant Queue Constructions
Programming Intel® XMX Using SYCL Joint Matrix Extension
Doing I/O in the Kernel
Optimizing Explicit SIMD Kernels
Using the Timers
The standard C++ chrono library can be used for tracking times with varying degrees of precision in SYCL. The following example shows how to use the chrono timer class to time kernel execution from the host side.
#include <sycl/sycl.hpp>
#include <iostream>
using sycl;
// Array type and data size for this example.
constexpr size_t array_size = (1 << 16);
typedef std::array<int, array_size> IntArray;
double VectorAdd(queue &q, const IntArray &a, const IntArray &b, IntArray &sum) {
range<1> num_items{a.size()};
buffer a_buf(a);
buffer b_buf(b);
buffer sum_buf(sum.data(), num_items);
auto t1 = std::chrono::steady_clock::now(); // Start timing
q.submit([&](handler &h) {
// Input accessors
auto a_acc = a_buf.get_access<access::mode::read>(h);
auto b_acc = b_buf.get_access<access::mode::read>(h);
// Output accessor
auto sum_acc = sum_buf.get_access<access::mode::write>(h);
h.parallel_for(num_items, [=](id<1> i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
}).wait();
auto t2 = std::chrono::steady_clock::now(); // Stop timing
return(std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count());
}
void InitializeArray(IntArray &a) {
for (size_t i = 0; i < a.size(); i++) a[i] = i;
}
int main() {
default_selector d_selector;
IntArray a, b, sum;
InitializeArray(a);
InitializeArray(b);
queue q(d_selector);
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
std::cout << "Vector size: " << a.size() << "\n";
double t = VectorAdd(q, a, b, sum);
std::cout << "Vector add successfully completed on device in " << t << " microseconds\n";
return 0;
}
Note that this timing is purely from the host side. The actual execution of the kernel on the device may start much later, after the submission of the kernel by the host. SYCL provides a profiling capability that let you keep track of the time it took to execute kernels.
#include <sycl/sycl.hpp>
#include <array>
#include <iostream>
using namespace sycl;
// Array type and data size for this example.
constexpr size_t array_size = (1 << 16);
typedef std::array<int, array_size> IntArray;
double VectorAdd(queue &q, const IntArray &a, const IntArray &b, IntArray &sum) {
range<1> num_items{a.size()};
buffer a_buf(a);
buffer b_buf(b);
buffer sum_buf(sum.data(), num_items);
event e = q.submit([&](handler &h) {
// Input accessors
auto a_acc = a_buf.get_access<access::mode::read>(h);
auto b_acc = b_buf.get_access<access::mode::read>(h);
// Output accessor
auto sum_acc = sum_buf.get_access<access::mode::write>(h);
h.parallel_for(num_items, [=](id<1> i) { sum_acc[i] = a_acc[i] + b_acc[i]; });
});
q.wait();
return(e.template get_profiling_info<info::event_profiling::command_end>() -
e.template get_profiling_info<info::event_profiling::command_start>());
}
void InitializeArray(IntArray &a) {
for (size_t i = 0; i < a.size(); i++) a[i] = i;
}
int main() {
default_selector d_selector;
IntArray a, b, sum;
InitializeArray(a);
InitializeArray(b);
queue q(d_selector, property::queue::enable_profiling{});
std::cout << "Running on device: "
<< q.get_device().get_info<info::device::name>() << "\n";
std::cout << "Vector size: " << a.size() << "\n";
double t = VectorAdd(q, a, b, sum);
std::cout << "Vector add successfully completed on device in " << t << " nanoseconds\n";
return 0;
}
When these examples are run, it is quite possible that the time reported by chrono is much larger than the time reported by the SYCL profiling class. This is because the SYCL profiling does not include any data transfer times between the host and the offload device.