In DPC++, work is performed by enqueueing kernels into queues targeting
specific devices. These kernels are submitted by the host to the device and
then they get executed by the device and results are sent back. The
kernel submission by the host and the actual start of execution do not
happen immediately - they are asynchronous and as such we have to keep
track of the following timings associated with a kernel.
Kernel submission start time
This is the at which the host starts the process of submitting the kernel.
Kernel submission end time
This is the time at which the host finished submitting the kernel. The host
performs multiple tasks like queuing the arguments, allocating resources
in the runtime for the kernel to start execution on the device.
Kernel Launch time
This is the time at which the kernel that was submitted by the host starts
executing on the device. Note that this is not exactly same as the kernel
submission end time. There is a lag between the submission end time and the
kernel launch time which depends on the availability of the device. It is
possible for the host to queue up a number of kernels for execution before
the kernels are actually launched for execution. More over, there are a few
data transfers that need to happen before the actual kernel starts execution
which is typically not accounted separately from kernel launch time.
Kernel completion time
This is the time at which the kernel finishes execution on the device. The
current generation of devices are non-preemptive which means that once
a kernel starts it has to complete its execution.
Tools like VTune
provide a visual timeline for each of the above times
for every kernel in the application.
The following simple example, shows time being measured for the
and this will involve the kernel submission time on the host and the kernel
execution time on the device and any data transfer times (since there are
no buffers or memory this is usually zero in this case).
These overheads are highly dependent on the backend runtime being used and
the processing power of the host.
A way to measure the actual kernel execution time on the device is using
DPC++ built-in profiling API. The following code
demonstrates usage of the DPC++ profiling API to profile
kernel execution times. It also shows the kernel submission time. There is
no way to programmatically measure the kernel launch time since it is
dependent on the runtime and the device driver - we have to
depend on profiling tools to get this information.
The following picture shows the timeline of the execution for the
above example. This picture is generated from running
to generate a trace file and using Chrome* tracing to visualize the
timeline. In this timeline there are two swim lanes - one for the host
side and another for the device side. Notice that the only activity
on device side is the execution of the submitted kernel. A significant
amount of work is done on the host side to get the kernel prepared
for execution. In this case since the kernel is very small, total
execution time is dominated by the JIT compilation of the kernel which is the
in the figure below.
The following picture is the zoomed in version to show the detail
of the functions called on the host side to submit the kernel. Here
the time is dominated by the
. Also notice
that there is a lag between the completion of kernel submission on
the host and the actual launch of the kernel on the device.