View Performance Inefficiencies of Data-parallel Constructs
- The parallel algorithms are nested under the kernel name when the kernel name can be demangled correctly.
- The Efficiency column indicates the efficiency of the algorithm, when associated with the algorithm name. For the participating worker threads, the efficiency column indicates the efficiency of the thread while participating in the execution. This data is typically derived from the total time spent on the parallel construct and the time the thread spent participating in other parallel constructs.
- Task Count column indicates the number of tasks executed by the participating thread.
- Duration indicates the time the participating thread spends executing tasks from the parallel construct.
- CPU time is theDurationcolumn data expressed as a percentage of the wall clock time of the parallel construct.
- Other Timewill be 0 if the thread fully participates in the execution of tasks from the parallel construct. However, in runtimes such asIntel® oneAPI Threading Building Blocks, the participating threads may steal tasks from other parallel constructs submitted to the device to provide better dynamic load balancing and throughput. In such cases, theOther Timecolumn will indicate the percentage of the total wall clock time the participating thread spends executing tasks from other parallel constructs.
- Fork Imbalanceindicates the penalty for waking up threads to participate in the execution of tasks from the parallel construct. For more information, see Startup Penalty.
- Join Imbalanceindicates the degree of imbalanced execution of tasks from the parallel constructs by the participating worker threads. For more information, see Data Parallel Efficiency.