Issue: Data Parallel Construct Inefficiency
- Startup costs for kicking off the data parallel algorithm on the worker threads.
- Imbalance costs encountered during the execution of the algorithm.
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
- If the algorithm usesrange, the runtime is automatically picking the block size and may be causing the imbalance. You can override this by usingnd_rangeand specifying a block size that would eliminate the imbalance.
- Ifnd_rangeis used, this issue may be caused by using ablock_sizethat is larger than optimal. Reducing the block size may improve performance or usingrangeand letting the runtime decide may also be an option.