This article describes internal driver optimizations for developers using Intel Atom® processors, Intel® Celeron® processors, and Intel® Pentium® Processors with Intel® HD Graphics 500 or 505. The intent is to clarify existing documentation. The optimizations described are completely transparent. The only change needed from a developer perspective is to be aware that for this special case, applications should be designed for the thread pool configuration instead of for the underlying hardware.
Driver Thread Pool Optimizations Maximize EU Performance
For Intel Core and Intel® Xeon® processors with integrated graphics, the number of execution units (EUs) and EUs per subslice is large enough that mapping thread pools directly to subslices is efficient. Tying thread pool implementation to hardware means that application behavior and hardware details can be described together in a way that is easy to visualize and remember. This approach was used by many reference documents such as The Compute Architecture of Intel Processor Graphics Gen9.
However, for the relatively smaller GPUs in the embedded processors listed above, this approach could sometimes result in non-optimal mapping. For these processors, EUs are now pooled across subslices creating "virtual subslices" which do not match the hardware. In this case, it can help to understand where behavior is driven by thread pools instead of hardware layout.
- Intel HD Graphics 505: 18 EUs, 3x6 physical, now using 2x9 thread pools (shown above)
- Intel HD Graphics 500: 12 EUs, 2x6 physical, now using 1x12 thread pools (not shown)
The thread pools determine how you should write your application, not the physical hardware. For example, if you have Intel HD Graphics 505, your application should be written as if there were two subslices with 9 EUs, not three subslices with six EUs.
Extensive testing proved that the worst case was to match legacy configuration performance. The performance boost from switching to 2x9/1x12 often approaches 2X. Since no scenarios were found that benefit from the legacy configuration, there are no plans to add extensions to modify MEDIA_POOL_STATE.
Thread Pool Size vs. Physical Hardware Configuration
There are 4 main areas to consider:
- Optimal work group size is determined by thread pool configuration, not physical hardware. The driver will automatically handle the thread launch details to maximize thread occupancy. State tracking (such as branch masking) is handled at the pool and subslice level by the driver, but for the most part these are implementation details that can be ignored by applications.
- Local memory is shared by threads in the same pool. The number of bytes reported by CL_DEVICE_LOCAL_MEM_SIZE is physically located in lowest level cache, not the subslice. This means either 1 (for 1x12 HD Graphics 500) or 2 (for 2x9 HD Graphics 505) regions are reserved to be shared by all threads in the same workgroup.
- Workgroup Barriers: Again, behavior is tied to the work group, which is defined by the thread pool. There are now two types of internal barrier implementations -- "local barriers" within a physical subslice and "linked barriers" spanning subslices. This behavior happens automatically and cannot be changed by the application. There are no additional knobs provided to optimize.
- Subgroup extensions: Subgroups are "between" work groups and work items, so their mapping to hardware remains unchanged. Work items in a subgroup execute on the same EU thread. For more info see Ben Ashbaugh's excellent section on subgroups in our extension tutorial.
Conclusion
In the past, thread pools were always configured to match physical hardware. Now there is a notable exception due to optimizations increasing performance for Intel HD Graphics 500 and 505 GPUs. You won't need to make a lot of changes to use these optimizations. The most important takeaway is that Intel completed work behind the scenes to make efficient use of EUs easy across the full range of GPU options. These details are provided as a conceptual background, but everything happens under the hood. These changes are completely transparent. To your application, Intel HD Graphics 500 has 1 subslice with 12 EUs and Intel HD Graphics 505 has 2 subslices with 9 EUs -- even though the underlying hardware is 2x6 and 3x6. Extensive internal testing has shown that this internal driver optimization provides significant improvements. We have not seen a case of performance regression yet. However, we are always open to feedback. If you find a scenario where the legacy thread pool configuration may be a better fit, please let us know.
For more information, see: Broxton Graphics Programmer's Reference Manual.