User Guide

  • 2021.3
  • 09/23/2021
  • Public Content
Contents

Bottlenecks Reference

Intel® GPA Graphics Frame Analyzer highlights graphics architectural blocks with bottlenecks in the Metrics pane. This section describes bottlenecks, metrics related to each bottleneck, and possible solutions. To learn more about how the tool identifies bottlenecks using hardware metrics, refer to the cookbook article "Performance Optimization for Intel® Processor Graphics".

Sampler

Sampling is the process of fetching a value from a texture at a given position. You can configure multiple sampling parameters, such as filtering mode, to balance visual results and sampling performance.
Intel® GPA Graphics Frame Analyzer checks the difference between the percentage of time when a Sampler Input is available and the percentage of time when a Sampler Output is ready.
Metric Name
Description
GPU / Sampler : Slice <N> Subslice<M> Sampler Input Available
Percentage of time there is input from the EUs on slice ‘N’ and subslice ‘M’ to the sampler.
GPU / Sampler : Slice <N> Subslice<M> Sampler Output Ready
Percentage of time there is output from the sampler to EUs on slice ‘N’ and subslice ‘M’.
When Input Available is >10 percent greater than Output Ready for a subslice of a given slice, the sampler is not returning data back to the EUs as fast as it is being requested. The sampler is probably the hotspot. This comparison only indicates a primary hotspot when the samplers are relatively busy, which means that both EU Occupancy and EU Stall are relatively high.
There can be multiple reasons for the sampler to be a hotspot. To speed up the sampler, you can try the following:
  • Reduce the texture size.
  • Change a filtering mode.
  • Choose a texture format with a smaller amount of data for a pixel or an uncompressed texture format, if possible. In some cases, the uncompressed format may cause a new bottleneck for larger textures.
  • Reduce the number of surfaces on the screen where the texture is applied.
  • Adjust the sampling access pattern to make an access to the texture more linear.
With Intel® GPA Graphics Frame Analyzer, you can optimize the Sampler bottleneck with real-time experiments, such as changing texture size and filter parameters in a pixel shader. For step-by-step instructions, refer to the "Optimize Sampler" cookbook recipe.

Shader Execution

Shader is a program, which handles programmable graphics pipeline stages or performs general-purpose computations on a GPU. Shaders are executed on execution units (EUs) of the GEN architecture. In each EU, the primary computation units are a pair of SIMD (Single Instruction, Multiple Data) floating-point units (FPUs). FPU0 processes floating point and integers operations, FPU1 can perform floating point operations and extended math instructions so it is also referred as Extended Math (EM) unit.
To detect that Shader Execution is a bottleneck, the Intel® GPA Graphics Frame Analyzer checks if an FPU pipes load is more than 90 percent. Usually, the Shader Execution bottleneck is caused by Pixel and Compute shaders that perform complex computations and are executed many times.
Metric Name
Description
EU Array / Pipes: EU FPU0 Pipe Active
Percentage of time the Floating Point Unit (FPU) pipe is actively executing instructions.
EU Array / Pipes: EU FPU1 Pipe Active
Percentage of time the Extended Math (EM) pipe is active executing instructions.
When
EU Array / Pipes: EU FPU0 Pipe Active
or
EU Array / Pipes: EU FPU1 Pipe Active
are above 90 percent, it can indicate that the primary hotspot is due to the number of instructions per clock (IPC). If so, adjust shader algorithms to reduce unnecessary instructions or implement using more efficient instructions to improve IPC. For IPC-limited pixel shaders, ensure maximum throughput by limiting shader temporary registers to ≤ 16.
For step-by-step instructions on how to optimize the Shader Execution bottleneck, refer to the "Optimize Shader Execution" cookbook recipe.

LLC/EDRAM/DRAM - Graphics Interface to Memory Hierarchy (GTI)

Metric Name
Description
GTI: SQ is full
Percentage of time that the graphics-to-memory interface is fully saturated for the event(s) due to internal cache misses.
When
GTI: SQ is full
more than 90 percent of the time, this is probably a primary hotspot. Improve the memory access pattern of the event(s) to reduce cache misses. Even if this isn’t a primary hotspot, memory latency can make this a minor hotspot any time this is above 30 percent.

Pixel Back-End - Color Write and Post-Pixel Shader (PS) Operations (PBE)

Metric Name
Description
GPU / 3D Pipe: Slice <N> PS Output Available
Percentage of time that color data is ready from pixel shading to be processed by the pixel back-end for slice ‘N’.
GPU / 3D Pipe: Slice <N> Pixel Values Ready
Percentage of time that pixel data is ready in the pixel back-end (following post-PS operations) for color write.
There are two stages in the pixel back-end (PBE): post-pixel shader (PS) operations and color write. Post-PS operations occur after the color data is ready from pixel shading to the back-end, and can include blending, late depth/stencil, and so on. Following post-PS operations, the final color data is ready for write-out to the render target.
If the
GPU / 3D Pipe: Slice <N> PS Output Available
is greater than 90%, the pixel back-end is probably the primary hotspot, either in post-PS operations or color write. To check, compare Output Available with Pixel Values Ready. If Output Available is >10% more than Pixel Values Ready, the post-PS operations are the primary hotspot. Otherwise the primary hotspot is color write.
If the difference between min and max values of
Slice <N> PS Output Available
or
Slice <N> Pixel Values Ready
is greater than 10%, the color-pipe workload imbalance is probably the primary hotspot.
If there’s a post-PS hotspot, adjust the algorithm or optimize the post-PS operations. To improve color write, improve the locality of writes (that is, geometry ordering, and so on) by using other render target formats, dimensions, and optimizations.

EU Occupancy - Shader Thread EU Occupancy

Metric Name
Description
EU Array: EU Thread Occupancy
Percentage of time that all EU threads were occupied with shader threads.
For GPGPU cases, when
EU Array: EU Thread Occupancy
is below 90%, it can indicate a dispatch issue, and the kernel may need to be adjusted for more optimal dispatch.
For 3D cases, low occupancy means the EUs are starved of shader threads by a unit upstream of thread dispatch.

Thread Dispatch (TDL)

Metric Name
Description
EU Array: EU Thread Occupancy
Percentage of time that all EU threads were occupied with shader threads.
GPU / Thread Dispatcher: PS Thread Ready for Dispatch on Slice <N> Subslice <M>
The percentage of time in which PS thread is ready for dispatch on slice ’N’ subslice ’M’ thread dispatcher.
GPU / Thread Dispatcher: NonPS Thread Ready For Dispatch on Slice <N> Subslice <M>
The percentage of time in which non-PS thread is ready for dispatch on slice ’N’ subslice ’M’ thread dispatcher.
For GPGPU cases, when
EU Array: EU Thread Occupancy
is below 90%, it can indicate a dispatch issue, and the kernel may need to be adjusted for more optimal dispatch.
For 3D cases, the bottleneck in the shader dispatch logic may occur in two cases:
  • PS Thread Ready for Dispatch on Slice <N> Subslice <M>
    is greater than 80%
  • Difference between min and max values of Non-PS Thread Ready for Dispatch on Slice <N> Subslice <M> is greater than 5%
To improve thread dispatch, reduce the shader’s thread payload, for example, register usage.

Setup Back-End (SBE)

Metric Name
Description
GPU / Rasterizer / Early Depth Test: Slice<N> Post-Early Z Pixel Data Ready
Percentage of time that early depth/stencil had pixel data ready for dispatch.
When
GPU / Rasterizer / Early Depth Test: Slice <N> Post-Early Z Pixel Data Ready
is above 90%, it can indicate the pixel shader constant data and/or attributes are loading slowly from the graphics cache (L3). To resolve, improve the memory access patterns to constants and attributes.

Early Depth/Stencil (Z/STC)

Metric Name
Description
GPU / Rasterizer: Slice <N> Rasterizer Output Ready
Percentage of time that input was available for early depth/stencil evaluation from rasterization unit.
GPU / Rasterizer / Early Depth Test: Slice <N> Post-Early Z Pixel Data Ready
Percentage of time that early depth/stencil had pixel data ready for dispatch.
When
GPU / Rasterizer: Slice <N> Rasterizer Output Ready
is above 90%, early depth/stencil throughput is low. A drop between
GPU / Rasterizer: Slice <N> Rasterizer Output Ready
and
GPU / Rasterizer / Early Depth Test: Slice <N> Post-Early Z Pixel Data Ready
of > 10 percent indicates an early depth/stencil could be a hotspot. Evaluating other stencil operations, improving geometry (that is, reducing spatial overlap), and improving memory locality can all improve performance.

Rasterization

Metric Name
Description
GPU / Rasterizer : Slice <N> Rasterizer Input Available
Percentage of time that input was available to the rasterizer from geometry transformation (VS-GS + clip/setup).
GPU/ 3D Pipe / Strip Fans: Polygon Data Ready
The percentage of time in which geometry pipeline output is ready.
If
GPU / Rasterizer: Slice <N> Rasterizer Input Available
is greater than 90%, the rasterization back-end is slow. If Input Available is >10% more than Output Ready, simplify or reduce the amount of geometry that must be rasterized (for example, fewer vertices, better clipping/culling, and so on).
If
GPU/ 3D Pipe / Strip Fans: Polygon Data Ready
> 90%, the rasterization front-end is slow. Simplify or reduce the amount of geometry that must be rasterized (for example, fewer vertices, better clipping/culling, and so on).

Geometry Transformation (Non-Slice)

Reaching this point in the flow indicates that geometry transformation is taking up a significant amount of execution time, so further optimization is needed to reduce the cost of shading, clip, and setup as indicated by the low output from rasterization. Possible optimizations are any shader optimizations for VS->GS, reducing the number of off-screen polygons generated from shading, and reducing unnecessary state changes between draws.

Shader Execution Stalled

Metric Name
Description
EU Array : EU Stall
Percentage of time that the shader threads were stalled.
When
EU Array: EU Stall
is above 10 percent, the stall could come from internal dependencies or from memory accesses initiated by the shader, and the L3 and sampler need to be evaluated. Otherwise, the execution hotspot is unknown and needs further debugging.

Unknown Shader Execution Hotspot

When you reach this point and the stall is low but the occupancy is high it indicates that there is some EU execution inefficiency associated with the workload. Optimize the shader code itself to improve IPC.

Graphics Cache (L3)

Metric Name
Description
GTI / L3 : Slice <N> L3 Bank <M> Active
Percentage of time that L3 bank ‘M’ on slice ‘N’ is servicing memory requests.
GTI / L3 : Slice <N> L3 Bank <M> Stalled
Percentage of time that L3 bank ‘M’ on slice ‘N’ has a memory request but cannot service.
GTI / L3 : Slice <N> L3 Bank <M> Input Available
Percentage of time that L3 bank ‘M’ on slice ‘N’ has input available. The metric is available starting with 10th generation Intel® Core™ processor family (code name: Ice Lake).
GTI / L3 : Slice <N> L3 Bank <M> Output Ready
Percentage of time that L3 bank ‘M’ on slice ‘N’ has output ready. The metric is available starting with 10th generation Intel® Core™ processor family (code name: Ice Lake).
You can indicate the hotspot on L3 in two cases:
  • When
    Slice <N> L3 Bank <M> Active
    > 80% and
    Slice <N> L3 Bank <M> Stalled
    > 5%
  • When
    Slice <N> L3 Bank <M> Input Available
    > 30% and (Slice <N> L3 Bank <M> Input Available * 4 - Slice <N> L3 Bank <M> Output Ready) / 4 > 30%.
Several clients interface with L3 for memory requests, and when a hotspot is seen here, improve memory access patterns to SLM, UAVs, texture, and constants.

Unknown Shader Stall

Indicates that while a stall was seen during shader execution, the root cause is not clear. Further debugging will be required to determine it. Reduce internal dependencies within the shader code and improve memory access pattern for all memory operations accessed during execution.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.