Intel® Graphics Performance Analyzers User Guide

ID 767266
Date 3/15/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

GPU Metrics

This section describes all the GPU metrics accessible from the Intel® GPA. The table below provides an overview of all GPU metrics available for Intel GPUs starting from the 3rd Generation Intel Core Processors.

NOTE:
  • Families of Intel® Xe graphics products starting with Intel® Arc™ Alchemist (formerly DG2) and newer generations feature GPU architecture terminology that shifts from legacy terms. For more information on the terminology changes and to understand their mapping with legacy content, see GPU Architecture Terminology for Intel® Xe Graphics.
  • For products formerly named Kaby Lake G, see GPU metrics description at https://gpuperfapi.readthedocs.io/en/latest/counters.html.
  • For DirectX* 11 targets, metrics are collected for a given application being profiled. For DirectX 12 targets, metrics are collected system-wide, including all running applications. While profiling a DirectX 12 application, it is recommended to stop all other running graphic applications.

Main Metrics

Metric Name

Description

GPU Duration

Represents the total GPU time for the frame, or for the selected event for Graphics Frame Analyzer within that frame.

Examples:

If GPU Duration is 80,000, it means that the GPU spends around 80 milliseconds to render the selected ergs.

Improving Performance:

When using GPU Duration as a metric to help analyze the performance of your game or application, it is important to understand the following:

  • If this value is too large, examine the underlying components of the rendering pipeline to see if one or more of these areas are too complex, resulting in potential performance bottlenecks. Check: Pixel Shader Duration , Vertex Shader Duration , Geometry Shader Duration metrics.
  • How effective is the GPU working for the selected ergs? Check: GPU EUs Active , GPU EUs Stalled .

GPU Frequency

Represents the average GPU core frequency during the measurement period. The latest Intel GPUs support the Intel® Turbo Boost Technology 2.0 and can dynamically change frequency depending on CPU and GPU workloads.

Examples:

For Intel® HD Graphics 3000, the GPU Frequency increases to its maximum frequency when a heavy GPU load occurs.

Improving Performance:

Typically the system automatically adjusts the GPU Frequency to optimize total system performance between the CPU and the GPU.

When running the HUD, if the GPU frequency is always at its peak value for a particular system configuration, this could indicate that your system is GPU bound; if the GPU frequency is always at the lower end of the range, this could indicate that either you are CPU bound and/or that the GPU is not being fully utilized.

When running the Graphics Frame Analyzer, currently this metric does not provide an accurate measure of GPU performance, since the CPU is not being utilized as it would be during the running of your game when the frame was captured.

NOTE:

If the Intel graphics device supports multiple GPU frequencies, to minimize variation in metric values the Graphics Frame Analyzer locks the GPU at the maximum frequency available.

Avg GPU Core Frequency, MHz

Represents the average GPU Core Frequency in the measurement.

GPU Core Clocks

Represents the total number of GPU core clocks elapsed during the measurement period.

GPU Busy

Represents the percentage of time when the GPU is busy.

Examples:

For GPU-bound workloads, the value of the GPU Busy metric is 100%. A value less than 100% indicates that the GPU is spending time in an idle state, waiting for data from the CPU, in which case your game or application might be CPU-bound.

Improving Performance:

If GPU Busy is consistently less than 100% and you are encountering performance issues, consider threading your game and using the Graphics Trace Analyzer to understand the interaction between the CPU and GPU.

HUD Overhead Time

Represents the Head's-up Display overhead time.

Non-Culled Polygons

Represents the number of polygons processed that were not culled.

GTI Metrics

Main Metric

Description

GTI Write Throughput

Represents the total number of GPU memory bytes written to GTI.

GTI Read Throughput

Represents the total number of GPU memory bytes read from GTI.

DRAM LLC Throughput, bytes

Represents the total number of successful LLC cache lookups done from the GPU.

LLC GPU Accesses, messages

Represents the approximate amount of GPU memory bytes transferred between LLC and DRAM controller.

NOTE:

This metric might show incorrect results and will be disabled with the next driver update.

LLC GPU Throughput, bytes

Represents the total number of GPU memory bytes transferred between GPU and LLC.

LLC GPU Hits, messages

Represents the total number of LLC cache lookups done from the GPU (64B reads, 32B writes).

NOTE:

This metric might show incorrect results and will be disabled with the next driver update.

EU Array Metrics

Metric Name

Description

EU Idle %

Represents the percentage of time when the GPU execution units (EUs) were idle. An EU is idle when it is neither actively executing shader instructions nor stalled trying to execute shader instructions.

Examples:

  • If EU Idle % is 50, it means that the EUs were idle for 50% of the rendering time for selected ergs.
  • If EU Idle % is 0, it means that the EUs were either active or stalled for the entire duration of the rendering time for the selected erg.

Improving Performance:

If EU Idle % is significantly higher than 0%, this indicates that there are stalls elsewhere in the rendering pipeline.

EU Active %

Represents the percentage of time when the GPU execution units (EUs) were actively executing pixel, geometry, or vertex shader instructions.

Examples:

If EU Active % is 80, it means that the EUs were active 80% of the rendering time for the selected events.

Improving Performance:

If the EUs are not active, it means that they are either stalled waiting for a request to be fulfilled, or idle. You can see how much of the non-active time is caused by stalls by examining the EU Stall % metric. If the total EU busy time ( EU Active % + EU Stall % ) is significantly lower than 100%, this indicates that there are stalls elsewhere in the rendering pipeline.

EU Stall %

Represents the percentage of time when the GPU execution units (EUs) were stalled. An EU becomes stalled when all of its threads are waiting for results from fixed function units (for example, a pixel shader requests texels from the texture sampler).

Examples:

  • If EU Stall % is 50, it means that EUs were stalled for 50% of the rendering time for the selected ergs.
  • If EU Stall % is 0, it means that there were no stalls in EUs or the stall time is very small.

Improving Performance

If this metric is unexpectedly high, especially when compared with the EU Active % metric, you can analyze where the stalls happen by looking at the VS EU Stall %| GS EU Stall % | PS EU Stall % metrics. If any of these metrics show that most of the stall time is in one particular shader, examine your shader code in the Graphics Frame Analyzer to determine why this shader might be causing the EUs to stall.

EU AVG IPC Rate, Number

Represents the average rate of IPC calculated for two FPU pipelines.

NOTE:

This metric might show incorrect results and will be disabled with the next driver update.

VS Duration

Represents an approximation of the total GPU time spent executing vertex shader code.

Examples:

  • If Vertex Shader Duration is 50,000, it means that GPU spends around 50 milliseconds to execute vertex shaders for selected ergs.
  • If Vertex Shader Duration is 0, it means that time spent in vertex shaders for selected ergs is very small.

Improving Performance:

If the Vertex Shader Duration time is significant compared to GPU Duration , vertex processing optimizations might be needed. In this situation, optimize the geometry by minimizing the Vertex Count , Primitive Count , and Vertex Shader Invocations Count . If you are using triangle lists, try to convert them to a single triangle strip to minimize the number of vertices sent to pipeline. Also optimize the geometry for VCache (see Vertex Shader Invocations Count metric description).To see whether optimizations are possible, examine your vertex shader code in the Graphics Frame Analyzer. Refer to the Graphics API Performance Guide to find recommendations for vertex shader optimizations.

VS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Vertex Shader instructions.

Examples:

  • If VS EU Active % is 50%, half of the overall GPU time was spent actively executing Vertex Shader instructions.
  • If VS EU Active % is 0%, no Vertex Shader was associated with the selected draw calls, or that the amount of time actively executing Vertex Shader instructions was negligible.

Improving Performance:

  • This metric is important if vertex processing seems to be a bottleneck for selected rendering calls. If VS EU Active % accounts for most of the EU active time, then to improve performance you should simplify the vertex shader or simplify and optimize the geometry of your primitives.
  • If VS EU Active % is significant, you should examine your vertex shader code to find reasons that might be causing stalls.

Inspect the shader code in the Graphics Frame Analyzer.

VS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Vertex Shader instructions.

NOTE:

This metric does not include the total amount of time stalled in the vertex shader, but only the fraction of the time when vertex shader stalls were causing the entire EU to stall. The entire EU stalls when all of its threads are stalled.

Examples:

  • If VS EU Stall % is 50% it means that half of the overall GPU time was spent stalled on Vertex Shader instructions.
  • If VS EU Stall % is 0% it means that no Vertex Shader was associated with selected rendering calls or Vertex Shader threads were not causing EUs stalls.

Improving Performance:

  • This metric is important if vertex processing seems to be the bottleneck for selected rendering calls. If VS EU Stall % accounts for most of the EU active time, then to improve performance you may need to simplify the vertex shader or simplify and optimize geometry.
  • If VS EU Stall % is significant you need to concentrate on vertex shader code to find reasons causing stalls.

Inspect the shader code in the Graphics Frame Analyzer.

VS Invocations

Represents the number of vertex shader invocations - the vertex shader is invoked once per vertex. The number of vertex shader invocations depends both on the vertex and primitive counts and the operation of the post-transform vertex cache (VCache). In an optimal situation the GPU fetches already-processed vertices from the cache rather than recalculating this data, which could impact the value of this metric.

Therefore, when the VS Invocations and the Vertex Count have similar values, it means that the geometry is not optimized to take advantage of the VCache.

Examples:

The OptimizedMesh sample from the Microsoft* DirectX* SDK is a good example to illustrate the Vertex Count and VCache optimizations:

  • When rendering one un-optimized mesh as a triangle list, the Vertex Count is equal to 141K and the VS Invocations is 112K.
  • When rendering the same mesh as a triangle list that has been reordered for optimum VCache usage, the Vertex Count is still the same but the VS Invocations number drops to 27K, which is almost four times less.
  • When rendering the same mesh as a VCache-optimized triangle strip, the Vertex Count drops to 52K and the VS Invocations drops to 25K.

Improving Performance:

To improve vertex processing performance and reduce the number of vertex shader invocations, try to reorder the geometry for optimum VCache usage. The D3DX utility library contains functions that reorder the geometry to improve VCache utilization ( ID3DXMesh::Optimize, ID3DXMesh::Optimize, D3DXOptimizeFaces, D3DXOptimizeVertices ).

NOTE:
  • If you render point sprites, the metric is always equal to Vertex Count and Primitive Count (that is, no optimizations are necessary).

  • The size of the VCache varies for different GPU models, so you may see different metric values when using the same geometry on different hardware.

VS Send Pipe Active %

Represents the percentage of time in which EU send pipeline was actively processing a vertex shader instruction.

VS FPU0 Pipe Active %

Represents the percentage of time in which EU FPU0 pipeline was actively processing a vertex shader instruction.

VS FPU1 Pipe Active %

Represents the percentage of time in which EU FPU1 pipeline was actively processing a vertex shader instruction.

HS Duration

Represents the total amount of time the GPU spent executing hull shader code.

Examples:

  • If HS Duration is 50,000 it means that the GPU spent 50 milliseconds executing hull shader code for the selected ergs.
  • If HS Duration is 0, it means that either the time spent executing hull shader code was negligible, or there was no hull shader in use.

The heading in this template is a special field for topic titles, so generally you do not need to edit it.

Improving Performance:

If the HS Duration is larger than you expect, you can examine your hull shader code in the Graphics Frame Analyzer to investigate possible optimizations.

HS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Hull Shader instructions.

HS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Hull Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.

NOTE:

This metrics does not include the total amount of stalled time in the Hull Shader, but only the amount of time when the Hull Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.

Improving Performance:

If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls.

Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be 'hidden'. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.

Inspect the shader code that was executed for a given draw call and experiment with optimizations in the Graphics Frame Analyzer.

HS Invocations

Represents the number of Hull Shader invocations. The Hull Shader is invoked once per patch.

Examples:

The SimpleBezier11 sample from the Microsoft* DirectX* SDK is a good example to understand Hull Shaders. This sample renders a Mobius strip comprised of four patches with 64 control points per patch. Execution of this sample will result in an HS Invocations value of four.

Improving Performance:

The Hull Shader is not usually a performance bottleneck, but it can definitely cause performance issues further down the rendering pipeline. If the Hull Shader specifies large tessellation factors, or as the HS Invocations value increases, it will result in more work for the fixed function tessellator as well as an increased number of DS Invocations and GS Invocations .

DS Duration

Represents the total amount of time the GPU spent executing domain shader code.

Examples:

  • If DS Duration is 50,000 it means that the GPU spent 50 milliseconds executing domain shader code for the selected ergs.
  • If DS Duration is 0, it means that either the time spent executing domain shader code was negligible, or there was no domain shader in use.

Improving Performance:

If DS Duration is larger than you expect, you can examine your domain shader code in the Graphics Frame Analyzer to investigate possible optimizations.

DS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Domain Shader instructions.

DS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Domain Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.

NOTE:

This metrics does not include the total amount of stalled time in the Domain Shader, but only the amount of time when the Domain Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.

Improving Performance

If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls. Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be 'hidden'. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.You can inspect the shader code that was executed for a given draw call and experiment with optimizations in Graphics Frame Analyzer.

DS Invocations

Represents the number of Domain Shader invocations. The Domain Shader is invoked once per fixed function tessellator output point.

Examples:

The SimpleBezier11 sample from the Microsoft* DirectX* SDK is a good example to understand Domain Shaders. This sample renders a Mobius strip comprised of 4 patches with 64 control points per patch.Increasing the Patch Divisions slider increases the tessellation factors of the Hull Shader which results in and increased number of inputs into the Domain Shader. When the Patch Divisions slider is set to 4.0, the DS Invocations value will be 192. When the Patch Divisions slider is set to 5.0, the DS Invocations value will be 320.

Improving Performance:

The purpose of a Domain Shader is to calculate the vertex positions for subdivided points output by the fixed function tessellator. The best way to improve performance is to minimize the number of DS Invocations . This can be done by decreasing the amount of tessellation performed by either decreasing the number Hull Shader Invocations or decreasing the tessellation factors in the Hull Shader.

GS Duration

Represents the approximate total GPU time spent executing geometry shader code.

Examples:

  • If GS Duration is 50,000, it means that GPU spends around 50 milliseconds to execute geometry shaders for selected ergs.
  • If GS Duration is 0, it means that time spent in geometry shaders for selected ergs is very small or no geometry shaders were associated with selected ergs.

Improving Performance:

If you are encountering performance issues and the GS Duration time is more than 20% to 40% of the total GPU Duration , geometry shader code optimizations may be needed.Examine geometry shader code in the Graphics Frame Analyzer to see if optimizations are possible.Refer to the Graphics API Performance Guide for recommendations on how to optimize the geometry shader.

GS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Geometry Shader instructions.

Examples:

  • If GS EU Active % is 50% it means that half of the overall GPU time was spent actively executing geometry shader instructions.
  • If GS EU Active % is 0% it means that no geometry shader was associated with the selected draw calls, or that the amount of time actively executing geometry shader instructions was negligible.

Improving Performance:

  • This metric is important if geometry shader seems to be the bottleneck for selected rendering calls. If GS EU Active % accounts for most of the EU active time, then to improve performance you may need to simplify the geometry shader or simplify and optimize the geometry of the scene.
  • If GS EU Active % is more than a nominal amount, you may need to examine your geometry shader code to find reasons for what might be causing these stalls.

Inspect the shader code using the Graphics Frame Analyzer.

GS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Geometry Shader instructions.

NOTE:

This metric does not include the total amount of stalled time in the geometry shader but only the fraction of time when the geometry shader stalls were causing the entire EU to stall. The entire EU stalls when all of its threads are stalled.

Examples:

  • If GS EU Stall % is 50%, it means that half of the overall GPU time was spent stalled on Geometry Shader instructions.
  • If GS EU Stall % is 0%, it means that no Geometry Shader was associated with selected rendering calls or Geometry Shader threads were not causing EUs stalls.

Improving Performance:

  • This metric will be important if you think that geometry shader looks like the bottleneck for selected rendering calls. If GS EU Stall % accounts for most of the EU active time, then to improve performance you may need to simplify the geometry shader or simplify and optimize geometry.
  • If GS EU Stall % is more than a nominal amount, you may need to examine your geometry shader code to find reasons for what might be causing these stalls.

Inspect the shader code using the Graphics Frame Analyzer.

GS Invocations

Represents the number of geometry shader invocations. The value is 0 if no geometry shader is associated with the rendering call.

NOTE:

See Microsoft* DirectX* SDK for a description of the shader invocation count.

Examples:

If GS Invocations is 1000 it means that the geometry shader was invoked for 1000 primitives.

Improving Performance:

The only way to minimize the number of geometry shader invocations is to minimize the number of input primitives. The impact on rendering performance of reducing the invocation count is highly dependent upon your specific game or application.

Post-GS Primitives

Represents the number of primitives that flowed out of the geometry shader (GS), if enabled, to the clipper. This metric is important if a geometry shader was associated with the selected rendering calls, and even more important if the number of primitives spawned by geometry shader code is dynamic.

NOTE:

If the GS was not enabled for the selected rendering calls, the metric returns a value of 0.

Examples:

If Post-GS Primitives is 1000 and Primitive Count is 100, it means that 1000 primitives were constructed in the geometry shader from the original 100.

Improving Performance:

Analyze the geometry shader code using Graphics Frame Analyzer.

PS Duration

Represents an approximation of the total GPU time spent executing pixel shader code.

Examples:

  • If Pixel Shader Duration is 50,000 it means that GPU spends around 50 milliseconds to execute pixel shaders for the selected ergs.
  • If Pixel Shader Duration is 0 it means that time spent in pixel shaders for selected ergs is very small.

Improving Performance:

Examine the Pixel Shader Duration time versus the GPU Duration ; when Pixel Shader Duration is high you may improve overall rendering performance by optimizing your pixel shader code.Refer to the Graphics API Performance Guide to find advice for pixel shader optimizations.

PS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Pixel Shader instructions.

Examples:

  • IfPS EU Active %is 50% it means that half of the overall GPU time was spent actively executing Pixel Shader instructions.
  • If PS EU Active % is 0% it means that no Pixel Shader was associated with the selected draw calls, or that the amount of time actively executing Pixel Shader instructions was negligible.

Improving Performance:

  • This metric is important if pixel shading seems to be the bottleneck for selected rendering calls.
  • If PS EU Active % accounts for most of the EU active time, then to improve performance you may need to simplify the pixel shader.
  • If PS EU Active % is larger than you would expect and you are encountering slow rendering times, you should examine the pixel shader code for potential reasons why these stalls may be occurring.

PS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Pixel Shader instructions.

NOTE:

This metric does not show total amount of stalled time in the pixel shader, but only the fraction of time when pixel shader stalls caused the entire EU to stall. The entire EU stalls when all of its threads are stalled.

Examples:

  • If PS EU Stall % is 50% it means that half of the overall GPU time was spent stalled on Pixel Shader instructions.
  • If PS EU Stall % is 0% it means that no Pixel Shader was associated with selected rendering calls or Pixel Shader threads were not causing EUs stalls.

Improving Performance:

  • This metric is important if pixel shading seems to be the bottleneck for selected rendering calls. If PS EU Stall % accounts for most the EU active time, then to improve performance you may need to simplify the pixel shader.
  • If PS EU Stall % is larger than you expect and you are encountering slow rendering times, you need to concentrate on pixel shader code to find reasons for these stalls.

PS Invocations

Represents the number of pixel shader invocations. The pixel shader is invoked once per pixel.

Examples:

If you render a quad with 8x8 pixels size that is located entirely within the viewing frustum, the Pixel Shader Invocation Count is 64.

Improving Performance:

Usually PS Invocations workloads are one of the most expensive in the rendering pipeline due to the processing time required within the pixel shader. Therefore, keeping the number of invocations as low as possible will likely improve your rendering performance.

NOTE:

For Intel® microarchitecture code name Ivy Bridge and Bay Trail, this metric includes pixels rejected by Early-Depth test, even though the pixel shader was not actually invoked for these pixels.

PS Send Pipeline Active %

Represents the percentage of time in which EU send pipeline was actively processing a pixel shader instruction.

PS FPU0 Pipe Active %

Represents the percentage of time in which EU FPU0 pipeline was actively processing a pixel shader instruction.

PS FPU1 Pipe Active %

Represents the percentage of time in which EU FPU1 pipeline was actively processing a pixel shader instruction.

EU FPU0 Pipe Active %

Represents the percentage of time during which the EU FPU0 pipeline was actively processing.

EU FPU1 Pipe Active %

Represents the percentage of time during which the EU FPU1 pipeline was actively processing.

EU Both FPU Pipes Active %

Represents the percentage of time in which both EU FPU pipelines were actively processing.

EU Send Pipe Active %

Represents the percentage of time during which the EU Send pipeline was actively processing.

CS Duration

Represents the total amount of time the GPU spent executing compute shader code.

Examples:

  • If CS Duration is 50,000 it means that the GPU spent 50 milliseconds executing compute shader code for the selected ergs.
  • If CS Duration is 0 it means that either the time spent executing compute shader code was negligible, or there was no compute shader in use.

Improving Performance:

If CS Duration is larger than you expect, you can examine your compute shader code in the Graphics Frame Analyzer to investigate possible optimizations.

CS EU Active %

Represents the percentage of overall GPU time that the EUs were actively executing Compute Shader instructions.

Examples:

  • If CS EU Active % is 0%, it means that no compute shader was associated with the selected draw calls, or that the amount of time actively executing compute shader instructions was negligible.

CS EU Stall %

Represents the percentage of overall GPU time that the EUs were stalled in Compute Shader instructions. A shader thread will stall when it reaches an instruction that cannot complete until some time-consuming operation is completed.

Examples:

  • If CS EU Stall % is 0%, it means that no Compute shader was associated with the selected draw calls, or that the amount of time stalled on Compute shader instructions was negligible.

NOTE:

This metric does not include the total amount of stalled time in the Compute Shader, but only the amount of time when the Compute Shader was causing the entire EU to stall. The EUs in the Intel® HD Graphics are hyperthreaded, which means that each EU can very quickly (within 2 clock cycles) switch from a stalled shader thread to another shader thread. Therefore, it is possible at any given time for a number of shader threads to be stalled on an EU, but for the EU to continue actively executing instructions on another shader thread. The entire EU is considered to be stalled only when all of its threads are stalled.

Improving Performance:

If a large amount of stall time seems to be occurring in a particular shader, then you should examine that shader to see whether you can reduce or eliminate some of the stalls.Short shaders might normally stall for a majority of their execution time, since in such situations instruction or data fetch (texels, constants) latency cannot be 'hidden'. If a large stall time occurs in longer shaders, it usually indicates inefficient shader execution and possible optimization opportunities.

Inspect the shader code that was executed for a given draw call and experiment with optimizations in Graphics Frame Analyzer.

CS Invocations

Represents the number of compute shader invocations. The Compute Shader is invoked once per thread per thread group. The number of threads per thread group is defined by the Compute Shader's numthreads attribute ( numthreads(tX, tY, tZ) ). The number of thread groups executed is determined by the parameters to the Dispatch call ( Dispatch(gX, gY, gZ) ). CS Invocations is equal to (gX*gY*gZ)*(tX*tY*tZ).

Examples:

  • If the numthreads attribute is numthreads(4, 4, 1) and Dispatch is called as Dispatch(16, 16, 16), the CS Invocations value will be equal to (16*16*16)*(4*4*1) = 65536.

Sampler Metrics

Metric Name

Description

Sampler Busy %

Represents the percentage of time the texture sampler was busy handling texel fetch requests (that is, was either active or stalled).

NOTE:

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples:

  • If Sampler Busy % is 50, it means that texture sampler was active 50% of the rendering time for the selected ergs.
  • If Sampler Busy % is 0, it means that texture sampler was not used or the time during which it was active is very small.

Improving Performance:

When Sampler Busy % is running this might lead to execution unit stalls, especially if texture fetch latency does not occur in parallel with mathematical instructions (as the shader compiler attempts to optimize shader code to cover such latencies). Examine the EU Stall % metric to see the amount of EUs stalls. If the percentage is high and the Sampler Busy % is close to 100%, most likely you have a texturing bottleneck. Try the 2x2 textures experiment in the Experiments pane in the Graphics Frame Analyzer to see if this is the case.

Sampler Texels, texels

Represents the number of texels returned from the texture sampler.

NOTE:

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples:

If Sampler Texels, texels is 1000, it means that 1000 texels were delivered to the execution units (EUs) from the texture sampler.

Improving Performance:

A high number of texels fetched from textures leads to a higher texture bandwidth and a higher number of texture sampler unit stalls, which might cause a high number of EU stalls caused by shaders awaiting texels from the sampler unit.Note that this metric could indicate that the shader stalls while fetching texture data inside branching logic.

For example, if the shader fetches texture samples only inside an if() block in the code, this metric can help you understand how often the shader takes the branch.

NOTE:

This metric is accurate only to four texels, and generally is slightly larger than the actual number of texels used. This is because the texture sampler returns data in 2x2 texel quads. When sampling along angular edges, this inaccuracy becomes more pronounced.

Sampler Cache Misses, messages

Represents the number of bytes of texture data read from memory by the GPU due to texture cache misses when rendering this frame. Note that the Texture Sampler reads data from memory in 64-byte blocks, so this metric can be used to calculate the number of texture cache misses as follows:

Examples:

  • If Sampler Cache Misses, messages is 64000, it means the Texture Sampler missed the cache 1000 times and needed to read 64000 bytes of memory.
  • If Sampler Cache Misses, messages is 0, it means that no texture data was read from memory for the selected ergs.

NOTE:

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Improving Performance:

Usually a higher value for this metric leads to a higher percentage of Texture Sampler stalls. Therefore, utilize techniques that minimize the number of texture reads, such as shown in the "Improving Performance" section of the Sampler Stalled metric.

Sampler Bottleneck %

Represents the percentage of time that the texture sampler is a bottleneck. The sampler is stalling Execution Units (EUs) due to a full input FIFO and starving EUs due to a lack of results.

NOTE:

This metric is unreliable when protected HD media content is being played back on a system with Intel® HD Graphics 5000/ 4600 / 4400 / 4200, Intel® Iris® graphics 5100, or Intel® Iris® Pro graphics 5200 configuration.

Examples:

If Sampler Bottleneck % is 90, then the texture sampler is a bottleneck (stalling some EUs and/or causing other EUs to idle) 90% of the time.Improving PerformanceThe following techniques may improve the texture sampler performance:

  • Reducing the size of textures, by using a lower resolution or lower color precision (such as RGBA4444 instead of RGBA8888)
  • Using texture compression to reduce the amount of memory to transfer textures
  • Using mipmapping, so that smaller textures (mipmaps) can be used
  • Reducing the number of textures in the scene
  • Using a different filtering algorithm

For example, anisotropic filtering is more expensive to compute than a simpler algorithm, such as bilinear filtering. To help minimize overhead in this area, capture a typical frame while the game is running, use this frame as input to the Graphics Frame Analyzer, and try one or more of the following techniques:

  • the 2x2 Textures experiment in the Experiments tab to see if textures are a bottleneck
  • the Texture tab to see the texture size, format, and mip level

NOTE:

This metric might show incorrect results and will be disabled with the next driver update.

Sampler Stalled

Represents the percentage of time the texture sampler was stalled. The texture sampler is stalled when its output queue is full, which can occur when it returns texture requests faster than the EUs can process them. When the texture sampler is stalled, it cannot process new requests.

Examples:

  • If Sampler Stalled is 50%, it means that half of the time when texture sampler was busy it was waiting for space to open up in its output queue.
  • If Sampler Stalled is 0%, it means that texture sampler never stalled.

Improving Performance:

  • Reduce the number of texture fetches in the shader code.
  • Reduce texture size and texture filtering setting under the Texture tab in the Graphics Frame Analyzer to see if this helps improve performance without adversely affecting image quality.
  • Minimize anisotropic filtering, because it requires a high number of additional texel fetches and is therefore "expensive" to use.
  • Modify the texture fetching pattern in the shader code to optimize texture cache utilization.

To inspect shader code, see the Shaders tab in the Graphics Frame Analyzer.

3D Pipe Metrics

Metric Name

Description

Early Hi-Depth Test Fails, pixels

Represents the total number of pixels dropped on the early hierarchical depth test.

Early Depth Test Fails, pixels

Represents the number of pixels that failed the early depth/stencil tests.

Clipper Invocations

Represents the number of primitives processed by the Clipper.

Examples:

  • If you render 100 triangles and clipping is enabled, the Clipper Invocations is 100.
  • If you render 100 triangles and clipping is disabled, the Clipper Invocations is 0.

Improving Performance:

In most cases you do not have to care about the clipper performance on Intel® HD Graphics 2000/3000 GPUs because these graphic processors utilize a fast clipping algorithm implemented in silicon.

For more information on enabling/disabling hardware clipping read the Microsoft* DirectX* SDK documentation.

Post-Clip Primitives

Represents the number of primitives that flowed out of the clipper. The metric includes original primitives that passed the trivial clipping test (trivial accept), and new primitives that were created by the clipper as a result of the clipping operation.

Examples:

  • If you render 100 triangles and clipping is enabled and all the triangles are trivially accepted, the Post-Clip Primitives is 100.
  • If you render 100 triangles and clipping is enabled and all the triangles are trivially rejected, the Post-Clip Primitives is 0.
  • If you render 100 triangles and clipping is enabled and one or more triangles are partially located within the viewing frustum, the Post-Clip Primitive count returns a value which could be more or less than 100 depending on the number of triangles that were clipped. If the value is significantly higher than 100 it means that many triangles where partially clipped, and the clipper created additional triangles.

Improving Performance:

In most cases you do not have to care about the clipper performance on Intel® HD Graphics 2000/3000 GPUs because these graphic processors implement an efficient clipping algorithm in silicon.

For more information on enabling/disabling hardware clipping read the Microsoft* DirectX* SDK documentation

.

Samples Killed in PS, pixels

Represents the total number of samples or pixels dropped in pixel shaders.

Primitive Count

Represents the number of primitives sent to the 3D hardware.

NOTE:

For Microsoft* DirectX* 9: the Primitive Count metric matches the PrimitiveCount parameter in the rendering calls.

Examples:

  • If you render 100 points, the IA stage assembles 100 point primitives and the Primitive Count is 100.
  • If you render two triangles as a triangle list, the IA stage assembles two triangles and the Primitive Count is two.
  • If you render two triangles as a triangle strip, the IA assembles two triangles and the Primitive Count is two.

Improving Performance:

If geometry/vertex processing becomes a bottleneck, try to reduce number of primitives sent to GPU for each frame by:

  • Simplifying your rendering geometry; for example, show small geometry details using bump maps instead of triangles, use lower detail models for far away objects, or use textures with multiple mip maps.
  • Optimizing your scene through various culling techniques; for example, use Binary Space Partitioning (BSP), Portal rendering, or Octrees.

Vertex Count

Represents the number of vertices sent to the 3D hardware pipeline during the D3D Input Assembler (IA) stage. The number of vertices depends on the primitive type and the number of primitives. The following formulas are used:

Primitive Type Vertex Count
Point list Number of Primitives
Triangle list Number of Primitives *3
Triangle strip Number of Primitives +2
Line list Number of Primitives *2
Line strip Number of Primitives +1

NOTE:
  • For Microsoft* DirectX* 9 : When rendering indexed primitives the Vertex Count metric does not match the NumVertices parameter in the ::DrawIndexedPrimitive , ::DrawIndexedPrimitiveUP functions because the Input Assembler counts shared vertices multiple times.
  • For Microsoft* DirectX* 10and later : The Vertex Count metric does not include vertices created during the geometry shader stage.

Examples:

  • If you render 100 points, the IA stage assembles 100 point primitives with 100 vertices total, and the Vertex Count is 100.
  • If you render two triangles as a triangle list, the IA stage assembles two triangles with six vertices total, and the Vertex Count is six.
  • If you render two triangles as a triangle strip, the IA stage assembles two triangles with four vertices total, and the Vertex Count is four.

Improving Performance:

To minimize the number of vertices sent to the pipeline and thereby improve vertex processing performance, use graphics primitives that minimize the amount of data being sent to and processed by the GPU, such as using single triangle strips.

Samples Blended, pixels

Represents the total number of samplers or pixels written to all render targets.

Samples Written

Represents the number of pixels/samples written to render targets.

The graphics driver 9.17.10 introduces a new notion of deferred clears. For the sake of optimization, the driver decides whether to defer the actual rendering of clear calls in case subsequent clear and draw calls make it unnecessary. As a result, when clear calls are deferred, the Graphics Frame Analyzer shows their GPU Duration and Samples Written as zero. If later it turns out that a clear call needs to be drawn, the work associated with that clear call gets included in the duration of the erg that was being drawn when this clear call was deferred, not necessarily a clear call. This means that in the Graphics Frame Analyzer metrics associated with a clear call accurately reflect the real work associated with that erg.

Alpha Test Fails

Represents the number of pixels that failed the alpha test and are ignored (not written to the surface).

Examples:

If Alpha Test Fails is 5000, then 5000 pixels failed the alpha test and were not written to the surface.

Pixels Rendered

Represents the number of pixels that passed the depth-test (both Z-buffer and Stencil if enabled). If the depth-test was disabled, Pixels Rendered counts all the pixels that passed through from the previous pipeline stage.

NOTE:

Pixels that passed the depth-test might not necessarily appear in the render target, which could occur if the color buffer write mask is set to 0.

Examples:

  • If you render a quad with 8x8 pixels, located entirely within the viewing frustum and all the pixels passed depth test, Pixels Rendered is 64.
  • If you render a quad with 8x8 pixels, located entirely within the viewing frustum and half of the pixels are rejected by depth tests or other stages of the graphics pipeline, Pixels Rendered is 32.

Improving Performance:

A high number of rendered pixels results in a high number of pixel shader executions, which requires more rendering time. To keep the number of rendered pixels as low as possible, optimize the rendering order to maximize Early-Z benefit or use a Z-only pass if possible.

To find areas with high depth complexity, use the Overdraw option in the Graphics Frame Analyzer.