Up Your Game and Know Your Intel® GPU Architecture

Many bottlenecks are a result of either a lack of threads loaded in the GPU's execution unit or from stalls while running as it waits for reads or writes to complete.


Hello. Welcome to our GDC Showcase tech session, Up Your Game! Know Your Intel GPU Architecture. I am Pamela Harrison. I’m a software technical consulting engineer for Intel® Graphics Performance Analyzers (Intel® GPA).

Let me also introduce you to Stanislav Volkov. Stas is the Intel® GPA software architect. He and I are both here ready to answer your questions via the chat interface, throughout the session.

So, you want to increase your game’s performance. You’ve started implementation and want to check out your FPS (frame per second) rate or see if your GPU is idle anywhere. So, you crack open your favorite profiler. Intel® GPA is a fantastic choice. You capture a frame. You view the captured data in Intel® GPA’s Graphics Frame Analyzer, but what exactly does low EU Thread Occupancy mean in the context of your application? What’s the difference between EU Stall and EU Idle?

Understanding the precise meaning of the terms used by the profiler goes back to understanding the GPU EU (execution unit) architecture. Many types of bottlenecks either result in insufficient work provided to the execution unit or a stall [in] its execution. So, let’s go on a little tour of the GPU EU architecture, relate that to the metrics collected, learn what those metrics mean in the context of your game’s performance, and then learn how to apply this knowledge via a few examples using Intel® GPA for profiling.

First, just in case you haven’t used [Intel] GPA, I will take a minute to tell you what it is. Then, before the EU architecture tour, we will take a quick look at a few of our latest new features.

Intel® GPA is a set of analysis tools that help profile applications to find rendering bottlenecks. With this tool suite, you have the capability to analyze in real time or analyze captured data statically.

First, we have System Analyzer that allows you to visualize metrics and other data in real time as you play your game. This helps you find the areas of interest in your application that you want to analyze further.

Then, we have Graphics Trace Analyzer. After capturing a trace, data captured over a user specified number of seconds, use this tool to visualize the big picture: whether you are CPU or GPU bound, or patterns where tracks are idle.

And finally, Graphics Frame Analyzer where you can view data from a captured frame or frames that is detailed down to shader, geometry, and texture levels.

Here are some of the most recent additions to our tool suite:

  • Across the Intel® GPA tool suite, we have added multi-GPU support.
  • In Graphics Trace Analyzer, two of our most recent additions are:
    • Sync highlighting and arrows between Signal, Render, and Present packages in the GPU queues and the calls that invoked them in the CPU threads
    • Indicators (percent and time) showing how much activity is occurring on the CPU cores, GPU queues, and CPU threads
  • In Graphics Frame Analyzer we have added:
    • Multiframe support in stream mode
    • And an awesome graph showing Render Target Dependency (currently for Direct3D* 11)
  • And finally, we have created an Intel® GPA plug-in for Unreal Engine* that you can find on GitHub*

Now we will begin the promised tour looking at some basic features of the EU.

In this presentation, we will describe the main building blocks of the EU using a sample Intel Xe [architecture] [sic] GPU, but outside of some specifics, this information is also applicable to other Intel® hardware generations. Intel Xe-LP [sic] (which you may know as TGL) can have up to 96 EUs facilitated by other hardware blocks, such as Thread Dispatch, Sampler, and L3 Cache. However, for the purpose of this presentation, we will concentrate only on the EU piece. We will follow up with more details in a deeper dive technical presentation video that we will publish later this quarter. From a 10,000 foot view, each EU logically consists of a General Register File (GRF), two Arithmetic Logic Units (ALU0 and ALU1), as well as Thread Control, Branch, and Send units. Let's go over some of these.

The General Register File or GRF is a memory space, which holds the shader execution state. It is split into seven slots, 4 KB each. Each slot holds the state of a single hardware thread. Each hardware thread can execute independent shader code, which is of some SIMD width (SIMD meaning Single Instruction, Multiple Data), such as SIMD8, SIMD16, or SIMD32. The higher the SIMD number, the more efficient the execution, but the more space required in the GRF slot (more on this in the upcoming deep dive video). So, if your shader uses a lot of temporary variables, it will force the Intel® compiler to fall back to a lower SIMD number. Also, it’s important to note that the more threads that are loaded into the GRF, the better, as they are used to hide execution latency, which we will discuss in a few slides.

ALU0 and ALU1. All float and integer arithmetic operations are handled by these two arithmetic units. Keep in mind that they are not symmetric with respect to which instruction type they can execute. For example, on Intel Xe-Lp [sic] ALU0 handles simple float and int instructions, such as multiplication or addition, while ALU1 handles transcendental math instructions, such as square root, trig instructions, and so on. Each instruction type has a hardware SIMD-width with which the ALU can execute that particular instruction. For example, ALU0 can pipeline a SIMD8 float32 instruction in a single clock cycle. Higher SIMD instruction execution is split into phases and takes multiple clock cycles. For example, SIMD16 float32 will take two clock cycles to pipeline.

We won’t touch thread control or branch units today, but instead concentrate on the Send unit. The Send unit is responsible for all read and write operations, as well as some service messages, such as end-of-thread. These instructions are called asynchronous, as their execution time is not deterministic and depends on many factors, such as data locality in various cache levels and overall memory bus utilization rate. For this reason, they usually introduce a large execution latency while data is being read or written. The EU is designed to hide this latency by loading up to seven threads in each EU.

Let’s imagine that we have a hardware thread which executes some code and eventually hits a memory read instruction, implemented by Send. Now we have to wait for its result (in reality, wait will happen on an instruction which tries to use the read result, not Send itself, but for simplicity reasons we will ignore this fact). So now our execution unit doesn’t execute any code because it is waiting for the read.

However, if we have one more thread loaded in the GRF of this EU, we can start to execute its code while we are waiting. This is exactly what the EU will do: it will switch between loaded threads in round robin fashion until it finds an instruction for which execution is not blocked by some dependency.

Let’s go back to the example. Eventually, our second thread can also hit a Send instruction and will need to wait for its result. If there are not other loaded threads, then the EU will wait for either Thread 1 or Thread 2 to be unblocked to continue execution.

In this particular example, there are two regions where the EU was able to hide data access latency by switching between threads; it switched to Thread 2 to hide latency in Thread 1, and was able to hide some latency in Thread 2 when Thread 1 was unblocked.

However, there is a region of time when execution in both hardware threads is blocked, and there were no loaded threads to hide it. This is called EU stall. This should be minimized as much as possible.

Now, with the basics described, we can properly understand the meaning of the following EU metrics, which represent various EU states, averaged across all EUs over the measurement interval:

  • EU Thread Occupancy—percent of occupied GRF slots. This generally should be as high as possible. If the EU Thread Occupancy value is low, this indicates either a bottleneck in preceding hardware blocks, such as vertex fetch or thread dispatch, or, for compute shaders it indicates a suboptimal SIMD-width or Shared Local Memory usage.
  • EU Active—percent of time when ALU0 or ALU1 were executing some instruction. It should be as high as possible and a low value is caused either by a lot of stalls or EUs being idle. Idle means that there are periods of time when there are no threads loaded on the EU at all.
  • EU Stall—percent of time when EU was stalled, meaning, as you saw in the example, there is at least one thread loaded but no execution of instructions. Obviously, this should be as low as possible and as mentioned, a high stall value usually means a lot of memory accesses in shader code.

Now that we have some understanding of the EU architecture, let’s look at how that manifests in a profiler.

While it’s important to understand how the GPU works and what the metrics mean for efficient profiling, you don’t need to analyze each draw call in your frame manually in order to understand the problem type. To help with this sort of analysis, Intel® GPA provides an automatic Hotspot Analysis mode.

In Graphics Frame Analyzer, enable Hotspot Analysis by clicking on the button on the top left of the tool. The Bar Chart across the top then shows the bottlenecks, and the API Log in the left panel changes to show the bottleneck types. When you click on a bottleneck, the metrics viewer will show more details about the bottleneck with metrics descriptions and suggestions to alleviate the bottleneck.

In addition, in the upcoming demo, you will see the shader profiler. Enable it by clicking this button. Now you can see the source code with timings. You can also enable the assembly code. And change between the two modes: duration or execution count.

Let’s now look at our first example. In this case, if you look in GPA Metrics Viewer, you can see that occupancy is high, more than 90%, but there is still a stall on the EU, which means that EU threads are waiting for some data from the memory, which is what exactly what Metrics Viewer suggests for us to do by showing an L3 bottleneck.

For further analysis, we will use Shader Profiler, which is an Intel® GPA analysis type, which shows per-instruction execution latencies. As you already know, latency is not always equal to stall. However, an instruction with higher latency has a higher probability to cause a stall. And, therefore, when dealing with an EU stall bottleneck, Shader Profiler gives a good approximation of what instructions most likely caused the stall. So, in order to find the problem here, we will look for [the] longest send instructions in assembly view, and then identify which shader source portions caused them. In this case, it will be the CalcUnshadowedAmountPCF2x2 function, which samples from ShadowMap and reads the constant buffer.

Graphics Frame Analyzer will facilitate analysis of these resources in more detail and make a decision about how to optimize access to them; for example, change formats, dimensions, or data layout. Further analysis here is outside the scope of this presentation.

Now let’s look at another example, a different type of performance problem. Here we have a Shader Execution bottleneck, which is characterized by very high occupancy and very low stall time, which is great. But this means that in order to reduce execution time we need to optimize the shader code itself. And this is where we can use Shader Profiler again.

From the previous step, we know that this problem is not related to an execution unit stall bottleneck, therefore we can safely ignore latencies of Send instructions since low stall means that these latencies were perfectly hidden at the EU level and didn’t affect performance. Instead, we will analyze the hotspots in shader source code caused by arithmetic operations. By doing this, we will eventually find the CalcLightningColor function, which does calculations involving both simple and transcendental operations. In order to resolve this bottleneck, you will need to optimize this particular algorithm.

Now let’s look at an even more interesting example. In this case, there is a sequence of draw calls which have a thread dispatch bottleneck. As you may notice in these execution metrics, this results in suboptimal shader execution. In this particular case, we have a rather high stall rate (20%), but as we already know, this may be caused by an insufficient number of threads loaded on the EU, and this is exactly what EU thread occupancy indicates to us by showing 66%. So instead of directly fixing stalls in shader code, we instead need to increase the overall EU occupancy. Let’s look at what is causing this issue.

Here, we will again use Shader Profiler, but in Execution Count mode, which shows how many times each instruction was executed. If you look there at the pixel shader, you may notice that it has been compiled into both SIMD8 and SIMD16. And Shader Profiler shows that each instruction in the SIMD8 version was executed 24K times, while instructions in SIMD16 were executed 16K times. This is 1.5 difference. As you may guess, it is preferable to have more SIMD16 threads, as they perform twice as many operations per single dispatch compared to SIMD8 threads. So what could be the reason of so many SIMD8 dispatches, and why are there two SIMD versions for Pixel Shader in the first place?

Let’s look at the geometry for these draw calls. As you may notice it is rather fine-grained. Let’s look even closer.

The observed anomaly boils down to how the GPU handles pixel shading. The shader compiler produced two SIMD versions (SIMD8 and SIMD16) for the pixel shader for a reason; it is required so that the pixel shader dispatcher can choose which one to execute based on the rasterization result. The thing is, hardware doesn’t shade pixels one at a time, instead shading happens in groups; for example a single pixel shader hardware thread shades 16 pixels at a time for SIMD16. Now in this case, if your primitive is rasterized into very few or just a single pixel, then the GPU will still shade 16 pixels and will discard all those which happened to be unnecessary. Therefore, in order to discard less, the pixel shader dispatcher schedules SIMD8 for very small primitives instead. And this is exactly what has happened in our case: a lot of highly detailed geometry produced a lot of SIMD8 invocations. As you may guess, in order to fix such a performance problem, you need to use geometry LODs in your game, LOD being “Level Of Detail”.

This was a really quick overview of the EU and how it relates to the data that can be extracted from the execution of your application. We hope this helps you be better able to understand how to increase the performance of your applications. We are in the process of creating a deeper dive video on this topic. You will find it later this quarter on the Intel® GPA Training page.

Thank you for attending this session. Don’t miss our next session: Program Your Games Today. Prepare for Tomorrow’s Intel CPU Architectures at 10:50. Enjoy the rest of the conference.