Expert Interview: Speed up Your Game on Intel® CPU and GPU

Get the Latest on All Things CODE

author-image

By

Pamela Harrison and Jennifer DiMatteo are accomplished software development engineers as well as Software Technical Consulting Engineers at Intel with a combined 15+ years in strategic high-touch customer support and performance optimization. They speak with us about the art of gaming workload optimization using Intel profiling and analysis tools.

Note: Emphasis and bullets are inserted to facilitate clarity.

Pamela Harrison and Jennifer DiMatteo created a detailed video presentation to demonstrate the use of Intel® Graphics Performance Analyzers (Intel® GPA) and Intel® VTune™ Profiler to take your gaming application performance to the next level. Pamela and Jennifer are Software Technical Consulting Engineers working with game developers on a daily basis. They are using their in-depth knowledge of developer pain points to help continuously improve the experience with analyzers and tools for game software optimization.

This spring at the Game Developer Conference in San Francisco, Pamela spoke with several game developers and artists to help them see how easy and important it is to do regular, quick profiling snapshots during the game development process.

In their demo, they feature Soma Games*’ The Lost Legends of Redwall* to show the power of Intel GPA to identify bottlenecks quickly in your games. You can then easily improve overall performance and frame rates with the help of Intel GPA and additional insights gathered by VTune Profiler.  

The video Optimize Games Across Platforms in Intel® Game Dev Program’s All Access collection goes over the highlights of their talk.


Today we decided to meet up with Jennifer and Pamela again to talk more about the role Intel VTune Profiler and Intel GPA play in game development and game optimization. We asked them about some of the key analysis tool features that benefit this type of software development the most and which direction they believe profiling technology benefitting gamers is headed.

Intel® Graphics Performance Analyzers

Rob: Pamela, in the video you speak about the 3 components of Intel’s GPA tool suite:

They cover monitoring of real-time system metrics to test application robustness and system resource usage, trace analysis to visualize CPU/GPU workload distribution, and identify opportunities to tune workload balance between processing units.

Last but not least, with the Graphics Frame Analyzer you can capture and analyze your game’s performance, frame by frame.

You are able to capture frames in your game, inspect all aspects of them, and find performance issues at the individual draw-call level, with geometry and other resource visualizations. You can analyze the captures and share them with colleagues. For example, an artist on the team might notice a section of the game where her sprite's movements are choppy. She can capture a trace or frame in that part of the game and send the capture to a developer for analysis. Additionally, you can even replay those captures on other platforms to compare frame rate.

How would you describe the typical graphics-optimization workflow for a game? Let us assume you are a game developer. You are handed code that was developed a few years ago and in need of an update. The original owner has left the company and it is your job to identify opportunities to adapt the game to the latest available gaming hardware.

What is your general recipe to approach this scenario?

Pamela: Since we are not familiar with the game’s baseline performance, I would suggest starting with System Analyzer to find places in the game where either frame rate drops or visual quality suffers. As you play the game, visually note where there are issues as you watch the data flow in the tracks of System Analyzer.

Another option is to capture that data by either using System Analyzer’s ability to save the data to a CSV file as you play through an interesting game segment over several minutes, or capture a trace or a frame with the click of a button.

After finding an area of interest, go back to Graphics Monitor and capture a trace. After the trace-capture is completed, you can double-click the new thumbnail that was generated in the upper right corner of Graphics Monitor. This will open the trace in Graphics Trace Analyzer.

Two of the most awesome things you can then do with this tool are:

  • Determine whether you are GPU- or CPU-bound by the existence of spaces along the timeline, either in the CPU core tracks or the GPU execution track.
  • Follow synchronization or dependency arrows:
    • For example, from a packet in the GPU execution track, follow along the orange arrows to that same packet in the GPU queue and then to the call that issued that packet in a CPU thread.
    • Similarly, follow the green arrows between Wait calls in CPU threads to Signal calls in the GPU queue.

And finally, use Graphics Frame Analyzer, especially if you are GPU-bound, to dig into the details such as API calls, resources, buffers, and shader profiling. You can also use it to find hotspots and ensure that shader code has not drawn outside the screen space.

Figure 1. A frame open in Graphics Frame Analyzer, showing the fully rendered frame after all draw calls are complete

Rob: In your collaboration with Soma Games, they saw significant improvement over a period of 3 weeks only investing a few hours each week. Is this typical for the benefit GPA provides? Have you seen similar levels of success from other customers?

Pamela: Absolutely. Great question, Rob. Yes. The thing is, you can optimize code all day, but if you are optimizing without knowing which optimizations will result in the biggest performance gains, you might not be spending your time in the best ways possible. Zach Taylor, Minister of Performance at Soma Games, was previously using Unity Profiler*. That makes sense; this profiler/debugger is a feature of the Unity game engine. And that was great. It helped.

But Intel GPA is much more powerful for profiling. It is a dedicated profiling tool. Unity is a game engine, so profiling is an extra benefit and not its main focus. In the first 5 minutes of one of our meetings where we showed Zach how to use Intel GPA, we all saw that tons of fog particle rendering was happening outside of the screen space view. Fixing this more than doubled performance on platforms without a discrete GPU and increased performance by 35% on high-end gaming platforms.

Rendering outside of where the camera can see is completely wasted time, when the GPU could be doing something useful instead.

Here is what that looks like.

  1. We open a frame in Graphics Frame Analyzer
  2. Select a draw call
  3. Select an output geometry
  4. Look at the visualization for that one element

Looking at Post Transform Mesh view for this particular draw call of a frame in a slow area of the game, everything looks awesome. You can move the view to look at the 3D rendering from any angle, in this case showing the many planes that represent fog.

Figure 2. Post Transform Mesh View of the output geometry of a frame draw call

Switching to Screen Space view (where the camera can see – not above or below or behind), we see that most of the rendering for these planes of fog was done outside of screen space, so it will never be seen during the execution of this draw call in this frame. This particular draw call only took 3 μs. But multiplied over 1000s of draw calls in 1000s of frames, that extra rendering adds up.

Figure 3. Screen Space View of the output geometry of a frame draw call

Believe it or not, this is a common mistake. Sometimes game developers don’t realize their shader code is doing lots and lots of extra work. This is one of the first things I look at when working with customers.

The key is, it didn’t take long to fix. They just hadn’t known the issue was there. That is the power of Intel GPA: It quickly shows what you need to focus on to gain performance.

Rob: Which steps were taken to achieve their performance goals?

Pamela:  First, Zach started his game via Graphics Monitor and played to the part of the game where he knew it was slow. He captured a frame in that region.

Note:
When we started working with Soma Games, this slow portion of their game was running at 6 fps on a mainstream computer with no graphics card. That is too slow to play, so Zach did the frame capture using his platform—10th Gen Intel® Core™ processor along with an NVIDIA discrete GPU card. Then he profiled this frame on his Intel Core processor’s integrated GPU.

Cross-platform functionality is part of the power of Intel GPA. We don’t guarantee to support all hardware on all gaming titles, but we test on several platforms because we want developers to be able to make their games better.

Graphics Frame Analyzer can’t extract all of the rendering data from non-Intel hardware. That is why, no matter where you capture, you should do your profiling on Intel® hardware.

We opened that frame in Graphics Frame Analyzer to identify the hottest bottleneck—the place where optimization potentially gives you the biggest return in overall performance. You pinpoint a few calls that add up to the most time-consuming issue(s), and thus the biggest opportunity for improvement.

What we found in our first meeting was that Zach's team had used combined meshes to reduce the number of draw calls. This is a really good idea to ease the load on the CPU.  However, this affected the Geometry Transformation stage of the graphics pipeline negatively, causing the GPU to draw each of the elements individually. When they changed the combined meshes to GPU instancing, the frame rate nearly doubled, from 6fps to 10fps when running with only the integrated GPU and from 50fps to 65fps running with the discrete GPU.

In our second meeting, another challenge that we found, and one of the main issues, was the shader execution issue I talked about earlier. Screen space view in Graphics Frame Analyzer can show you in 2 minutes whether you are rendering outside of the camera view.

Changing from the Post Transform Mesh view of the rendered call in Graphics Frame Analyzer to the Screen Space view allowed us to identify the issue so they could fix it. Here is another example—in this one the draw call did not contribute to to the frame at all. The pixels were drawn completely outside of where the camera was aimed.

When you look at these images, it is so completely obvious, but remember, it is only obvious if you capture a frame and take a look. So make sure you are profiling during your development process.

Take a look at another example. This is a different call in the same frame with the post transform mesh on the left, and 2 images of the screen space view on the right. In the top right image you can see that the geometry is completely outside the screen space view. And in the lower right image you can see that there is a lot being drawn. The call takes 3μs to complete. The total frame time is 25826μs. It’s only taking 1/10,000th of the frame time.  If this is the only call that is behaving this way then it is probably not worth fixing, instead move on to a bigger bottleneck. But if this is caused by shader code, it is likely affecting many draw calls, and those add up.

Figure 4. Another example geometry drawn outside of the screen space view

So ultimately, Soma Games is now able to get better gaming performance on a wider variety of platforms. Thus they can widen the market for their products by targeting more platforms.

Here are some results we achieved on the original integrated GPU and on newer platforms:

 

Intel® UHD Graphics
10th Generation Core
(Comet Lake)

Intel® UHD Graphics
12th Generation Core
(Alder Lake)

Intel® UHD Graphics
13th Generation Core
(Raptor Lake)

Intel® UHD Graphics
12th Generation Core (Tiger Lake)
+
Intel® Arc™ A770 (Alchemist)

Original Frame (week 1)

6

9

11

45

No Combined Meshes (week 2)

10

19

22

133

Fastest Frame (week 3)

27

54

63

180

Table 1. Game Performance Progress over 3 Weeks

I love the work we did with Soma Games because it gave us the experience of working closely with a customer and tracking the progress as performance improvements were made.

I hope the Soma Games story will reach other developers who have not experienced the power of Intel GPA. It is free. No registration required. It supports Microsoft Direct3D* 11 and 12 as well as Vulkan*.

Rob: How do you believe the key learnings of the work with Soma Games can translate to the wider game developer community? What are the key potential choke points that any developer should be aware of and pay attention to?

Pamela: Well, it just takes a few minutes to capture a frame and look at a few output geometries in screen space view. This way, you verify that your shaders are behaving as you think they should.

And then it takes another minute or two to check the hotspots. It is good to look at the top 3. The top bottleneck may be something you just do not want to change: a highly detailed sprite that does some really cool motions, for example. But maybe hotspots 2 and 3 can be sped up.

Really, the most important thing to remember is to profile. Don’t just start optimizing. You may be spending most of your time on parts of your game that won’t contribute a lot to the big picture.

As for things to look out for, keep in mind when you are looking at a single frame, you are looking at a tiny piece of the whole picture. It might be worthwhile to capture a few different frames. You could also look at multiple frames and metrics aggregated across multiple frames. Maybe the frame you chose is awesome, but in combination with the frame leading up to it, may not be optimal.


Advanced Notes:                 

It is also a good idea to track execution (EU) states (EU Stall/EU Active/EU Thread Occupancy). How to track EUs and interpret the data has been demonstrated at Game Developer Conference 2021 in the presentation “Up Your Game, Know Your Intel GPU Architecture.”

  • High Occupancy with High Stall likely indicates an L3 Cache bottleneck. EU threads are waiting for R/W from memory. We recommend to check Graphics Frame Analyzer’s Shader Profiler to see which SEND instructions are taking too long, causing the stall. You can then proceed to figure out if you need to change dimensions, data layouts, or something else.
  • High Occupancy with Low Stall is awesome. But if you are looking for improvements you may need to optimize the Shader Code by again using the Shader Profiler. Instead of looking for long SEND instructions, look for arithmetic functionality that could be optimized.
  • Low Occupancy with High Stall indicates a Thread Dispatch bottleneck. There aren’t enough threads loaded on the EUs. This could be caused by a high level of detail in the geometries, in which case you might need to use geometry level of detail (LODs).

It is also smart for game development studios to be able to do regression testing as they augment and upgrade their games. In addition to validating code correctness, regression testing can be a measure of performance improvement as development progresses.

  • Capture some key frames and/or traces at different points in the development process. This will allow comparisons of the data over time that can be correlated to various changes in the game.
  • For those comfortable with scripting, they can partially automate some of this testing:
    • Use System Analyzer to capture, say, 10 minutes of CSV data from a particular place in the game, and save that file.
    • After making changes, capture 10 minutes from that same place in the game and run the 2 CSV files through a script and make comparisons; e.g., it may be of interest to compare the number of times frame rate dropped below some value or the average frame rate.

Note: The command line backend of Graphics Frame Analyzer (Intel GPA Framework) is available as a separate download and offers additional functionality to help with regression and other sorts of testing.


Intel® VTune™ Profiler

Rob: Thank you, Jennifer, for following up with us on the use of VTune Profiler for game performance optimization. You make the point that while GPA gives you all the detailed frame-level insights into the GPU performance, a traditional approach to software workload performance profiling can be very helpful for gaming workloads as well.

Where VTune Profiler can provide the most help for a gaming application is in assuring best load balancing between CPU and GPU. This also implies that the memory latency impact of data transfers between the processing units is minimized.

Intel VTune Profiler helps you:

  • Optimize CPU compute-intensive tasks, so you know what functions will benefit most from optimization and gain insights into how to do that.
  • Tune CPU threading performance, so you can resolve any issues with blocked tasks and improve parallelism.
  • Profile games built with Unity’s Real-Time Development Platform or Unreal Engine*. These engines use the VTune Profiler Instrumentation and Tracing Technology API, so you can view performance of annotated engine tasks.
  • Optimize cache usage, which can improve the speed of CPU instructions.
  • Get best performance on the latest Intel hardware, so you can take advantage of cutting edge architectures such as hybrid CPUs.  

Can you elaborate on how Unity and Unreal Engine integrate the ITT API and how VTune Profiler usage benefits from its use?

Jennifer: Both engines wrap well-known tasks using the ITT API. So when VTune displays results, it can correlate performance metrics with specific engine tasks familiar to the user. Seeing function names may not be very helpful if the developer doesn’t have the knowledge or ability to change engine code, but knowing that a lot of time is spent in a particular draw task is more actionable.   

Rob: Performance profiling always seems to start with identifying hotspots. Bottlenecks that are executed frequently and thus result in significant overall performance impact. If you are profiling a Unity or Unreal Engine game, you will also see which engine tasks are taking the most time. From there, you can then drill down into the function and source-code view.

Beyond that, you highlight the importance of understanding the correlation between threading behavior and CPU utilization. VTune Profiler can help uncover how efficiently synchronization tasks and context switches are employed throughout the code. The goal is to minimize waits and maximize parallel execution through concepts like task-stealing.

How does the Flame Graph View in VTune Profiler help with that?

Jennifer: The Flame Graph provides a top-down view of the call stack. Now, instead of opening up each function in the top-down tree, you can view the flame graph and see the full stack trace. Different colors are used to differentiate between user, system, and synchronization tasks, so you can quickly see where a synchronization function might be causing a performance problem. This view can be searched and filtered as the other tabs.

Rob: How does microarchitecture analysis and memory access analysis specifically help with game optimization? Are there memory-latency and cache-utilization insights that can help with improved game performance with the help of data prefetching perhaps?

Jennifer:  Yes, the microarchitecture and memory access analysis are particularly helpful to understand how to fix CPU bottlenecks. Slow memory accesses can have a major effect on overall performance because they prevent instructions from executing quickly regardless of threading or vectorization.

Other factors that impact performance at the instruction level include increased front-end latency. This happens when either the instructions themselves take a long time to fetch or decode, or the processor tries to predict the next instruction and fails a large percentage of the time.

Identifying these kinds of problems is generally considered the “deep end” of the optimization pool, but VTune provides additional guidance and helpful hints to understand these lower-level metrics.

Rob: Where do you see the future role of Intel VTune Profiler for game developers? How does it help with gaming performance on hybrid CPUs on the latest generation Intel Core processors?  How can we ensure affinity of compute-intensive threads to P-Cores?

Jennifer:  VTune allows users to see exactly how their code is utilizing the different cores. One misconception I have seen is that gamers may think they can get better performance by disabling their E-cores so everything is forced to run on the higher-performance P-cores. This is not the case and may even hurt performance, especially as game developers optimize more for hybrid CPUs. We have a fantastic group of engineers who have a deep understanding of all manner of game performance, and their suggestions, combined with VTune, will help developers get optimal performance by utilizing both core types.

We even have detailed performance-analysis recipes specifically targeting game development:

Wrap-Up

Rob: Thank you for your valuable insights. This was most enlightening. Our curiosity has been peaked. I really appreciate you sharing your perspective with us.

Do you have something else you would like to share with our readers to help them on their quest to develop the perfect game?

Pamela: Thank you, Rob.  It was great chatting. The one last thing I want to say is that we love to hear from game developers. Any feedback or feature requests are more than welcome. We do our best to stay in touch with what folks are looking for in their profiling tools. Guidance from customers is our best way of knowing that we really are giving folks what they need. Send feedback via the feedback feature in Graphics Frame Analyzer, or contact us through the forum.

Jennifer:  Thank you for the opportunity to clarify some of the functionality and features of VTune with regards to the gaming segment. I realize it is a complex product, which can be intimidating. As Pamela said, we always appreciate feedback from developers. Especially for VTune, which can certainly use more and better documentation when it comes to gaming.

Get the Software

Additional Resources