Finding the Right Balance: Optimizing Threading for Gaming Performance

Authors:

  • Leigh Davies

  • Martin Moerth

author-image

By

Introduction

Can you have too much of a good thing? When it comes to the number of CPU cores in a high-end gaming PC, some recent articles suggest that you can. However, a deeper dive into game processes shows that it's not normally the hardware at fault, but how the game chooses to use available resources.

Below are two articles which discuss game performance on modern multi-core hybrid hardware: 

 

Both articles share a common theme: disabling the Efficient-cores (E-cores) leads to better performance.

However, neither delves into the reasons behind this phenomenon. Upon closer inspection, it’s evident that the games reviewed adapt their code based on the number of processors in the system, creating one thread for each physical core and attempting to distribute large parts of the workload evenly. This inadvertently results in games using a multitude of threads, each performing only a small amount of work and not actually reducing the work on the game’s critical path.

As hardware becomes even more complex and the OS attempts to manage all the software running simultaneously under many different usage scenarios, it is becoming more important than ever to use only the resources you need and allow the OS flexibility on how best to schedule the work.

Architecting the Workload

The first stage is to architect the workload in a way that gives the OS context about the behaviour the developer is expecting:

 

  • Identify the critical path work (per-frame tasks that could impact the critical path) and background/asynchronous work.
    • Arrange separate thread pools based on per-frame and background work.
  • Limit concurrency and thread pool size to what your workload really needs.
    • This will have to be benchmarked (ideally across target system configurations).
      • Are you getting the expected gains on the overall/realistic workload when increasing thread pool size?
    • If you do not keep your pools busy, the OS will start parking cores and serialize your work onto fewer processors than you might expect.
    • Be aware that the overhead of managing threads with very small tasks might become greater than the gain from the extra parallelism in some situations.

 

The advice in this section is valid for all multi-threaded workloads, regardless of whether you are executing them on a hybrid CPU or not. Amdahl’s Law is highly relevant for gaming where there is always some serial code, and if the workload is GPU bound, you're waiting on something that's completely independent from the CPU. Effectively, there is only going to be so much gain to be had from increasing a game’s thread count. Amdahl’s Law also deals only with theoretical limits. In the real world, the overhead of frequently managing many threads means performance can start to decrease quickly once you pass the peak. By limiting the number of concurrent threads, you simplify the OS’s job of scheduling, while still benefiting from its ability to park cores to conserve power and potentially increase the frequency of the processors that are doing important work. This is especially true for power- and/or thermal-limited scenarios.

Determining Maximum Concurrency

The next stage is to determine the maximum concurrency the game can expect from the OS based on the hardware available:

 

  • Determine processor count based on processors’ relative performance.
    • A: Count processors from higher relative performance cores.
    • B: Count processors from lower relative performance cores.
  • Determine cache hierarchy.
    • Remove any processors that don’t share a suitable last level cache.
  • Your maximum concurrency is the sum of processors in A and B.
  • Further advice:
    • Reduce maximum application concurrency for middleware and OS overhead.
    • Compare size of thread pools identified in ‘Architecting the Workload‘ above with number of processors in A and B.
      • A → Critical path work thread pool.
      • B → Background/asynchronous work thread pool.
    • On mobile systems you must be careful when restricting yourself to A for critical path work as processor count in A might be very limited.
    • If your workload does not require full concurrency remove SMT siblings from A.

 

A game shouldn’t automatically use all available resources in the system. As hardware gets more specialized, the OS might be constrained from using some of the available resources exposed through topology enumeration APIs (e.g. GetLogicalProcessorInformationEx or GetSystemCpuSetInformation). An example would be the low-power E-cores included in CPUs such as Intel Core Ultra processors, which are intended to be used at the OS’s discretion when most of the compute resources aren’t needed and can be put to sleep. Using the processor's relative performance together with the cache hierarchy is enough to filter these out.

Another factor to consider is the potential for hardware resource contention between threads, primarily arising in two ways:

 

  • Firstly, in the case of Hyper-Threading or simultaneous multithreading (SMT). When two hardware threads operate on the same physical core, they share the resources of that processor, including the local cache and internal execution units. While this arrangement can enhance overall throughput, it also results in a reduction of Instructions Per Clock (IPC) for each individual thread.
  • Secondly, there is an increased likelihood of data contention. Wherein more active threads lead to additional caches to monitor, this results in heightened latency as data must travel from more distant locations.

Steering the Workload

The third stage is configuring the games threads, so they are run in the best location:

 

  • Let the OS do it for you! *
  • Isolate threads with SIMD-heavy work (e.g. support Intel® Thread Director).
    • Don’t mix SIMD and scalar work on the same thread (pool).
  • Critical path work threads:
    • Bump priority to ABOVE_NORMAL or 2 above general workers.
      • Only if critical path work is a small subset of overall workload.
  • Background/asynchronous work threads
  • Notes on soft affinity:
    • Work soft-affinitized to lower relative performance cores might migrate to higher relative performance cores (e.g. when the OS starts parking lower relative performance cores for power efficiency).
    • Work soft-affinitized to higher relative performance cores might be prevented from migrating to lower relative performance cores.
  • Only use hard affinity if work must never migrate off certain set of cores.
    • Assumptions around execution time and cache coherency.
    • Always benchmark your assumptions (ideally across target SKUs)
      • Are you getting the expected gains on the overall workload?

 

* Yes really, we mean it! If you try and control exactly where threads run on a modern PC you will be fighting against a host of input from both the OS and the underlying hardware providing suggestions where threads should be scheduled. The important part is to give the OS enough context information about the workload and allow it enough room to make the scheduling decisions.

For example, you should not mix tasks with very different workload characteristics onto a single thread. Intel’s recent hybrid architectures come with Intel® Thread Director, which monitors the workload and classifies it so the OS can better schedule it. If, on a single thread, all the SIMD-heavy tasks are mixed in with tasks that are memory heavy or branch heavy then the OS will classify everything the same. The benefits may not be obvious today but separating types of work on an algorithm basis is good coding practice moving forward and will allow much easier use of specialized hardware such as AI co-processors.

Thread priority is also a powerful tool but it is not to be overused. Windows uses priority to sort which threads get executed, not necessarily where they get run. If an E-core is free at the time a high-priority thread is scheduled, then the OS will use it unless it thinks the cost of context switching threads around is worth it (which is a fairly high threshold when considering typical durations a game thread is scheduled for). Nevertheless, it's worth ensuring the most important threads stand out from the general thread pool. Over time, Windows creates a hysteretic model of the running threads so the longest running threads will favor the most performant cores.

The table below shows how the two main threads in Hitman 3 by IO Interactive are scheduled mostly to the most performant cores without any affinity applied.

Thread Type Total CPU [%] P-Core [%] E-Core [%] P-Core [ms] E-Core [ms]
Game Thread 96.98 96.72 0.26 19886.10 53.45
Render Thread 96.61 96.30 0.31 19799.75 63.73
Task Thread 38.13 20.70 17.43 4256.02 3583.69
Task Thread 36.51 20.05 16.46 4122.37 3384.25
Task Thread 36.41 19.89 16.52 4089.48 3396.59
Task Thread 36.39 20.22 16.17 4157.33 3324.63
Task Thread 36.36 19.60 16.76 4029.85 3445.93

Table 1. 8 Most CPU Intensive Threads in Hitman 3.

Using OS API features like Eco QoS is another good way to bias threads that have a high latency tolerance to the E-cores, leaving the P-cores free to do the remaining work.

Hard thread affinity is seldom advised due to its tendency to eliminate the OS's ability to try and intelligently arrange threads based on current workload and hardware considerations. While assigning tasks to specific cores can enhance performance in specific situations, it also has the potential to result in suboptimal resource utilization if not carefully managed. This challenge is especially pertinent in gaming, where software is designed to operate across a diverse range of hardware configurations.

Closing Thoughts

  • Selecting the right number of threads for a particular game is highly workload dependent; the most important thing is to test across target system configurations
  • Don’t assume only using the P-cores will automatically fix performance issues. In many cases gains are achievable by better matching the application’s requirements with the available hardware. Using E-cores instead of both SMT siblings on a P-core might give significant benefits.
  • It is possible that small changes can significantly boost performance. In the case of Atlas Fallen mentioned at the start of this article, simply reducing the number of background threads spawned fixed the performance inversion and meant that systems with E-cores enabled were slightly faster than those without E-cores.