Intel® oneAPI DPC++/C++ Compiler and Profile Guided Optimization Usage Guide for Unreal Engine* 5

Authors:

  • Jeff Rous

author-image

By

Introduction

New enhancements to the profile guided optimization (PGO) feature set are now available with the release of the Intel® oneAPI DPC++/C++ Compiler version 2024.0 (ICX), based on LLVM 17 compiler technology. Previously, only instrumented PGO was available which required a specially compiled binary. This required it capture its own profile traces to later feed into the linker, resulting in high overhead and low frame rates on games. With the ICX 2024.0 release, PGO is enhanced with hardware assistance. Now developers can leverage the processor’s dedicated performance monitoring unit and significantly reduce the collection overhead of these profile traces.

Beginning with Unreal Engine* version 5.4, support for the Intel ICX compiler, PGO, and HWPGO features are available to Unreal Engine developers on Windows*. This guide outlines how to use the features and how to maximize their performance.

Enable Intel oneAPI DPC++/C++ Compiler

The Intel oneAPI DPC++/C++ Compiler (ICX) is free to use and can be downloaded from this link.

Once the ICX compiler is installed, open Unreal Engine 5's BuildConfiguration.xml file and set Intel as the compiler of choice. The Clang linker option must be set whenever link time optimizations or PGO are used, or it can be disabled to instead use the Microsoft Visual Studio* (MSVC) linker.

<?xml version="1.0" encoding="utf-8" ?>
<Configuration xmlns="https://www.unrealengine.com/BuildConfiguration">
 <WindowsPlatform>
  <Compiler>Intel</Compiler>
  <bAllowClangLinker>true</bAllowClangLinker>
 </WindowsPlatform>
</Configuration>

Hardware (Sample-Based) PGO

Hardware (sample-based) PGO relies on programmable counters built into the performance monitoring unit of newer CPUs to capture performance data. With this hardware assistance, this results in very low overhead compared to instrumented PGO. To enjoy this advantage, a few steps need to be completed to reach an optimized build – building with profiling enabled, capturing a representative profile, processing and merging the profiles, and finally applying the profiles for an optimized build.

Build Profile-Enabled Configuration

In addition to setting Intel as the compiler of choice and enabling the Clang linker in Unreal Engine 5's BuildConfiguration.xml file as described above, the sample-based PGO option must also be enabled:

<?xml version="1.0" encoding="utf-8" ?>
<Configuration xmlns="https://www.unrealengine.com/BuildConfiguration">
 <WindowsPlatform>
  <Compiler>Intel</Compiler>
  <bAllowClangLinker>true</bAllowClangLinker>
  <bSampleBasedPGO>true</bSampleBasedPGO>
 </WindowsPlatform>
</Configuration>

Additionally, the project's Target.cs file must have the following variables set to enable link time optimization and a PGO profile build:

bAllowLTCG = true;
bPGOProfile = true;
bPGOOptimize = false;

Capture Representative Profile

Capturing a representative profile relies on Sampling Enabling Product (SEP), which is a standalone command-line tool that performs hardware event-based sampling. SEP is installed by default with Intel® VTune™. Intel VTune is free to use and can be downloaded from this link.

To capture a profile, run the profile-enabled binary on a system with an Intel CPU using the following script. The script captures branch records as the game runs to determine the hot path of execution. The capture time (taken in seconds) can be adjusted by the -d option. The frequency of samples can be adjusted with the SA= option (smaller is more frequent). Common values for the SA= option include 1000003, 400009, 200017, 100003 and 10007. Longer samples can be taken with higher values of the SA= option. A good middle ground capture of 4 minutes with a fast sample rate on the CitySample Unreal Engine 5 example workload can be found below:

sep -start -d 240 -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=100003:pdir:lbr:USR=YES,BR_MISP_RETIRED.ALL_BRANCHES:PRECISE=YES:SA=100003:lbr:USR=YES -lbr no_filter:usr -perf-script event,ip,brstack -app .\CitySample-Win64-Test.exe
  • Play the game for a representative sample of gameplay until the time expires and the game exits automatically.
  • A TB7 capture file and a performance data script are generated upon game exit. These files are used for the next steps.
  • Capture multiple profiles as needed to cover representative gameplay.

Process and Merge Profile(s)

The generated TB7 file and perf data script from the previous step need to be processed and then merged with previous profdata file(s) to become a final profdata file which the linker can apply for the profile-optimized build. This step uses llvm-profgen.exe and llvm-profdata.exe which are included in the install directory of the Intel® oneAPI DPC++/C++ compiler. These tools are different from the similarly named ones installed with Clang/LLVM and are not interoperable. The --sample-period option should match the SA= option above to correctly process the profile.

llvm-profgen.exe --perfscript app.perf.data.script --binary CitySample-Win64-Test.exe --output app.freq.prof --sample-period 100003 --perf-event BR_INST_RETIRED.NEAR_TAKEN:pdir

The app.freq.prof file and previous profdata file(s) now need to be merged into a final profdata file which the linker can use to optimize the profile-optimized build. Multiple files from multiple different systems can be merged in this step to produce the final output file.

llvm-profdata.exe merge app.freq.prof --sample -output CitySample-Win64-Test.profdata
  • The profdata file should be named Project-Platform-Configuration.profdata, ex: CitySample-Win64-Test.profdata for a CitySample Test build on Windows*.
  • The profdata file should be placed in ProjectDir\Platforms\Windows\Build\PGO\ so the build tool can find the profile during the link step.

Build Profile-Optimized Configuration

The last step to complete for an optimized build is applying the profiles. In the project's Target.cs file, set the following variables to enable link time optimization and a PGO optimized build. Be sure the build configuration (Development, Test, Shipping) matches the configuration of the profdata generated in the last steps:

bAllowLTCG = true;
bPGOProfile = false;
bPGOOptimize = true;
  • Rebuild
  • Binary is now optimized by hardware PGO!
  • Repeat previous steps as needed with different combinations of capture times and sample rates to generate a profile that, when applied, gives the best performance uplift.

Instrumented PGO

Instrumented PGO relies on the compiled binary to capture its own performance data and generate a profile later used for optimization by the linker. No hardware assistance is used, so it has higher relative

overhead compared to hardware PGO. Like hardware PGO, a few steps need to be completed to reach an optimized build – building with profiling enabled, capturing a representative profile, merging the profiles, and finally applying the profiles for an optimized build.

Build Profile-Enabled Configuration

To carry out instrumented PGO, Unreal Engine 5's BuildConfiguration.xml file must have Intel set as the compiler of choice and the Clang linker enabled as described previously. The Clang linker is required when enabling link-time code generation (LTCG) or PGO options. Note that Sample-Based PGO is disabled in this configuration:

<?xml version="1.0" encoding="utf-8" ?>
<Configuration xmlns="https://www.unrealengine.com/BuildConfiguration">
 <WindowsPlatform>
  <Compiler>Intel</Compiler>
  <bAllowClangLinker>true</bAllowClangLinker>
  <bSampleBasedPGO>false</bSampleBasedPGO>
 </WindowsPlatform>
</Configuration>

In project's Target.cs file, set the following variables to enable link time optimization and a PGO profile build.

bAllowLTCG = true;
bPGOProfile = true;
bPGOOptimize = false;

Capture Representative Profile

Once the profile-enabled build has completed, a representative profile needs to be gathered by running the game.

  • Run the profile-enabled binary
  • Play the game for a representative sample of gameplay
  • Exit game
  • Profraw file is generated upon game exit. This file is used for the next steps.

Merge Profiles

The generated profraw from the previous step and previous profdata file(s) need to be merged into a final profdata file the linker can apply for the profile-optimized build. This step uses llvm-profgen.exe and llvm-profdata.exe which are included in the install directory of the Intel® oneAPI DPC++/C++ compiler. These tools are different from the similarly named ones installed with Clang/LLVM and are not interoperable.

Multiple files from multiple different systems can be merged in this step to produce the final output file.

llvm-profdata.exe merge CitySample-Win64-Test-IRPGO.profraw [profile...] -output CitySample-Win64-Test.profdata
  • The profdata file should be named Project-Platform-Configuration.profdata, ex: CitySample-Win64-Test.profdata for a CitySample Test build on Windows.
  • The profdata file should be placed in ProjectDir\Platforms\Windows\Build\PGO\ so the build tool can find the profile during the link step.

Build Profile-Optimized Configuration

The last step to produce an optimized build is applying the profiles. In the project's Target.cs file, set the following variables to enable link time optimization and a PGO optimized build. Be sure the build configuration (Development, Test, Shipping) matches the configuration of the profdata generated in the last steps:

bAllowLTCG = true;
bPGOProfile = false;
bPGOOptimize = true;
  • Rebuild
  • Binary is now optimized by instrumented PGO!

Closing Thoughts

  • Profile guided optimization is a powerful technique to optimize your Unreal Engine 5 workload.
  • PGO on ICX works across all PC CPU hardware and vendors without any code changes.
  • PGO can increase overall performance on CPU-bound workloads and improve power on GPU-bound ones.
  • On systems with a shared power budget between the CPU and GPU (often mobile), lowering the demands on the CPU with PGO can increase overall performance by allowing the system to give more power to the GPU.
  • Profiles should be re-gathered when a large amount of code changes. For example, as part of an engine upgrade or code refactor.