What is HWPGO
- Intel’s sample-based profile-guided optimization (SPGO), which is an alternative to traditional instrumented PGO, is referred to as Hardware PGO (HWPGO) due to the underlying mechanism it employs. HWPGO leverages hardware performance monitoring counters available on modern Intel CPUs.
- Performance monitoring counters (PMC) provide information about hardware events. Profiling tools such as Linux* Perf and Intel® VTune™ Profiler SEP provide sample-based profiles of these events, which can be used by the compiler.
- This results in a lower overhead, faster, and more efficient way to establish a performance optimization profile for your primary workload execution path.
Performance Monitoring Counters (PMC), or hardware performance counters, are specialized hardware registers embedded within CPUs designed to monitor various performance-related events occurring during program execution. These counters are part of the processor's Performance Monitoring Unit (PMU).
PMCs can track a wide range of events, including:
Instructions retired: The number of instructions successfully executed by the processor.
Cache misses: The number of times data or instructions are not found in the processor's cache and must be fetched from higher levels of memory.
Branch mispredictions: The number of times the processor predicts the wrong outcome of a branch instruction.
Floating-point operations: The number of floating-point operations executed by the processor.
Memory access: Information about memory accesses, such as the number of read and write operations.
Hardware interrupts: Events related to hardware interrupts, such as the number of interrupts received.
Reduced Collection Overhead
As a form of Sample-Based PGO (SPGO), HWPGO has dramatically lower profiling overhead than traditional instrumentation-based PGO. This simplifies collection by removing the need to craft special "training" workloads; overhead is typically low enough that collection can be done on production workloads with production binaries.
While instrumentation performs aggregation continuously in user space, HWPGO programs a PMU counter to trigger a sample every N times an event occurs. It thus leverages dedicated hardware to perform much of the online aggregation. Overhead is typically on the order of 1%, but sampling frequency can be decreased to reduce this further. Each sample captures Last Branch Records (LBR), effectively forming an instruction trace.
New Feedback Capabilities
Beyond reduced overhead, HWPGO enables new types of feedback not possible with instrumentation. For example, PMU hardware can trigger samples when the hardware mispredicts a branch. This allows the construction of a branch mispredict profile. The Intel® Compilers in version 2024.0 and newer can use such a profile to improve code generation and more aggressively use conditional move ("cmov") instructions.
In the future, we hope to enable additional types of PMU feedback as well as improved tooling to help developers manage multiple PMU profiles.
Branch Mispredict Feedback
Branch misprediction occurs when the processor incorrectly predicts the outcome of a conditional branch instruction, which include if-then-else statements, loops, or any code that involves decision-making in a program. When the processor predicts a branch, it may attempt to speculatively execute instructions based on that prediction.
If the prediction is incorrect, the processor must discard any speculative results and execute from the correct target. In this case the processor has used time and resources fetching and executing instructions unnecessarily, leading to decreased performance and pipeline inefficiencies. A mispredict profile must be combined with an execution frequency profile to understand mispredict ratios. Execution frequency feedback is a performance optimization technique used in compilers and runtime systems to dynamically adjust optimization decisions based on the execution frequency of different code paths or program segments. For an example of execution frequency feedback, please check this document. 
In this section, we use source code from here: hwpgo-mispredict-example
Branch Mispredict Feedback Example
Consider the following loop:
for (int i = 0; i < N; i++) {
  int *p;
  if(s1[i] > 8000) {
    p = &s2[i];
    int z = i * i * i * i * i * i * i;
    nop(p, z);
  } else {
    p =  &s3[i];
    nop(p, 3);
  }
  dst[i] = *p;
}
It is difficult for a compiler's static analysis to guess how well hardware will predict the if statement's condition because it depends on the contents of the s1 array.
If it is well predicted then a conditional jump is appropriate, but if it is poorly predicted, it may be better for the compiler to eliminate control flow. The following loop is equivalent but has had control flow eliminated in favor of conditional expressions:
for (int i = 0; i < N; i++) {
  int *p;
  int c = (s1[i] > 8000);
  p = c ? (&s2[i]) : (&s3[i]);
  int z = i * i * i * i * i * i * i;
  int arg2 = c ? z : 3;
  nop(p, arg2);
  dst[i] = *p;
}
The disadvantage of this transformation is that z is computed unconditionally but not used when the condition, c, is false. In other words, the cost of eliminating control flow is that z must be speculatively computed. The profitability of this transformation depends on c being poorly predicted.
Fortunately, HWPGO enables the compiler to make this decision for us based on the hardware's actual branch predictor results.
The rest of this section walks through applying HWPGO to this example.
First Compilation Phase for Branch Mispredict Feedback
The first step is to produce a binary suitable for PMU-based profiling. The primary requirement is that debug information which can be used to correlate instructions in the binary with source locations is available. The binary may otherwise be fully optimized.
Note: it's possible to split the debug information from binaries using generic -gsplit-debug functionality and DWARF package files.
Though not strictly required, the -fprofile-sample-generate option makes it easy to ensure that useful debug information is generated. We can compile the example as follows:
$ icx -O3 -fprofile-sample-generate unpredictable.c nop.c -o unpredictable
Let's note the performance of the result:
$ time ./unpredictable
./unpredictable  1.04s user 0.00s system 99% cpu 1.050 total
If we were to strip the binary or recompile without -fprofile-sample-generate, we would observe identical performance and identical code disassembly.
Profile Collection for Branch Mispredict Feedback
Let us take a closer look at what the hardware can tell us about the executable's behavior by generating PMU profiles.
- 
	On Linux: $ perf record -b -e BR_INST_RETIRED.NEAR_TAKEN:uppp,BR_MISP_RETIRED.ALL_BRANCHES:upp -c 1000003 -- ./unpredictable
Perf tool is a performance analysis and profiling tool available on Linux systems.
perf record: This is the command to start recording performance events using perf.
-b: This option tells perf to collect Last Branch Records (LBRs) with each sample.
-e: This option specifies the events to profile, namely, BR_INST_RETIRED.NEAR_TAKEN and BR_MISP_RETIRED.ALL_BRANCHES
BR_INST_RETIRED.NEAR_TAKEN:uppp: This event counts taken branches. The uppp modifier indicates that the event should be counted in user mode only with highest precision and precise distribution
BR_MISP_RETIRED.ALL_BRANCHES:upp: This event counts mispredicted branches.  The upp modifier indicates that this event should also be counted in user mode only and with high precision.
-c 1000003: This option specifies the period with which perf will sample the performance events. In this case, it is set to 1000003, meaning that perf will collect a sample every 1000003 occurrences of the monitored event.
-- ./unpredictable: This part of the command specifies the program to be profiled. Here, ./unpredictable is the path to the executable application.
This records performance data using the perf tool while the specified program (./unpredictable) is running. It specifically monitors the retirement of near-taken branch instructions and records the event data at a sampling frequency of 1000003. The recorded data is saved to the file perf.data for later analysis.
- 
	On Windows, using VTune SEP: $ sep -start -out unpredictable.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES,BR_MISP_RETIRED.ALL_BRANCHES:PRECISE=YES:SA=1000003:lbr:USR=YES -lbr no_filter:usr -perf-script event,ip,brstack -app .\unpredictableNote: SEP emits a simple textual perf.data.script file in addition to tb7 data when when given the -perf-scriptoption. Only the perf.data.script output is used for profile generation.
SEP uses Intel Performance Libraries and Tools.
sep: This is the command for Intel VTune Profiler's command-line interface.
-start: This option starts profiling.
-out unpredictable.tb7: This option specifies the output file where the profiling data will be saved. In this case, the output file is named unpredictable.tb7.
-ec: This option specifies the events to profile, namely,  BR_INST_RETIRED.NEAR_TAKEN and BR_MISP_RETIRED.ALL_BRANCHES. 
BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=1000003:pdir:lbr:USR=YES:
This event counts taken branches. The options PRECISE=YES, SA=1000003, pdir, lbr, and USR=YES provide additional configuration parameters related to the precise event counting, sampling interval, precise distribution, last branch record, and user-mode events, respectively.
BR_MISP_RETIRED.ALL_BRANCHES:PRECISE=YES:SA=1000003:lbr:USR=YES:
This event counts mispredicted branches. The options PRECISE=YES, SA=1000003, lbr, and USR=YES provide additional configuration parameters related to the precise event counting, sampling interval, last branch record, and user-mode events, respectively. 
-lbr no_filter:usr: This option configures the profiler to collect the Last Branch Record (LBR) stack information without any filter applied, specifically for user-mode events.
-perf-script event,ip,brstack: This option specifies the fields to include in the “perf.script” output file. In this case, it includes event names, instruction pointers (IP), and branch stack information.
-app .\unpredictable: This specifies the application to be profiled. In this case, it's a relative path to the executable unpredictable.
In summary, this command starts the Intel VTune Profiler. It then configures it to collect performance data related to retired near-taken branch instructions, Last Branch Record (LBR) stack information, and instruction pointer (IP). The collected data is saved to the file unpredictable.tb7 for further analysis.
We've asked for a sample every 1 million times we see each event with -c 1000003 or SA=1000003. A smaller sampling period implies a higher sampling frequency. A higher frequency yields better profile fidelity in exchange for slightly higher profiling over head and greater storage requirements. Depending on the characteristics of your program, it may be worth increasing or decreasing the sampling period to balance overhead and fidelity. A prime sampling period is recommended in order to avoid aliasing with execution patterns in the program.
We can generate source-level profiles using two invocations of llvm-profgen -- one for each PMU profile type. We use --format text for this example to generate human-readable profiles, but the default binary format should be used otherwise.
The first invocation generates a profile of code execution frequency using LBRs. The second invocation generates a profile of mispredicted branches based on the sampled instruction pointers:
- 
	On Linux: llvm-profgen --format text --output=unpredictable.freq.prof --binary=unpredictable --sample-period=1000003 --perf-event=BR_INST_RETIRED.NEAR_TAKEN:uppp --perfdata=unpredictable.perf.data llvm-profgen --format text --output=unpredictable.misp.prof --binary=unpredictable --sample-period=1000003 --perf-event=BR_MISP_RETIRED.ALL_BRANCHES:upp --leading-ip-only --perfdata=unpredictable.perf.data 
- 
	On Windows: llvm-profgen --format text --output=unpredictable.freq.prof --binary=unpredictable --sample-period=1000003 --perf-event=BR_INST_RETIRED.NEAR_TAKEN:pdir --perfscript=unpredictable.perf.data.script llvm-profgen --format text --output=unpredictable.misp.prof --binary=unpredictable --sample-period=1000003 --perf-event=BR_MISP_RETIRED.ALL_BRANCHES --leading-ip-only --perfscript=unpredictable.perf.data.script The Windows llvm-profgencommands are adjusted to use SEP's textual-perf-scriptoutput format and different event names.
Mispredict profiles should be generated using --leading-ip-only, as Last Branch Records (LBR) are only used for execution frequency profiles. 
Here's the resulting execution frequency profile:
unpredictable:12905587092:0
 3.1: 202483466
 3.2: 202483466
 5: 202483466
 6: 119064278
 7: 119064278
 8: 124612654 nop:124612654
 11: 87128858 nop:87128858
 13: 202483466
nop:194225418:211741512
 1: 194225418
Note that the first column in the profile denotes the source line offset from the beginning of the function, and the C code listings above start at line offset 3. The second column is the estimated number of executions. Additional columns, if present, indicate function call counts.
For example, the branch 5 source lines into the "unpredictable" function were executed about 200,000,000 times. In this case, we can check this against the sources: it's in the body of a loop executing N times ITERS times, which is 200,000,000.
We can also see that the two nop calls corresponding to if/else cases of the branch are executed about 60% and 40% of the time, respectively. This doesn't tell us how well or poorly the branch is being predicted, however: it's entirely possible for a branch to be taken roughly 50% of the time but very well predicted. (See predictable.c for an example of this.)
To know conclusively whether or not the branch is being mispredicted, we can look at the mispredict profile:
unpredictable:172000516:0
 3.1: 0
 3.2: 0
 5: 86000258
 6: 0
 7: 0
 8: 0
 11: 0
 13: 0
 15: 0
This shows that the branch is extremely difficult for the hardware to predict: Out of 200 million iterations it's being mispredicted 86 million times! Since virtually all of the branch mispredicts in this program happen on this one branch, we can double-check this result using perf stat, which does not profile but simply counts:
$ perf stat -e br_misp_retired.all_branches -- ./unpredictable
 Performance counter stats for './unpredictable':
        85,739,375      br_misp_retired.all_branches
       1.025948268 seconds time elapsed
The actual count is 85.7 million mispredicts.
Common llvm-profgen Issues
It is important to use the oneAPI llvm-profgen. If you'd like it to be in PATH, use the --include-intel-llvm option when performing oneAPI environment setup. Otherwise, you can use icx --print-prog-name=llvm-profgen to discover the right path. (Use icx /nologo /clang:--print-prog-name=llvm-profgen on Windows.)
It is common to see a few warnings, but the one to look out for is No samples in perf script! when producing the execution frequency profile. This means the PMU profile was empty, the wrong binary was profiled, or the wrong PMU event was specified with --perf-event.
Note that an empty branch mispredict profile is plausible, so ignoring No samples in perf script! when producing a branch mispredict profile may be ok.
Second Compilation Phase of Branch Mispredict Feedback
If we recompile providing the mispredict profile to the compiler, it can speculate more aggressively and avoid the branch altogether:
$ icx -O3 -fprofile-sample-use=unpredictable.freq.prof -mllvm -unpredictable-hints-file=unpredictable.misp.prof unpredictable.c nop.c -o unpredictable.hwpgo
Note that both the execution frequency and branch mispredict profile are given to the compiler. The compiler will use the execution profile in combination with the mispredict profile to estimate the branch mispredict ratio of individual branches.
Although this example is focused on branch mispredict feedback, the execution frequency profile is also useful on its own and provides information similar to basic block execution counts provided by Instrumented PGO.
Evaluation of Branch Mispredict Feedback
We can see the performance improvements from providing branch mispredict feedback in both execution time and hardware metrics:
Before HWPGO:
$ perf stat -e cycles:u,instructions,br_inst_retired.all_branches:u,br_misp_retired.all_branches -- ./unpredictable
 Performance counter stats for './unpredictable':
     3,243,043,047      cycles:u
     3,619,535,187      instructions:u       #   1.12  insn per cycle
       917,083,309      br_inst_retired.all_branches:u
        85,966,707      br_misp_retired.all_branches:u
       1.021617622 seconds time elapsed
After HWPGO:
$ perf stat -e cycles:u,instructions,br_inst_retired.all_branches:u,br_misp_retired.all_branches -- ./unpredictable.hwpgo
 Performance counter stats for './unpredictable.hwpgo':
     1,715,030,113      cycles:u
     4,000,954,710      instructions:u       #   2.33  insn per cycle
       600,132,829      br_inst_retired.all_branches:u
            13,322      br_misp_retired.all_branches:u
       0.541091594 seconds time elapsed
Here, we see a 1.8x (1.02 seconds vs. 0.54 seconds) speedup in execution time, though this is not typical.
We also see that the total number of executed branches has been reduced because the if/else is implemented as a conditional move rather than a conditional jump.
In turn, this dramatically reduces the number of mispredicted branches. We do have more total instructions executed in exchange, but in this case, it is profitable because eliminating the mispredicts nearly doubles instructions per cycle (IPC).
Example Sources
Sources for the loop discussed in this article are available in the following as source file unpredictable.c at the following repository:
This repository also includes a similar example where cmov would be harmful.
Use Hardware Profile Guided Optimization with your Application
Hardware Performance Monitoring Counter Assisted Profile Guided Optimization (HWPGO) provides the developer with a low overhead approach to optimizing performance favoring the dominant execution path of a workload.
If you have been using Profile Guided Optimization (PGO) before and are looking for a more efficient way, check out the latest Intel® oneAPI DPC++/C++ Compiler today, either stand-alone or as part of the Intel® oneAPI Base Toolkit.