• 2022
  • 10/20/2022
  • Public Content

Page Faults

This recipe helps identify and measure page faults impact on target application performance by using Intel® VTune™ Profiler's Microarchitecture Exploration, System Overview, and Memory Consumption analyses.
Content expert
: Vitaly Slobodskoy
page fault
occurs when a running program accesses a memory page that is not currently mapped to the virtual address space of a process. Mapping is handled by the Memory-Management Unit (MMU) using Translation Lookaside Buffer (TLB) as a cache to reduce the time taken to access a memory location. When a TLB miss occurs, the page may be accessible to the process but not just actually mapped, or the page content may need to be loaded from the storage device issuing a page fault exception. While page faults are a common mechanism for handling virtual memory, their impact on the application performance can be significant due to a variety of ways to increase the page size.


This section lists the hardware and software tools used for the performance analysis scenario.
  • Application
    app available from the product directory
    . For this recipe, change the size of matrices by modifying the NUM value in
    from 2048 to 8192 and rebuild the
    application by running
    from the
  • Performance analysis tools
    : Intel® oneAPI Base Toolkit (Beta) > Intel® VTune™ (Beta 04) > Microarchitecture Exploration, System Overview, and Memory Consumption analysis types
    • Starting with the 2020 release, Intel® VTune™ Amplifier has been renamed to
      Intel® VTune™
    • Most recipes in the
      Intel® VTune™
      Performance Analysis Cookbook are flexible. You can apply them to different versions of
      Intel® VTune™
      . In some cases, minor adjustments may be required.
    • Get the latest version of
      Intel® VTune™
  • Operating System
    : Ubuntu* 18.04.1 LTS 64-bit
  • CPU
    : Intel® Core™ i7-6700K

Identify TLB Issues with Microarchitecture Exploration Analysis

To get a full picture on the hardware resources usage for your app, run the Microarchitecture Exploration analysis:
  1. Launch the VTune Profiler.
    By default, VTune Profiler opens with the
    sample (matrix)
    project as current.
    Make sure this project is configured to launch the
    application with NUM=8192. Otherwise, create a new project for the updated application.
  2. Click the
    Configure Analysis...
    button on the Welcome page.
    Configure Analysis
    window opens.
  3. In the
    pane, click the down arrow button and select
    Microarchitecture Exploration
    from the
    analysis group.
  4. Click the
    button to run the analysis.
    VTune Amplifier collects the data and opens the
    window with application-level statistics.
Explore the Back-End Bound issues caused by TLB misses:
DTLB Overhead
metric estimates the performance penalty paid for missing TLB. Most of the overhead is attributed to the
Load STLB Hit
metric counting first-level (DTLB) misses that hit the second-level TLB (STLB). There is still a small value of the
Load STLB Miss
metric representing a fraction of cycles performing a hardware page walk. Beware that these metrics do not account overall time spent within page fault exceptions. So, the Microarchitecture Exploration analysis helps diagnose TLB-related issues, but cannot estimate an impact of page fault exceptions on the application elapsed time.

Trace Kernel Activity with System Overview Analysis

A page fault triggers an interrupt caught by the Linux kernel. To measure exact CPU time spent within the Linux kernel, a more granular analysis is needed. The System Overview analysis in the
Hardware Tracing
mode uses Intel® Processor Trace technology to capture all the retired branch instructions on CPU cores. In particular, this analysis enables accurate tracing of all the kernel activities including interrupts:
Even with the
Launch Application
target configuration, this analysis performs a system-wide data collection.
Due to a significant amount of branch instructions, this analysis collects a lot of raw data. You may launch the analysis from the command line and limit the data collection scope to the first 3 seconds:
vtune -collect system-overview -knob collecting-mode=hw-tracing -d 3 -r matrix-so ./matrix
Before launching the
command-line interface, make sure to set up the environment variable from the product installation directory:
source env/
Open the result in the VTune Profiler GUI:
vtune-gui ./matrix-so
When the result opens, switch to the
tab and filter the collected data by the
process using the filter bar drop-down menu:
From the Timeline pane that provides an over-time view, you can see that most of the CPU time is spent within the
module executing the
function. This function is not executed continuously: in a few milliseconds it is usually interrupted, and the heaviest interrupts are caused by page faults:
The grid view helps you discover that overall time spent by the sample application within the Linux kernel is 6.1%, where 439K kernel entries occurred just within the first 3 seconds of the application execution. To resolve this, consider using huge pages.

Calculate the Amount of Allocated Memory with Memory Consumption Analysis

To switch to huge pages, define how many pages are needed. For this, calculate the amount of memory the application allocates. For simple applications like
, it is trivial to just inspect the source code. For more complex applications, consider using the
Memory Consumption
analysis. It provides the exact allocated memory size or identify objects that should use huge pages.
  1. Click
    Configure Analysis
    to open your
    project configuration.
  2. In the
    pane, click the down arrow button and select
    Memory Consumption
    from the
    analysis group.
  3. Change the
    Minimal dynamic memory object size to track
    option value to 1.
  4. Click the
    button to run the analysis.
    VTune Profiler collects the data and opens the
    window with application-level statistics.
  5. Click the
    tab. In the
    Allocation Size
    column right-click and select
    Show Data As
    for a bytes representation:
  6. Right-click the grid again and choose
    Select All
    (alternatively, press
    ) to see the total allocation size.
    The application allocates 2147557472 bytes:

Reduce Page Faults with Huge Pages

By default, a page size is 4Kb. With huge pages the default page size is 2Mb and it can be increased up to 1Gb. Switching to huge pages is quite easy with
First of all, you need to calculate how many 2Mb pages you need. The sample
allocates 2147557472 bytes. This means that you need 2147557472 / 2097152 = 1025 pages of 2Mb (using top rounding).
To switch to huge pages:
  1. Configure the number of pages:
    sudo hugeadm --pool-pages-min 2Mb:1025
  2. Create a
    script with the following content>
    #!/bin/bash HUGETLB_MORECORE=yes ./matrix
  3. Set the executable mode for the script:
    chmod u+x ./
  4. Re-run the System Overview analysis.
    vtune -collect system-overview -knob collecting-mode=hw-tracing -d 3 -r matrix-so-hp ./
  5. Open the result in the VTune Profiler GUI:
    vtune-gui ./matrix-so-hp
view now shows 3.3% reduction of kernel CPU time and 8.1x reduction on kernel-mode entries:
Elapsed time of the
application with huge pages is reduced from 106,4s to 100,5s, which is around 5% of an overall elapsed time improvement without changing any line of code.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at