Profiling High Bandwidth Memory Performance on Intel® Xeon® CPU Max...

Intel® VTune™ Profiler Performance Analysis Cookbook

Download PDF

ID 766316

Date 9/05/2023

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Profiling High Bandwidth Memory Performance on Intel® Xeon® CPU Max Series

Use Intel® VTune™ Profiler to profile memory-bound workloads in high performance computing (HPC) and artificial intelligence (AI) applications which utilize high bandwidth memory (HBM).

As HPC and AI applications grow increasingly complex, these memory-bound workloads are increasingly challenged by memory bandwidth. High bandwidth memory (HBM) technology in the Intel^® Xeon^® CPU Max Series of processors tackles the bandwidth challenge. This recipe describes how you use VTune Profiler to profile HBM performance in these memory-bound applications.

Content Expert: Vishnu Naikawadi, Min Yeol Lim, and Alexander Antonov

In this recipe, you use VTune Profiler to profile a memory-bound application on a system that has HBM memory.VTune Profiler displays HBM-specific performance metrics which can help you understand the usage of HBM memory by the workload. Thus, you can analyze the performance of the workload in the context of HBM memory.

Memory Modes in HBM

The Intel® Xeon® CPU Max Series of processors offers HBM in three memory modes:

	HBM Only	HBM Flat Mode	HBM Caching Mode
Memory Configuration	HBM Memory. No DRAM.	Flat memory regions with HBM and DRAM	HBM caches DRAM
Workload Capacity	64 GB or less	64 GB or more	64 GB or more
Code Change	No code change.	Code change may be necessary to optimize performance.	No code change.
Usage	System boots and operates with HBM only.	Provides flexibility for applications that require large memory capacity.	Blend of HBM Only and HBM Flat Mode. Whole applications may fit in HBM cache. This mode blurs the line between cache and memory.

Switch HBM Modes

When you do not install DRAM, the processor operates in HBM Only mode. In this mode, HBM is the only memory available to the OS and all applications. The OS may see all of the installed HBM, while applications can only see what is exposed by the OS.

When you install DRAM, you can select different HBM memory modes by changing the BIOS memory mode configuration:

Open EDKII Menu.
In the Socket Configuration option, select Memory Map.
Open Volatile Memory.

NOTE:
The UI path to change the BIOS configuration may vary depending on the BIOS running on your system.
Change the HBM mode:
- To select the HBM Flat mode, select 1LM (or 1-Level Mode). This mode exposes the HBM and DRAM memories to the software. Each memory is available as a separate address space (NUMA node).
- To select the HBM Cache mode, select 2LM (or 2-Level Mode). In this mode, only the DRAM address space is visible. HBM functions as a transparent memory-side cache for DRAM.

Depending on your BIOS, additional changes may be necessary. For more information on switching between memory modes, see the Intel® Xeon® CPU Max Series Configuration and Tuning Guide.

Ingredients

Here are the hardware and software tools you need for this recipe.

Application: This recipe uses the STREAM benchmark.
Analysis Tools:
- VTune Profiler (version 2023.2 or newer)
- numactl - Use this application to control NUMA policy for processes or shared memory.
CPU: 4^th Generation of Intel® Xeon® CPU Max Series processors (formerly code-named Sapphire Rapids HBM)
Operating System: Linux* OS

System Configuration

This recipe uses a system with:

2-socket, 224 logical CPUs with Hyper-Threading
16 32GB DRAM DIMMs (8 DIMMs for each socket)
HBM Flat mode with SNC4 enabled

As shown in the table below, the system used in this recipe has 8 NUMA nodes per socket:

	Socket 0	Socket 1
DRAM	Nodes 0,1,2,3	Nodes 4,5,6,7
HBM	Nodes 8,9,10,11	Nodes 12,13,14,15

Run Memory Access Analysis

In this recipe, you use VTune Profiler to run the Memory Access analysis type on the STREAM benchmark. You can run the VTune Profiler standalone application on the target system or use a web browser to access the GUI by running VTune Profiler Server.

This example uses VTune Profiler Server. To set up the server, on your target platform, run this command:

/opt/intel/oneapi/vtune/latest/bin64/vtune-backend --web-port <port_id> --allow-remote-access --data-directory /home/stream/results --enable-server-profiling

Here:

--web-port is the HTTP/HTTPS port for the web server UI and data APIs
--allow-remote-access enables remote access through a web browser
--data-directory is the root directory to store projects and results
--enable-server-profiling enables the selection of the hosting server as the profiling target

This command returns a token and a URL. Now you are ready to start the analysis.

This recipe describes how VTune Profiler profiles the STREAM application using only HBM NUMA nodes. For this specific system configuration, the analysis uses NUMA nodes 8-15.

Open the URL returned at the command prompt.
Set a password to use VTune Profiler Server.
From the Welcome screen, create a new project.

In the Configure Analysis window, set these options:

Pane	Option	Setting
WHERE	-	VTune Profiler Server
WHAT	Target	Launch Application
	Application	Path to the `numactl` application. NOTE: Although STREAM is the actual application that gets profiled, you specify the `numactl` tool as the application in order to set NUMA affinity for the STREAM benchmark. You provide the benchmark in the Application parameters field instead.
	Application parameters	HBM NUMA nodes 8-15 Path to STREAM benchmark
	Working directory	Path to application directory
HOW	Analysis type	Memory Access Analysis

Click Start to run the analysis.

In this default configuration, VTune Profiler collects HBM bandwidth data in addition to DRAM bandwidth. Therefore, you do not require additional settings.

NOTE:

To run the Memory Access analysis from the command line, type:

vtune -collect memory-access --app-working-dir=/home/stream -- /usr/bin/numactl -m “8-15” /home/stream/stream_app

Analyze Results

Once data collection is complete and VTune Profiler displays the results, open the Summary window to see general information about the execution. The information is sorted into several sections.

Elapsed Time

This section contains the following statistics:

Application execution per pipeline slots or clockticks
Total Elapsed Time - This includes idle time
CPU Time - This is the sum of CPU times of all threads and the Paused Time, which indicates the total time the application was paused (by commands from the GUI, CLI, or user API)

NOTE:

The HBM Bandwidth Bound metric is measured in terms of elapsed time.

Platform Diagram

Next, see the Platform Diagram, which presents the following information:

System topology
Average DRAM and HBM bandwidths for each package
Utilization metrics for Intel^® Ultra Path Interconnect (Intel^® UPI) cross-socket links and physical cores

Suboptimal application topology can cause cross-socket traffic, which in turn can limit the overall performance of the application.

Bandwidth Utilization

In this section, observe the bandwidth utilization in different domains. In this example, the system uses DRAM, UPI, and HBM domains.

NOTE:

You can see per-socket bandwidth information for DRAM and HBM domains.

To see the overall HBM utilization across the entire system, in the Bandwidth Domain pulldown menu, select HBM, GB/sec. This information displays in a histogram of bandwidth utilization (GB/sec) vs the aggregated elapsed time (sec) for each bandwidth utilization group.

In this example, there is a high utilization of HBM with over 1200 GB/sec for the majority of the duration. This is because the STREAM benchmark is designed to maximize the use of memory bandwidth.

NOTE:

You can observe the same bandwidth information in the HPC Performance Characterization analysis and the Input and Output Analysis as well. To do this, make sure to check the Analyze memory bandwidth option before running those analyses.

Timeline

Finally, switch to the Bottom-up window to observe the timeline. Here, you can examine the following bandwidths over time:

DRAM bandwidth (broken down per channel)
HBM bandwidth (broken down per package)
Intel^® UPI links (broken down per link)

Use this information to identify potential issues like misconfiguration which can lead to unnecessary UPI or DRAM bandwidth.

Hover your mouse on the graph to analyze specific parts and see the bandwidth at the selected instant of time.

In the Grouping pane, select the Bandwidth Domain / Bandwidth Utilization / Type / Function / Call Stack grouping. Use this grouping to identify functions with high utilization in the HBM bandwidth domain.

To further optimize the performance of your application, run these analyses:

Follow these analysis procedures to identify other performance issues.

This recipe describes how you measure performance when running the STREAM application in HBM Flat mode. To compare performance in the HBM Caching and HBM Only modes, switch the HBM mode and repeat the performance analysis. Find the mode with the shortest elapsed time. You can also compare DRAM and HBM bandwidths to look for higher overall bandwidth.

Parent topic: Configuration Recipes

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® VTune™ Profiler Performance Analysis Cookbook

Profiling High Bandwidth Memory Performance on Intel® Xeon® CPU Max Series

Ingredients

System Configuration

Directions

Run Memory Access Analysis

Analyze Results

See Also