Choosing the Best Configuration and Problem Sizes for CPUs

Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

Download PDF

ID 766690

Date 3/22/2024

Version

Public

Visible to Intel only — GUID: GUID-4B477928-4D6B-491E-AE7D-161A88EC81A6

View Details

Choosing the Best Configuration and Problem Sizes for CPUs

The performance of the Intel CPU Optimized HPCG depends on many system parameters including (but not limited to) the hardware configuration of the host and MPI implementation used. To get the best performance for a specific system configuration, choose a combination of these parameters:

The number of MPI processes per host node
The number of OpenMP* threads per MPI process
The local problem size

On Intel® Xeon® processor-based clusters, use the Intel AVX2 or Intel AVX-512 optimized version of the benchmark depending on the supported instruction set. For CPUs with one natural NUMA node per socket (up to 3^rd generation Xeon® scalable processors), we recommend using one MPI process per CPU socket and one OpenMP* thread per physical CPU core skipping SMT threads. For CPUs from 4^th generation Xeon® Scalable processors and beyond, there is often a more natural NUMA like division within each socket, and it is often best to use more MPI processes per socket matching these natural divisions. For instance, there are four dies in each socket of the Intel Xeon® Platinum 8480+ and 9480 processors corresponding to 14 physical cores (and for the 9480 model, an HBM stack attached to that die), and four MPI processes per socket can give top performance. In other processors, there is not a natural subdivision; however, even with an increasing number of cores per socket, it is sometimes worthwhile to increase the number of MPI processes per socket to reduce the number of OpenMP threads per MPI process, leading to better balance and performance. It is worth trying a single MPI process per socket with all OpenMP threads as well as multiple ranks per socket, with the number of OpenMP threads targeted to a range of 10–36 threads assigned per MPI process in order to find the best performance on the system.

For best performance, use a local problem size that is large enough to better utilize available cores, but not too large, so that all tasks fit the available memory. With modern CPUs, the last level cache (LLC) sizes per socket have grown immensely, so to comply with current HPCG benchmark requirements, the local problem size (nx x ny x nz) should also be chosen large enough so that the combined size of a vector per MPI process on the socket (each vector is nx*ny*nz*sizeof(double) bytes) should also not fit in the LLC.

Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201

Parent topic: Intel® Optimized High Performance Conjugate Gradient Benchmark

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

Choosing the Best Configuration and Problem Sizes for CPUs