Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 3/22/2024
Public
Document Table of Contents

Getting Started with Intel® CPU Optimized HPCG

To start working with the benchmark:

  1. On a cluster file system, unpack the Intel® CPU Optimized HPCG package to a directory accessible by all nodes. Read and accept the license as indicated in the readme.txt file included in the package.

  2. Change the directory to hpcg/hpcg_cpu/bin.

  3. Determine the prebuilt version of the benchmark that is best for your system or follow QUICKSTART instructions to build a version of the benchmark for your MPI implementation.

  4. Ensure that the Intel® oneAPI Math Kernel Library (oneMKL), Intel C/C++ Compiler, and MPI runtime environments have been set properly. You can do this using the vars.sh scripts that are included in those distributions.

  5. Run the chosen version of the benchmark.

    For Intel® Xeon® Processor families up to the 3rd generation Xeon Scalable servers (previously code-named Ice Lake), the Intel AVX2 and AVX512 optimized versions perform best with one MPI process per socket and one OpenMP* thread per core, skipping simultaneous multithreading (SMT) threads. We recommend setting the affinity as KMP_AFFINITY=granularity=fine,compact,1,0 when hyperthreading is enabled on the system. Specifically, for a 128-node cluster with two Intel® Xeon® Platinum 8380 processors per node, run the executable as follows:

    #> export OMP_NUM_THREADS=20; export KMP_AFFINITY=granularity=fine,compact,1,0; mpiexec.hydra --genvall -n 256 --ppn 2 -f ${nodefile} ./bin/xhpcg_avx512 --nx=192 --ny=192 --nz=192 --run-real-ref=1

    For Intel® Xeon® Scalable Processor families from 4th generation Xeon Scalable servers and beyond, the AVX512 optimized versions often perform best with one MPI process per NUMA node (or per die on each processor socket) and two OpenMP* threads per core, utilizing simultaneous multithreading (SMT) threads. We recommend setting the affinity as KMP_AFFINITY=granularity=fine,compact. Specifically, for a 128-node cluster with two Intel® Xeon® Platinum 8480+ processors per node, run the executable as follows:

    #> export OMP_NUM_THREADS=28; export KMP_AFFINITY=granularity=fine,compact; mpiexec.hydra --genvall -n 1024 --ppn 8 -f ${nodefile} ./bin/xhpcg_avx512 --nx=192 --ny=192 --nz=192 --run-real-ref=1

  6. When the benchmark completes execution, which usually takes a few minutes, find the YAML file with official results in the current directory. The performance rating of the benchmarked system is in the last section of the file:

    HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201