Getting Started with Intel Optimized HPCG
- On a cluster file system, unpack the Intel Optimized HPCG package to a directory accessible by all nodes. Read and accept the license as indicated in thereadme.txtfile included in the package.
- Change the directory tohpcg/bin.
- Determine the prebuilt version of the benchmark that is best for your system or followQUICKSTARTinstructions to build a version of the benchmark for your MPI implementation.
- Ensure that, Intel C/C++ Compiler and MPI run-time environments have been set properly. You can do this using the scriptsIntel® oneAPI Math Kernel Library,vars.shcompilervars.sh, andmpivars.shthat are included in those distributions.
- Run the chosen version of the benchmark.
- The Intel AVX and Intel AVX2 optimized versions perform best with one MPI process per socket and one OpenMP* thread per core skipping simultaneous multithreading (SMT) threads: set the affinity asKMP_AFFINITY=granularity=fine,compact,1,0. Specifically, for a 128-node cluster with two Intel® Xeon® Processor E5-2697 v4 per node, run the executable as follows:#> mpiexec.hydra -n 256 -ppn 2 env OMP_NUM_THREADS=18 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_avx2 -n192
- The Intel® Xeon® Phi processor optimized version performs best with four MPI processes per processor and two threads for each processor core, with SMT turned on. Specifically, for a 128-node cluster with one Intel® Xeon® Phi processor 7250 per node, run the executable in this manner:#> mpiexec.hydra -n 512 -ppn 2 env OMP_NUM_THREADS=34 KMP_AFFINITY=granularity=fine,compact,1,0 ./bin/xhpcg_knl -n160
- When the benchmark completes execution, which usually takes a few minutes, find the YAML file with official results in the current directory. The performance rating of the benchmarked system is in the last section of the file:HPCG result is VALID with a GFLOP/s rating of: [GFLOP/s]