Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 3/22/2024
Public
Document Table of Contents

Overview of the Intel Optimized HPCG

The Intel® Optimized High Performance Conjugate Gradient Benchmark (Intel® Optimized HPCG) provides CPU- and GPU-optimized implementations of the HPCG benchmark (http://hpcg-benchmark.org). The CPU version is optimized for Intel® Xeon® processors with Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) support. The GPU version is optimized for the Intel® Data Center GPU Max Series.

The HPCG Benchmark is intended to complement the High Performance LINPACK benchmark used in the TOP500 (http://www.top500.org) system ranking by providing a metric that better aligns with a broader set of important cluster applications.

The HPCG benchmark implementation is based on a 3-dimensional (3D) regular 27-point discretization of an elliptic partial differential equation. The implementation calls a 3D domain to fill a 3D virtual process grid for all the available MPI ranks. The HPCG benchmark uses the preconditioned conjugate gradient method (CG) to solve the intermediate systems of equations and incorporates a local and symmetric Gauss-Seidel preconditioning step that requires a triangular forward solve and a backward solve. A synthetic multi-grid V-cycle is used on each preconditioning step to make the benchmark better fit real-world applications. The HPCG benchmark implements matrix multiplication locally, with an initial halo exchange between neighboring processes. The benchmark exhibits irregular accesses to memory and fine-grain recursive computations that dominate many scientific workloads.

Intel® CPU Optimized HPCG Benchmark

The Intel® Optimized HPCG for CPUs benchmark contains source code of the HPCG v3.1 reference implementation with necessary modifications to include:

  • Intel® architecture optimizations

  • Prebuilt benchmark executables that link to Intel® oneAPI Math Kernel Library (oneMKL)

    • Inspector-executor Sparse BLAS kernels for sparse matrix-vector multiplication (SpMV)
    • Sparse triangular solve (SpTRSV)

    • Symmetric Gauss-Seidel smoother (SYMGS)

that are optimized for Intel AVX2 and Intel AVX-512 instruction sets. Use this package to evaluate the performance of distributed-memory systems based on any generation of the Intel® Xeon® processor E3, Intel® Xeon® processor E5, Intel® Xeon® processor E7, and Intel® Xeon® Scalable processor families.

The Intel® oneAPI Math Kernel Library Inspector-executor Sparse BLAS kernels SpMV, TRSV, and SYMGS are implemented using an inspector-executor model. The inspection step chooses the best algorithm for the input matrix and converts the matrix to a special internal representation to achieve high performance at the execution step.

Intel® GPU Optimized HPCG Benchmark

The Intel® GPU Optimized HPCG benchmark contains source code of the HPCG v3.1 reference implementation with necessary modifications to include:

  • Using SYCL and C++ languages for efficient host and device scheduling of kernels

  • Intel® GPU architecture optimizations

  • A symmetric permutation of the sparse matrix to enable more task parallelism in some of the key computation kernels like the Symmetric Gauss-Seidel smoother.

  • Conversion of the sparse matrix to an Ellpack Block Sparse (ESB) matrix format for more efficient vectorizable loads on the GPU hardware.

  • Core computation kernels written in SYCL and using the "Explicit SIMD" (ESIMD) SYCL Extension for lower level Intel GPU programming:

    • Sparse matrix-vector multiplication (SpMV)

    • Sparse triangular solve (SpTRSV)

    • Symmetric Gauss-Seidel smoother (SYMGS).

Use this package to evaluate the performance of distributed-memory systems based on the Intel® Data Center GPU Max Series family.

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201