Choosing the Best HPCG Configuration for GPUs

Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

Download PDF

ID 766690

Date 3/22/2024

Version

Public

Visible to Intel only — GUID: GUID-DE840BA5-99EF-45C2-B288-EB2CC6C38AAF

View Details

Document Table of Contents

Document Table of Contents x

Developer Guide for Intel® oneAPI Math Kernel Library for Linux*

Developer Guide for Intel® oneAPI Math Kernel Library for Linux* x

Getting Help and Support What's New Notational Conventions Related Information Getting Started Structure of the Intel® oneAPI Math Kernel Library Linking Your Application with the Intel® oneAPI Math Kernel Library Managing Performance and Memory Language-specific Usage Options Obtaining Numerically Reproducible Results Coding Tips Managing Output Working with the Intel® oneAPI Math Kernel Library Cluster Software Managing Behavior of the Intel® oneAPI Math Kernel Library with Environment Variables Configuring Your Integrated Development Environment to Link with Intel® oneAPI Math Kernel Library Intel® oneAPI Math Kernel Library Benchmarks Appendix A: Intel® oneAPI Math Kernel Library Language Interfaces Support Appendix B: Support for Third-Party Interfaces Appendix C: Directory Structure in Detail Notices and Disclaimers

Getting Started x

Shared Library Versioning CMake Config for oneMKL Checking Your Installation Setting Environment Variables Compiler Support Using Code Examples What You Need to Know Before You Begin Using the Intel® oneAPI Math Kernel Library

Setting Environment Variables x

Scripts to Set Environment Variables Modulefiles to Set Environment Variables Automating the Process of Setting Environment Variables Using the CMake Config File

Structure of the Intel® oneAPI Math Kernel Library x

Architecture Support High-Level Directory Structure Layered Model Concept

Linking Your Application with the Intel® oneAPI Math Kernel Library x

Linking Quick Start Linking Examples Linking in Detail Building Custom Shared Objects

Linking Quick Start x

Using the -qmkl Compiler Option Using the -qmkl-ilp64 Compiler Option Using the Single Dynamic Library Selecting Libraries to Link with Using the Link-line Advisor Using the Command-Line Link Tool

Linking Examples x

Linking on IA-32 Architecture Systems Linking on Intel(R) 64 Architecture Systems

Linking in Detail x

Listing Libraries on a Link Line Dynamically Selecting the Interface and Threading Layer Linking with Interface Libraries Linking with Threading Libraries Linking with Computational Libraries Linking with Compiler Run-time Libraries Linking with System Libraries

Linking with Interface Libraries x

Using the ILP64 Interface vs. LP64 Interface Linking with Fortran 95 Interface Libraries

Building Custom Shared Objects x

Using the Custom Shared Object Builder Composing a List of Functions Specifying Function Names Distributing Your Custom Shared Object

Managing Performance and Memory x

Improving Performance with Threading Improving Performance for Small Size Problems Other Tips and Techniques to Improve Performance Using Memory Functions

Improving Performance with Threading x

OpenMP* Threaded Functions and Problems Functions Threaded with Intel® Threading Building Blocks Avoiding Conflicts in the Execution Environment Techniques to Set the Number of Threads Setting the Number of Threads Using an OpenMP* Environment Variable Changing the Number of OpenMP* Threads at Run Time Using Additional Threading Control Calling oneMKL Functions from Multi-threaded Applications Using Intel® Hyper-Threading Technology Managing Multi-core Performance Managing Performance with Heterogeneous Cores

Using Additional Threading Control x

oneMKL-specific Environment Variables for OpenMP Threading Control MKL_DYNAMIC MKL_DOMAIN_NUM_THREADS MKL_NUM_STRIPES Setting the Environment Variables for Threading Control

Improving Performance for Small Size Problems x

Using MKL_DIRECT_CALL in C Applications Using MKL_DIRECT_CALL in Fortran Applications Limitations of the Direct Call

Other Tips and Techniques to Improve Performance x

Coding Techniques Improving oneMKL Performance on Specific Processors Operating on Denormals

Using Memory Functions x

Avoiding Memory Leaks in oneMKL Redefining Memory Functions

Language-specific Usage Options x

Using Language-Specific Interfaces with Intel® oneAPI Math Kernel Library Mixed-language Programming with the Intel Math Kernel Library

Using Language-Specific Interfaces with Intel® oneAPI Math Kernel Library x

Interface Libraries and Modules Fortran 95 Interfaces to LAPACK and BLAS Compiler-dependent Functions and Fortran 90 Modules

Mixed-language Programming with the Intel Math Kernel Library x

Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments Using Complex Types in C/C++ Calling BLAS Functions That Return the Complex Values in C/C++ Code

Obtaining Numerically Reproducible Results x

Getting Started with Conditional Numerical Reproducibility Specifying Code Branches Reproducibility Conditions Setting the Environment Variable for Conditional Numerical Reproducibility Code Examples

Coding Tips x

Example of Data Alignment Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation

Managing Output x

Using oneMKL Verbose Mode

Using oneMKL Verbose Mode x

Version Information Line Call Description Line

Working with the Intel® oneAPI Math Kernel Library Cluster Software x

Linking with oneMKL Cluster Software Setting the Number of OpenMP* Threads Using Shared Libraries Setting Environment Variables on a Cluster Interaction with the Message-passing Interface Using a Custom Message-Passing Interface Examples of Linking for Clusters

Examples of Linking for Clusters x

Examples for Linking a C Application Examples for Linking a Fortran Application

Managing Behavior of the Intel® oneAPI Math Kernel Library with Environment Variables x

Managing Behavior of Function Domains with Environment Variables Instruction Set–Specific Dispatching on Intel® Architectures

Managing Behavior of Function Domains with Environment Variables x

Setting the Default Mode of Vector Math with an Environment Variable Managing Performance of the Cluster Fourier Transform Functions Managing Invalid Input Checking in LAPACKE Functions

Configuring Your Integrated Development Environment to Link with Intel® oneAPI Math Kernel Library x

Configuring the Eclipse* IDE CDT to Link with oneMKL

Intel® oneAPI Math Kernel Library Benchmarks x

Intel Optimized LINPACK Benchmark for Linux* Intel® Distribution for LINPACK* Benchmark and Intel® Optimized HPL-AI* Benchmark Intel® Optimized High Performance Conjugate Gradient Benchmark

Intel Optimized LINPACK Benchmark for Linux* x

Contents of the Intel® Optimized LINPACK Benchmark Running the Software Known Limitations of the Intel® Optimized LINPACK Benchmark

Intel® Distribution for LINPACK* Benchmark and Intel® Optimized HPL-AI* Benchmark x

Overview of the Intel® Distribution for LINPACK* Benchmark Overview of the Intel® Optimized HPL-AI* Benchmark Contents of the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark Building the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark for a Customized MPI Implementation Building the Netlib HPL from Source Code Configuring Parameters Ease-of-use Command-line Parameters Running the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark Heterogeneous Support in the Intel® Distribution for LINPACK* Benchmark Environment Variables Improving Performance of Your Cluster

Intel® Optimized High Performance Conjugate Gradient Benchmark x

Overview of the Intel Optimized HPCG Versions of the Intel® CPU Optimized HPCG Versions of the Intel® GPU Optimized HPCG Getting Started with Intel® CPU Optimized HPCG Getting Started with Intel® GPU Optimized HPCG Choosing the Best Configuration and Problem Sizes for CPUs Choosing the Best HPCG Configuration for GPUs

Appendix A: Intel® oneAPI Math Kernel Library Language Interfaces Support x

Language Interfaces Support, by Function Domain Include Files

Appendix B: Support for Third-Party Interfaces x

FFTW Interface Support

Appendix C: Directory Structure in Detail x

Detailed Structure of the IA-32 Architecture Directories Detailed Structure of the Intel® 64 Architecture Directories

Detailed Structure of the IA-32 Architecture Directories x

Static Libraries in the lib32 Directory Dynamic Libraries in the lib32 Directory

Detailed Structure of the Intel® 64 Architecture Directories x

Static Libraries in the lib Directory Dynamic Libraries in the lib Directory

Developer Guide for Intel® oneAPI Math Kernel Library for Linux*

Getting Help and Support

What's New

Notational Conventions

Related Information

Getting Started

Shared Library Versioning

CMake Config for oneMKL

Checking Your Installation

Setting Environment Variables

Scripts to Set Environment Variables

Modulefiles to Set Environment Variables

Automating the Process of Setting Environment Variables

Using the CMake Config File

Compiler Support

Using Code Examples

What You Need to Know Before You Begin Using the Intel® oneAPI Math Kernel Library

Structure of the Intel® oneAPI Math Kernel Library

Architecture Support

High-Level Directory Structure

Layered Model Concept

Linking Your Application with the Intel® oneAPI Math Kernel Library

Linking Quick Start

Using the -qmkl Compiler Option

Using the -qmkl-ilp64 Compiler Option

Using the Single Dynamic Library

Selecting Libraries to Link with

Using the Link-line Advisor

Using the Command-Line Link Tool

Linking Examples

Linking on IA-32 Architecture Systems

Linking on Intel(R) 64 Architecture Systems

Linking in Detail

Listing Libraries on a Link Line

Dynamically Selecting the Interface and Threading Layer

Linking with Interface Libraries

Using the ILP64 Interface vs. LP64 Interface

Linking with Fortran 95 Interface Libraries

Linking with Threading Libraries

Linking with Computational Libraries

Linking with Compiler Run-time Libraries

Linking with System Libraries

Building Custom Shared Objects

Using the Custom Shared Object Builder

Composing a List of Functions

Specifying Function Names

Distributing Your Custom Shared Object

Managing Performance and Memory

Improving Performance with Threading

OpenMP* Threaded Functions and Problems

Functions Threaded with Intel® Threading Building Blocks

Avoiding Conflicts in the Execution Environment

Techniques to Set the Number of Threads

Setting the Number of Threads Using an OpenMP* Environment Variable

Changing the Number of OpenMP* Threads at Run Time

Using Additional Threading Control

oneMKL-specific Environment Variables for OpenMP Threading Control

MKL_DYNAMIC

MKL_DOMAIN_NUM_THREADS

MKL_NUM_STRIPES

Setting the Environment Variables for Threading Control

Calling oneMKL Functions from Multi-threaded Applications

Using Intel® Hyper-Threading Technology

Managing Multi-core Performance

Managing Performance with Heterogeneous Cores

Improving Performance for Small Size Problems

Using MKL_DIRECT_CALL in C Applications

Using MKL_DIRECT_CALL in Fortran Applications

Limitations of the Direct Call

Other Tips and Techniques to Improve Performance

Coding Techniques

Improving oneMKL Performance on Specific Processors

Operating on Denormals

Using Memory Functions

Avoiding Memory Leaks in oneMKL

Redefining Memory Functions

Language-specific Usage Options

Using Language-Specific Interfaces with Intel® oneAPI Math Kernel Library

Interface Libraries and Modules

Fortran 95 Interfaces to LAPACK and BLAS

Compiler-dependent Functions and Fortran 90 Modules

Mixed-language Programming with the Intel Math Kernel Library

Calling LAPACK, BLAS, and CBLAS Routines from C/C++ Language Environments

Using Complex Types in C/C++

Calling BLAS Functions That Return the Complex Values in C/C++ Code

Obtaining Numerically Reproducible Results

Getting Started with Conditional Numerical Reproducibility

Specifying Code Branches

Reproducibility Conditions

Setting the Environment Variable for Conditional Numerical Reproducibility

Code Examples

Coding Tips

Example of Data Alignment

Using Predefined Preprocessor Symbols for Intel® MKL Version-Dependent Compilation

Managing Output

Using oneMKL Verbose Mode

Version Information Line

Call Description Line

Working with the Intel® oneAPI Math Kernel Library Cluster Software

Linking with oneMKL Cluster Software

Setting the Number of OpenMP* Threads

Using Shared Libraries

Setting Environment Variables on a Cluster

Interaction with the Message-passing Interface

Using a Custom Message-Passing Interface

Examples of Linking for Clusters

Examples for Linking a C Application

Examples for Linking a Fortran Application

Managing Behavior of the Intel® oneAPI Math Kernel Library with Environment Variables

Managing Behavior of Function Domains with Environment Variables

Setting the Default Mode of Vector Math with an Environment Variable

Managing Performance of the Cluster Fourier Transform Functions

Managing Invalid Input Checking in LAPACKE Functions

Instruction Set–Specific Dispatching on Intel® Architectures

Configuring Your Integrated Development Environment to Link with Intel® oneAPI Math Kernel Library

Configuring the Eclipse* IDE CDT to Link with oneMKL

Intel® oneAPI Math Kernel Library Benchmarks

Intel Optimized LINPACK Benchmark for Linux*

Contents of the Intel® Optimized LINPACK Benchmark

Running the Software

Known Limitations of the Intel® Optimized LINPACK Benchmark

Intel® Distribution for LINPACK* Benchmark and Intel® Optimized HPL-AI* Benchmark

Overview of the Intel® Distribution for LINPACK* Benchmark

Overview of the Intel® Optimized HPL-AI* Benchmark

Contents of the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark

Building the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark for a Customized MPI Implementation

Building the Netlib HPL from Source Code

Configuring Parameters

Ease-of-use Command-line Parameters

Running the Intel® Distribution for LINPACK* Benchmark and the Intel® Optimized HPL-AI* Benchmark

Heterogeneous Support in the Intel® Distribution for LINPACK* Benchmark

Environment Variables

Improving Performance of Your Cluster

Intel® Optimized High Performance Conjugate Gradient Benchmark

Overview of the Intel Optimized HPCG

Versions of the Intel® CPU Optimized HPCG

Versions of the Intel® GPU Optimized HPCG

Getting Started with Intel® CPU Optimized HPCG

Getting Started with Intel® GPU Optimized HPCG

Choosing the Best Configuration and Problem Sizes for CPUs

Choosing the Best HPCG Configuration for GPUs

Appendix A: Intel® oneAPI Math Kernel Library Language Interfaces Support

Language Interfaces Support, by Function Domain

Include Files

Appendix B: Support for Third-Party Interfaces

FFTW Interface Support

Appendix C: Directory Structure in Detail

Detailed Structure of the IA-32 Architecture Directories

Static Libraries in the lib32 Directory

Dynamic Libraries in the lib32 Directory

Detailed Structure of the Intel® 64 Architecture Directories

Static Libraries in the lib Directory

Dynamic Libraries in the lib Directory

Notices and Disclaimers

Visible to Intel only — GUID: GUID-DE840BA5-99EF-45C2-B288-EB2CC6C38AAF

View Details

Choosing the Best HPCG Configuration for GPUs

The performance of the Intel GPU Optimized HPCG depends on many system parameters including (but not limited to) the hardware configuration of the host node and one or more devices attached to the node as well as the MPI implementation used. To get the best performance for a specific system configuration, choose a combination of these parameters:

The number of MPI processes per host node (defining work on the host + an attached device)
The number of OpenMP* threads per MPI process for reference code used in validation of benchmark
The local problem size

On Intel® Data Center GPU Max Series GPUs, we recommend the use of one MPI process per tile with a large local problem size. With modern GPUs, the last level cache (LLC) sizes per tile can be either extremely large or quite small, and the device memory can be quite limited. So to comply with current HPCG benchmark requirements, the local problem size (nx x ny x nz) should be large enough so that the size of a vector from the benchmark (each vector is nx*ny*nz*sizeof(double) bytes) does not completely fit in the LLC of the device, but not too large that the full benchmark system doesn’t fit in device memory.

Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201

Parent topic: Intel® Optimized High Performance Conjugate Gradient Benchmark

Level Two Title

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

Choosing the Best HPCG Configuration for GPUs