Improve Performance and Stability with Intel® MPI Library on InfiniBand*

ID 659585
Updated 8/14/2020
Version Latest
Public

author-image

By

Overview

Intel® MPI Library 2019 has transitioned to exclusively using libfabric to facilitate communications. The libfabric infrastructure is based around providers to handle implementation of message transfer for various hardware vendors. The MLX provider is implemented to facilitate usage of InfiniBand* hardware.

Rationale

Stability and performance when utilizing InfiniBand* was sub-optimal in the initial and early update releases of Intel® MPI Library. The MLX provider in libfabric addresses these concerns.

Availability

The MLX provider is available in Intel® MPI Library 2019 Update 5 for Linux* as a technical preview, and as a full feature in Intel® MPI Library 2019 Update 6 for Linux*.

Requirements

  • Intel® MPI Library 2019 Update 5 or higher
  • Mellanox UCX* Framework v1.4 or higher

Basic Usage

Ensure you are using the libfabric version provided with Intel® MPI Library. In Intel® MPI Library 2019 Update 5, the MLX provider is a technical preview, and will not be selected by default. To enable it, set FI_PROVIDER=mlx

Intel® MPI Library 2019 Update 6 and later uses the MLX by default if InfiniBand* is detected at runtime.

Performance Tuning Options

Option Usage Reference
I_MPI_COLL_EXTERNAL Set to 1 to enable external collective operations (HCOLL) I_MPI_ADJUST Family Environment Variables
Autotuner Automatically tune application at the beginning of the run. Autotuning

Limitations

  • Dynamic process management is not yet supported as of Intel® MPI Library 2019 Update 6. Support will be implemented in a future release.
  • Older InfiniBand hardware doesn't support all of the expected transports. To check and resolve transport issues:
    $ucx_info -d | grep Transport

    Output should include dc, rc, and ud transports. On older hardware, the dc transport will likely be missing. As a workaround, set

    UCX_TLS=rc,ud,sm,self

    If none of the required transports are present, this is usually due to a driver misconfiguration, missing libraries, or other fabric software problems. Please recheck your UCX configuration using one of the following:

    $ibv_devinfo
    $lspci | grep Mellanox