Intel® MPI Library Developer Guide for Linux* OS

ID 768728
Date 12/16/2022
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Job Schedulers Support

The Intel® MPI Library supports the majority of commonly used job schedulers in the HPC field.

The following job schedulers are supported on Linux* OS:

  • Altair* PBS Pro*
  • Torque*
  • OpenPBS*
  • IBM* Platform LSF*
  • Parallelnavi* NQS*
  • SLURM*
  • Univa* Grid Engine*

The Hydra Process manager detects Job Schedulers automatically by checking specific environment variables. These variables are used to determine how many nodes were allocated, which nodes, and the number of processes per tasks.

Altair PBS Pro*, TORQUE*, and OpenPBS*

If you use one of these job schedulers, and $PBS_ENVIRONMENT exists with the value PBS_BATCH or PBS_INTERACTIVE, mpirun uses $PBS_NODEFILE as a machine file for mpirun. You do not need to specify the -machinefile option explicitly.

The following is an example of a batch job script:

#PBS -l nodes=4:ppn=4
#PBS -q queue_name
cd $PBS_O_WORKDIR
mpirun -n 16 ./myprog

IBM Platform LSF*

The IBM Platform LSF* job scheduler is detected automatically if the $LSB_MCPU_HOSTS and $LSF_BINDIR environment variables are set.

The Hydra process manager uses these variables to determine how many nodes were allocated, which nodes, and the number of processes per tasks. To run processes on the remote nodes, the Hydra process manager uses the blaunch utility by default. This utility is provided by the IBM Platform LSF.

The number of processes, the number of processes per node, and node names may be overridden by the usual Hydra options (-n, -ppn, -hosts).

Examples:

bsub -n 16 mpirun ./myprog
bsub -n 16 mpirun -n 2 -ppn 1 ./myprog

Parallelnavi NQS*

If you use the Parallelnavi NQS job scheduler and the $ENVIRONMENT, $QSUB_REQID, $QSUB_NODEINF options are set, the $QSUB_NODEINF file is used as a machine file for mpirun. Also, /usr/bin/plesh is used as remote shell by the process manager during startup.

Slurm*

The Slurm job scheduler can be detected automatically by mpirun and mpiexec. Job scheduler detection is enabled in mpirun by default and enabled in mpiexec if hostnames are not specified. The only prerequisite is setting I_MPI_PIN_RESPECT_CPUSET=0.

For autodetection, the Hydra process manger uses these environment variables:

  • SLURM_JOBID
  • SLURM_NODELIST
  • SLURM_NNODES
  • SLURM_NTASKS_PER_NODE or SLURM_NTASKS
  • SLURM_CPUS_PER_TASK

Using these variables, Hydra can determine which nodes are available, how many nodes were allocated, the number of MPI processes per node, and the domain size per MPI process. SLURM_NTASKS_PER_NODE is used for the implicit specification of I_MPI_PERHOST, or alternatively SLURM_NTASKS/SLURM_NNODES. The value of SLURM_CPUS_PER_TASK defines implicitly I_MPI_PIN_DOMAIN and overwrites the "auto" default. If some of the the Slurm variables are not defined the corresponding Intel MPI defaults are used. Based on the environment detection it is sufficient to execute the following simple command line under Slurm:

export I_MPI_PIN_RESPECT_CPUSET=0; mpirun ./myprog

The approach works in standard situations with simple Slurm pinning (for example, only using the Slurm flag --cpus-per-task). If a Slurm job requires a more complicated pinning setup (using the Slurm flag --cpu-bind) then the process pinning may be incorrect. In this case, full pinning control is gained by launching the MPI run with srun or enable Intel MPI Library pinning by setting the I_MPI_PIN_RESPECT_CPUSET=0 environment variable (see the Developer Reference, “Process Pinning” and “Environmental Variables for Process Pinning”). When using mpirun, the required pinning has to be explicitly replicated using I_MPI_PIN_DOMAIN.

If the Slurm job scheduler was not detected automatically, you can set the I_MPI_HYDRA_RMK=slurm or I_MPI_HYDRA_BOOTSTRAP=slurm variables (see the Developer Reference, “Hydra Environment Variables”).

To run processes on the remote nodes, Hydra uses the srun utility. These environment variables control which utility is used in this case (see the Developer Reference, “Hydra Environment Variables”):

  • I_MPI_HYDRA_BOOTSTRAP
  • I_MPI_HYDRA_BOOTSTRAP_EXEC
  • I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS

You can also launch applications with the srun utility without Hydra by setting the I_MPI_PMI_LIBRARY environment variable (see the Developer Reference, “Other Environment Variables”).

PMI versions currently supported are PMI-1 and PMI-2.

By default, the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the -ppn option has no effect. To change this behavior and control process placement through -ppn (and related options and variables), set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

Examples:

# Allocate nodes.
salloc --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node>

# Run your application using Hydra.
mpiexec ./myprog
#or
mpirun ./myprog

# Run your application using srun with the PMI-1 interface.
I_MPI_PMI_LIBRARY=<path-to-libpmi.so>/libpmi.so srun ./myprog

# Run your application using srun with the PMI-2 interface.
I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so srun --mpi=pmi2 ./myprog

# Change per-host process placement.
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 ./myprog

# Change per-host process placement and hostnames and use srun utility for remote launch.
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 -hosts host3,host1 -bootstrap=slurm ./myprog

# Use Intel MPI Library pinning.
I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog

# Use the --cpus-per-task Slurm option in Intel MPI Library pinning.
salloc --cpus-per-task=<cpus-per-task> --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog
#or
I_MPI_PIN_DOMAIN=${SLURM_CPUS_PER_TASK} I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog

Univa Grid Engine*

If you use the Univa Grid Engine job scheduler and the $PE_HOSTFILE is set, then two files will be generated: /tmp/sge_hostfile_${username}_$$ and /tmp/sge_machifile_${username}_$$. The latter is used as the machine file for mpirun. These files are removed when the job is completed.

SIGINT, SIGTERM Signals Intercepting

If resources allocated to a job exceed the limit, most job schedulers terminate the job by sending a signal to all processes.

For example, Torque* sends SIGTERM three times to a job and if this job is still alive, SIGKILL will be sent to terminate it.

For Univa Grid Engine, the default signal to terminate a job is SIGKILL. The Intel MPI Library is unable to process or catch that signal causing mpirun to kill the entire job. You can change the value of the termination signal through the following queue configuration:

  1. Use the following command to see available queues:
    $ qconf -sql
  2. Execute the following command to modify the queue settings:
    $ qconf -mq <queue_name>
  3. Find terminate_method and change signal to SIGTERM.
  4. Save queue configuration.

Controlling Per-Host Process Placement

When using a job scheduler, by default the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the -ppn option has no effect. To change this behavior and control process placement through -ppn (and related options and variables), use the I_MPI_JOB_RESPECT_PROCESS_PLACEMENT environment variable:

$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off