I_MPI_ADJUST Family Environment Variables
I_MPI_ADJUST_<opname>
Control collective operation algorithm selection.
Syntax
I_MPI_ADJUST_
<opname>
="<presetid>
[:<conditions>
][;<presetid>
:<conditions>
[...]]"Arguments
<presetid> | Preset identifier
|
>= 0 | Set a number to select the desired algorithm. The value 0 uses basic logic of the collective algorithm selection.
|
<conditions> | A comma separated list of conditions. An empty list selects all message sizes and process combinations
|
<l> | Messages of size
<l> |
<l>-<m> | Messages of size from
<l> <m> |
<l>@<p> | Messages of size
<l> <p> |
<l>-<m>@<p>-<q> | Messages of size from
<l> <m> <p> <q> |
Description
Set this environment variable to select the desired algorithm(s) for the collective operation
under particular conditions. Each collective operation has its own environment variable and algorithms.
<opname>
Environment Variable
| Collective Operation
| Algorithms
|
---|---|---|
I_MPI_ADJUST_ALLGATHER | MPI_Allgather |
|
I_MPI_ADJUST_ALLGATHERV | MPI_Allgatherv |
|
I_MPI_ADJUST_ALLREDUCE | MPI_Allreduce |
|
I_MPI_ADJUST_ALLTOALL | MPI_Alltoall |
|
I_MPI_ADJUST_ALLTOALLV | MPI_Alltoallv |
|
I_MPI_ADJUST_ALLTOALLW | MPI_Alltoallw | Isend/Irecv + waitall
|
I_MPI_ADJUST_BARRIER | MPI_Barrier |
|
I_MPI_ADJUST_BCAST | MPI_Bcast |
|
I_MPI_ADJUST_EXSCAN | MPI_Exscan |
|
I_MPI_ADJUST_GATHER | MPI_Gather |
|
I_MPI_ADJUST_GATHERV | MPI_Gatherv |
|
I_MPI_ADJUST_REDUCE_SCATTER | MPI_Reduce_scatter |
|
I_MPI_ADJUST_REDUCE | MPI_Reduce |
|
I_MPI_ADJUST_SCAN | MPI_Scan |
|
I_MPI_ADJUST_SCATTER | MPI_Scatter |
|
I_MPI_ADJUST_SCATTERV | MPI_Scatterv |
|
I_MPI_ADJUST_SENDRECV_REPLACE | MPI_Sendrecv_replace | 1. Generic 2. Uniform (with restrictions)
|
I_MPI_ADJUST_IALLGATHER | MPI_Iallgather |
|
I_MPI_ADJUST_IALLGATHERV | MPI_Iallgatherv |
|
I_MPI_ADJUST_IALLREDUCE | MPI_Iallreduce |
|
I_MPI_ADJUST_IALLTOALL | MPI_Ialltoall |
|
I_MPI_ADJUST_IALLTOALLV | MPI_Ialltoallv | Isend/Irecv + Waitall
|
I_MPI_ADJUST_IALLTOALLW | MPI_Ialltoallw | Isend/Irecv + Waitall
|
I_MPI_ADJUST_IBARRIER | MPI_Ibarrier | Dissemination
|
I_MPI_ADJUST_IBCAST | MPI_Ibcast |
|
I_MPI_ADJUST_IEXSCAN | MPI_Iexscan |
|
I_MPI_ADJUST_IGATHER | MPI_Igather |
|
I_MPI_ADJUST_IGATHERV | MPI_Igatherv |
|
I_MPI_ADJUST_IREDUCE_SCATTER | MPI_Ireduce_scatter |
|
I_MPI_ADJUST_IREDUCE | MPI_Ireduce |
|
I_MPI_ADJUST_ISCAN | MPI_Iscan |
|
I_MPI_ADJUST_ISCATTER | MPI_Iscatter |
|
I_MPI_ADJUST_ISCATTERV | MPI_Iscatterv | Linear
|
The message size calculation rules for the collective operations are described in the table. In the following table, "n/a" means that the corresponding interval
should be omitted.
<l>-<m>
The I_MPI_ADJUST_SENDRECV_REPLACE=2 ("Uniform") algorithm can be used only in the case when datatype and objects count are the same across all ranks.
To get the maximum number (range) of presets available for each collective operation, use the
impi_info
command:
$ impi_info -v I_MPI_ADJUST_ALLREDUCE I_MPI_ADJUST_ALLREDUCE MPI Datatype: MPI_CHAR Description: Control selection of MPI_Allreduce algorithm presets. Arguments <presetid> - Preset identifier range: 0-27
Collective Function
| Message Size Formula
|
---|---|
MPI_Allgather | recv_count*recv_type_size |
MPI_Allgatherv | total_recv_count*recv_type_size |
MPI_Allreduce | count*type_size |
MPI_Alltoall | send_count*send_type_size |
MPI_Alltoallv | n/a
|
MPI_Alltoallw | n/a
|
MPI_Barrier | n/a
|
MPI_Bcast | count*type_size |
MPI_Exscan | count*type_size |
MPI_Gather | recv_count*recv_type_size if
MPI_IN_PLACE is used, otherwise
send_count*send_type_size |
MPI_Gatherv | n/a
|
MPI_Reduce_scatter | total_recv_count*type_size |
MPI_Reduce | count*type_size |
MPI_Scan | count*type_size |
MPI_Scatter | send_count*send_type_size if
MPI_IN_PLACE is used, otherwise
recv_count*recv_type_size |
MPI_Scatterv | n/a
|
Examples
Use the following settings to select the second algorithm for
MPI_Reduce
operation:
I_MPI_ADJUST_REDUCE=2
Use the following settings to define the algorithms for
MPI_Reduce_scatter
operation:
I_MPI_ADJUST_REDUCE_SCATTER="4:0-100,5001-10000;1:101-3200;2:3201-5000;3"
In this case. algorithm 4 is used for the message sizes between 0 and 100 bytes and from 5001 and 10000 bytes, algorithm 1 is used for the message sizes between 101 and 3200 bytes, algorithm 2 is used for the message sizes between 3201 and 5000 bytes, and algorithm 3 is used for all other messages.
I_MPI_ADJUST_<opname>_LIST
Syntax
I_MPI_ADJUST_<opname>_LIST=<presetid1>[-<presetid2>][,<presetid3>][,<presetid4>-<presetid5>]
Description
Set this environment variable to specify the set of algorithms to be considered by the Intel MPI runtime for a specified
<opname>
. This variable is useful in autotuning scenarios, as well as tuning scenarios where users would like to select a certain subset of algorithms.
Setting an empty string disables autotuning for the
<opname>
collective.
I_MPI_COLL_INTRANODE
Syntax
I_MPI_COLL_INTRANODE=<mode>
Arguments
<mode>
| Intranode collectives type
|
pt2pt
| Use only point-to-point communication-based collectives
|
shm
| Enables shared memory collectives. This is the default value
|
Description
Set this environment variable to switch intranode communication type for collective operations. If there is large set of communicators, you can switch off the SHM-collectives to avoid memory overconsumption.
I_MPI_COLL_INTRANODE_SHM_THRESHOLD
Syntax
I_MPI_COLL_INTRANODE_SHM_THRESHOLD=<nbytes>
Arguments
<nbytes>
| Define the maximal data block size processed by shared memory collectives
|
> 0
| Use the specified size. The default value is 16384 bytes.
|
Description
Set this environment variable to define the size of shared memory area available for each rank for data placement. Messages greater than this value will
not
be processed by SHM-based collective operation, but will be processed by point-to-point based collective operation. The value must be a multiple of 4096.
I_MPI_COLL_EXTERNAL
Syntax
I_MPI_COLL_EXTERNAL=<arg>
Arguments
<arg>
| Description
|
enable | yes | on | 1
| Enable the external collective operations functionality using available collectives libraries.
|
disable | no | off | 0
| Disable the external collective operations functionality. This is the default value.
|
hcoll
| Enable the external collective operations functionality using HCOLL library.
|
Description
Set this environment variable to enable external collective operations. For reaching better performance, use an autotuner after enabling
I_MPI_COLL_EXTERNAL
. This process gets the optimal collectives settings.
To force external collective operations usage, use the following
I_MPI_ADJUST_<opname>
values: I_MPI_ADJUST_ALLREDUCE=24, I_MPI_ADJUST_BARRIER=11, I_MPI_ADJUST_BCAST=16, I_MPI_ADJUST_REDUCE=13, I_MPI_ADJUST_ALLGATHER=6, I_MPI_ADJUST_ALLTOALL=5, I_MPI_ADJUST_ALLTOALLV=5, I_MPI_ADJUST_SCAN=3, I_MPI_ADJUST_EXSCAN=3, I_MPI_ADJUST_GATHER=5, I_MPI_ADJUST_GATHERV=4, I_MPI_ADJUST_SCATTER=5, I_MPI_ADJUST_SCATTERV=4, I_MPI_ADJUST_ALLGATHERV=5, I_MPI_ADJUST_ALLTOALLW=2, I_MPI_ADJUST_REDUCE_SCATTER=6, I_MPI_ADJUST_REDUCE_SCATTER_BLOCK=4, I_MPI_ADJUST_IALLGATHER=5, I_MPI_ADJUST_IALLGATHERV=5, I_MPI_ADJUST_IGATHERV=3, I_MPI_ADJUST_IALLREDUCE=9, I_MPI_ADJUST_IALLTOALLV=2, I_MPI_ADJUST_IBARRIER=2, I_MPI_ADJUST_IBCAST=5, I_MPI_ADJUST_IREDUCE=4.
For more information on HCOLL tuning, refer to NVIDIA* documentation.
I_MPI_COLL_DIRECT
Syntax
I_MPI_COLL_DIRECT=<arg>
Arguments
<arg>
| Description
|
on
| Enable direct collectives. This is the default value.
|
off
| Disable direct collectives.
|
Description
Set this environment variable to control direct collectives usage. Disable this variable to eliminate OFI* usage for intra-node communications in case of shm:ofi fabric.
I_MPI_CBWR
Control reproducibility of floating-point operations results across different platforms, networks, and topologies in case of the same number of processes.
Syntax
I_MPI_CBWR=
<arg>
Arguments
<arg> | CBWR compatibility mode
| Description
|
0 | None
| Do not use CBWR in a library-wide mode. CNR-safe communicators may be created with
MPI_Comm_dup_with_info explicitly. This is the default value.
|
1 | Weak mode
| Disable topology aware collectives. The result of a collective operation does not depend on the rank placement. The mode guarantees results reproducibility across different runs on the same cluster (independent of the rank placement).
|
2 | Strict mode
| Disable topology aware collectives, ignore CPU architecture, and interconnect during algorithm selection. The mode guarantees results reproducibility across different runs on different clusters (independent of the rank placement, CPU architecture, and interconnection)
|
Description
Conditional Numerical Reproducibility (CNR) provides controls for obtaining reproducible floating-point results on collectives operations. With this feature, Intel MPI collective operations are designed to return the same floating-point results from run to run in case of the same number of MPI ranks.
Control this feature with the
I_MPI_CBWR
environment variable in a library-wide manner, where all collectives on all communicators are guaranteed to have reproducible results. To control the floating-point operations reproducibility in a more precise and per-communicator way, pass the
{“I_MPI_CBWR”, “yes”}
key-value pair to the
MPI_Comm_dup_with_info
call.
Setting the
I_MPI_CBWR
in a library-wide mode using the environment variable leads to performance penalty.
CNR-safe communicators created using
MPI_Comm_dup_with_info
always work in the strict mode. For example:
MPI_Info hint; MPI_Comm cbwr_safe_world, cbwr_safe_copy; MPI_Info_create(&hint); MPI_Info_set(hint, “I_MPI_CBWR”, “yes”); MPI_Comm_dup_with_info(MPI_COMM_WORLD, hint, & cbwr_safe_world); MPI_Comm_dup(cbwr_safe_world, & cbwr_safe_copy);
In the example above, both cbwr_safe_world and cbwr_safe_copy are CNR-safe. Use cbwr_safe_world and its duplicates to get reproducible results for critical operations.
Note that
MPI_COMM_WORLD
itself may be used for performance-critical operations without reproducibility limitations.