Developer Guide

Developer Guide for Intel® oneAPI Math Kernel Library Linux*

ID 766690
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Managing Performance of the Cluster Fourier Transform Functions

Performance of Intel® oneAPI Math Kernel Library Cluster FFT (CFFT) in different applications mainly depends on the cluster configuration, performance of message-passing interface (MPI) communications, and configuration of the run. Note that MPI communications usually take approximately 70% of the overall CFFT compute time.For more flexibility of control over time-consuming aspects of CFFT algorithms, Intel® oneAPI Math Kernel Library provides theMKL_CDFT environment variable to set special values that affect CFFT performance. To improve performance of your application that intensively calls CFFT, you can use the environment variable to set optimal values for you cluster, application, MPI, and so on.

The MKL_CDFT environment variable has the following syntax, explained in the table below:

MKL_CDFT=option1[=value1],option2[=value2],…,optionN[=valueN]

IMPORTANT:

While this table explains the settings that usually improve performance under certain conditions, the actual performance highly depends on the configuration of your cluster. Therefore, experiment with the listed values to speed up your computations.

Option

Possible Values

Description

alltoallv

0 (default)

Configures CFFT to use the standard MPI_Alltoallv function to perform global transpositions.

1

Configures CFFT to use a series of calls to MPI_Isend and MPI_Irecv instead of the MPI_Alltoallv function.

4

Configures CFFT to merge global transposition with data movements in the local memory. CFFT performs global transpositions by calling MPI_Isend and MPI_Irecv in this case.

Use this value in a hybrid case (MPI + OpenMP), especially when the number of processes per node equals one.

wo_omatcopy

0

Configures CFFT to perform local FFT and local transpositions separately.

CFFT usually performs faster with this value than with wo_omatcopy = 1 if the configuration parameter DFTI_TRANSPOSE has the value of DFTI_ALLOW. See the Intel® oneAPI Math Kernel Library Developer Reference for details.

1

Configures CFFT to merge local FFT calls with local transpositions.

CFFT usually performs faster with this value than with wo_omatcopy = 0 if DFTI_TRANSPOSE has the value of DFTI_NONE.

-1 (default)

Enables CFFT to decide which of the two above values to use depending on the value of DFTI_TRANSPOSE.

enable_soi

Not applicable

A flag that enables low-communication Segment Of Interest FFT (SOI FFT) algorithm for one-dimensional complex-to-complex CFFT, which requires fewer MPI communications than the standard nine-step (or six-step) algorithm.

CAUTION:

While using fewer MPI communications, the SOI FFT algorithm incurs a minor loss of precision (about one decimal digit).

The following example illustrates usage of the environment variable assuming the bash shell:

export MKL_CDFT=wo_omatcopy=1,alltoallv=4,enable_soi
mpirun –ppn 2 –n 16 ./mkl_cdft_app

Product and Performance Information

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201