Introduction
Intel® oneAPI Math Kernel Library (oneMKL) is a highly optimized and extensively threaded math library suitable for computationally intensive applications. It has been optimized for a broad range of Intel processors, enabling users to evaluate the performance of their applications and oneMKL on those processors.
The general matrix multiplication (GEMM) function is a commonly used function in scientific, engineering, numerical computing, data analytics, and machine learning workloads. oneMKL provides single- and double-precision matrix multiplication functions (SGEMM and DGEMM), which have been vectorized and parallelized to take advantage of the latest Intel® architecture instruction sets and their numerous computational cores. Interest in the performance of the SGEMM function, for example, has grown significantly due to its ability to accelerate deep neural network (DNN) computations.
Factors such as the problem size, cache size, number of cores, and the underlying computer architecture influence the performance of almost all functions in oneMKL. In this article, we provide some tips on how to measure the SGEMM function performance on Intel® Xeon® processors. These same tips can be applied to measure the performance of many other functions in oneMKL.
To begin, for a detailed description of the SGEMM function and its parameters, review the Developer Reference for Intel® oneAPI Math Kernel Library for C . cblas_sgemm is the C interface of SGEMM and its syntax.
Computes a matrix-matrix product with general matrices
void cblas_sgemm (const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE transa, const
CBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const float
alpha, const float *a, const MKL_INT lda, const float *b, const MKL_INT ldb, const
float beta, float *c, const MKL_INT ldc);
the ?gemm routines compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, with general matrices. The operation is defined as
C := alpha*op(A)*op(B) + beta*C,
where:
op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH,
Measure the Performance of SGEMM
Tip 1: Ignore the Time Required for the First Call
oneMKL is multithreaded and employs internal buffers for fast memory allocation. Typically, the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time is usually insignificant compared to the runtime of SGEMM for large matrices, it can be substantial when timing SGEMM for small matrices. To remove the initialization time from the performance measurement, we recommend making a call to SGEMM with sufficiently bigger parameters (for example, M=N=K=100) and ignoring the time required for the first call. Using a small matrix for the first call won’t initialize the threads since oneMKL runs multithreaded code only for sufficiently large matrices.
Tip 2: Select the Appropriate Time Function and Measure the Average Performance
oneMKL provides the timing function dsecnd(), which measures the runtime in seconds. The runtime of a subroutine may vary from one call to another, and small problems are especially susceptible to time variations due to system artifacts. Therefore, for functions with small runtimes, it is a common practice to measure the average performance by placing the function within a loop. The total elapsed time divided by the loop count gives the average time required for a single function call. The loop count should be large enough to get a representative average for small problems. On the other hand, if a large loop count is chosen the runtime of the benchmark may be prohibitive for large problems. In practice, we expect the loop will take at least one second.
Developers may often use the system time measure function gettimeofday(&end_time,NULL), which gives the number of seconds and microseconds (μs). We use the time function in the article.
Tip 3: Measure Performance in GFLOPS
Floating point operations per second (FLOPS) is a useful metric to compare the performance of compute-bound subroutines like SGEMM with the theoretical peak performance of a machine. For the multiplication of an M×K A matrix and a K×N B matrix, 2K - 1 operations (K-1 additions and K multiplications) are required to compute each element of the result matrix. Since there are MN entries in the C matrix, MN(2K-1) operations are required for the multiplication of the two matrices. 2MN additional operations are required for adding scaled C to AB. Therefore, the total number of floating point operations for a typical SGEMM call is approximately 2MNK. Dividing the number of operations by the average time gives the average FLOPS rate for SGEMM. Typically, the performance is reported as gigaFLOPS (GFLOPS), which is equal to one billion (10 9 ) FLOPS. An example code that determines the time and GFLOPS for SGEMM is provided here.
Theoretical Peak GFLOPS of the Test Machine
The peak GFLOPS numbers for an Intel processor can be calculated in theory:
The total number of cores on the machine (cores) * CPU frequency in GHz per core(hz) * float operations per CPU cycle (fpc) where fpc = the floating-point numbers in one SIMD register x number of SIMD operands * SIMD operands per CPU cycle.
Some examples follow:
- Intel® Core™ i5-6300U CPU @ 2.4 G 2 core, which is a 6th generation Intel® Core™ i5 processor and supports Intel® Advanced Vector Extensions 2 (Intel® AVX2), 2 FMA (1 multiply and 1 add). So, for the single-precision floating point (32 bit), the fpc = 256/32 x2x2 =32 and Peak GFLOPS=2x2.4x32=153.6 GFLOPS.
- Intel® Xeon® Scalable processors: Intel® Xeon® Platinum 8180M CPU @ 2.50 GHz, 56 cores, Intel AVX-512, 2 FMA support. The fpc = 512/32x2x2=64. The peak GFLOPS=56x2.5x64=8.96 TFLOPS.
- Intel® Xeon® Silver 4110 CPU @2.10 GHz, Intel® Advanced Vector Extensions 512 (Intel® AVX-512) support, 32 cores. Peak GFLOPS=4.3 TFLOPS; all test results are based on the machines.
The GFLOPS numbers for some Intel processors are reported here.1
Example Code
This code measures the performance of SGEMM using the dsecnd() function in oneMKL. The return value of the first dsecnd() function may be slightly off; therefore, we recommend discarding the return value of the first dsecnd() function call.
/* mkl.h is required for dsecnd and SGEMM */
#include <iostream>
#include <mkl.h>
int main(int argc, char** argv)
{
/* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */
const long long m = 2048;
const long long n = 2048;
const long long k = 4096;
float alpha = 1.0, beta = 1.0;
float *A, *B, *C;
const int MKL_MEM_ALIGNMENT = 64;
/* first call which does the thread/buffer initialization */
//float A[m * k] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f,
// 6.0f, 7.0f, 8.0f, 9.0f, 10.0f,
// 11.0f, 12.0f, 13.0f, 14.0f, 15.0f};
//float B[k * n] = {1.0f, 2.0f, 3.0f, 4.0f,
// 5.0f, 6.0f, 7.0f, 8.0f,
// 9.0f, 10.0f, 11.0f, 12.0f,
// 13.0f, 14.0f, 15.0f, 16.0f,
// 17.0f, 18.0f, 19.0f, 20.0f};
//float C[m * n] = {0.0f};
A = (float *) mkl_malloc(sizeof(float)*m*k, MKL_MEM_ALIGNMENT);
B = (float *) mkl_malloc(sizeof(float)*k*n, MKL_MEM_ALIGNMENT);
C = (float *) mkl_malloc(sizeof(float)*m*n, MKL_MEM_ALIGNMENT);
for (std::size_t i = 0; i < (m*k); i++) {
A[i] = (float)(i+1);
}
for (std::size_t i = 0; i < (k*n); i++) {
B[i] = (float)(-i-1);
}
for (std::size_t i = 0; i < (m*n); i++) {
C[i] = 0.0;
}
SGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
/* start timing after the first GEMM call */
double time_st = dsecnd();
std::size_t LOOP_COUNT = 100;
for (std::size_t i=0; i<LOOP_COUNT; ++i)
{
SGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
}
double time_end = dsecnd();
double time_avg = (time_end - time_st)/LOOP_COUNT;
double gflop = (2.0*m*n*k)*1E-9;
std::cout << "Average time: " << time_avg << "sec" << std::endl;
std::cout << "GFlop: " << gflop << std::endl;
std::cout << "GFlop/sec " << gflop/time_avg << std::endl;
}
Configuration Information of the Test Machine
Hardware: Intel Xeon Silver 4110 CPU, 2.10 GHz, 2 sockets x 8 cores per socket x 2 hyperthreading threads, total 32 logical CPUs, Intel AVX-512 support, 11264K Level 3 cache, 32 GB memory
Operating system: Ubuntu* 16.04.3 LTS, x86_64 GNU/Linux*
Versions: Compilers_and_libraries_2018.1.163, oneMKL 2018 update 1, GCC version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)
Build the Example Code
Intel® Parallel Studio XE Composer Edition includes both Intel® C++ Compiler and oneMKL. Try Intel® C++ Compiler first.
Build the sample code on the Intel Xeon processor with Intel C++ Compiler:
>source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 > icc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_icc
- Run the application: > ./sgemm_icc
The test uses matrix A 2000x2048 * matrix B 2048x2000 and returns GFLOPS/sec: 467.39181.
$ ./sgemm_icc
LOOP COUNT: 100
m=2000, n=2048, k=2048, cores=32
Average time: 0.035895 secs
GFLOPS: 16.77722
GFLOPS/sec: 467.39181 GFLOPS
Build the sample code natively on the Intel Xeon processor with GNU C++ compiler:
> source /opt/intel/compilers_and_libraries/linux/mkl/bin/mklvars.sh intel64
> gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_gcc
>./sgemm_gcc
- Average time: 0.037357 secs
- GFLOPS: 16.77722
- GFLOPS/sec :449.10237 GFLOPS
Note The Intel C++ Compiler and GNU GCC compiler have similar performance.
With the GNU GCC build, we get 449.10 GFLOPS. It seems to have slower performance than the binary build from Intel C++ Compiler.
The result of sgemm_icc and sgemm_gcc varied from time to time on the test machine from about 377.73991 GFLOPS to 498.21456 GFLOPS. Because oneMKL is linked as binary, both the Intel C++ Compiler and GNU GCC compiler will have the same binary linked, so they have similar performance within the sgemm function loop.
To align with the general use case, we use the GNU C++ compiler and system time measure function gettimeofday in the following test.
Tip 4: Set KMP_AFFINITY to Avoid Thread Migration
To get a more stable or better performance test result, we need to consider the hardware architecture and operating system features. To achieve the best performance on systems with multicore processors, it is required that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. For the GEMM performance test, KMP_AFFINITY environment variables are useful:
Intel® Hyper-Threading Technology Enabled:
Linux* and macOS*: export KMP_AFFINITY=compact,1,0,granularity=fine
Windows*: set KMP_AFFINITY=compact,1,0,granularity=fine Intel Hyper-Threading Technology Disabled:
Linux and macOS: export KMP_AFFINITY=compact
Windows: set KMP_AFFINITY=compact
Test the performance and evaluate the quality of the result: ~559.9 GFLOPS
> export KMP_AFFINITY=compact,1,0,granularity=fine
> Average time: 0.029963 secs
GFLOP: 16.77722
GFLOPS/sec: 559.93467 GFLOPS
Tip 5: Align the Data
On an Intel processor, the cache line is usually 64 bytes and the data read/write is aligned to the cache line. To improve the performance of an application that calls oneMKL, align test arrays on 64-byte boundaries and use mkl_malloc and mkl_free for allocating and freeing aligned memory.
A = (float*) mkl_malloc(sizeof(float)*lda*m, MKL_MEM_ALIGNMENT);
>gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_align -DMKL_ALIGN
Average time: 0.029767 secs
GFLOPS: 16.77722
GFLOPS/sec: 563.61550 GFLOPS
Tip 6: Avoid Leading Dimensions that are Multiples of 256
A leading dimension of matrix m x n, linear discriminant analysis or LDA, is the distance between successive columns in memory or stride between two rows.
The present processes have cascade caches. Read/write caches are 64 bytes (multiples of 16 for single precision). To avoid cache conflict, leading dimensions should be a multiple of the cache line, but not to the power of 2 (that is, not multiples of 256, 128, and so on). For single precision, the multiple is 16.
To improve the performance of an application that calls oneMKL, ensure that the leading dimensions of the arrays are divisible by 64 per element_size, where element_size is the number of bytes for the matrix elements (4 for single-precision real, 8 for double-precision real and single-precision complex, and 16 for double-precision complex).
If the current procesor is using the cache-cascading structure (set > cache line), to avoid the cache stall issue, leading dimensions should be multiples of 128. For example, if ldim % 128 = 0, add 16 to the leading dimension.
Generally, set the leading dimension to the following integer expression:
((( n * element_size + 511) / 512) * 512 + 64) / element_size , or #define FIX_LD(x) (((x + 255) & ~255) + 16)
where n is the matrix dimension along the leading dimension. gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_align -DMKL_ALIGN –DMKL_LD Average time: 0.029493 secs
GFLOPS: 16.77722
GFLOPS/sec: 568.84858 GFLOPS
Tip 7: Use Variations of GEMM
oneMKL typically offers parallel high-performing GEMM implementations to use the concurrent threads supported by modern multicore architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices or special-use scenery, however, classic GEMM calls may not optimally use all the cores. In the latest version, oneMKL provides MKL_DIRECT_CALL, compact GEMM, batch APIs, and packed API. To try them, see the Developer Reference.
Summary
It is important to measure the performance of GEMM for computationally intensive applications. oneMKL provides a highly optimized and extensively threaded GEMM function. In this article, we explained how to design and measure the performance using oneMKL SGEMM and outlined tips to help developers test the performance and quickly evaluate the FLOPS on the specified processor.
References
Intel® Math Kernel Library - Introducing Vectorized Compact Routines