Intel® Math Kernel Library 9.0 for Windows*
Technical User Notes

Document Number: 310707-003US

Contents

Purpose
Compiler Support
Using Intel® MKL Parallelism
Memory Management
Performance
    Multi-Core Performance
Configuration File
Obtaining Version Information
Custom DLL Builder
FFTW Interface Support
Support for Removed FFT Interface
GMP* Functions
Technical Support
Disclaimer and Legal Information

 

Purpose

The Intel® Math Kernel Library (Intel® MKL) 9.0 for Windows* Technical User Notes describe the details of how to compile, link and run with Intel® MKL 9.0 for Windows*. It should be used in conjunction with the Intel® MKL 9.0 for Windows* Release Notes and Getting Started Guide for Intel® MKL 9.0 for Windows* document to reference how to use Intel® MKL 9.0 for Windows* in your application.

 

Compiler Support

Intel supports Intel® MKL for use only with those compilers that are identified in the release notes. However, the library has been successfully used with other compilers.

Although the Compaq* Visual Fortran (CVF) compiler is no longer supported by Compaq, Intel MKL still preserves the CVF interface that can be used with Intel® Fortran Compiler with /Gm key. In the following discussion, stdcall is actually the CVF compiler default compilation, which differs from stdcall in the way strings are passed to the routine.

There are both cdecl (default Microsoft* Visual C (MSVC) interface) and stdcall (default CVF interface) versions of the library. The cdecl version is called mkl_c.lib and the stdcall version is called mkl_s.lib. Whether you choose to link with cdecl or stdcall depends on factors that only you can determine. Calling routines in mkl_s.lib from C requires that you use a statement like extern __stdcall name( <prototype variable1>, <prototype variable2>, .. );. However, because the default CVF format is not identical with stdcall, you will need to handle strings in the calling sequence specially. Dealing with this issue is complex and the user is advised to refer to sections on interfaces in the CVF documentation.

Calling routines in mkl_c.lib requires a similar declaration, such as <type> name( <prototype variable1>, <prototype variable2>, .. );.

Similarly, the CVF compiler will link with mkl_s.lib if routines are compiled with the default interface, but if the user compiles with the switch /iface=(cref,nomixed_str_len_arg), the appropriate library to use is mkl_c.lib.

When using the cblas interface, the header file mkl.h will simplify program development since it specifies enumerated values as well as prototypes of all the functions. The header determines if the program is being compiled with a C++ compiler, and if it is, the included file will be correct for use with C++ compilation.

There is currently no support for compilers that require OMF file format.


 

Using Intel® MKL Parallelism

Intel® MKL is threaded in a number of places: sparse solver, LAPACK (*GETRF, *POTRF, *GBTRF, *GEQRF, *ORMQR, *STEQR, *BDSQR routines), all Level 3 BLAS, Sparse BLAS matrix-vector and matrix-matrix multiply routines for the compressed sparse row and diagonal formats, and all DFTs (except 1D transformations when DFTI_NUMBER_OF_TRANSFORMS=1 and sizes are not power-of-two). The library uses OpenMP* threading software.

There are situations in which conflicts can exist in the execution environment that make the use of threads in Intel® MKL problematic. We list them here with recommendations for dealing with these. First, a brief discussion of why the problem exists is appropriate.

If the user threads the program using OpenMP* directives and uses the Intel compilers to compile the program, Intel® MKL and the user program will both use the same threading library. Intel® MKL tries to determine if it is in a parallel region in the program, and if it is, it does not spread its operations over multiple threads. But Intel® MKL can be aware that it is in a parallel region only if the threaded program and Intel® MKL are using the same threading library. If the user program is threaded by some other means, Intel® MKL may operate in multithreaded mode and the performance may suffer due to overuse of the resources. Here are several cases and our recommendations for the user:

  1. User threads the program using OS threads (pthreads on Linux*, Win32* threads on Windows*). If more than one thread calls the library, and the function being called is threaded, it is important that threading in Intel® MKL be turned off. Set OMP_NUM_THREADS=1 in the environment. This is the default with Intel® MKL 9.0 except sparse solver.
  2. User threads the program using OpenMP* directives and/or pragmas and compiles the program using a compiler other than a compiler from Intel. This is more problematic in that setting OMP_NUM_THREADS in the environment affects both the compiler's threading library and the threading library with Intel® MKL. At this time, the safe approach is to set MKL_SERIAL=YES (or MKL_SERIAL=yes), which forces the library to serial mode regardless of OMP_NUM_THREADS value.
  3. There are multiple programs running on a multiple-CPU system, as in the case of a parallelized program running using MPI for communication in which each processor is treated as a node. The threading software will see multiple processors on the system even though each processor has a separate process running on it. In this case OMP_NUM_THREADS should be set to 1.

Setting the number of threads. The OpenMP* software responds to the environmental variable OMP_NUM_THREADS. The number of threads can either be set as an environmental variable in the Environment panel of the System Properties box of the Control Panel on Microsoft* Windows*, or it can be set in the shell the program is running in. To change the number of threads, in the command shell in which the program is going to run, enter:

set OMP_NUM_THREADS=<number of threads to use>.

Some other shells require the variable and its value to be exported:

export OMP_NUM_THREADS=<number of threads to use>.

Setting the variable when running on Microsoft* Windows* 98 or Windows* ME is meaningless, since multiprocessing is not supported.

To force Intel® MKL to serial mode, environment variable MKL_SERIAL should be set to YES (set MKL_SERIAL=YES). It works regardless of OMP_NUM_THREADS value. MKL_SERIAL is not set by default.

If the variable OMP_NUM_THREADS is not set, Intel® MKL software will run on the number of threads equal to 1. We recommend always setting OMP_NUM_THREADS to the number of processors you wish to use in your application.

Note. Currently the default number of threads for sparse solver is the number of processors in system.

 

Memory Management

Intel® MKL has memory management software that controls memory buffers for use by the library functions. When a call is made to certain Intel® MKL functions (such as those in the Level 3 BLAS or DFTs), new buffers are allocated if there are no free ones (marked as free) currently available. These buffers are not deallocated until the program ends. If at some point the user's program needs to free memory, it may do so with a call to MKL_FreeBuffers(). If another call is made to a library function that needs a memory buffer, then the memory manager will again allocate the buffers and they will again remain allocated until either the end of the program or the program deallocates the memory.

This memory management software is turned on by default. To disable it, set the environment variable MKL_DISABLE_FAST_MM to any value, which will cause memory to be allocated and freed from call to call. Disabling this feature will negatively impact performance of routines such as the level 3 BLAS, especially for small problem sizes.

Memory management has a restriction for the number of allocated buffers in each thread. Currently this number is 32. The maximum number of supported threads is 514. To avoid the default restriction, disable memory management.

Memory Functions Renaming
Intel® MKL memory management uses standard C runtime memory functions to allocate or free memory. Since MKL 9.0, you can replace these functions by your own memory functions. The header files i_malloc.h and, additionally for dynamic case, i_malloc_dll.h contain all declarations required by an application developer to replace the memory allocation functions. These header files describe how memory allocation can be replaced in those Intel® libraries that support this feature.

 

Performance

To obtain the best performance with Intel® MKL, make sure the following conditions are met:

Note on the LAPACK packed routines performance:

The routines with the names that contain the letters HP, OP, PP, SP, TP, UP in the matrix type and storage position (the second and third letters respectively) operate on the matrices in the packed format (see "LAPACK Routine Naming Conventions" sections in the MKL manual). Their functionality is strictly equivalent to the functionality of the unpacked routines with the names containing the letters HE, OR, PO, SY, TR, UN in the corresponding positions, but the performance is significantly lower.
If the memory restriction is not too tight, use an unpacked routine for better performance. Note that in such a case you need to allocate N2/2 more memory than the memory required by a respective packed routine, where N is the problem size (the number of equations).

For example, solving a symmetric eigenproblem with an expert driver can be speeded up through using an unpacked routine:
call dsyevx(jobz, range, uplo, n, a, lda, vl, vu, il, iu, abstol, m, w, z, ldz, work, lwork, iwork, ifail, info),
where a is the dimension lda-by-n, which is at least N2 elements, instead of
call dspevx(jobz, range, uplo, n, ap, vl, vu, il, iu, abstol, m, w, z, ldz, work, iwork, ifail, info),
where ap is the dimension N*(N+1)/2.

There are additional conditions for the FFT functions:

On IA-32 based applications the addresses of the first elements of arrays and the leading dimension values, in bytes (n*element_size), of two-dimensional arrays should be divisible by cache line size (32 bytes for Pentium® III processor, 64 bytes for Pentium® 4 processor, and 128 bytes for Intel® EM64T processor).
On Itanium®-based applications the sufficient conditions are as follows:
- for the C-style FFT, the distance L between arrays that represent real and imaginary parts is not divisible by 64. The best case is when L=k*64 + 16.
- leading dimension values, in bytes (n*element_size), of two-dimensional arrays are not power-of-two.

 

Multi-Core Performance


You may experience the issues that the application using MKL runs faster when threaded on the number of sockets rather than on the number of cores, and/or parallel application performance is unstable on multi-core. Binding threads to the CPU cores may improve or stabilize the performance. This is performed by setting an affinity mask to threads. You can do it either with OpenMP facilities (which is recommended if available, for instance, via KMP_AFFINITY environment variable using Intel OpenMP), or with a system routine (see the example below).

Example of setting an affinity mask by operational system means using Intel C compiler.
Suppose, the system has two sockets with two cores each, and 4 threads parallel application (using MKL LAPACK) performance happens to be unstable. Put the fragment marked green into your code before LAPACK call to bind the threads to the appropriate cores to prevent migration of the threads.

// Set affinity mask
#include <windows.h>
#include <omp.h>
#pragma omp parallel default(shared) private(mask)
{
   DWORD_PTR mask = (1 << omp_get_thread_num());
   SetThreadAffinityMask( GetCurrentThread(), mask );
}

// Call MKL LAPACK routine
...

Then build your application and run it in 4 threads:

set OMP_NUM_THREADS=4
test_application.exe

Please, refer to http://msdn.microscoft.com for the restrictions on the Windows API routines usage.

 

Configuration File

The Intel® MKL configuration file will provide the possibilities to customize several features of the library, namely:

The configuration file is mkl.cfg file by default. The file contains several variables that can be changed. Below is the example of the configuration file containing all possible variables with default values:

//
// Default values for mkl.cfg file
//
// DLL names for IA-32
MKL_X87dll = mkl_def.dll
MKL_SSE1dll = mkl_p3.dll
MKL_SSE2dll = mkl_p4.dll
MKL_SSE3dll = mkl_p4p.dll
MKL_VML_X87dll = mkl_vml_def.dll
MKL_VML_SSE1dll = mkl_vml_p3.dll
MKL_VML_SSE2dll = mkl_vml_p4.dll
MKL_VML_SSE3dll = mkl_vml_p4p.dll
// DLL names for Intel(R) EM64T
MKL_EM64TDEFdll = mkl_def.dll
MKL_EM64TSSE3dll = mkl_p4n.dll
MKL_VML_EM64TDEFdll = mkl_vml_def.dll
MKL_VML_EM64TSSE3dll = mkl_vml_p4n.dll
// DLL names for Intel(R) Itanium(R) processor family
MKL_I2Pdll = mkl_i2p.dll
MKL_VML_I2Pdll = mkl_vml_i2p.dll
// DLL names for LAPACK libraries
MKL_LAPACK32dll = mkl_lapack32.dll
MKL_LAPACK64dll = mkl_lapack64.dll
// Serial or parallel mode
//     YES – single threaded
//     NO - multi threaded
//     OMP – control by OMP_NUM_THREADS
MKL_SERIAL = YES
// Input parameters check
//     ON – checkers are used (default)
//     OFF – checkers are not used
MKL_INPUT_CHECK = ON

When any MKL function is called first, Intel® MKL checks to see if the configuration file exists, and if so, it operates with the specified variables. The path to the configuration file is specified by environment variable MKL_CFG_FILE. If this variable is not defined, then first the current directory is searched through, and then the directories specified in the PATH environment variable. If the MKL configuration file does not exist, the library operates with default values of variables (standard names of libraries, checkers on, non-threaded operation mode).
If the variable is not specified in the configuration file, or specified incorrectly, the default value is used.

Below is an example of the configuration file that only redefines the library names:

// DLL redefinition
MKL_X87dll = matlab_x87.dll
MKL_SSE1dll = matlab_sse1.dll
MKL_SSE2dll = matlab_sse2.dll
MKL_SSE3dll = matlab_sse2.dll
MKL_ITPdll = matlab_ipt.dll
MKL_I2Pdll = matlab_i2p.dll

 

Obtaining Version Information

Intel® MKL provides a facility by which you can obtain information about the library (e.g., the version number). Two methods are provided for extracting this information. First, you may extract a version string using the function MKLGetVersionString. Or, alternatively, you can use the MKLGetVersion function to obtain an MKLVersion structure that contains the version information. Example programs for extracting this information are provided in the examples\versionquery directory. A makefile is also provided to automatically build the examples and output summary files containing the version information for the current library.

 

Custom DLL Builder

Custom DLL builder is targeted for dynamic library creation with selected functions and placed in tools/builder folder. The builder contains a makefile and a definition file with the list of functions. The makefile has three targets: "ia32", "ipf", and "em64t". ia32 target is used for IA-32, ipf is used for Intel® Itanium® processor family, and em64t is used for Intel® Xeon® processor with Intel® EM64T.
There are several macros (parameters) for the makefile:

interface = cdecl/stdcall
defines interface only for IA-32. Default value is cdecl.
export = functions_list
determines the name of the file that contains the list of entry points functions, which will be included into dll. This file is used for definition file creation and then for export table creation. Default name is functions_list.
name = mkl_custom
specifies the name of the created dll and interface library. By default the libraries mkl_custom.dll and mkl_custom.lib are built.
xerbla = user_xerbla.obj
specifies the name of object file that contains user’s error handler. This error handler will be added to the library and then will be used instead of standard MKL error handler xerbla. By default, that is, when this parameter is not pointed, standard MKL xerbla is used. Please note that for IA-32 the object file should be in appropriate interface (cdecl or stdcall) according to the interface macro.

All parameters are not mandatory. For the simplest case, the command line could be nmake ia32 and the values of the remaining parameters will be taken by default. As a result mkl_custom.dll and mkl_custom.lib libraries for IA-32 will be created with cdecl interface, the functions list will be taken from functions_list file, and the standard MKL error handler xerbla will be used.

Another example for a more complex case:
nmake ia32 interface=stdcall export=my_func_list.txt name=mkl_small xerbla=my_xerbla.obj
In this case mkl_small.dll and mkl_small.lib libraries for IA-32 will be created with stdcall interface, the functions list will be taken from my_func_list.txt file, user’s error handler my_xerbla.obj will be used.

Entry points in functions_list file should be adjusted to interface. For example, cdecl entry points could be listed as:

DGEMM
DTRSM
DDOT
DGETRF
DGETRS
cblas_dgemm
cblas_ddot

The example of entry points for stdcall interface:

_DGEMM@60
_DDOT@20
_DGETRF@24

MKL static library for IA-32 contains several versions targeted for different processors. These specific versions differ by prefix:

If interface entry points are used in functions_list file, custom dll will contain functions with all processor supported versions managed by dispatcher. It is possible to create a library with only processor specific versions of functions by specifying processor specific entry points in functions_list file. As an example, Pentium® 4 processor versions of several BLAS functions are chosen for custom dll:

DGEMM=_MKL_BLAS_p4_dgemm
DDOT=_MKL_BLAS_p4_ddot
DTRSM=_MKL_BLAS_p4_dtrsm

There are no processor specific versions for LAPACK, PARDISO functions and cblas interface functions.

Note for Intel® EM64T and Itanium®2-based applications users

New SDKs starting with 1289 and later include an additional library bufferoverflowu.lib to resolve external references _security_cookie. Makefiles contain this library with "BUF_LIB=bufferoverflowu.lib" macro. For older SDKs, leave this macro empty "BUF_LIB=" or remove it from the linkage string.

 

FFTW Interface Support

Intel MKL offers two collections of wrappers that allow the FFTW interface to call the Intel MKL Fourier transforms. These collections correspond to the FFTW versions 2.x and 3.x, respectively, and the Intel MKL versions 7.0 and later.
The purpose of these wrappers is to enable developers whose programs currently use FFTW to achieve the performance of the Intel MKL Fourier transforms without changing the program source code. See FFTW to Intel® MKL Wrappers Technical User Notes for FFTW 2.x (fftw2xmkl_notes.htm) for details on the use of the FFTW 2.x wrappers and FFTW to Intel® MKL Wrappers Technical User Notes for FFTW 3.x (fftw3xmkl_notes.htm) for details on the use of the FFTW 3.x wrappers.

 

Support for Removed FFT Interface

Intel MKL offers a collection of FFT to DFTI wrappers, which allow developers whose programs currently use the Intel Fast Fourier Transform (FFT) interface, which is no longer supplied, to continue using Intel MKL Fourier transforms, that is DFTI interface, without changing the program source code. See details in Intel® Math Kernel Library FFT to DFTI Wrappers Technical User Notes.

 

GMP* Functions

Intel MKL implementation of GMP* arithmetic functions includes arbitrary precision arithmetic operations on integer numbers. The interfaces of such functions fully match the GNU Multiple Precision* (GMP) Arithmetic Library.

If you currently use the GMP* library, you need to modify INCLUDE statements in your programs to mkl_gmp.h.

 

Technical Support

Please see the Intel® MKL support website at http://www.intel.com/support/performancetools/libraries/mkl/.

 


Disclaimer and Legal Information

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL(R) PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice.

The software described in this document may contain software defects which may cause the product to deviate from published specifications. Current characterized software defects are available on request.

This document as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this manual is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel Corporation. Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.

Except as permitted by such license, no part of this document may be reproduced, stored in a retrieval system, or transmitted in any form or by any means without the express written consent of Intel Corporation.

Developers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Improper use of reserved or undefined features or instructions may cause unpredictable behavior or failure in developer’s software code when running on an Intel processor. Intel reserves these features or instructions for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from their unauthorized use.

The software described in this document may contain software defects which may cause the product to deviate from published specifications. Current characterized software defects are available on request.

 

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino logo, Chips, Core Inside, Dialogic, EtherExpress, ETOX, FlashFile, i386, i486, i960, iCOMP, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4, IntelSX2, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel XScale, IPLink, Itanium, Itanium Inside, MCS, MMX, MMX logo, Optimizer logo, OverDrive, Paragon, PDCharm, Pentium, Pentium II Xeon, Pentium III Xeon, Performance at Your Command, Pentium Inside, skoool, Sound Mark, The Computer Inside., The Journey Inside, VTune, Xeon, Xeon Inside and Xircom are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
 
* Other names and brands may be claimed as the property of others.
 
Copyright (C) 2000-2006, Intel Corporation.