# Accelerating LU Factorization Using Fortran, Intel® oneAPI Math Kernel Library, and OpenMP*

Standard, Open Approaches to Accelerate Common Math Operations

## Get the Latest on All Things CODE

By

BASIC, Pascal, and C were the first programming languages that I learned, but I did my first serious coding (i.e., outside of a classroom) in Fortran. I used it almost exclusively for 12 years before switching to C and C++ when I joined Intel. I’ve lamented in some of my writings that I haven’t written a line of Fortran in 20 years. I was even starting to wonder if it was still relevant until my old pal Jim Cownie asked, “Is Fortran ‘A Dead Language?’”, and then answered convincingly that it isn’t. So, when a colleague asked me to try out the OpenMP* accelerator offload support in our latest Fortran compiler, I jumped at the chance.

You see, I like OpenMP as much as I like Fortran. It has always been an easy way to parallelize code, and the OpenMP specification has come a long way since it exploded onto the parallel computing scene in 1997. Today, it even supports accelerator offload. So instead of using SYCL, my usual language for accelerator offload, I’m going to return to my parallel programming roots to demonstrate how to offload computations to an accelerator using Fortran, OpenMP, and Intel® oneAPI Math Kernel Library (oneMKL).

If you’ve read some of my previous articles, you know that I like the Fourier correlation algorithm (article, webinar, code samples). I’ve used it in my biomolecular docking research and more recently for 2D and 3D image alignment. Since I’ve done this algorithm to death, I decided to do something different. Let’s see how easy it is to offload linear algebra computations (specifically, LU factorization) to an accelerator. An in-depth explanation of LU factorization is not the goal here. Suffice to say that it’s used to solve linear systems. If you want more details, the Intel® Math Kernel Library Cookbook has some good examples: Solving a System of Linear Equations with an LU-Factored Block Tridiagonal Coefficient Matrix and Factoring General Block Tridiagonal Matrices. I’m going to walk through a Fortran, oneMKL, and OpenMP implementation to illustrate the offload syntax and to highlight host-device data transfer in heterogeneous parallelism. I’ll start with the dgetri_oop_batch_strided.f90 example that’s included with oneMKL (see Using Code Examples), then modify it to improve efficiency.

The program begins by including the Fortran interfaces for OpenMP offload support, then it decides whether to use 32- or 64-bit integer type (see Using the ILP64 Interface vs. LP64 Interface for more information) (Figure 1). I’ll make this decision at compile-time, as shown in the compiler commands at the end of this article. To complete the setup, I set the number of matrices to factor and their sizes, set the dimension and stride parameters required by the oneMKL subroutines, and allocate the work arrays. We’re now ready to begin the calculation.

include "mkl_omp_offload.f90"

program sgetri_oop_batch_strided_example

#if defined(MKL_ILP64)
use onemkl_lapack_omp_offload_ilp64   ! 64-bit integer type
#else
use onemkl_lapack_omp_offload_lp64    ! 32-bit integer type
#endif

integer,   parameter :: n = 10, batch_size = 4
integer              :: lda, ldainv, stride_a, stride_ainv, stride_ipiv
real,    allocatable :: a(:,:), ainv(:,:)
integer, allocatable :: ipiv(:,:), info(:)

lda = n
ldainv = n
stride_a = lda * n
stride_ainv = ldainv * n
stride_ipiv = n

allocate(a(stride_a, batch_size), ainv(stride_ainv, batch_size), &
ipiv(stride_ipiv, batch_size), info(batch_size))

Figure 1. Setting up the example program. The oneMKL headers and modules that provide OpenMP target offload support are highlighted in blue. Note that the hardcoded matrix dimension and batch size parameters are fine for testing, but too small to merit acceleration.

The GPU that I have handy as I write this article only supports single-precision computations, so I’ll be using the Fortran REAL type and the appropriate LAPACK subroutines to compute the LU factorization (sgetrf_batch_strided) and out-of-place inversions of the LU-factored matrices (sgetri_oop_batch_strided). (Converting back to double-precision is a simple matter of switching to Fortran DOUBLE PRECISION type and using the LAPACK dgetrf_batch_strided and dgetri_oop_batch_strided subroutines.) The original example uses two OpenMP target data regions (Figure 2). The target construct transfers control to the device. The first region transfers the input matrices [map(tofrom:a)] to the device memory; dispatches the in-place LU factorization (sgetrf_batch_strided) to the device; and retrieves the LU-factored matrices (now stored in a), pivot indices [map(from:ipiv)], and status [map(from:info)] from device memory. If factorization succeeded, the program proceeds to the next OpenMP region. The LU-factored matrices and pivot indices are transferred back to the device memory [map(to:a) and map(to:ipiv)], the out-of-place inverses are computed on the device (sgetri_oop_batch_strided), and the inverses [map(from:ainv)] and status [map(from:info)] are retrieved from the device memory.

! Compute LU factorization via OpenMP offload.
! On entry, A contains the input matrices.
! On exit, it contains the LU factorizations.
!$omp target data map(tofrom:a) map(from:ipiv) map(from:info) !$omp dispatch
call sgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, &
batch_size, info)
!$omp end target data if (any(info .ne. 0)) then print *, 'Error: sgetrf_batch_strided returned with errors.' else ! Compute matrix inverse via OpenMP offload. On exit, the inverse is in Ainv. !$omp target data map(to:a) map(to:ipiv) map(from:ainv) map(from:info)

!$omp dispatch call sgetri_oop_batch_strided(n, a, lda, stride_a, ipiv, stride_ipiv, & ainv, ldainv, stride_ainv, batch_size, info) !$omp end target data
endif

if (any(info .ne. 0)) then
print *, 'Error: sgetri_oop_batch_strided returned with errors.'
else
print '("The calculations executed successfully.")'
endif

Figure 2 Offloading the LAPACK calculations to an accelerator using two OpenMP target regions, one for each LAPACK subroutine. OpenMP directives are highlighted in blue. LAPACK calls are highlighted in green.

The author of this example program may have used two OpenMP target regions to check the results of factorization before proceeding to inversion. The programmer’s caution is admirable, but it causes extra host-device data transfer, which is something we really want to avoid in heterogeneous parallel programming. Every arrowhead in Figure 3 represents data movement between disjoint memories. For example, the input matrices are transferred to the device, then the factors are transferred back to the host [map(tofrom:a)]. The same factors are then transferred back to the device [map(to:a)] in the next OpenMP target data region. Similarly, the pivot indices are retrieved from the device [map(from:ipiv)] after sgetrf_batch_strided returns, then transferred back to the device [map(to:ipiv)] for input to sgetri_oop_batch_strided. These matrices are presumably large if you’re going through the hassle of heterogeneous computing, so transferring them from host to device and vice versa could get expensive in terms of compute and power overhead. The one-dimensional status array is not that large, so I’m not as bothered by the map(from:info) transfer.

Figure 3 Host-device data transfer for the two OpenMP target offload regions shown in Figure 2. Each arrowhead indicates data movement between the host and device memories.

Ideally, the OpenMP runtime will check if the necessary data is already available in the device memory. However, there’s no guarantee that it won’t be flushed or overwritten between the OpenMP target regions. We also need to consider the overhead of creating two OpenMP offload regions. It’s better to do the entire computation in one offload region (Figure 4).

!$omp target data map(to:a) map(from:ainv) map(from:info_rf) map(from:info_ri) !$omp dispatch
call sgetrf_batch_strided(n, n, a, lda, stride_a, ipiv, stride_ipiv, &
batch_size, info_rf)

!$omp dispatch call sgetri_oop_batch_strided(n, a, lda, stride_a, ipiv, stride_ipiv, & ainv, ldainv, stride_ainv, batch_size, info_ri) !$omp end target data

if (any(info_rf .ne. 0) then
print *, 'Error: getrf_batch_strided returned with errors.'
elseif (any(info_ri .ne. 0) then
print *, 'Error: getri_oop_batch_strided returned with errors.'
else
print '("Computation executed successfully")'
endif

Figure 4 Offloading the LAPACK calculations to an accelerator using a single OpenMP target region. OpenMP directives are highlighted in blue. LAPACK calls are highlighted in green.

We can see that there’s less host-device data transfer when we fuse the two OpenMP target regions into one (Figure 5). The input matrices (a) are transferred to the device and factored in place. The LU factors and pivots (a, ipiv) computed by sgetrf_batch_strided on the device so they’re already in the device memory when sgetri_oop_batch_strided is called. We don’t need to transfer them back to the host unless we plan to use them later to solve a linear system. If that’s the case, however, we could just call sgetrs_batch_strided in the same OpenMP target data region while the necessary input data is already in device memory. I’ll leave that as an exercise for the reader.

Figure 5 Host-device data transfer for the OpenMP target offload region shown in Figure 4. Each arrowhead (except the one inside the accelerator device) indicates data movement between the host and device memories.

## Try it Yourself

The example programs described in this article are available at this repository. They contain the code to initialize the input matrices and display the output from each step of the computation, which I did not include in this article. The commands to build the example programs using the Intel® Fortran Compiler and oneMKL, with and without OpenMP, are as follows:

$ifx -i8 -fpp -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl \ lu_fact_omp_offload_example.f90 -o lu_fact_omp_offload_example.exe \ -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread \
$ifx -i8 -fpp -DMKL_ILP64 \ lu_fact_omp_offload_example.f90 -o lu_fact_example.exe \ -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread \
-lmkl_core -liomp5 -lpthread -ldl