How to Transition to Sparse BLAS Inspector-Executor API: Sparse Matrix-Matrix Multiplication

In this article we show how to translate calls to deprecated Sparse BLAS API (NIST style) into the calls to Inspector-Executor Sparse BLAS for one of the commonly used sparse functionality, sparse matrix-matrix multiplication. For simplicity, we consider double precision data type.

Specifically, we compare mkl_dcsrmultcsr from Sparse BLAS API with mkl_sparse_spmm and mkl_sparse_sp2m routines from IE SpBLAS and show when and how mkl_dcsrmultcsr calls can be replaced by mkl_sparse_spmm and mkl_sparse_sp2m.

First, let’s have a look at the APIs:

  • Sparse BLAS NIST style API for mkl_dcsrmultcsr 
    /* Computes product of two sparse matrices stored in the CSR format (3-array variation) with one-based indexing (deprecated) */
    void mkl_dcsrmultcsr(const char *transa,
                         const MKL_INT *request,
                         const MKL_INT *sort,
                         const MKL_INT *m,
                         const MKL_INT *n,
                         const MKL_INT *k,
                         double *a,
                         MKL_INT *ja,
                         MKL_INT *ia,
                         double *b,
                         MKL_INT *jb,
                         MKL_INT *ib,
                         double *c,
                         MKL_INT *jc,
                         MKL_INT *ic,
                         const MKL_INT *nnzmax,
                   MKL_INT *ierr);
    
  • Sparse BLAS Inspector-Executor API for mkl_sparse_spmm and mkl_sparse_sp2m

    /* Computes product of sparse matrices: C = op(A) * B, result is sparse */
    sparse_status_t mkl_sparse_spmm(const sparse_operation_t operation,
                                    const sparse_matrix_t A,
                                    const sparse_matrix_t B,
                                    sparse_matrix_t *C);
    
    /* Computes product of sparse matrices: C = opA(A) * opB(B), result is sparse */
    sparse_status_t mkl_sparse_sp2m(const sparse_operation_t transA,
                                    const struct matrix_descr descrA,
                                    const sparse_matrix_t A,
                                    const sparse_operation_t transB,
                                    const struct matrix_descr descrB,
                                    const sparse_matrix_t B,
                                    const sparse_request_t request,
                                    sparse_matrix_t *C);
    

Second, we clarify the difference in how the APIs handle memory. 

  • For mkl_dcsrmultcsr the user must allocate memory for the output matrix (ic, jc and c) on the side of the application. To understand how much memory is needed for the output, multiple calls can be done with different values of “request” parameter.
  • For mkl_sparse_spmm and mkl_sparse_sp2m, oneMKL allocates memory for the output matrix (ic jc and c). In order to use a custom memory allocator, the user can use custom i_malloc/i_free routines

Note: Using i_malloc has a global effect on how memory is allocated by all oneMKL functionality.

After the computations are finished, the user can access the output arrays by means of calling mkl_sparse_?_export_csr which will let the user access the pointers to the output arrays ic, jc and c.

The output arrays will be “owned” by the matrix handle and deallocated at the time when mkl_sparse_matrix_destroy will be called for C.

For both APIs, temporary memory buffers are allocated by oneMKL (some are deallocated after the execution, others can be deallocated using mkl_free_buffers routine).

Key Takeaway: “new” APIs can only replace the old APIs provided that the user is willing to allow IE SpBLAS routines to allocate memory for the output data.

Now we describe how to convert input arguments for mkl_dcsrmultcsr into the IE SpBLAS parameters.

  • transa (flag which defines operation on matrix A):
    • is included as a sparse_operation_t parameter of mkl_sparse_spmm and mkl_sparse_sp2m.
      char *transa sparse_operation_t operation
      “N” SPARSE_OPERATION_NON_TRANSPOSE
      “T” SPARSE_OPERATION_TRANSPOSE
      “C” SPARSE_OPERATION_CONJUGATE_TRANSPOSE
  • request (flag which defines the behavior of the routine):
    • is described below in a separate section.
  • sort:
    • mkl_sparse_spmm and mkl_sparse_sp2m require indices of input matrices to be sorted and produce unsorted indices for the output matrix. If input matrices are unsorted or output needs to be sorted, mkl_sparse_order should be called appropriately before/after the call to spmm or sp2m.
  • m, n, k:
    • are encoded in the input sparse_matrix_t A and B.
  • ia, ja, a:
    • are stored inside the sparse_matrix_t A. For creating a matrix handle of type sparse_matrix_t from CSR data, mkl_sparse_?_create_csr should be used.
  • ib, jb, b:
    • are stored inside the sparse_matrix_t B. For creating a matrix handle of type sparse_matrix_t from CSR data, mkl_sparse_?_create_csr should be used.
  • ic, jc, c:
    • will be computed as a part of the output matrix handle C.
  • nzmax:
    • can be obtained as ic[nrowsC] – ic[0] where ic can be retrieved from the output matrix sparse_matrix_t C by calling mkl_sparse_?_export_csr after the computations are done by spmm or sp2m routine.
  • info:
    • is returned as sparse_status_t return value of spmm or sp2m routine.

Now we describe how to translate the values of the “request” parameter. For each possible case request = 0, 1 or 2 we provide a pseudocode as an example.
Explanation of “request” parameter in mkl_dcsrmultcsr form Sparse BLAS NIST style API:

  • If request=0, the routine performs multiplication, the memory for the output arrays ic, jc, c must be allocated beforehand.
  • If request=1, the routine computes only values of the array ic of length m + 1, the memory for this array must be allocated beforehand. On exit the value ic[m] - 1 is the actual number of the elements in the arrays c and jc.
  • If request=2, the routine has been called previously with the parameter request=1, the output arrays jc and c are allocated in the calling program and they are of the length ic[m] - 1 at least.

Mapping of each case:

  1. request = 0: all output is computed at once
    Short answer: mkl_sparse_spmm can be called, or mkl_sparse_sp2m with sparse_request_t SPARSE_STAGE_FULL_MULT.
    Example:
    /* ia, ja, a and ib, jb, b are ready
    ic, jc, c must be allocated beforehand with at least nzmax elements
    reserved for jc and c */
    mkl_dcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info);
    check(info);
    should be translated into:
    /* ia, ja, a are put into the matrix handle A
    ib, jb, b are put into the matrix handle B
    e.g. mkl_sparse_?_create_csr can be used for that */
    sparse_matrix_t C = NULL;
    status = mkl_sparse_spmm(operation, A, B, &C);
    check(status);
    
    MKL_INT *ic = NULL, *c_rows_end = NULL, *jc = NULL;
    double *c = NULL;
    
    status = mkl_sparse_d_export_csr(C, &indexing, &nrows, &ncols, &ic, &c_rows_end, &jc, &c);
    check(status);
    
    /* Note: mkl_sparse_?_export_csr does not perform a memory copy, so the returned pointers are a part of the matrix handle C */
    
    /* <once ic, jc and c are not needed anymore> */
    mkl_sparse_matrix_destroy(C); // will deallocate ic, jc and c
  2. request = 1: only ic needs to be computed.
    Example:
    /* ia, ja, a and ib, jb, b are ready
    ic must be allocated beforehand with at least (nrowsC + 1) elements */
    mkl_dcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info);
    check(info);
    should be translated into:
    /* ia, ja, a are put into the matrix handle A
    ib, jb, b are put into the matrix handle B
    e.g. mkl_sparse_?_create_csr can be used for that */
    
    sparse_matrix_t C = NULL;
    status = mkl_sparse_sp2m(transA, descrA, A, transB, descrB, B, SPARSE_STAGE_NNZ_COUNT, &C);
    check(status);
    
    MKL_INT *ic = NULL, *c_rows_end = NULL, *jc = NULL;
    double *c = NULL;
    status = mkl_sparse_d_export_csr(C, &indexing, &nrows, &ncols, &ic, &c_rows_end, &jc, &c);
    check(status);
    int nzmax = ic[nrows] – ic[0]; // if needed 
    
    /* Note: in this case, ic[0] = indexing; also, ic[nrows] = c_rows_end[nrows-1] */
    After the call to mkl_sparse_sp2m, the matrix handle C will have a correctly computed ic (memory for ic will be allocated by mkl_malloc inside sp2m call) but will not have any memory buffers associated with jc or c.
  3. request = 2: array ic was allocated and computed before with request = 1 (or was known from somewhere); arrays jc and c are computed.
    Short answer: mkl_sparse_sp2m needs to be called with sparse_request_t SPARSE_STAGE_FINALIZE_MULT, and prior to this stage sp2m must have been called with sparse_request_t SPARSE_STAGE_NNZ_COUNT (see request=1 case above).
    Example:
    /* ia, ja, a and ib, jb, b are ready
    ic  is already available (e.g., from a prior call to mkl_dcsrmultcsr with request = 1).
    Arrays jc and c must be allocated beforehand with at least 
    (ic[nrowsC] - 1) elements 
    Note: mkl_dcsrmultcsr only supports 1-based indexing hence the 1 above */
    
    mkl_dcsrmultcsr(&trans, &request, &sort, &m, &n, &k, a, ja, ia, b, jb, ib, c, jc, ic, &nzmax, &info);
    check(info);
    should be translated into:
    /* ia, ja, a are put into the matrix handle A
    ib, jb, b are put into the matrix handle B
    e.g. mkl_sparse_?_create_csr can be used for that.
    mkl_sparse_sp2m with request=SPARSE_STAGE_NNZ_COUNT must have been 
    called before this call */
    
    status = mkl_sparse_sp2m(transA, descrA, A, transB, descrB, B, SPARSE_STAGE_FINALIZE_MULT, &C);
    check(status);
    
    /* if raw pointers are needed */
    MKL_INT *ic = NULL, *c_rows_end = NULL, *jc = NULL;
    double *c = NULL;
    status = mkl_sparse_d_export_csr(C, &indexing, &nrows, &ncols, &ic, &c_rows_end, &jc, &c);
    check(status);
    After the call to the new API, the matrix handle C will have all three arrays (ic allocated and computed during the prior call with sparse_request_t SPARSE_STAGE_NNZ_COUNT and jc and c allocated and computed during the last call with sparse_request_t SPARSE_STAGE_FINALIZE_MULT.

Note: Although it is not required, we encourage the users to always check for errors (either the info parameter for mkl_dcsrmultcsr or the return value of mkl_sparse_spmm and mkl_sparse_sp2m). In the pseudocode above we indicated this by calling some “check” routine after the calls to Sparse BLAS functionality.

Authors

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.