Application Notes for oneMKL Summary Statistics

ID 772991
Date 3/31/2023
Public
Document Table of Contents

Processing Data in Blocks

Summary Statistics enables block-based data analysis that can help you:

  1. compute statistical estimates for out-of-memory datasets, splitting them into blocks

  2. analyze in-memory data arrays that become available block by block

  3. tune your applications for out-of-memory data support

To compute statistical estimates for out-of-memory datasets, do the following:

  1. Set the estimates of your interest to zero, or to any other meaningful value:

    for( i = 0; i < p; i++ )
    {
         Xmean[i] = 0.0;
         Raw2Mom[i] = 0.0;
         Central2Mom[i] = 0.0;
         for(j = 0; j < p; j++)
         {
             Cov[i][j] = 0.0;
         }
    }
  2. Initialize array W of size 2 with zero values.

    This array holds accumulated weights that are important for correct computation of the estimates:

    W[0] = 0.0; W[1] = 0.0;
  3. Get the first portion of the dataset into array X, and the corresponding weights into array weights:

    GetNextDataChunk( X, weights );
  4. Follow the common usage model of the Summary Statistics algorithms:

    /* Create a task */
    xstorage = VSL_SS_MATRIX_STORAGE_COLS;
    errcode = vsldSSNewTask( &task, &p, &nblock,
                             &xstorage, X, weights, indices );
     
    /* Edit the task parameters */
    errcode = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );
    errcode = vsldSSEditTask( task, VSL_SS_ED_VARIATION, Variation );
    errcode = vsldSSEditMoments( task, Xmean, Raw2Mom, 0, 0, Central2Mom, 0, 0 );
     
    covstorage = VSL_SS_MATRIX_STORAGE_FULL;
    errcode = vsldSSEditCovCor( task, Xmean, cov, &covstorage, 0, 0 );
     
    /* Compute the estimates for the dataset split into chunks */
    estimates = VSL_SS_MEAN | VSL_SS_2C_MOM | VSL_SS_COV | VSL_SS_VARIATION;
    for( nchunk = 0;  nchunk++; )
         errcode = vsldSSCompute( task, estimates, VSL_SS_1PASS_METHOD );
         If ( nchunk >= N ) break;
         GetNextDataChunk( X, weights );
    }
     
    /* Deallocate task resources */
    errcode = vslSSDeleteTask( &task );

Summary statistics domain also enables reading the next data block into a different array. The whole computation scheme remains the same. You just need to provide the address of this data block to the library:

double* NextXChunk[N];
estimates = VSL_SS_MEAN | VSL_SS_2C_MOM | VSL_SS_COV | VSL_SS_VARIATION;
for( nchunk = 0; nchunk++; )
{
     errcode = vsldSSCompute( task, estimates, VSL_SS_1PASS_METHOD );
     If ( nchunk >= N ) break;
     GetNextDataChunk( NextXChunk, [nchunk], weights );
     errcode = vsldSSEditTask( task, VSL_SS_ED_OBSERV, NextXChunk,[nchunk] );
}

For the list of estimators that support processing datasets in blocks, see Table VS Summary Statistics Estimates Obtained with Compute Routine in the Summary Statistics section of [MKLMan].

Product and Performance Information

= = = = = = = = = =

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Notice revision #20201201

= = = = = = = = = =