Application Notes for oneMKL Summary Statistics

ID 772991
Date 3/31/2023
Public
Document Table of Contents

Basic Components of the MI Method

Summary Statistics MI method comprises two components:

  1. Expectation Maximization (EM) algorithm that computes the start point for the Data Augmentation (DA) algorithm

  2. Simulation-based DA function that uses oneMKL random number generators

The parameters of the MI method are packed into the params array. See Table Structure of the Array of MI Parameters in the Summary Statistics section of [MKLMan] for the description of the parameters and their location in the array.

Beside the array of MI parameters, you can pass initial estimates of the mean and variance-covariance matrix (start point) into the EM algorithm. The array of means and variance-covariance matrix are packed as a one-dimensional array init_estimates. The size of the array should be at least p+p(p+1)/2. The package format is as follows:

  1. For i=0,..,p-1, init_estimates[i] contains the start estimate of means.

  2. The remaining positions of the array store the upper-triangular part of the variance-covariance matrix.

If you do not provide the start point for the EM algorithm, it uses the default start point, which is the vector of zero means, and the unitary matrix as a variance-covariance matrix.

You can pass into the library prior parameters for μ and Σ: μ0, τ, m, and Λ-1. As the DA function uses inverted matrix Λ, the MI algorithm expects the inverse of Λ. These parameters are packed as a one-dimensional array prior. The size of the array should be at least (p2 + 3p +4)/2 to hold all the parameters. The storage format is as follows:

  1. prior[0],..., prior[p-1] contain elements of vector μ0.

  2. prior[p] contains parameter τ.

  3. prior[p+1] contains parameter m.

  4. The remaining positions contain the upper-triangular part of the inverted matrix Λ-1.

If the prior parameters are not provided, the algorithm uses their default values:

  1. μ0 is set to an array of p zeros.

  2. τ is set to 0.

  3. m is set to p.

  4.  for the initial approximate of Λ-1, the zero matrix is used.

Proper processing of the default parameter values ensures correct computation.

The MI algorithm returns m sets of imputed values and/or a sequence of parameter estimates drawn during the DA procedure. The imputed values are returned as a single array simul_missing_vals. The size of the array should be sufficient to hold m sets, each of size missing_values_num, that is, m * missing_values_num in total. Imputed values are packed one by one in the order of their appearance in the matrix of observations.

Example:

Consider a task of dimension 4 with the total number of observations n=10, where the second vector misses values for the first and the second variables, and the seventh observation misses the first point. The number of sets to impute is m=2. Thus, simul_missing_vals[0] and simul_missing_vals[1] contain the first and the second points for the second observation vector, simul_missing_vals[2] holds the first point for the seventh observation. Similarly, positions 3, 4, and 5 are reserved for the second set of simulated values.

To estimate convergence of the DA algorithm and choose a proper value for DA iterations, you may generate a sequence of parameter estimates produced during the DA procedure. Elements of the sequence can then be analyzed to estimate convergence of the algorithm. For example, see [Schafer1998].

The sequence of the parameters is returned as a single array. The size of the array should be m*da_iter_num*(p + (p2 + p)/2)

where

  1. m is the number of sets of values to impute

  2. da_iter_num is the number of DA iterations

  3. p + (p2 + p)/2 is the size of the memory to hold one set of the parameter estimates.

Each set of parameters is packed as follows:

  1. The vector of means occupies the first p positions.

  2. The upper-triangular part of the variance-covariance matrix occupies the remaining (p2 + p)/2 positions.

Before the call to the MI algorithm, the dataset to be analyzed should be pre-processed and all missing observations should be marked with a predefined parameter, as follows:

  1. VSL_SS_DNAN, if the dataset is stored in double precision floating-point arithmetic

  2. VSL_SS_SNAN, if the dataset is stored in single precision floating-point arithmetic

Upon successful generation of m sets of imputed values, you can place them in cells of the data matrix with missing values and use other Summary Statistics algorithms to analyze and compute estimates for each of the m complete datasets.

NOTE:

oneMKL implementation of the MI algorithm rewrites cells of the dataset that are marked with the VSL_SS_DNAN/VSL_SS_SNAN values.

If the combination of VSL_SS_MISSING_VALS and any other estimate parameter are passed into the Compute routine, only the algorithm for processing missing values is called.

The example below shows a typical MI usage scenario:

#include "mkl_vsl.h"
 
#define DIM    3                     /* dimension of the task */
#define N   10000                    /* number of observations */
#define M    VSL_SS_MI_PARAMS_SIZE  /* number of MI parameters */
#define M_VALUE    9                 /* total number of missing values */
#define M_COPIES   5                 /* number of sets of imputed values */
int main()
{
int status;
VSLSSTaskPtr task;
MKL_INT i;
double x[DIM * N]; /* matrix of observations */
double W[2];
double mean[DIM], r2m[DIM], c2m[DIM];
MKL_INT p, n, xstorage;
double em_iter_num, da_iter_num, em_accuracy, copy_num, missing_value_num; 
double params[M], simul_missing_vals[M_VALUE * M_COPIES];
MKL_INT nparams = M, simul_missing_val_n = M_VALUE * M_COPIES;
 
 /* Pre-process the dataset and mark entries of missing values with VSL_SS_DNAN */
...
 
/* Parameters of the task and initialization */
p = DIM;
n = N;
xstorage = VSL_SS_MATRIX_STORAGE_ROWS;
 
/* Parameters of the MI algorithm */
em_iter_num       = 100;
da_iter_num       = 10;
em_accuracy       = 0.001;
copy_num          = M_COPIES;
missing_value_num = M_VALUE;
params[0] = em_iter_num;
params[1] = da_iter_num;        
params[2] = em_accuracy;       
params[3] = copy_num;          
params[4] = missing_value_num;
 
/* Create a task */
status = vsldSSNewTask( &task, &p, &n, &xstorage, x, 0, 0 );
 
/* Initialize the task parameters */
status = vsldSSEditMissingValues( task, &nparams, params, 0, 0,
                                  0, 0, &simul_missing_val_n,
                                  simul_missing_vals, 0, 0 );
 
/* Generate m_values copies of missing value sets */
status = vsldSSCompute(task, VSL_SS_MISSING_VALS, VSL_SS_METHOD_MI );
 
/* Use the task to analyze the complete datasets */
W[0] = 0.0; W[1] = 0.0;
for ( i = 0; i < p; i++ )
{
mean[i] = 0.0; r2m[i] = 0; c2m[i] = 0.0;
}
status = vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );
status = vsldSSEditMoments( task, mean, r2m, 0,0, c2m, 0, 0 );
for ( i = 0; i < M_COPIES; i++ )
 
{
/* Perform imputation of  the next set of simulated values into x */
...
 
/* Compute the mean and the variance using the fast method */
errcode = vsldSSCompute(task, VSL_SS_MEAN|VSL_SS_2C_MOM,
                              VSL_SS_METHOD_FAST ); 
 
/* Analyze the computed estimates */
...
}
 
/* Deallocate the task resources */
status = vslSSDeleteTask( &task );
 
return 0;
}