## Computing Quantiles for Streaming Data

Summary Statistics provides a method for computing quantiles that supports out-of-memory data.

In a nutshell, this method permits getting an *e*-approximate quantile in one pass over the dataset, without knowing the total size of the dataset in advance. The *e*-approximate quantile is an element in the dataset with the rank within the interval *[r-**e**n, r+**e**n]* for a user-provided error *e*, size of dataset *n* and any rank *r=1,...,n.*

[Zhang2007] describes the theory and properties of this algorithm. For details on computational aspects, requirements, and usage model of the algorithm, see Using `VSL_SS_METHOD_SQUANTS_ZW` for Quantiles Computation chapter of this document.

**Example:**

Consider a simple application that uses this algorithm. The dimension of the task is set to *p*=1. The total number of observations is set to *n=*10,000,000. The dataset is returned in blocks of 10,000 elements each. The goal is to compute deciles with a pre-defined error *e* <= 0.00001, that is, the array elements with ranks deviating from the rank of accurate deciles by not more than 100 positions.

The library contains a special editor for this algorithm:

status = vsldSSEditStreamQuantiles ( task, &quant_order_n, quant_order, quant, &n_params, ¶ms );

where

`quant_order_n`is the total number of quantiles, set to 9 in this example.`quant_order`is an array initialized with quantile orders 0.1, 0.2,..., 0.9.`¶ms`is an array of parameters. The only element in the array is the user-defined error*e*, set to 0.00001 in this example.

The computation results are placed into the `quants` array.

To initialize the size of the array that contains parameters of the algorithm, you can use the macro defined in the library:

n_params=VSL_SS_SQUANTS_ZW_PARAMS_N

The loop for computing the deciles is as follows:

for ( block_index = 0; block_index < max_block_num; block_index++ ) { // Get the next data block of size block_n ... status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS, VSL_SS_METHOD_SQUANTS_ZW ); // Process computation results ... }

Intermediate estimates of deciles are obtained immediately after processing the next block. As the dataset contains Gaussian numbers with the mean equal to 0 and the variance equal to 1, the sequence of the estimates is as follows:

Block Streaming deciles: index D1 D2 D3 D4 D5 D6 D7 D8 D9 1 -1.2671 -0.8442 -0.5257 -0.2667 -0.0115 0.2524 0.5391 0.8496 1.2695 2 -1.2880 -0.8478 -0.5374 -0.2766 -0.0192 0.2400 0.5131 0.8327 1.2690 3 -1.2848 -0.8386 -0.5261 -0.2656 -0.0110 0.2428 0.5163 0.8366 1.2704 ... ... 998 -1.2815 -0.8414 -0.5241 -0.2531 0.0009 0.2536 0.5248 0.8412 1.2814 999 -1.2815 -0.8414 -0.5241 -0.2531 0.0009 0.2536 0.5248 0.8413 1.2814 1000 -1.2815 -0.8414 -0.5241 -0.2531 0.0008 0.2536 0.5248 0.8412 1.2813

If you need to compute the estimate for the whole dataset only, you can use the fast version of the same method. It permits you to update the internal data structure in the library without actual computation of the intermediate estimates.

for ( block_index = 0; block_index < max_block_num; block_index++ ) { // Get the next data block of size block_n ... status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS, VSL_SS_METHOD_SQUANTS_ZW_FAST ); }

To get the estimate, set the `block_n` variable to zero, make sure it is registered in the library, and call the `Compute` routine:

block_n = 0; status = vsldSSCompute( task, VSL_SS_STREAM_QUANTS, VSL_SS_METHOD_SQUANTS_ZW );

The output of this application is identical to the last line of the previous table:

Streaming deciles: D1 D2 D3 D4 D5 D6 D7 D8 D9 -1.28154 -0.84141 -0.52418 -0.25312 0.00088 0.25367 0.52483 0.84129 1.28139

To check the estimates, the in-memory version of the quantile algorithm calculates accurate deciles for the same dataset. This algorithm returns the following output:

"Accurate" deciles: D1 D2 D3 D4 D5 D6 D7 D8 D9 -1.28155 -0.84142 -0.52417 -0.25311 0.00089 0.25368 0.52484 0.84130 1.28140

The maximal difference between ranks of in-memory and out-of-memory deciles does not exceed 100, which fully aligns with the theory:

Rank difference D1 D2 D3 D4 D5 D6 D7 D8 D9 4 5 3 1 3 7 8 7 4

Product and Performance Information |
---|

= = = = = = = = = = Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Notice revision #20201201 = = = = = = = = = = |