Distributed Processing
This mode assumes that the data set is split into
nblocks
blocks across computation nodes.Parameters
Centroid initialization for KMeans clustering in the distributed processing mode has the following parameters:
Parameter  Method  Default Valude  Description 

computeStep  any  Not applicable  The parameter required to initialize the algorithm. Can be:

algorithmFPType  any  float  The floatingpoint type that the algorithm uses for intermediate computations. Can be float or double . 
method  Not applicable  defaultDense  Available initialization methods for KMeans clustering:
For more details, see the algorithm description. 
nClusters  any  Not applicable  The number of centroids. Required. 
nRowsTotal  any  0  The total number of rows in all input data sets on all nodes. Required in the distributed processing mode in the first step. 
offset  any  Not applicable  Offset in the total data set specifying the start of a block stored on a given local node. Required. 
oversamplingFactor 
 0.5  A fraction of nClusters in each of nRounds of parallel KMeans++.
points are sampled in a round.
For details, see [Bahmani2012], section 3.3. 
nRounds 
 5  The number of rounds for parallel KMeans++. must be greater than nClusters .
For details, see [Bahmani2012], section 3.3. 
firstIteration 
 false  Set to true if step2Local is called for the first time. 
outputForStep5Required 
 false 
Centroid initialization for KMeans clustering follows the general schema described in Algorithms.
plusPlus
methodsparrallelPlus
methodsStep 1  on Local Nodes (deterministic, random, plusPlus, and parallelPlus methods)
deterministic
, random
, plusPlus
, and parallelPlus
methods)plusPlus
methodsparrallelPlus
methodsIn this step, centroid initialization for KMeans clustering accepts the input described below.
Pass the
Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

data  Pointer to the numeric table that represents the i th data block on the local node.While the input for defaultDense , randomDense , plusPlusDense , and parallelPlusDense methods
can be an object of any class derived from NumericTable ,
the input for deterministicCSR , randomCSR , plusPlusCSR , and parallelPlusCSR methods
can only be an object of the CSRNumericTable class. 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

partialCentroids  Pointer to the numeric table with the centroids computed on the local node. By default, this result is an object of the HomogenNumericTable class,
but you can define the result as an object of any class derived from NumericTable
except PackedTriangularMatrix , PackedSymmetricMatrix , and CSRNumericTable . 
Step 2  on Master Node (deterministic and random methods)
deterministic
and random
methods)This step is applicable for
deterministic
and random
methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

partialResuts  A collection that contains results computed in Step 1
on local nodes (two numeric tables from each local node). 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

centroids  Pointer to the numeric table with centroids. By default, this result is an object of the HomogenNumericTable class,
but you can define the result as an object of any class derived from NumericTable
except PackedTriangularMatrix , PackedSymmetricMatrix , and CSRNumericTable . 
Step 2  on Local Nodes (plusPlus and parallelPlus methods)
plusPlus
and parallelPlus
methods)plusPlus
methodsparrallelPlus
methodsThis step is applicable for
plusPlus
and parallelPlus
methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

data  Pointer to the numeric table that represents the i th data block on the local node.While the input for defaultDense , randomDense , plusPlusDense , and parallelPlusDense methods
can be an object of any class derived from NumericTable ,
the input for deterministicCSR , randomCSR , plusPlusCSR , and parallelPlusCSR methods
can only be an object of the CSRNumericTable class. 
inputOfStep2  The value of m is defined by the method and iteration of the algorithm:
This input can be an object of any class derived from NumericTable ,
except CSRNumericTable , PackedTriangularMatrix , and PackedSymmetricMatrix . 
internalInput  Pointer to the DataCollection object with the internal data of the distributed algorithm
used by its local nodes in Step 2 and Step 4.
The DataCollection is created in Step 2 when firstIteration is set to true ,
and then the DataCollection should be set from the partial result as an input for next local steps
(Step 2 and Step 4). 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

outputOfStep2ForStep3  Pointer to the numeric table that contains the overall error accumulated on the node.
For a description of the overall error, see KMeans Clustering Details. 
outputOfStep2ForStep5  Applicable for parallelPlus methods only and calculated when outputForStep5Required is set to true .
Pointer to the numeric table with the ratings of centroid candidates computed on the previous steps
and .
For a description of ratings, see KMeans Clustering Details. 
By default, these results are objects of the
HomogenNumericTable
class,
but you can define the result as an object of any class derived from NumericTable
except PackedTriangularMatrix
, PackedSymmetricMatrix
, and CSRNumericTable
.Step 3  on Master Node (plusPlus and parallelPlus methods)
plusPlus
and parallelPlus
methods)plusPlus
methodsparrallelPlus
methodsThis step is applicable for plusPlus and parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the
Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

inputOfStep3FromStep2  A keyvalue data collection that maps parts of the accumulated error to the local nodes:
i th element of this collection is a numeric table that contains overall error accumulated on the i th node. 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

outputOfStep3ForStep4  Note that Step 3 may produce no input for Step 4 on some local nodes,
which means the collection may not contain the i th node entry.
The single element of this numeric table , where the overall error calculated on the node.
For a description of the overall error, see KMeans Clustering Details.This value defines the probability to sample a new centroid on the i th node. 
outputOfStep3ForStep5  Applicable for parallelPlus methods only. Pointer to the service data to be used in Step 5. 
Step 4  on Local Nodes (plusPlus and parallelPlus methods)
plusPlus
and parallelPlus
methods)plusPlus
methodsparrallelPlus
methodsThis step is applicable for plusPlus and parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the
Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

data  Pointer to the numeric table that represents the i th data block on the local node.While the input for defaultDense , randomDense , plusPlusDense , and parallelPlusDense methods
can be an object of any class derived from NumericTable ,
the input for deterministicCSR , randomCSR , plusPlusCSR , and parallelPlusCSR methods
can only be an object of the CSRNumericTable class. 
inputOfStep4FromStep3  Pointer to the numeric table with the values calculated in Step 3. The value of m is defined by the method of the algorithm:
This input can be an object of any class derived from NumericTable ,
except CSRNumericTable , PackedTriangularMatrix , and PackedSymmetricMatrix . 
internalInput  Pointer to the DataCollection object with the internal data of the distributed algorithm
used by its local nodes in Step 2 and Step 4.
The DataCollection is created in Step 2 when firstIteration is set to true ,
and then the DataCollection should be set from the partial result as the input for next local steps
(Step 2 and Step 4). 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

outputOfStep4  Pointer to the numeric table that contains centroids computed on this local node,
where m equals to the one in inputOfStep4FromStep3 .By default, this result is an object of the HomogenNumericTable class,
but you can define the result as an object of any class derived from NumericTable
except CSRNumericTable , PackedTriangularMatrix , and PackedSymmetricMatrix . 
Step 5  on Master Node (parallelPlus methods)
parallelPlus
methods)This step is applicable for parallelPlus methods only.
Centroid initialization for KMeans clustering accepts the input from each local node described below.
Pass the
Input ID
as a parameter to the methods that provide input for your algorithm.
For more details, see Algorithms.Input ID  Input 

inputCentroids  parallelPlus method:Each numeric table can be an object of any class derived from NumericTable ,
except CSRNumericTable , PackedTriangularMatrix , and PackedSymmetricMatrix . 
inputOfStep5FromStep2  A data collection with the items calculated in Step 2 on local nodes.
For a detailed definition, see outputOfStep2ForStep5 above. 
inputOfStep5FromStep3  Pointer to the service data generated as the output of Step 3 on master node.
For a detailed definition, see outputOfStep3ForStep5 above. 
In this step, centroid initialization for KMeans clustering calculates the results described below.
Pass the
Result ID
as a parameter to the methods that access the results of your algorithm.
For more details, see Algorithms.Result ID  Result 

centroids  Pointer to the numeric table with centroids. By default, this result is an object of the HomogenNumericTable class,
but you can define the result as an object of any class derived from NumericTable
except PackedTriangularMatrix , PackedSymmetricMatrix , and CSRNumericTable . 