K-Means

Intel® oneAPI Data Analytics Library Developer Guide and Reference

Download PDF

ID 772611

Date 12/16/2022

Version

Public

A newer version of this document is available. Customers should click here to go to the newest version.

Visible to Intel only — GUID: GUID-D829A562-90D3-46CC-9A1E-2DE35E04A6CC

View Details

K-Means

The K-Means algorithm solves clustering problem by partitioning n feature vectors into k clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

Operation	Computational methods	Programming Interface
Training	Lloyd’s	train(…)	train_input	train_result
Inference	Lloyd’s	infer(…)	infer_input	infer_result

Mathematical formulation

Training

Given the training set of p-dimensional feature vectors and a positive integer k, the problem is to find a set of p-dimensional centroids that minimize the objective function

where is the squared Euclidean distance from to the closest centroid in C,

Expression denotes norm.

NOTE:

In the general case, d may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

Training method: Lloyd’s

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where t denotes a index of the current iteration, e.g., is the set of centroids at the t-th iteration. The method requires the initial centroids to be specified at the beginning of the algorithm ().

(1) Assignment step: Assign each feature vector to the nearest centroid. denotes the assigned label (cluster index) to the feature vector .

Each feature vector from the training set X is assigned to exactly one centroid so that X is partitioned to k disjoint sets (clusters)

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

The steps (1) and (2) are performed until the following stop condition,

is satisfied or number of iterations exceeds the maximal value T defined by the user.

Inference

Given the inference set of p-dimensional feature vectors and the set of centroids produced at the training stage, the problem is to predict the index , , of the centroid in accordance with a method-defined rule.

Inference method: Lloyd’s

Lloyd’s inference method computes the as an index of the centroid closest to the feature vector ,

Programming Interface

Refer to API Reference: K-Means.

Usage example

Training

kmeans::model<> run_training(const table& data,
                           const table& initial_centroids) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(10)
      .set_max_iteration_count(50)
      .set_accuracy_threshold(1e-4);

   const auto result = train(kmeans_desc, data, initial_centroids);

   print_table("labels", result.get_labels());
   print_table("centroids", result.get_model().get_centroids());
   print_value("objective", result.get_objective_function_value());

   return result.get_model();
}

Inference

table run_inference(const kmeans::model<>& model,
                  const table& new_data) {
   const auto kmeans_desc = kmeans::descriptor<float>{}
      .set_cluster_count(model.get_cluster_count());

   const auto result = infer(kmeans_desc, model, new_data);

   print_table("labels", result.get_labels());
}

Examples

oneAPI DPC++

Batch Processing:

dpc_kmeans_lloyd_dense_batch.cpp

oneAPI C++

Batch Processing:

cpp_kmeans_lloyd_dense_batch.cpp

Python* with DPC++ support

Batch Processing:

kmeans_batch.py

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Intel® oneAPI Data Analytics Library Developer Guide and Reference

K-Means

Mathematical formulation

Programming Interface

Usage example

Examples