Assign Data to Clusters

Article
05/06/2019

Important

Support for Machine Learning Studio (classic) will end on 31 August 2024. We recommend you transition to Azure Machine Learning by that date.

Beginning 1 December 2021, you will not be able to create new Machine Learning Studio (classic) resources. Through 31 August 2024, you can continue to use the existing Machine Learning Studio (classic) resources.

See information on moving machine learning projects from ML Studio (classic) to Azure Machine Learning.
Learn more about Azure Machine Learning.

ML Studio (classic) documentation is being retired and may not be updated in the future.

Assigns data to clusters using an existing trained clustering model

Category: Score

Note

Applies to: Machine Learning Studio (classic) only

Similar drag-and-drop modules are available in Azure Machine Learning designer.

Module overview

This article describes how to use the Assign Data to Clusters module in Machine Learning Studio (classic), to generate predictions using a clustering model that was trained using the K-Means clustering algorithm.

The module returns a dataset that contains the probable assignments for each new data point. It also creates a PCA (Principal Component Analysis) graph to help you visualize the dimensionality of the clusters.

Warning

This module replaces the Assign to Clusters (deprecated) module, which is available only for support of older experiments.

How to use Assign Data to Clusters

In Machine Learning Studio (classic), locate a previously trained clustering model. You can create and train a clustering model by using either of these methods:
- Configure the K-means algorithm using the K-Means Clustering module, and then train the model using a dataset and the Train Clustering Model module.
- Configure a range of options for the K-means algorithm using K-Means Clustering and then train the model using the Sweep Clustering module.
You can also add an existing trained clustering model from the Saved Models group in your workspace.
Attach the trained model to the left input port of Assign Data to Clusters.
Attach a new dataset as input. In this dataset, labels are optional. Generally, clustering is an unsupervised learning method so it is not expected that you will know categories in advance.

However, the input columns must be the same as the columns that were used in training the clustering model, or an error occurs.

Tip

To reduce the number of columns output from cluster predictions, use Select Columns in Dataset, and select a subset of the columns.
Leave the option Check for Append or Uncheck for Result Only selected if you want the results to contain the full input dataset, together with a column indicating the results (cluster assignments).

If you deselect this option, you get back just the results. This might be useful when creating predictions as part of a web service.
Run the experiment.

Results

The Assign Data to Clusters module returns two types of results on the Results dataset output:

To see the separation of clusters in the model, click the output of the module and select Visualize

This command displays a Principal Component Analysis (PCA) graph that maps the collection of values in each cluster to two component axes.
- The first component axis is the combined set of features that captures the most variance in the model. It is plotted on the x-axis (Principal Component 1).
- The next component axis represents some combined set of features that is orthogonal to the first component and that adds the next most information to the chart. It is plotted on the y-axis (Principal Component 2).
From the graph, you can see the separation between the clusters, and how the clusters are distributed along the axes that represent the principal components.

To view the table of results for each case in the input data, attach the Convert to Dataset module, and visualize the results in Studio (classic).

This dataset contains the cluster assignments for each case, and a distance metric that gives you some indication of how close this particular case is to the center of the cluster.

Output column name	Description
Assignments	A 0-based index that indicates which cluster the data point was assigned to.
DistancesToClusterCenter no. n	For each data point, this value indicates the distance from the data point to the center of the assigned cluster, and the distance to other clusters. The metric used to calculate distance is determined when you configure the K-means clustering model.

Expected inputs

Name	Type	Description
Trained model	ICluster interface	Trained clustering model
Dataset	Data Table	Input data source

Module parameters

Name	Type	Range	Optional	Default	Description
Append or Result Only			Required	TRUE	Indicate whether the output dataset should contain the input dataset as well as the results, or the results only
Specify parameter sweeping mode	Sweep Methods	List:Entire grid\|Random sweep	Required	Random sweep	Sweep entire grid on parameter space, or sweep with using a limited number of sample runs

Outputs

Name	Type	Description
Results dataset	Data Table	Input dataset appended by data column of assignments or assignments column only

Exceptions

Exception	Description
Error 0003	Exception occurs if one or more of inputs are null or empty.