Assign Data to Clusters
Assigns data to clusters using an existing trained clustering model
Applies to: Machine Learning Studio (classic)
This content pertains only to Studio (classic). Similar drag and drop modules have been added to Azure Machine Learning designer (preview). Learn more in this article comparing the two versions.
This article describes how to use the Assign Data to Clusters module in Azure Machine Learning Studio (classic), to generate predictions using a clustering model that was trained using the K-Means clustering algorithm.
The module returns a dataset that contains the probable assignments for each new data point. It also creates a PCA (Principal Component Analysis) graph to help you visualize the dimensionality of the clusters.
This module replaces the Assign to Clusters (deprecated) module, which is available only for support of older experiments.
How to use Assign Data to Clusters
In Azure Machine Learning Studio (classic), locate a previously trained clustering model. You can create and train a clustering model by using either of these methods:
You can also add an existing trained clustering model from the Saved Models group in your workspace.
Attach the trained model to the left input port of Assign Data to Clusters.
Attach a new dataset as input. In this dataset, labels are optional. Generally, clustering is an unsupervised learning method so it is not expected that you will know categories in advance.
However, the input columns must be the same as the columns that were used in training the clustering model, or an error occurs.
To reduce the number of columns output from cluster predictions, use Select Columns in Dataset, and select a subset of the columns.
Leave the option Check for Append or Uncheck for Result Only selected if you want the results to contain the full input dataset, together with a column indicating the results (cluster assignments).
If you deselect this option, you get back just the results. This might be useful when creating predictions as part of a web service.
Run the experiment.
The Assign Data to Clusters module returns two types of results on the Results dataset output:
To see the separation of clusters in the model, click the output of the module and select Visualize
This command displays a Principal Component Analysis (PCA) graph that maps the collection of values in each cluster to two component axes.
- The first component axis is the combined set of features that captures the most variance in the model. It is plotted on the x-axis (Principal Component 1).
- The next component axis represents some combined set of features that is orthogonal to the first component and that adds the next most information to the chart. It is plotted on the y-axis (Principal Component 2).
From the graph, you can see the separation between the clusters, and how the clusters are distributed along the axes that represent the principal components.
To view the table of results for each case in the input data, attach the Convert to Dataset module, and visualize the results in Studio (classic).
This dataset contains the cluster assignments for each case, and a distance metric that gives you some indication of how close this particular case is to the center of the cluster.
Output column name Description Assignments A 0-based index that indicates which cluster the data point was assigned to. DistancesToClusterCenter no. n For each data point, this value indicates the distance from the data point to the center of the assigned cluster, and the distance to other clusters.
The metric used to calculate distance is determined when you configure the K-means clustering model.
|Trained model||ICluster interface||Trained clustering model|
|Dataset||Data Table||Input data source|
|Append or Result Only||Required||TRUE||Indicate whether the output dataset should contain the input dataset as well as the results, or the results only|
|Specify parameter sweeping mode||Sweep Methods||List:Entire grid|Random sweep||Required||Random sweep||Sweep entire grid on parameter space, or sweep with using a limited number of sample runs|
|Results dataset||Data Table||Input dataset appended by data column of assignments or assignments column only|
|Error 0003||Exception occurs if one or more of inputs are null or empty.|