AML - How do I get the data I need from Assigning Data to Clusters

Question

I have a Data to Clusters model that is being feed 4000 points in training and 4000 items to be compared against the points. It is providing the best guess in the Assignments which is mostly good. But I need to determine how confident it is by seeing the score in that item.

What is the recommended way to obtaining that score and the next 4 other scores and their respective assignment?

In my situation the training data will never have all the data and I am trying to figure out when it's just assigning it's best guess that is wrong.

Answer

Hi @Drummond, Joshua (AA)

Thank you for using the Microsoft Q&A forum.

Based on your description, you have a Data to Clusters model that is being fed 4000 points during training and another 4000 items to be compared against the points. While it is providing satisfactory assignments, you're interested in determining the confidence level of these assignments by accessing the scores associated with each item.

To achieve this, follow these steps:

Utilize the "Assign Data to Clusters" Component:
- Locate or create a clustering model using the K-means clustering algorithm within Azure Machine Learning Designer.
Configure the Component:
- Attach the trained model to the left input port of the "Assign Data to Clusters" component.
- Provide a new dataset as input, ensuring that the input columns match those used in training the clustering model.
Retrieve Results:
- The "Assign Data to Clusters" component returns a dataset containing the probable assignments for each new data point.
- This dataset includes the Assignments column, indicating the cluster to which each row is assigned, along with columns indicating the distance from each point to the centers of each cluster.
Select Top 5 Scores:
- To obtain the top 5 scores along with their respective assignments, select the top 5 rows based on the Assignments column and the distances from the point to the cluster centers. You can also make use of the "Execute Python Script" component to sort the dataset by the Assignments column and the distance columns in descending order and select the top 5 rows.
- Here is an example python script for your reference. Please make necessary modifications, as per your use case requirements.
```
     import pandas as pd
     def azureml_main(dataframe1 = None, dataframe2 = None):
     # Sort the dataset by the Assignments column and the distance columns in descending order
     sorted_df = dataframe1.sort_values(['Assignments', 'Distance to Cluster Center', 'Distance to Other Center'], ascending=[True, False, True])
     
     # Select the top 5 rows
     top_5 = sorted_df.head(5)
     
     # Return the top 5 rows
     return top_5,
     
```

For a detailed implementation of this approach, you can refer to the provided "Component: K-Means Clustering" documentation here. Additionally, you can explore the step-by-step Microsoft documentation on clustering here.I hope the provided information helps you in further improvising your solution. Thank you.

AML - How do I get the data I need from Assigning Data to Clusters

1 answer