教學課程:搭配 ML.NET 使用 K-means 群集分類鳶尾花Tutorial: Categorize iris flowers using k-means clustering with ML.NET

本教學課程說明如何使用 ML.NET 為鳶尾花資料集建立一個群集模型This tutorial illustrates how to use ML.NET to build a clustering model for the iris flower data set.

在本教學課程中,您將了解如何:In this tutorial, you learn how to:

  • 了解問題Understand the problem
  • 選取適當的機器學習工作Select the appropriate machine learning task
  • 準備資料Prepare the data
  • 載入並轉換資料Load and transform the data
  • 選擇學習演算法Choose a learning algorithm
  • 將模型定型Train the model
  • 使用模型來進行預測Use the model for predictions

必要條件Prerequisites

了解問題Understand the problem

這個問題是關於將一組鳶尾花按照花卉特徵分成不同的群組。This problem is about dividing the set of iris flowers in different groups based on the flower features. 這些特徵是萼片的長度和寬度以及花瓣的長度和寬度。Those features are the length and width of a sepal and the length and width of a petal. 本教學課程中假設不知道每個花卉的類型。For this tutorial, assume that the type of each flower is unknown. 您想要從特徵了解資料集的結構,還要預測資料執行個體如何符合此結構。You want to learn the structure of a data set from the features and predict how a data instance fits this structure.

選取適當的機器學習工作Select the appropriate machine learning task

因為您不知道每個花卉屬於哪個群組,所以您選擇非監督式機器學習工作。As you don't know to which group each flower belongs to, you choose the unsupervised machine learning task. 若要按照類似元素歸於同一群組中的方式來分割群組中的資料集,請使用群集機器學習工作。To divide a data set in groups in such a way that elements in the same group are more similar to each other than to those in other groups, use a clustering machine learning task.

建立主控台應用程式Create a console application

  1. 開啟 Visual Studio。Open Visual Studio. 從功能表列中選取 [檔案] > [新增] > [專案]。Select File > New > Project from the menu bar. 在 [新增專案] 對話方塊中,選取 [Visual C#] 節點,然後選取 [.NET Core] 節點。In the New Project dialog, select the Visual C# node followed by the .NET Core node. 然後選取 [主控台應用程式 (.NET Core)] 專案範本。Then select the Console App (.NET Core) project template. 在 [名稱] 文字方塊中,鍵入 "IrisFlowerClustering",然後選取 [確定] 按鈕。In the Name text box, type "IrisFlowerClustering" and then select the OK button.

  2. 在您的專案中建立一個名為 Data 的目錄以儲存資料集和模型檔案:Create a directory named Data in your project to store the data set and model files:

    在 [方案總管] 中,以滑鼠右鍵按一下專案,然後選取 [新增] > [新增資料夾]。In Solution Explorer, right-click the project and select Add > New Folder. 輸入 "Data",然後按 Enter。Type "Data" and hit Enter.

  3. 安裝 Microsoft.ML NuGet 套件:Install the Microsoft.ML NuGet package:

    在 [方案總管] 中,以滑鼠右鍵按一下專案,然後選取 [管理 NuGet 套件]。In Solution Explorer, right-click the project and select Manage NuGet Packages. 選擇 "nuget.org" 作為 [套件來源]、選取 [瀏覽] 索引標籤、搜尋 Microsoft.ML、從清單中選取 v1.0.0 套件,然後選取 [安裝] 按鈕。Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML, select the v1.0.0 package in the list, and select the Install button. 在 [預覽變更] 對話方塊上,選取 [確定] 按鈕,然後在 [授權接受] 對話方塊上,如果您同意所列套件的授權條款,請選取 [我接受]。Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.

準備資料Prepare the data

  1. 下載 iris.data 資料集,並將它儲存到上一個步驟中建立的 Data 資料夾。Download the iris.data data set and save it to the Data folder you've created at the previous step. 如需有關鳶尾花資料集的詳細資訊,請參閱鳶尾花資料集維基百科頁面和鳶尾花資料集頁面 (也就是資料集的來源)。For more information about the iris data set, see the Iris flower data set Wikipedia page and the Iris Data Set page, which is the source of the data set.

  2. 以滑鼠右鍵按一下 [方案總管] 中的 iris.data 檔案,然後選取 [內容]。In Solution Explorer, right-click the iris.data file and select Properties. 在 [進階] 底下,將 [複製到輸出目錄] 的值變更為 [有更新時才複製]。Under Advanced, change the value of Copy to Output Directory to Copy if newer.

Iris.data 檔案包含五個資料行,分別表示:The iris.data file contains five columns that represent:

  • 萼片長度 (以公分為單位)sepal length in centimetres
  • 萼片寬度 (以公分為單位)sepal width in centimetres
  • 花瓣長度 (以公分為單位)petal length in centimetres
  • 花瓣寬度 (以公分為單位)petal width in centimetres
  • 鳶尾花的類型type of iris flower

為了示範群集方法,本教學課程會略過最後一個資料行。For the sake of the clustering example, this tutorial ignores the last column.

建立資料類別Create data classes

為輸入資料和預測建立類別:Create classes for the input data and the predictions:

  1. 在 [方案總管] 中,於專案上按一下滑鼠右鍵,然後選取 [新增] > [新增項目]。In Solution Explorer, right-click the project, and then select Add > New Item.

  2. 在 [新增項目] 對話方塊中,選取 [類別],然後將 [名稱] 欄位變更為 IrisData.csIn the Add New Item dialog box, select Class and change the Name field to IrisData.cs. 接著,選取 [新增] 按鈕。Then, select the Add button.

  3. 將下列的 using 指示詞加入新檔案:Add the following using directive to the new file:

    using Microsoft.ML.Data;
    

移除現有的類別定義,然後將下列程式碼 (定義 IrisDataClusterPrediction 這兩個類別) 新增至 IrisData.cs 檔案:Remove the existing class definition and add the following code, which defines the classes IrisData and ClusterPrediction, to the IrisData.cs file:

public class IrisData
{
    [LoadColumn(0)]
    public float SepalLength;

    [LoadColumn(1)]
    public float SepalWidth;

    [LoadColumn(2)]
    public float PetalLength;

    [LoadColumn(3)]
    public float PetalWidth;
}

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedClusterId;

    [ColumnName("Score")]
    public float[] Distances;
}

IrisData 是輸入資料類別,它含有資料集中每個特徵的定義。IrisData is the input data class and has definitions for each feature from the data set. 請使用 LoadColumn 屬性來指定資料集檔案中來源資料行的索引。Use the LoadColumn attribute to specify the indices of the source columns in the data set file.

ClusterPrediction 類別代表套用至 IrisData 執行個體的群集模型結果。The ClusterPrediction class represents the output of the clustering model applied to an IrisData instance. 使用 ColumnName 屬性分別將 PredictedClusterIdDistances 欄位繫結至 PredictedLabelScore 資料行。Use the ColumnName attribute to bind the PredictedClusterId and Distances fields to the PredictedLabel and Score columns respectively. 群集這些資料行的工作具有下列意義:In case of the clustering task those columns have the following meaning:

  • PredictedLabel 資料行包含預測群集的 ID。PredictedLabel column contains the ID of the predicted cluster.
  • Score 資料行包含平方歐幾里得距離到群集矩心的陣列。Score column contains an array with squared Euclidean distances to the cluster centroids. 陣列長度等於群集數目。The array length is equal to the number of clusters.

注意

使用 float 型別表示輸入和預測資料類別中的浮點值。Use the float type to represent floating-point values in the input and prediction data classes.

定義資料和模型路徑Define data and model paths

返回 Program.cs 檔案並新增兩個欄位,保存資料集檔案的路徑以及儲存模型之檔案的路徑:Go back to the Program.cs file and add two fields to hold the paths to the data set file and to the file to save the model:

  • _dataPath 會針對具有用來定型模型之資料集的檔案,包含檔案的路徑。_dataPath contains the path to the file with the data set used to train the model.
  • _modelPath 會針對儲存定型模型的檔案,包含檔案的路徑。_modelPath contains the path to the file where the trained model is stored.

將下列程式碼加入 Main 方法的緊鄰上方,以指定這些路徑:Add the following code right above the Main method to specify those paths:

static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "iris.data");
static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "IrisClusteringModel.zip");

若要將上述的程式碼進行編譯,請在 Program.cs 檔案頂端加入下列 using 指示詞:To make the preceding code compile, add the following using directives at the top of the Program.cs file:

using System;
using System.IO;

建立 ML 內容Create ML context

Program.cs 檔案頂端加入下列額外的 using 指示詞:Add the following additional using directives to the top of the Program.cs file:

using Microsoft.ML;
using Microsoft.ML.Data;

Console.WriteLine("Hello World!"); 方法中,以下列程式碼取代 Main 行:In the Main method, replace the Console.WriteLine("Hello World!"); line with the following code:

var mlContext = new MLContext(seed: 0);

Microsoft.ML.MLContext 類別表示機器學習環境,並為資料載入、模型定型、預測和其他工作提供記錄機制和進入點。The Microsoft.ML.MLContext class represents the machine learning environment and provides mechanisms for logging and entry points for data loading, model training, prediction, and other tasks. 這在概念上類似於在 Entity Framework 中使用 DbContextThis is comparable conceptually to using DbContext in Entity Framework.

設定資料載入Setup data loading

將下列程式碼新增至 Main 方法,以設定資料載入方式:Add the following code to the Main method to setup the way to load data:

IDataView dataView = mlContext.Data.LoadFromTextFile<IrisData>(_dataPath, hasHeader: false, separatorChar: ',');

泛型的 MLContext.Data.LoadFromTextFile 擴充方法會從所提供的 IrisData 類型推斷資料集結構描述,並傳回 IDataView,其可作為轉換器的輸入使用。The generic MLContext.Data.LoadFromTextFile extension method infers the data set schema from the provided IrisData type and returns IDataView which can be used as input for transformers.

建立學習管線Create a learning pipeline

在本教學課程中,叢集工作的學習管線包含下列兩個步驟:For this tutorial, the learning pipeline of the clustering task comprises two following steps:

  • 將載入的資料行串連成一個 [特徵] 資料行,以供叢集定型器使用;concatenate loaded columns into one Features column, which is used by a clustering trainer;
  • 使用 KMeansTrainer 定型器透過 k-means++ 叢集演算法將模型定型。use a KMeansTrainer trainer to train the model using the k-means++ clustering algorithm.

將下列程式碼加入 Main 方法:Add the following code to the Main method:

string featuresColumnName = "Features";
var pipeline = mlContext.Transforms
    .Concatenate(featuresColumnName, "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
    .Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 3));

此程式碼指定應該將資料集分成三個叢集。The code specifies that the data set should be split in three clusters.

將模型定型Train the model

上述小節中加入的步驟已準備好訓練的管道,不過,還沒有開始執行。The steps added in the preceding sections prepared the pipeline for training, however, none have been executed. 將下列程式碼行新增至 Main 方法,以執行資料載入和模型定型:Add the following line to the Main method to perform data loading and model training:

var model = pipeline.Fit(dataView);

儲存模型Save the model

此時,您已有一個可整合至任何現有或新 .NET 應用程式的模型。At this point, you have a model that can be integrated into any of your existing or new .NET applications. 若要將您的模型儲存成 .zip 檔案,請將下列程式碼新增至 Main 方法:To save your model to a .zip file, add the following code to the Main method:

using (var fileStream = new FileStream(_modelPath, FileMode.Create, FileAccess.Write, FileShare.Write))
{
    mlContext.Model.Save(model, dataView.Schema, fileStream);
}

使用模型來進行預測Use the model for predictions

為了進行預測,請使用 PredictionEngine<TSrc,TDst> 類別,其帶領輸入類型的執行個體通過轉換程式管線,並產生輸出類型的執行個體。To make predictions, use the PredictionEngine<TSrc,TDst> class that takes instances of the input type through the transformer pipeline and produces instances of the output type. 將下列程式碼行新增至 Main 方法,以建立該類別的執行個體:Add the following line to the Main method to create an instance of that class:

var predictor = mlContext.Model.CreatePredictionEngine<IrisData, ClusterPrediction>(model);

建立 TestIrisData 類型以容納測試資料執行個體:Create the TestIrisData class to house test data instances:

  1. 在 [方案總管] 中,於專案上按一下滑鼠右鍵,然後選取 [新增] > [新增項目]。In Solution Explorer, right-click the project, and then select Add > New Item.

  2. 在 [新增項目] 對話方塊中,選取 [類別],然後將 [名稱] 欄位變更為 TestIrisData.csIn the Add New Item dialog box, select Class and change the Name field to TestIrisData.cs. 接著,選取 [新增] 按鈕。Then, select the Add button.

  3. 將類別修改成靜態,如以下範例所示:Modify the class to be static like in the following example:

    static class TestIrisData
    

本教學課程介紹此類別內一個鳶尾花資料執行個體。This tutorial introduces one iris data instance within this class. 您可以新增其他案例來對此模型進行實驗。You can add other scenarios to experiment with the model. 將下列程式碼新增至 TestIrisData 類別:Add the following code into the TestIrisData class:

internal static readonly IrisData Setosa = new IrisData
{
    SepalLength = 5.1f,
    SepalWidth = 3.5f,
    PetalLength = 1.4f,
    PetalWidth = 0.2f
};

若要找出指定項目所屬的群集,請返回 Program.cs 檔案,然後將下列程式碼加入至 Main 方法:To find out the cluster to which the specified item belongs to, go back to the Program.cs file and add the following code into the Main method:

var prediction = predictor.Predict(TestIrisData.Setosa);
Console.WriteLine($"Cluster: {prediction.PredictedClusterId}");
Console.WriteLine($"Distances: {string.Join(" ", prediction.Distances)}");

執行程式,以查看哪些群集包含指定的資料執行個體以及從該執行個體到群集矩心的平方距離。Run the program to see which cluster contains the specified data instance and squared distances from that instance to the cluster centroids. 您的結果應該與以下類似:Your results should be similar to the following:

Cluster: 2
Distances: 11.69127 0.02159119 25.59896

恭喜您!Congratulations! 您現在已成功建置用來群集鳶尾花並進行預測的機器學習模型。You've now successfully built a machine learning model for iris clustering and used it to make predictions. 您可以在 dotnet/samples GitHub 存放庫中找到本教學課程的原始程式碼。You can find the source code for this tutorial at the dotnet/samples GitHub repository.

後續步驟Next steps

在本教學課程中,您將了解如何:In this tutorial, you learned how to:

  • 了解問題Understand the problem
  • 選取適當的機器學習工作Select the appropriate machine learning task
  • 準備資料Prepare the data
  • 載入並轉換資料Load and transform the data
  • 選擇學習演算法Choose a learning algorithm
  • 將模型定型Train the model
  • 使用模型來進行預測Use the model for predictions

請查看我們的 GitHub 存放庫來繼續學習及尋找更多範例。Check out our GitHub repository to continue learning and find more samples.