教程:使用模型生成器对餐馆卫生违规行为进行分类Tutorial: Classify the severity of restaurant health violations with Model Builder

了解如何使用模型生成器生成一个多级分类模型,以对在卫生检查期间发现的餐馆违规行为的风险级别进行分类。Learn how to build a multiclass classification model using Model Builder to categorize the risk level of restaurant violations found during health inspections.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 准备和了解数据Prepare and understand the data
  • 选择方案Choose a scenario
  • 从数据库加载数据Load data from a database
  • 定型模型Train the model
  • 评估模型Evaluate the model
  • 使用预测模型Use the model for predictions

备注

模型生成器当前为预览版。Model Builder is currently in Preview.

先决条件Prerequisites

有关先决条件和安装说明列表,请访问模型生成器安装指南For a list of prerequisites and installation instructions, visit the Model Builder installation guide.

模型生成器多级分类概述Model Builder multiclass classification overview

此示例会创建一个 C# .NET Core 控制台应用程序,用于通过模型生成器生成的机器学习模型对卫生违规风险进行分类。This sample creates a C# .NET Core console application that categorizes the risk of health violations using a machine learning model built with Model Builder. 有关本教程中的源代码,可以从 dotnet/machinelearning-samples GitHub 存储库中找到。You can find the source code for this tutorial at the dotnet/machinelearning-samples GitHub repository.

创建控制台应用程序Create a console application

  1. 创建一个名为“RestaurantViolations”的 C# .NET Core 控制台应用程序。 Create a C# .NET Core console application called "RestaurantViolations". 请确保未选中“将解决方案和项目放置在同一目录中”(VS 2019) 或已选中“创建解决方案的目录”(VS 2017) 。Make sure Place solution and project in the same directory is unchecked (VS 2019), or Create directory for solution is checked (VS 2017).

准备和了解数据Prepare and understand the data

培训和评估机器学习模型所用的数据集源自旧金山公共卫生部餐馆安全性评分The data set used to train and evaluate the machine learning model is originally from the San Francisco Department of Public Health Restaurant Safety Scores. 为了便于使用,已将此数据集精简为仅包含与模型培训和进行预测相关的列。For convenience, the dataset has been condensed to only include the columns relevant to train the model and make predictions. 如需详细了解此数据集,请访问以下网站。Visit the following website to learn more about the dataset.

下载餐馆安全性评分数据集,并将其解压缩。Download the Restaurant Safety Scores dataset and unzip it.

此数据集中的各个行包含卫生部进行检查时发现的违规行为的相关信息,以及这些违规行为对公共卫生和安全构成威胁的风险评估。Each row in the dataset contains information regarding violations observed during an inspection from the Health Department and a risk assessment of the threat those violations present to public health and safety.

检查类型InspectionType 违规行为描述ViolationDescription 风险类别RiskCategory
常规 - 不定期Routine - Unscheduled 未对食物接触面进行充分清洁或消毒Inadequately cleaned or sanitized food contact surfaces 中等风险Moderate Risk
新营业场所New Ownership 高危害虫成群出现High risk vermin infestation 高风险High Risk
常规 - 不定期Routine - Unscheduled 擦拭布不干净或存放不当或消毒液不足Wiping cloths not clean or properly stored or inadequate sanitizer 低风险Low Risk
  • 检查类型:检查的类型。InspectionType: the type of inspection. 它可以是对新场所的首次检查、常规检查、投诉检查,以及其他各种类型的检查。This can either be a first-time inspection for a new establishment, a routine inspection, a complaint inspection, and many other types.
  • 违规行为描述:对检查期间发现的违规行为的描述。ViolationDescription: a description of the violation found during inspection.
  • 风险类别:违规行为对公共健康和安全构成风险的严重性。RiskCategory: the risk severity a violation poses to public health and safety.

label 是要预测的列。The label is the column you want to predict. 执行分类任务时,目标是分配一个类别(文本或数值)。When performing a classification task, the goal is to assign a category (text or numerical). 在此分类方案中,对违规行为的严重性赋值为:低风险、中等风险或高风险。In this classification scenario, the severity of the violation is assigned the value of low, moderate, or high risk. 因此,“风险类别”是标签。 Therefore, the RiskCategory is the label. features 是你为模型提供的用来预测 label 的输入。The features are the inputs you give the model to predict the label. 在此案例中,“检查类型”和“违规行为描述”用作预测“风险类别”的特性或输入。 In this case, the InspectionType and ViolationDescription are used as features or inputs to predict the RiskCategory.

选择方案Choose a scenario

Visual Studio 中的“模型生成器”向导

为了训练模型,请从模型生成器提供的可用机器学习方案列表中进行选择。To train your model, select from the list of available machine learning scenarios provided by Model Builder. 在此案例中,方案为“问题分类”。 In this case, the scenario is Issue Classification.

  1. 在“解决方案资源管理器”中,右键单击“餐馆违规行为”项目,然后选择“添加” > “机器学习” 。In Solution Explorer, right-click the RestaurantViolations project, and select Add > Machine Learning.
  2. 在此示例中,方案为多级分类。For this sample, the scenario is multiclass classification. 在模型生成器的“方案”步骤中,选择“问题分类”方案。 In the Scenario step of Model Builder, select the Issue Classification scenario.

加载数据Load the data

模型生成器可接受来自 SQL Server 数据库或者 csvtsv 格式的本地文件中的数据。Model Builder accepts data from a SQL Server database or a local file in csv or tsv format.

  1. 在模型生成器工具的数据步骤中,从数据源下拉列表中选择“SQL Server” 。In the data step of the Model Builder tool, select SQL Server from the data source dropdown.
  2. 选择“连接到 SQL Server 数据库”文本框旁的按钮。 Select the button next to the Connect to SQL Server database text box.
    1. 在“选择数据”对话框中,选择“Microsoft SQL Server 数据库文件” 。In the Choose Data dialog, select Microsoft SQL Server Database File.
    2. 取消选中“始终使用此选择”复选框,然后选择“继续”。 Uncheck the Always use this selection checkbox and select Continue.
    3. 在“连接属性”对话框中,选择“浏览”,然后选择已下载的“RestaurantScores.mdf”文件。 In the Connection Properties dialog, select Browse and select the downloaded RestaurantScores.mdf file.
    4. 选择“确定” 。Select OK.
  3. 从“表名称”下拉列表中选择“违规行为”。 Choose Violations from the Table Name dropdown.
  4. 在“要预测的列(标签)”下拉列表中选择“风险类别” 。Choose RiskCategory in the Column to Predict (Label) dropdown.
  5. 保留默认的列选择,即在“输入列(特性)”下拉列表中选择的“检查类型”和“违规行为描述”。 Leave the default column selections, InspectionType and ViolationDescription, checked in the Input Columns (Features) dropdown.
  6. 选择“培训” 链接,转到模型生成器中的下一步。Select the Train link to move to the next step in Model Builder.

定型模型Train the model

在本教程中,用于培训问题分类模型的机器学习任务是多级分类。The machine learning task used to train the issue classification model in this tutorial is multiclass classification. 在模型培训过程中,模型生成器使用不同的多级分类算法和设置来培训各个模型,以便为数据集找到性能最佳的模型。During the model training process, Model Builder trains separate models using different multiclass classification algorithms and settings to find the best performing model for your dataset.

模型培训所需的时间与数据量成正比。The time required for the model to train is proportional to the amount of data. 模型生成器会根据数据源的大小自动选择“训练时间(秒)”的默认值 。Model Builder automatically selects a default value for Time to train (seconds) based on the size of your data source.

  1. 尽管模型生成器将“训练时间(秒)” 的值设置为 10 秒,但可以将其增加到 30 秒。Although Model Builder sets the value of Time to train (seconds) to 10 seconds, increase it to 30 seconds. 通过较长时间段的训练,模型生成器可以在最佳模型的搜索中浏览更多的算法和参数组合。Training for a longer period of time allows Model Builder to explore a larger number of algorithms and combination of parameters in search of the best model.

  2. 选择“开始训练” 。Select Start Training.

    在训练过程中,进度数据显示在训练步骤中的 Progress 部分。Throughout the training process, progress data is displayed in the Progress section of the train step.

    • “状态”显示培训进程的完成状态。 Status displays the completion status of the training process.
    • “最高准确性”显示截至目前由模型生成器找到的性能最佳的模型的准确性。  Best accuracy displays the accuracy of the best performing model found by Model Builder so far. 准确性越高,意味着模型对测试数据的预测越准确。Higher accuracy means the model predicted more correctly on test data.
    • “最佳算法”显示截至目前由模型生成器找到的性能最佳的算法的名称。  Best algorithm displays the name of the best-performing algorithm found by Model Builder so far.
    • “最新算法”显示模型生成器为了培训模型采用的最新算法名称。  Last algorithm displays the name of the algorithm most recently used by Model Builder to train the model.
  3. 培训完成后,选择“评估” 链接以转到下一步。Once training is complete, select the Evaluate link to move to the next step.

评估模型Evaluate the model

培训步骤的成果将是一个具备最佳性能的模型。The result of the training step is the one model that had the best performance. 在模型生成器的评估步骤中,输出部分将包含“最佳模型”项中性能最佳模型使用的算法,并包含“最佳模型准确度”中的指标 。In the evaluate step of Model Builder, the output section contains the algorithm used by the best performing model in the Best Model entry along with metrics in Best Model Accuracy. 此外,还会显示一个摘要表,其中最多包含五个已研究的模型及其指标。Additionally, a summary table containing up to five models that were explored and their metrics is displayed.

如果你对自己的准确性指标不满意,则尝试提高模型准确性的简单方法是增加模型的训练时间或使用更多数据。If you're not satisfied with your accuracy metrics, some easy ways to try to improve model accuracy are to increase the amount of time to train the model or use more data. 否则,选择“代码” 链接,转到模型生成器中的最后一步。Otherwise, select the code link to move to the final step in Model Builder.

添加代码进行预测Add the code to make predictions

培训期间会创建两个项目。Two projects are created as a result of the training process.

  • RestaurantViolationsML.ConsoleApp:包含模型培训和示例消费代码的 C# .NET Core 控制台应用程序。RestaurantViolationsML.ConsoleApp: A C# .NET Core Console application that contains the model training and sample consumption code.
  • RestaurantViolationsML.Model:一个 .NET Standard 类库,包含定义输入和输出模型数据架构的数据模型、培训期间性能最佳的模型的保存版本,以及用于执行预测的帮助程序类(称为 ConsumeModel)。RestaurantViolationsML.Model: A .NET Standard class library containing the data models that define the schema of input and output model data, the saved version of the best performing model during training, and a helper class called ConsumeModel to make predictions.
  1. 在模型生成器的“代码”步骤中,选择“添加项目”,以将自动生成的项目添加到解决方案。 In the code step of Model Builder, select Add Projects to add the autogenerated projects to the solution.

  2. 打开“餐馆违规行为”项目中的“Program.cs”文件。 Open the Program.cs file in the RestaurantViolations project.

  3. 添加以下 using 语句以引用 RestaurantViolationsML.Model 项目: Add the following using statement to reference the RestaurantViolationsML.Model project:

    using RestaurantViolationsML.Model;
    
  4. 要使用模型对新数据进行预测,请在应用程序的 Main 方法内创建 ModelInput 类的新实例。To make a prediction on new data using the model, create a new instance of the ModelInput class inside the Main method of your application. 请注意,风险类别不是输入的一部分。Notice that the risk category is not part of the input. 这是因为模型将为它生成预测。This is because the model generates the prediction for it.

    ModelInput input = new ModelInput
    {
        InspectionType = "Complaint",
        ViolationDescription = "Inadequate sewage or wastewater disposal"
    };
    
  5. 使用 ConsumeModel 类中的 Predict 方法。Use the Predict method from the ConsumeModel class. Predict 方法将加载经过培训的模型,为模型创建 PredictionEngine 并使用它对新数据进行预测。The Predict method loads the trained model, creates a PredictionEngine for the model, and uses it to make predictions on new data.

    // Make prediction
    ModelOutput result = ConsumeModel.Predict(input);
    
    // Print Prediction
    Console.WriteLine($"Inspection type: {input.InspectionType}");
    Console.WriteLine($"Violation description: {input.ViolationDescription}");
    Console.WriteLine($"Predicted risk category: {result.Prediction}");
    Console.ReadKey();
    
  6. 运行该应用程序。Run the application.

    该程序生成的输出应类似于下面的代码段:The output generated by the program should look similar to the snippet below:

    Inspection Type: Complaint
    Violation Description: Inadequate sewage or wastewater disposal
    Risk Category: Moderate Risk
    

如果稍后需要在另一个解决方案中引用生成的项目,可以在 C:\Users\%USERNAME%\AppData\Local\Temp\MLVSTools 目录中找到它们。If you need to reference the generated projects at a later time inside of another solution, you can find them inside the C:\Users\%USERNAME%\AppData\Local\Temp\MLVSTools directory.

祝贺你!Congratulations! 你已成功使用模型生成器生成用于对卫生违规行为风险进行分类的机器学习模型。You've successfully built a machine learning model to categorize the risk of health violations using Model Builder. 有关本教程中的源代码,可以从 dotnet/machinelearning-samples GitHub 存储库中找到。You can find the source code for this tutorial at the dotnet/machinelearning-samples GitHub repository.

其他资源Additional resources

若要详细了解本教程中所述的主题,请访问以下资源:To learn more about topics mentioned in this tutorial, visit the following resources: