您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:使用可视界面预测汽车价格Tutorial: Predict automobile price with the visual interface

本教程分为两部分,介绍如何使用 Azure 机器学习服务可视界面来开发和部署预测分析解决方案,以预测任何汽车的价格。In this two-part tutorial, you learn how to use the Azure Machine Learning service visual interface to develop and deploy a predictive analytic solution that predicts the price of any car.

在第一部分中,你将设置环境,将数据集和分析模块拖放到交互式画布上,并将它们连接到一起以创建一个试验。In part one, you'll set up your environment, drag-and-drop datasets and analysis modules onto an interactive canvas, and connect them together to create an experiment.

本教程的第一部分介绍如何:In part one of the tutorial you learn how to:

  • 创建新实验Create a new experiment
  • 导入数据Import data
  • 准备数据Prepare data
  • 训练机器学习模型Train a machine learning model
  • 评估机器学习模型Evaluate a machine learning model

在本教程的第二部分中,你将学习如何将预测模型部署为 Azure Web 服务,以便可以根据发送它的技术规范来预测任何汽车的价格。In part two of the tutorial, you'll learn how to deploy your predictive model as an Azure web service so you can use it to predict the price of any car based on technical specifications you send it.

我们提供了本教程的已完成版本作为示例试验。A completed version of this tutorial is available as a sample experiment.

若要进行查找,请从“试验页” 中选择“新增” ,然后选择 “示例 1 - 回归: 汽车价格预测(基本)”试验。To find it, from the Experiments page, select Add New, then select the Sample 1 - Regression: Automobile Price Prediction(Basic) experiment.

创建新实验Create a new experiment

若要创建可视界面试验,首先需要 Azure 机器学习服务工作区。To create a visual interface experiment, you first need an Azure Machine Learnings service workspace. 本部分介绍如何创建这两个资源。In this section you learn how to create both these resources.

创建新的工作区Create a new workspace

如果你有一个 Azure 机器学习服务工作区,请跳至下一部分。If you have an Azure Machine Learning service workspace, skip to the next section.

  1. 使用将所使用的 Azure 订阅的凭据登录到 Azure 门户Sign in to the Azure portal by using the credentials for the Azure subscription you use.

  2. 在 Azure 门户的左上角,选择“+ 创建资源” 。In the upper-left corner of Azure portal, select + Create a resource.

    创建新资源

  3. 使用搜索栏查找“机器学习服务工作区” 。Use the search bar to find Machine Learning service workspace.

  4. 选择“机器学习服务工作区” 。Select Machine Learning service workspace.

  5. 在“机器学习服务工作区”窗格中,选择“创建”以开始 。In the Machine Learning service workspace pane, select Create to begin.

  6. 提供以下信息来配置新工作区:Provide the following information to configure your new workspace:

    字段Field 说明Description
    工作区名称Workspace name 输入用于标识工作区的唯一名称。Enter a unique name that identifies your workspace. 本示例使用 docs-ws 。In this example, we use docs-ws. 名称在整个资源组中必须唯一。Names must be unique across the resource group. 使用易于记忆且区别于其他人所创建工作区的名称。Use a name that's easy to recall and to differentiate from workspaces created by others.
    SubscriptionSubscription 选择要使用的 Azure 订阅。Select the Azure subscription that you want to use.
    Resource groupResource group 使用订阅中的现有资源组,或者输入一个名称以创建新的资源组。Use an existing resource group in your subscription or enter a name to create a new resource group. 资源组保存 Azure 解决方案的相关资源。A resource group holds related resources for an Azure solution. 本示例使用 docs-aml 。In this example, we use docs-aml.
    位置Location 选择离你的用户和数据资源最近的位置来创建工作区。Select the location closest to your users and the data resources to create your workspace.
  7. 完成工作区配置后,选择“创建” 。After you are finished configuring the workspace, select Create.

    警告

    在云中创建工作区可能需要几分钟时间。It can take a several minutes to create your workspace in the cloud.

    完成创建后,会显示部署成功消息。When the process is finished, a deployment success message appears.

  8. 若要查看新工作区,请选择“转到资源” 。To view the new workspace, select Go to resource.

创建试验Create an experiment

  1. Azure 门户中打开你的工作区。Open your workspace in the Azure portal.

  2. 在工作区中,选择“可视界面”。 In your workspace, select Visual interface. 然后选择“启动可视界面”。 Then select Launch visual interface.

    Azure 门户的屏幕截图,其中显示了如何从机器学习服务工作区访问可视界面

  3. 在可视界面窗口的底部选择“+新建”以创建新的试验。 Create a new experiment by selecting +New at the bottom of the visual interface window.

  4. 选择“空白试验”。 Select Blank Experiment.

  5. 在画布顶部选择默认试验名称“Experiment created on ...”,然后将它重命名为有意义的名称。 Select the default experiment name "Experiment created on ..." at the top of the canvas and rename it to something meaningful. 例如“汽车价格预测”。 For example, "Automobile price prediction". 名称不需唯一。The name doesn't need to be unique.

导入数据Import data

机器学习依赖于数据。Machine learning depends on data. 幸运的是,此界面中包含多个样本数据集可供你进行试验。Luckily, there are several sample datasets included in this interface available for you to experiment with. 在本教程中,可以使用示例数据集“汽车价格数据(原始)” 。For this tutorial, use the sample dataset Automobile price data (Raw).

  1. 试验画布左侧是数据集和模块的控制板。To the left of the experiment canvas is a palette of datasets and modules. 选择“保存的数据集”,然后选择“示例”,以便查看可用的示例数据集。 Select Saved Datasets then select Samples to view the available sample datasets.

  2. 选择数据集“汽车价格数据(原始)”,然后将其拖到画布上。 Select the dataset, Automobile price data (raw), and drag it onto the canvas.

    将数据拖到画布上

  3. 选择要处理哪些数据列。Select which columns of data to work with. 在面板顶部的搜索框中键入“选择”,以查找“选择数据集中的列” 模块。Type Select in the Search box at the top of the palette to find the Select Columns in Dataset module.

  4. 单击“选择数据集中的列”模块,然后将其拖到画布上。 Click and drag the Select Columns in Dataset module onto the canvas. 将该模块放在数据集模块下面。Drop the module below the dataset module.

  5. 通过单击和拖动将先前添加的数据集连接到“选择数据集中的列” 模块。Connect the dataset you added earlier to the Select Columns in Dataset module by clicking and dragging. 从数据集的输出端口(画布上数据集底部的小圆圈)一直拖到“选择数据集中的列” 的输入端口(模块顶部的小圆圈)。Drag from the dataset's output port, which is the small circle at the bottom of the dataset on the canvas, all the way to the input port of Select Columns in Dataset, which is the small circle at the top of the module.

    提示

    将一个模块的输出端口连接到另一个模块的输入端口时,即可通过试验创建数据流。You create a flow of data through your experiment when you connect the output port of one module to an input port of another.

    连接模块

    红色感叹号表示尚未设置该模块的属性。The red exclamation mark indicates that you haven't set the properties for the module yet.

  6. 选择“在数据集中选择列”模块。 Select the Select Columns in Dataset module.

  7. 在画布右侧的“属性”窗格中,选择“编辑列”。 In the Properties pane to the right of the canvas, select Edit columns.

    在“选择列”对话框中选择“所有列”,并包括“所有功能”。 In the Select columns dialog, select ALL COLUMNS and include all features. 此对话框应如下所示:The dialog should look like this:

    列选择器

  8. 在右下角,选择“确定”以关闭列选择器。 On the lower right, select OK to close the column selector.

运行试验Run the experiment

在任何时刻,单击数据集或模块的输出端口即可查看数据流中的数据在该时刻的情形。At any time, click the output port of a dataset or module to see what the data looks like at that point in the data flow. 如果“可视化”选项已禁用,则先需要运行此试验。 If the Visualize option is disabled, you first need to run the experiment.

试验在计算目标(附加到工作区的计算资源)上运行。An experiment runs on a compute target, a compute resource that is attached to your workspace. 一旦创建了计算目标,就可以在以后的运行中重用它。Once you create a compute target, you can reuse it for future runs.

  1. 选择底部的“运行”,运行此试验。 Select Run at the bottom to run the experiment.

  2. 出现“设置计算目标” 对话框时,如果你的工作区已有计算资源,现在就可以选择它。When the Setup Compute Targets dialog appears, if your workspace already has a compute resource, you can select it now. 否则,请选择“新建” 。Otherwise, select Create new.

    备注

    可视界面只能对机器学习计算目标运行试验。The visual interface can only run experiments on Machine Learning Compute targets. 不会显示其他计算目标。Other compute targets will not be shown.

  3. 为计算资源提供名称。Provide a name for the compute resource.

  4. 选择“运行”。 Select Run.

    设置计算目标

    现在将创建计算资源。The compute resource will now be created. 在试验右上角查看状态。View the status in the top-right corner of the experiment.

    备注

    创建计算资源大约需要 5 分钟。It takes approximately 5 minutes to create a compute resource. 创建资源之后,可以重用它,并跳过此等待时间,以便将来运行。After the resource is created, you can reuse it and skip this wait time for future runs.

    计算资源在空闲时将自动缩放为 0 个节点以节省成本。The compute resource will autoscale to 0 nodes when it is idle to save cost. 在延迟之后再次使用它时,可能会再次经历大约 5 分钟的等待时间,同时它会重新扩展。When you use it again after a delay, you may again experience approximately 5 minutes of wait time while it scales back up.

在计算目标可用后,试验就会运行。After the compute target is available, the experiment runs. 在运行完成后,每个模块上都会显示一个绿色的对勾标记。When the run is complete, a green check mark appears on each module.

可视化数据Visualize the data

运行了初始试验后,可以直观显示数据,以详细了解你使用的数据集。Now that you have run your initial experiment, you can visualize the data to understand more about the dataset you have.

  1. 选择“在数据集中选择列”底部的输出端口,然后选择“可视化”。 Select the output port at the bottom of the Select Columns in Dataset then select Visualize.

  2. 单击数据窗口中的不同列,查看有关该列的信息。Click on different columns in the data window to view information about that column.

    在此数据集中,每行代表一辆汽车,与每辆汽车关联的变量显示为列。In this dataset, each row represents an automobile, and the variables associated with each automobile appear as columns. 此数据集中有 205 行和 26 列。There are 205 rows and 26 columns in this dataset.

    每次单击某个数据列时,就会在左侧显示该列的统计信息和可视化图像。Each time you click a column of data, the Statistics information and Visualization image of that column appears on the left.

    预览数据Preview the data

  3. 单击每一列以了解有关数据集的更多信息,并考虑这些列对预测汽车价格是否有用。Click each column to understand more about your dataset, and think about whether these columns will be useful to predict the price of an automobile.

准备数据Prepare data

通常,数据集需要进行一些预处理才能进行分析。Typically, a dataset requires some preprocessing before it can be analyzed. 在直观显示数据集时,你可能已经注意到某些值缺失。You might have noticed some missing values when visualizing the dataset. 需要清除这些缺失值,使模型能够正确分析数据。These missing values need to be cleaned so the model can analyze the data correctly. 将删除任何有缺失值的行。You'll remove any rows that have missing values. 另外,normalized-losses 列有大比例的缺失值,因此你将从模型中完全排除该列。Also, the normalized-losses column has a large proportion of missing values, so you'll exclude that column from the model altogether.

提示

使用大多数模块时,都必须从输入数据中清除缺失值。Cleaning the missing values from input data is a prerequisite for using most of the modules.

删除列Remove column

首先,彻底删除 normalized-losses 列。First, remove the normalized-losses column completely.

  1. 选择“在数据集中选择列”模块。 Select the Select Columns in Dataset module.

  2. 在画布右侧的“属性”窗格中,选择“编辑列”。 In the Properties pane to the right of the canvas, select Edit columns.

    • 让“使用规则”和“所有列”处于选中状态 。Leave With rules and ALL COLUMNS selected.

    • 在下拉列表中,选择“排除” 和“列名称” ,并在文本框内部单击。From the drop-downs, select Exclude and column names, and then click inside the text box. 键入“normalized-losses”。 Type normalized-losses.

    • 在右下角,选择“确定”以关闭列选择器。 On the lower right, select OK to close the column selector.

    排除列

    现在,“在数据集中选择列”的属性窗格指示它会传递数据集中除 normalized-losses 外的所有列。Now the properties pane for Select Columns in Dataset indicates that it will pass through all columns from the dataset except normalized-losses.

    属性窗格显示 normalized-losses 列已排除。The properties pane shows that the normalized-losses column is excluded.

  3. 双击“在数据集中选择列”模块,键入注释“排除规范化的损失”。 Double-click the Select Columns in Dataset module and type the comment "Exclude normalized losses."

    键入注释后,在模块外单击。After you type the comment, click outside the module. 此时会显示一个向下箭头,表明模块包含注释。A down-arrow appears to show that the module contains a comment.

  4. 单击向下箭头,显示注释。Click on the down-arrow to display the comment.

    模块现在显示向上箭头,隐藏注释。The module now shows an up-arrow to hide the comment.

    注释

清理缺失数据Clean missing data

训练模型时,必须对缺少的数据执行某些操作。When you train a model, you have to do something about the data that is missing. 在本例中,你将添加一个模块以删除任何缺少数据的剩余行。In this case, you'll add a module to remove any remaining row that has missing data.

  1. 在“搜索”框中键入“清理”,查找“清理缺失数据”模块。 Type Clean in the Search box to find the Clean Missing Data module.

  2. 将“清理缺失数据”模块拖到试验画布上,然后将其连接到“在数据集中选择列”模块。 Drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in Dataset module.

  3. 在“属性”窗格中,选择“清理模式”下的“删除整个行”。 In the Properties pane, select Remove entire row under Cleaning mode.

  4. 双击该模块并键入注释“删除缺失值行”。Double-click the module and type the comment "Remove missing value rows."

    试验现在应该如下所示:Your experiment should now look something like this:

    选择列

训练机器学习模型Train a machine learning model

数据准备就绪后,可以构造一个预测模型。Now that the data is ready, you can construct a predictive model. 你将使用自己的数据来训练模型。You'll use your data to train the model. 然后将测试模型,以确定它预测价格的准确性。Then you'll test the model to see how closely it's able to predict prices.

选择一个算法Select an algorithm

分类回归 是两种监督式机器学习算法。Classification and regression are two types of supervised machine learning algorithms. 分类可以从一组定义的类别预测答案,例如颜色(红、蓝或绿)。Classification predicts an answer from a defined set of categories, such as a color (red, blue, or green). 回归用于预测数字。Regression is used to predict a number.

由于你要预测价格(一个数字),因此可以使用回归算法。Because you want to predict price, which is a number, you can use a regression algorithm. 本示例将使用线性回归模型。For this example, you'll use a linear regression model.

拆分数据Split the data

将数据拆分为单独的训练数据集和测试数据集,用于模型训练和测试。Use your data for both training the model and testing it by splitting the data into separate training and testing datasets.

  1. 在搜索框中键入“拆分数据”找到“拆分数据”模块,然后将其连接到“清理缺失数据”模块的左端口。 Type split data in the search box to find the Split Data module and connect it to the left port of the Clean Missing Data module.

  2. 选择“拆分数据” 模块。Select the Split Data module. 在“属性”窗格中,将“第一个输出集中的行部分”设置为 0.7。In the Properties pane, set the Fraction of rows in the first output dataset to 0.7. 这样,我们将使用 70% 的数据来训练模型,保留 30% 的数据用于测试。This way, we'll use 70 percent of the data to train the model, and hold back 30 percent for testing.

  3. 双击“拆分数据”并键入注释“将数据集拆分为训练集(0.7)和测试集(0.3)” Double-click the Split Data and type the comment "Split the dataset into training set(0.7) and test set(0.3)"

训练模型Train the model

在模型中提供一组包含价格的数据以对其进行训练。Train the model by giving it a set of data that includes the price. 该模型会扫描数据,查找汽车特征与其价格之间的关联。The model scans the data and looks for correlations between a car's features and its price.

  1. 若要选择学习算法,请清除模块控制板搜索框。To select the learning algorithm, clear your module palette search box.

  2. 依次展开“机器学习”、“初始化模型”。 Expand the Machine Learning then expand Initialize Model. 此时会显示多个可用于初始化机器学习算法的模块类别。This displays several categories of modules that can be used to initialize machine learning algorithms.

  3. 对于本试验,请选择“回归” > “线性回归”,并将该模块拖到试验画布上。 For this experiment, select Regression > Linear Regression and drag it to the experiment canvas.

  4. 找到 训练模型 模块并将其拖到试验画布上。Find and drag the Train Model module to the experiment canvas. 将“线性回归”模块的输出连接到“训练模型”模块左侧的输入,将“拆分数据”模块的训练数据输出(左端口)连接到“训练模型”模块右侧的输入。 Connect the output of the Linear Regression module to the left input of the Train Model module, and connect the training data output (left port) of the Split Data module to the right input of the Train Model module.

    显示“训练模型”模块的正确配置的屏幕截图。

  5. 选择“训练模型”模块。 Select the Train Model module. 在“属性”窗格中选择“启动列选择器”,然后在“包括列名称”的旁边键入价格In the Properties pane, Select Launch column selector and then type price next to Include column names. 价格是模型要预测的值。Price is the value that your model is going to predict

    显示列选择器模块的正确配置的屏幕截图。

    试验应该如下所示:Your experiment should look like this:

    显示添加“训练模型”模块后试验的正确配置的屏幕截图。

评估机器学习模型Evaluate a machine learning model

使用 70% 的数据训练模型后,可以使用该模型为另外 30% 的数据评分,确定模型的运行情况。Now that you've trained the model using 70 percent of your data, you can use it to score the other 30 percent of the data to see how well your model functions.

  1. 在搜索框中键入“评分模型”找到“评分模型”模块,并将该模块拖到试验画布上。 Type score model in the search box to find the Score Model module and drag the module to the experiment canvas. 将“训练模型”模块的输出连接到“评分模型”的左侧输入端口。 Connect the output of the Train Model module to the left input port of Score Model. 拆分数据 模型的测试数据输出(右端口)连接到 评分模型 的右侧输入端口。Connect the test data output (right port) of the Split Data module to the right input port of Score Model.

  2. 在搜索框中键入“评估”找到“评估模型”,并将该模块拖到试验画布上。 Type evaluate in the search box to find the Evaluate Model and drag the module to the experiment canvas. 将“评分模型”模块的输出连接到“评估模型”的左侧输入。 Connect the output of the Score Model module to the left input of Evaluate Model. 最终试验看起来应与下图类似:The final experiment should look something like this:

    显示试验的最终正确配置的屏幕截图。

  3. 使用之前创建的计算资源运行试验。Run the experiment using the compute resource you created earlier.

  4. 依次选择“评分模型”的输出端口和“可视化”来查看“评分模型”模块的输出。 View the output from the Score Model module by selecting the output port of Score Model and select Visualize. 输出显示价格预测值,以及来自测试数据的已知值。The output shows the predicted values for price and the known values from the test data.

    输出可视化效果的屏幕截图,其中突出显示了“评分标签”列

  5. 要查看“评估模型” 模块的输出,请选择输出端口,然后选择“可视化” 。To view the output from the Evaluate Model module, select the output port, and then select Visualize.

    显示最终试验评估结果的屏幕截图。

针对模型显示了以下统计信息:The following statistics are shown for your model:

  • 平均绝对误差(MAE) :绝对误差的平均值(误差是指预测值与实际值之间的差)。Mean Absolute Error (MAE): The average of absolute errors (an error is the difference between the predicted value and the actual value).
  • 均方根误差(RMSE) :对测试数据集所做预测的平均误差的平方根。Root Mean Squared Error (RMSE): The square root of the average of squared errors of predictions made on the test dataset.
  • 相对绝对误差:相对于实际值与所有实际值平均值之间的绝对差异的绝对误差平均值。Relative Absolute Error: The average of absolute errors relative to the absolute difference between actual values and the average of all actual values.
  • 相对平方误差:相对于实际值与所有实际值平均值之间的平方差异的平方误差平均值。Relative Squared Error: The average of squared errors relative to the squared difference between the actual values and the average of all actual values.
  • 决定系数:也称为 R 平方值,这是一个统计指标,表示模型的数据拟合度。Coefficient of Determination: Also known as the R squared value, this is a statistical metric indicating how well a model fits the data.

每个误差统计值越小越好。For each of the error statistics, smaller is better. 值越小,表示预测越接近实际值。A smaller value indicates that the predictions more closely match the actual values. 对于决定系数,其值越接近 1 (1.0),预测就越精确。For Coefficient of Determination, the closer its value is to one (1.0), the better the predictions.

清理资源Clean up resources

重要

可以使用你创建的、用作其他 Azure 机器学习服务教程和操作指南文章的先决条件的资源。You can use the resources that you created as prerequisites for other Azure Machine Learning service tutorials and how-to articles.

删除所有内容Delete everything

如果你不打算使用所创建的任何内容,请删除整个资源组,以免产生任何费用:If you don't plan to use anything that you created, delete the entire resource group so you don't incur any charges:

  1. 在 Azure 门户的窗口左侧选择“资源组” 。In the Azure portal, select Resource groups on the left side of the window.

    在 Azure 门户中删除资源组

  2. 在列表中选择你创建的资源组。In the list, select the resource group that you created.

  3. 在窗口的右侧,选择省略号按钮 ( ... )。On the right side of the window, select the ellipsis button (...).

  4. 选择“删除资源组” 。Select Delete resource group.

删除该资源组也会删除在可视界面中创建的所有资源。Deleting the resource group also deletes all resources that you created in the visual interface.

仅删除计算目标Delete only the compute target

此处创建的计算目标在未使用时,会自动缩减到零个节点。 The compute target that you created here automatically autoscales to zero nodes when it's not being used. 这样可以最大限定地减少费用。This is to minimize charges. 若要删除计算目标,请执行以下步骤: If you want to delete the compute target, take these steps:

  1. Azure 门户中打开你的工作区。In the Azure portal, open your workspace.

    删除计算目标

  2. 在工作区的“计算”部分选择资源。 In the Compute section of your workspace, select the resource.

  3. 选择“删除”。 Select Delete.

删除各项资产Delete individual assets

在创建试验的可视界面中删除各个资产,方法是将其选中,然后选择“删除”按钮。 In the visual interface where you created your experiment, delete individual assets by selecting them and then selecting the Delete button.

删除试验

后续步骤Next steps

在本教程系列的第一部分,你已完成以下步骤:In part one of this tutorial, you completed these steps:

  • 已创建试验Created an experiment
  • 准备数据Prepare the data
  • 训练模型Train the model
  • 评分和评估模型Score and evaluate the model

第二部分介绍如何将模型部署为 Azure Web 服务。In part two, you'll learn how to deploy your model as an Azure web service.