您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:通过自动化机器学习来构建回归模型Tutorial: Use automated machine learning to build your regression model

本教程是由两个部分构成的系列教程的第二部分 。This tutorial is part two of a two-part tutorial series. 在上一教程中,已准备用于回归建模的纽约市出租车数据In the previous tutorial, you prepared the NYC taxi data for regression modeling.

现在可以使用 Azure 机器学习服务来生成模型了。Now you're ready to start building your model with Azure Machine Learning service. 在教程的此部分,你将使用准备好的数据自动生成一个回归模型,用于预测出租车费用价格。In this part of the tutorial, you use the prepared data and automatically generate a regression model to predict taxi fare prices. 使用该服务的自动化机器学习功能定义机器学习目标和约束。By using the automated machine learning capabilities of the service, you define your machine learning goals and constraints. 启动自动化机器学习过程。You launch the automated machine learning process. 然后进行算法选择和超参数优化。Then allow the algorithm selection and hyperparameter tuning to happen for you. 自动化机器学习技术会对算法和超参数的多种组合进行迭代访问,直到找到符合条件的最佳模型。The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

流程图

本教程将介绍以下任务:In this tutorial, you learn the following tasks:

  • 设置 Python 环境并导入 SDK 包。Set up a Python environment and import the SDK packages.
  • 配置 Azure 机器学习服务工作区。Configure an Azure Machine Learning service workspace.
  • 自动训练回归模型。Autotrain a regression model.
  • 使用自定义参数在本地运行模型。Run the model locally with custom parameters.
  • 浏览结果。Explore the results.

如果没有 Azure 订阅,请在开始之前创建一个免费帐户。If you don’t have an Azure subscription, create a free account before you begin. 立即试用 Azure 机器学习服务免费版或付费版Try the free or paid version of Azure Machine Learning service today.

备注

本文中的代码已使用 Azure 机器学习 SDK 版本 1.0.39 进行测试。Code in this article was tested with Azure Machine Learning SDK version 1.0.39.

先决条件Prerequisites

跳到设置开发环境来了解整个 Notebook 设置步骤,或遵照以下说明获取 Notebook 并在 Azure Notebooks 或自己的 Notebook 服务器中运行。Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. 若要运行 Notebook,需要:To run the notebook you will need:

  • 运行数据准备教程Run the data preparation tutorial.
  • 一个装有以下组件的 Python 3.6 Notebook 服务器:A Python 3.6 notebook server with the following installed:
    • 适用于 Python 的 Azure 机器学习 SDK 以及 automlnotebooks 附加程序The Azure Machine Learning SDK for Python with automl and notebooks extras
    • matplotlib
  • 教程 NotebookThe tutorial notebook
  • 机器学习工作区A machine learning workspace
  • Notebook 所在的同一目录中的工作区的配置文件The configuration file for the workspace in the same directory as the notebook

从以下任一部分获取所有这些必备组件。Get all these prerequisites from either of the sections below.

使用工作区中的云 Notebook 服务器Use a cloud notebook server in your workspace

可以轻松地从你自己的基于云的 Notebook 服务器着手。It's easy to get started with your own cloud-based notebook server. 当你创建此云资源时,用于 Python 的 Azure 机器学习 SDK 就已为你安装和配置了。The Azure Machine Learning SDK for Python is already installed and configured for you once you create this cloud resource.

  • 在启动笔记本网页后,请运行 tutorials/regression-part2-automated-ml.ipynb Notebook。After you launch the notebook webpage, run the tutorials/regression-part2-automated-ml.ipynb notebook.

使用自己的 Jupyter Notebook 服务器Use your own Jupyter notebook server

执行这些步骤,在计算机上创建本地 Jupyter Notebook 服务器。Use these steps to create a local Jupyter Notebook server on your computer. 确保在你的环境中安装 matplotlib 以及 automlnotebooks 附加程序。Make sure that you install matplotlib and the automl and notebooks extras in your environment.

  1. 按照创建 Azure 机器学习服务工作区中的说明执行以下操作:Use the instructions at Create an Azure Machine Learning service workspace to do the following:

    • 创建 Miniconda 环境Create a Miniconda environment
    • 安装用于 Python 的 Azure 机器学习 SDKInstall the Azure Machine Learning SDK for Python
    • 创建工作区Create a workspace
    • 写入工作区配置文件 (aml_config/config.json )。Write a workspace configuration file (aml_config/config.json).
  2. 克隆 GitHub 存储库Clone the GitHub repository.

    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
  3. 从克隆目录启动 notebook 服务器。Start the notebook server from your cloned directory.

    jupyter notebook
    

完成这些步骤后,运行 tutorials/regression-part2-automated-ml.ipynb Notebook。After you complete the steps, run the tutorials/regression-part2-automated-ml.ipynb notebook.

设置开发环境Set up your development environment

开发工作的所有设置都可以在 Python Notebook 中完成。All the setup for your development work can be accomplished in a Python notebook. 安装包括以下操作:Setup includes the following actions:

  • 安装 SDKInstall the SDK
  • 导入 Python 包Import Python packages
  • 配置工作区Configure your workspace

安装并导入包Install and import packages

若要按照教程在自己的 Python 环境中操作,请使用以下命令安装必需的包。If you are following the tutorial in your own Python environment, use the following to install necessary packages.

pip install azureml-sdk[automl,notebooks] matplotlib

导入需要在本教程中使用的 Python 包:Import the Python packages you need in this tutorial:

import azureml.core
import pandas as pd
from azureml.core.workspace import Workspace
import logging
import os

配置工作区Configure workspace

从现有工作区创建工作区对象。Create a workspace object from the existing workspace. 工作区是可接受 Azure 订阅和资源信息的类。A Workspace is a class that accepts your Azure subscription and resource information. 它还可创建云资源来监视和跟踪模型运行。It also creates a cloud resource to monitor and track your model runs.

Workspace.from_config() 读取文件 config.json 并将详细信息加载到一个名为 ws 的对象。Workspace.from_config() reads the file config.json and loads the details into an object named ws. 在本教程中,ws 在代码的其余部分使用。ws is used throughout the rest of the code in this tutorial.

创建工作区对象之后,请为试验指定一个名称。After you have a workspace object, specify a name for the experiment. 创建一个本地目录并将其注册到工作区。Create and register a local directory with the workspace. 所有运行的历史记录都记录在指定的试验下和 Azure 门户中。The history of all runs is recorded under the specified experiment and in the Azure portal.

ws = Workspace.from_config()
# choose a name for the run history container in the workspace
experiment_name = 'automated-ml-regression'
# project folder
project_folder = './automated-ml-regression'

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

浏览数据Explore data

使用在上一教程中创建的数据流对象。Use the data flow object created in the previous tutorial. 总结说来,本教程的第 1 部分清理了纽约市出租车数据,使之可以在机器学习模型中使用。To summarize, part 1 of this tutorial cleaned the NYC Taxi data so it could be used in a machine learning model. 现在,请使用数据集中的各种特性,以便通过自动化模型在特性与出租车打车价格之间建立关系。Now, you use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip. 打开并运行该数据流,然后查看结果:Open and run the data flow and review the results:

import azureml.dataprep as dprep

file_path = os.path.join(os.getcwd(), "dflows.dprep")

dflow_prepared = dprep.Dataflow.open(file_path)
dflow_prepared.get_profile()
TypeType MinMin MaxMax CountCount 缺失计数Missing count 非缺失计数Not missing count 缺失百分比Percent missing 错误计数Error count 空计数Empty count 0.1% 分位0.1% quantile 1% 分位1% quantile 5% 分位5% quantile 25% 分位25% quantile 50% 分位50% quantile 75% 分位75% quantile 95% 分位95% quantile 99% 分位99% quantile 99.9% 分位99.9% quantile 平均值Mean 标准偏差Standard deviation VarianceVariance 倾斜Skewness 峰度Kurtosis
vendorvendor FieldType.STRINGFieldType.STRING 11 VTSVTS 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_weekdaypickup_weekday FieldType.STRINGFieldType.STRING 星期五Friday 星期三Wednesday 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_hourpickup_hour FieldType.DECIMALFieldType.DECIMAL 00 2323 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 2.900472.90047 2.693552.69355 9.728899.72889 1616 19.371319.3713 22.697422.6974 2323 2323 14.273114.2731 6.592426.59242 43.4643.46 -0.693723-0.693723 -0.570403-0.570403
pickup_minutepickup_minute FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 4.997014.99701 4.958334.95833 14.152814.1528 29.383229.3832 44.682544.6825 56.444456.4444 58.990958.9909 5959 29.42729.427 17.433317.4333 303.921303.921 0.01209990.0120999 -1.20981-1.20981
pickup_secondpickup_second FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.281315.28131 55 14.783214.7832 29.929329.9293 44.72544.725 56.757356.7573 5959 5959 29.744329.7443 17.359517.3595 301.351301.351 -0.0252399-0.0252399 -1.19616-1.19616
dropoff_weekdaydropoff_weekday FieldType.STRINGFieldType.STRING 星期五Friday 星期三Wednesday 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
dropoff_hourdropoff_hour FieldType.DECIMALFieldType.DECIMAL 00 2323 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 2.571532.57153 22 9.587959.58795 15.999415.9994 19.618419.6184 22.831722.8317 2323 2323 14.210514.2105 6.710936.71093 45.036545.0365 -0.687292-0.687292 -0.61951-0.61951
dropoff_minutedropoff_minute FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.443835.44383 4.846944.84694 14.103614.1036 28.836528.8365 44.310244.3102 56.689256.6892 5959 5959 29.290729.2907 17.410817.4108 303.136303.136 0.02225140.0222514 -1.2181-1.2181
dropoff_seconddropoff_second FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.078015.07801 55 14.575114.5751 29.597229.5972 45.464945.4649 56.272956.2729 5959 5959 29.77229.772 17.533717.5337 307.429307.429 -0.0212575-0.0212575 -1.226-1.226
store_forwardstore_forward FieldType.STRINGFieldType.STRING NN YY 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -74.0781-74.0781 -73.7459-73.7459 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 -74.0578-74.0578 -73.9639-73.9639 -73.9656-73.9656 -73.9508-73.9508 -73.9255-73.9255 -73.8529-73.8529 -73.8302-73.8302 -73.8238-73.8238 -73.7697-73.7697 -73.9123-73.9123 0.05037570.0503757 0.002537710.00253771 0.3521720.352172 -0.923743-0.923743
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 40.575540.5755 40.879940.8799 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 40.63240.632 40.711740.7117 40.711540.7115 40.721340.7213 40.756540.7565 40.805840.8058 40.847840.8478 40.867640.8676 40.877840.8778 40.764940.7649 0.04946740.0494674 0.002447020.00244702 0.2059720.205972 -0.777945-0.777945
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -74.0857-74.0857 -73.7209-73.7209 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 -74.0775-74.0775 -73.9875-73.9875 -73.9882-73.9882 -73.9638-73.9638 -73.935-73.935 -73.8755-73.8755 -73.8125-73.8125 -73.7759-73.7759 -73.7327-73.7327 -73.9202-73.9202 0.05846270.0584627 0.003417890.00341789 0.6236220.623622 -0.262603-0.262603
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 40.583540.5835 40.879740.8797 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 40.597340.5973 40.692840.6928 40.691140.6911 40.722640.7226 40.756740.7567 40.791840.7918 40.849540.8495 40.86840.868 40.878740.8787 40.758340.7583 0.05173990.0517399 0.002677010.00267701 0.03904040.0390404 -0.203525-0.203525
passengerspassengers FieldType.DECIMALFieldType.DECIMAL 11 66 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 11 11 11 11 11 55 55 66 66 2.392492.39249 1.831971.83197 3.35613.3561 0.7631440.763144 -1.23467-1.23467
distancedistance FieldType.DECIMALFieldType.DECIMAL 0.010.01 32.3432.34 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 0.01087440.0108744 0.7438980.743898 0.7381940.738194 1.2431.243 2.401682.40168 4.744784.74478 10.513610.5136 14.901114.9011 21.803521.8035 3.54473.5447 3.29433.2943 10.852410.8524 1.915561.91556 4.998984.99898
costcost FieldType.DECIMALFieldType.DECIMAL 0.10.1 8888 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 2.338372.33837 5.004915.00491 55 6.931296.93129 10.52410.524 17.481117.4811 33.234333.2343 50.009350.0093 63.175363.1753 13.684313.6843 9.665719.66571 93.42693.426 1.785181.78518 4.139724.13972

为试验准备数据时,请将列添加到 dflow_x,使之成为创建模型所需的特性。You prepare the data for the experiment by adding columns to dflow_x to be features for our model creation. dflow_y 定义为预测值 costYou define dflow_y to be our prediction value, cost:

dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])
dflow_y = dflow_prepared.keep_columns('cost')

将数据拆分为训练集和测试集Split the data into train and test sets

现在,使用 sklearn 库中的 train_test_split 函数将数据拆分为训练集和测试集。Now you split the data into training and test sets by using the train_test_split function in the sklearn library. 此函数将数据分隔成用于模型训练的 x(特征)数据集,和用于测试的 y(预测值)数据集。This function segregates the data into the x, features, dataset for model training and the y, values to predict, dataset for testing. test_size 参数决定了分配用于测试的数据的百分比。The test_size parameter determines the percentage of data to allocate to testing. random_state 参数设置随机数生成器的种子,因此进行的训练-测试拆分始终具有确定性:The random_state parameter sets a seed to the random generator, so that your train-test splits are always deterministic:

from sklearn.model_selection import train_test_split

x_df = dflow_X.to_pandas_dataframe()
y_df = dflow_y.to_pandas_dataframe()

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)
# flatten y_train to 1d array
y_train.values.flatten()

此步骤的目的是通过数据点来测试完成的模型(此模型尚未用于训练模型),以便测量实际准确性。The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. 换句话说,经过良好训练的模型应该能够准确地根据其尚未看到的数据进行预测。In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. 现在已有所需的包和数据,接下来可以开始自动训练模型。You now have the necessary packages and data ready for autotraining your model.

自动训练模型Automatically train a model

若要自动训练模型,请执行以下步骤:To automatically train a model, take the following steps:

  1. 定义试验运行的设置。Define settings for the experiment run. 将训练数据附加到配置,并修改用于控制训练过程的设置。Attach your training data to the configuration, and modify settings that control the training process.
  2. 提交用于模型优化的试验。Submit the experiment for model tuning. 在提交试验以后,此过程会根据定义的约束循环访问不同的机器学习算法和超参数设置。After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. 它通过优化准确性指标来选择最佳拟合模型。It chooses the best-fit model by optimizing an accuracy metric.

定义自动生成和优化的设置Define settings for autogeneration and tuning

定义自动生成和优化的试验参数和模型设置。Define the experiment parameter and model settings for autogeneration and tuning. 查看设置的完整列表。View the full list of settings. 提交带这些默认设置的试验大约需要 10-15 分钟,但如果需要缩短运行时间,可减小 iterationsiteration_timeout_minutesSubmitting the experiment with these default settings will take approximately 10-15 min, but if you want a shorter run time, reduce either iterations or iteration_timeout_minutes.

属性Property 本教程中的值Value in this tutorial DescriptionDescription
iteration_timeout_minutesiteration_timeout_minutes 1010 每个迭代的时间限制(分钟)。Time limit in minutes for each iteration. 减小此值可缩短总运行时。Reduce this value to decrease total runtime.
迭代iterations 3030 迭代次数。Number of iterations. 每次迭代时,都会使用数据对新的机器学习模型进行训练。In each iteration, a new machine learning model is trained with your data. 这是影响总运行时间的主要值。This is the primary value that affects total run time.
primary_metricprimary_metric spearman_correlationspearman_correlation 要优化的指标。Metric that you want to optimize. 将根据此指标选择最佳拟合模型。The best-fit model will be chosen based on this metric.
preprocesspreprocess TrueTrue 如果使用 True,则试验可以预处理输入数据(处理缺失的数据、将文本转换为数字,等等)。By using True, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)
verbosityverbosity logging.INFOlogging.INFO 控制日志记录的级别。Controls the level of logging.
n_cross_validationsn_cross_validations 55 在验证数据未指定的情况下,需执行的交叉验证拆分的数目。Number of cross-validation splits to perform when validation data is not specified.
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 30,
    "primary_metric" : 'spearman_correlation',
    "preprocess" : True,
    "verbosity" : logging.INFO,
    "n_cross_validations": 5
}

使用定义的训练设置作为 AutoMLConfig 对象的参数。Use your defined training settings as a parameter to an AutoMLConfig object. 另请指定训练数据和模型的类型,后者在此示例中为 regressionAdditionally, specify your training data and the type of model, which is regression in this case.

from azureml.train.automl import AutoMLConfig

# local compute
automated_ml_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automated_ml_errors.log',
                             path = project_folder,
                             X = x_train.values,
                             y = y_train.values.flatten(),
                             **automl_settings)

训练自动回归模型Train the automatic regression model

启动要在本地运行的试验。Start the experiment to run locally. 将定义的 automated_ml_config 对象传递到试验。Pass the defined automated_ml_config object to the experiment. 将输出设置为 True,以查看试验的进度:Set the output to True to view progress during the experiment:

from azureml.core.experiment import Experiment
experiment=Experiment(ws, experiment_name)
local_run = experiment.submit(automated_ml_config, show_output=True)

显示的输出会随着试验的运行进行实时更新。The output shown updates live as the experiment runs. 可以看到每次迭代的模型类型、运行持续时间以及训练准确性。For each iteration, you see the model type, the run duration, and the training accuracy. 字段 BEST 根据指标类型跟踪运行情况最好的训练分数。The field BEST tracks the best running training score based on your metric type.

Parent Run ID: AutoML_02778de3-3696-46e9-a71b-521c8fca0651
*******************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
*******************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler ExtremeRandomTrees                0:00:08       0.9447    0.9447
         1   StandardScalerWrapper GradientBoosting         0:00:09       0.9536    0.9536
         2   StandardScalerWrapper ExtremeRandomTrees       0:00:09       0.8580    0.9536
         3   StandardScalerWrapper RandomForest             0:00:08       0.9147    0.9536
         4   StandardScalerWrapper ExtremeRandomTrees       0:00:45       0.9398    0.9536
         5   MaxAbsScaler LightGBM                          0:00:08       0.9562    0.9562
         6   StandardScalerWrapper ExtremeRandomTrees       0:00:27       0.8282    0.9562
         7   StandardScalerWrapper LightGBM                 0:00:07       0.9421    0.9562
         8   MaxAbsScaler DecisionTree                      0:00:08       0.9526    0.9562
         9   MaxAbsScaler RandomForest                      0:00:09       0.9355    0.9562
        10   MaxAbsScaler SGD                               0:00:09       0.9602    0.9602
        11   MaxAbsScaler LightGBM                          0:00:09       0.9553    0.9602
        12   MaxAbsScaler DecisionTree                      0:00:07       0.9484    0.9602
        13   MaxAbsScaler LightGBM                          0:00:08       0.9540    0.9602
        14   MaxAbsScaler RandomForest                      0:00:10       0.9365    0.9602
        15   MaxAbsScaler SGD                               0:00:09       0.9602    0.9602
        16   StandardScalerWrapper ExtremeRandomTrees       0:00:49       0.9171    0.9602
        17   SparseNormalizer LightGBM                      0:00:08       0.9191    0.9602
        18   MaxAbsScaler DecisionTree                      0:00:08       0.9402    0.9602
        19   StandardScalerWrapper ElasticNet               0:00:08       0.9603    0.9603
        20   MaxAbsScaler DecisionTree                      0:00:08       0.9513    0.9603
        21   MaxAbsScaler SGD                               0:00:08       0.9603    0.9603
        22   MaxAbsScaler SGD                               0:00:10       0.9602    0.9603
        23   StandardScalerWrapper ElasticNet               0:00:09       0.9603    0.9603
        24   StandardScalerWrapper ElasticNet               0:00:09       0.9603    0.9603
        25   MaxAbsScaler SGD                               0:00:09       0.9603    0.9603
        26   TruncatedSVDWrapper ElasticNet                 0:00:09       0.9602    0.9603
        27   MaxAbsScaler SGD                               0:00:12       0.9413    0.9603
        28   StandardScalerWrapper ElasticNet               0:00:07       0.9603    0.9603
        29    Ensemble                                      0:00:38       0.9622    0.9622

浏览结果Explore the results

通过 Jupyter 小组件或者通过检查试验历史记录来浏览自动训练的结果。Explore the results of automatic training with a Jupyter widget or by examining the experiment history.

选项 1:添加查看结果所需的 Jupyter 小组件Option 1: Add a Jupyter widget to see results

如果使用 Jupyter Notebook,则使用此 Jupyter Notebook 小组件可以看到一个包含所有结果的图形和表:If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:

from azureml.widgets import RunDetails
RunDetails(local_run).show()

Jupyter 小组件运行详细信息 Jupyter 小组件图Jupyter widget run details Jupyter widget plot

选项 2:获取并检查 Python 中的所有运行迭代Option 2: Get and examine all run iterations in Python

也可以检索每个试验的历史记录,并浏览每个迭代运行的单个指标。You can also retrieve the history of each experiment and explore the individual metrics for each iteration run. 通过检查单个模型运行的 RMSE (root_mean_squared_error),可以看到大多数迭代预测的出租车费用是在合理的误差范围(3-4 美元)内。By examining RMSE (root_mean_squared_error) for each individual model run, you see that most iterations are predicting the taxi fair cost within a reasonable margin ($3-4).

children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata
00 11 22 33 44 55 66 77 88 99 ...... 2020 2121 2222 2323 2424 2525 2626 2727 2828 2929
explained_varianceexplained_variance 0.8110370.811037 0.8805530.880553 0.3985820.398582 0.7760400.776040 0.6638690.663869 0.8759110.875911 0.1156320.115632 0.5869050.586905 0.8519110.851911 0.7939640.793964 ...... 0.8500230.850023 0.8836030.883603 0.8837040.883704 0.8807970.880797 0.8815640.881564 0.8837080.883708 0.8818260.881826 0.5853770.585377 0.8831230.883123 0.8868170.886817
mean_absolute_errormean_absolute_error 2.1894442.189444 1.5004121.500412 5.4805315.480531 2.6263162.626316 2.9730262.973026 1.5501991.550199 6.3838686.383868 4.4142414.414241 1.7433281.743328 2.2946012.294601 ...... 1.7974021.797402 1.4158151.415815 1.4181671.418167 1.5786171.578617 1.5594271.559427 1.4130421.413042 1.5516981.551698 4.0691964.069196 1.5057951.505795 1.4309571.430957
median_absolute_errormedian_absolute_error 1.4384171.438417 0.8508990.850899 4.5796624.579662 1.7652101.765210 1.5946001.594600 0.8698830.869883 4.2664504.266450 3.6273553.627355 0.9549920.954992 1.3610141.361014 ...... 0.9736340.973634 0.7748140.774814 0.7972690.797269 1.1472341.147234 1.1164241.116424 0.7839580.783958 1.0984641.098464 2.7090272.709027 1.0037281.003728 0.8517240.851724
normalized_mean_absolute_errornormalized_mean_absolute_error 0.0249080.024908 0.0170700.017070 0.0623500.062350 0.0298780.029878 0.0338230.033823 0.0176360.017636 0.0726260.072626 0.0502190.050219 0.0198330.019833 0.0261050.026105 ...... 0.0204480.020448 0.0161070.016107 0.0161340.016134 0.0179590.017959 0.0177410.017741 0.0160760.016076 0.0176530.017653 0.0462930.046293 0.0171310.017131 0.0162790.016279
normalized_median_absolute_errornormalized_median_absolute_error 0.0163640.016364 0.0096800.009680 0.0521010.052101 0.0200820.020082 0.0181410.018141 0.0098960.009896 0.0485380.048538 0.0412670.041267 0.0108650.010865 0.0154840.015484 ...... 0.0110770.011077 0.0088150.008815 0.0090700.009070 0.0130520.013052 0.0127010.012701 0.0089190.008919 0.0124970.012497 0.0308190.030819 0.0114190.011419 0.0096900.009690
normalized_root_mean_squared_errornormalized_root_mean_squared_error 0.0479680.047968 0.0378820.037882 0.0855720.085572 0.0522820.052282 0.0658090.065809 0.0386640.038664 0.1094010.109401 0.0711040.071104 0.0422940.042294 0.0499670.049967 ...... 0.0425650.042565 0.0376850.037685 0.0375570.037557 0.0376430.037643 0.0375130.037513 0.0375600.037560 0.0374650.037465 0.0720770.072077 0.0372490.037249 0.0367160.036716
normalized_root_mean_squared_log_errornormalized_root_mean_squared_log_error 0.0553530.055353 0.0450000.045000 0.1102190.110219 0.0656330.065633 0.0635890.063589 0.0444120.044412 0.1234330.123433 0.0923120.092312 0.0461300.046130 0.0552430.055243 ...... 0.0465400.046540 0.0418040.041804 0.0417710.041771 0.0451750.045175 0.0446280.044628 0.0416170.041617 0.0444050.044405 0.0796510.079651 0.0427990.042799 0.0415300.041530
r2_scorer2_score 0.8109000.810900 0.8803280.880328 0.3980760.398076 0.7759570.775957 0.6428120.642812 0.8757190.875719 0.0216030.021603 0.5865140.586514 0.8517670.851767 0.7936710.793671 ...... 0.8498090.849809 0.8801420.880142 0.8809520.880952 0.8805860.880586 0.8813470.881347 0.8808870.880887 0.8816130.881613 0.5481210.548121 0.8828830.882883 0.8863210.886321
root_mean_squared_errorroot_mean_squared_error 4.2163624.216362 3.3298103.329810 7.5217657.521765 4.5956044.595604 5.7846015.784601 3.3985403.398540 9.6163549.616354 6.2500116.250011 3.7176613.717661 4.3920724.392072 ...... 3.7414473.741447 3.3125333.312533 3.3012423.301242 3.3087953.308795 3.2973893.297389 3.3014853.301485 3.2931823.293182 6.3355816.335581 3.2742093.274209 3.2273653.227365
root_mean_squared_log_errorroot_mean_squared_log_error 0.2431840.243184 0.1977020.197702 0.4842270.484227 0.2883490.288349 0.2793670.279367 0.1951160.195116 0.5422810.542281 0.4055590.405559 0.2026660.202666 0.2427020.242702 ...... 0.2044640.204464 0.1836580.183658 0.1835140.183514 0.1984680.198468 0.1960670.196067 0.1828360.182836 0.1950870.195087 0.3499350.349935 0.1880310.188031 0.1824550.182455
spearman_correlationspearman_correlation 0.9447430.944743 0.9536180.953618 0.8579650.857965 0.9147030.914703 0.9398460.939846 0.9561590.956159 0.8281870.828187 0.9420690.942069 0.9525810.952581 0.9354770.935477 ...... 0.9512870.951287 0.9603350.960335 0.9601950.960195 0.9602790.960279 0.9602880.960288 0.9603230.960323 0.9601610.960161 0.9412540.941254 0.9602930.960293 0.9621580.962158
spearman_correlation_maxspearman_correlation_max 0.9447430.944743 0.9536180.953618 0.9536180.953618 0.9536180.953618 0.9536180.953618 0.9561590.956159 0.9561590.956159 0.9561590.956159 0.9561590.956159 0.9561590.956159 ...... 0.9603030.960303 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9621580.962158

12 行 × 30 列12 rows × 30 columns

检索最佳模型Retrieve the best model

从迭代中选择最佳管道。Select the best pipeline from our iterations. automl_classifier 上的 get_output 方法针对上次拟合调用返回最佳运行和拟合的模型。The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. get_output 中使用重载,可以针对任何记录的指标或特定的迭代来检索最佳运行和拟合的模型:By using the overloads on get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration:

best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

测试最佳模型的准确性Test the best model accuracy

使用最佳模型针对测试数据集运行预测,以便预测出租车费用。Use the best model to run predictions on the test dataset to predict taxi fares. 函数 predict 使用最佳模型根据 x_test 数据集预测 y(行程费用)的值。The function predict uses the best model and predicts the values of y, trip cost, from the x_test dataset. 输出 y_predict 中头 10 个预测的费用值:Print the first 10 predicted cost values from y_predict:

y_predict = fitted_model.predict(x_test.values)
print(y_predict[:10])

创建散点图来可视化与实际成本值相比的预测成本值。Create a scatter plot to visualize the predicted cost values compared to the actual cost values. 以下代码使用 distance 特征作为 x 轴,使用行程 cost 作为 y 轴。The following code uses the distance feature as the x-axis and trip cost as the y-axis. 若要比较每个行程距离值处的预测成本的差异,前 100 个预测的和实际的成本值需创建为单独的系列。To compare the variance of predicted cost at each trip distance value, the first 100 predicted and actual cost values are created as separate series. 观察绘图,其中显示的距离/成本关系几乎为线性的,并且大多数情况下,对于同样的行程距离,预测的成本值非常接近于实际成本值。Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance.

%matplotlib inline

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(14, 10))
ax1 = fig.add_subplot(111)

distance_vals = [x[4] for x in x_test.values]
y_actual = y_test.values.flatten().tolist()

ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker="s", label='Predicted')
ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker="o", label='Actual')

ax1.set_xlabel('distance (mi)')
ax1.set_title('Predicted and Actual Cost/Distance')
ax1.set_ylabel('Cost ($)')

plt.legend(loc='upper left', prop={'size': 12})
plt.rcParams.update({'font.size': 14})
plt.show()

预测散点图

计算结果的 root mean squared errorCalculate the root mean squared error of the results. 使用 y_test 数据帧。Use the y_test dataframe. 将其转换为一个列表,以便与预测的值进行比较。Convert it to a list to compare to the predicted values. 函数 mean_squared_error 接受两个数组的值,计算两个数组之间的平均平方误差。The function mean_squared_error takes two arrays of values and calculates the average squared error between them. 取结果的平方根会将相同单位的误差提供为 y 差异(成本)。Taking the square root of the result gives an error in the same units as the y variable, cost. 它大致指出了出租车费用预测值与实际费用之间有多大的差距:It indicates roughly how far the taxi fare predictions are from the actual fares:

from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse
3.2204936862688798

运行以下代码,使用完整的 y_actualy_predict 数据集来计算平均绝对百分比误差 (MAPE)。Run the following code to calculate mean absolute percent error (MAPE) by using the full y_actual and y_predict datasets. 此指标计算每个预测值和实际值之间的绝对差,将所有差值求和。This metric calculates an absolute difference between each predicted and actual value and sums all the differences. 然后,它将该和表示为实际值的总计百分比:Then it expresses that sum as a percent of the total of the actual values:

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)
Model MAPE:
0.10545153869569586

Model Accuracy:
0.8945484613043041

从最终的预测准确率指标来看,该模型可以很好地根据数据集的特性来预测出租车费用,误差通常在 3.00 美元上下。From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $3.00. 传统的机器学习模型开发过程是资源高度密集型的,需要大量的领域知识和时间投资来运行数十个模型并比较其结果。The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. 使用自动化机器学习是一种很好的方式,可以针对方案快速测试许多不同的模型。Using automated machine learning is a great way to rapidly test many different models for your scenario.

清理资源Clean up resources

重要

已创建的资源可以用作其他 Azure 机器学习服务教程和操作方法文章的先决条件。The resources you created can be used as prerequisites to other Azure Machine Learning service tutorials and how-to articles.

如果不打算使用已创建的资源,请删除它们,以免产生任何费用:If you don't plan to use the resources you created, delete them, so you don't incur any charges:

  1. 在 Azure 门户中,选择最左侧的“资源组” 。In the Azure portal, select Resource groups on the far left.

    在 Azure 门户中删除

  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.

  3. 选择“删除资源组” 。Select Delete resource group.

  4. 输入资源组名称。Enter the resource group name. 然后选择“删除” 。Then select Delete.

后续步骤Next steps

在本自动化机器学习教程中,你已完成以下任务:In this automated machine learning tutorial, you did the following tasks:

  • 配置了工作区并准备了试验数据。Configured a workspace and prepared data for an experiment.
  • 结合自定义参数在本地使用自动化回归模型进行了训练。Trained by using an automated regression model locally with custom parameters.
  • 浏览并查看了训练结果。Explored and reviewed training results.

使用 Azure 机器学习来部署模型Deploy your model with Azure Machine Learning.