您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:使用自动化机器学习预测出租车费Tutorial: Use automated machine learning to predict taxi fares

适用于:是基本版是企业版               (升级到企业版APPLIES TO: yesBasic edition yesEnterprise edition                    (Upgrade to Enterprise edition)

在本教程中,你将使用 Azure 机器学习中的自动机器学习来创建一个用于预测纽约市出租车收费价格的回归模型。In this tutorial, you use automated machine learning in Azure Machine Learning to create a regression model to predict NYC taxi fare prices. 此过程接受定型数据和配置设置,并自动循环访问不同特征规范化/标准化方法、模型和超参数设置的组合,以实现最佳模型。This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

流程图

在本教程中,你将学习如何执行以下任务:In this tutorial you learn the following tasks:

  • 使用 Azure 开放数据集下载、转换和清理数据Download, transform, and clean data using Azure Open Datasets
  • 定型自动化机器学习回归模型Train an automated machine learning regression model
  • 计算模型准确度Calculate model accuracy

如果没有 Azure 订阅,请在开始之前创建一个免费帐户。If you don’t have an Azure subscription, create a free account before you begin. 立即试用免费版或付费版的 Azure 机器学习。Try the free or paid version of Azure Machine Learning today.

先决条件Prerequisites

  • 如果还没有 Azure 机器学习工作区或 Notebook 虚拟机,请完成设置教程Complete the setup tutorial if you don't already have an Azure Machine Learning workspace or notebook virtual machine.
  • 完成设置教程后,使用同一笔记本服务器打开 tutorials/regression-automl-nyc-taxi-data/regression-automated-ml.ipynb 笔记本。After you complete the setup tutorial, open the tutorials/regression-automl-nyc-taxi-data/regression-automated-ml.ipynb notebook using the same notebook server.

如果你想要在自己的本地环境中运行此教程,也可以在 GitHub 上找到它。This tutorial is also available on GitHub if you wish to run it in your own local environment. 运行 pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets 以获取所需的包。Run pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets to get the required packages.

下载并准备数据Download and prepare data

导入必需包。Import the necessary packages. “开放数据集”包内有表示各个数据源的类(如 NycTlcGreen),用于在下载前轻松筛选日期参数。The Open Datasets package contains a class representing each data source (NycTlcGreen for example) to easily filter date parameters before downloading.

from azureml.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

首先,创建用于保留出租车数据的数据帧。Begin by creating a dataframe to hold the taxi data. 如果是在非 Spark 环境中,开放数据集仅允许一次下载一个月的数据,并利用一些类来避免较大数据集出现 MemoryErrorWhen working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets.

若要下载出租车数据,请以迭代方式一次提取一个月的数据,并在将数据追加到 green_taxi_df 前,先从各月的数据中随机采样 2000 个样本,以免数据帧膨胀。To download taxi data, iteratively fetch one month at a time, and before appending it to green_taxi_df randomly sample 2,000 records from each month to avoid bloating the dataframe. 然后,预览数据。Then preview the data.

green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2015","%m/%d/%Y")
end = datetime.strptime("1/31/2015","%m/%d/%Y")

for sample_month in range(12):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

green_taxi_df.head(10)
vendorIDvendorID lpepPickupDatetimelpepPickupDatetime lpepDropoffDatetimelpepDropoffDatetime passengerCountpassengerCount tripDistancetripDistance puLocationIdpuLocationId doLocationIddoLocationId pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude ...... paymentTypepaymentType fareAmountfareAmount extraextra mtaTaxmtaTax improvementSurchargeimprovementSurcharge tipAmounttipAmount tollsAmounttollsAmount ehailFeeehailFee totalAmounttotalAmount tripTypetripType
131969131969 22 2015-01-11 05:34:442015-01-11 05:34:44 2015-01-11 05:45:032015-01-11 05:45:03 33 4.844.84 None None -73.88-73.88 40.8440.84 -73.94-73.94 ...... 22 15.0015.00 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 16.3016.30 1.001.00
11298171129817 22 2015-01-20 16:26:292015-01-20 16:26:29 2015-01-20 16:30:262015-01-20 16:30:26 11 0.690.69 None None -73.96-73.96 40.8140.81 -73.96-73.96 ...... 22 4.504.50 1.001.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00
12786201278620 22 2015-01-01 05:58:102015-01-01 05:58:10 2015-01-01 06:00:552015-01-01 06:00:55 11 0.450.45 None None -73.92-73.92 40.7640.76 -73.91-73.91 ...... 22 4.004.00 0.000.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 4.804.80 1.001.00
348430348430 22 2015-01-17 02:20:502015-01-17 02:20:50 2015-01-17 02:41:382015-01-17 02:41:38 11 0.000.00 None None -73.81-73.81 40.7040.70 -73.82-73.82 ...... 22 12.5012.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00
12696271269627 11 2015-01-01 05:04:102015-01-01 05:04:10 2015-01-01 05:06:232015-01-01 05:06:23 11 0.500.50 None None -73.92-73.92 40.7640.76 -73.92-73.92 ...... 22 4.004.00 0.500.50 0.500.50 00 0.000.00 0.000.00 nannan 5.005.00 1.001.00
811755811755 11 2015-01-04 19:57:512015-01-04 19:57:51 2015-01-04 20:05:452015-01-04 20:05:45 22 1.101.10 None None -73.96-73.96 40.7240.72 -73.95-73.95 ...... 22 6.506.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 7.807.80 1.001.00
737281737281 11 2015-01-03 12:27:312015-01-03 12:27:31 2015-01-03 12:33:522015-01-03 12:33:52 11 0.900.90 None None -73.88-73.88 40.7640.76 -73.87-73.87 ...... 22 6.006.00 0.000.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.806.80 1.001.00
113951113951 11 2015-01-09 23:25:512015-01-09 23:25:51 2015-01-09 23:39:522015-01-09 23:39:52 11 3.303.30 None None -73.96-73.96 40.7240.72 -73.91-73.91 ...... 22 12.5012.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00
150436150436 22 2015-01-11 17:15:142015-01-11 17:15:14 2015-01-11 17:22:572015-01-11 17:22:57 11 1.191.19 None None -73.94-73.94 40.7140.71 -73.95-73.95 ...... 11 7.007.00 0.000.00 0.500.50 0.30.3 1.751.75 0.000.00 nannan 9.559.55 1.001.00
432136432136 22 2015-01-22 23:16:332015-01-22 23:16:33 2015-01-22 23:20:132015-01-22 23:20:13 11 0.650.65 None None -73.94-73.94 40.7140.71 -73.94-73.94 ...... 22 5.005.00 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00

10 行 × 23 列10 rows × 23 columns

至此,初始数据已加载。是时候定义函数了,以根据接取日期/时间字段创建各种基于时间的特征。Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. 这将新建月份、日期、周几和时段字段,并允许模型将基于时间的季节性考虑在内。This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 请对数据帧使用 apply() 函数,以迭代方式将 build_time_features() 函数应用于出租车数据中的每一行。Use the apply() function on the dataframe to iteratively apply the build_time_features() function to each row in the taxi data.

def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour

    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)
vendorIDvendorID lpepPickupDatetimelpepPickupDatetime lpepDropoffDatetimelpepDropoffDatetime passengerCountpassengerCount tripDistancetripDistance puLocationIdpuLocationId doLocationIddoLocationId pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude ...... improvementSurchargeimprovementSurcharge tipAmounttipAmount tollsAmounttollsAmount ehailFeeehailFee totalAmounttotalAmount tripTypetripType month_nummonth_num day_of_monthday_of_month day_of_weekday_of_week hour_of_dayhour_of_day
131969131969 22 2015-01-11 05:34:442015-01-11 05:34:44 2015-01-11 05:45:032015-01-11 05:45:03 33 4.844.84 None None -73.88-73.88 40.8440.84 -73.94-73.94 ...... 0.30.3 0.000.00 0.000.00 nannan 16.3016.30 1.001.00 11 1111 66 55
11298171129817 22 2015-01-20 16:26:292015-01-20 16:26:29 2015-01-20 16:30:262015-01-20 16:30:26 11 0.690.69 None None -73.96-73.96 40.8140.81 -73.96-73.96 ...... 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00 11 2020 11 1616
12786201278620 22 2015-01-01 05:58:102015-01-01 05:58:10 2015-01-01 06:00:552015-01-01 06:00:55 11 0.450.45 None None -73.92-73.92 40.7640.76 -73.91-73.91 ...... 0.30.3 0.000.00 0.000.00 nannan 4.804.80 1.001.00 11 11 33 55
348430348430 22 2015-01-17 02:20:502015-01-17 02:20:50 2015-01-17 02:41:382015-01-17 02:41:38 11 0.000.00 None None -73.81-73.81 40.7040.70 -73.82-73.82 ...... 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00 11 1717 55 22
12696271269627 11 2015-01-01 05:04:102015-01-01 05:04:10 2015-01-01 05:06:232015-01-01 05:06:23 11 0.500.50 None None -73.92-73.92 40.7640.76 -73.92-73.92 ...... 00 0.000.00 0.000.00 nannan 5.005.00 1.001.00 11 11 33 55
811755811755 11 2015-01-04 19:57:512015-01-04 19:57:51 2015-01-04 20:05:452015-01-04 20:05:45 22 1.101.10 None None -73.96-73.96 40.7240.72 -73.95-73.95 ...... 0.30.3 0.000.00 0.000.00 nannan 7.807.80 1.001.00 11 44 66 1919
737281737281 11 2015-01-03 12:27:312015-01-03 12:27:31 2015-01-03 12:33:522015-01-03 12:33:52 11 0.900.90 None None -73.88-73.88 40.7640.76 -73.87-73.87 ...... 0.30.3 0.000.00 0.000.00 nannan 6.806.80 1.001.00 11 33 55 1212
113951113951 11 2015-01-09 23:25:512015-01-09 23:25:51 2015-01-09 23:39:522015-01-09 23:39:52 11 3.303.30 None None -73.96-73.96 40.7240.72 -73.91-73.91 ...... 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00 11 99 44 2323
150436150436 22 2015-01-11 17:15:142015-01-11 17:15:14 2015-01-11 17:22:572015-01-11 17:22:57 11 1.191.19 None None -73.94-73.94 40.7140.71 -73.95-73.95 ...... 0.30.3 1.751.75 0.000.00 nannan 9.559.55 1.001.00 11 1111 66 1717
432136432136 22 2015-01-22 23:16:332015-01-22 23:16:33 2015-01-22 23:20:132015-01-22 23:20:13 11 0.650.65 None None -73.94-73.94 40.7140.71 -73.94-73.94 ...... 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00 11 2222 33 2323

10 行 × 27 列10 rows × 27 columns

删除训练或其他特征生成不需要的一些列。Remove some of the columns that you won't need for training or additional feature building.

columns_to_remove = ["lpepPickupDatetime", "lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID",
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)

green_taxi_df.head(5)

清理数据Cleanse data

对新数据帧运行 describe() 函数,以查看各个字段的汇总统计信息。Run the describe() function on the new dataframe to see summary statistics for each field.

green_taxi_df.describe()
vendorIDvendorID passengerCountpassengerCount tripDistancetripDistance pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude dropoffLatitudedropoffLatitude totalAmounttotalAmount month_nummonth_num day_of_monthday_of_month day_of_weekday_of_week hour_of_dayhour_of_day
countcount 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00
平均值mean 1.781.78 1.371.37 2.872.87 -73.83-73.83 40.6940.69 -73.84-73.84 40.7040.70 14.7514.75 6.506.50 15.1315.13 3.273.27 13.5213.52
stdstd 0.410.41 1.041.04 2.932.93 2.762.76 1.521.52 2.612.61 1.441.44 12.0812.08 3.453.45 8.458.45 1.951.95 6.836.83
minmin 1.001.00 0.000.00 0.000.00 -74.66-74.66 0.000.00 -74.66-74.66 0.000.00 -300.00-300.00 1.001.00 1.001.00 0.000.00 0.000.00
25%25% 2.002.00 1.001.00 1.061.06 -73.96-73.96 40.7040.70 -73.97-73.97 40.7040.70 7.807.80 3.753.75 8.008.00 2.002.00 9.009.00
50%50% 2.002.00 1.001.00 1.901.90 -73.94-73.94 40.7540.75 -73.94-73.94 40.7540.75 11.3011.30 6.506.50 15.0015.00 3.003.00 15.0015.00
75%75% 2.002.00 1.001.00 3.603.60 -73.92-73.92 40.8040.80 -73.91-73.91 40.7940.79 17.8017.80 9.259.25 22.0022.00 5.005.00 19.0019.00
maxmax 2.002.00 9.009.00 97.5797.57 0.000.00 41.9341.93 0.000.00 41.9441.94 450.00450.00 12.0012.00 30.0030.00 6.006.00 23.0023.00

从汇总统计信息中可以看到,有几个字段包含离群值或会降低模型准确度的值。From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. 首先筛选位于曼哈顿区域边界内的纬度/经度字段。First filter the lat/long fields to be within the bounds of the Manhattan area. 这会筛选出较长的出租车行程,或者在与其他特征的关系上属于离群值的行程。This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features.

此外,筛选大于 0 但小于 31 英里(两个纬度/经度对之间的迭加正弦波距离)的 tripDistance 字段。Additionally filter the tripDistance field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). 这会消除行程费用不一致的长离群行程。This eliminates long outlier trips that have inconsistent trip cost.

最后,totalAmount 字段包含出租车费的负值,这在我们的模型上下文中没有意义,而 passengerCount 字段包含最小值为 0 的错误数据。Lastly, the totalAmount field has negative values for the taxi fares, which don't make sense in the context of our model, and the passengerCount field has bad data with the minimum values being zero.

使用查询函数筛选掉这些异常,然后删除定型不需要的最后几列。Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

重新对数据调用 describe(),以确保清理工作符合预期。Call describe() again on the data to ensure cleansing worked as expected. 至此,已有经过准备和清理的出租车、节假日和天气数据集,用于机器学习模型定型。You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training.

final_df.describe()

配置工作区Configure workspace

从现有工作区创建工作区对象。Create a workspace object from the existing workspace. 工作区是可接受 Azure 订阅和资源信息的类。A Workspace is a class that accepts your Azure subscription and resource information. 它还可创建云资源来监视和跟踪模型运行。It also creates a cloud resource to monitor and track your model runs. Workspace.from_config() 读取文件 config.json 并将身份验证详细信息加载到名为 ws 的对象。Workspace.from_config() reads the file config.json and loads the authentication details into an object named ws. 在本教程中,ws 在代码的其余部分使用。ws is used throughout the rest of the code in this tutorial.

from azureml.core.workspace import Workspace
ws = Workspace.from_config()

将数据拆分为训练集和测试集Split the data into train and test sets

使用 scikit-learn 库中的 train_test_split 函数将数据拆分为训练集和测试集。Split the data into training and test sets by using the train_test_split function in the scikit-learn library. 该函数将数据分成用于模型训练的 x(特征)数据集和用于测试的 y(用于预测的值)数据集。This function segregates the data into the x (features) data set for model training and the y (values to predict) data set for testing.

test_size 参数决定了分配用于测试的数据的百分比。The test_size parameter determines the percentage of data to allocate to testing. random_state 参数设置随机生成器的种子。这样一来,训练-测试拆分是有确定性的。The random_state parameter sets a seed to the random generator, so that your train-test splits are deterministic.

from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(final_df, test_size=0.2, random_state=223)

此步骤的目的是通过数据点来测试完成的模型(此模型尚未用于训练模型),以便测量实际准确性。The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy.

换句话说,经过良好训练的模型应该能够准确地根据其尚未看到的数据进行预测。In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. 现已准备好用于自动训练机器学习模型的数据。You now have data prepared for auto-training a machine learning model.

自动训练模型Automatically train a model

若要自动训练模型,请执行以下步骤:To automatically train a model, take the following steps:

  1. 定义试验运行的设置。Define settings for the experiment run. 将训练数据附加到配置,并修改用于控制训练过程的设置。Attach your training data to the configuration, and modify settings that control the training process.
  2. 提交用于模型优化的试验。Submit the experiment for model tuning. 在提交试验以后,此过程会根据定义的约束循环访问不同的机器学习算法和超参数设置。After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. 它通过优化准确性指标来选择最佳拟合模型。It chooses the best-fit model by optimizing an accuracy metric.

定义训练设置Define training settings

定义用于训练的试验参数和模型设置。Define the experiment parameter and model settings for training. 查看设置的完整列表。View the full list of settings. 提交带这些默认设置的试验大约需要 5-20 分钟,但如果需要缩短运行时间,可减小 experiment_timeout_hours 参数。Submitting the experiment with these default settings will take approximately 5-20 min, but if you want a shorter run time, reduce the experiment_timeout_hours parameter.

propertiesProperty 本教程中的值Value in this tutorial 说明Description
iteration_timeout_minutesiteration_timeout_minutes 22 每个迭代的时间限制(分钟)。Time limit in minutes for each iteration. 减小此值可缩短总运行时。Reduce this value to decrease total runtime.
experiment_timeout_hoursexperiment_timeout_hours 0.30.3 在试验结束之前,所有合并的迭代所花费的最大时间量(以小时为单位)。Maximum amount of time in hours that all iterations combined can take before the experiment terminates.
enable_early_stoppingenable_early_stopping TrueTrue 如果分数在短期内没有提高,则进行标记,以提前终止。Flag to enable early termination if the score is not improving in the short term.
primary_metricprimary_metric spearman_correlationspearman_correlation 要优化的指标。Metric that you want to optimize. 将根据此指标选择最佳拟合模型。The best-fit model will be chosen based on this metric.
featurizationfeaturization autoauto 如果使用“auto”,则试验可以预处理输入数据(处理缺失的数据、将文本转换为数字,等等)By using auto, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)
verbosityverbosity logging.INFOlogging.INFO 控制日志记录的级别。Controls the level of logging.
n_cross_validationsn_cross_validations 55 在验证数据未指定的情况下,需执行的交叉验证拆分的数目。Number of cross-validation splits to perform when validation data is not specified.
import logging

automl_settings = {
    "iteration_timeout_minutes": 2,
    "experiment_timeout_hours": 0.3,
    "enable_early_stopping": True,
    "primary_metric": 'spearman_correlation',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

使用定义的训练设置作为 AutoMLConfig 对象的 **kwargs 参数。Use your defined training settings as a **kwargs parameter to an AutoMLConfig object. 另请指定训练数据和模型的类型,后者在此示例中为 regressionAdditionally, specify your training data and the type of model, which is regression in this case.

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='regression',
                             debug_log='automated_ml_errors.log',
                             training_data=x_train,
                             label_column_name="totalAmount",
                             **automl_settings)

备注

自动机器学习预处理步骤(特征规范化、处理缺失数据,将文本转换为数字等)成为基础模型的一部分。Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型进行预测时,训练期间应用的相同预处理步骤将自动应用于输入数据。When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

训练自动回归模型Train the automatic regression model

在工作区中创建一个试验对象。Create an experiment object in your workspace. 试验充当单个运行的容器。An experiment acts as a container for your individual runs. 将定义的 automl_config 对象传递至试验,并将输出设置为 True,以便查看运行过程中的进度。Pass the defined automl_config object to the experiment, and set the output to True to view progress during the run.

启动试验后,显示的输出会随着试验的运行实时更新。After starting the experiment, the output shown updates live as the experiment runs. 可以看到每次迭代的模型类型、运行持续时间以及训练准确性。For each iteration, you see the model type, the run duration, and the training accuracy. 字段 BEST 根据指标类型跟踪运行情况最好的训练分数。The field BEST tracks the best running training score based on your metric type.

from azureml.core.experiment import Experiment
experiment = Experiment(ws, "taxi-experiment")
local_run = experiment.submit(automl_config, show_output=True)
Running on local machine
Parent Run ID: AutoML_1766cdf7-56cf-4b28-a340-c4aeee15b12b
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper RandomForest             0:00:16       0.8746    0.8746
         1   MinMaxScaler RandomForest                      0:00:15       0.9468    0.9468
         2   StandardScalerWrapper ExtremeRandomTrees       0:00:09       0.9303    0.9468
         3   StandardScalerWrapper LightGBM                 0:00:10       0.9424    0.9468
         4   RobustScaler DecisionTree                      0:00:09       0.9449    0.9468
         5   StandardScalerWrapper LassoLars                0:00:09       0.9440    0.9468
         6   StandardScalerWrapper LightGBM                 0:00:10       0.9282    0.9468
         7   StandardScalerWrapper RandomForest             0:00:12       0.8946    0.9468
         8   StandardScalerWrapper LassoLars                0:00:16       0.9439    0.9468
         9   MinMaxScaler ExtremeRandomTrees                0:00:35       0.9199    0.9468
        10   RobustScaler ExtremeRandomTrees                0:00:19       0.9411    0.9468
        11   StandardScalerWrapper ExtremeRandomTrees       0:00:13       0.9077    0.9468
        12   StandardScalerWrapper LassoLars                0:00:15       0.9433    0.9468
        13   MinMaxScaler ExtremeRandomTrees                0:00:14       0.9186    0.9468
        14   RobustScaler RandomForest                      0:00:10       0.8810    0.9468
        15   StandardScalerWrapper LassoLars                0:00:55       0.9433    0.9468
        16   StandardScalerWrapper ExtremeRandomTrees       0:00:13       0.9026    0.9468
        17   StandardScalerWrapper RandomForest             0:00:13       0.9140    0.9468
        18   VotingEnsemble                                 0:00:23       0.9471    0.9471
        19   StackEnsemble                                  0:00:27       0.9463    0.9471

浏览结果Explore the results

通过 Jupyter 小组件浏览自动训练的结果。Explore the results of automatic training with a Jupyter widget. 使用该小组件可以查看每个运行迭代的图形和表,以及训练准确度指标和元数据。The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. 此外,可以筛选不同于下拉选择器中的主要指标的准确度指标。Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

from azureml.widgets import RunDetails
RunDetails(local_run).show()

Jupyter 小组件运行详细信息 Jupyter 小组件图Jupyter widget run details Jupyter widget plot

检索最佳模型Retrieve the best model

选择迭代后的最佳模型。Select the best model from your iterations. get_output 函数针对上次拟合调用返回最佳运行和拟合的模型。The get_output function returns the best run and the fitted model for the last fit invocation. get_output 中使用重载,可以针对任何记录的指标或特定的迭代来检索最佳运行和拟合的模型。By using the overloads on get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

测试最佳模型的准确性Test the best model accuracy

使用最佳模型针对测试数据集运行预测,以便预测出租车费。Use the best model to run predictions on the test data set to predict taxi fares. 函数 predict 使用最佳模型根据 x_test 数据集预测 y(行程费用)的值。The function predict uses the best model and predicts the values of y, trip cost, from the x_test data set. 输出 y_predict 中头 10 个预测的费用值。Print the first 10 predicted cost values from y_predict.

y_test = x_test.pop("totalAmount")

y_predict = fitted_model.predict(x_test)
print(y_predict[:10])

计算结果的 root mean squared errorCalculate the root mean squared error of the results. y_test 数据帧转换为要与预测值比较的列表。Convert the y_test dataframe to a list to compare to the predicted values. 函数 mean_squared_error 接受两个数组的值,计算两个数组之间的平均平方误差。The function mean_squared_error takes two arrays of values and calculates the average squared error between them. 取结果的平方根会将相同单位的误差提供为 y 差异(成本)。Taking the square root of the result gives an error in the same units as the y variable, cost. 它大致指出了出租车费预测值与实际费用之间有多大的差距。It indicates roughly how far the taxi fare predictions are from the actual fares.

from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse

运行以下代码,使用完整的 y_actualy_predict 数据集来计算平均绝对百分比误差 (MAPE)。Run the following code to calculate mean absolute percent error (MAPE) by using the full y_actual and y_predict data sets. 此指标计算每个预测值和实际值之间的绝对差,将所有差值求和。This metric calculates an absolute difference between each predicted and actual value and sums all the differences. 然后,它将总和表示为实际值总和的百分比。Then it expresses that sum as a percent of the total of the actual values.

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)
Model MAPE:
0.14353867606052823

Model Accuracy:
0.8564613239394718

从两个预测准确度指标来看,该模型可以很好地根据数据集的特征来预测出租车费,误差率大约为 15%,通常在 4.00 美元上下。From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error.

传统的机器学习模型开发过程是资源高度密集型的,需要大量的领域知识和时间投资来运行数十个模型并比较其结果。The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. 使用自动化机器学习是一种很好的方式,可以针对方案快速测试许多不同的模型。Using automated machine learning is a great way to rapidly test many different models for your scenario.

清理资源Clean up resources

如果打算运行其他 Azure 机器学习教程,请不要完成本部分。Do not complete this section if you plan on running other Azure Machine Learning tutorials.

停止计算实例Stop the compute instance

如果使用了计算实例或笔记本 VM,请停止未使用的 VM,以降低成本。If you used a compute instance or Notebook VM, stop the VM when you are not using it to reduce cost.

  1. 在工作区中选择“计算”。 In your workspace, select Compute.

  2. 从列表中选择 VM。From the list, select the VM.

  3. 选择“停止” 。Select Stop.

  4. 准备好再次使用服务器时,选择“启动” 。When you're ready to use the server again, select Start.

删除所有内容Delete everything

如果不打算使用已创建的资源,请删除它们,以免产生任何费用。If you don't plan to use the resources you created, delete them, so you don't incur any charges.

  1. 在 Azure 门户中,选择最左侧的“资源组”。In the Azure portal, select Resource groups on the far left.
  2. 从列表中选择已创建的资源组。From the list, select the resource group you created.
  3. 选择“删除资源组”。Select Delete resource group.
  4. 输入资源组名称。Enter the resource group name. 然后选择“删除”。Then select Delete.

还可保留资源组,但请删除单个工作区。You can also keep the resource group but delete a single workspace. 显示工作区属性,然后选择“删除”。Display the workspace properties and select Delete.

后续步骤Next steps

在本自动化机器学习教程中,你已完成以下任务:In this automated machine learning tutorial, you did the following tasks:

  • 配置了工作区并准备了试验数据。Configured a workspace and prepared data for an experiment.
  • 结合自定义参数在本地使用自动化回归模型进行了训练。Trained by using an automated regression model locally with custom parameters.
  • 浏览并查看了训练结果。Explored and reviewed training results.

使用 Azure 机器学习来部署模型Deploy your model with Azure Machine Learning.