教學課程:使用自動機器學習建置迴歸模型Tutorial: Use automated machine learning to build your regression model

本教學課程是兩部分教學課程系列的第二部分This tutorial is part two of a two-part tutorial series. 在上一個教學課程中,您已備妥 NYC 計程車資料來建立迴歸模型In the previous tutorial, you prepared the NYC taxi data for regression modeling.

現在,您已準備好使用 Azure Machine Learning 服務來開始建置模型。Now you're ready to start building your model with Azure Machine Learning service. 在教學課程的這個部分,您將使用已備妥的資料並自動產生迴歸模型,以預測計程車的車資。In this part of the tutorial, you use the prepared data and automatically generate a regression model to predict taxi fare prices. 您可以藉由使用此服務的自動化機器學習功能,定義機器學習目標和條件約束。By using the automated machine learning capabilities of the service, you define your machine learning goals and constraints. 您需啟動自動化機器學習程序。You launch the automated machine learning process. 然後允許為您進行演算法選擇和超參數調整。Then allow the algorithm selection and hyperparameter tuning to happen for you. 自動化機器學習技術會逐一嘗試演算法和超參數的多種組合,直到根據您的準則找到最佳模型為止。The automated machine learning technique iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

流程圖

在本教學課程中,您會了解下列工作:In this tutorial, you learn the following tasks:

  • 設定 Python 環境並匯入 SDK 套件。Set up a Python environment and import the SDK packages.
  • 設定 Azure Machine Learning 服務工作區。Configure an Azure Machine Learning service workspace.
  • 自動將迴歸模型定型。Autotrain a regression model.
  • 使用自訂參數在本機執行模型。Run the model locally with custom parameters.
  • 探索結果。Explore the results.

如果您沒有 Azure 訂用帳戶,請在開始前先建立一個免費帳戶。If you don’t have an Azure subscription, create a free account before you begin. 立即試用免費或付費版本的 Azure Machine Learning 服務Try the free or paid version of Azure Machine Learning service today.

注意

本文中的程式碼已進行過 Azure Machine Learning SDK 1.0.39 版的測試。Code in this article was tested with Azure Machine Learning SDK version 1.0.39.

必要條件Prerequisites

請跳至設定您的開發環境閱讀完整的 Notebook 步驟,或依照下列指示取得 Notebook,並在 Azure Notebooks 或您自己的 Notebook 伺服器上加以執行。Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. 若要執行 Notebook,您將需要:To run the notebook you will need:

  • 執行資料準備教學課程Run the data preparation tutorial.
  • 已安裝下列項目的 Python 3.6 Notebook 伺服器:A Python 3.6 notebook server with the following installed:
    • 適用於 Python 的 Azure Machine Learning SDK,含 automlnotebooks 額外項目The Azure Machine Learning SDK for Python with automl and notebooks extras
    • matplotlib
  • 教學課程筆記本The tutorial notebook
  • 機器學習工作區A machine learning workspace
  • 與 Notebook 位於相同目錄中的工作區組態檔The configuration file for the workspace in the same directory as the notebook

請從以下各節取得前述所有必要項目。Get all these prerequisites from either of the sections below.

使用您工作區中的雲端 Notebook 伺服器Use a cloud notebook server in your workspace

您可以輕鬆地開始使用自己的雲端式 Notebook 伺服器。It's easy to get started with your own cloud-based notebook server. 我們已在您建立此雲端資源後,為您安裝及設定適用於 Python 的 Azure Machine Learning SDKThe Azure Machine Learning SDK for Python is already installed and configured for you once you create this cloud resource.

  • 啟動 Notebook 網頁之後,請執行 tutorials/regression-part2-automated-ml.ipynb Notebook。After you launch the notebook webpage, run the tutorials/regression-part2-automated-ml.ipynb notebook.

使用您自己的 Jupyter Notebook 伺服器Use your own Jupyter notebook server

使用下列步驟在您的電腦上建立本機 Jupyter Notebook 伺服器。Use these steps to create a local Jupyter Notebook server on your computer. 務必在您的環境中安裝 matplotlib``automlnotebooks 額外項目。Make sure that you install matplotlib and the automl and notebooks extras in your environment.

  1. 依照建立 Azure Machine Learning 服務工作區中的指示執行下列工作:Use the instructions at Create an Azure Machine Learning service workspace to do the following:

    • 建立 Miniconda 環境Create a Miniconda environment
    • 安裝適用於 Python 的 Azure Machine Learning SDKInstall the Azure Machine Learning SDK for Python
    • 建立工作區Create a workspace
    • 撰寫工作區設定檔 (aml_config/config.json)。Write a workspace configuration file (aml_config/config.json).
  2. 複製 GitHub 存放庫Clone the GitHub repository.

    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
  3. 從複製的目錄中啟動 Notebook 伺服器。Start the notebook server from your cloned directory.

    jupyter notebook
    

完成所有步驟後,請執行 tutorials/regression-part2-automated-ml.ipynb 筆記本。After you complete the steps, run the tutorials/regression-part2-automated-ml.ipynb notebook.

設定您的開發環境Set up your development environment

針對您開發工作的所有設定都可以在 Python Notebook 中完成。All the setup for your development work can be accomplished in a Python notebook. 設定包含下列動作:Setup includes the following actions:

  • 安裝 SDKInstall the SDK
  • 匯入 Python 套件Import Python packages
  • 設定您的工作區Configure your workspace

安裝並匯入套件Install and import packages

如果您要在自己的 Python 環境中進行本教學課程,請使用下列程序安裝必要套件。If you are following the tutorial in your own Python environment, use the following to install necessary packages.

pip install azureml-sdk[automl,notebooks] matplotlib

匯入本教學課程中所需的 Python 套件:Import the Python packages you need in this tutorial:

import azureml.core
import pandas as pd
from azureml.core.workspace import Workspace
import logging
import os

設定工作區Configure workspace

從現有的工作區建立工作區物件。Create a workspace object from the existing workspace. Workspace (英文) 是會接受您 Azure 訂用帳戶和資源資訊的類別。A Workspace is a class that accepts your Azure subscription and resource information. 它也會建立雲端資源來監視及追蹤您的模型執行。It also creates a cloud resource to monitor and track your model runs.

Workspace.from_config() 會讀取檔案 config.json,並將詳細資料載入到名為 ws 的物件。Workspace.from_config() reads the file config.json and loads the details into an object named ws. ws 用於本教學課程的其餘程式碼。ws is used throughout the rest of the code in this tutorial.

在您有工作區物件之後,請為實驗指定一個名稱。After you have a workspace object, specify a name for the experiment. 請建立一個本機目錄並向工作區註冊該目錄。Create and register a local directory with the workspace. 所有執行的歷程記錄都會記錄在指定的實驗下及 Azure 入口網站中。The history of all runs is recorded under the specified experiment and in the Azure portal.

ws = Workspace.from_config()
# choose a name for the run history container in the workspace
experiment_name = 'automated-ml-regression'
# project folder
project_folder = './automated-ml-regression'

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data=output, index=['']).T

探索資料Explore data

請使用在上一個教學課程中建立的資料流程物件。Use the data flow object created in the previous tutorial. 總結來說,本教學課程的第 1 部分會清除 NYC 計程車資料,使其可在機器學習模型中使用。To summarize, part 1 of this tutorial cleaned the NYC Taxi data so it could be used in a machine learning model. 現在,您可以使用資料集中的各種特性,並且讓自動化模型建置這些特性與計程車車程價格之間的關聯性。Now, you use various features from the data set and allow an automated model to build relationships between the features and the price of a taxi trip. 開啟並執行資料流程,然後檢閱結果:Open and run the data flow and review the results:

import azureml.dataprep as dprep

file_path = os.path.join(os.getcwd(), "dflows.dprep")

dflow_prepared = dprep.Dataflow.open(file_path)
dflow_prepared.get_profile()
類型Type MinMin maxMax CountCount 遺漏計數Missing count 未遺漏計數Not missing count 遺漏百分比Percent missing 錯誤計數Error count 空白計數Empty count 0.1% 分位數0.1% quantile 1% 分位數1% quantile 5% 分位數5% quantile 25% 分位數25% quantile 50% 分位數50% quantile 75% 分位數75% quantile 95% 分位數95% quantile 99% 分位數99% quantile 99.9% 分位數99.9% quantile 平均值Mean 標準差Standard deviation VarianceVariance 偏度Skewness 峰度Kurtosis
vendorvendor FieldType.STRINGFieldType.STRING 11 VTSVTS 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_weekdaypickup_weekday FieldType.STRINGFieldType.STRING 星期五Friday 星期三Wednesday 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_hourpickup_hour FieldType.DECIMALFieldType.DECIMAL 00 2323 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 2.900472.90047 2.693552.69355 9.728899.72889 1616 19.371319.3713 22.697422.6974 2323 2323 14.273114.2731 6.592426.59242 43.4643.46 -0.693723-0.693723 -0.570403-0.570403
pickup_minutepickup_minute FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 4.997014.99701 4.958334.95833 14.152814.1528 29.383229.3832 44.682544.6825 56.444456.4444 58.990958.9909 5959 29.42729.427 17.433317.4333 303.921303.921 0.01209990.0120999 -1.20981-1.20981
pickup_secondpickup_second FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.281315.28131 55 14.783214.7832 29.929329.9293 44.72544.725 56.757356.7573 5959 5959 29.744329.7443 17.359517.3595 301.351301.351 -0.0252399-0.0252399 -1.19616-1.19616
dropoff_weekdaydropoff_weekday FieldType.STRINGFieldType.STRING 星期五Friday 星期三Wednesday 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
dropoff_hourdropoff_hour FieldType.DECIMALFieldType.DECIMAL 00 2323 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 2.571532.57153 22 9.587959.58795 15.999415.9994 19.618419.6184 22.831722.8317 2323 2323 14.210514.2105 6.710936.71093 45.036545.0365 -0.687292-0.687292 -0.61951-0.61951
dropoff_minutedropoff_minute FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.443835.44383 4.846944.84694 14.103614.1036 28.836528.8365 44.310244.3102 56.689256.6892 5959 5959 29.290729.2907 17.410817.4108 303.136303.136 0.02225140.0222514 -1.2181-1.2181
dropoff_seconddropoff_second FieldType.DECIMALFieldType.DECIMAL 00 5959 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 00 5.078015.07801 55 14.575114.5751 29.597229.5972 45.464945.4649 56.272956.2729 5959 5959 29.77229.772 17.533717.5337 307.429307.429 -0.0212575-0.0212575 -1.226-1.226
store_forwardstore_forward FieldType.STRINGFieldType.STRING NN YY 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -74.0781-74.0781 -73.7459-73.7459 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 -74.0578-74.0578 -73.9639-73.9639 -73.9656-73.9656 -73.9508-73.9508 -73.9255-73.9255 -73.8529-73.8529 -73.8302-73.8302 -73.8238-73.8238 -73.7697-73.7697 -73.9123-73.9123 0.05037570.0503757 0.002537710.00253771 0.3521720.352172 -0.923743-0.923743
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 40.575540.5755 40.879940.8799 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 40.63240.632 40.711740.7117 40.711540.7115 40.721340.7213 40.756540.7565 40.805840.8058 40.847840.8478 40.867640.8676 40.877840.8778 40.764940.7649 0.04946740.0494674 0.002447020.00244702 0.2059720.205972 -0.777945-0.777945
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -74.0857-74.0857 -73.7209-73.7209 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 -74.0775-74.0775 -73.9875-73.9875 -73.9882-73.9882 -73.9638-73.9638 -73.935-73.935 -73.8755-73.8755 -73.8125-73.8125 -73.7759-73.7759 -73.7327-73.7327 -73.9202-73.9202 0.05846270.0584627 0.003417890.00341789 0.6236220.623622 -0.262603-0.262603
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 40.583540.5835 40.879740.8797 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 40.597340.5973 40.692840.6928 40.691140.6911 40.722640.7226 40.756740.7567 40.791840.7918 40.849540.8495 40.86840.868 40.878740.8787 40.758340.7583 0.05173990.0517399 0.002677010.00267701 0.03904040.0390404 -0.203525-0.203525
passengerspassengers FieldType.DECIMALFieldType.DECIMAL 11 66 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 11 11 11 11 11 55 55 66 66 2.392492.39249 1.831971.83197 3.35613.3561 0.7631440.763144 -1.23467-1.23467
distancedistance FieldType.DECIMALFieldType.DECIMAL 0.010.01 32.3432.34 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 0.01087440.0108744 0.7438980.743898 0.7381940.738194 1.2431.243 2.401682.40168 4.744784.74478 10.513610.5136 14.901114.9011 21.803521.8035 3.54473.5447 3.29433.2943 10.852410.8524 1.915561.91556 4.998984.99898
costcost FieldType.DECIMALFieldType.DECIMAL 0.10.1 8888 6148.06148.0 0.00.0 6148.06148.0 0.00.0 0.00.0 0.00.0 2.338372.33837 5.004915.00491 55 6.931296.93129 10.52410.524 17.481117.4811 33.234333.2343 50.009350.0093 63.175363.1753 13.684313.6843 9.665719.66571 93.42693.426 1.785181.78518 4.139724.13972

您可以藉由將資料行新增至 dflow_x 來準備實驗用的資料,以作為建立模型的特徵。You prepare the data for the experiment by adding columns to dflow_x to be features for our model creation. 您可以將 dflow_y 定義為我們的預測值 costYou define dflow_y to be our prediction value, cost:

dflow_X = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance','passengers', 'vendor'])
dflow_y = dflow_prepared.keep_columns('cost')

將資料分成定型集和測試集Split the data into train and test sets

現在您可以使用 sklearn 程式庫中的 train_test_split 函式,將資料分割成定型集和測試集。Now you split the data into training and test sets by using the train_test_split function in the sklearn library. 此函式會將資料分為用於模型定型的資料集 x (功能),以及用於測試的資料集 y (要預測的值)。This function segregates the data into the x, features, dataset for model training and the y, values to predict, dataset for testing. test_size 參數會決定要配置給測試的資料百分比。The test_size parameter determines the percentage of data to allocate to testing. random_state 參數會設定隨機產生器的種子,讓您的「訓練-測試」分割一律具有確定性:The random_state parameter sets a seed to the random generator, so that your train-test splits are always deterministic:

from sklearn.model_selection import train_test_split

x_df = dflow_X.to_pandas_dataframe()
y_df = dflow_y.to_pandas_dataframe()

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)
# flatten y_train to 1d array
y_train.values.flatten()

此步驟的目的是要以資料點測試已完成、但尚未用來訓練模型的模型,以評估實際的精確度。The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. 換句話說,訓練完善的模型應該能夠準確地從資料預測它尚未觀察到的部分。In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. 您現在已備妥進行模型自動定型所需的套件和資料。You now have the necessary packages and data ready for autotraining your model.

自動為模型定型Automatically train a model

若要自動將模型定型,請執行下列步驟:To automatically train a model, take the following steps:

  1. 定義用於實驗執行的設定。Define settings for the experiment run. 將訓練資料附加至組態,並修改用來控制訓練程序的設定。Attach your training data to the configuration, and modify settings that control the training process.
  2. 提交實驗來調整模型。Submit the experiment for model tuning. 提交實驗之後,程序會根據您定義的條件約束,反覆運算不同的機器學習演算法和超參數設定。After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. 它會將精確度計量最佳化,以選擇最適化模型。It chooses the best-fit model by optimizing an accuracy metric.

定義自動產生和微調的設定Define settings for autogeneration and tuning

定義用於自動產生和調整的實驗參數與模型設定。Define the experiment parameter and model settings for autogeneration and tuning. 檢視設定的完整清單。View the full list of settings. 提交使用這些預設設定的實驗大約需要 10-15 分鐘的時間,但如果您想要縮短執行時間,請降低 iterationsiteration_timeout_minutesSubmitting the experiment with these default settings will take approximately 10-15 min, but if you want a shorter run time, reduce either iterations or iteration_timeout_minutes.

屬性Property 本教學課程中的值Value in this tutorial 說明Description
iteration_timeout_minutesiteration_timeout_minutes 1010 每次反覆運算的時間限制 (分鐘)。Time limit in minutes for each iteration. 降低此值以減少總執行時間。Reduce this value to decrease total runtime.
反覆運算次數iterations 3030 反覆運算次數。Number of iterations. 在每次的反覆運算中,都會以您的資料訓練新的機器學習模型。In each iteration, a new machine learning model is trained with your data. 總執行時間主要會受此值影響。This is the primary value that affects total run time.
primary_metricprimary_metric spearman_correlationspearman_correlation 您想要最佳化的度量。Metric that you want to optimize. 最適化模型將根據此計量來選擇。The best-fit model will be chosen based on this metric.
preprocesspreprocess TrueTrue 使用 True 時,實驗可以預先處理輸入資料 (處理遺漏的資料、將文字轉換成數值等等)。By using True, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)
verbosityverbosity logging.INFOlogging.INFO 控制記錄層級。Controls the level of logging.
n_cross_validationsn_cross_validations 55 未指定驗證資料時所要執行的交叉驗證分割數目。Number of cross-validation splits to perform when validation data is not specified.
automl_settings = {
    "iteration_timeout_minutes" : 10,
    "iterations" : 30,
    "primary_metric" : 'spearman_correlation',
    "preprocess" : True,
    "verbosity" : logging.INFO,
    "n_cross_validations": 5
}

請使用您定義的訓練設定作為 AutoMLConfig 物件的參數。Use your defined training settings as a parameter to an AutoMLConfig object. 此外,請指定您的訓練資料和模型類型 (在此案例中為 regression)。Additionally, specify your training data and the type of model, which is regression in this case.

from azureml.train.automl import AutoMLConfig

# local compute
automated_ml_config = AutoMLConfig(task = 'regression',
                             debug_log = 'automated_ml_errors.log',
                             path = project_folder,
                             X = x_train.values,
                             y = y_train.values.flatten(),
                             **automl_settings)

定型自動迴歸模型Train the automatic regression model

開始在本機執行實驗。Start the experiment to run locally. 將已定義的 automated_ml_config 物件傳遞給實驗。Pass the defined automated_ml_config object to the experiment. 將輸出設定為 True 以在實驗期間檢視進度:Set the output to True to view progress during the experiment:

from azureml.core.experiment import Experiment
experiment=Experiment(ws, experiment_name)
local_run = experiment.submit(automated_ml_config, show_output=True)

顯示的輸出會隨著實驗的執行即時更新。The output shown updates live as the experiment runs. 對於每次反覆運算,您都可以檢視模型類型、執行的持續時間,以及訓練精確度。For each iteration, you see the model type, the run duration, and the training accuracy. 欄位 BEST 會根據您的計量類型追蹤最佳訓練分數。The field BEST tracks the best running training score based on your metric type.

Parent Run ID: AutoML_02778de3-3696-46e9-a71b-521c8fca0651
*******************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
*******************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MaxAbsScaler ExtremeRandomTrees                0:00:08       0.9447    0.9447
         1   StandardScalerWrapper GradientBoosting         0:00:09       0.9536    0.9536
         2   StandardScalerWrapper ExtremeRandomTrees       0:00:09       0.8580    0.9536
         3   StandardScalerWrapper RandomForest             0:00:08       0.9147    0.9536
         4   StandardScalerWrapper ExtremeRandomTrees       0:00:45       0.9398    0.9536
         5   MaxAbsScaler LightGBM                          0:00:08       0.9562    0.9562
         6   StandardScalerWrapper ExtremeRandomTrees       0:00:27       0.8282    0.9562
         7   StandardScalerWrapper LightGBM                 0:00:07       0.9421    0.9562
         8   MaxAbsScaler DecisionTree                      0:00:08       0.9526    0.9562
         9   MaxAbsScaler RandomForest                      0:00:09       0.9355    0.9562
        10   MaxAbsScaler SGD                               0:00:09       0.9602    0.9602
        11   MaxAbsScaler LightGBM                          0:00:09       0.9553    0.9602
        12   MaxAbsScaler DecisionTree                      0:00:07       0.9484    0.9602
        13   MaxAbsScaler LightGBM                          0:00:08       0.9540    0.9602
        14   MaxAbsScaler RandomForest                      0:00:10       0.9365    0.9602
        15   MaxAbsScaler SGD                               0:00:09       0.9602    0.9602
        16   StandardScalerWrapper ExtremeRandomTrees       0:00:49       0.9171    0.9602
        17   SparseNormalizer LightGBM                      0:00:08       0.9191    0.9602
        18   MaxAbsScaler DecisionTree                      0:00:08       0.9402    0.9602
        19   StandardScalerWrapper ElasticNet               0:00:08       0.9603    0.9603
        20   MaxAbsScaler DecisionTree                      0:00:08       0.9513    0.9603
        21   MaxAbsScaler SGD                               0:00:08       0.9603    0.9603
        22   MaxAbsScaler SGD                               0:00:10       0.9602    0.9603
        23   StandardScalerWrapper ElasticNet               0:00:09       0.9603    0.9603
        24   StandardScalerWrapper ElasticNet               0:00:09       0.9603    0.9603
        25   MaxAbsScaler SGD                               0:00:09       0.9603    0.9603
        26   TruncatedSVDWrapper ElasticNet                 0:00:09       0.9602    0.9603
        27   MaxAbsScaler SGD                               0:00:12       0.9413    0.9603
        28   StandardScalerWrapper ElasticNet               0:00:07       0.9603    0.9603
        29    Ensemble                                      0:00:38       0.9622    0.9622

探索結果Explore the results

使用 Jupyter 小工具,或藉由檢查實驗歷程記錄,探索自動定型的結果。Explore the results of automatic training with a Jupyter widget or by examining the experiment history.

選項 1:新增 Jupyter 小工具以查看結果Option 1: Add a Jupyter widget to see results

如果您使用 Jupyter Notebook,請使用此 Jupyter Notebook 小工具來查看所有結果的圖表和資料表:If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:

from azureml.widgets import RunDetails
RunDetails(local_run).show()

Jupyter 小工具執行詳細資料 Jupyter 小工具繪圖Jupyter widget run details Jupyter widget plot

選項 2:在 Python 中取得並檢查所有執行的反覆項目Option 2: Get and examine all run iterations in Python

您也可以擷取每個實驗的歷程記錄,並瀏覽每次反覆運算執行的個別計量。You can also retrieve the history of each experiment and explore the individual metrics for each iteration run. 藉由檢查每次個別模型執行的 RMSE (root_mean_squared_error),您會看到大部分的反覆運算都預測計程車車資費用落在合理的差距內 ($3-4)。By examining RMSE (root_mean_squared_error) for each individual model run, you see that most iterations are predicting the taxi fair cost within a reasonable margin ($3-4).

children = list(local_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata
00 11 22 33 44 55 66 77 88 99 ...... 2020 2121 2222 2323 2424 2525 2626 2727 2828 2929
explained_varianceexplained_variance 0.8110370.811037 0.8805530.880553 0.3985820.398582 0.7760400.776040 0.6638690.663869 0.8759110.875911 0.1156320.115632 0.5869050.586905 0.8519110.851911 0.7939640.793964 ...... 0.8500230.850023 0.8836030.883603 0.8837040.883704 0.8807970.880797 0.8815640.881564 0.8837080.883708 0.8818260.881826 0.5853770.585377 0.8831230.883123 0.8868170.886817
mean_absolute_errormean_absolute_error 2.1894442.189444 1.5004121.500412 5.4805315.480531 2.6263162.626316 2.9730262.973026 1.5501991.550199 6.3838686.383868 4.4142414.414241 1.7433281.743328 2.2946012.294601 ...... 1.7974021.797402 1.4158151.415815 1.4181671.418167 1.5786171.578617 1.5594271.559427 1.4130421.413042 1.5516981.551698 4.0691964.069196 1.5057951.505795 1.4309571.430957
median_absolute_errormedian_absolute_error 1.4384171.438417 0.8508990.850899 4.5796624.579662 1.7652101.765210 1.5946001.594600 0.8698830.869883 4.2664504.266450 3.6273553.627355 0.9549920.954992 1.3610141.361014 ...... 0.9736340.973634 0.7748140.774814 0.7972690.797269 1.1472341.147234 1.1164241.116424 0.7839580.783958 1.0984641.098464 2.7090272.709027 1.0037281.003728 0.8517240.851724
normalized_mean_absolute_errornormalized_mean_absolute_error 0.0249080.024908 0.0170700.017070 0.0623500.062350 0.0298780.029878 0.0338230.033823 0.0176360.017636 0.0726260.072626 0.0502190.050219 0.0198330.019833 0.0261050.026105 ...... 0.0204480.020448 0.0161070.016107 0.0161340.016134 0.0179590.017959 0.0177410.017741 0.0160760.016076 0.0176530.017653 0.0462930.046293 0.0171310.017131 0.0162790.016279
normalized_median_absolute_errornormalized_median_absolute_error 0.0163640.016364 0.0096800.009680 0.0521010.052101 0.0200820.020082 0.0181410.018141 0.0098960.009896 0.0485380.048538 0.0412670.041267 0.0108650.010865 0.0154840.015484 ...... 0.0110770.011077 0.0088150.008815 0.0090700.009070 0.0130520.013052 0.0127010.012701 0.0089190.008919 0.0124970.012497 0.0308190.030819 0.0114190.011419 0.0096900.009690
normalized_root_mean_squared_errornormalized_root_mean_squared_error 0.0479680.047968 0.0378820.037882 0.0855720.085572 0.0522820.052282 0.0658090.065809 0.0386640.038664 0.1094010.109401 0.0711040.071104 0.0422940.042294 0.0499670.049967 ...... 0.0425650.042565 0.0376850.037685 0.0375570.037557 0.0376430.037643 0.0375130.037513 0.0375600.037560 0.0374650.037465 0.0720770.072077 0.0372490.037249 0.0367160.036716
normalized_root_mean_squared_log_errornormalized_root_mean_squared_log_error 0.0553530.055353 0.0450000.045000 0.1102190.110219 0.0656330.065633 0.0635890.063589 0.0444120.044412 0.1234330.123433 0.0923120.092312 0.0461300.046130 0.0552430.055243 ...... 0.0465400.046540 0.0418040.041804 0.0417710.041771 0.0451750.045175 0.0446280.044628 0.0416170.041617 0.0444050.044405 0.0796510.079651 0.0427990.042799 0.0415300.041530
r2_scorer2_score 0.8109000.810900 0.8803280.880328 0.3980760.398076 0.7759570.775957 0.6428120.642812 0.8757190.875719 0.0216030.021603 0.5865140.586514 0.8517670.851767 0.7936710.793671 ...... 0.8498090.849809 0.8801420.880142 0.8809520.880952 0.8805860.880586 0.8813470.881347 0.8808870.880887 0.8816130.881613 0.5481210.548121 0.8828830.882883 0.8863210.886321
root_mean_squared_errorroot_mean_squared_error 4.2163624.216362 3.3298103.329810 7.5217657.521765 4.5956044.595604 5.7846015.784601 3.3985403.398540 9.6163549.616354 6.2500116.250011 3.7176613.717661 4.3920724.392072 ...... 3.7414473.741447 3.3125333.312533 3.3012423.301242 3.3087953.308795 3.2973893.297389 3.3014853.301485 3.2931823.293182 6.3355816.335581 3.2742093.274209 3.2273653.227365
root_mean_squared_log_errorroot_mean_squared_log_error 0.2431840.243184 0.1977020.197702 0.4842270.484227 0.2883490.288349 0.2793670.279367 0.1951160.195116 0.5422810.542281 0.4055590.405559 0.2026660.202666 0.2427020.242702 ...... 0.2044640.204464 0.1836580.183658 0.1835140.183514 0.1984680.198468 0.1960670.196067 0.1828360.182836 0.1950870.195087 0.3499350.349935 0.1880310.188031 0.1824550.182455
spearman_correlationspearman_correlation 0.9447430.944743 0.9536180.953618 0.8579650.857965 0.9147030.914703 0.9398460.939846 0.9561590.956159 0.8281870.828187 0.9420690.942069 0.9525810.952581 0.9354770.935477 ...... 0.9512870.951287 0.9603350.960335 0.9601950.960195 0.9602790.960279 0.9602880.960288 0.9603230.960323 0.9601610.960161 0.9412540.941254 0.9602930.960293 0.9621580.962158
spearman_correlation_maxspearman_correlation_max 0.9447430.944743 0.9536180.953618 0.9536180.953618 0.9536180.953618 0.9536180.953618 0.9561590.956159 0.9561590.956159 0.9561590.956159 0.9561590.956159 0.9561590.956159 ...... 0.9603030.960303 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9603350.960335 0.9621580.962158

12 個資料列 × 30 個資料行12 rows × 30 columns

擷取最佳模型Retrieve the best model

從我們的反覆項目中選取最佳管線。Select the best pipeline from our iterations. automl_classifier 上的 get_output 方法會傳回最佳執行和上一個配適引動過程的已配適模型。The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. 藉由在 get_output 上使用多載,您便可以針對任何已記錄的計量或特定的反覆項目,擷取最佳執行和配適模型:By using the overloads on get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration:

best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

測試最佳模型的精確度Test the best model accuracy

使用最佳模型在測試資料集上執行預測,以預測計程車車資。Use the best model to run predictions on the test dataset to predict taxi fares. predict 函式會使用最佳模型,並從 x_test 資料集預測 y 值 (行程成本)。The function predict uses the best model and predicts the values of y, trip cost, from the x_test dataset. y_predict 列印前 10 個預測成本值:Print the first 10 predicted cost values from y_predict:

y_predict = fitted_model.predict(x_test.values)
print(y_predict[:10])

建立散佈圖,以視覺化方式呈現預測成本值與實際成本值的比較。Create a scatter plot to visualize the predicted cost values compared to the actual cost values. 下列程式碼使用 distance 特徵作為 x 軸,以及行程 cost 作為 y 軸。The following code uses the distance feature as the x-axis and trip cost as the y-axis. 為了比較每個行程距離值之預測成本的變化,會將前 100 個預測成本值和實際成本值建立成個別的序列。To compare the variance of predicted cost at each trip distance value, the first 100 predicted and actual cost values are created as separate series. 檢查繪圖會顯示距離/成本近似線性關係,而且預測的成本值在大部分情況下都非常接近相同車程距離的實際成本值。Examining the plot shows that the distance/cost relationship is nearly linear, and the predicted cost values are in most cases very close to the actual cost values for the same trip distance.

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(14, 10))
ax1 = fig.add_subplot(111)

distance_vals = [x[4] for x in x_test.values]
y_actual = y_test.values.flatten().tolist()

ax1.scatter(distance_vals[:100], y_predict[:100], s=18, c='b', marker="s", label='Predicted')
ax1.scatter(distance_vals[:100], y_actual[:100], s=18, c='r', marker="o", label='Actual')

ax1.set_xlabel('distance (mi)')
ax1.set_title('Predicted and Actual Cost/Distance')
ax1.set_ylabel('Cost ($)')

plt.legend(loc='upper left', prop={'size': 12})
plt.rcParams.update({'font.size': 14})
plt.show()

預測散佈圖

計算結果的 root mean squared errorCalculate the root mean squared error of the results. 使用 y_test 資料框架。Use the y_test dataframe. 將其轉換為要與預測值比較的清單。Convert it to a list to compare to the predicted values. mean_squared_error 函式會採用兩個值陣列,並計算這兩個陣列之間的均方誤差。The function mean_squared_error takes two arrays of values and calculates the average squared error between them. 取結果的平方根會產生與 y 變數 (成本) 相同單位的誤差。Taking the square root of the result gives an error in the same units as the y variable, cost. 這大致上可表示計程車車資預測與實際車資的差距:It indicates roughly how far the taxi fare predictions are from the actual fares:

from sklearn.metrics import mean_squared_error
from math import sqrt

rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse
3.2204936862688798

請執行下列程式碼,以使用完整的 y_actualy_predict 資料集來計算平均絕對百分比誤差 (MAPE)。Run the following code to calculate mean absolute percent error (MAPE) by using the full y_actual and y_predict datasets. 此計量會計算每個預測值與實際值之間的絕對差異,並加總所有差異。This metric calculates an absolute difference between each predicted and actual value and sums all the differences. 然後再以實際值總計的百分比來表示該總和:Then it expresses that sum as a percent of the total of the actual values:

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)
Model MAPE:
0.10545153869569586

Model Accuracy:
0.8945484613043041

從最終的預測精確度計量中,您會看到模型從資料集的特性預測計程車車資的表現相當不錯,大多在 +- 3.00 美元以內。From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $3.00. 傳統的機器學習模型開發程序會耗費大量資源,而且需要投入大量的網域知識和時間來執行並比較數十個模型的結果。The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. 使用自動化機器學習,是您針對個人案例快速測試許多不同模型的絕佳方式。Using automated machine learning is a great way to rapidly test many different models for your scenario.

清除資源Clean up resources

重要

您所建立的資源可用來作為其他 Azure Machine Learning 服務教學課程和操作說明文章的先決條件。The resources you created can be used as prerequisites to other Azure Machine Learning service tutorials and how-to articles.

如果您不打算使用您建立的資源,請刪除它們,以免產生任何費用:If you don't plan to use the resources you created, delete them, so you don't incur any charges:

  1. 在 Azure 入口網站中,選取最左邊的 [資源群組] 。In the Azure portal, select Resource groups on the far left.

    在 Azure 入口網站中刪除

  2. 在清單中,選取您所建立的資源群組。From the list, select the resource group you created.

  3. 選取 [刪除資源群組] 。Select Delete resource group.

  4. 輸入資源群組名稱。Enter the resource group name. 然後選取 [刪除] 。Then select Delete.

後續步驟Next steps

在此自動化機器學習教學課程中,您已執行下列工作:In this automated machine learning tutorial, you did the following tasks:

  • 設定工作區和備妥用於實驗的資料。Configured a workspace and prepared data for an experiment.
  • 搭配自訂參數在本機使用自動化迴歸模型來進行定型。Trained by using an automated regression model locally with custom parameters.
  • 瀏覽及檢閱定型結果。Explored and reviewed training results.

使用 Azure Machine Learning 部署模型Deploy your model with Azure Machine Learning.