教學課程:使用自動化機器學習預測計程車車資Tutorial: Use automated machine learning to predict taxi fares

在本教學課程中,您將使用 Azure Machine Learning 服務中的自動化機器學習建立迴歸模型,以預測 NYC 計程車車資價格。In this tutorial, you use automated machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. 此程序接受定型資料和組態設定,並自動逐一查看不同功能正規化/標準化方法、模型及超參數設定的組合,以獲得最佳模型。This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

流程圖

在此教學課程中,您將了解下列工作:In this tutorial you learn the following tasks:

  • 使用 Azure 開放資料集來下載、轉換及清除資料Download, transform, and clean data using Azure Open Datasets
  • 為自動化機器學習迴歸模型定型Train an automated machine learning regression model
  • 計算模型精確度Calculate model accuracy

如果您沒有 Azure 訂用帳戶,請在開始前先建立一個免費帳戶。If you don’t have an Azure subscription, create a free account before you begin. 立即試用免費或付費版本的 Azure Machine Learning 服務。Try the free or paid version of Azure Machine Learning service today.

必要條件Prerequisites

  • 如果您還沒有 Azure Machine Learning 服務工作區或 Notebook 虛擬機器,請完成設定教學課程Complete the setup tutorial if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.
  • 完成設定教學課程之後,請使用相同的 Notebook 伺服器開啟 tutorials/regression-automated-ml.ipynb Notebook。After you complete the setup tutorial, open the tutorials/regression-automated-ml.ipynb notebook using the same notebook server.

如果您想要在自己的本機環境中執行此教學課程,也可以在 GitHub 上取得。This tutorial is also available on GitHub if you wish to run it in your own local environment. 執行 pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets 以取得必要套件。Run pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets to get the required packages.

下載並準備資料Download and prepare data

匯入必要的套件。Import the necessary packages. 開放資料集套件包含一個代表每個資料來源 (例如 NycTlcGreen) 的類別,以在下載之前輕鬆篩選日期參數。The Open Datasets package contains a class representing each data source (NycTlcGreen for example) to easily filter date parameters before downloading.

from azureml.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

從建立用來保存計程車資料的資料架構開始。Begin by creating a dataframe to hold the taxi data. 在非 Spark 環境中運作時,開放資料集僅允許一次下載一個月含特定類別的資料,以避免使用大型資料集時發生發生 MemoryErrorWhen working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid MemoryError with large datasets.

若要下載計程車資料,請反覆地逐一擷取一個月的資料,並且在將該資料附加到 green_taxi_df 之前,從每個月中隨機取樣 2,000 筆記錄,以避免使資料架構膨脹。To download taxi data, iteratively fetch one month at a time, and before appending it to green_taxi_df randomly sample 2,000 records from each month to avoid bloating the dataframe. 接著預覽資料。Then preview the data.

green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2015","%m/%d/%Y")
end = datetime.strptime("1/31/2015","%m/%d/%Y")

for sample_month in range(12):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

green_taxi_df.head(10)
vendorIDvendorID lpepPickupDatetimelpepPickupDatetime lpepDropoffDatetimelpepDropoffDatetime passengerCountpassengerCount tripDistancetripDistance puLocationIdpuLocationId doLocationIddoLocationId pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude ...... paymentTypepaymentType fareAmountfareAmount extraextra mtaTaxmtaTax improvementSurchargeimprovementSurcharge tipAmounttipAmount tollsAmounttollsAmount ehailFeeehailFee totalAmounttotalAmount tripTypetripType
131969131969 22 2015-01-11 05:34:442015-01-11 05:34:44 2015-01-11 05:45:032015-01-11 05:45:03 33 4.844.84 NoneNone NoneNone -73.88-73.88 40.8440.84 -73.94-73.94 ...... 22 15.0015.00 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 16.3016.30 1.001.00
11298171129817 22 2015-01-20 16:26:292015-01-20 16:26:29 2015-01-20 16:30:262015-01-20 16:30:26 11 0.690.69 NoneNone NoneNone -73.96-73.96 40.8140.81 -73.96-73.96 ...... 22 4.504.50 1.001.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00
12786201278620 22 2015-01-01 05:58:102015-01-01 05:58:10 2015-01-01 06:00:552015-01-01 06:00:55 11 0.450.45 NoneNone NoneNone -73.92-73.92 40.7640.76 -73.91-73.91 ...... 22 4.004.00 0.000.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 4.804.80 1.001.00
348430348430 22 2015-01-17 02:20:502015-01-17 02:20:50 2015-01-17 02:41:382015-01-17 02:41:38 11 0.000.00 NoneNone NoneNone -73.81-73.81 40.7040.70 -73.82-73.82 ...... 22 12.5012.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00
12696271269627 11 2015-01-01 05:04:102015-01-01 05:04:10 2015-01-01 05:06:232015-01-01 05:06:23 11 0.500.50 NoneNone NoneNone -73.92-73.92 40.7640.76 -73.92-73.92 ...... 22 4.004.00 0.500.50 0.500.50 00 0.000.00 0.000.00 nannan 5.005.00 1.001.00
811755811755 11 2015-01-04 19:57:512015-01-04 19:57:51 2015-01-04 20:05:452015-01-04 20:05:45 22 1.101.10 NoneNone NoneNone -73.96-73.96 40.7240.72 -73.95-73.95 ...... 22 6.506.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 7.807.80 1.001.00
737281737281 11 2015-01-03 12:27:312015-01-03 12:27:31 2015-01-03 12:33:522015-01-03 12:33:52 11 0.900.90 NoneNone NoneNone -73.88-73.88 40.7640.76 -73.87-73.87 ...... 22 6.006.00 0.000.00 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.806.80 1.001.00
113951113951 11 2015-01-09 23:25:512015-01-09 23:25:51 2015-01-09 23:39:522015-01-09 23:39:52 11 3.303.30 NoneNone NoneNone -73.96-73.96 40.7240.72 -73.91-73.91 ...... 22 12.5012.50 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00
150436150436 22 2015-01-11 17:15:142015-01-11 17:15:14 2015-01-11 17:22:572015-01-11 17:22:57 11 1.191.19 NoneNone NoneNone -73.94-73.94 40.7140.71 -73.95-73.95 ...... 11 7.007.00 0.000.00 0.500.50 0.30.3 1.751.75 0.000.00 nannan 9.559.55 1.001.00
432136432136 22 2015-01-22 23:16:332015-01-22 23:16:33 2015-01-22 23:20:132015-01-22 23:20:13 11 0.650.65 NoneNone NoneNone -73.94-73.94 40.7140.71 -73.94-73.94 ...... 22 5.005.00 0.500.50 0.500.50 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00

10 個資料列 × 23 個資料行10 rows × 23 columns

既然已載入初始資料,請定義一個函式,從搭車日期時間欄位建立各種以時間為基礎的功能。Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. 這將會建立適用於月份數字、當月某日、週中的日及當天特定時間的新欄位,並且可讓模型將以時間為基礎的季節性當作考量因素。This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 在資料架構上使用 apply() 函式,以將 build_time_features() 函式反覆地套用至計程車資料中的每個資料列。Use the apply() function on the dataframe to iteratively apply the build_time_features() function to each row in the taxi data.

def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour

    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)
vendorIDvendorID lpepPickupDatetimelpepPickupDatetime lpepDropoffDatetimelpepDropoffDatetime passengerCountpassengerCount tripDistancetripDistance puLocationIdpuLocationId doLocationIddoLocationId pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude ...... improvementSurchargeimprovementSurcharge tipAmounttipAmount tollsAmounttollsAmount ehailFeeehailFee totalAmounttotalAmount tripTypetripType month_nummonth_num day_of_monthday_of_month day_of_weekday_of_week hour_of_dayhour_of_day
131969131969 22 2015-01-11 05:34:442015-01-11 05:34:44 2015-01-11 05:45:032015-01-11 05:45:03 33 4.844.84 NoneNone NoneNone -73.88-73.88 40.8440.84 -73.94-73.94 ...... 0.30.3 0.000.00 0.000.00 nannan 16.3016.30 1.001.00 11 1111 66 55
11298171129817 22 2015-01-20 16:26:292015-01-20 16:26:29 2015-01-20 16:30:262015-01-20 16:30:26 11 0.690.69 NoneNone NoneNone -73.96-73.96 40.8140.81 -73.96-73.96 ...... 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00 11 2020 11 1616
12786201278620 22 2015-01-01 05:58:102015-01-01 05:58:10 2015-01-01 06:00:552015-01-01 06:00:55 11 0.450.45 NoneNone NoneNone -73.92-73.92 40.7640.76 -73.91-73.91 ...... 0.30.3 0.000.00 0.000.00 nannan 4.804.80 1.001.00 11 11 33 55
348430348430 22 2015-01-17 02:20:502015-01-17 02:20:50 2015-01-17 02:41:382015-01-17 02:41:38 11 0.000.00 NoneNone NoneNone -73.81-73.81 40.7040.70 -73.82-73.82 ...... 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00 11 1717 55 22
12696271269627 11 2015-01-01 05:04:102015-01-01 05:04:10 2015-01-01 05:06:232015-01-01 05:06:23 11 0.500.50 NoneNone NoneNone -73.92-73.92 40.7640.76 -73.92-73.92 ...... 00 0.000.00 0.000.00 nannan 5.005.00 1.001.00 11 11 33 55
811755811755 11 2015-01-04 19:57:512015-01-04 19:57:51 2015-01-04 20:05:452015-01-04 20:05:45 22 1.101.10 NoneNone NoneNone -73.96-73.96 40.7240.72 -73.95-73.95 ...... 0.30.3 0.000.00 0.000.00 nannan 7.807.80 1.001.00 11 44 66 1919
737281737281 11 2015-01-03 12:27:312015-01-03 12:27:31 2015-01-03 12:33:522015-01-03 12:33:52 11 0.900.90 NoneNone NoneNone -73.88-73.88 40.7640.76 -73.87-73.87 ...... 0.30.3 0.000.00 0.000.00 nannan 6.806.80 1.001.00 11 33 55 1212
113951113951 11 2015-01-09 23:25:512015-01-09 23:25:51 2015-01-09 23:39:522015-01-09 23:39:52 11 3.303.30 NoneNone NoneNone -73.96-73.96 40.7240.72 -73.91-73.91 ...... 0.30.3 0.000.00 0.000.00 nannan 13.8013.80 1.001.00 11 99 44 2323
150436150436 22 2015-01-11 17:15:142015-01-11 17:15:14 2015-01-11 17:22:572015-01-11 17:22:57 11 1.191.19 NoneNone NoneNone -73.94-73.94 40.7140.71 -73.95-73.95 ...... 0.30.3 1.751.75 0.000.00 nannan 9.559.55 1.001.00 11 1111 66 1717
432136432136 22 2015-01-22 23:16:332015-01-22 23:16:33 2015-01-22 23:20:132015-01-22 23:20:13 11 0.650.65 NoneNone NoneNone -73.94-73.94 40.7140.71 -73.94-73.94 ...... 0.30.3 0.000.00 0.000.00 nannan 6.306.30 1.001.00 11 2222 33 2323

10 個資料列 × 27 個資料行10 rows × 27 columns

移除一些您在定型或建置其他功能時不需用到的資料行。Remove some of the columns that you won't need for training or additional feature building.

columns_to_remove = ["lpepPickupDatetime", "lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID",
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)

green_taxi_df.head(5)

清理資料Cleanse data

在新的資料架構上執行 describe() 函式,來查看每個欄位的摘要統計資料。Run the describe() function on the new dataframe to see summary statistics for each field.

green_taxi_df.describe()
vendorIDvendorID passengerCountpassengerCount tripDistancetripDistance pickupLongitudepickupLongitude pickupLatitudepickupLatitude dropoffLongitudedropoffLongitude dropoffLatitudedropoffLatitude totalAmounttotalAmount month_nummonth_num day_of_monthday_of_month day_of_weekday_of_week hour_of_dayhour_of_day
countcount 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00 48000.0048000.00
平均值mean 1.781.78 1.371.37 2.872.87 -73.83-73.83 40.6940.69 -73.84-73.84 40.7040.70 14.7514.75 6.506.50 15.1315.13 3.273.27 13.5213.52
stdstd 0.410.41 1.041.04 2.932.93 2.762.76 1.521.52 2.612.61 1.441.44 12.0812.08 3.453.45 8.458.45 1.951.95 6.836.83
Minmin 1.001.00 0.000.00 0.000.00 -74.66-74.66 0.000.00 -74.66-74.66 0.000.00 -300.00-300.00 1.001.00 1.001.00 0.000.00 0.000.00
25%25% 2.002.00 1.001.00 1.061.06 -73.96-73.96 40.7040.70 -73.97-73.97 40.7040.70 7.807.80 3.753.75 8.008.00 2.002.00 9.009.00
50%50% 2.002.00 1.001.00 1.901.90 -73.94-73.94 40.7540.75 -73.94-73.94 40.7540.75 11.3011.30 6.506.50 15.0015.00 3.003.00 15.0015.00
75%75% 2.002.00 1.001.00 3.603.60 -73.92-73.92 40.8040.80 -73.91-73.91 40.7940.79 17.8017.80 9.259.25 22.0022.00 5.005.00 19.0019.00
maxmax 2.002.00 9.009.00 97.5797.57 0.000.00 41.9341.93 0.000.00 41.9441.94 450.00450.00 12.0012.00 30.0030.00 6.006.00 23.0023.00

從摘要統計資料中,您會看到有數個欄位包含將降低模型精確度的極端值或值。From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. 首先,將經/緯度欄位篩選為曼哈頓區域的範圍內。First filter the lat/long fields to be within the bounds of the Manhattan area. 這會篩選掉較長途的計程車車程,或與其他特性的關聯性方面屬於極端值的車程。This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features.

此外,請將 tripDistance 欄位篩選為大於零、但小於 31 英哩 (兩個經/緯度組之間的半正矢距離)。Additionally filter the tripDistance field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). 這可以消除車程成本不一致的長途極端值車程。This eliminates long outlier trips that have inconsistent trip cost.

最後,totalAmount 欄位的計程車車資含有負值 (這在我們的模型內容中不具意義),且 passengerCount 欄位含有最小值為零的無效資料。Lastly, the totalAmount field has negative values for the taxi fares, which don't make sense in the context of our model, and the passengerCount field has bad data with the minimum values being zero.

使用查詢函式篩選出這些異常狀況,然後移除前幾個在定型時不需用到的資料行。Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

再次於資料上呼叫 describe(),以確保清理會如預期般運作。Call describe() again on the data to ensure cleansing worked as expected. 您現在具有一組已備妥且已清理的計程車、假日及氣象資料,可用於機器學習服務模型定型。You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training.

final_df.describe()

設定工作區Configure workspace

從現有的工作區建立工作區物件。Create a workspace object from the existing workspace. Workspace (英文) 是會接受您 Azure 訂用帳戶和資源資訊的類別。A Workspace is a class that accepts your Azure subscription and resource information. 它也會建立雲端資源來監視及追蹤您的模型執行。It also creates a cloud resource to monitor and track your model runs. Workspace.from_config() 會讀取 config.json 檔案,並將驗證詳細資料載入名為 ws 的物件中。Workspace.from_config() reads the file config.json and loads the authentication details into an object named ws. ws 用於本教學課程的其餘程式碼。ws is used throughout the rest of the code in this tutorial.

from azureml.core.workspace import Workspace
ws = Workspace.from_config()

將資料分成定型集和測試集Split the data into train and test sets

您可以使用 scikit-learn 程式庫中的 train_test_split 函式,將資料分割成定型集和測試集。Split the data into training and test sets by using the train_test_split function in the scikit-learn library. 此函式會將資料分成用於模型定型的 x (特性) 資料集,以及用於測試的 y (要預測的值) 資料集。This function segregates the data into the x (features) data set for model training and the y (values to predict) data set for testing.

test_size 參數會決定要配置給測試的資料百分比。The test_size parameter determines the percentage of data to allocate to testing. random_state 參數會設定隨機產生器的種子,讓您的「定型-測試」分割具有確定性。The random_state parameter sets a seed to the random generator, so that your train-test splits are deterministic.

from sklearn.model_selection import train_test_split

y_df = final_df.pop("totalAmount")
x_df = final_df

x_train, x_test, y_train, y_test = train_test_split(x_df, y_df, test_size=0.2, random_state=223)

此步驟的目的是要以資料點測試已完成、但尚未用來訓練模型的模型,以評估實際的精確度。The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy.

換句話說,訓練完善的模型應該能夠準確地從資料預測它尚未觀察到的部分。In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. 您現在已備妥資料,可自動定型機器學習服務模型。You now have data prepared for auto-training a machine learning model.

自動為模型定型Automatically train a model

若要自動將模型定型,請執行下列步驟:To automatically train a model, take the following steps:

  1. 定義用於實驗執行的設定。Define settings for the experiment run. 將訓練資料附加至組態,並修改用來控制訓練程序的設定。Attach your training data to the configuration, and modify settings that control the training process.
  2. 提交實驗來調整模型。Submit the experiment for model tuning. 提交實驗之後,程序會根據您定義的條件約束,反覆運算不同的機器學習演算法和超參數設定。After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. 它會將精確度計量最佳化,以選擇最適化模型。It chooses the best-fit model by optimizing an accuracy metric.

定義定型設定Define training settings

定義用於定型的實驗參數與模型設定。Define the experiment parameter and model settings for training. 檢視設定的完整清單。View the full list of settings. 提交使用這些預設設定的實驗大約需要 5-10 分鐘的時間,但如果您想要縮短執行時間,請降低 iterations 參數。Submitting the experiment with these default settings will take approximately 5-10 min, but if you want a shorter run time, reduce the iterations parameter.

屬性Property 本教學課程中的值Value in this tutorial 說明Description
iteration_timeout_minutesiteration_timeout_minutes 22 每次反覆運算的時間限制 (分鐘)。Time limit in minutes for each iteration. 降低此值以減少總執行時間。Reduce this value to decrease total runtime.
反覆運算次數iterations 2020 反覆運算次數。Number of iterations. 在每次的反覆運算中,都會以您的資料訓練新的機器學習模型。In each iteration, a new machine learning model is trained with your data. 總執行時間主要會受此值影響。This is the primary value that affects total run time.
primary_metricprimary_metric spearman_correlationspearman_correlation 您想要最佳化的度量。Metric that you want to optimize. 最適化模型將根據此計量來選擇。The best-fit model will be chosen based on this metric.
preprocesspreprocess TrueTrue 使用 True 時,實驗可以預先處理輸入資料 (處理遺漏的資料、將文字轉換成數值等等)。By using True, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)
verbosityverbosity logging.INFOlogging.INFO 控制記錄層級。Controls the level of logging.
n_cross_validationsn_cross_validations 55 未指定驗證資料時所要執行的交叉驗證分割數目。Number of cross-validation splits to perform when validation data is not specified.
import logging

automl_settings = {
    "iteration_timeout_minutes": 2,
    "iterations": 20,
    "primary_metric": 'spearman_correlation',
    "preprocess": True,
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

請使用您定義的定型設定作為 AutoMLConfig 物件的 **kwargs 參數。Use your defined training settings as a **kwargs parameter to an AutoMLConfig object. 此外,請指定您的訓練資料和模型類型 (在此案例中為 regression)。Additionally, specify your training data and the type of model, which is regression in this case.

from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='regression',
                             debug_log='automated_ml_errors.log',
                             X=x_train.values,
                             y=y_train.values.flatten(),
                             **automl_settings)

注意

自動化機器學習前置處理步驟 (功能正規化、處理遺漏的資料、將文字轉換成數值等等) 會成為基礎模型的一部分。Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. 使用模型進行預測時,定型期間所套用的相同前置處理步驟會自動套用至您的輸入資料。When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

定型自動迴歸模型Train the automatic regression model

在您的工作區中建立實驗物件。Create an experiment object in your workspace. 實驗會作為個別執行的容器。An experiment acts as a container for your individual runs. 將已定義的 automl_config 物件傳至實驗,並將輸出設定為 True 以檢視執行期間的進度。Pass the defined automl_config object to the experiment, and set the output to True to view progress during the run.

實驗開始後,顯示的輸出會隨著實驗的執行即時更新。After starting the experiment, the output shown updates live as the experiment runs. 對於每次反覆運算,您都可以檢視模型類型、執行的持續時間,以及訓練精確度。For each iteration, you see the model type, the run duration, and the training accuracy. 欄位 BEST 會根據您的計量類型追蹤最佳訓練分數。The field BEST tracks the best running training score based on your metric type.

from azureml.core.experiment import Experiment
experiment = Experiment(ws, "taxi-experiment")
local_run = experiment.submit(automl_config, show_output=True)
Running on local machine
Parent Run ID: AutoML_1766cdf7-56cf-4b28-a340-c4aeee15b12b
Current status: DatasetFeaturization. Beginning to featurize the dataset.
Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   StandardScalerWrapper RandomForest             0:00:16       0.8746    0.8746
         1   MinMaxScaler RandomForest                      0:00:15       0.9468    0.9468
         2   StandardScalerWrapper ExtremeRandomTrees       0:00:09       0.9303    0.9468
         3   StandardScalerWrapper LightGBM                 0:00:10       0.9424    0.9468
         4   RobustScaler DecisionTree                      0:00:09       0.9449    0.9468
         5   StandardScalerWrapper LassoLars                0:00:09       0.9440    0.9468
         6   StandardScalerWrapper LightGBM                 0:00:10       0.9282    0.9468
         7   StandardScalerWrapper RandomForest             0:00:12       0.8946    0.9468
         8   StandardScalerWrapper LassoLars                0:00:16       0.9439    0.9468
         9   MinMaxScaler ExtremeRandomTrees                0:00:35       0.9199    0.9468
        10   RobustScaler ExtremeRandomTrees                0:00:19       0.9411    0.9468
        11   StandardScalerWrapper ExtremeRandomTrees       0:00:13       0.9077    0.9468
        12   StandardScalerWrapper LassoLars                0:00:15       0.9433    0.9468
        13   MinMaxScaler ExtremeRandomTrees                0:00:14       0.9186    0.9468
        14   RobustScaler RandomForest                      0:00:10       0.8810    0.9468
        15   StandardScalerWrapper LassoLars                0:00:55       0.9433    0.9468
        16   StandardScalerWrapper ExtremeRandomTrees       0:00:13       0.9026    0.9468
        17   StandardScalerWrapper RandomForest             0:00:13       0.9140    0.9468
        18   VotingEnsemble                                 0:00:23       0.9471    0.9471
        19   StackEnsemble                                  0:00:27       0.9463    0.9471

探索結果Explore the results

使用 Jupyter 小工具,探索自動定型的結果。Explore the results of automatic training with a Jupyter widget. 這項小工具可讓您查看所有個別執行反覆項目的圖表和資料表,以及定型精確度計量和中繼資料。The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. 此外,您也可以使用下拉式清單選取器,來篩選主要計量以外的不同精確度計量。Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

from azureml.widgets import RunDetails
RunDetails(local_run).show()

Jupyter 小工具執行詳細資料 Jupyter 小工具繪圖Jupyter widget run details Jupyter widget plot

擷取最佳模型Retrieve the best model

從您的反覆項目中選取最佳的模型。Select the best model from your iterations. get_output 函式會傳回最佳執行和上一個配適引動過程的配適模型。The get_output function returns the best run and the fitted model for the last fit invocation. 藉由在 get_output 上使用多載,您便可以針對任何已記錄的計量或特定的反覆項目,擷取最佳執行和配適模型。By using the overloads on get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

測試最佳模型的精確度Test the best model accuracy

使用最佳模型在測試資料集上執行預測,以預測計程車車資。Use the best model to run predictions on the test data set to predict taxi fares. 函式 predict 會使用最佳模型,並從 x_test 資料集預測 y 值 (車程成本)。The function predict uses the best model and predicts the values of y, trip cost, from the x_test data set. y_predict 列印前 10 個預測成本值。Print the first 10 predicted cost values from y_predict.

y_predict = fitted_model.predict(x_test.values)
print(y_predict[:10])

計算結果的 root mean squared errorCalculate the root mean squared error of the results. y_test 資料架構轉換為清單,以便與預測值進行比較。Convert the y_test dataframe to a list to compare to the predicted values. mean_squared_error 函式會採用兩個值陣列,並計算這兩個陣列之間的均方誤差。The function mean_squared_error takes two arrays of values and calculates the average squared error between them. 取結果的平方根會產生與 y 變數 (成本) 相同單位的誤差。Taking the square root of the result gives an error in the same units as the y variable, cost. 這大致上可表示計程車車資預測與實際車資的差距。It indicates roughly how far the taxi fare predictions are from the actual fares.

from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse

執行下列程式碼,以使用完整的 y_actualy_predict 資料集來計算平均絕對百分比誤差 (MAPE)。Run the following code to calculate mean absolute percent error (MAPE) by using the full y_actual and y_predict data sets. 此計量會計算每個預測值與實際值之間的絕對差異,並加總所有差異。This metric calculates an absolute difference between each predicted and actual value and sums all the differences. 然後它會以實際值總計的百分比來表示該總和。Then it expresses that sum as a percent of the total of the actual values.

sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)
Model MAPE:
0.14353867606052823

Model Accuracy:
0.8564613239394718

從這兩個預測精確度計量中,您會看到模型從資料集的特性預測計程車車資的表現相當不錯,大多在 +- 4.00 美元以內,誤差約為 15%。From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error.

傳統的機器學習模型開發程序會耗費大量資源,而且需要投入大量的網域知識和時間來執行並比較數十個模型的結果。The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. 使用自動化機器學習,是您針對個人案例快速測試許多不同模型的絕佳方式。Using automated machine learning is a great way to rapidly test many different models for your scenario.

清除資源Clean up resources

如果您打算執行其他 Azure Machine Learning 服務教學課程,請不要完成本節。Do not complete this section if you plan on running other Azure Machine Learning service tutorials.

停止 Notebook VMStop the notebook VM

如果您使用雲端 Notebook 伺服器,當您不使用 VM 時,請停止該 VM 來降低成本。If you used a cloud notebook server, stop the VM when you are not using it to reduce cost.

  1. 在您的工作區中,選取 [Notebook VM] 。In your workspace, select Notebook VMs.
  2. 從清單中選取 VM。From the list, select the VM.
  3. 選取 [停止] 。Select Stop.
  4. 當您準備好再次使用伺服器時,請選取 [啟動] 。When you're ready to use the server again, select Start.

刪除所有內容Delete everything

如果您不打算使用您建立的資源,請刪除它們,以免產生任何費用。If you don't plan to use the resources you created, delete them, so you don't incur any charges.

  1. 在 Azure 入口網站中,選取最左邊的 [資源群組] 。In the Azure portal, select Resource groups on the far left.
  2. 在清單中,選取您所建立的資源群組。From the list, select the resource group you created.
  3. 選取 [刪除資源群組] 。Select Delete resource group.
  4. 輸入資源群組名稱。Enter the resource group name. 然後選取 [刪除] 。Then select Delete.

您也可以保留資源群組,但刪除單一工作區。You can also keep the resource group but delete a single workspace. 顯示工作區屬性,然後選取 [刪除] 。Display the workspace properties and select Delete.

後續步驟Next steps

在此自動化機器學習教學課程中,您已執行下列工作:In this automated machine learning tutorial, you did the following tasks:

  • 設定工作區和備妥用於實驗的資料。Configured a workspace and prepared data for an experiment.
  • 搭配自訂參數在本機使用自動化迴歸模型來進行定型。Trained by using an automated regression model locally with custom parameters.
  • 瀏覽及檢閱定型結果。Explored and reviewed training results.

使用 Azure Machine Learning 服務部署模型Deploy your model with Azure Machine Learning service.