教學課程:準備建立迴歸模型所需的資料Tutorial: Prepare data for regression modeling

在本教學課程中,您將了解如何使用 Azure Machine Learning SDK資料準備套件來準備建立迴歸模型所需的資料。In this tutorial, you learn how to prepare data for regression modeling by using the data prep package from the Azure Machine Learning SDK. 您將執行各種轉換來篩選和結合兩個不同的 NYC 計程車資料集。You run various transformations to filter and combine two different NYC taxi data sets.

本教學課程是兩部分教學課程系列的第一部分This tutorial is part one of a two-part tutorial series. 完成教學課程系列之後,您可以訓練模型的資料特性,以預測計程車車程的費用。After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. 這些特性包括上車的日期和時間、乘客數和上車地點。These features include the pickup day and time, the number of passengers, and the pickup location.

在本教學課程中,您:In this tutorial, you:

  • 設定 Python 環境並匯入套件。Set up a Python environment and import packages.
  • 載入具有不同欄位名稱的兩個資料集。Load two datasets with different field names.
  • 清理資料以移除異常的部分。Cleanse data to remove anomalies.
  • 使用智慧型轉換來轉換資料,以建立新特性。Transform data by using intelligent transforms to create new features.
  • 儲存要在迴歸模型中使用的資料流程物件。Save your dataflow object to use in a regression model.

必要條件Prerequisites

請跳至設定您的開發環境閱讀完整的 Notebook 步驟,或依照下列指示取得 Notebook,並在 Azure Notebooks 或您自己的 Notebook 伺服器上加以執行。Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. 若要執行 Notebook,您將需要:To run the notebook you will need:

使用您工作區中的雲端 Notebook 伺服器Use a cloud notebook server in your workspace

您可以輕鬆地開始使用自己的雲端式 Notebook 伺服器。It's easy to get started with your own cloud-based notebook server. 我們已在您建立此雲端資源後,為您安裝及設定適用於 Python 的 Azure Machine Learning SDK。The Azure Machine Learning SDK for Python is already installed and configured for you once you create this cloud resource.

  • 啟動 Notebook 網頁之後,請執行 tutorials/regression-part1-data-prep.ipynb Notebook。After you launch the notebook webpage, run the tutorials/regression-part1-data-prep.ipynb notebook.

使用您自己的 Jupyter Notebook 伺服器Use your own Jupyter notebook server

使用下列步驟在您的電腦上建立本機 Jupyter Notebook 伺服器。Use these steps to create a local Jupyter Notebook server on your computer. 完成所有步驟後,請執行 tutorials/regression-part1-data-prep.ipynb 筆記本。After you complete the steps, run the tutorials/regression-part1-data-prep.ipynb notebook.

  1. 完成 Azure Machine Learning Python 快速入門中的安裝步驟,以建立 Miniconda 環境並安裝 SDK。Complete the installation steps in Azure Machine Learning Python quickstart to create a Miniconda environment and install the SDK. 如果您希望的話,也可以放心地略過建立工作區一節,但您在本教學課程系列的第 2 部分會需要該項目。Feel free to skip the Create a workspace section if you wish, but you will need it for part 2 of this tutorial series.

  2. 當您安裝 SDK 時,系統便會自動安裝 azureml-dataprep 套件。The azureml-dataprep package is automatically installed when you install the SDK.

  3. 複製 GitHub 存放庫Clone the GitHub repository.

    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
  4. 從複製的目錄中啟動 Notebook 伺服器。Start the notebook server from your cloned directory.

    jupyter notebook
    

設定您的開發環境Set up your development environment

針對您開發工作的所有設定都可以在 Python Notebook 中完成。All the setup for your development work can be accomplished in a Python notebook. 設定包含下列動作:Setup includes the following actions:

  • 安裝 SDKInstall the SDK
  • 匯入 Python 套件Import Python packages

安裝並匯入套件Install and import packages

如果您還沒有必要的套件,請使用下列步驟加以安裝。Use the following to install necessary packages if you don't already have them.

pip install "azureml-dataprep[pandas]>=1.1.0,<1.2.0"

匯入套件。Import the package.

import azureml.dataprep as dprep

重要

請確定您已安裝最新版的 azureml.dataprep 套件版本。Ensure you install the latest azureml.dataprep package version. 低於 1.1.0 的版本號碼不適用於本教學課程This tutorial will not work with version number lower than 1.1.0

載入資料Load data

將兩個不同的 NYC 計程車資料集下載到資料流程物件中。Download two different NYC taxi data sets into dataflow objects. 這些資料集包含稍有不同的欄位。The datasets have slightly different fields. 方法 auto_read_file() 會自動辨識輸入檔類型。The auto_read_file() method automatically recognizes the input file type.

from IPython.display import display
dataset_root = "https://dprepdata.blob.core.windows.net/demo"

green_path = "/".join([dataset_root, "green-small/*"])
yellow_path = "/".join([dataset_root, "yellow-small/*"])

green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)
# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.
yellow_df_raw = dprep.auto_read_file(path=yellow_path)

display(green_df_raw.head(5))
display(yellow_df_raw.head(5))

Dataflow 物件類似於資料框架,代表對資料所做的一系列惰性評估、不可變的作業。A Dataflow object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. 藉由叫用不同的轉換和可用的篩選方法,可以新增作業。Operations can be added by invoking the different transformation and filtering methods available. 將作業新增至 Dataflow 的結果一律是新的 Dataflow 物件。The result of adding an operation to a Dataflow is always a new Dataflow object.

清理資料Cleanse data

現在,您可以在一些變數中填入將套用至所有資料流程的捷徑轉換。Now you populate some variables with shortcut transforms to apply to all dataflows. 變數 drop_if_all_null 會用來刪除所有欄位皆為 Null 的記錄。The drop_if_all_null variable is used to delete records where all fields are null. 變數 useful_columns 會保存每個資料流程中保留的資料行描述陣列。The useful_columns variable holds an array of column descriptions that are kept in each dataflow.

all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "cost", "distance", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude",
    "passengers", "pickup_datetime", "pickup_latitude", "pickup_longitude", "store_forward", "vendor"
]

您會先處理綠色計程車資料,並使其成為可與黃色計程車資料結合的有效型態。You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. 使用您所建立的捷徑轉換變數呼叫 replace_na()drop_nulls()keep_columns() 函式。Call the replace_na(), drop_nulls(), and keep_columns() functions by using the shortcut transform variables you created. 此外,請將資料框架中的所有資料行重新命名,使其符合 useful_columns 變數中的名稱。Additionally, rename all the columns in the dataframe to match the names in the useful_columns variable.

green_df = (green_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "lpep_pickup_datetime": "pickup_datetime",
        "Lpep_dropoff_datetime": "dropoff_datetime",
        "lpep_dropoff_datetime": "dropoff_datetime",
        "Store_and_fwd_flag": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Pickup_longitude": "pickup_longitude",
        "Pickup_latitude": "pickup_latitude",
        "Dropoff_longitude": "dropoff_longitude",
        "Dropoff_latitude": "dropoff_latitude",
        "Passenger_count": "passengers",
        "Fare_amount": "cost",
        "Trip_distance": "distance"
     })
    .keep_columns(columns=useful_columns))
green_df.head(5)
vendorvendor pickup_datetimepickup_datetime dropoff_datetimedropoff_datetime store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 08:14:372013-08-01 08:14:37 2013-08-01 09:09:062013-08-01 09:09:06 NN 00 00 00 00 11 .00.00 21.2521.25
11 22 2013-08-01 09:13:002013-08-01 09:13:00 2013-08-01 11:38:002013-08-01 11:38:00 NN 00 00 00 00 22 .00.00 74.574.5
22 22 2013-08-01 09:48:002013-08-01 09:48:00 2013-08-01 09:49:002013-08-01 09:49:00 NN 00 00 00 00 11 .00.00 11
33 22 2013-08-01 10:38:352013-08-01 10:38:35 2013-08-01 10:38:512013-08-01 10:38:51 NN 00 00 00 00 11 .00.00 3.253.25
44 22 2013-08-01 11:51:452013-08-01 11:51:45 2013-08-01 12:03:522013-08-01 12:03:52 NN 00 00 00 00 11 .00.00 8.58.5

對黃色計程車資料執行相同的轉換步驟。Run the same transformation steps on the yellow taxi data. 這些函式可確保會從資料集中移除 Null 資料,因此有助於提升機器學習模型的精確度。These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy.

yellow_df = (yellow_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "vendor_name": "vendor",
        "VendorID": "vendor",
        "vendor_id": "vendor",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "tpep_pickup_datetime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "tpep_dropoff_datetime": "dropoff_datetime",
        "store_and_forward": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Passenger_Count": "passengers",
        "passenger_count": "passengers",
        "Fare_Amt": "cost",
        "fare_amount": "cost",
        "Trip_Distance": "distance",
        "trip_distance": "distance"
    })
    .keep_columns(columns=useful_columns))
yellow_df.head(5)

對綠色計程車資料呼叫 append_rows() 函式,以附加黃色計程車資料。Call the append_rows() function on the green taxi data to append the yellow taxi data. 此時會建立新的合併資料框架。A new combined dataframe is created.

combined_df = green_df.append_rows([yellow_df])

轉換類型和篩選Convert types and filter

檢查上車和下車座標摘要統計資料,以了解如何資料的分布情形。Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. 首先,請定義將經、緯度欄位變更為十進位類型的 TypeConverter 物件。First, define a TypeConverter object to change the latitude and longitude fields to decimal type. 接著,呼叫 keep_columns() 函式將輸出限定於經、緯度欄位,然後呼叫 get_profile() 函式。Next, call the keep_columns() function to restrict output to only the latitude and longitude fields, and then call the get_profile() function. 這些函式呼叫會建立資料流程的精簡檢視而僅顯示經緯度欄位,以方便評估遺漏或超出範圍的座標。These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates.

decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)
combined_df = combined_df.set_column_types(type_conversions={
    "pickup_longitude": decimal_type,
    "pickup_latitude": decimal_type,
    "dropoff_longitude": decimal_type,
    "dropoff_latitude": decimal_type
})
combined_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
類型Type MinMin maxMax CountCount 遺漏計數Missing Count 未遺漏計數Not Missing Count 遺漏百分比Percent missing 錯誤計數Error Count 空白計數Empty count 0.1% 分位數0.1% Quantile 1% 分位數1% Quantile 5% 分位數5% Quantile 25% 分位數25% Quantile 50% 分位數50% Quantile 75% 分位數75% Quantile 95% 分位數95% Quantile 99% 分位數99% Quantile 99.9% 分位數99.9% Quantile 標準差Standard Deviation 平均值Mean
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -115.179337-115.179337 0.0000000.000000 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 -88.114046-88.114046 -73.961840-73.961840 -73.961964-73.961964 -73.947693-73.947693 -73.922097-73.922097 -73.846670-73.846670 0.0000000.000000 0.0000000.000000 0.0000000.000000 18.79267218.792672 -68.833579-68.833579
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 0.0000000.000000 40.91912140.919121 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 0.0000000.000000 40.68288940.682889 40.67554140.675541 40.72107540.721075 40.75615940.756159 40.80390940.803909 40.84940640.849406 40.87068140.870681 40.89124440.891244 10.34596710.345967 37.93674237.936742
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -115.179337-115.179337 0.0000000.000000 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 -87.699611-87.699611 -73.984734-73.984734 -73.985777-73.985777 -73.956250-73.956250 -73.928948-73.928948 -73.866208-73.866208 0.0000000.000000 0.0000000.000000 0.0000000.000000 18.69652618.696526 -68.896978-68.896978
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 0.0000000.000000 41.00893441.008934 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 0.0000000.000000 40.66276340.662763 40.65485140.654851 40.71782140.717821 40.75653440.756534 40.78468840.784688 40.85243740.852437 40.87928940.879289 40.93729140.937291 10.29078010.290780 37.96377437.963774

在摘要統計資料輸出中,您會看到有遺漏的座標和不在紐約市內的座標 (這取決於主觀分析)。From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). 請篩選出不在市界內的位置座標。Filter out coordinates for locations that are outside the city border. filter() 函式內鏈結資料行篩選命令,並定義每個欄位的下限和上限。Chain the column filter commands within the filter() function and define the minimum and maximum bounds for each field. 然後,再次呼叫 get_profile() 函式以驗證轉換。Then call the get_profile() function again to verify the transformation.

latlong_filtered_df = (combined_df
    .drop_nulls(
        columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],
        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)
    )
    .filter(dprep.f_and(
        dprep.col("pickup_longitude") <= -73.72,
        dprep.col("pickup_longitude") >= -74.09,
        dprep.col("pickup_latitude") <= 40.88,
        dprep.col("pickup_latitude") >= 40.53,
        dprep.col("dropoff_longitude") <= -73.72,
        dprep.col("dropoff_longitude") >= -74.09,
        dprep.col("dropoff_latitude") <= 40.88,
        dprep.col("dropoff_latitude") >= 40.53
    )))
latlong_filtered_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
類型Type MinMin maxMax CountCount 遺漏計數Missing Count 未遺漏計數Not Missing Count 遺漏百分比Percent missing 錯誤計數Error Count 空白計數Empty count 0.1% 分位數0.1% Quantile 1% 分位數1% Quantile 5% 分位數5% Quantile 25% 分位數25% Quantile 50% 分位數50% Quantile 75% 分位數75% Quantile 95% 分位數95% Quantile 99% 分位數99% Quantile 99.9% 分位數99.9% Quantile 標準差Standard Deviation 平均值Mean
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -74.078156-74.078156 -73.736481-73.736481 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 -74.076314-74.076314 -73.962542-73.962542 -73.962893-73.962893 -73.948975-73.948975 -73.927856-73.927856 -73.866662-73.866662 -73.830438-73.830438 -73.823160-73.823160 -73.769750-73.769750 0.0487110.048711 -73.913865-73.913865
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 40.57548540.575485 40.87985240.879852 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 40.63288440.632884 40.71310540.713105 40.71160040.711600 40.72140340.721403 40.75814240.758142 40.80514540.805145 40.84885540.848855 40.86756740.867567 40.87769040.877690 0.0483480.048348 40.76522640.765226
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -74.085747-74.085747 -73.720871-73.720871 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 -74.078828-74.078828 -73.985650-73.985650 -73.985813-73.985813 -73.959041-73.959041 -73.936681-73.936681 -73.884846-73.884846 -73.815507-73.815507 -73.776697-73.776697 -73.733471-73.733471 0.0559610.055961 -73.920718-73.920718
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 40.58353040.583530 40.87973440.879734 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 40.59774140.597741 40.69537640.695376 40.69511540.695115 40.72754940.727549 40.75816040.758160 40.78837840.788378 40.85037240.850372 40.86796840.867968 40.87858640.878586 0.0504620.050462 40.75948740.759487

分割並重新命名資料行Split and rename columns

查看 store_forward 資料行的資料設定檔。Look at the data profile for the store_forward column. 此欄位是一個布林旗標;當計程車在載客之後無法連線到伺服器,因而必須將車程資料儲存在記憶體中,稍後再將其轉送至連線的伺服器時,此欄位會顯示為 YThis field is a boolean flag that is Y when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected.

latlong_filtered_df.keep_columns(columns='store_forward').get_profile()
類型Type MinMin maxMax CountCount 遺漏計數Missing Count 未遺漏計數Not Missing Count 遺漏百分比Percent missing 錯誤計數Error Count 空白計數Empty count 0.1% 分位數0.1% Quantile 1% 分位數1% Quantile 5% 分位數5% Quantile 25% 分位數25% Quantile 50% 分位數50% Quantile 75% 分位數75% Quantile 95% 分位數95% Quantile 99% 分位數99% Quantile 99.9% 分位數99.9% Quantile 標準差Standard Deviation 平均值Mean
store_forwardstore_forward FieldType.STRINGFieldType.STRING NN YY 7059.07059.0 99.099.0 6960.06960.0 0.0140250.014025 0.00.0 0.00.0

請注意,store_forward 資料行中的資料設定檔輸出會顯示資料不一致,而且有遺漏值或 Null 值。Notice that the data profile output in the store_forward column shows that the data is inconsistent and there are missing or null values. 請使用 replace()fill_nulls() 函式將這些值取代為字串 "N":Use the replace() and fill_nulls() functions to replace these values with the string "N":

replaced_stfor_vals_df = latlong_filtered_df.replace(columns="store_forward", find="0", replace_with="N").fill_nulls("store_forward", "N")

distance 欄位上執行 replace 函式。Execute the replace function on the distance field. 此函式會將錯誤標示為 .00 的距離值重新格式化,並在所有 Null 中填入零。The function reformats distance values that are incorrectly labeled as .00, and fills any nulls with zeros. distance 欄位轉換為數值格式。Convert the distance field to numerical format. 不正確的資料點很可能是計程車資料收集系統中的異常。These incorrect data points are likely anomalies in the data collection system on the taxi cabs.

replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns="distance", find=".00", replace_with=0).fill_nulls("distance", 0)
replaced_distance_vals_df = replaced_distance_vals_df.to_number(["distance"])

將上車和下車的日期時間值分割到各自的日期資料行和時間資料行中。Split the pickup and dropoff datetime values into the respective date and time columns. 使用 split_column_by_example() 函式進行分割。Use the split_column_by_example() function to make the split. 在此案例中,會省略函式 split_column_by_example() 的選擇性 example 參數。In this case, the optional example parameter of the split_column_by_example() function is omitted. 因此,函式將根據資料自動決定分割的目的地。Therefore, the function automatically determines where to split based on the data.

time_split_df = (replaced_distance_vals_df
    .split_column_by_example(source_column="pickup_datetime")
    .split_column_by_example(source_column="dropoff_datetime"))
time_split_df.head(5)
vendorvendor pickup_datetimepickup_datetime pickup_datetime_1pickup_datetime_1 pickup_datetime_2pickup_datetime_2 dropoff_datetimedropoff_datetime dropoff_datetime_1dropoff_datetime_1 dropoff_datetime_2dropoff_datetime_2 store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 17:22:002013-08-01 17:22:00 2013-08-012013-08-01 17:22:0017:22:00 2013-08-01 17:22:002013-08-01 17:22:00 2013-08-012013-08-01 17:22:0017:22:00 NN -73.937767-73.937767 40.75848040.758480 -73.937767-73.937767 40.75848040.758480 11 0.00.0 2.52.5
11 22 2013-08-01 17:24:002013-08-01 17:24:00 2013-08-012013-08-01 17:24:0017:24:00 2013-08-01 17:25:002013-08-01 17:25:00 2013-08-012013-08-01 17:25:0017:25:00 NN -73.937927-73.937927 40.75784340.757843 -73.937927-73.937927 40.75784340.757843 11 0.00.0 2.52.5
22 22 2013-08-06 06:51:192013-08-06 06:51:19 2013-08-062013-08-06 06:51:1906:51:19 2013-08-06 06:51:362013-08-06 06:51:36 2013-08-062013-08-06 06:51:3606:51:36 NN -73.937721-73.937721 40.75840440.758404 -73.937721-73.937721 40.75836940.758369 11 0.00.0 3.33.3
33 22 2013-08-06 13:26:342013-08-06 13:26:34 2013-08-062013-08-06 13:26:3413:26:34 2013-08-06 13:26:572013-08-06 13:26:57 2013-08-062013-08-06 13:26:5713:26:57 NN -73.937691-73.937691 40.75841940.758419 -73.937790-73.937790 40.75835840.758358 11 0.00.0 3.33.3
44 22 2013-08-06 13:27:532013-08-06 13:27:53 2013-08-062013-08-06 13:27:5313:27:53 2013-08-06 13:28:082013-08-06 13:28:08 2013-08-062013-08-06 13:28:0813:28:08 NN -73.937805-73.937805 40.75839640.758396 -73.937775-73.937775 40.75845040.758450 11 0.00.0 3.33.3

split_column_by_example() 函式產生的資料行重新命名,以使用有意義的名稱。Rename the columns generated by the split_column_by_example() function to use meaningful names.

renamed_col_df = (time_split_df
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time",
        "dropoff_datetime_1": "dropoff_date",
        "dropoff_datetime_2": "dropoff_time"
    }))
renamed_col_df.head(5)

呼叫 get_profile() 函式,以檢視所有清理步驟完成後的完整摘要統計資料。Call the get_profile() function to see the full summary statistics after all cleansing steps.

renamed_col_df.get_profile()

轉換資料Transform data

將上車和下車日期進一步分割成週中的日期、月中的日期和月份值。Split the pickup and dropoff date further into the day of the week, day of the month, and month values. 若要取得「週中的日期」值,請使用 derive_column_by_example() 函式。To get the day of the week value, use the derive_column_by_example() function. 此函數會採用定義輸入資料和所需輸出的範例物件所組成的陣列參數。The function takes an array parameter of example objects that define the input data, and the preferred output. 此函式會自動判斷您所需的轉換。The function automatically determines your preferred transformation. 針對上車和下車時間資料行,請使用不含範例參數的 split_column_by_example() 函式,將時間分割成小時、分鐘和秒。For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.

在您產生這些新特性後,請使用 drop_columns() 函式刪除原始欄位,因為您所需的是新產生的特性。After you generate the new features, use the drop_columns() function to delete the original fields as the newly generated features are preferred. 將其餘欄位重新命名,以使用有意義的描述。Rename the rest of the fields to use meaningful descriptions.

以此方式轉換資料以建立以時間為基礎的新特性,可改善機器學習模型的精確度。Transforming the data in this way to create new time-based features will improve machine learning model accuracy. 例如,產生工作日的新特性,將有助於建立星期幾與計程車車資價格之間的關聯性;此價格常會因為一週中的某幾天需求較高而較為昂貴。For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand.

transformed_features_df = (renamed_col_df
    .derive_column_by_example(
        source_columns="pickup_date",
        new_column_name="pickup_weekday",
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    .derive_column_by_example(
        source_columns="dropoff_date",
        new_column_name="dropoff_weekday",
        example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]
    )

    .split_column_by_example(source_column="pickup_time")
    .split_column_by_example(source_column="dropoff_time")
    # The following two calls to split_column_by_example reference the column names generated from the previous two calls.
    .split_column_by_example(source_column="pickup_time_1")
    .split_column_by_example(source_column="dropoff_time_1")
    .drop_columns(columns=[
        "pickup_date", "pickup_time", "dropoff_date", "dropoff_time",
        "pickup_date_1", "dropoff_date_1", "pickup_time_1", "dropoff_time_1"
    ])

    .rename_columns(column_pairs={
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1_1": "pickup_hour",
        "pickup_time_1_2": "pickup_minute",
        "pickup_time_2": "pickup_second",
        "dropoff_date_2": "dropoff_month",
        "dropoff_date_3": "dropoff_monthday",
        "dropoff_time_1_1": "dropoff_hour",
        "dropoff_time_1_2": "dropoff_minute",
        "dropoff_time_2": "dropoff_second"
    }))

transformed_features_df.head(5)
vendorvendor pickup_datetimepickup_datetime pickup_weekdaypickup_weekday pickup_hourpickup_hour pickup_minutepickup_minute pickup_secondpickup_second dropoff_datetimedropoff_datetime dropoff_weekdaydropoff_weekday dropoff_hourdropoff_hour dropoff_minutedropoff_minute dropoff_seconddropoff_second store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 17:22:002013-08-01 17:22:00 ThursdayThursday 1717 2222 0000 2013-08-01 17:22:002013-08-01 17:22:00 ThursdayThursday 1717 2222 0000 NN -73.937767-73.937767 40.75848040.758480 -73.937767-73.937767 40.75848040.758480 11 0.00.0 2.52.5
11 22 2013-08-01 17:24:002013-08-01 17:24:00 ThursdayThursday 1717 2424 0000 2013-08-01 17:25:002013-08-01 17:25:00 ThursdayThursday 1717 2525 0000 NN -73.937927-73.937927 40.75784340.757843 -73.937927-73.937927 40.75784340.757843 11 0.00.0 2.52.5
22 22 2013-08-06 06:51:192013-08-06 06:51:19 TuesdayTuesday 0606 5151 1919 2013-08-06 06:51:362013-08-06 06:51:36 TuesdayTuesday 0606 5151 3636 NN -73.937721-73.937721 40.75840440.758404 -73.937721-73.937721 40.75836940.758369 11 0.00.0 3.33.3
33 22 2013-08-06 13:26:342013-08-06 13:26:34 TuesdayTuesday 1313 2626 3434 2013-08-06 13:26:572013-08-06 13:26:57 TuesdayTuesday 1313 2626 5757 NN -73.937691-73.937691 40.75841940.758419 -73.937790-73.937790 40.75835840.758358 11 0.00.0 3.33.3
44 22 2013-08-06 13:27:532013-08-06 13:27:53 TuesdayTuesday 1313 2727 5353 2013-08-06 13:28:082013-08-06 13:28:08 TuesdayTuesday 1313 2828 0808 NN -73.937805-73.937805 40.75839640.758396 -73.937775-73.937775 40.75845040.758450 11 0.00.0 3.33.3

請注意,資料會顯示從衍生的轉換產生的上車和下車日期和時間元件正確無誤。Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. pickup_datetimedropoff_datetime 資料行已不再需要,請加以捨棄 (小時、分和秒之類的細微時間特性較適合用於模型定型)。Drop the pickup_datetime and dropoff_datetime columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training).

processed_df = transformed_features_df.drop_columns(columns=["pickup_datetime", "dropoff_datetime"])

使用型別推斷功能自動檢查每個欄位的資料類型,並顯示推斷結果。Use the type inference functionality to automatically check the data type of each field, and display the inference results.

type_infer = processed_df.builders.set_column_types()
type_infer.learn()
type_infer

type_infer 產生的輸出如下。The resulting output of type_infer is as follows.

Column types conversion candidates:
'pickup_weekday': [FieldType.STRING],
'pickup_hour': [FieldType.DECIMAL],
'pickup_minute': [FieldType.DECIMAL],
'pickup_second': [FieldType.DECIMAL],
'dropoff_hour': [FieldType.DECIMAL],
'dropoff_minute': [FieldType.DECIMAL],
'dropoff_second': [FieldType.DECIMAL],
'store_forward': [FieldType.STRING],
'pickup_longitude': [FieldType.DECIMAL],
'dropoff_longitude': [FieldType.DECIMAL],
'passengers': [FieldType.DECIMAL],
'distance': [FieldType.DECIMAL],
'vendor': [FieldType.STRING],
'dropoff_weekday': [FieldType.STRING],
'pickup_latitude': [FieldType.DECIMAL],
'dropoff_latitude': [FieldType.DECIMAL],
'cost': [FieldType.DECIMAL]

推斷結果依資料來看是正確的。The inference results look correct based on the data. 接著請將類型轉換套用至資料流程。Now apply the type conversions to the dataflow.

type_converted_df = type_infer.to_dataflow()
type_converted_df.get_profile()

在封裝資料流程之前,請先在資料集上執行兩個最終篩選條件。Before you package the dataflow, run two final filters on the data set. 若要排除未正確擷取的資料點,請在 costdistance 變數值皆大於零的記錄上篩選資料流程。To eliminate incorrectly captured data points, filter the dataflow on records where both the cost and distance variable values are greater than zero. 此步驟可大幅改善機器學習模型的精確度,因為成本或距離為零的資料點代表失去預測精確度的主要極端值。This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy.

final_df = type_converted_df.filter(dprep.col("distance") > 0)
final_df = final_df.filter(dprep.col("cost") > 0)

現在,您已完整轉換並備妥要在機器學習模型中使用的資料流程物件。You now have a fully transformed and prepared dataflow object to use in a machine learning model. SDK 包含物件序列化功能,其使用方式如下列程式碼所示。The SDK includes object serialization functionality, which is used as shown in the following code.

import os

file_path = os.path.join(os.getcwd(), "dflows.dprep")
final_df.save(file_path)

清除資源Clean up resources

若要繼續進行本教學課程的第二部分,您必須在目前的目錄中保留 dflows.dprep 檔案。To continue with part two of the tutorial, you need the dflows.dprep file in the current directory.

如果您不打算繼續進行第二部分,請在目前的目錄中刪除 dflows.dprep 檔案。If you don't plan to continue to part two, delete the dflows.dprep file in your current directory. 無論您是在本機還是 Azure Notebooks 中執行,均請刪除此檔案。Delete this file whether you're running the execution locally or in Azure Notebooks.

後續步驟Next steps

在本教學課程的第一部分中,您已:In part one of this tutorial, you:

  • 設定您的開發環境。Set up your development environment.
  • 載入並清理資料集。Loaded and cleansed data sets.
  • 使用智慧轉換根據範例預測您的邏輯。Used smart transforms to predict your logic based on an example.
  • 合併並封裝機器學習訓練的資料集。Merged and packaged datasets for machine learning training.

您已準備就緒,可在教學課程的第二部分中使用訓練資料:You're ready to use the training data in part two of the tutorial: