您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:为回归建模准备数据Tutorial: Prepare data for regression modeling

本教程介绍如何使用 Azure 机器学习 SDK 中的数据准备包来为回归建模准备数据。In this tutorial, you learn how to prepare data for regression modeling by using the data prep package from the Azure Machine Learning SDK. 运行各种转换,以便筛选并组合两个不同的纽约市出租车数据集。You run various transformations to filter and combine two different NYC taxi data sets.

本教程是由两个部分构成的系列教程的第一部分 。This tutorial is part one of a two-part tutorial series. 完成这一系列的教程以后,即可根据数据特性训练一个模型,以便预测出租车打车费用。After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. 这些特性包括上车日期和时间、乘客数和上车位置。These features include the pickup day and time, the number of passengers, and the pickup location.

本教程介绍以下操作:In this tutorial, you:

  • 设置 Python 环境并导入包。Set up a Python environment and import packages.
  • 加载两个包含不同字段名称的数据集。Load two datasets with different field names.
  • 清理数据,删除异常数据。Cleanse data to remove anomalies.
  • 使用智能转换来转换数据,以便创建新特性。Transform data by using intelligent transforms to create new features.
  • 保存可以在回归模型中使用的数据流对象。Save your dataflow object to use in a regression model.

先决条件Prerequisites

跳到设置开发环境来了解整个 Notebook 设置步骤,或遵照以下说明获取 Notebook 并在 Azure Notebooks 或自己的 Notebook 服务器中运行。Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. 若要运行 Notebook,需要:To run the notebook you will need:

使用工作区中的云 Notebook 服务器Use a cloud notebook server in your workspace

可以轻松地从你自己的基于云的 Notebook 服务器着手。It's easy to get started with your own cloud-based notebook server. 创建此云资源后,用于 Python 的 Azure 机器学习 SDK 就已为你安装和配置了。The Azure Machine Learning SDK for Python is already installed and configured for you once you create this cloud resource.

  • 启动 Notebook 网页后,运行 tutorials/regression-part1-data-prep.ipynb Notebook。After you launch the notebook webpage, run the tutorials/regression-part1-data-prep.ipynb notebook.

使用自己的 Jupyter Notebook 服务器Use your own Jupyter notebook server

执行这些步骤,在计算机上创建本地 Jupyter Notebook 服务器。Use these steps to create a local Jupyter Notebook server on your computer. 完成这些步骤后,运行 tutorials/regression-part1-data-prep.ipynb Notebook。After you complete the steps, run the tutorials/regression-part1-data-prep.ipynb notebook.

  1. 完成 Azure 机器学习 Python 快速入门中的安装步骤,创建 Miniconda 环境并安装 SDK。Complete the installation steps in Azure Machine Learning Python quickstart to create a Miniconda environment and install the SDK. 如果愿意,可以跳过创建工作区部分,但是本教程系列的第 2 部分将需要使用工作区。Feel free to skip the Create a workspace section if you wish, but you will need it for part 2 of this tutorial series.

  2. 安装 SDK 时,会自动安装 azureml-dataprep 包。The azureml-dataprep package is automatically installed when you install the SDK.

  3. 克隆 GitHub 存储库Clone the GitHub repository.

    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
  4. 从克隆目录启动 notebook 服务器。Start the notebook server from your cloned directory.

    jupyter notebook
    

设置开发环境Set up your development environment

开发工作的所有设置都可以在 Python Notebook 中完成。All the setup for your development work can be accomplished in a Python notebook. 安装包括以下操作:Setup includes the following actions:

  • 安装 SDKInstall the SDK
  • 导入 Python 包Import Python packages

安装并导入包Install and import packages

使用以下命令安装必要的包(如果尚未安装)。Use the following to install necessary packages if you don't already have them.

pip install "azureml-dataprep[pandas]>=1.1.0,<1.2.0"

导入包。Import the package.

import azureml.dataprep as dprep

重要

请确保安装最新的 azureml.dataprep 包版本。Ensure you install the latest azureml.dataprep package version. 本教程不适用于低于 1.1.0 的版本This tutorial will not work with version number lower than 1.1.0

加载数据Load data

将两个不同的纽约市出租车数据集下载到数据流对象中。Download two different NYC taxi data sets into dataflow objects. 这两个数据集的字段稍有不同。The datasets have slightly different fields. auto_read_file() 方法自动识别输入文件类型。The auto_read_file() method automatically recognizes the input file type.

from IPython.display import display
dataset_root = "https://dprepdata.blob.core.windows.net/demo"

green_path = "/".join([dataset_root, "green-small/*"])
yellow_path = "/".join([dataset_root, "yellow-small/*"])

green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)
# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.
yellow_df_raw = dprep.auto_read_file(path=yellow_path)

display(green_df_raw.head(5))
display(yellow_df_raw.head(5))

备注

此同一示例中的 URL 不是完整的 URL,The URL in this same example is not a complete URL. 而是引用 blob 中的 demo 文件夹。Instead, it refers to the demo folder in the blob. 某些数据的完整 URL 为 https://dprepdata.blob.core.windows.net/demo/green-small/green_tripdata_2013-08.csvThe full URL to some of the data is https://dprepdata.blob.core.windows.net/demo/green-small/green_tripdata_2013-08.csv

Dataflow 对象类似于数据帧,表示针对数据执行的一系列松散评估的不可变操作。A Dataflow object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. 可以通过调用不同的转换并筛选可用方法来添加操作。Operations can be added by invoking the different transformation and filtering methods available. Dataflow 中添加操作后,始终会获得新的 Dataflow 对象。The result of adding an operation to a Dataflow is always a new Dataflow object.

清理数据Cleanse data

现在,请在某些变量中填充将要应用到所有数据流的快捷方式转换。Now you populate some variables with shortcut transforms to apply to all dataflows. drop_if_all_null 变量用于删除所有字段均为 null 的记录。The drop_if_all_null variable is used to delete records where all fields are null. useful_columns 变量保存列说明的数组,这些说明保留在每个数据流中。The useful_columns variable holds an array of column descriptions that are kept in each dataflow.

all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "cost", "distance", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude",
    "passengers", "pickup_datetime", "pickup_latitude", "pickup_longitude", "store_forward", "vendor"
]

首先处理绿色出租车数据,使之具有有效的形状,能够与黄色出租车数据组合在一起。You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. 使用已创建的快捷方式转换变量来调用 replace_na()drop_nulls()keep_columns() 函数。Call the replace_na(), drop_nulls(), and keep_columns() functions by using the shortcut transform variables you created. 另外,请将 dataframe 中的所有列重命名,使之与 useful_columns 变量中的名称匹配。Additionally, rename all the columns in the dataframe to match the names in the useful_columns variable.

green_df = (green_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "lpep_pickup_datetime": "pickup_datetime",
        "Lpep_dropoff_datetime": "dropoff_datetime",
        "lpep_dropoff_datetime": "dropoff_datetime",
        "Store_and_fwd_flag": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Pickup_longitude": "pickup_longitude",
        "Pickup_latitude": "pickup_latitude",
        "Dropoff_longitude": "dropoff_longitude",
        "Dropoff_latitude": "dropoff_latitude",
        "Passenger_count": "passengers",
        "Fare_amount": "cost",
        "Trip_distance": "distance"
     })
    .keep_columns(columns=useful_columns))
green_df.head(5)
vendorvendor pickup_datetimepickup_datetime dropoff_datetimedropoff_datetime store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 08:14:372013-08-01 08:14:37 2013-08-01 09:09:062013-08-01 09:09:06 NN 00 00 00 00 11 .00.00 21.2521.25
11 22 2013-08-01 09:13:002013-08-01 09:13:00 2013-08-01 11:38:002013-08-01 11:38:00 NN 00 00 00 00 22 .00.00 74.574.5
22 22 2013-08-01 09:48:002013-08-01 09:48:00 2013-08-01 09:49:002013-08-01 09:49:00 NN 00 00 00 00 11 .00.00 11
33 22 2013-08-01 10:38:352013-08-01 10:38:35 2013-08-01 10:38:512013-08-01 10:38:51 NN 00 00 00 00 11 .00.00 3.253.25
44 22 2013-08-01 11:51:452013-08-01 11:51:45 2013-08-01 12:03:522013-08-01 12:03:52 NN 00 00 00 00 11 .00.00 8.58.5

对黄色出租车数据运行相同的转换步骤。Run the same transformation steps on the yellow taxi data. 这些函数确保从数据集中删除 null 数据,这有助于提高机器学习模型的准确度。These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy.

yellow_df = (yellow_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "vendor_name": "vendor",
        "VendorID": "vendor",
        "vendor_id": "vendor",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "tpep_pickup_datetime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "tpep_dropoff_datetime": "dropoff_datetime",
        "store_and_forward": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Passenger_Count": "passengers",
        "passenger_count": "passengers",
        "Fare_Amt": "cost",
        "fare_amount": "cost",
        "Trip_Distance": "distance",
        "trip_distance": "distance"
    })
    .keep_columns(columns=useful_columns))
yellow_df.head(5)

针对绿色出租车数据调用 append_rows() 函数,以追加黄色出租车数据。Call the append_rows() function on the green taxi data to append the yellow taxi data. 此时会创建新的组合 dataframe。A new combined dataframe is created.

combined_df = green_df.append_rows([yellow_df])

转换类型和筛选器Convert types and filter

检查上车和下车的坐标摘要统计信息,了解数据的分布情况。Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. 首先定义一个 TypeConverter 对象,将纬度和经度字段更改为十进制类型。First, define a TypeConverter object to change the latitude and longitude fields to decimal type. 接下来调用 keep_columns() 函数,使输出仅限纬度和经度字段,然后调用 get_profile() 函数。Next, call the keep_columns() function to restrict output to only the latitude and longitude fields, and then call the get_profile() function. 这些函数调用创建数据流的精简视图,以便仅显示纬度/经度字段,并更方便地评估缺失的坐标或超出范围的坐标。These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates.

decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)
combined_df = combined_df.set_column_types(type_conversions={
    "pickup_longitude": decimal_type,
    "pickup_latitude": decimal_type,
    "dropoff_longitude": decimal_type,
    "dropoff_latitude": decimal_type
})
combined_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
TypeType MinMin MaxMax CountCount 缺失计数Missing Count 非缺失计数Not Missing Count 缺失百分比Percent missing 错误计数Error Count 空计数Empty count 0.1% 分位0.1% Quantile 1% 分位1% Quantile 5% 分位5% Quantile 25% 分位25% Quantile 50% 分位50% Quantile 75% 分位75% Quantile 95% 分位95% Quantile 99% 分位99% Quantile 99.9% 分位99.9% Quantile 标准偏差Standard Deviation 平均值Mean
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -115.179337-115.179337 0.0000000.000000 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 -88.114046-88.114046 -73.961840-73.961840 -73.961964-73.961964 -73.947693-73.947693 -73.922097-73.922097 -73.846670-73.846670 0.0000000.000000 0.0000000.000000 0.0000000.000000 18.79267218.792672 -68.833579-68.833579
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 0.0000000.000000 40.91912140.919121 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 0.0000000.000000 40.68288940.682889 40.67554140.675541 40.72107540.721075 40.75615940.756159 40.80390940.803909 40.84940640.849406 40.87068140.870681 40.89124440.891244 10.34596710.345967 37.93674237.936742
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -115.179337-115.179337 0.0000000.000000 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 -87.699611-87.699611 -73.984734-73.984734 -73.985777-73.985777 -73.956250-73.956250 -73.928948-73.928948 -73.866208-73.866208 0.0000000.000000 0.0000000.000000 0.0000000.000000 18.69652618.696526 -68.896978-68.896978
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 0.0000000.000000 41.00893441.008934 7722.07722.0 0.00.0 7722.07722.0 0.00.0 0.00.0 0.00.0 0.0000000.000000 40.66276340.662763 40.65485140.654851 40.71782140.717821 40.75653440.756534 40.78468840.784688 40.85243740.852437 40.87928940.879289 40.93729140.937291 10.29078010.290780 37.96377437.963774

从摘要统计信息输出中可以看到,有的坐标缺失,有的坐标不在纽约市(通过主观分析来确定)。From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). 筛选掉位置在城市边缘之外的坐标。Filter out coordinates for locations that are outside the city border. 链接 filter() 函数中的列筛选器命令,定义每个字段的最小和最大边界。Chain the column filter commands within the filter() function and define the minimum and maximum bounds for each field. 然后再次调用 get_profile() 函数,验证转换。Then call the get_profile() function again to verify the transformation.

latlong_filtered_df = (combined_df
    .drop_nulls(
        columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],
        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)
    )
    .filter(dprep.f_and(
        dprep.col("pickup_longitude") <= -73.72,
        dprep.col("pickup_longitude") >= -74.09,
        dprep.col("pickup_latitude") <= 40.88,
        dprep.col("pickup_latitude") >= 40.53,
        dprep.col("dropoff_longitude") <= -73.72,
        dprep.col("dropoff_longitude") >= -74.09,
        dprep.col("dropoff_latitude") <= 40.88,
        dprep.col("dropoff_latitude") >= 40.53
    )))
latlong_filtered_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
TypeType MinMin MaxMax CountCount 缺失计数Missing Count 非缺失计数Not Missing Count 缺失百分比Percent missing 错误计数Error Count 空计数Empty count 0.1% 分位0.1% Quantile 1% 分位1% Quantile 5% 分位5% Quantile 25% 分位25% Quantile 50% 分位50% Quantile 75% 分位75% Quantile 95% 分位95% Quantile 99% 分位99% Quantile 99.9% 分位99.9% Quantile 标准偏差Standard Deviation 平均值Mean
pickup_longitudepickup_longitude FieldType.DECIMALFieldType.DECIMAL -74.078156-74.078156 -73.736481-73.736481 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 -74.076314-74.076314 -73.962542-73.962542 -73.962893-73.962893 -73.948975-73.948975 -73.927856-73.927856 -73.866662-73.866662 -73.830438-73.830438 -73.823160-73.823160 -73.769750-73.769750 0.0487110.048711 -73.913865-73.913865
pickup_latitudepickup_latitude FieldType.DECIMALFieldType.DECIMAL 40.57548540.575485 40.87985240.879852 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 40.63288440.632884 40.71310540.713105 40.71160040.711600 40.72140340.721403 40.75814240.758142 40.80514540.805145 40.84885540.848855 40.86756740.867567 40.87769040.877690 0.0483480.048348 40.76522640.765226
dropoff_longitudedropoff_longitude FieldType.DECIMALFieldType.DECIMAL -74.085747-74.085747 -73.720871-73.720871 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 -74.078828-74.078828 -73.985650-73.985650 -73.985813-73.985813 -73.959041-73.959041 -73.936681-73.936681 -73.884846-73.884846 -73.815507-73.815507 -73.776697-73.776697 -73.733471-73.733471 0.0559610.055961 -73.920718-73.920718
dropoff_latitudedropoff_latitude FieldType.DECIMALFieldType.DECIMAL 40.58353040.583530 40.87973440.879734 7059.07059.0 0.00.0 7059.07059.0 0.00.0 0.00.0 0.00.0 40.59774140.597741 40.69537640.695376 40.69511540.695115 40.72754940.727549 40.75816040.758160 40.78837840.788378 40.85037240.850372 40.86796840.867968 40.87858640.878586 0.0504620.050462 40.75948740.759487

拆分和重命名列Split and rename columns

查看 store_forward 列的数据配置文件。Look at the data profile for the store_forward column. 此字段是一个布尔标志。如果出租车在完成行程后未连接到服务器,因此必须先将行程数据存储在内存中,然后在连接后将数据转发到服务器,则此标志的值为 YThis field is a boolean flag that is Y when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected.

latlong_filtered_df.keep_columns(columns='store_forward').get_profile()
TypeType MinMin MaxMax CountCount 缺失计数Missing Count 非缺失计数Not Missing Count 缺失百分比Percent missing 错误计数Error Count 空计数Empty count 0.1% 分位0.1% Quantile 1% 分位1% Quantile 5% 分位5% Quantile 25% 分位25% Quantile 50% 分位50% Quantile 75% 分位75% Quantile 95% 分位95% Quantile 99% 分位99% Quantile 99.9% 分位99.9% Quantile 标准偏差Standard Deviation 平均值Mean
store_forwardstore_forward FieldType.STRINGFieldType.STRING NN YY 7059.07059.0 99.099.0 6960.06960.0 0.0140250.014025 0.00.0 0.00.0

注意,store_forward 列中的数据配置文件输出显示,数据不一致且存在缺失值或 null 值。Notice that the data profile output in the store_forward column shows that the data is inconsistent and there are missing or null values. 使用 replace()fill_nulls() 函数将这些值替换为字符串“N”:Use the replace() and fill_nulls() functions to replace these values with the string "N":

replaced_stfor_vals_df = latlong_filtered_df.replace(columns="store_forward", find="0", replace_with="N").fill_nulls("store_forward", "N")

执行 distance 字段上的 replace 函数。Execute the replace function on the distance field. 此函数将重新格式化错误标记为 .00 的距离值,并用 0 填充任何 null 值。The function reformats distance values that are incorrectly labeled as .00, and fills any nulls with zeros. distance 字段转换为数值格式。Convert the distance field to numerical format. 这些错误的数据点可能是出租车内部数据收集系统中的异常导致。These incorrect data points are likely anomalies in the data collection system on the taxi cabs.

replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns="distance", find=".00", replace_with=0).fill_nulls("distance", 0)
replaced_distance_vals_df = replaced_distance_vals_df.to_number(["distance"])

将上车和下车日期/时间值拆分为相应的日期和时间列。Split the pickup and dropoff datetime values into the respective date and time columns. 使用 split_column_by_example() 函数进行拆分。Use the split_column_by_example() function to make the split. 在此示例中,split_column_by_example() 函数的可选 example 参数已省略。In this case, the optional example parameter of the split_column_by_example() function is omitted. 因此,函数会自动根据数据来确定在何处进行拆分。Therefore, the function automatically determines where to split based on the data.

time_split_df = (replaced_distance_vals_df
    .split_column_by_example(source_column="pickup_datetime")
    .split_column_by_example(source_column="dropoff_datetime"))
time_split_df.head(5)
vendorvendor pickup_datetimepickup_datetime pickup_datetime_1pickup_datetime_1 pickup_datetime_2pickup_datetime_2 dropoff_datetimedropoff_datetime dropoff_datetime_1dropoff_datetime_1 dropoff_datetime_2dropoff_datetime_2 store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 17:22:002013-08-01 17:22:00 2013-08-012013-08-01 17:22:0017:22:00 2013-08-01 17:22:002013-08-01 17:22:00 2013-08-012013-08-01 17:22:0017:22:00 NN -73.937767-73.937767 40.75848040.758480 -73.937767-73.937767 40.75848040.758480 11 0.00.0 2.52.5
11 22 2013-08-01 17:24:002013-08-01 17:24:00 2013-08-012013-08-01 17:24:0017:24:00 2013-08-01 17:25:002013-08-01 17:25:00 2013-08-012013-08-01 17:25:0017:25:00 NN -73.937927-73.937927 40.75784340.757843 -73.937927-73.937927 40.75784340.757843 11 0.00.0 2.52.5
22 22 2013-08-06 06:51:192013-08-06 06:51:19 2013-08-062013-08-06 06:51:1906:51:19 2013-08-06 06:51:362013-08-06 06:51:36 2013-08-062013-08-06 06:51:3606:51:36 NN -73.937721-73.937721 40.75840440.758404 -73.937721-73.937721 40.75836940.758369 11 0.00.0 3.33.3
33 22 2013-08-06 13:26:342013-08-06 13:26:34 2013-08-062013-08-06 13:26:3413:26:34 2013-08-06 13:26:572013-08-06 13:26:57 2013-08-062013-08-06 13:26:5713:26:57 NN -73.937691-73.937691 40.75841940.758419 -73.937790-73.937790 40.75835840.758358 11 0.00.0 3.33.3
44 22 2013-08-06 13:27:532013-08-06 13:27:53 2013-08-062013-08-06 13:27:5313:27:53 2013-08-06 13:28:082013-08-06 13:28:08 2013-08-062013-08-06 13:28:0813:28:08 NN -73.937805-73.937805 40.75839640.758396 -73.937775-73.937775 40.75845040.758450 11 0.00.0 3.33.3

split_column_by_example() 函数生成的列重命名,使用有意义的名称。Rename the columns generated by the split_column_by_example() function to use meaningful names.

renamed_col_df = (time_split_df
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time",
        "dropoff_datetime_1": "dropoff_date",
        "dropoff_datetime_2": "dropoff_time"
    }))
renamed_col_df.head(5)

调用 get_profile() 函数,以查看完成所有清理步骤之后的完整摘要统计信息。Call the get_profile() function to see the full summary statistics after all cleansing steps.

renamed_col_df.get_profile()

转换数据Transform data

将上车和下车日期进一步拆分为星期几、月份日期和月份值。Split the pickup and dropoff date further into the day of the week, day of the month, and month values. 若要确定是星期几,请使用 derive_column_by_example() 函数。To get the day of the week value, use the derive_column_by_example() function. 此函数采用示例对象的数组参数,这些对象定义输入数据和首选输出。The function takes an array parameter of example objects that define the input data, and the preferred output. 此函数会自动确定你的首选转换。The function automatically determines your preferred transformation. 对于上车和下车时间列,请使用不带示例参数的 split_column_by_example() 函数将时间拆分为小时、分钟和秒。For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.

生成这些新特性后,请通过 drop_columns() 函数删除原始字段,因为新生成的特性是首选特性。After you generate the new features, use the drop_columns() function to delete the original fields as the newly generated features are preferred. 重命名其余字段,使用有意义的说明。Rename the rest of the fields to use meaningful descriptions.

以这种方式转换数据来创建基于时间的新特征可以提高机器学习模型的准确度。Transforming the data in this way to create new time-based features will improve machine learning model accuracy. 例如,为工作日生成新特征有助于在星期日期与出租车费用价格之间建立关系,因为某些星期日期的用车需求更高,其价格也更高。For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand.

transformed_features_df = (renamed_col_df
    .derive_column_by_example(
        source_columns="pickup_date",
        new_column_name="pickup_weekday",
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    .derive_column_by_example(
        source_columns="dropoff_date",
        new_column_name="dropoff_weekday",
        example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]
    )

    .split_column_by_example(source_column="pickup_time")
    .split_column_by_example(source_column="dropoff_time")
    # The following two calls to split_column_by_example reference the column names generated from the previous two calls.
    .split_column_by_example(source_column="pickup_time_1")
    .split_column_by_example(source_column="dropoff_time_1")
    .drop_columns(columns=[
        "pickup_date", "pickup_time", "dropoff_date", "dropoff_time",
        "pickup_date_1", "dropoff_date_1", "pickup_time_1", "dropoff_time_1"
    ])

    .rename_columns(column_pairs={
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1_1": "pickup_hour",
        "pickup_time_1_2": "pickup_minute",
        "pickup_time_2": "pickup_second",
        "dropoff_date_2": "dropoff_month",
        "dropoff_date_3": "dropoff_monthday",
        "dropoff_time_1_1": "dropoff_hour",
        "dropoff_time_1_2": "dropoff_minute",
        "dropoff_time_2": "dropoff_second"
    }))

transformed_features_df.head(5)
vendorvendor pickup_datetimepickup_datetime pickup_weekdaypickup_weekday pickup_hourpickup_hour pickup_minutepickup_minute pickup_secondpickup_second dropoff_datetimedropoff_datetime dropoff_weekdaydropoff_weekday dropoff_hourdropoff_hour dropoff_minutedropoff_minute dropoff_seconddropoff_second store_forwardstore_forward pickup_longitudepickup_longitude pickup_latitudepickup_latitude dropoff_longitudedropoff_longitude dropoff_latitudedropoff_latitude passengerspassengers distancedistance costcost
00 22 2013-08-01 17:22:002013-08-01 17:22:00 星期四Thursday 1717 2222 0000 2013-08-01 17:22:002013-08-01 17:22:00 星期四Thursday 1717 2222 0000 NN -73.937767-73.937767 40.75848040.758480 -73.937767-73.937767 40.75848040.758480 11 0.00.0 2.52.5
11 22 2013-08-01 17:24:002013-08-01 17:24:00 星期四Thursday 1717 2424 0000 2013-08-01 17:25:002013-08-01 17:25:00 星期四Thursday 1717 2525 0000 NN -73.937927-73.937927 40.75784340.757843 -73.937927-73.937927 40.75784340.757843 11 0.00.0 2.52.5
22 22 2013-08-06 06:51:192013-08-06 06:51:19 星期二Tuesday 0606 5151 1919 2013-08-06 06:51:362013-08-06 06:51:36 星期二Tuesday 0606 5151 3636 NN -73.937721-73.937721 40.75840440.758404 -73.937721-73.937721 40.75836940.758369 11 0.00.0 3.33.3
33 22 2013-08-06 13:26:342013-08-06 13:26:34 星期二Tuesday 1313 2626 3434 2013-08-06 13:26:572013-08-06 13:26:57 星期二Tuesday 1313 2626 5757 NN -73.937691-73.937691 40.75841940.758419 -73.937790-73.937790 40.75835840.758358 11 0.00.0 3.33.3
44 22 2013-08-06 13:27:532013-08-06 13:27:53 星期二Tuesday 1313 2727 5353 2013-08-06 13:28:082013-08-06 13:28:08 星期二Tuesday 1313 2828 0808 NN -73.937805-73.937805 40.75839640.758396 -73.937775-73.937775 40.75845040.758450 11 0.00.0 3.33.3

注意,数据显示通过派生的转换生成的上车和下车日期和时间组件是正确的。Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. 删除 pickup_datetimedropoff_datetime 列,因为已不再需要它们(小时、分钟和秒等粒度时间特征更适合用于模型训练)。Drop the pickup_datetime and dropoff_datetime columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training).

processed_df = transformed_features_df.drop_columns(columns=["pickup_datetime", "dropoff_datetime"])

请使用类型推断功能自动检查每个字段的数据类型,并显示推断结果。Use the type inference functionality to automatically check the data type of each field, and display the inference results.

type_infer = processed_df.builders.set_column_types()
type_infer.learn()
type_infer

type_infer 生成的输出如下所示。The resulting output of type_infer is as follows.

Column types conversion candidates:
'pickup_weekday': [FieldType.STRING],
'pickup_hour': [FieldType.DECIMAL],
'pickup_minute': [FieldType.DECIMAL],
'pickup_second': [FieldType.DECIMAL],
'dropoff_hour': [FieldType.DECIMAL],
'dropoff_minute': [FieldType.DECIMAL],
'dropoff_second': [FieldType.DECIMAL],
'store_forward': [FieldType.STRING],
'pickup_longitude': [FieldType.DECIMAL],
'dropoff_longitude': [FieldType.DECIMAL],
'passengers': [FieldType.DECIMAL],
'distance': [FieldType.DECIMAL],
'vendor': [FieldType.STRING],
'dropoff_weekday': [FieldType.STRING],
'pickup_latitude': [FieldType.DECIMAL],
'dropoff_latitude': [FieldType.DECIMAL],
'cost': [FieldType.DECIMAL]

根据数据来看,推断结果看起来是正确的。The inference results look correct based on the data. 现在请将类型转换应用到数据流。Now apply the type conversions to the dataflow.

type_converted_df = type_infer.to_dataflow()
type_converted_df.get_profile()

在打包数据流之前,请对数据集运行两个最终筛选器。Before you package the dataflow, run two final filters on the data set. 若要消除错误捕获的数据点,请根据 costdistance 变量值都大于零的记录筛选数据流。To eliminate incorrectly captured data points, filter the dataflow on records where both the cost and distance variable values are greater than zero. 此步骤可以大幅提高机器学习模型的准确度,因为成本或距离为零的数据点表示严重离群值,会导致丧失预测准确度。This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy.

final_df = type_converted_df.filter(dprep.col("distance") > 0)
final_df = final_df.filter(dprep.col("cost") > 0)

目前已经有了完全转换和准备好的数据流对象,可以在机器学习模型中使用了。You now have a fully transformed and prepared dataflow object to use in a machine learning model. SDK 包含对象序列化功能,其用法如以下代码中所示。The SDK includes object serialization functionality, which is used as shown in the following code.

import os

file_path = os.path.join(os.getcwd(), "dflows.dprep")
final_df.save(file_path)

清理资源Clean up resources

若要继续学习教程第二部分,需要当前目录中的 dflows.dprep 文件。To continue with part two of the tutorial, you need the dflows.dprep file in the current directory.

如果不打算继续学习第二部分,请删除当前目录中的 dflows.dprep 文件。If you don't plan to continue to part two, delete the dflows.dprep file in your current directory. 不管是在本地还是在 Azure Notebooks 中运行此执行操作,均请删除此文件。Delete this file whether you're running the execution locally or in Azure Notebooks.

后续步骤Next steps

在本教程的第一部分,你完成了以下操作:In part one of this tutorial, you:

  • 设置开发环境。Set up your development environment.
  • 加载并清理数据集。Loaded and cleansed data sets.
  • 使用智能转换根据示例来预测逻辑。Used smart transforms to predict your logic based on an example.
  • 合并并打包用于机器学习训练的数据集。Merged and packaged datasets for machine learning training.

现在可以在本教程的第二部分使用此训练数据了:You're ready to use the training data in part two of the tutorial: