Tutorial: Prepare data for regression modeling

In this tutorial, you learn how to prepare data for regression modeling by using the Azure Machine Learning Data Prep SDK. You run various transformations to filter and combine two different NYC taxi data sets.

This tutorial is part one of a two-part tutorial series. After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.

In this tutorial, you:

  • Set up a Python environment and import packages.
  • Load two datasets with different field names.
  • Cleanse data to remove anomalies.
  • Transform data by using intelligent transforms to create new features.
  • Save your dataflow object to use in a regression model.

You can prepare your data in Python by using the Azure Machine Learning Data Prep SDK.

Get the notebook

For your convenience, this tutorial is available as a Jupyter notebook. Run the regression-part1-data-prep.ipynb notebook either in Azure Notebooks or in your own Jupyter Notebook server.

Learn how to run notebooks by following the article, Use Jupyter notebooks to explore Azure Machine Learning service.

Import packages

You begin by importing the SDK.

import azureml.dataprep as dprep

Load data

Download two different NYC taxi data sets into dataflow objects. The datasets have slightly different fields. The auto_read_file() method automatically recognizes the input file type.

dataset_root = "https://dprepdata.blob.core.windows.net/demo"

green_path = "/".join([dataset_root, "green-small/*"])
yellow_path = "/".join([dataset_root, "yellow-small/*"])

green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)
# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.
yellow_df = dprep.auto_read_file(path=yellow_path)

display(green_df.head(5))
display(yellow_df.head(5))

Cleanse data

Now you populate some variables with shortcut transforms to apply to all dataflows. The drop_if_all_null variable is used to delete records where all fields are null. The useful_columns variable holds an array of column descriptions that are kept in each dataflow.

all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "cost", "distance", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude",
    "passengers", "pickup_datetime", "pickup_latitude", "pickup_longitude", "store_forward", "vendor"
]

You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow named tmp_df. Call the replace_na(), drop_nulls(), and keep_columns() functions by using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in the useful_columns variable.

tmp_df = (green_df
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "lpep_pickup_datetime": "pickup_datetime",
        "Lpep_dropoff_datetime": "dropoff_datetime",
        "lpep_dropoff_datetime": "dropoff_datetime",
        "Store_and_fwd_flag": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Pickup_longitude": "pickup_longitude",
        "Pickup_latitude": "pickup_latitude",
        "Dropoff_longitude": "dropoff_longitude",
        "Dropoff_latitude": "dropoff_latitude",
        "Passenger_count": "passengers",
        "Fare_amount": "cost",
        "Trip_distance": "distance"
     })
    .keep_columns(columns=useful_columns))
tmp_df.head(5)
vendor pickup_datetime dropoff_datetime store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 08:14:37 2013-08-01 09:09:06 N 0 0 0 0 1 .00 21.25
1 2 2013-08-01 09:13:00 2013-08-01 11:38:00 N 0 0 0 0 2 .00 74.5
2 2 2013-08-01 09:48:00 2013-08-01 09:49:00 N 0 0 0 0 1 .00 1
3 2 2013-08-01 10:38:35 2013-08-01 10:38:51 N 0 0 0 0 1 .00 3.25
4 2 2013-08-01 11:51:45 2013-08-01 12:03:52 N 0 0 0 0 1 .00 8.5

Overwrite the green_df variable with the transforms run on the tmp_df dataflow in the previous step.

green_df = tmp_df

Run the same transformation steps on the yellow taxi data.

tmp_df = (yellow_df
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "vendor_name": "vendor",
        "VendorID": "vendor",
        "vendor_id": "vendor",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "tpep_pickup_datetime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "tpep_dropoff_datetime": "dropoff_datetime",
        "store_and_forward": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Passenger_Count": "passengers",
        "passenger_count": "passengers",
        "Fare_Amt": "cost",
        "fare_amount": "cost",
        "Trip_Distance": "distance",
        "trip_distance": "distance"
    })
    .keep_columns(columns=useful_columns))
tmp_df.head(5)

Again, overwrite the yellow_df dataflow with the tmp_df dataflow. Then call the append_rows() function on the green taxi data to append the yellow taxi data. A new combined dataframe is created.

yellow_df = tmp_df
combined_df = green_df.append_rows([yellow_df])

Convert types and filter

Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First, define a TypeConverter object to change the latitude and longitude fields to decimal type. Next, call the keep_columns() function to restrict output to only the latitude and longitude fields, and then call the get_profile() function.

decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)
combined_df = combined_df.set_column_types(type_conversions={
    "pickup_longitude": decimal_type,
    "pickup_latitude": decimal_type,
    "dropoff_longitude": decimal_type,
    "dropoff_latitude": decimal_type
})
combined_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude", 
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -88.114046 -73.961840 -73.961964 -73.947693 -73.922097 -73.846670 0.000000 0.000000 0.000000 18.792672 -68.833579
pickup_latitude FieldType.DECIMAL 0.000000 40.919121 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.682889 40.675541 40.721075 40.756159 40.803909 40.849406 40.870681 40.891244 10.345967 37.936742
dropoff_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -87.699611 -73.984734 -73.985777 -73.956250 -73.928948 -73.866208 0.000000 0.000000 0.000000 18.696526 -68.896978
dropoff_latitude FieldType.DECIMAL 0.000000 41.008934 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.662763 40.654851 40.717821 40.756534 40.784688 40.852437 40.879289 40.937291 10.290780 37.963774

From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City. Filter out coordinates for locations that are outside the city border. Chain the column filter commands within the filter() function and define the minimum and maximum bounds for each field. Then call the get_profile() function again to verify the transformation.

tmp_df = (combined_df
    .drop_nulls(
        columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],
        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)
    ) 
    .filter(dprep.f_and(
        dprep.col("pickup_longitude") <= -73.72,
        dprep.col("pickup_longitude") >= -74.09,
        dprep.col("pickup_latitude") <= 40.88,
        dprep.col("pickup_latitude") >= 40.53,
        dprep.col("dropoff_longitude") <= -73.72,
        dprep.col("dropoff_longitude") >= -74.09,
        dprep.col("dropoff_latitude") <= 40.88,
        dprep.col("dropoff_latitude") >= 40.53
    )))
tmp_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude", 
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -74.078156 -73.736481 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.076314 -73.962542 -73.962893 -73.948975 -73.927856 -73.866662 -73.830438 -73.823160 -73.769750 0.048711 -73.913865
pickup_latitude FieldType.DECIMAL 40.575485 40.879852 7059.0 0.0 7059.0 0.0 0.0 0.0 40.632884 40.713105 40.711600 40.721403 40.758142 40.805145 40.848855 40.867567 40.877690 0.048348 40.765226
dropoff_longitude FieldType.DECIMAL -74.085747 -73.720871 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.078828 -73.985650 -73.985813 -73.959041 -73.936681 -73.884846 -73.815507 -73.776697 -73.733471 0.055961 -73.920718
dropoff_latitude FieldType.DECIMAL 40.583530 40.879734 7059.0 0.0 7059.0 0.0 0.0 0.0 40.597741 40.695376 40.695115 40.727549 40.758160 40.788378 40.850372 40.867968 40.878586 0.050462 40.759487

Overwrite the combined_df dataflow with the transformations that you made to the tmp_df dataflow.

combined_df = tmp_df

Split and rename columns

Look at the data profile for the store_forward column.

combined_df.keep_columns(columns='store_forward').get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
store_forward FieldType.STRING N Y 7059.0 99.0 6960.0 0.014025 0.0 0.0

Notice that the data profile output in the store_forward column shows that the data is inconsistent and there are missing or null values. Use the replace() and fill_nulls() functions to replace these values with the string "N":

combined_df = combined_df.replace(columns="store_forward", find="0", replace_with="N").fill_nulls("store_forward", "N")

Execute the replace function on the distance field. The function reformats distance values that are incorrectly labeled as .00, and fills any nulls with zeros. Convert the distance field to numerical format.

combined_df = combined_df.replace(columns="distance", find=".00", replace_with=0).fill_nulls("distance", 0)
combined_df = combined_df.to_number(["distance"])

Split the pickup and dropoff datetime values into the respective date and time columns. Use the split_column_by_example() function to make the split. In this case, the optional example parameter of the split_column_by_example() function is omitted. Therefore, the function automatically determines where to split based on the data.

tmp_df = (combined_df
    .split_column_by_example(source_column="pickup_datetime")
    .split_column_by_example(source_column="dropoff_datetime"))
tmp_df.head(5)
vendor pickup_datetime pickup_datetime_1 pickup_datetime_2 dropoff_datetime dropoff_datetime_1 dropoff_datetime_2 store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 2013-08-01 17:24:00 2013-08-01 17:25:00 2013-08-01 17:25:00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 2013-08-06 06:51:19 2013-08-06 06:51:36 2013-08-06 06:51:36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 2013-08-06 13:26:34 2013-08-06 13:26:57 2013-08-06 13:26:57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 2013-08-06 13:27:53 2013-08-06 13:28:08 2013-08-06 13:28:08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

Rename the columns generated by the split_column_by_example() function to use meaningful names.

tmp_df_renamed = (tmp_df
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time",
        "dropoff_datetime_1": "dropoff_date",
        "dropoff_datetime_2": "dropoff_time"
    }))
tmp_df_renamed.head(5)

Overwrite the combined_df dataflow with the executed transformations. Then call the get_profile() function to see the full summary statistics after all transformations.

combined_df = tmp_df_renamed
combined_df.get_profile()

Transform data

Split the pickup and dropoff date further into the day of the week, day of the month, and month values. To get the day of the week value, use the derive_column_by_example() function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines your preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.

After you generate the new features, use the drop_columns() function to delete the original fields as the newly generated features are preferred. Rename the rest of the fields to use meaningful descriptions.

tmp_df = (combined_df
    .derive_column_by_example(
        source_columns="pickup_date", 
        new_column_name="pickup_weekday", 
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    .derive_column_by_example(
        source_columns="dropoff_date",
        new_column_name="dropoff_weekday",
        example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]
    )

    .split_column_by_example(source_column="pickup_time")
    .split_column_by_example(source_column="dropoff_time")
    # The following two calls to split_column_by_example reference the column names generated from the previous two calls.
    .split_column_by_example(source_column="pickup_time_1")
    .split_column_by_example(source_column="dropoff_time_1")
    .drop_columns(columns=[
        "pickup_date", "pickup_time", "dropoff_date", "dropoff_time", 
        "pickup_date_1", "dropoff_date_1", "pickup_time_1", "dropoff_time_1"
    ])

    .rename_columns(column_pairs={
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1_1": "pickup_hour",
        "pickup_time_1_2": "pickup_minute",
        "pickup_time_2": "pickup_second",
        "dropoff_date_2": "dropoff_month",
        "dropoff_date_3": "dropoff_monthday",
        "dropoff_time_1_1": "dropoff_hour",
        "dropoff_time_1_2": "dropoff_minute",
        "dropoff_time_2": "dropoff_second"
    }))

tmp_df.head(5)
vendor pickup_datetime pickup_weekday pickup_hour pickup_minute pickup_second dropoff_datetime dropoff_weekday dropoff_hour dropoff_minute dropoff_second store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 Thursday 17 22 00 2013-08-01 17:22:00 Thursday 17 22 00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 Thursday 17 24 00 2013-08-01 17:25:00 Thursday 17 25 00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 Tuesday 06 51 19 2013-08-06 06:51:36 Tuesday 06 51 36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 Tuesday 13 26 34 2013-08-06 13:26:57 Tuesday 13 26 57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 Tuesday 13 27 53 2013-08-06 13:28:08 Tuesday 13 28 08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. Drop the pickup_datetime and dropoff_datetime columns because they're no longer needed.

tmp_df = tmp_df.drop_columns(columns=["pickup_datetime", "dropoff_datetime"])

Use the type inference functionality to automatically check the data type of each field, and display the inference results.

type_infer = tmp_df.builders.set_column_types()
type_infer.learn()
type_infer
Column types conversion candidates:
'pickup_weekday': [FieldType.STRING],
'pickup_hour': [FieldType.DECIMAL],
'pickup_minute': [FieldType.DECIMAL],
'pickup_second': [FieldType.DECIMAL],
'dropoff_hour': [FieldType.DECIMAL],
'dropoff_minute': [FieldType.DECIMAL],
'dropoff_second': [FieldType.DECIMAL],
'store_forward': [FieldType.STRING],
'pickup_longitude': [FieldType.DECIMAL],
'dropoff_longitude': [FieldType.DECIMAL],
'passengers': [FieldType.DECIMAL],
'distance': [FieldType.DECIMAL],
'vendor': [FieldType.STRING],
'dropoff_weekday': [FieldType.STRING],
'pickup_latitude': [FieldType.DECIMAL],
'dropoff_latitude': [FieldType.DECIMAL],
'cost': [FieldType.DECIMAL]

The inference results look correct based on the data. Now apply the type conversions to the dataflow.

tmp_df = type_infer.to_dataflow()
tmp_df.get_profile()

Before you package the dataflow, run two final filters on the data set. To eliminate incorrect data points, filter the dataflow on records where both the cost and distance variable values are greater than zero.

tmp_df = tmp_df.filter(dprep.col("distance") > 0)
tmp_df = tmp_df.filter(dprep.col("cost") > 0)

You now have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as shown in the following snippet.

import os
file_path = os.path.join(os.getcwd(), "dflows.dprep")

dflow_prepared = tmp_df
package = dprep.Package([dflow_prepared])
package.save(file_path)

Clean up resources

To continue with part two of the tutorial, you need the dflows.dprep file in the current directory.

If you don't plan to continue to part two, delete the dflows.dprep file in your current directory. Delete this file whether you're running the execution locally or in Azure Notebooks.

Next steps

In part one of this tutorial, you:

  • Set up your development environment.
  • Loaded and cleansed data sets.
  • Used smart transforms to predict your logic based on an example.
  • Merged and packaged datasets for machine learning training.

You're ready to use the training data in part two of the tutorial: