Tutorial: Prepare data for regression modeling

In this tutorial, you learn how to prepare data for regression modeling by using the data prep package from the Azure Machine Learning SDK. You run various transformations to filter and combine two different NYC taxi data sets.

This tutorial is part one of a two-part tutorial series. After you complete the tutorial series, you can predict the cost of a taxi trip by training a model on data features. These features include the pickup day and time, the number of passengers, and the pickup location.

In this tutorial, you:

  • Set up a Python environment and import packages.
  • Load two datasets with different field names.
  • Cleanse data to remove anomalies.
  • Transform data by using intelligent transforms to create new features.
  • Save your dataflow object to use in a regression model.

Prerequisites

Skip to Set up your development environment to read through the notebook steps, or use the instructions below to get the notebook and run it on Azure Notebooks or your own notebook server. To run the notebook you will need:

Use a cloud notebook server in your workspace

It's easy to get started with your own cloud-based notebook server. The Azure Machine Learning SDK for Python is already installed and configured for you once you create this cloud resource.

  • After you launch the notebook webpage, run the tutorials/regression-part1-data-prep.ipynb notebook.

Use your own Jupyter notebook server

Use these steps to create a local Jupyter Notebook server on your computer. After you complete the steps, run the tutorials/regression-part1-data-prep.ipynb notebook.

  1. Complete the installation steps in Azure Machine Learning Python quickstart to create a Miniconda environment and install the SDK. Feel free to skip the Create a workspace section if you wish, but you will need it for part 2 of this tutorial series.

  2. The azureml-dataprep package is automatically installed when you install the SDK.

  3. Clone the GitHub repository.

    git clone https://github.com/Azure/MachineLearningNotebooks.git
    
  4. Start the notebook server from your cloned directory.

    jupyter notebook
    

Set up your development environment

All the setup for your development work can be accomplished in a Python notebook. Setup includes the following actions:

  • Install the SDK
  • Import Python packages

Install and import packages

Use the following to install necessary packages if you don't already have them.

pip install "azureml-dataprep[pandas]>=1.1.0,<1.2.0"

Import the package.

import azureml.dataprep as dprep

Important

Ensure you install the latest azureml.dataprep package version. This tutorial will not work with version number lower than 1.1.0

Load data

Download two different NYC taxi data sets into dataflow objects. The datasets have slightly different fields. The auto_read_file() method automatically recognizes the input file type.

from IPython.display import display
dataset_root = "https://dprepdata.blob.core.windows.net/demo"

green_path = "/".join([dataset_root, "green-small/*"])
yellow_path = "/".join([dataset_root, "yellow-small/*"])

green_df_raw = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)
# auto_read_file automatically identifies and parses the file type, which is useful when you don't know the file type.
yellow_df_raw = dprep.auto_read_file(path=yellow_path)

display(green_df_raw.head(5))
display(yellow_df_raw.head(5))

Note

The URL in this same example is not a complete URL. Instead, it refers to the demo folder in the blob. The full URL to some of the data is https://dprepdata.blob.core.windows.net/demo/green-small/green_tripdata_2013-08.csv

A Dataflow object is similar to a dataframe, and represents a series of lazily-evaluated, immutable operations on data. Operations can be added by invoking the different transformation and filtering methods available. The result of adding an operation to a Dataflow is always a new Dataflow object.

Cleanse data

Now you populate some variables with shortcut transforms to apply to all dataflows. The drop_if_all_null variable is used to delete records where all fields are null. The useful_columns variable holds an array of column descriptions that are kept in each dataflow.

all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "cost", "distance", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude",
    "passengers", "pickup_datetime", "pickup_latitude", "pickup_longitude", "store_forward", "vendor"
]

You first work with the green taxi data to get it into a valid shape that can be combined with the yellow taxi data. Call the replace_na(), drop_nulls(), and keep_columns() functions by using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in the useful_columns variable.

green_df = (green_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "lpep_pickup_datetime": "pickup_datetime",
        "Lpep_dropoff_datetime": "dropoff_datetime",
        "lpep_dropoff_datetime": "dropoff_datetime",
        "Store_and_fwd_flag": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Pickup_longitude": "pickup_longitude",
        "Pickup_latitude": "pickup_latitude",
        "Dropoff_longitude": "dropoff_longitude",
        "Dropoff_latitude": "dropoff_latitude",
        "Passenger_count": "passengers",
        "Fare_amount": "cost",
        "Trip_distance": "distance"
     })
    .keep_columns(columns=useful_columns))
green_df.head(5)
vendor pickup_datetime dropoff_datetime store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 08:14:37 2013-08-01 09:09:06 N 0 0 0 0 1 .00 21.25
1 2 2013-08-01 09:13:00 2013-08-01 11:38:00 N 0 0 0 0 2 .00 74.5
2 2 2013-08-01 09:48:00 2013-08-01 09:49:00 N 0 0 0 0 1 .00 1
3 2 2013-08-01 10:38:35 2013-08-01 10:38:51 N 0 0 0 0 1 .00 3.25
4 2 2013-08-01 11:51:45 2013-08-01 12:03:52 N 0 0 0 0 1 .00 8.5

Run the same transformation steps on the yellow taxi data. These functions ensure that null data is removed from the data set, which will help increase machine learning model accuracy.

yellow_df = (yellow_df_raw
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "vendor_name": "vendor",
        "VendorID": "vendor",
        "vendor_id": "vendor",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "tpep_pickup_datetime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "tpep_dropoff_datetime": "dropoff_datetime",
        "store_and_forward": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Passenger_Count": "passengers",
        "passenger_count": "passengers",
        "Fare_Amt": "cost",
        "fare_amount": "cost",
        "Trip_Distance": "distance",
        "trip_distance": "distance"
    })
    .keep_columns(columns=useful_columns))
yellow_df.head(5)

Call the append_rows() function on the green taxi data to append the yellow taxi data. A new combined dataframe is created.

combined_df = green_df.append_rows([yellow_df])

Convert types and filter

Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First, define a TypeConverter object to change the latitude and longitude fields to decimal type. Next, call the keep_columns() function to restrict output to only the latitude and longitude fields, and then call the get_profile() function. These function calls create a condensed view of the dataflow to just show the lat/long fields, which makes it easier to evaluate missing or out-of-scope coordinates.

decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)
combined_df = combined_df.set_column_types(type_conversions={
    "pickup_longitude": decimal_type,
    "pickup_latitude": decimal_type,
    "dropoff_longitude": decimal_type,
    "dropoff_latitude": decimal_type
})
combined_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -88.114046 -73.961840 -73.961964 -73.947693 -73.922097 -73.846670 0.000000 0.000000 0.000000 18.792672 -68.833579
pickup_latitude FieldType.DECIMAL 0.000000 40.919121 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.682889 40.675541 40.721075 40.756159 40.803909 40.849406 40.870681 40.891244 10.345967 37.936742
dropoff_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -87.699611 -73.984734 -73.985777 -73.956250 -73.928948 -73.866208 0.000000 0.000000 0.000000 18.696526 -68.896978
dropoff_latitude FieldType.DECIMAL 0.000000 41.008934 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.662763 40.654851 40.717821 40.756534 40.784688 40.852437 40.879289 40.937291 10.290780 37.963774

From the summary statistics output, you see there are missing coordinates and coordinates that aren't in New York City (this is determined from subjective analysis). Filter out coordinates for locations that are outside the city border. Chain the column filter commands within the filter() function and define the minimum and maximum bounds for each field. Then call the get_profile() function again to verify the transformation.

latlong_filtered_df = (combined_df
    .drop_nulls(
        columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],
        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)
    )
    .filter(dprep.f_and(
        dprep.col("pickup_longitude") <= -73.72,
        dprep.col("pickup_longitude") >= -74.09,
        dprep.col("pickup_latitude") <= 40.88,
        dprep.col("pickup_latitude") >= 40.53,
        dprep.col("dropoff_longitude") <= -73.72,
        dprep.col("dropoff_longitude") >= -74.09,
        dprep.col("dropoff_latitude") <= 40.88,
        dprep.col("dropoff_latitude") >= 40.53
    )))
latlong_filtered_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -74.078156 -73.736481 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.076314 -73.962542 -73.962893 -73.948975 -73.927856 -73.866662 -73.830438 -73.823160 -73.769750 0.048711 -73.913865
pickup_latitude FieldType.DECIMAL 40.575485 40.879852 7059.0 0.0 7059.0 0.0 0.0 0.0 40.632884 40.713105 40.711600 40.721403 40.758142 40.805145 40.848855 40.867567 40.877690 0.048348 40.765226
dropoff_longitude FieldType.DECIMAL -74.085747 -73.720871 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.078828 -73.985650 -73.985813 -73.959041 -73.936681 -73.884846 -73.815507 -73.776697 -73.733471 0.055961 -73.920718
dropoff_latitude FieldType.DECIMAL 40.583530 40.879734 7059.0 0.0 7059.0 0.0 0.0 0.0 40.597741 40.695376 40.695115 40.727549 40.758160 40.788378 40.850372 40.867968 40.878586 0.050462 40.759487

Split and rename columns

Look at the data profile for the store_forward column. This field is a boolean flag that is Y when the taxi did not have a connection to the server after the trip, and thus had to store the trip data in memory, and later forward it to the server when connected.

latlong_filtered_df.keep_columns(columns='store_forward').get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
store_forward FieldType.STRING N Y 7059.0 99.0 6960.0 0.014025 0.0 0.0

Notice that the data profile output in the store_forward column shows that the data is inconsistent and there are missing or null values. Use the replace() and fill_nulls() functions to replace these values with the string "N":

replaced_stfor_vals_df = latlong_filtered_df.replace(columns="store_forward", find="0", replace_with="N").fill_nulls("store_forward", "N")

Execute the replace function on the distance field. The function reformats distance values that are incorrectly labeled as .00, and fills any nulls with zeros. Convert the distance field to numerical format. These incorrect data points are likely anomalies in the data collection system on the taxi cabs.

replaced_distance_vals_df = replaced_stfor_vals_df.replace(columns="distance", find=".00", replace_with=0).fill_nulls("distance", 0)
replaced_distance_vals_df = replaced_distance_vals_df.to_number(["distance"])

Split the pickup and dropoff datetime values into the respective date and time columns. Use the split_column_by_example() function to make the split. In this case, the optional example parameter of the split_column_by_example() function is omitted. Therefore, the function automatically determines where to split based on the data.

time_split_df = (replaced_distance_vals_df
    .split_column_by_example(source_column="pickup_datetime")
    .split_column_by_example(source_column="dropoff_datetime"))
time_split_df.head(5)
vendor pickup_datetime pickup_datetime_1 pickup_datetime_2 dropoff_datetime dropoff_datetime_1 dropoff_datetime_2 store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 2013-08-01 17:24:00 2013-08-01 17:25:00 2013-08-01 17:25:00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 2013-08-06 06:51:19 2013-08-06 06:51:36 2013-08-06 06:51:36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 2013-08-06 13:26:34 2013-08-06 13:26:57 2013-08-06 13:26:57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 2013-08-06 13:27:53 2013-08-06 13:28:08 2013-08-06 13:28:08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

Rename the columns generated by the split_column_by_example() function to use meaningful names.

renamed_col_df = (time_split_df
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time",
        "dropoff_datetime_1": "dropoff_date",
        "dropoff_datetime_2": "dropoff_time"
    }))
renamed_col_df.head(5)

Call the get_profile() function to see the full summary statistics after all cleansing steps.

renamed_col_df.get_profile()

Transform data

Split the pickup and dropoff date further into the day of the week, day of the month, and month values. To get the day of the week value, use the derive_column_by_example() function. The function takes an array parameter of example objects that define the input data, and the preferred output. The function automatically determines your preferred transformation. For the pickup and dropoff time columns, split the time into the hour, minute, and second by using the split_column_by_example() function with no example parameter.

After you generate the new features, use the drop_columns() function to delete the original fields as the newly generated features are preferred. Rename the rest of the fields to use meaningful descriptions.

Transforming the data in this way to create new time-based features will improve machine learning model accuracy. For example, generating a new feature for the weekday will help establish a relationship between the day of the week and the taxi fare price, which is often more expensive on certain days of the week due to high demand.

transformed_features_df = (renamed_col_df
    .derive_column_by_example(
        source_columns="pickup_date",
        new_column_name="pickup_weekday",
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    .derive_column_by_example(
        source_columns="dropoff_date",
        new_column_name="dropoff_weekday",
        example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]
    )

    .split_column_by_example(source_column="pickup_time")
    .split_column_by_example(source_column="dropoff_time")
    # The following two calls to split_column_by_example reference the column names generated from the previous two calls.
    .split_column_by_example(source_column="pickup_time_1")
    .split_column_by_example(source_column="dropoff_time_1")
    .drop_columns(columns=[
        "pickup_date", "pickup_time", "dropoff_date", "dropoff_time",
        "pickup_date_1", "dropoff_date_1", "pickup_time_1", "dropoff_time_1"
    ])

    .rename_columns(column_pairs={
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1_1": "pickup_hour",
        "pickup_time_1_2": "pickup_minute",
        "pickup_time_2": "pickup_second",
        "dropoff_date_2": "dropoff_month",
        "dropoff_date_3": "dropoff_monthday",
        "dropoff_time_1_1": "dropoff_hour",
        "dropoff_time_1_2": "dropoff_minute",
        "dropoff_time_2": "dropoff_second"
    }))

transformed_features_df.head(5)
vendor pickup_datetime pickup_weekday pickup_hour pickup_minute pickup_second dropoff_datetime dropoff_weekday dropoff_hour dropoff_minute dropoff_second store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 Thursday 17 22 00 2013-08-01 17:22:00 Thursday 17 22 00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 Thursday 17 24 00 2013-08-01 17:25:00 Thursday 17 25 00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 Tuesday 06 51 19 2013-08-06 06:51:36 Tuesday 06 51 36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 Tuesday 13 26 34 2013-08-06 13:26:57 Tuesday 13 26 57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 Tuesday 13 27 53 2013-08-06 13:28:08 Tuesday 13 28 08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

Notice that the data shows that the pickup and dropoff date and time components produced from the derived transformations are correct. Drop the pickup_datetime and dropoff_datetime columns because they're no longer needed (granular time features like hour, minute and second are more useful for model training).

processed_df = transformed_features_df.drop_columns(columns=["pickup_datetime", "dropoff_datetime"])

Use the type inference functionality to automatically check the data type of each field, and display the inference results.

type_infer = processed_df.builders.set_column_types()
type_infer.learn()
type_infer

The resulting output of type_infer is as follows.

Column types conversion candidates:
'pickup_weekday': [FieldType.STRING],
'pickup_hour': [FieldType.DECIMAL],
'pickup_minute': [FieldType.DECIMAL],
'pickup_second': [FieldType.DECIMAL],
'dropoff_hour': [FieldType.DECIMAL],
'dropoff_minute': [FieldType.DECIMAL],
'dropoff_second': [FieldType.DECIMAL],
'store_forward': [FieldType.STRING],
'pickup_longitude': [FieldType.DECIMAL],
'dropoff_longitude': [FieldType.DECIMAL],
'passengers': [FieldType.DECIMAL],
'distance': [FieldType.DECIMAL],
'vendor': [FieldType.STRING],
'dropoff_weekday': [FieldType.STRING],
'pickup_latitude': [FieldType.DECIMAL],
'dropoff_latitude': [FieldType.DECIMAL],
'cost': [FieldType.DECIMAL]

The inference results look correct based on the data. Now apply the type conversions to the dataflow.

type_converted_df = type_infer.to_dataflow()
type_converted_df.get_profile()

Before you package the dataflow, run two final filters on the data set. To eliminate incorrectly captured data points, filter the dataflow on records where both the cost and distance variable values are greater than zero. This step will significantly improve machine learning model accuracy, because data points with a zero cost or distance represent major outliers that throw off prediction accuracy.

final_df = type_converted_df.filter(dprep.col("distance") > 0)
final_df = final_df.filter(dprep.col("cost") > 0)

You now have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as shown in the following code.

import os

file_path = os.path.join(os.getcwd(), "dflows.dprep")
final_df.save(file_path)

Clean up resources

To continue with part two of the tutorial, you need the dflows.dprep file in the current directory.

If you don't plan to continue to part two, delete the dflows.dprep file in your current directory. Delete this file whether you're running the execution locally or in Azure Notebooks.

Next steps

In part one of this tutorial, you:

  • Set up your development environment.
  • Loaded and cleansed data sets.
  • Used smart transforms to predict your logic based on an example.
  • Merged and packaged datasets for machine learning training.

You're ready to use the training data in part two of the tutorial: