Tutorial (part 1): Prepare data for regression modeling

In this tutorial, you learn how to prep data for regression modeling using the Azure Machine Learning Data Prep SDK. Perform various transformations to filter and combine two different NYC Taxi data sets. The end goal of this tutorial set is to predict the cost of a taxi trip by training a model on data features including pickup hour, day of week, number of passengers, and coordinates. This tutorial is part one of a two-part tutorial series.

In this tutorial, you:

  • Setup a Python environment and import packages
  • Load two datasets with different field names
  • Cleanse data to remove anomalies
  • Transform data using intelligent transforms to create new features
  • Save your dataflow object to use in a regression model

You can prepare your data in Python using the Azure Machine Learning Data Prep SDK.

Get the notebook

For your convenience, this tutorial is available as a Jupyter notebook. Run the regression-part1-data-prep.ipynb notebook either in Azure Notebooks or in your own Jupyter notebook server.

Learn how to run notebooks by following the article, Use Jupyter notebooks to explore this service.

Import packages

Begin by importing the SDK.

import azureml.dataprep as dprep

Load data

Download two different NYC Taxi data sets into dataflow objects. These datasets contain slightly different fields. The method auto_read_file() automatically recognizes the input file type.

dataset_root = "https://dprepdata.blob.core.windows.net/demo"

green_path = "/".join([dataset_root, "green-small/*"])
yellow_path = "/".join([dataset_root, "yellow-small/*"])

green_df = dprep.read_csv(path=green_path, header=dprep.PromoteHeadersMode.GROUPED)
# auto_read_file will automatically identify and parse the file type, and is useful if you don't know the file type
yellow_df = dprep.auto_read_file(path=yellow_path)

display(green_df.head(5))
display(yellow_df.head(5))

Cleanse data

Now you populate some variables with shortcut transforms that will apply to all dataflows. The variable drop_if_all_null will be used to delete records where all fields are null. The variable useful_columns holds an array of column descriptions that are retained in each dataflow.

all_columns = dprep.ColumnSelector(term=".*", use_regex=True)
drop_if_all_null = [all_columns, dprep.ColumnRelationship(dprep.ColumnRelationship.ALL)]
useful_columns = [
    "cost", "distance", "dropoff_datetime", "dropoff_latitude", "dropoff_longitude",
    "passengers", "pickup_datetime", "pickup_latitude", "pickup_longitude", "store_forward", "vendor"
]

You first work with the green taxi data and get it into a valid shape that can be combined with the yellow taxi data. Create a temporary dataflow tmp_df. Call the replace_na(), drop_nulls(), and keep_columns() functions using the shortcut transform variables you created. Additionally, rename all the columns in the dataframe to match the names in useful_columns.

tmp_df = (green_df
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "VendorID": "vendor",
        "lpep_pickup_datetime": "pickup_datetime",
        "Lpep_dropoff_datetime": "dropoff_datetime",
        "lpep_dropoff_datetime": "dropoff_datetime",
        "Store_and_fwd_flag": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Pickup_longitude": "pickup_longitude",
        "Pickup_latitude": "pickup_latitude",
        "Dropoff_longitude": "dropoff_longitude",
        "Dropoff_latitude": "dropoff_latitude",
        "Passenger_count": "passengers",
        "Fare_amount": "cost",
        "Trip_distance": "distance"
     })
    .keep_columns(columns=useful_columns))
tmp_df.head(5)
vendor pickup_datetime dropoff_datetime store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 08:14:37 2013-08-01 09:09:06 N 0 0 0 0 1 .00 21.25
1 2 2013-08-01 09:13:00 2013-08-01 11:38:00 N 0 0 0 0 2 .00 74.5
2 2 2013-08-01 09:48:00 2013-08-01 09:49:00 N 0 0 0 0 1 .00 1
3 2 2013-08-01 10:38:35 2013-08-01 10:38:51 N 0 0 0 0 1 .00 3.25
4 2 2013-08-01 11:51:45 2013-08-01 12:03:52 N 0 0 0 0 1 .00 8.5

Overwrite the green_df variable with the transforms performed on tmp_df in the previous step.

green_df = tmp_df

Perform the same transformation steps to the yellow taxi data.

tmp_df = (yellow_df
    .replace_na(columns=all_columns)
    .drop_nulls(*drop_if_all_null)
    .rename_columns(column_pairs={
        "vendor_name": "vendor",
        "VendorID": "vendor",
        "vendor_id": "vendor",
        "Trip_Pickup_DateTime": "pickup_datetime",
        "tpep_pickup_datetime": "pickup_datetime",
        "Trip_Dropoff_DateTime": "dropoff_datetime",
        "tpep_dropoff_datetime": "dropoff_datetime",
        "store_and_forward": "store_forward",
        "store_and_fwd_flag": "store_forward",
        "Start_Lon": "pickup_longitude",
        "Start_Lat": "pickup_latitude",
        "End_Lon": "dropoff_longitude",
        "End_Lat": "dropoff_latitude",
        "Passenger_Count": "passengers",
        "passenger_count": "passengers",
        "Fare_Amt": "cost",
        "fare_amount": "cost",
        "Trip_Distance": "distance",
        "trip_distance": "distance"
    })
    .keep_columns(columns=useful_columns))
tmp_df.head(5)

Again, overwrite yellow_df with tmp_df, and then call the append_rows() function on the green taxi data to append the yellow taxi data, creating a new combined dataframe.

yellow_df = tmp_df
combined_df = green_df.append_rows([yellow_df])

Convert types and filter

Examine the pickup and drop-off coordinates summary statistics to see how the data is distributed. First define a TypeConverter object to change the lat/long fields to decimal type. Next, call the keep_columns() function to restrict output to only the lat/long fields, and then call get_profile().

decimal_type = dprep.TypeConverter(data_type=dprep.FieldType.DECIMAL)
combined_df = combined_df.set_column_types(type_conversions={
    "pickup_longitude": decimal_type,
    "pickup_latitude": decimal_type,
    "dropoff_longitude": decimal_type,
    "dropoff_latitude": decimal_type
})
combined_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude", 
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -88.114046 -73.961840 -73.961964 -73.947693 -73.922097 -73.846670 0.000000 0.000000 0.000000 18.792672 -68.833579
pickup_latitude FieldType.DECIMAL 0.000000 40.919121 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.682889 40.675541 40.721075 40.756159 40.803909 40.849406 40.870681 40.891244 10.345967 37.936742
dropoff_longitude FieldType.DECIMAL -115.179337 0.000000 7722.0 0.0 7722.0 0.0 0.0 0.0 -87.699611 -73.984734 -73.985777 -73.956250 -73.928948 -73.866208 0.000000 0.000000 0.000000 18.696526 -68.896978
dropoff_latitude FieldType.DECIMAL 0.000000 41.008934 7722.0 0.0 7722.0 0.0 0.0 0.0 0.000000 40.662763 40.654851 40.717821 40.756534 40.784688 40.852437 40.879289 40.937291 10.290780 37.963774

From the summary statistics output, you see that there are coordinates that are missing, and coordinates that are not in New York City. Filter out coordinates not in the city border by chaining column filter commands within the filter() function, and defining minimum and maximum bounds for each field. Then call get_profile() again to verify the transformation.

tmp_df = (combined_df
    .drop_nulls(
        columns=["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"],
        column_relationship=dprep.ColumnRelationship(dprep.ColumnRelationship.ANY)
    ) 
    .filter(dprep.f_and(
        dprep.col("pickup_longitude") <= -73.72,
        dprep.col("pickup_longitude") >= -74.09,
        dprep.col("pickup_latitude") <= 40.88,
        dprep.col("pickup_latitude") >= 40.53,
        dprep.col("dropoff_longitude") <= -73.72,
        dprep.col("dropoff_longitude") >= -74.09,
        dprep.col("dropoff_latitude") <= 40.88,
        dprep.col("dropoff_latitude") >= 40.53
    )))
tmp_df.keep_columns(columns=[
    "pickup_longitude", "pickup_latitude", 
    "dropoff_longitude", "dropoff_latitude"
]).get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
pickup_longitude FieldType.DECIMAL -74.078156 -73.736481 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.076314 -73.962542 -73.962893 -73.948975 -73.927856 -73.866662 -73.830438 -73.823160 -73.769750 0.048711 -73.913865
pickup_latitude FieldType.DECIMAL 40.575485 40.879852 7059.0 0.0 7059.0 0.0 0.0 0.0 40.632884 40.713105 40.711600 40.721403 40.758142 40.805145 40.848855 40.867567 40.877690 0.048348 40.765226
dropoff_longitude FieldType.DECIMAL -74.085747 -73.720871 7059.0 0.0 7059.0 0.0 0.0 0.0 -74.078828 -73.985650 -73.985813 -73.959041 -73.936681 -73.884846 -73.815507 -73.776697 -73.733471 0.055961 -73.920718
dropoff_latitude FieldType.DECIMAL 40.583530 40.879734 7059.0 0.0 7059.0 0.0 0.0 0.0 40.597741 40.695376 40.695115 40.727549 40.758160 40.788378 40.850372 40.867968 40.878586 0.050462 40.759487

Overwrite combined_df with the transformations you made to tmp_df.

combined_df = tmp_df

Split and rename columns

Look at the data profile for the store_forward column.

combined_df.keep_columns(columns='store_forward').get_profile()
Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count 0.1% Quantile 1% Quantile 5% Quantile 25% Quantile 50% Quantile 75% Quantile 95% Quantile 99% Quantile 99.9% Quantile Standard Deviation Mean
store_forward FieldType.STRING N Y 7059.0 99.0 6960.0 0.014025 0.0 0.0

From the data profile output of store_forward, you see that the data is inconsistent and there are missing/null values. Replace these values using the replace() and fill_nulls() functions, and in both cases change to the string "N".

combined_df = combined_df.replace(columns="store_forward", find="0", replace_with="N").fill_nulls("store_forward", "N")

Execute another replace function, this time on the distance field. This reformats distance values that are incorrectly labeled as .00, and fills any nulls with zeros. Convert the distance field to numerical format.

combined_df = combined_df.replace(columns="distance", find=".00", replace_with=0).fill_nulls("distance", 0)
combined_df = combined_df.to_number(["distance"])

Split the pick up and drop off datetimes into respective date and time columns. Use split_column_by_example() to perform the split. In this case, the optional example parameter of split_column_by_example() is omitted. Therefore the function will automatically determine where to split based on the data.

tmp_df = (combined_df
    .split_column_by_example(source_column="pickup_datetime")
    .split_column_by_example(source_column="dropoff_datetime"))
tmp_df.head(5)
vendor pickup_datetime pickup_datetime_1 pickup_datetime_2 dropoff_datetime dropoff_datetime_1 dropoff_datetime_2 store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 2013-08-01 17:22:00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 2013-08-01 17:24:00 2013-08-01 17:25:00 2013-08-01 17:25:00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 2013-08-06 06:51:19 2013-08-06 06:51:36 2013-08-06 06:51:36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 2013-08-06 13:26:34 2013-08-06 13:26:57 2013-08-06 13:26:57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 2013-08-06 13:27:53 2013-08-06 13:28:08 2013-08-06 13:28:08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

Rename the columns generated by split_column_by_example() into meaningful names.

tmp_df_renamed = (tmp_df
    .rename_columns(column_pairs={
        "pickup_datetime_1": "pickup_date",
        "pickup_datetime_2": "pickup_time",
        "dropoff_datetime_1": "dropoff_date",
        "dropoff_datetime_2": "dropoff_time"
    }))
tmp_df_renamed.head(5)

Overwrite combined_df with the executed transformations, and then call get_profile() to see full summary statistics after all transformations.

combined_df = tmp_df_renamed
combined_df.get_profile()

Transform data

Split the pickup and drop-off date further into day of week, day of month, and month. To get day of week, use the derive_column_by_example() function. This function takes as a parameter an array of example objects that define the input data, and the desired output. The function then automatically determines your desired transformation. For pickup and drop-off time columns, split into hour, minute, and second using the split_column_by_example() function with no example parameter.

Once you have generated these new features, delete the original fields in favor of the newly generated features using drop_columns(). Rename all remaining fields to accurate descriptions.

tmp_df = (combined_df
    .derive_column_by_example(
        source_columns="pickup_date", 
        new_column_name="pickup_weekday", 
        example_data=[("2009-01-04", "Sunday"), ("2013-08-22", "Thursday")]
    )
    .derive_column_by_example(
        source_columns="dropoff_date",
        new_column_name="dropoff_weekday",
        example_data=[("2013-08-22", "Thursday"), ("2013-11-03", "Sunday")]
    )

    .split_column_by_example(source_column="pickup_time")
    .split_column_by_example(source_column="dropoff_time")
    # the following two split_column_by_example calls reference the generated column names from the above two calls
    .split_column_by_example(source_column="pickup_time_1")
    .split_column_by_example(source_column="dropoff_time_1")
    .drop_columns(columns=[
        "pickup_date", "pickup_time", "dropoff_date", "dropoff_time", 
        "pickup_date_1", "dropoff_date_1", "pickup_time_1", "dropoff_time_1"
    ])

    .rename_columns(column_pairs={
        "pickup_date_2": "pickup_month",
        "pickup_date_3": "pickup_monthday",
        "pickup_time_1_1": "pickup_hour",
        "pickup_time_1_2": "pickup_minute",
        "pickup_time_2": "pickup_second",
        "dropoff_date_2": "dropoff_month",
        "dropoff_date_3": "dropoff_monthday",
        "dropoff_time_1_1": "dropoff_hour",
        "dropoff_time_1_2": "dropoff_minute",
        "dropoff_time_2": "dropoff_second"
    }))

tmp_df.head(5)
vendor pickup_datetime pickup_weekday pickup_hour pickup_minute pickup_second dropoff_datetime dropoff_weekday dropoff_hour dropoff_minute dropoff_second store_forward pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passengers distance cost
0 2 2013-08-01 17:22:00 Thursday 17 22 00 2013-08-01 17:22:00 Thursday 17 22 00 N -73.937767 40.758480 -73.937767 40.758480 1 0.0 2.5
1 2 2013-08-01 17:24:00 Thursday 17 24 00 2013-08-01 17:25:00 Thursday 17 25 00 N -73.937927 40.757843 -73.937927 40.757843 1 0.0 2.5
2 2 2013-08-06 06:51:19 Tuesday 06 51 19 2013-08-06 06:51:36 Tuesday 06 51 36 N -73.937721 40.758404 -73.937721 40.758369 1 0.0 3.3
3 2 2013-08-06 13:26:34 Tuesday 13 26 34 2013-08-06 13:26:57 Tuesday 13 26 57 N -73.937691 40.758419 -73.937790 40.758358 1 0.0 3.3
4 2 2013-08-06 13:27:53 Tuesday 13 27 53 2013-08-06 13:28:08 Tuesday 13 28 08 N -73.937805 40.758396 -73.937775 40.758450 1 0.0 3.3

From the data above, you see that the pickup and drop-off date and time components produced from the derived transformations are correct. Drop the pickup_datetime and dropoff_datetime columns as they are no longer needed.

tmp_df = tmp_df.drop_columns(columns=["pickup_datetime", "dropoff_datetime"])

Use the type inference functionality to automatically check the data type of each field, and display the inference results.

type_infer = tmp_df.builders.set_column_types()
type_infer.learn()
type_infer
Column types conversion candidates:
'pickup_weekday': [FieldType.STRING],
'pickup_hour': [FieldType.DECIMAL],
'pickup_minute': [FieldType.DECIMAL],
'pickup_second': [FieldType.DECIMAL],
'dropoff_hour': [FieldType.DECIMAL],
'dropoff_minute': [FieldType.DECIMAL],
'dropoff_second': [FieldType.DECIMAL],
'store_forward': [FieldType.STRING],
'pickup_longitude': [FieldType.DECIMAL],
'dropoff_longitude': [FieldType.DECIMAL],
'passengers': [FieldType.DECIMAL],
'distance': [FieldType.DECIMAL],
'vendor': [FieldType.STRING],
'dropoff_weekday': [FieldType.STRING],
'pickup_latitude': [FieldType.DECIMAL],
'dropoff_latitude': [FieldType.DECIMAL],
'cost': [FieldType.DECIMAL]

The inference results look correct based on the data, now apply the type conversions to the dataflow.

tmp_df = type_infer.to_dataflow()
tmp_df.get_profile()

Before packaging the dataflow, perform two final filters on the data set. To eliminate incorrect data points, filter the dataflow on records where both the cost and distance are greater than zero.

tmp_df = tmp_df.filter(dprep.col("distance") > 0)
tmp_df = tmp_df.filter(dprep.col("cost") > 0)

At this point, you have a fully transformed and prepared dataflow object to use in a machine learning model. The SDK includes object serialization functionality, which is used as follows.

import os
file_path = os.path.join(os.getcwd(), "dflows.dprep")

dflow_prepared = tmp_df
package = dprep.Package([dflow_prepared])
package.save(file_path)

Clean up resources

Delete the file dflows.dprep (whether you are running locally or in Azure Notebooks) in your current directory if you do not wish to continue with part two of the tutorial. If you continue on to part two, you will need the dflows.dprep file in the current directory.

Next steps

In part one of this tutorial, you:

  • Set up your development environment
  • Loaded and cleansed data sets
  • Used smart transforms to predict your logic based on an example
  • Merged and packaged datasets for machine learning training

You are ready to use this training data in the next part of the tutorial series: