What is the Azure Machine Learning Data Prep SDK for Python?

The Azure Machine Learning Data Prep SDK helps data scientists explore, cleanse and transform data for machine learning workflows in any Python environment.

This python SDK includes the following functionality:

How does it differ?

The Azure Machine Learning Data Prep SDK offers an intelligent and scalable experience for essential data preparation scenarios, while maintaining interoperability with common data analysis libraries.

Key benefits to the SDK:

  • Cross-platform functionality. You can interact with the SDK in any Python environment alongside familiar libraries. Write with a single SDK and run it on Windows, macOS, or Linux.

  • Intelligent transformations powered by AI, including grouping similar values to their canonical form and deriving columns by examples without custom code.

  • Capability to work with large, multiple files of different schema.

  • Scalability on a single machine by streaming data during processing rather than loading into memory.

  • Seamless integration with other Azure Machine Learning services. You can simply pass your prepared data file into [AutoMLConfig] object for automated machine learning training.

Install SDK & import

The Data Prep SDK requires a 64-bit Python environment.

To install the SDK, use the following command:

pip install --upgrade azureml-dataprep

If you intend to use pandas or read Parquet files, use the following command instead:

pip install --upgrade azureml-dataprep[pandas,parquet]

After installing the SDK, import the package in your Python code:

import azureml.dataprep as dprep

File type detection

Use the auto_read_file() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.

dflow = dprep.auto_read_file(path="<your-file-path>")

Summary statistics

Generate quick summary statistics for a dataflow with one line of code with the get_profile() method.

dflow.get_profile()

Calling this function on a dataflow object results in output like the following table.

azure-machine-learning-service-summary-statistics

Intelligent transforms

Use the SDK to split and derive columns by both example and inference to automate feature engineering. For example, assume you have a field called From with a value of "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>.

From
0 "MR. JAMES NGOLA." james_ngola2002@maktoob.com
1 "PRINCE OBONG ELEME" obong_715@epatra.com
2 "Maryam Abacha" m_abacha03@www.com

To automatically split the From field, call the following function.

new_dflow = dflow.builders.split_column_by_example(source_column="From")

new_dflow.preview()

By not defining the example parameter, the function will automatically split the From field into two new fields From_1 and From_2.

From From_1 From_2
0 "MR. JAMES NGOLA." james_ngola2002@maktoob.com MR. JAMES NGOLA. james_ngola2002@maktoob.com
1 "PRINCE OBONG ELEME" obong_715@epatra.com PRINCE OBONG ELEME obong_715@epatra.com
2 "Maryam Abacha" m_abacha03@www.com Maryam Abacha m_abacha03@www.com

It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same From object, the following code will create a new column Sender email for the sender email address based on the provided example.

new_dflow = dflow.derive_column_by_example(
        source_columns="From", 
        new_column_name=" Sender email", 
        example_data=[("MR. JAMES NGOLA <james_ngola2002@maktoob.com>" ,"james_ngola2002@maktoob.com"), ("PRINCE OBONG ELEME <obong_715@epatra.com>", "obong_715@epatra.com")]
From Sender email
0 "MR. JAMES NGOLA." james_ngola2002@maktoob.com james_ngola2002@maktoob.com
1 "PRINCE OBONG ELEME" obong_715@epatra.com obong_715@epatra.com
2 "Maryam Abacha" m_abacha03@www.com m_abacha03@www.com

Assertions

Ensure assumptions of data values for specified columns are continuously accurate.

For example, say you have a data set that contains the fields Latitude and Longitude. By definition, Latitude and Longitude values are constrained to specific ranges. To verify this is the case in your data set use assert_value().

from azureml.dataprep import value

dflow = dflow.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude')
dflow = dflow.assert_value('Longitude', (value <= 180) & (value >= -180), error_code='InvalidLongitude')

dflow.get_profile()

In the preceding code, any assertion failures throw an error called InvalidLatitude or InvalideLongitude in the resulting data set.

Type Min Max Count Missing Count Not Missing Count Percent missing Error Count Empty count ...
Latitude FieldType.DECIMAL 41.679311 42.008124 10.0 0.0 10.0 0.0 1.0 0.0 ...
Longitude FieldType.DECIMAL -87.800175 -87.644545 10.0 0.0 10.0 0.0 1.0 0.0 ...

From the profile, you see that the Error Count for both of these columns is 1. The following code filters the data set to retrieve the error and see what value causes the assertion to fail. From here you can adjust your code and cleanse your data accordingly.

from azureml.dataprep import col

dflow_error = dflow.filter(col('Latitude').is_error())
error = dflow_error.head(10)['Latitude'][0]

print(error.originalValue)

Caching

A Dataflow can be cached as a file on your disk during a local run by calling dflow_cached = dflow.cache(directory_path). With this code, you run all the steps in the Dataflow, dflow, and save the cached data to the specified directory_path. The returned Dataflow, dflow_cached, has a Caching Step added at the end. Any subsequent runs on on the Dataflow dflow_cached will reuse the cached data, and the steps before the Caching Step is run again.

Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:

  • after reading data from remote
  • after expensive transforms, such as Sort
  • after transforms that change the shape of data, such as Sampling, Filter and Summarize

Caching Step will be ignored during scale-out run invoked by to_spark_dataframe().

Get support

To get help or ask questions, please email: askamldataprep@microsoft.com

Next steps

To see detailed examples and code for each preparation step, follow these how-to guides:

  1. Data prep tutorial: prepare data for regression modeling using NYC taxi data and use automated machine learning to build the model
  2. How to load data, which can be in various formats
  3. How to transform data it into a more usable structure
  4. How to write data that data to a location accessible to your models
  5. Explore the SDK using these sample Jupyter notebooks

Use the table of contents to the left to find reference documentation for SDK classes and modules.