What is the Azure Machine Learning Data Prep SDK for Python?

The Azure Machine Learning Data Prep SDK is used to load, transform, and write data for machine learning workflows. You can interact with the SDK in any Python environment, including Jupyter Notebooks or your favorite Python IDE.

This python SDK includes the following functionality to help prepare your data for modeling:

How does it differ?

The Azure Machine Learning Data Prep SDK is designed to be familiar to users of other common data prep libraries, while offering advantages and flexibility for key scenarios and maintaining interoperability with those libraries. Other packages are typically either useful on smaller data sets, but memory constrained, or tailored for large data sets, with an overhead too slow for small data sets.

The data prep SDK is similar in core-functionality but offers:

  • Practicality and convenience not only when working with small data sets

  • Added scalability for modern big-data applications

  • A single API that works on small data locally and on large data in the cloud with few-to-no code changes

  • Ability to scale more effectively on a single machine by streaming data during processing rather than loading into memory.

Install SDK & import

To install the SDK, use the following command:

pip install --upgrade azureml-dataprep

To import the package in your Python code, use:

import azureml.dataprep as dprep

File type detection

Use the auto_read_file() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.

dataflow = dprep.auto_read_file(path="<your-file-path>")

Intelligent transforms

Use the SDK to split and derive columns by both example and inference to automate feature engineering. Assume you have a field in your dataflow object called datetime with a value of 2018-12-15 14:30:00.

To automatically split the datetime field, call the following function.

new_dataflow = dataflow.split_column_by_example(source_column="datetime")

By not defining the example parameter, the function will automatically split the datetime field into two new fields datetime_1 and datetime_2. The resulting values are 2018-09-15 and 14:30:00, respectively. It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same datetime object, the following code will create a new column datetime_weekday for the weekday based on the provided example.

new_dataflow = dataflow.derive_column_by_example(
        source_columns="datetime", 
        new_column_name="datetime_weekday", 
        example_data=[("2009-01-04 10:12:00", "Sunday"), ("2013-08-22 17:00:00", "Thursday")]
    )

Summary statistics

You can generate quick summary statistics for a dataflow with one line of code. The dataflow method offers a convenient way to understand your data and how it's distributed.

dataflow.get_profile()

Calling this function on a dataflow object results in output like the following table.

azure-machine-learning-service-summary-statistics

Cross environment compatible

The SDK also allows for dataflow objects to be serialized and opened in any Python environment. The environment where it's opened can be different than the environment where it's saved. This functionality allows for easy transfer between Python environments and quick integration with Azure Machine Learning models.

Use the following code to save your dataflow objects using dprep.Package:

package = dprep.Package([dataflow_1, dataflow_2])
package.save("<your-local-path>")

Use the following code to reopen your package in any environment and retrieve a list of dataflow objects.

package = dprep.Package.open("<your-local-path>")
dataflow_list = package.dataflows

Get support

To get help or ask questions, please email: askamldataprep@microsoft.com

Next steps

To see detailed examples and code for each preparation step, follow these how-to guides:

  1. Data prep tutorial: prepare data for regression modeling using NYC taxi data and use automated machine learning to build the model
  2. How to load data, which can be in various formats
  3. How to transform data it into a more usable structure
  4. How to write data that data to a location accessible to your models
  5. Explore the SDK using these sample Jupyter notebooks

Use the table of contents to the left to find reference documentation for SDK classes and modules.