What is the Azure Machine Learning Data Prep SDK for Python?
The Azure Machine Learning Data Prep SDK is used to load, transform, and write data for machine learning workflows. You can interact with the SDK in any Python environment, including Jupyter Notebooks or your favorite Python IDE.
This python SDK includes the following functionality to help prepare your data for modeling:
Automatic file type detection. The SDK can automatically detect any of the supported file types. You don’t need to use special file readers for CSV, text, Excel, etc., or to specify delimiter, header, or encoding parameters.
Cross-platform functionality with a single code artifact. The SDK also allows for dataflow objects to be serialized and opened in any Python environment. Write to a single SDK and run it on Windows, macOS, Linux, or Spark in a scale-up or scale-out manner. When running in scale-up, the engine attempts to utilize all hardware threads available, when running scale-out the engine allows the distributed scheduler to optimize execution.
Summary statistics can be generated quickly for a dataflow with a single line of code.
Scale through streaming. Instead of loading all the data into memory, the SDK engine streams data for better scale and performance on large datasets.
How does it differ?
The Azure Machine Learning Data Prep SDK is designed to be familiar to users of other common data prep libraries, while offering advantages and flexibility for key scenarios and maintaining interoperability with those libraries. Other packages are typically either useful on smaller data sets, but memory constrained, or tailored for large data sets, with an overhead too slow for small data sets.
The data prep SDK is similar in core-functionality but offers:
Practicality and convenience not only when working with small data sets
Added scalability for modern big-data applications
A single API that works on small data locally and on large data in the cloud with few-to-no code changes
Ability to scale more effectively on a single machine by streaming data during processing rather than loading into memory.
Install SDK & import
To install the SDK, use the following command:
pip install --upgrade azureml-dataprep
To import the package in your Python code, use:
import azureml.dataprep as dprep
File type detection
auto_read_file() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.
dataflow = dprep.auto_read_file(path="<your-file-path>")
Use the SDK to split and derive columns by both example and inference to automate feature engineering. Assume you have a field in your dataflow object called
datetime with a value of
To automatically split the
datetime field, call the following function.
new_dataflow = dataflow.split_column_by_example(source_column="datetime")
By not defining the example parameter, the function will automatically split the
datetime field into two new fields
datetime_2. The resulting values are
14:30:00, respectively. It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same
datetime object, the following code will create a new column
datetime_weekday for the weekday based on the provided example.
new_dataflow = dataflow.derive_column_by_example( source_columns="datetime", new_column_name="datetime_weekday", example_data=[("2009-01-04 10:12:00", "Sunday"), ("2013-08-22 17:00:00", "Thursday")] )
You can generate quick summary statistics for a dataflow with one line of code. The
dataflow method offers a convenient way to understand your data and how it's distributed.
Calling this function on a dataflow object results in output like the following table.
Cross environment compatible
The SDK also allows for dataflow objects to be serialized and opened in any Python environment. The environment where it's opened can be different than the environment where it's saved. This functionality allows for easy transfer between Python environments and quick integration with Azure Machine Learning models.
Use the following code to save your dataflow objects using
package = dprep.Package([dataflow_1, dataflow_2]) package.save("<your-local-path>")
Use the following code to reopen your package in any environment and retrieve a list of dataflow objects.
package = dprep.Package.open("<your-local-path>") dataflow_list = package.dataflows
To get help or ask questions, please email: firstname.lastname@example.org
To see detailed examples and code for each preparation step, follow these how-to guides:
- Data prep tutorial: prepare data for regression modeling using NYC taxi data and use automated machine learning to build the model
- How to load data, which can be in various formats
- How to transform data it into a more usable structure
- How to write data that data to a location accessible to your models
- Explore the SDK using these sample Jupyter notebooks
Use the table of contents to the left to find reference documentation for SDK classes and modules.