What is the Azure Machine Learning Data Prep SDK for Python?
The Azure Machine Learning Data Prep SDK helps data scientists explore, cleanse and transform data for machine learning workflows in any Python environment.
This python SDK includes the following functionality:
Automatic file type detection. The SDK can automatically detect whether your data is in any of the supported file types. You don’t need to use special file readers for formats like CSV, text, Excel, etc., or to specify delimiter, header, or encoding parameters.
Summary statistics can be generated quickly for a dataflow with a single line of code.
Assertion. Create assertion rules to ensure that values in the specified columns satisfy the provided expression.
How does it differ?
The Azure Machine Learning Data Prep SDK offers an intelligent and scalable experience for essential data preparation scenarios, while maintaining interoperability with common data analysis libraries.
Key benefits to the SDK:
Cross-platform functionality. You can interact with the SDK in any Python environment alongside familiar libraries. Write with a single SDK and run it on Windows, macOS, or Linux.
Intelligent transformations powered by AI, including grouping similar values to their canonical form and deriving columns by examples without custom code.
Capability to work with large, multiple files of different schema.
Scalability on a single machine by streaming data during processing rather than loading into memory.
Seamless integration with other Azure Machine Learning services. You can simply pass your prepared data file into [
AutoMLConfig] object for automated machine learning training.
Install SDK & import
The Data Prep SDK requires a 64-bit Python environment.
To install the SDK, use the following command:
pip install --upgrade azureml-dataprep
If you intend to use pandas or read Parquet files, use the following command instead:
pip install --upgrade azureml-dataprep[pandas,parquet]
After installing the SDK, import the package in your Python code:
import azureml.dataprep as dprep
File type detection
auto_read_file() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.
dflow = dprep.auto_read_file(path="<your-file-path>")
Generate quick summary statistics for a dataflow with one line of code with the
Calling this function on a dataflow object results in output like the following table.
Use the SDK to split and derive columns by both example and inference to automate feature engineering. For example, assume you have a field called
From with a value of
"MR. JAMES NGOLA." <firstname.lastname@example.org>.
|0||"MR. JAMES NGOLA." email@example.com|
|1||"PRINCE OBONG ELEME" firstname.lastname@example.org|
|2||"Maryam Abacha" email@example.com|
To automatically split the
From field, call the following function.
new_dflow = dflow.builders.split_column_by_example(source_column="From") new_dflow.preview()
By not defining the example parameter, the function will automatically split the
From field into two new fields
|0||"MR. JAMES NGOLA." firstname.lastname@example.org||MR. JAMES NGOLA.||email@example.com|
|1||"PRINCE OBONG ELEME" firstname.lastname@example.org||PRINCE OBONG ELEMEemail@example.com|
|2||"Maryam Abacha" firstname.lastname@example.org||Maryam Abachaemail@example.com|
It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same
From object, the following code will create a new column
Sender email for the sender email address based on the provided example.
new_dflow = dflow.derive_column_by_example( source_columns="From", new_column_name=" Sender email", example_data=[("MR. JAMES NGOLA <firstname.lastname@example.org>" ,"email@example.com"), ("PRINCE OBONG ELEME <firstname.lastname@example.org>", "email@example.com")]
|0||"MR. JAMES NGOLA." firstname.lastname@example.orgemail@example.com|
|1||"PRINCE OBONG ELEME" firstname.lastname@example.orgemail@example.com|
|2||"Maryam Abacha" firstname.lastname@example.orgemail@example.com|
Ensure assumptions of data values for specified columns are continuously accurate.
For example, say you have a data set that contains the fields Latitude and Longitude. By definition, Latitude and Longitude values are constrained to specific ranges. To verify this is the case in your data set use
from azureml.dataprep import value dflow = dflow.assert_value('Latitude', (value <= 90) & (value >= -90), error_code='InvalidLatitude') dflow = dflow.assert_value('Longitude', (value <= 180) & (value >= -180), error_code='InvalidLongitude') dflow.get_profile()
In the preceding code, any assertion failures throw an error called
InvalideLongitude in the resulting data set.
|Type||Min||Max||Count||Missing Count||Not Missing Count||Percent missing||Error Count||Empty count||...|
From the profile, you see that the Error Count for both of these columns is 1. The following code filters the data set to retrieve the error and see what value causes the assertion to fail. From here you can adjust your code and cleanse your data accordingly.
from azureml.dataprep import col dflow_error = dflow.filter(col('Latitude').is_error()) error = dflow_error.head(10)['Latitude'] print(error.originalValue)
A Dataflow can be cached as a file on your disk during a local run by calling
dflow_cached = dflow.cache(directory_path). With this code, you run all the steps in the Dataflow, dflow, and save the cached data to the specified directory_path. The returned Dataflow, dflow_cached, has a Caching Step added at the end. Any subsequent runs on on the Dataflow dflow_cached will reuse the cached data, and the steps before the Caching Step is run again.
Caching avoids running transforms multiple times, which can make local runs more efficient. Here are common places to use Caching:
- after reading data from remote
- after expensive transforms, such as Sort
- after transforms that change the shape of data, such as Sampling, Filter and Summarize
Caching Step will be ignored during scale-out run invoked by
To get help or ask questions, please email: firstname.lastname@example.org
To see detailed examples and code for each preparation step, follow these how-to guides:
- Data prep tutorial: prepare data for regression modeling using NYC taxi data and use automated machine learning to build the model
- How to load data, which can be in various formats
- How to transform data it into a more usable structure
- How to write data that data to a location accessible to your models
- Explore the SDK using these sample Jupyter notebooks
Use the table of contents to the left to find reference documentation for SDK classes and modules.
Send feedback about: