Prepare data for modeling with Azure Machine Learning
In this article, you learn about the use cases and unique features of the Azure Machine Learning Data Prep SDK. Data preparation is the most important part of a machine learning workflow. Real-world data is often broken, inconsistent, or unable to be used as training data without significant cleansing and transformation. Correcting errors and anomalies in raw data, and building new features that are relevant to the problem you're trying to solve, will increase model accuracy.
You can prepare your data in Python using the Azure Machine Learning Data Prep SDK.
Azure Machine Learning Data Prep SDK
The Azure Machine Learning Data Prep SDK is a Python library that includes many common data preprocessing tools. It also adds advanced functionality like automated feature engineering and transformations derived from examples. The SDK is similar in core-functionality to popular libraries such as Pandas and PySpark, yet offers more flexibility. Pandas is typically most useful on smaller data sets (< 2-5 GB) before memory capacity-constraints affect performance. In contrast, PySpark is generally for big-data applications but carries an overhead that makes working with small data sets much slower.
The SDK offers:
- Practicality and convenience when working with small data sets
- Scalability for modern big-data applications
- The ability to use and scale the same code for both use-cases
The following examples highlight some of the unique functionality of the SDK.
Install the SDK
Install the SDK in your Python environment using the following command.
pip install azureml-dataprep
Use the following code to import the package.
import azureml.dataprep as dprep
Automatic file type detection
smart_read_file() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.
dataflow = dprep.smart_read_file(path="<your-file-path>")
Automated feature engineering
Use the SDK to split and derive columns by both example and inference to automate feature engineering. Assume you have a field in your dataflow object called
datetime with a value of
To automatically split the
datetime field, call the following function.
new_dataflow = dataflow.split_column_by_example(source_column="datetime")
By not defining the example parameter, the function will automatically split the
datetime field into two new fields
datetime_2. The resulting values are
14:30:00, respectively. It's also possible to provide an example pattern, and the SDK will predict and execute your intended transformation. Using the same
datetime object, the following code will create a new column
datetime_weekday for the weekday based on the provided example.
new_dataflow = dataflow.derive_column_by_example( source_columns="datetime", new_column_name="datetime_weekday", example_data=[("2009-01-04 10:12:00", "Sunday"), ("2013-08-22 17:00:00", "Thursday")] )
You can generate quick summary statistics for a dataflow with one line of code. This method offers a convenient way to understand your data and how it's distributed.
Calling this function on a dataflow object will result in output like the following table.
Multiple environment compatibilities
The SDK also allows for dataflow objects to be serialized and opened in any Python environment. The environment where it's opened can be different than the environment where it's saved. This functionality allows for easy transfer between Python environments and quick integration with Azure Machine Learning models.
Use the following code to save your dataflow objects.
package = dprep.Package([dataflow_1, dataflow_2]) package.save("<your-local-path>")
Use the following code to reopen your package in any environment and retrieve a list of dataflow objects.
package = dprep.Package.open("<your-local-path>") dataflow_list = package.dataflows
Data preparation pipeline
To see detailed examples and code for each preparation step, use the following how-to guides:
- Load data, which can be in various formats
- Transform it into a more usable structure
- Write that data to a location accessible to your models
Review an example notebook of data preparation using the Azure Machine Learning Data Prep SDK.
Azure Machine Learning Data Prep SDK reference documentation.