Data Transformation - Manipulation
This article describes the modules in Azure Machine Learning Studio that you can use for basic data manipulation.
Applies to: Machine Learning Studio
This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.
Machine Learning Studio supports tasks that are specific to machine learning, such as normalization or feature selection. The modules in this category are intended for more general tasks.
You can use Azure Machine Learning Workbench to perform more sophisticated data cleanup and preparations tasks by using "learn by example" functions. For examples, see Microsoft Machine Learning team blog post Data transformations “by example” in Machine Learning Workbench.
Data manipulation tasks
The modules in this category are intended to support core data management tasks that might need to be performed in Machine Learning Studio. The following tasks are examples of core data management tasks:
- Combine two datasets, either by using joins, or by merging columns or rows.
- Create new categories to use in grouping data.
- Modify column headings, change column data types, or flag columns as features or labels.
- Check for missing values, and then replace them with appropriate values.
- Perform sampling or divide a dataset into training and testing sets: Use the Data Transformation - Sample and Split modules.
- Scale numbers, normalize data, or put numerical values into bins: Use the Data Transformation - Scale and Reduce modules.
- Perform calculations on numeric data fields or to generate commonly used statistics: Use the tools in Statistical Functions.
For examples of how to work with complex data in machine learning experiments, see these samples in the Azure AI Gallery:
- Data Processing and Analysis: Demonstrates key tools and processes.
- Breast cancer detection: Illustrates how to partition datasets, and then apply special processing to each partition.
Modules in this category
The Data Transformation - Manipulation category includes the following modules:
- Add Columns: Adds a set of columns from one dataset to another.
- Add Rows: Appends a set of rows from an input dataset to the end of another dataset.
- Apply SQL Transformation: Runs a SQLite query on input datasets to transform the data.
- Clean Missing Data: Specifies how to handle values that are missing from a dataset. This module replaces Missing Values Scrubber (deprecated), which has been deprecated.
- Convert to Indicator Values: Converts categorical values in columns to indicator values.
- Edit Metadata: Edits metadata that's associated with columns in a dataset.
- Group Categorical Values: Groups data from multiple categories into a new category.
- Join Data: Joins two datasets.
- Remove Duplicate Rows: Removes duplicate rows from a dataset.
- Select Columns in Dataset: Selects columns to include in a dataset or exclude from a dataset in an operation.
- Select Columns Transform: Creates a transformation that selects the same subset of columns as in a specified dataset.
- SMOTE: Increases the number of low-incidence examples in a dataset by using synthetic minority oversampling.