Split Data using Regular Expression
This article describes how to use the Regular Expression Split option in the Split Data module of Azure Machine Learning Studio (classic). This option is useful when you need to apply a filter criteria to a text column. For example, you might divide your dataset by whether a particular product is mentioned.
Applies to: Machine Learning Studio (classic)
This content pertains only to Studio (classic). Similar drag and drop modules have been added to Azure Machine Learning designer (preview). Learn more in this article comparing the two versions.
You can use a regular expression split on a single text column. You define a regular expression that includes the text column name, and then set conditions that apply to the column, such as "begins with", ""contains", or "does not contain".
Other options in the Split Data module:
Split data using relative expressions: Apply an expression to numeric data.
Split recommender datasets: Divide datasets that are used in recommendation models. The dataset should have three columns: items, users, and ratings
Use a regular expression to divide a dataset
Add the Split Data module to your experiment, and connect it as input to the dataset you want to split.
For Splitting mode, select Regular expression split.
In the Regular expression box, type a valid regular expression. Some examples are provided here.
The regular expression is applied only to the specified column, which must be a string data type.
For help composing regular expressions, see the Regular Expression Language - Quick Reference.
Run the experiment, or right-click the module and select Run selected.
Based on the regular expression you provide, the dataset is divided into two sets of rows: rows with values that match the expression and all remaining rows.
The following examples demonstrate how to divide a dataset using the Regular Expression option.
Single whole word
This example puts into the first dataset all rows that contain the text
Gryphon in the column
Text, and puts other rows into the second output of Split Data:
This example looks for the specified string in any position within the second column of the dataset, denoted here by the index value of 1. The match is case-sensitive.
The first result dataset contains all rows where the index column begins with one of these characters:
f. All other rows are directed to the second output.
String match on IP addresses
This example divides some server log data into two categories for analysis: connections behind the firewall and connections with IP addresses outside the firewall. The regular expression is applied to the
IP_Address field (a string data type).
The first output contains all addresses that begin with