Split Data using Regular Expression

This article describes how to use the Regular Expression Split option in the Split Data module of Azure Machine Learning Studio (classic). This option is useful when you need to apply a filter criteria to a text column. For example, you might divide your dataset by whether a particular product is mentioned.


Applies to: Machine Learning Studio (classic)

This content pertains only to Studio (classic). Similar drag and drop modules have been added to Azure Machine Learning designer (preview). Learn more in this article comparing the two versions.

You can use a regular expression split on a single text column. You define a regular expression that includes the text column name, and then set conditions that apply to the column, such as "begins with", ""contains", or "does not contain".

For general information about data partitioning for machine learning experiments, see Split Data and Partition and Split.

Other options in the Split Data module:

Use a regular expression to divide a dataset

  1. Add the Split Data module to your experiment, and connect it as input to the dataset you want to split.

  2. For Splitting mode, select Regular expression split.

  3. In the Regular expression box, type a valid regular expression. Some examples are provided here.

    The regular expression is applied only to the specified column, which must be a string data type.

    For help composing regular expressions, see the Regular Expression Language - Quick Reference.

  4. Run the experiment, or right-click the module and select Run selected.

    Based on the regular expression you provide, the dataset is divided into two sets of rows: rows with values that match the expression and all remaining rows.


The following examples demonstrate how to divide a dataset using the Regular Expression option.

Single whole word

This example puts into the first dataset all rows that contain the text Gryphon in the column Text, and puts other rows into the second output of Split Data:

    \"Text" Gryphon  


This example looks for the specified string in any position within the second column of the dataset, denoted here by the index value of 1. The match is case-sensitive.

(\1) ^[a-f]

The first result dataset contains all rows where the index column begins with one of these characters: a, b, c, d, e, f. All other rows are directed to the second output.

String match on IP addresses

This example divides some server log data into two categories for analysis: connections behind the firewall and connections with IP addresses outside the firewall. The regular expression is applied to the IP_Address field (a string data type).

(\IP_Address) ^[10]

The first output contains all addresses that begin with 10.

See also

Sample and Split
Partition and Sample