Split column by example transformation
This transform predictively splits the content of a column on meaningful boundaries without requiring user input. The Split algorithm selects the boundaries after analyzing the content of the column. These boundaries could be defined by
- A fixed delimiter,
- Multiple, arbitrary delimiters appearing in particular contexts, or,
- Data patterns or certain entity types
Users can also control the splitting behavior in the Advanced mode where they can specify the delimiters or by provide examples of desired splitting.
In theory, Split operations can also be performed in the Workbench using a series of Derive Column by Example transforms. However, if there are several columns, deriving each of those individually even using by-example approach can be very time consuming. Predictive split enables easy splitting without the user needing to provide examples for each of the columns.
How to perform this transformation
- Select the column that you want to split.
- Select Split Column by Example from the Transforms menu. Or, Right-click on the header of the selected column and select Split Column by Example from the context menu. The Transform Editor opens and new columns are added next to the selected column. At this point, the Workbench analyzes the input column, determines split boundaries, and synthesizes a program to split the column as displayed in the grid. The synthesized program is executed against all the rows in the column. Delimiters, if any, are excluded from the final result.
- You can click on the Advanced mode for finer control over the split transformation.
- Review the output and Click OK to accept the transform.
The transform aims to produce the same number of resultant columns for all the rows. If any row cannot be split on the determined boundaries, it produces null for all the columns by default. This behavior can be changed in the Advanced Mode.
Transform editor: advanced mode
Advanced Mode provides a richer experience for Splitting columns.
Selecting Keep Delimiter Columns includes the Delimiters in the final result. Delimiters are excluded by default.
Specifying Delimiters overrides the automatic delimiter selection logic. Multiple delimiters, one in each line, can be specified as Delimiters. All those characters are used as delimiters to split the column.
Sometimes, splitting a value on determined boundaries produces different number of columns than the majority of others. In those cases, Fill Direction is used to decide the order in which the columns should be filled.
Clicking on Show suggested examples displays the representative rows for which user should provide an example of split. User can click on the Up arrow to the right of the suggested row to promote the row as an example.
User can Delete Column or Insert new Columns by Right-clicking on the header of the Examples Table.
User can Copy and Paste values from one Cell to another in order to provide an example of split.
User can switch between the Basic Mode and the Advanced Mode by clicking the links in the Transform Editor.
Editing an existing transformation
A user can edit an existing Split Column By Example transform by selecting Edit option of the Transformation Step. Clicking on Edit opens the Transform Editor in Advanced Mode, and all the examples that were provided during creation of the transform are shown.
Examples of splitting on a fixed, single-character delimiter
It is common for data fields to be separated by a single fixed delimiter such as a comma in a CSV format. The Split transform attempts to infer these delimiters automatically. For example, in the following scenario it automatically infers the "." as a delimiter.
Splitting IP addresses
The values in the first column are predictively split into four columns.
Examples of splitting on multiple delimiters within particular contexts
The user's data may include many different delimiters separating different fields. Moreover, only some occurrences of a delimiting string may be a delimiter but not all. For example, in the following case the set of delimiters required is "-", "," and ":" to produce the desired output. However, not every occurrence of the ":" should be a split point, since we do not want to split the time but keep it in a single column. The Split transform infers delimiters within the contexts in which they occur in the input data rather than any possible occurrence of the delimiter. The transform is also aware of common data types such as dates and times.
Splitting store opening timings
The values in the following Timings column get predictively split into nine columns shown in the table under it.
|Monday - Friday: 7:00 am - 6:00 pm,Saturday: 9:00 am - 5:00 pm,Sunday: Closed|
|Monday - Friday: 9:00 am - 6:00 pm,Saturday: 4:00 am - 4:00 pm,Sunday: Closed|
|Monday - Friday: 8:30 am - 7:00 pm,Saturday: 3:00 am - 2:30 pm,Sunday: Closed|
|Monday - Friday: 8:00 am - 6:00 pm,Saturday: 2:00 am - 3:00 pm,Sunday: Closed|
|Monday - Friday: 4:00 am - 7:00 pm,Saturday: 9:00 am - 4:00 pm,Sunday: Closed|
|Monday - Friday: 8:30 am - 4:30 pm,Saturday: 9:00 am - 5:00 pm,Sunday: Closed|
|Monday - Friday: 5:30 am - 6:30 pm,Saturday: 5:00 am - 4:00 pm,Sunday: Closed|
|Monday - Friday: 8:30 am - 8:30 pm,Saturday: 6:00 am - 5:00 pm,Sunday: Closed|
|Monday - Friday: 8:00 am - 9:00 pm,Saturday: 9:00 am - 8:00 pm,Sunday: Closed|
|Monday - Friday: 10:00 am - 9:30 pm,Saturday: 9:30 am - 3:00 pm,Sunday: Closed|
|Monday||Friday||7:00 am||6:00 pm||Saturday||9:00 am||5:00 pm||Sunday||Closed|
|Monday||Friday||9:00 am||6:00 pm||Saturday||4:00 am||4:00 pm||Sunday||Closed|
|Monday||Friday||8:30 am||7:00 pm||Saturday||3:00 am||2:30 pm||Sunday||Closed|
|Monday||Friday||8:00 am||6:00 pm||Saturday||2:00 am||3:00 pm||Sunday||Closed|
|Monday||Friday||4:00 am||7:00 pm||Saturday||9:00 am||4:00 pm||Sunday||Closed|
|Monday||Friday||8:30 am||4:30 pm||Saturday||9:00 am||5:00 pm||Sunday||Closed|
|Monday||Friday||5:30 am||6:30 pm||Saturday||5:00 am||4:00 pm||Sunday||Closed|
|Monday||Friday||8:30 am||8:30 pm||Saturday||6:00 am||5:00 pm||Sunday||Closed|
|Monday||Friday||8:00 am||9:00 pm||Saturday||9:00 am||8:00 pm||Sunday||Closed|
|Monday||Friday||10:00 am||9:30 pm||Saturday||9:30 am||3:00 pm||Sunday||Closed|
Splitting IIS log
Here is another example of multiple arbitrary delimiters. This example also includes a contextual delimiter "/", which must not be split inside the URLs or file paths. It is tedious to perform this splitting using many Derive Column by Example transforms and giving examples for each field. Using the Split transform we can perform the predictive splitting without giving any examples.
|188.8.131.52 - - [16/Oct/2016 16:22:33 -0200] "GET /images/picture.gif HTTP/1.1" 234 343 www.yahoo.com "http://www.example.com/" "Mozilla/4.0 (compatible; MSIE 4)" "-"|
|10.128.72.213 - - [17/Oct/2016 12:43:12 +0300] "GET /news/stuff.html HTTP/1.1" 200 6233 www.aol.com "http://www.sample.com/" "Mozilla/5.0 (MSIE)" "-"|
|184.108.40.206 - - [12/Nov/2016 14:22:44 -0500] "GET /sample.ico HTTP/1.1" 342 7342 www.facebook.com "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)" "-"|
|10.166.64.165 - - [23/Nov/2016 01:52:45 -0800] "GET /style.css HTTP/1.1" 200 2552 www.google.com "http://www.test.com/index.html" "Mozilla/5.0 (Windows)" "-"|
|220.127.116.11 - - [16/Jan/2017 22:34:56 +0200] "GET /js/ads.js HTTP/1.1" 200 23462 www.microsoft.com "http://www.illustration.com/index.html" "Mozilla/5.0 (Windows)" "-"|
|18.104.22.168 - - [28/Jan/2017 26:36:16 +0800] "GET /search.php HTTP/1.1" 400 1777 www.bing.com "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" "-"|
|22.214.171.124 - - [23/Mar/2017 01:55:25 -0800] "GET /style.css HTTP/1.1" 200 2552 www.google.com "http://www.test.com/index.html" "Mozilla/5.0 (Windows)" "-"|
|126.96.36.199 - - [16/Apr/2017 11:34:36 +0200] "GET /js/ads.js HTTP/1.1" 200 23462 www.microsoft.com "http://www.illustration.com/index.html" "Mozilla/5.0 (Windows)" "-"|
Gets split into:
|188.8.131.52||16/Oct/2016||16:22:33||-0200||GET||images/picture.gif||HTTP||1.1||234||343||www.yahoo.com||http://www.example.com/||Mozilla||4.0||compatible; MSIE 4|
|184.108.40.206||12/Nov/2016||14:22:44||-0500||GET||sample.ico||HTTP||1.1||342||7342||www.facebook.com||-||Mozilla||5.0||Windows; U; Windows NT 5.1; rv:1.7.3|
|220.127.116.11||28/Jan/2017||26:36:16||+0800||GET||search.php||HTTP||1.1||400||1777||www.bing.com||-||Mozilla||4.0||compatible; MSIE 6.0; Windows NT 5.1|
Examples of splitting without delimiters
In some cases, there are no actual delimiters, and data fields may occur contiguously next to each other. In this case, the Split transform automatically detects patterns in the data to infer probably split points. For example, in the following scenario we want to separate the amount from the currency type, and Split automatically infers the boundary between the numeric and non-numeric data as the split point.
Splitting amount with currency symbol
In the following example, we would like to separate the weight values from the units of measure. Again the Split inference detects the meaningful boundary automatically and prefers it over other possible delimiters such as the "." character.
Splitting weights with units
The Split transform feature is based on the Predictive Program Synthesis technique. In this technique, data transformation programs are learned automatically based on the input data. The programs are synthesized in a domain-specific language. The DSL is based on delimiters and fields that occur in particular regular expression contexts. More information about this technology can be found in a recent publication on this topic.