Apply SQL Transformation
Runs a SQLite query on input datasets to transform the data
Category: Data Transformation / Manipulation
Applies to: Machine Learning Studio
This content pertains only to Studio. Similar drag and drop modules have been added to the visual interface in Machine Learning service. Learn more in this article comparing the two versions.
This article describes how to use the Apply SQL Transformation module in Azure Machine Learning Studio, to specify a SQL query on an input dataset or datasets.
SQL is handy when you need to modify your data in complex ways, or persist the data for use in other environments. For example, using the Apply SQL Transformation module, you can:
Create tables for results and save the datasets in a portable database.
Perform custom transformations on data types, or create aggregates.
Execute SQL query statements to filter or alter data and return the query results as a data table.
What is SQLite?
SQLite is a public domain relational database management system that is contained in a C programming library. SQLite is a popular choice as an embedded database for local storage in web browsers.
SQLite was originally designed in 2000 for the U.S. Navy, to support serverless transactions. It is a self-contained database engine that has no management system and hence requires no configuration or administration.
How to configure Apply SQL Transformation
The module can take up to three datasets as inputs. When you reference the datasets connected to each input port, you must use the names
t3. The table number indicates the index of the input port.
The remaining parameter is a SQL query, which uses the SQLite syntax. This module supports all standard statements of the SQLite syntax. For a list of unsupported statements, see the Technical Notes section.
General syntax and usage
When typing multiple lines in the SQL Script text box, use a semi-colon to terminate each statement. Otherwise, line breaks are converted to spaces.
For example, the following statements are equivalent:
SELECT * from t1;
SELECT * from t1;
You can add comments by using either
--at the beginning of each line, or by enclosing text using
For example, this statement is valid:
SELECT * from t1 /*WHERE ItemID BETWEEN 1 AND 100*/;
If a column name duplicates the name of a reserved keyword, syntax highlighting is applied to the text inside the SQL Script text box. To avoid confusion, you should enclose column names with square brackets (to follow the Transact-SQL convention) or backticks or double quotation marks (the ANSI SQL convention).
For example, in the following query on the Blood Donation dataset, Time is a valid column name but is also a reserved keyword.
SELECT Recency, Frequency, Monetary, Time, Class FROM t1 WHERE Time between 3 and 20;
If you run the query as is, the query might return the correct results, but depending on the dataset, it might return an error. Here are some examples of how to avoid the issue:
-- Transact-SQL SELECT [Recency], [Frequency], [Monetary], [Time], [Class] FROM t1 WHERE [Time] between 3 and 20; -- ANSI SQL SELECT "Recency", "Frequency", "Monetary", "Time", "Class" FROM t1 WHERE `Time` between 3 and 20;
Syntax highlighting remains on the keyword even after it is enclosed in quotes or brackets.
SQLite is case insensitive, except for a few commands that have case-sensitive variants with different meanings (GLOB vs. glob).
SELECT statement, column names that include spaces or other characters prohibited in identifiers must be enclosed in double quotation marks, square brackets, or backtick characters (`).
For example, this query references the Two-Class Iris dataset on
t1, but one column name contains a prohibited character, so the column name is enclosed in quotation marks.
SELECT class, "sepal-length" FROM t1;
You can add a
WHERE clause to filter values in the dataset.
SELECT class, "sepal-length" FROM t1 WHERE "sepal-length" >5.0;
The SQLite syntax does not support the
TOP keyword, which is used in Transact-SQL. Instead, you can use the
LIMIT keyword, or a
For example, compare these queries on the Bike Rental dataset.
-- unsupported in SQLite SELECT TOP 100 [dteday] FROM t1 ; ORDER BY [dteday] DESC; -- Returns top 100 SELECT [dteday] FROM t1 LIMIT 100 ; ORDER BY [dteday] DESC; -- Returns top 100. Note that FETCH is on a new line. SELECT [dteday] FROM t1 - ; FETCH FIRST 100 rows ONLY; ORDER BY [dteday] DESC;
The following examples use the Restaurant Ratings dataset on the input port corresponding to
t1, and the Restaurant Features dataset on the input port corresponding to
The following statement joins the two tables to create a dataset that combines the specified restaurant features with average ratings for each restaurant.
SELECT DISTINCT(t2.placeid), t2.name, t2.city, t2.state, t2.price, t2.alcohol, AVG(rating) AS 'AvgRating' FROM t1 JOIN t2 ON t1.placeID = t2.placeID GROUP BY t2.placeid;
This section provides basic examples of some common SQL aggregate functions, using SQLite.
Aggregate functions currently supported are:
The following query returns a dataset containing the restaurant ID, along with the average rating for the restaurant.
SELECT DISTINCT placeid, AVG(rating) AS ‘AvgRating’, FROM t1 GROUP BY placeid
Working with strings
SQLite supports the double pipe operator for concatenating strings.
The following statement creates a new column by concatenating two text columns.
SELECT placeID, name, (city || '-' || state) AS 'Target Region', FROM t1
The Transact-SQL string concatenation operator is not supported: + (String Concatenation). For example, the expression
('city + '-' + state) AS 'Target Region'in the example query would return 0 for all values.
However, even though the operator is not supported for this data type, no error is raised in Azure Machine Learning. Be sure to verify the results of Apply SQL Transformation before using the resulting dataset in an experiment.
COALESCE and CASE
COALESCE evaluates multiple arguments, in order, and returns the value of the first expression that does not evaluate to NULL.
For example, this query on the Steel Annealing Multi-Class dataset returns the first non-null flag from a list of columns assumed to have mutually exclusive values. If no flag is found, the string “none” is returned.
SELECT classes, family, [product-type], COALESCE(bt,bc,bf,[bw/me],bl, "none") AS TemperType FROM t1;
CASE statement is useful for testing values and returning a new value based on the evaluated results. SQLite supports the following syntax for
CASE WHEN [condition] THEN [expression] ELSE [expression] END
CASE [expression] WHEN [value] THEN [expression] ELSE [expression] END
For example, suppose you had previously used the Convert to Indicator Values module to create a set feature columns containing true-false values. The following query collapses the values in multiple feature columns into a single multivalued column.
SELECT userID, [smoker-0], [smoker-1], CASE WHEN [smoker-0]= '1' THEN 'smoker' WHEN [smoker-1]= '1' THEN 'nonsmoker' ELSE 'unknown' END AS newLabel FROM t1;
For an example of how this module might be used in machine learning experiments, see this sample in the Azure AI Gallery:
- Apply SQL Transformation: Uses the Restaurant Ratings, Restaurant Features, and Restaurant Customers dataset to illustrate simple joins, select statements, and aggregate functions.
This section contains implementation details, tips, and answers to frequently asked questions.
An input is always required on port 1.
If the input dataset has column names, the columns in the output dataset will use the column names from the input dataset.
If the input dataset does not have column names, the column names in the table are automatically created by using the following naming convention: T1COL1, T1COL2, T1COL3, and so on, where the numbers indicate the index of each column in the input dataset.
For column identifiers that contain a space or other special characters, always enclose the column identifier in square brackets or double quotation marks when referring to the column in the
Although SQLite supports much of the ANSI SQL standard, it does not include many features supported by commercial relational database systems. For more information, see SQL as Understood by SQLite. Also, be aware of the following restrictions when creating SQL statements:
SQLite uses dynamic typing for values, rather than assigning a type to a column as in most relational database systems. It is weakly typed, and allows implicit type conversion.
LEFT OUTER JOINis implemented, but not
RIGHT OUTER JOINor
FULL OUTER JOIN.
You can use
ADD COLUMNstatements with the
ALTER TABLEcommand, but other clauses are not supported, including
ALTER COLUMN, and
You can create a VIEW within SQLite, but thereafter views are read-only. You cannot execute a
UPDATEstatement on a view. However, you can create a trigger that fires on an attempt to
UPDATEon a view and perform other operations in the body of the trigger.
In addition to the list of non-supported functions provided on the official SQLite site, the following wiki provides a list of other unsupported features: SQLite - Unsupported SQL
|Table1||Data Table||Input dataset1|
|Table2||Data Table||Input dataset2|
|Table3||Data Table||Input dataset3|
|SQL Query Script||any||StreamReader||SQL query statement|
|Results dataset||Data Table||Output dataset|
|Error 0001||An exception occurs if one or more specified columns of the dataset couldn't be found.|
|Error 0003||An exception occurs if one or more of the input datasets is null or empty.|
|Error 0069||SQL logic error or missing database|
For a list of errors specific to Studio modules, see Machine Learning Error codes.
For a list of API exceptions, see Machine Learning REST API Error Codes.