EXCEPT Expression (U-SQL)

Article
10/17/2017

Summary

EXCEPT returns the rows from the left query expression that are not in the right query expression, thus subtracting the right rowset from the left.

Syntax

Except_Expression :=                                                                                     
    Query_Expression 'EXCEPT' [Set_Operator_Option] [By_Name] 
    Query_Expression.

Set_Operator_Option :=  
    'DISTINCT'  
|   'ALL'.

By_Name :=
    'BY' 'NAME' ['ON' '(' (Identifier_List [',' '*'] | '*') ')'.

Remarks

Query_Expression
Specifies the two input rowsets. The basic rules for combining the result sets of the two query expressions are:
- The number and the order of the columns must be the same in all queries.
- The data types must be compatible and comparable.
If the number of columns do not match, an error is raised. If the columns are out of order, an error may be raised if the column data types are incompatible. Otherwise an unexpected result may be returned. If the column data types of the two query expressions are incompatible, an error is raised.

Note

The column names do not have to be the same between the two rowsets. The column names of the first rowset is chosen for the result.
Set_Operator_Option
The optional set operator option indicates the handling of duplicate result rows:
DISTINCT
Removes duplicate rows from the result (default if the option is left away).
ALL
Preserves duplicate rows in the result.

Tip

Unless duplicate elimination is required, specifying ALL leads to more efficient execution.
By_Name
The optional BY NAME clause indicates that the exception is matching up values not based on position but by name of the columns. If the BY NAME clause is not specified, the matching is done positionally.

If there is no ON clause, the counts of columns on the two sides must be the same and all columns on the left side must have matching columns with the same name on the right side. The schema of the result is the same as the schema of the left argument, i.e., the left argument determines the name and order of the output columns.

If there is an ON clause, it specifies the set of matching columns on both sides. There must be no duplicates in the list or an error is raised. All matching columns must be present in both rowset arguments or an error is raised. Moreover, the specified set of matching columns must be exactly the set of all columns in the two rowsets that have matching names. If there is an extra match that is not listed in the ON clause, an error is raised unless a “*” is specified in the list. The two rowsets may have other, non-matching columns, that are ignored by the set operator. The resulting schema is composed only of the matching columns in the order they are present in the left argument.

If the ON clause includes the “*” symbol (it may be specified as the last or the only member of the list), then extra name matches beyond those in the ON clause are allowed, and the result’s columns include all matching columns in the order they are present in the left argument.

Examples

The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
The examples below are based on the datasets defined below. Ensure your execution includes the rowset variables.

Dataset

@left = 
    SELECT * FROM 
        ( VALUES
        (1,	"Smith", 20),
        (1,	"Smith", 20),
        (1,	"Smith", 20),
        (2,	"Brown", 30),
        (3,	"Case", 40),
        (4, (string)null, 45),
        (5, (string)null, 50)
        ) AS T(id, name, age);

@right = 
    SELECT * FROM 
        ( VALUES
        (1, 20000,	"Smith"),
        (1, 20000,	"Smith"),
        (2, 30000,	"Brown"),
        (2, 30000,	"Brown"),
        (4, 50000,	"Dey"),
        (4, 50000,	"Dey"),
        (5, 60000, (string)null)
        ) AS T(id, salary, name);

Return distinct values by position
All distinct id, name values from @left that do not exist in @right.

@result =    
    SELECT id, name FROM @left
    EXCEPT DISTINCT    // Using DISTINCT is optional as it is the default value
    SELECT id, name FROM @right;

OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/except_distinct_position.txt" 
USING Outputters.Csv();

Return distinct values by name
All distinct id, name values from @left that do not exist in @right.

@result = 
    SELECT * FROM @left
    EXCEPT DISTINCT BY NAME ON (*)
    SELECT * FROM @right;  

OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/except_distinct_name.txt" 
USING Outputters.Csv();

Return values with duplicates by position
All id, name values, including duplicates, from @left that do not exist in @right.

@result =    
    SELECT id, name FROM @left
    EXCEPT ALL   // ALL preserves duplicates
    SELECT id, name FROM @right;

OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/except_all_position.txt" 
USING Outputters.Csv();

Return values with duplicates by name
All id, name values, including duplicates, from @left that do not exist in @right.

@result =    
    SELECT * FROM @left
    EXCEPT ALL BY NAME ON (id, *)   
    SELECT * FROM @right;

OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/except_all_name.txt" 
USING Outputters.Csv();

EXCEPT with ORDER BY and FETCH
The ORDER BY clause with FETCH allows the selection of a limited number of rows based on the specified order.

// Data sets
@Product = 
    SELECT * FROM 
        ( VALUES
        (1, "Adjustable Race", "AR-5381", (string)null),
        (2, "Bearing Ball", "BA-8327", (string)null),
        (3, "BB Ball Bearing", "BE-2349", (string)null),
        (4, "Headset Ball Bearings", "BE-2908", (string)null),
        (316, "Blade", "BL-2036", (string)null),
        (317, "LL Crankarm", "CA-5965", "Black"),
        (318, "ML Crankarm", "CA-6738", "Black"),
        (319, "HL Crankarm", "CA-7457", "Black"),
        (320, "Chainring Bolts", "CB-2903", "Silver"),
        (324, "Chain Stays", "CS-2812", (string)null)
        ) AS T(ProductID, Name, ProductNumber, Color);

@WorkOrder = 
    SELECT * FROM 
        ( VALUES
        (88, 3, 4600, new DateTime(2017, 2, 14)),
        (89, 324, 1148, new DateTime(2018, 6, 14)),
        (129, 3, 40, new DateTime(2017, 6, 15)),
        (130, 324, 8, new DateTime(2018, 6, 15)),
        (142, 316, 8, new DateTime(2017, 3, 15)),
        (170, 3, 50, new DateTime(2017, 6, 16)),
        (184, 316, 10, new DateTime(2017, 6, 16)),
        (201, 3, 20, new DateTime(2017, 8, 17)),
        (213, 316, 4, new DateTime(2017, 6, 17)),
        (312, 2, 30, new DateTime(2016, 7, 20))
        ) AS T(WorkOrderID, ProductID, OrderQty, DueDate);


// All products without orders
@result =    
    SELECT ProductID FROM @Product
    EXCEPT ALL 
    SELECT ProductID FROM @WorkOrder;

OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/Products_wo_Orders.txt" 
USING Outputters.Csv();


// Pull top 5 most recent work orders, thus excluding older orders
@WorkOrder_subset = 
    SELECT * FROM @WorkOrder ORDER BY DueDate DESC FETCH 5 ROWS;

// All products without orders from `@WorkOrder_subset`.
@result =    
    SELECT ProductID FROM @Product 
    EXCEPT ALL
    SELECT ProductID FROM @WorkOrder_subset;
    
// Note that ProductID 2 is now included in the output
OUTPUT @result 
TO "/ReferenceGuide/QSE/Set/EXCEPT/Products_wo_Orders_subset.txt" 
USING Outputters.Csv();