U-SQL SELECT Selecting from Joins

Article
03/10/2017

Summary

U-SQL provides the following ways of joining two rowsets:

The syntax follows the ANSI SQL join syntax. The comma separated implicit cross join syntax with a join predicate in a WHERE (U-SQL) clause is not supported in U-SQL. Either use an explicit CROSS JOIN or preferably use an INNER or OUTER JOIN instead.

Syntax

Join_Expression :=                                                                                       
     Rowset_Source Cross_Join_Operator Rowset_Source  
|    Rowset_Source Join_Operator Rowset_Source Join_On_Clause.

Remarks

Rowset_Source
Specifies the rowsets that are being joined. For more details on the rowset source see U-SQL SELECT FROM Clause.
Cross_Join_Operator
The cross join operator joins the two rowsets by combining each row on the left side with each row on the right side.

Syntax

  Cross_Join_Operator :=                                                                              
       'CROSS' [Parallel_Join_Hint] 'JOIN'.

It allows the optional specification of a join hint. See Join Hints for more details.

Join_Operator
Specifies the join operation.

Syntax

  Join_Operator :=                                                                                    
       [Join_Prefix] Join_Hints 'JOIN'
  |    ['LEFT' | 'RIGHT'] Join_Hints Semijoin_Operator.

  Join_Prefix :=
       'INNER'
  |    [['LEFT' | 'RIGHT' | 'FULL'] 'OUTER'].

  Semijoin_Operator :=
        'SEMIJOIN' | 'ANTISEMIJOIN'.

The join operators can either be a variety of joins or semijoins.

[Join_Prefix] 'JOIN'
Specifies the type of join. JOIN defaults to INNER JOIN, OUTER JOIN defaults to LEFT OUTER JOIN. See the sections on joins for more details on each of the join expressions.
['LEFT' | 'RIGHT'] Semijoin_Operator
Specifies the type of semijoin. SEMIJOIN and ANTISEMIJOIN both default to LEFT SEMIJOIN and LEFT ANTISEMIJOIN respectively. See the sections on semijoins for more details on each of the semijoin expressions.
Join_Hints
The optional specification of join hints. See Join Hints for more details.

It is highly recommended to always fully specify the join operator.

Join_On_Clause
Specifies the join comparison that is being used to identify which of the rows from each rowset will be joined.

Syntax

  Join_On_Clause :=                                                                                   
       'ON' Join_Comparison_Expression.

Join_Comparison_Expression
Provides the join comparison. See Join Comparisons for more details.

Join Comparisons

U-SQL, like most scaled out Big Data Query languages that support joins, restricts the join comparison to equality comparisons between columns in the rowsets to be joined. If the column references are not unique across all the rowsets in the query, they have to be fully qualified with the rowset variable name, rowset name or specified alias.

Join Syntax

Join_Comparison_Expression :=                                                                            
     Join_Equality {'AND' Join_Equality}.

Join_Equality :=  
     Column_Identifier '==' Column_Identifier.

Column_Identifier :=  
     [(Rowset_Variable | Quoted_or_Unquoted_Identifier) '.']  
     Quoted_or_Unquoted_Identifier.

There are however ways to still apply other join predicates. If one has a non-equality comparison or a more complex expression (such as a method invocation) in the comparison, one can move the comparison to the SELECT's WHERE clause. Or the more complex expression can be placed in an earlier SELECT statement's column and then that alias can be referred to in the join comparison.

Tip

U-SQL decided to limit the ON clause to simple equality only and not perform the described rewrites to make it clear to the user that the more complex join predicates will negatively impact performance of the joins, since they will hinder the use of indices and other join optimizations.

Examples

The examples are designed for execution in Visual Studio with the Azure Data Lake Tools plug-in.
The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.

The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable.

  @employees = 
      SELECT * FROM 
          ( VALUES
          (1, "Noah",   100, (int?)10000, new DateTime(2012,05,31)),
          (2, "Sophia", 100, (int?)15000, new DateTime(2012,03,19)),
          (3, "Liam",   100, (int?)30000, new DateTime(2014,09,14)),
          (4, "Amy",    100, (int?)35000, new DateTime(1999,02,27)),
          (5, "Justin", 100, (int?)15000, new DateTime(2015,01,12)),
          (6, "Emma",   200, (int?)8000,  new DateTime(2014,03,08)),
          (7, "Jacob",  200, (int?)8000,  new DateTime(2014,09,02)),
          (8, "Olivia", 200, (int?)8000,  new DateTime(2013,12,11)),
          (9, "Mason",  300, (int?)50000, new DateTime(2016,01,01)),
          (10, "Ava",   400, (int?)15000, new DateTime(2014,09,14)),
          (11, "Ethan", 400, (int?)9000,  new DateTime(2015,08,22)),
          (12, "David", 800, (int?)100,   new DateTime(2016,11,01)),
          (13, "Andrew", 100, (int?)null, new DateTime(1995,07,16)),
          (14, "Jennie", 100, (int?)34000, new DateTime(2000,02,12))
          ) AS T(EmpID, EmpName, DeptID, Salary, StartDate);

A. How to write U-SQL joins with equality and in-equality comparisons
The query first identifies a subset of data, DeptID and DeptMaxSalary, and stores it in a new rowset variable. Then both rowset variables are joined on DeptID to identify all employees that have a salary that is lower than the highest salary in his/her dept.

@employees2 =
    SELECT DeptID,
           MAX(Salary) AS DeptMaxSalary
    FROM @employees
    GROUP BY DeptID;

@result =
    SELECT a.EmpName, a.DeptID, a.Salary, b.DeptMaxSalary
    FROM @employees AS a
         JOIN
             @employees2 AS b
         ON a.DeptID == b.DeptID
    WHERE a.Salary < b.DeptMaxSalary;

OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleA.csv"
USING Outputters.Csv();

B. How to write U-SQL joins with non-equijoins.
Returns each employee, his/her salary, and the number of employees with a higher salary

@result =
    SELECT a.EmpName,
           a.Salary,
           COUNT( * ) - 1 AS NumberOfEmployeesWithHigherSalary
    FROM @employees AS a
         CROSS JOIN @employees AS b
    WHERE a.Salary <= b.Salary
    GROUP BY a.EmpName, a.Salary;

OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleB.csv"
ORDER BY Salary DESC
USING Outputters.Csv();

C. How to write U-SQL joins where join comparison includes more complex expressions
All employees that have a salary that is equal to the adjusted highest salary in his/her dept.

@employees2 =
    SELECT DeptID,
           MAX(Salary)-1000 AS DeptAdjustedMaxSalary
    FROM @employees
    GROUP BY DeptID;

@result =
    SELECT a.EmpName, a.DeptID, a.Salary, b.DeptAdjustedMaxSalary
    FROM @employees AS a
         JOIN
             @employees2 AS b
         ON a.DeptID == b.DeptID AND a.Salary == b.DeptAdjustedMaxSalary;

OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleC.csv"
USING Outputters.Csv();

D. How to write U-SQL joins where join comparison includes more complex expressions, additional example.
This query identifies departments that have more than three employees. The expression is then joined back to the rowset variable to identify the employees from departments which have more than three employees.

@result =
    SELECT a.DeptID,
           a.EmpName
    FROM @employees AS a
         JOIN
         (
         SELECT DeptID
         FROM @employees
         GROUP BY DeptID
         HAVING COUNT( * ) > 3
         ) AS b
         ON a.DeptID == b.DeptID;

OUTPUT @result
TO "/Output/ReferenceGuide/Joins/exampleD.csv"
USING Outputters.Csv();

U-SQL SELECT Selecting from Joins

Summary

Syntax

Remarks

Syntax

Syntax

Syntax

Join Comparisons

Join Syntax

Examples

See Also

Additional resources