Share via


PERCENTILE_DISC (U-SQL)

Summary

The PERCENTILE_DISC analytic function computes a specific percentile for sorted values in the specified window based on a discrete distribution of the column values. The result is equal to a specific value in the column.

For a given percentile value P, PERCENTILE_DISC returns the value with the smallest CUME_DIST value for the same sorted list of values that is greater than or equal to P.

For example, PERCENTILE_DISC(0.5) will compute the 50th percentile (that is, the median) of an expression.

PERCENTILE_DISC can only be used in the context of a windowing expression.

Syntax

PERCENTILE_DISC_Expression :=                                                                            
     'PERCENTILE_DISC' '(' double_literal ')'  
     'WITHIN' 'GROUP' '(' Simple_Order_By_Clause ')'.

Remarks

  • double_literal
    The percentile to compute. The value must range between 0.0 and 1.0 and be of type double.

  • WITHIN GROUP ( Simple_Order_By_Clause )
    Specifies a list of numeric values to sort and compute the percentile over.

Syntax

  Simple_Order_By_Clause :=                                                                           
       'ORDER' 'BY' Sort_Item_Expression.
Sort_Item_Expression := expression [Sort_Direction].
Sort_Direction := 'ASC' | 'DESC'.

Only one ORDER BY expression is allowed. The type of the expression has to be comparable. The default sort order is ascending.

Return Type

The return type is the type of the ORDER BY expression.

Usage in Windowing Expression

This analytic function can be used in a windowing expression with the following restrictions:

  • Only the PARTITION BY clause can be specified with the OVER operator.
  • The ORDER BY clause in the OVER operator is not allowed.
  • The ROWS clause in the OVER operator is not allowed.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.

  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.

  • The example below is based on the dataset defined below. Ensure your execution includes the rowset variable.

      @employees = 
          SELECT * FROM 
              ( VALUES
              (1, "Noah",   "Engineering", 100, 10000),
              (2, "Sophia", "Engineering", 100, 20000),
              (3, "Liam",   "Engineering", 100, 30000),
              (4, "Amy",    "Engineering", 100, 35000),
              (5, "Emma",   "HR",          200, 8000),
              (6, "Jacob",  "HR",          200, 8000),
              (7, "Olivia", "HR",          200, 8000),
              (8, "Mason",  "Executive",   300, 50000),
              (9, "Ava",    "Marketing",   400, 15000),
              (10, "Ethan", "Marketing",   400, 9000) 
              ) AS T(EmpID, EmpName, DeptName, DeptID, Salary);
    

Using PERCENTILE_DISC
The following example uses PERCENTILE_DISC to find the median employee salary in each department.

@result =
    SELECT DISTINCT DeptName,
                    PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY Salary) OVER(PARTITION BY DeptName) AS MedianDisc
    FROM @employees;

OUTPUT @result
TO "/Output/ReferenceGuide/percentile_disc/example.csv"
USING Outputters.Csv();

See Also