SAMPLE Expression (U-SQL)

Summary

USQL supports samplers natively. The sample expression is needed to access some types of samplers.

Given an input rowset and arguments that are specific to the types of sampler being invoked, the sample expression outputs a rowset that is a statistical sample of the input. Moreover, the sample expression takes as input an optional identifier (column) that records the probability with which the corresponding row has been output.

All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).

Syntax

Sample_Expression_Complex :=                                                                             
     'SAMPLE' Rowset_Source Sampler_Details [Weight_Col]
Sampler_Details := 'UNIFORM' '(' row_fraction ')' | 'ON' Identifier_List 'UNIVERSE' '(' row_fraction ')' | 'ON' Identifier_List 'DISTINCT' '(' row_fraction ',' min_row_count ')'
Weight_Col := 'WITH' 'WEIGHT' 'AS' Identifier.

Remarks

  • UNIFORM
    Rows are picked uniformly at random with probability equal to row_fraction. The weight column, if requested, is set to 1/(row_fraction) for all rows. The size of the output is governed by a binomial distribution. In expectation, the size of output is row_fraction * SIZE(input rowset).

  • row_fraction
    A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.

  • Identifier_List
    This option specifies the list of columns that define the groups.

Syntax

  Identifier_List :=                                                                                  
       Quoted_or_Unquoted_Identifier                                               
       {',' Quoted_or_Unquoted_Identifier }.
  
  • UNIVERSE
    All rows are passed where the group value (i.e., the value of the columns in Identifier_List) is in some randomly chosen row_fraction of the space of group values. UNIVERSE uses a cryptographically strong hash function to pick a random portion of the values. The weight column, if requested, is set to 1/(row_fraction) for all rows.

    UNIVERSE ensures sample-then-join is equivalent to join-then-sample: Using UNIVERSE before an equijoin with Identifier_List, on both inputs of the equijoin, containing exactly the equijoin columns is identical to sampling after the join.

    The size of the output is also in expectation row_fraction*SIZE(input rowset). However, especially if there are too few groups, the output size has more variance than with UNIFORM since all rows from a group are passed by the sampler or not.

  • DISTINCT
    Per group (a distinct value of the columns in Identifier_List), this sampler passes min_row_count rows and the rest of the rows are passed with probability row_fraction. The weight column, if requested, is set to 1/(row_fraction) if the row is passed in the probabilistic mode and 1 otherwise. DISTINCT facilitates sample-before-groupby: Using DISTINCT before a groupby, with Identifier_List containing at least the group in the group-by, guarantees that the sample will not miss any groups.

    DISTINCT may return fewer than min_row_count rows (but never less than 1) for some groups. A simple case is when a group has fewer than min_row_count rows in the input. A more complex case occurs depending on the degree of parallelism of the stage that runs the sampler and how the rows corresponding to the group are distributed among the input partitions.

  • min_row_count
    A positive integer for DISTINCT.

Tip

Ensure your sample probability, row_fraction, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
  • The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable, @data.

Dataset

@data = 
    SELECT * FROM 
        ( VALUES
        ("02143",   30,     100,  "Smith"),
        ("98052",   25,     50,   "Andersen"),
        ("02139",   30,     75,   "Todd"),
        ("02139",   25,     60,   "Roberts"),
        ("61801",   23,     80,   "Sanders")
        ) AS T(ZipCode, Age, Salary, LastName);

UNIFORM

@result = 
    SAMPLE @data UNIFORM (0.4) WITH WEIGHT AS _Weight;

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/uniform.txt"
USING Outputters.Tsv();

Possible output below. Note that the output need not contain every ZipCode or all rows that have a given Age value. The likelihood of seeing this output is 0.03456.

ZipCode Age Salary LastName Weight
02143 30 100 Smith 2.5
98052 25 50 Andersen 2.5

UNIVERSE
Samples the @data rowset by picking some randomly chosen fraction of the overall value-space of the ON columns and returning all rows whose value of the ON columns belongs in the chosen space. The query implicitly picks 10% of all the values of the Age columns and returns all rows whose value of Age was randomly chosen.

@result = 
    SAMPLE @data ON Age UNIVERSE (0.1);

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/universe.txt"
USING Outputters.Tsv();

Possible output below. Note that all rows with Age=30 will be picked if any of the Age=30 rows are picked. Similar all rows with Age=25 will either be picked or not be picked (as in this example).

The likelihood of seeing this output is 0.081; this is one of the three outcomes that have the same probability after no-output which occurs with a likelihood of 0.729 since there are only three unique values for Age. Note: An empty result set may be returned in this example due to the low probability and small dataset.

ZipCode Age Salary LastName
02143 30 100 Smith
02139 30 75 Todd

DISTINCT
Samples the @data rowset such that at least 3 rows per distinct value of the ZipCode column are included, and additional rows are added with the indicated probability (10%). The weight of each passing row is added to the resulting rowset in the weight column _Weight.

@result = 
    SAMPLE @data ON ZipCode DISTINCT (0.1, 3) WITH WEIGHT AS _Weight;

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/distinct.txt"
USING Outputters.Tsv();

Possible output below. Note that every ZipCode is represented in the output by up to 3 rows, if as many rows are available in the input. The likelihood of seeing this output is 1.

ZipCode Age Salary LastName Weight
02143 30 100 Smith 1
98052 25 50 Andersen 1
02139 30 75 Todd 1
02139 25 60 Roberts 1
61801 23 80 Sanders 1

See Also