SAMPLE (U-SQL)

Summary

U-SQL supports samplers natively. The SAMPLE clause enables convenient access from within a SELECT expression. All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).

Syntax

Sample_Expression_Simple :=                                                                              
     Query_Expression 'SAMPLE' ( 'ANY' '(' row_count ')' | 'UNIFORM' '(' row_fraction ')' ).

Remarks

  • ANY
    Any row_count rows from the input rowset. ANY does not return a randomly chosen subset of rows; do not use where randomness is needed.

  • row_count
    A positive integer.

  • UNIFORM
    Rows are picked uniformly at random with probability equal to row_fraction. The size of the output is governed by a binomial distribution. In expectation, the size of output is row_fraction * SIZE(input rowset).

  • row_fraction
    A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.

Tip

Ensure your sample probability, row_fraction, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.

Examples

  • The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
  • The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
  • The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable, @data.

Dataset

@data = 
    SELECT * FROM 
        ( VALUES
        ("02143",   30,     100,  "Smith"),
        ("98052",   25,     50,   "Andersen"),
        ("02139",   30,     75,   "Todd"),
        ("02139",   25,     60,   "Roberts"),
        ("61801",   23,     80,   "Sanders")
        ) AS T(ZipCode, Age, Salary, LastName);

ANY
Samples the @data rowset by selecting the specified number of rows (2) in a completely arbitrary way.

@result = 
    SELECT * FROM @data SAMPLE ANY (2);

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/any.txt"
USING Outputters.Tsv();

Possible output below.

ZipCode Age Salary LastName
02143 30 100 Smith
98052 25 50 Andersen

UNIFORM
Samples the @data rowset using a random uniform sampling with the provided probability (10%).

@result =
    SELECT * FROM @data SAMPLE UNIFORM (0.1);

OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/simple_uniform.txt"
USING Outputters.Tsv();

Possible output below. The likelihood of this output is 0.00729 (two rows are picked with a probability of 0.1 each and three rows not picked with a probability of 0.9 each). Note: An empty result set may be returned in this example due to the low probability and small dataset.

ZipCode Age Salary LastName
02143 30 100 Smith
98052 25 50 Andersen

See Also