U-SQL supports samplers natively. The
SAMPLE clause enables convenient access from within a
SELECT expression. All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).
Sample_Expression_Simple := Query_Expression 'SAMPLE' ( 'ANY' '(' row_count ')' | 'UNIFORM' '(' row_fraction ')' ).
Rows are picked uniformly at random with probability equal to
row_fraction. The size of the output is governed by a binomial distribution. In expectation, the size of output is
row_fraction* SIZE(input rowset).
Ensure your sample probability,
row_fraction, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.
- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
@data = SELECT * FROM ( VALUES ("02143", 30, 100, "Smith"), ("98052", 25, 50, "Andersen"), ("02139", 30, 75, "Todd"), ("02139", 25, 60, "Roberts"), ("61801", 23, 80, "Sanders") ) AS T(ZipCode, Age, Salary, LastName);
@data rowset by selecting the specified number of rows (2) in a completely arbitrary way.
@result = SELECT * FROM @data SAMPLE ANY (2); OUTPUT @result TO "/ReferenceGuide/QSE/Sample/any.txt" USING Outputters.Tsv();
Possible output below.
@data rowset using a random uniform sampling with the provided probability (10%).
@result = SELECT * FROM @data SAMPLE UNIFORM (0.1); OUTPUT @result TO "/ReferenceGuide/QSE/Sample/simple_uniform.txt" USING Outputters.Tsv();
Possible output below. The likelihood of this output is 0.00729 (two rows are picked with a probability of 0.1 each and three rows not picked with a probability of 0.9 each). Note: An empty result set may be returned in this example due to the low probability and small dataset.