# SAMPLE (U-SQL)

## Summary

U-SQL supports samplers natively. The `SAMPLE`

clause enables convenient access from within a `SELECT`

expression. All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).

## Syntax

Sample_Expression_Simple := Query_Expression 'SAMPLE' ( 'ANY' '(' row_count ')' | 'UNIFORM' '(' row_fraction ')' ).

## Remarks

`ANY`

*Any*`row_count`

rows from the input rowset.`ANY`

does not return a randomly chosen subset of rows; do not use where randomness is needed.`UNIFORM`

Rows are picked uniformly at random with probability equal to`row_fraction`

. The size of the output is governed by a binomial distribution. In expectation, the size of output is`row_fraction`

* SIZE(input rowset).`row_fraction`

A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.

Tip

Ensure your sample probability, `row_fraction`

, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.

### Examples

- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
`@data`

.

**Dataset**

```
@data =
SELECT * FROM
( VALUES
("02143", 30, 100, "Smith"),
("98052", 25, 50, "Andersen"),
("02139", 30, 75, "Todd"),
("02139", 25, 60, "Roberts"),
("61801", 23, 80, "Sanders")
) AS T(ZipCode, Age, Salary, LastName);
```

**ANY**

Samples the `@data`

rowset by selecting the specified number of rows (2) in a completely arbitrary way.

```
@result =
SELECT * FROM @data SAMPLE ANY (2);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/any.txt"
USING Outputters.Tsv();
```

Possible output below.

ZipCode | Age | Salary | LastName |
---|---|---|---|

02143 | 30 | 100 | Smith |

98052 | 25 | 50 | Andersen |

**UNIFORM**

Samples the `@data`

rowset using a random uniform sampling with the provided probability (10%).

```
@result =
SELECT * FROM @data SAMPLE UNIFORM (0.1);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/simple_uniform.txt"
USING Outputters.Tsv();
```

Possible output below. The likelihood of this output is 0.00729 (two rows are picked with a probability of 0.1 each and three rows not picked with a probability of 0.9 each). Note: An empty result set may be returned in this example due to the low probability and small dataset.

ZipCode | Age | Salary | LastName |
---|---|---|---|

02143 | 30 | 100 | Smith |

98052 | 25 | 50 | Andersen |