# SAMPLE Expression (U-SQL)

## Summary

USQL supports samplers natively. The sample expression is needed to access some types of samplers.

Given an input rowset and arguments that are specific to the types of sampler being invoked, the sample expression outputs a rowset that is a statistical sample of the input. Moreover, the sample expression takes as input an optional identifier (column) that records the probability with which the corresponding row has been output.

All samplers execute in one pass on data and in parallel on portions of the input; they do not require partitioning, shuffle etc. Samplers have been implemented in a single pass over data and have very small memory footprint; log( SIZE(input), SIZE(output) ).

## Syntax

Sample_Expression_Complex := 'SAMPLE' Rowset_Source Sampler_Details [Weight_Col]

Sampler_Details := 'UNIFORM' '(' row_fraction ')' | 'ON' Identifier_List 'UNIVERSE' '(' row_fraction ')' | 'ON' Identifier_List 'DISTINCT' '(' row_fraction ',' min_row_count ')'

Weight_Col := 'WITH' 'WEIGHT' 'AS' Identifier.

## Remarks

`UNIFORM`

Rows are picked uniformly at random with probability equal to`row_fraction`

. The weight column, if requested, is set to 1/(`row_fraction`

) for all rows. The size of the output is governed by a binomial distribution. In expectation, the size of output is`row_fraction`

* SIZE(input rowset).`row_fraction`

A double between 0 and 1 that indicates the probability with which a row in the input rowset will be passed.`Identifier_List`

This option specifies the list of columns that define the groups.

#### Syntax

Identifier_List := Quoted_or_Unquoted_Identifier {',' Quoted_or_Unquoted_Identifier }.

`UNIVERSE`

All rows are passed where the group value (i.e., the value of the columns in`Identifier_List`

) is in some randomly chosen`row_fraction`

of the space of group values.`UNIVERSE`

uses a cryptographically strong hash function to pick a random portion of the values. The weight column, if requested, is set to 1/(`row_fraction`

) for all rows.`UNIVERSE`

ensures sample-then-join is equivalent to join-then-sample: Using`UNIVERSE`

before an equijoin with`Identifier_List`

, on both inputs of the equijoin, containing exactly the equijoin columns is identical to sampling after the join.The size of the output is also in expectation

`row_fraction`

*SIZE(input rowset). However, especially if there are too few groups, the output size has more variance than with`UNIFORM`

since all rows from a group are passed by the sampler or not.`DISTINCT`

Per group (a distinct value of the columns in`Identifier_List`

), this sampler passes`min_row_count`

rows and the rest of the rows are passed with probability`row_fraction`

. The weight column, if requested, is set to 1/(`row_fraction`

) if the row is passed in the probabilistic mode and 1 otherwise.`DISTINCT`

facilitates sample-before-groupby: Using`DISTINCT`

before a groupby, with`Identifier_List`

containing at least the group in the group-by, guarantees that the sample will not miss any groups.`DISTINCT`

may return fewer than`min_row_count`

rows (but never less than 1) for some groups. A simple case is when a group has fewer than`min_row_count`

rows in the input. A more complex case occurs depending on the degree of parallelism of the stage that runs the sampler and how the rows corresponding to the group are distributed among the input partitions.

Tip

Ensure your sample probability, `row_fraction`

, is adequate for the size of your dataset to minimize the possibility of an empty result set being returned.

### Examples

- The examples can be executed in Visual Studio with the Azure Data Lake Tools plug-in.
- The scripts can be executed locally. An Azure subscription and Azure Data Lake Analytics account is not needed when executed locally.
- The examples below are based on the dataset defined below. Ensure your execution includes the rowset variable,
`@data`

.

**Dataset**

```
@data =
SELECT * FROM
( VALUES
("02143", 30, 100, "Smith"),
("98052", 25, 50, "Andersen"),
("02139", 30, 75, "Todd"),
("02139", 25, 60, "Roberts"),
("61801", 23, 80, "Sanders")
) AS T(ZipCode, Age, Salary, LastName);
```

**UNIFORM**

```
@result =
SAMPLE @data UNIFORM (0.4) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/uniform.txt"
USING Outputters.Tsv();
```

Possible output below. Note that the output need not contain every `ZipCode`

or all rows that have a given `Age`

value. The likelihood of seeing this output is 0.03456.

ZipCode | Age | Salary | LastName | Weight |
---|---|---|---|---|

02143 | 30 | 100 | Smith | 2.5 |

98052 | 25 | 50 | Andersen | 2.5 |

**UNIVERSE**

Samples the `@data`

rowset by picking some randomly chosen fraction of the overall value-space of the ON columns and returning all rows whose value of the ON columns belongs in the chosen space. The query implicitly picks 10% of all the values of the `Age`

columns and returns all rows whose value of `Age`

was randomly chosen.

```
@result =
SAMPLE @data ON Age UNIVERSE (0.1);
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/universe.txt"
USING Outputters.Tsv();
```

Possible output below. Note that all rows with `Age`

=30 will be picked if any of the `Age`

=30 rows are picked. Similar all rows with `Age`

=25 will either be picked or not be picked (as in this example).

The likelihood of seeing this output is 0.081; this is one of the three outcomes that have the same probability after no-output which occurs with a likelihood of 0.729 since there are only three unique values for `Age`

. Note: An empty result set may be returned in this example due to the low probability and small dataset.

ZipCode | Age | Salary | LastName |
---|---|---|---|

02143 | 30 | 100 | Smith |

02139 | 30 | 75 | Todd |

**DISTINCT**

Samples the `@data`

rowset such that at least 3 rows per distinct value of the `ZipCode`

column are included, and additional rows are added with the indicated probability (10%). The weight of each passing row is added to the resulting rowset in the weight column `_Weight`

.

```
@result =
SAMPLE @data ON ZipCode DISTINCT (0.1, 3) WITH WEIGHT AS _Weight;
OUTPUT @result
TO "/ReferenceGuide/QSE/Sample/distinct.txt"
USING Outputters.Tsv();
```

Possible output below. Note that every `ZipCode`

is represented in the output by up to 3 rows, if as many rows are available in the input. The likelihood of seeing this output is 1.

ZipCode | Age | Salary | LastName | Weight |
---|---|---|---|---|

02143 | 30 | 100 | Smith | 1 |

98052 | 25 | 50 | Andersen | 1 |

02139 | 30 | 75 | Todd | 1 |

02139 | 25 | 60 | Roberts | 1 |

61801 | 23 | 80 | Sanders | 1 |