Approximate Count Distinct enters Public Preview in Azure SQL Database

Kevin Farlee Principal Program Manager, SQL Database

We are pleased to announce the public preview of Approximate Count Distinct, the first of the Approximate Query Processing features.  Approximate Query Processing is a new family of features which are designed to provide aggregations across very large data sets where responsiveness is more critical than absolute precision.  An example might be calculating a COUNT(DISTINCT()) across 10 billion rows, for display on a dashboard.  In this case, absolute precision is not important, but responsiveness is critical.

The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group.

This function is designed for use in big data scenarios and is optimized for the following conditions:

  • Access of data sets that are millions of rows or higher AND

  • Aggregation of a column or columns that have a large number of distinct values

 

Assuming these conditions, the accuracy will be within 2% of the precise result for a majority of workloads.

Furthermore, in no case should APPROX_COUNT_DISTINCT be more than 20% off from the precise COUNT DISTINCT equivalent.

Advantages

APPROX_COUNT_DISTINCT uses far less memory to calculate the distinct count than an exhaustive precise COUNT DISTINCT would.  For this reason it is far more likely that the calculation can be done in memory without spilling to disk, even when there are billions of rows in the dataset.  In cases where COUNT DISTINCT would run out of memory and spill data to TempDB which causes very large performance penalties, APPROX_COUNT_DISTINCT would typically not spill.

Partly as a consequence of not spilling large datasets to TempDB, and partly as a consequence of the algorithms internally, for very large datasets APPROX_COUNT_DISTINCT will execute far faster than COUNT DISTINCT.

Questions and answers

Q. How much will APPROX_COUNT_DISTINCT speed up my queries?

A. It depends on the data set. If the exact count distinct query will fit entirely in memory, any speedup will be modest. If on the other hand the exact query runs out of memory and needs to spill data to TempDB, the speedup will be significant.

Q. Will APPROX_COUNT_DISTINCT speed up all queries over the precise COUNT(DISTINCT()) form?

A. No. If the exact query fits entirely in memory, APPROX_COUNT_DISTINCT will not speed the query up much, if at all.

Q. What is the worst-case error range I should expect if I’m not in the majority of workloads that have an error rate of under 2%?

A. With a data set in the target size that has also a large number of distinct values, the error should not be more than 20% of the precise value.

Q. How does APPROX_COUNT_DISTINCT affect the query plan?

A. The first query plan shown below is from a simple query using traditional COUNT(DISTINCT()). The second query plan shown below is using APPROX_COUNT_DISTINCT(). You can see that there is an extra sort operator in the Count Distinct plan, and that a Hash Match is replaced by a Stream Aggregate in the APPROX_COUNT_DISTINCT case.  Further, note that although both plans start with the same clustered index scan, in the traditional case it only counts for 95% of the time, where in the APPROX_COUNT_DISTINCT case, 98% of the total time is taken in the scan.  This means that there is more processing work outside of the scan for the traditional Count Distinct.

 

 

[caption id="attachment_7285" align="alignnone" width="879"] Original Query Plan[/caption]

 

[caption id="attachment_7295" align="alignnone" width="879"] APPROX_COUNT_DISTINCT Query Plan[/caption]

Also, if you look at the Query Plan XML, you'll see ScalarOperators containing the string APPROX_COUNT_DISTINCT, like:

 <ScalarOperator ScalarString="APPROX_COUNT_DISTINCT_CONVERT([globalagg1004])">