dcount() (aggregation function)

Returns an estimate for the number of distinct values that are taken by a scalar expression in the summary group.

Note

The dcount() aggregation function is primarily useful for estimating the cardinality of huge sets. It trades performance for accuracy, and may return a result that varies between executions. The order of inputs may have an effect on its output.

Syntax

... | summarize dcount (Expr[, Accuracy]) ...

Arguments

  • Expr: A scalar expression whose distinct values are to be counted.
  • Accuracy: An optional int literal that defines the requested estimation accuracy. See below for supported values. If unspecified, the default value 1 is used.

Returns

Returns an estimate of the number of distinct values of Expr in the group.

Example

PageViewLog | summarize countries=dcount(country) by continent

D count.

Get an exact count of distinct values of V grouped by G.

T | summarize by V, G | summarize count() by G

This calculation requires a great amount of internal memory, since distinct values of V are multiplied by the number of distinct values of G. It may result in memory errors or large execution times. dcount()provides a fast and reliable alternative:

T | summarize dcount(V) by G | count

Estimation accuracy

The dcount() aggregate function uses a variant of the HyperLogLog (HLL) algorithm, which does a stochastic estimation of set cardinality. The algorithm provides a "knob" that can be used to balance accuracy and execution time per memory size:

Accuracy Error (%) Entry count
0 1.6 212
1 0.8 214
2 0.4 216
3 0.28 217
4 0.2 218

Note

The "entry count" column is the number of 1-byte counters in the HLL implementation.

The algorithm includes some provisions for doing a perfect count (zero error), if the set cardinality is small enough:

  • When the accuracy level is 1, 1000 values are returned
  • When the accuracy level is 2, 8000 values are returned

The error bound is probabilistic, not a theoretical bound. The value is the standard deviation of error distribution (the sigma), and 99.7% of the estimations will have a relative error of under 3 x sigma.

The following image shows the probability distribution function of the relative estimation error, in percentages, for all supported accuracy settings:

hll error distribution.