percentile(), percentiles()

Returns an estimate for the specified nearest-rank percentile of the population defined by Expr. The accuracy depends on the density of population in the region of the percentile.

  • Can be used only in context of aggregation inside summarize

percentiles() is like percentile(), but calculates a number of percentile values (which is faster than calculating each percentile individually).

percentilesw() is like percentilew(), but calculates a number of percentile values (which is faster than calculating each percentile individually).

percentilew() and percentilesw() allows calculating weighted percentiles. Weighted percentiles calculate the given percentiles in a "weighted" way - threating each value as if it was repeated Weight times in the input.

Syntax

summarize percentile(Expr, Percentile)

summarize percentiles(Expr, Percentile1 [, Percentile2])

summarize percentiles_array(Expr, Percentile1 [, Percentile2])

summarize percentiles_array(Expr, Dynamic array)

summarize percentilew(Expr, WeightExpr, Percentile)

summarize percentilesw(Expr, WeightExpr, Percentile1 [, Percentile2])

summarize percentilesw_array(Expr, WeightExpr, Percentile1 [, Percentile2])

summarize percentilesw_array(Expr, WeightExpr, Dynamic array)

Arguments

  • Expr: Expression that will be used for aggregation calculation.
  • WeightExpr: Expression that will be used as the weight of values for aggregation calculation.
  • Percentile is a double constant that specifies the percentile.
  • Dynamic array: list of percentiles in a dynamic array of integer or floating point numbers

Returns

Returns an estimate for Expr of the specified percentiles in the group.

Examples

The value of Duration that is larger than 95% of the sample set and smaller than 5% of the sample set:

CallDetailRecords | summarize percentile(Duration, 95) by continent

Simultaneously calculate 5, 50 (median) and 95:

CallDetailRecords 
| summarize percentiles(Duration, 5, 50, 95) by continent

alt text

The results show that in Europe, 5% of calls are shorter than 11.55s, 50% of calls are shorter than 3 minutes 18.46 seconds, and 95% of calls are shorter than 40 minutes 48 seconds.

Calculate multiple statistics:

CallDetailRecords 
| summarize percentiles(Duration, 5, 50, 95), avg(Duration)

Weighted percentiles

Assume you measure the time (Duration) it takes an action to complete again and again, and then instead of recording each and every value of the measurement, you "condense" them by recording each value of Duration rounded to 100 msec and how many times that rounded value appeared (BucketSize).

You can use summarize percentilesw(Duration, BucketSize, ...) to calculates the given percentiles in a "weighted" way - treating each value of Duration as if it was repeated BucketSize times in the input, without actually needing to materialize those records.

Consider the following following example. A customer has a set of latency values in milliseconds: { 1, 1, 2, 2, 2, 5, 7, 7, 12, 12, 15, 15, 15, 18, 21, 22, 26, 35 }.

In order to reduce bandwidth and storage, the customer performs pre-aggregation to the following buckets: { 10, 20, 30, 40, 50, 100 }, and counts the number of events in each bucket, which gives the following Kusto table:

alt text

Which can be read as:

  • 8 events in the 10ms bucket (corresponding to subset { 1, 1, 2, 2, 2, 5, 7, 7 })
  • 6 events in the 20ms bucket (corresponding to subset { 12, 12, 15, 15, 15, 18 })
  • 3 events in the 30ms bucket (corresponding to subset { 21, 22, 26 })
  • 1 event in the 40ms bucket (corresponding to subset { 35 })

At this point, the original data is no longer available to us, and all we have is the number of events in each bucket. In order to compute percentiles from this data, we can use the percentilesw() function. For example, for the 50, 75 and 99.9 percentiles, we'll use the following query:

Table
| summarize percentilesw(LatencyBucket, ReqCount, 50, 75, 99.9) 

The result for the above query is:

alt text

Notice, that the above query corresponds to the function percentiles(LatencyBucket, 50, 75, 99.9) if the data was expended to the following form:

alt text

Getting multiple percentiles in an array

Multiple percentiles percentiles can be obtained as an array in a single dynamic column instead of multiple columns:

CallDetailRecords 
| summarize percentiles_array(Duration, 5, 25, 50, 75, 95), avg(Duration)

alt text

Similarily weighted percentiles can be returned as a dynamic array using percentilesw_array

Percentiles for percentiles_array and percentilesw_array can be specified in a dynamic array of integer or floating-point numbers. The array must be constant but does not have to be literal.

CallDetailRecords 
| summarize percentiles_array(Duration, dynamic([5, 25, 50, 75, 95])), avg(Duration)
CallDetailRecords 
| summarize percentiles_array(Duration, range(0, 100, 5)), avg(Duration)

Nearest-rank percentile

P-th percentile (0 < P <= 100) of a list of ordered values (sorted from least to greatest) is the smallest value in the list such that P percent of the data is less or equal to that value (from Wikipedia article on percentiles)

We also define 0-th percentiles to be the smallest member of the population.

Note

  • Given the approximating nature of the calculation the actual returned value may not be a member of the population
  • Nearest-rank definition means that P=50 does not conform to the interpolative definition of the median. When evaluating the significance of this discrepancy for the specific application the size of the population and an estimation error should be taken into account.

Estimation error in percentiles

The percentiles aggregate provides an approximate value using T-Digest.

A few important points:

  • The bounds on the estimation error vary with the value of the requested percentile. The best accuracy is at the ends of [0..100] scale, percentiles 0 and 100 are the exact minimum and maximum values of the distribution. The accuracy gradually decreases towards the middle of the scale. It is worst at the median and is capped at 1%.
  • Error bounds are observed on the rank, not on the value. Suppose percentile(X, 50) returned value of Xm. The estimation guarantees that at least 49% and at most 51% of the values of X are less or equal to Xm. There is no theoretical limit on the difference between Xm and actual median value of X.
  • The estimation may sometimes result in a precise value but there are no reliable conditions to define when it will be the case