Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality
A few days ago I was in the middle of writing up a quick post about a query performance problem I recently ran into. The writeup referenced predicate selectivity, and I found myself wondering whether everyone who came across the post would have a good intuitive understanding of what that referred to. Just in case, I thought I'd do a quick overview of some terms that you'll definitely run across if you're doing much query tuning. You can find this information in other forms elsewhere on the Internet, but these are practical descriptions that I've found to be useful when introducing people to query tuning concepts.
Density, when used to describe the data in a column, is a measure of how often duplicate values occur in that column. Another way to think of density is as a measure of the uniqueness of the data in a column: high density --> less unique data. Density values range from 0 to 1.0. There are different (but equivalent) ways to think of density.
Density = 1/[# of distinct values in a column]
Density = Avg. number of duplicates for a given value / Total row count
Consider a Customers table with a CountryID column and a RegionID column. Suppose that 100 different CountryID values are present in the table, and 1000 different RegionIDs are present. The density of the CountryID column would be higher than the density of the RegionID column because a typical country is less unique than a typical Region. Note that this is an overall characterization of a range of values. There may be some regions that occur more frequently than some countries -- a U.S.-based company might have more customers in California than in Estonia, for example. But CountryID still has higher density than RegionID simply because the average country is represented in more rows than the average region.
Density of the CountryID column:
Density = 1/[# of distinct values in the column]
Density = 1/100 = 0.01
Note that the density of 0.01, or 1%, corresponds to the percentage of the total rows that would be returned by a query for a single value. In other words, an average country makes up 1% of the table. if there were 10,000 rows in the table, there were be 100 rows (1% of 10,000) per country, on average.
Density of the RegionID column:
Density = 1/[# of distinct values in a column]
Density = 1/1000 = 0.001
The density of the CountryID column (0.01) is higher than the density of the RegionID column (0.001) because the CountryID column is less unique.
Q: What is the density of a column that contains only a single value repeated in every row?
A: 1.0 -- the highest possible density.
Q: What is the density of a unique column, such as an IDENTITY or primary key column?
A: 1/[row count]
Here is a script to create the scenario described above.
create table customers (customerid int identity, country varchar(10), region varchar(10))
set nocount on
declare @x int
set @x = 1
while (@x <=10000)
insert into customers (country, region) values ('Country' + convert (varchar, @x % 100), 'Region' + convert (varchar, @x % 1000))
set @x = @x + 1
while @@trancount > 0 commit tran
create statistics stats_country on customers (country) with fullscan
create statistics stats_region on customers (region) with fullscan
create statistics stats_country_region on customers (country, region) with fullscan
dbcc show_statistics (customers, stats_country)
dbcc show_statistics (customers, stats_region)
Run this and check out the DBCC SHOW_STATISTICS output for the stats on the [country] and [region] columns. In the SHOW_STATISTICS output you'll see a value called "All density". This should match the values predicted in the calculations described above: Country and Region densities of 0.01 and 0.001, respectively. (TIP: you should generally use "All density" and disregard the "Density" value in SHOW_STATISTICS's first resultset -- the value in the "Density" column is generally not used by the query optimizer.) If the statistics are multi-column as in the [stats_country_region] statistics created in this example, an All density value is also maintained for combinations of columns; this can help the optimizer make more accurate predictions when a query includes a filter on more than one column.
|Name||Updated||Rows||Rows Sampled||Steps||Density||Average Key Length||String Index||Filter Expression||Unfiltered Rows|
|stats_country||Jan 25 2011 10:14AM||10000||10000||55||0.01||8.9||YES||NULL||10000|
|All density||Average Length||Columns|
A predicate is an expression that evaluates to True or False. In SQL you encounter predicates in joins and in WHERE or HAVING clauses, where they are used to either filter out or qualify rows. Here's a query with two filter predicates:
WHERE column1 > 20 AND column2 IS NULL
And a query with one join predicate:
INNER JOIN table2 ON table1.column1 = table2.column1
Selectivity is also a measure of uniqueness, of sorts. High selectivity implies high uniqueness, or a low number of matching values. Low selectivity implies a low uniqueness, or a high percent of matches.
Selectivity is most commonly used to describe a predicate. Returning to the example used to describe density, consider a Customers table with a Region column. Suppose that most of the company's customers are in RegionA, and only a few are in RegionB. The predicate "Region='RegionB'" would be described as more selective because it returns a smaller percentage of the rows in the table. (As an aside, this kind of uneven distribution of data within a column is an example of data skew.) The estimated selectivity of a predicate is essential when the optimizer is evaluating possible plans. It will affect things like join order (all other things equal, the most selective join operation is generally performed first) and whether SQL chooses to do a table scan or an index seek followed by a bookmark lookup.
Selectivity for a filter predicate against a base table can be calculated as "[# rows that pass the predicate]/[# rows in the table]". If the predicate passes all rows in the table, its selectivity is 1.0. If it disqualifies all rows, its selectivity is 0. (This can be confusing. Note that 0.000001 reflects a high selectivity even though the number is small, while 1.0 is low selectivity even though the number is higher.)
The optimizer often uses the histogram steps in column statistics to estimate the selectivity of a predicate. Re-run the sample script above, and note the third and final resultset in the "DBCC SHOW_STATISTICS(customers, stats_country)" output. A single row in this resultset represents a step in a histogram. Each histogram step summarizes the density of all rows in the table that have a [country] value that is equal to or less than that row's RANGE_HI_KEY value, but greater than the RANGE_HI_KEY of the preceding histogram step. The snippet below shows the first 4 rows.
Take the third row in the histogram as an example. This step summarizes all rows with a country value greater than 'Country1' (the RANGE_HI_KEY for the preceding step), but less than or equal to 'Country11' (this step's RANGE_HI_KEY). There are 100 rows with exactly the value 'Country11' (EQ_ROWS), and 100 rows with some other value in this range (RANGE_ROWS). The DISTINCT_RANGE_ROWS value of 1 tells us that those 100 other RANGE_ROWS are made up of exactly one other distinct [country] value. (These happen to be the rows with a country value of 'Country10', although that information isn't present in the statistics.) I've often thought that the DISTINCT_RANGE_ROWS column was poorly-named; a better name might be "DISTINCT_RANGE_VALUES" because it reports a count of distinct values, not rows. The AVG_RANGE_ROWS column reports the average number of rows (100) for each of the (1) distinct values in the range. Note that by dividing RANGE_ROWS by DISTINCT_RANGE_ROWS, you calculate step density, a measure of the uniqueness of an average value within the range of values described by that histogram step.
Some simple examples might help illustrate how the optimizer uses this data to estimate selectivity. For the query below, the optimizer's task is to guess the selectivity of a predicate like "country='Country11'". It will look for the "Country11" value in the histogram on the country column, and will find that this value happens to be one of the RANGE_HI_KEY values (histogram step endpoints). As a result, it can use EQ_ROWS -- the number of rows observed with this exact value. The estimated selectivity of the predicate [country]='Country11' is 100/10,000 (EQ_ROWS / [table rowcount]), or 0.01. The actual selectivity is also 0.01, meaning that the optimizer's estimated selectivity turned out to be accurate.
|... WHERE [country]='Country11'||0.1||0.1|
For the query below the search value 'Country10' is not a RANGE_HI_KEY -- instead, the value falls somewhere in the middle of a range of values that is summarized by a histogram step. This means that the optimizer will need to use step density, which implies an assumtion that the search value is typical of the values in its histogram step. The estimated number of rows with a given value in this historgram step is AVG_RANGE_ROWS, which is 100 for this histogram step. That means that the estimated and actual selectivity are both 0.01 (100 / 10,000).
|Actual Selectivity||Est. Selectivity|
|... WHERE [country]='Country10'||0.1||0.1|
Here's a more interesting example. Note that the optimizer's guessed selectivity is wrong: it thought that 100 rows would be returned (selectivity of 0.1 * 10000 rows in the table), but in reality no rows passed the filter predicate. The search value 'Country10xyz' falls within the same histogram step used in the prior query, so the optimizer guessed the number of rows that would be returned using the exact same calculation as in the prior example. This isn't a bug; it's just a reflection of the fact that statistics can only provide an imperfect summary of the data in a table. You can read more about this here.
|Actual Selectivity||Est. Selectivity|
|.. WHERE [country]='Country10xyz'||0||0.1|
The information exposed by DBCC SHOW_STATISTICS can give you a better understanding of why the optimizer thinks that a particular plan will be cheaper than other plans. This is just a quick overview; for a much more detailed description SHOW_STATISTICS output and SQL statistics in general, check out the whitepaper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2008" @ http://msdn.microsoft.com/en-us/library/dd535534(v=sql.100).aspx. (If you haven't ever read this paper, it's a very good read.)
All of this brings us finally to cardinality, which is kind of "where the rubber hits the road" when you are trying to understand why a query plan is slow.
For our purposes, cardinality can be thought of as the number of rows returned by a query operator. (A few examples of physical query operators that you will probably recognize: Index Seek, Nested Loop Join, Filter) The cardinality of an operator like a Filter is determined by multiplying the selectivity of that operator -- that is, the % of its input rows that it will pass -- by the cardinality (rowcount) of the operator's child. If an operator receives 500 rows from its child operator and has a selectivity of 0.3 (30%), it has a cardinality of 500*0.3, or 150 rows.
Each operator in a query plan has an estimated cardinality (the number of rows the optimizer guessed that the operator would return) and an actual cardinality (the number of rows that the operator returned in the real world). You can see both by running a query with "SET STATISTICS PROFILE ON".
|SELECT c1 FROM t1 WHERE c1 = 'xyz'|
|2210||1|||--Filter([t1].[c1] = 'xyz')||21.0||1.0|
Sometimes the query optimizer cannot accurately predict the number of rows that a given operator will return. This can prevent SQL from estimating the cost of a query plan correctly, which can lead to the selection of a suboptimal plan. Cardinality estimation errors are one of the most common causes of slow query plans in SQL Server, so it is very important to know how to identify cardinality estimation problems in a query plan. (That's a little beyond the point of this post, though -- here I'm just defining some terms.) In the plan above, you can see that SQL made a cardinality estimation error: it guessed that the filter would be very selective and would only pass 21 of 2971 rows, when in reality almost all of the rows passed the filter.
The optimizer has a number of ways to estimate cardinality, none of which are completely foolproof.
- If the predicate is simple like "column=123" and if the search value happens to be a histogram endpoint (RANGE_HI_KEY), then EQ_ROWS can be used for a very accurate estimate.
- If the search value happens to fall between two step endpoints, then the average density of values in that particular histogram step is used to estimate predicate selectivity and operator cardinality.
- If the specific search value is not known at compile time, the next best option is to use average column density ("All density"), which can be used to calculate the number of rows that will match an average value in the column.
- In some cases none of the above are possible and optimizer has to resort to a "magic number" -based estimate. For example, it might make a totally blind guess that 10% of the rows will be returned, where the "10%" value would be hardcoded in the optimizer's code rather than being derived from statistics.
In most scenarios option #1 is the ideal, but because most values aren't histogram endpoints, option #2 is the most frequently-used method for simple filter and join predicates. Option #4 usually provides the worst estimates, and is only used when there is no other alternative. Thankfully, most of the time when a magic number is used, this is a side effect of poor query design; by rewriting the query you can enable a more accurate cardinality estimate based on either average column density or histogram lookup.