Non-Unique Clustered Index and Duplicate Value limits...
I spent this past week with a customer giving a workshop covering Sql 2000 architecture and performance related information (interpreting query plans, index structures, etc., etc.), and during the course of the week a few good points came to light that I decided to blog about. This first one has to do with the creation of non-unique clustered indexes...
Many of you probably already realize that if you don't explicitly create a unique clustered index in Sql Server, the engine will unique-ify the clustered index for you anyhow, since it's used for pointers to data in nonclustered indexes. The engine will append a 4-byte value (when necessary) to any non-unique cluster key value as it's inserted into the cluster to unique-ify the key value within the cluster...there are a variety of places that cover this, including Books Online, and this isn't the topic of my blog post, but it's the basis for it, so I needed to ensure we covered that.
Now, in the workshop I gave this last week, I mentioned how I had once upon a time tried to produce a situation where a Duplicate Key error is generated on a non-unique clustered index by simply creating a table with a single column, then clustering the table on this column (in a non-unique fashion), and then inserting a whole-lot-o-records with the same value. I mentioned in the workshop that I was unsuccessful at producing this type of situation, and attributed it to the fact that the 4-byte value was probably allowed to make use of any character, not just numeric values, which would increase the amount of duplicate key values that could be sustained on a single cluster exponentially compared to the case where the value could only be numeric data. Well, if any of you guys from the workshop are reading along, I WAS WRONG!!!! You can in fact produce this type of scenario, and it's even easier than I originally thought before mistakenly thinking the 4-byte value could include characters in addition to numeric values...
I started thinking about it more on the flight back from the customer, and then pulled up the script I used originally to try and produce the situation, and found that I had a logic error in my data insertion (nice one huh?)...in addition to this, I noticed on an internal DL that someone mentioned that the 4-byte value that is used to unique-ify a nonunique cluster key is in fact integer based, and doesn't include character data.
Well, this should make it easier to reproduce then, assuming I can get the script right anyhow :-). After I produced the situtation, I was surprised to find that not only does the uniqueifier contain integer data only, but only positive integer data at that...so, given that we know a 4-byte integer can only contain values between -2^31 and 2^31-1 (-2,147,483,648 thru 2,147,483,647), which simple math tells us results in about 4,294,967,295 distinct values. I figured originally that I'd be able to produce a Duplicate Key error as soon as I tried inserting more than 4,294,967,295 records with the same key value - however, the Duplicate Key error is actually produced once you try and insert the 2,147,483,648th record, which indicates that the unique-ifier is actually made up of postive 4-byte integer values only...intersting huh?
So, to produce the scenario I mention is pretty simple...just create yourself a table with a single column, then create a nonunique clustered index on the column, then start inserting data into the table using the same value continuously over and over and over and over and over until you get to just about 2,147,483,647 records...then, as soon as you try inserting the next one, you'll get an error as follows:
Cannot insert duplicate key row in object 'dbo.ztblDupClusterKeyTest' with unique index 'ix_ztblDupClusterKeyTest_clust'.
So, if you ever plan to have a scenario where you are deploying an application that will make use of a clustered table that contains more than 2,147,483,647 records per distinct cluster key value, you'll have to figure out an alternative design...this could include any of the following:
1) Add additional keys to the cluster key
2) Physically partition the data into seperate tables
NOTE: Using partitioning with Sql 2005 wouldn't solve this problem, since the cluster is table-wide
3) Use a heap
Of course, this really isn't that much of a problem, since you probably won't see this scenario due to a variety of reasons, and not to mention that if it was a problem, we'd have known about it long ago from customers :-). But, nonetheless, something interesting...
I've included the script I used to produce this situation below...a single run of the entire test took about 2.75 hours on my machine...by the time I was producing the error, my test table had 2,147,483,647 records and was about 45,000 MB in total size...
On a side note, if anyone out there is working with record counts where you think this may be an issue, I'd be curious to hear from you...
Chad Boyd ~~~ This posting is provided "AS IS" with no warranties, and confers no rights. Use of any included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm.
------------------ CODE ONLY BELOW ------------------
-- 292,754,944 million rows to start with on my machine...
select 1 as clustId
from sys.columns a
cross join sys.columns b
cross join sys.columns c;
-- Cluster now...
create clustered index ix_ztblDupClusterKeyTest_clust on dbo.ztblDupClusterKeyTest (clustId);
set nocount on;
-- Ok...now keep adding until we break...
declare @i int;
while 1=1 begin
-- about 10.17 million rows on my machine per iteration...
insert dbo.ztblDupClusterKeyTest (clustId)
from sys.columns a
cross join sys.columns b
cross join (select top 23 column_id from sys.columns) c;
print 'ERROR: ' + error_message();
-- Just to introduce a bit of a delay between iterations...