Fragmentation (part 1): What are records?
This blogging thing sucks you in, doesn't it? Not content with having an ongoing series on disaster recovery and CHECKDB (with another 6 and 25 more posts planned respectively), I'm starting a new series on fragmentation. This will begin from first principles and work up, in approximately 18 posts over the next few months. The first few posts could be skipped by some people and will cover:
- What are records?
- What are pages?
- What are extents?
- What is a heap?
- What is a clustered index?
- What is a non-clustered index?
Bear with me as I build up the terminology. You could also check out Kalen's excellent book Inside SQL Server 2000 for details on these topics. (Her upcoming volume on the Storage Engine for Inside SQL Server 2005 should be out sometime this summer - buy it!).
The idea for this series came to mind last Friday at TechEd when I spent about 3 hours repeating a deck on fragmentation I gave to the North Texas SSUG in April 2004, people kept wandering by and stopping to listen. If that many people are interested in this stuff, it should make good blog material, and it also seems to be frequently misunderstood. The level of sophistication here ranges from not having any idea what fragmentation is up to defragging or rebuilding all indexes every night. I didn't see anyone there who knew just when it was worth removing fragmentation so I decided to explain here. The first 6 posts will also be useful as background for the CHECKDB internals series I'm doing.
So, what are records? At the simplest level, a record is the physical storage associated with a table or index row. Of course, it gets much more complicated than that...
These store table rows in a heap, or in the leaf level of a clustered index. They store all the columns of the table row that will fit in a single 8Kb page.
Data records are stored on data pages.
If any columns are for LOB data types (text, ntext, image, and the new LOB types in SQL Server 2005 - varchar(max), nvarchar(max), varbinary(max), XML), then there's a pointer stored in the data record which points to a text record on a different page (the root of a loose tree that stores the LOB value). Exceptions to this are when schema has been set to store LOB columns 'in-row' when possible. This is when a LOB value is small enough to fit within the confines of the data page that holds the data record, and so is stored in the same data record. This is a performance benefit as selecting the LOB column does not require an extra IO to read the text record.
In SQL Server 2005, non-LOB variable length columns (e.g. varchar, sqlvariant) may also be stored 'off-row' as part of the new capability (called row-overflow) of having table rows longer than 8060 bytes. In this case the storage format is the same as for LOB values - a pointer in the data record pointing to a text record.
I'll go into how the columns are laid out differently between heaps and clustered indexes in parts 4 and 5. There are also some twists in the record types for heaps that I'll cover in part 4.
There are two types of index records (which differ only in what columns they store):
- Those that store non-clustered index rows
- Those that comprise the b-tree that make up clustered and non-clustered indexes.
I'll explain more about the differences between these in parts 5 and 6 - it can be quite complicated (especially the differences between SQL Server 2000 and 2005) and is worth doing in separate posts.
Index records are stored on index pages.
Index records typically do not contain all the column values in a table (although some do - called covering indexes).
In SQL Server 2005, non-clustered index records can include LOB values as included columns (with the storage details exactly the same as for data records) and also can have row-overflow data that is pushed off-row (again, in exactly the same way as for data records).
- There are various types of text records that comprise the tree structure that stores LOB values, stored on two types of text page.
- An in-depth explanation of the various types and how they hang together is beyond the scope of this post and discussion.
- They are also used to store variable-length column values that have been pushed out of data or index records as part of the row-overflow capability.
- These are records that have been logically deleted but not physically deleted from the leaf level of an index.
- The reasons for this are complicated, but basically having ghost records simplfies key-range locking and transaction rollback.
- The record is marked with a bit that indicates it's a ghost record and cannot be physically deleted until the transaction that caused it to be ghosted commits. Once this is done, it is deleted by an asynchronous background proces (called the ghost-cleanup task) or it is converted back to a real record by an insert of a record with the exact same set of keys.
- Ghost records will be mentioned later in the series when I discuss page compaction.
Other record types
- There are also records that are used to store various allocation bitmaps, intermediate results of sort operations, and file and database metadata (e.g. in the per-file fileheader page and database boot page).
- These are not relevant to a discussion of fragmentation so won't be considered further here.
All records have the same structure, regardless of their type and use, but the number and type of columns will be different. For instance, a data record from a table with a complex schema may have hundreds of columns of various types whereas an allocation bitmap record will have a single column, filling up the whole page.
The record structure isn't relevant to a discussion on fragmentation but is for CHECKDB internals, so here it is:
- record header
- 4 bytes long
- two bytes of record metadata (record type)
- two bytes pointing forward in the record to the NULL bitmap
- fixed length portion of the record, containing the columns storing data types that have fixed lengths (e.g. bigint, char(10) , datetime)
- NULL bitmap
- two bytes for count of columns in the record
- variable number of bytes to store one bit per column in the record, regardless of whether the column is nullable or not (this is different and simpler than SQL Server 2000 which had one bit per nullable column only)
- this allows an optimization when reading columns that are NULL
- variable-length column offset array
- two bytes for the count of variable-length columns
- two bytes per variable length column, giving the offset to the start of the column value
- versioning tag
- this is in SQL Server 2005 only
- this is a 14-byte structure that contains a timestamp plus a pointer into the version store in tempdb
If you have any questions on this stuff - put them in the comments of drop me an email.
Next time - what are pages?