How It Works: FileStream (RsFx) Garbage Collection
In a previous blog I outlined how file stream transactions work and retain the before and after images of the file to allow various forms of recovery. Reading that blog should be considered a prerequisite: http://blogs.msdn.com/b/psssql/archive/2008/01/15/how-it-works-file-stream-the-before-and-after-image-of-a-file.aspx
I have had several questions lately about how the files get cleaned up as it is possible to build up a large set of files in the file stream container (directory) because of the transactional activity.
File Streams uses a garbage collection process to clean up files that are not longer needed. A system task wakes up periodically (~10 seconds or so) and checks for file stream garbage collection needs.
If you execute the following example it shows the behavior. The table has a single row but the update runs a 100 times and builds up a set of files.
create database dbFS
on primary(name='FSData', filename = 'c:\temp\FS\FSData.mdf'),
filegroup RSFX CONTAINS FILESTREAM DEFAULT(name='Rsfx', filename='c:\temp\FS\RsFx')
log on(name='FSLog', filename='c:\temp\fs\FSData.ldf')
set nocount on
ALTER DATABASE dbFS set recovery full
backup database dbFS to disk = 'c:\temp\DelMe.bak' with init
CREATE TABLE RsFx
[Id] [uniqueidentifier] ROWGUIDCOL NOT NULL UNIQUE,
[Text] VARBINARY(MAX) FILESTREAM NULL
insert into RsFx values(NEWID(), 0x10101010)
update RsFx set Text = 0x20202020
The file names are the LSN values associated with the transaction log records. Since I have not backed up the log the files remain.
In the file stream container directory tree is the $FSLOG directory with a set of files as well. The file names have embedded information and are considered part of the database log chain. Part of the file name is an LSN that matches the LSN file name of the actual file storage. The File stream garbage collector uses these files to help determine the cleanup requirements.
DANGER: Deleting a file directly from the file stream container is considered database corruption and dbcc checkdb will report corruption errors. These files are linked to the database in a transactional way and the deletion of a file is the same as a damaged page within the database.
File streams GC runs on a periodic basis. It checks several requirements before deleting the physical file.
- What is the truncation point in the log? File stream GC can only remove files in the truncated portion of the log. Beyond the truncation point could mean the file is needed for backup, rollback, replication, … other data recovery situations.
- Oldest open transaction - this will help determine log truncation point
- Last checkpoint - this will help determine log truncation point
- Looks for valid recovery capabilities - backups have been taken, etc… The GC activity always ensures the file remains until it is clearly safe to remove the file.
The easiest way to kick start the File Stream GC activity is to backup the log and issue a checkpoint in the database. (Making sure you don't have any long running transactions.)
log dbFS to disk = 'c:\temp\DelMe.bak' with init
Shortly after taking completing this action file stream garbage collection will cleanup some files. This may not be all the files. It honors the log truncation point (sys.databases - reuse reason). It also may limit the number of files collected during one iteration of the system task. So you may see files collected, a slight delay, and then more files collected.
Why don't all the files get collected? - Even after this you may see you have a few files remaining but only a single row in the database. You issue the backup log and checkpoint several times but the files are not getting deleted. This is because of how SQL Server truncates the log. The truncation point can also account for things like the VLFs. Read http://support.microsoft.com/kb/907511 for more details on how the log file activity could impact the file stream GC behavior.
Using the Dedicated Admin Connection (DAC - admin:) you can see the file stream, tombstone tables. These tables hold the GC actions, LSN values and other details used by GC when it is executed.
Shown below is a set of queries against the tombstone in dbFS. I order the entries based on LSN sequence. Looking at the first row you can compare this to the log truncation point to help determine the GC behaviors.
select * from sys.objects where name like '%tombstone%'
select * from sys.filestream_tombstone_2073058421 order by oplsn_fseqno asc, oplsn_bOffset asc, oplsn_slotid asc
Notice that you can see the rowset GUID and Column GUID as well as the file stream value name (file) so you can determine versioned copies of the data. Using this table the GC will access it in sorted LSN order and cleanup the files as appropriate.
Bob Dorr - Principal SQL Server Escalation Engineer