Monitoring Database Replication Efficiency in Configuration Manager 2012
In this blog post I want to look at DRS replication within System Center Configuration Manager and discuss some impacts that can be seen in a busy site. If you are running a single site environment, then this article is not for you. If you run a CAS and multiple primary sites then you might just see some issues. Be aware however that we mean that the site is busy for a protracted period of time – for example if you migrate around 50,000 new clients into your site in 5 days then you might see the issue and the environment recovers quickly as it is supposed, but if this kind of activity goes on for some time you may experience issues and the simple checks in this blog will be of help.
So, to review DRS replication there is nothing really that we can add to Umair Kahn’s excellent discussion on DRS here http://blogs.technet.com/b/umairkhan/archive/2014/02/18/configmgr-2012-data-replication-service-drs-unleashed.aspx. Umair explains in his post that there are two main types of data:
Global data, which as its name suggests is replicated between all primary sites and CAS within the hierarchy. There is a derivation of this called Global Proxy data which is replicated from the primary site to any secondary sites under it.
Site data, which is unique to this primary site and then is replicated up to the CAS
We then read on to find that the data that we replicate is arranged into articles and these articles are then arranged into replication groups. The replication groups then replicate between the sites.
The transport mechanism for this data to be replicated is via a SQL Server feature called change tracking. Quoting from Umair’s blog post:
Change tracking is to identify and keep track of what changed in the Database like a row insertion, updation or deletion. We need this as we have to now only send the changes to the CAS and not the whole global data again as both the CAS and primary are now in page.
Question can come how Change Tracking works in background –
The answer is simple for every table that is enabled for change tracking we have one more internal table change_tracking_<object_id >. When a row gets inserted in the actual table then a corresponding entry (containing) only the primary key information gets added in the internal table change_tracking_<object_id >
So, whenever a row is changed within the local database then a second table is updated with that information. We then replicate that data to the other site.
Umair’s blog post then goes on to say:
For every successful transaction that is committed for a table we see a row in the sys.syscommittab table. The sys.syscommittab is a system table which can be referenced by the sys.dm_tran_commit_table view
So the sys.syscommittab table is what I want to spend a little time discussing. From the database we need to look at the sys.dm_tran_commit_table view which gives us something which looks like this:
So, the change tracking process looks into this table and picks up transactions which have not been committed, locates the un-replicated transactions and then when they are replicated, updates the table.
So, then what? Well, we keep the records in this table for 5 days (by default). If you have looked into DRS replication then you will have also seen that DRS can recover from an outage which is shorter than 5 days – more than that, then we will re-initialise the database. This is why we have the 5 days limit.
When a transaction is committed and it is more than 5 days old then it will be removed from sys.syscommittab. This occurs within a SQL process that runs every 62 seconds. This process however has a rowcount limit of 10,000 records so we are good to prune up to 14.4 million rows per day.
If you have a few days where the number of replicated transactions is greater than 14.4 million then this will not be a problem – sys.syscommittab can accommodate for this growth and the excess records will be pruned when the growth in this table is not so great (for example, over a weekend). However, what happens if there never really is a quiet time in your site?
Well, sys.syscommttab will do what it is supposed to do and contain the older records until it is able to prune them. Left unchecked then the growth in this table can adversely affect performance of replication, but this will likely be a slow-burn for several months before you see it. If you are in that situation then when you run the SQL below to see the busiest SQL statements then towards the top you will see not the usual stored procedures that you expect to see but SQL code including sys.syscommittab showing.
To determine the busiest processes running on the SQL Server
SELECT TOP 50
ObjectName = OBJECT_SCHEMA_NAME(qt.objectid,dbid) + '.' + OBJECT_NAME(qt.objectid, qt.dbid)
,TextData = qt.text
,DiskReads = qs.total_physical_reads -- The worst reads, disk reads
,MemoryReads = qs.total_logical_reads --Logical Reads are memory reads
,Executions = qs.execution_count
,TotalCPUTime = qs.total_worker_time
,AverageCPUTime = qs.total_worker_time/qs.execution_count
,DiskWaitAndCPUTime = qs.total_elapsed_time
,MemoryWrites = qs.max_logical_writes
,DateCached = qs.creation_time
,DatabaseName = DB_Name(qt.dbid)
,LastExecutionTime = qs.last_execution_time
FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt
ORDER BY qs.total_worker_time DESC
At this stage, while it’s nothing to panic about it would be best to log a support incident with Microsoft in order to run the correct SQL statements to bring things back into order.
Much better however is to simply monitor the growth of entries in sys.syscommittab and ensure that you are maintaining a healthy equilibrium. With my customer we simply added a command select count (*) from sys.dm_tran_commit_table and trend this in SCOM, with an alert if the value goes above 100 million rows (we based this on 14.4 million is within bounds for their 5 days retention plus some headroom for those occasional busy times). My colleague Rob York however took this and wrote a much more professional piece of code below:
/* Declare Variables */
declare @transaction_count int
declare @changetrackperiod int
declare @oldestentrydate datetime
declare @committimedelta int
declare @cleanupcap int
declare @dailytransactioncount int
declare @issuefound int
/* Set daily cleanup cap. 14.4m bases on 10,000 transactions per cycle once every 62 seconds */
set @cleanupcap = 14400000
set @issuefound = 0
/* get number of transactions */
select @transaction_count = count(*) from sys.dm_tran_commit_table
/* get change tracking retention period. 5 days is default. 14days is max allowed setting */
select @changetrackperiod = CTDB.retention_period from sys.change_tracking_databases as CTDB
join sys.databases as DB on DB.database_id = CTDB.database_id
where DB.name like 'CM_%'
/* get the age of the oldest entry in change tracking */
select @oldestentrydate = min(commit_time) from sys.dm_tran_commit_table
select @committimedelta = DATEDIFF(day,@oldestentrydate,getdate())
/* Print out informationals */
print 'Total Transactions: ' + RTRIM(CAST(@transaction_count as nvarchar))
print 'Change Tracking Period: ' + RTRIM(CAST(@changetrackperiod as nvarchar))
/* Check for signs of backlogs */
if (@transaction_count > (@changetrackperiod * @cleanupcap))
print 'Potential Backlog Based on Total Number of Transactions'
set @issuefound = 1
if (@committimedelta > @changetrackperiod)
print 'Potential Backlog Based on Stale Transactions'
set @issuefound = 1
declare @i int
set @i = @changetrackperiod
while @i > 0
select @dailytransactioncount = count (*) from sys.dm_tran_commit_table where datepart(dd,commit_time) = dateadd(day,-@i,datepart(dd,getdate()))
if (@dailytransactioncount > @cleanupcap)
print 'Potential Backlog Based on Daily Transaction Count for: '
+ RTRIM(CAST(convert(varchar(11),(dateadd(day,-@i,getdate())),103)as nvarchar))
+ ' (' + RTRIM(CAST(@dailytransactioncount as nvarchar)) + ')'
set @issuefound = 1
set @i = @i - 1
if (@issuefound = 0)
print 'Change Tracking Cleanup Appears Healthy'
Lastly, if you do see that these are an issue for you then logging a call with Microsoft Support will be the best way to bring your environment back into line.