Troubleshooting SQL Server Transactional Replication

I often see questions about transactional replication performance problems, especially around latency/delays between the publisher and subscriber(s) so I’ve put a few pointers below on what to investigate.  Latency between the publisher, distributor and subscriber(s) is, more often than not, the symptom of other causes for example, poor I/O capacity on subscribers, blocking/locking, hotspots on indexes, high number of virtual log files etc.

Troubleshooting tips:

  • Look at perfmon counters (disk reads and writes/sec, avg disk/sec read and avg disk/sec write) to ensure there is enough capacity and that the latency on the data and log drives are within our recommended boundaries. 
  • Look at waitstats (use DMVstats – highly recommended) to see what resources are waiting. This will give you a good indication where the bottleneck is. 
  • Look at the transactional replication performance monitor counters (pending Xacts, transactions/sec, latency etc.)
  • Check the number of VLF’s as this can have a negative impact on log scanning if there are a very high number of VLF’s, I tend to ensure this value is below 1000.
  • Use tracer tokens to check latency from publisher to distributor to subscriber
  • Use agent logging to external files, -outputverboselevel 2 –output <dir\file> to troubleshoot data issues
  • Look in mslogreader_history, msdistribution_history & msrepl_errors in distribution database
  • Consider external factors e.g. consult the network/SAN specialists to check external issues such as network bandwidth/array performance issues etc.
  • If blocking is suspected then use the blocked process trace definition.  I can highly recommend this as it provides incredibly valuable information about the blocked and blocking processes. 
  • If you are using database mirroring in conjunction with transaction replication then the log reader may be have to wait for the record to be hardened on the mirror.  This can be avoided by using trace flag 1448 on the publisher.

Optimisation tips:

  • Use agent profiles to optimise for workloads
  • Implement Read Committed Snapshot Isolation (RCSI) on subscribers to alleviate reader/writer blocking (when doing this consider the impact on tempdb as this is where the version store is located)
  • Ensure the distribution history clean-up job is correctly trimming the distribution database tables.
  • If there are data consistency issues, consider using tablediff to compare data in publisher/subscriber tables (warning: this may take a while with large volumes of data) however tablediff can in fact be used against a subset of the data using views.
  • Be careful about using –skiperrors to bypass consistency errors
  • Consider using –SubscriptionStreams on the distribution agent to use multiple threads to apply the data to the subscribers, read this and this
  • If initialising from a backup/copy of the database, don’t enforce integrity on the subscribers.  Drop the constraints or use the NOTFORREPLICATION option.