SQL Server 2008 DR, Testing, Waits and Queues
Here are some notes on “SQL Server 2008 DR, Testing, Waits” I took while attending an advanced class on SQL Server taught by Ron Talmage (from http://www.solidq.com/na/MentorDetail.aspx?Id=38).
Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without some pre-requisite knowledge and the right context. Reading the post from top to bottom will help.
- RTO (Recovery Time Objective), RPO (Recovery Point Objective), RTA (Recovery Time Actual)
- Backup Media (disk/tape) and Location (local, remote, offsite)
- Recovery Technology (Backup/Restore, Mirroring, Log Shipping, Clustering, Replication)
DR and HA
- What’s the difference between HA and DR?
- How would you prepare for a disaster?
- What happens if you lose the entire data center?
- Primary data center, standby data center.
- Consider risks, cost, time to invest.
- Big question: What are you trying to protect against?
- If you have an RPO that requires very small data loss, what to do?
- Small RPO? Large RPO? Small RTO?
- How far apart can the two nodes be?
- Synchronous or asynchronous replication?
- Consider: Clustering, Mirroring, SAN Replication, GeoClustering, Log Shipping
- Planned downtime: New hardware, Upgrade, Patches, Versions, Consolidation
- Unplanned downtime: Disasters
- How to flip a switch to the DR site?
- Virtualization support for SQL Server - http://support.microsoft.com/KB/956893
- Server Virtualization Validation Program (SVVP) - http://technet.microsoft.com/en-us/library/ms143506.aspx
- Windows Server 2008 Failover Cluster Configuration Program (FCCP) - See http://www.microsoft.com/windowsserver2008/en/us/failover-clustering-program-overview.aspx
- VDI - COM-based interface with SQL Server 7 and later
- VDI - Supports snapshot backups using SAN-based technologies
- VSS - http://blogs.technet.com/josebda/archive/2007/10/10/the-basics-of-the-volume-shadow-copy-service-vss.aspx
- DPM - www.microsoft.com/dpm
- Log Shipping - http://msdn.microsoft.com/en-us/library/ms187103.aspx
- Database Mirroring - http://msdn.microsoft.com/en-us/library/bb934127.aspx
Benchmarking and baselining
- Load-testing, benchmarking and baselining
- Meet performance requirements
- Verify correctness
- Determine system limits
- Determine impact of changes
- Baseline - Ground zero, known state, Point of comparison
- Benchmark - Measurement under specified conditions, goal, used to estabilish baseline
- TPC - http://www.tpc.org/, TPC-E for OLTP, TPC-H for DW/OLAP. Reports trans/sec and response time
- Load test - Application of a load, looking for a sufficient or expected load
- Stress test - Increase load. Done with concrete in construction.
- Saturation test - Add load until it fails.
- Endurance test – With specified load for a long time
- Tuning the Performance of Backup Compression in SQL Server 2008
- Looking at the white paper
- - Backup throughput to NUL with varying BUFFERCOUNT (what’s the baseline?)
- - Disk throughput and latency during backup compression (disk saturated, increasing queue)
- - Database snapshot creating time with varying workload (what’s the baseline? Load test.)
- Monitoring Resource Usage - http://msdn.microsoft.com/en-us/library/ms191246.aspx
- See http://blogs.msdn.com/jimmymay/archive/2008/10/15/perfmon-objects-counters-thresholds-utilities-for-sql-server.aspx
- See http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?culture=en-US&EventID=1032357640&CountryCode=US
- SQLIO, IOMeter, SQLIOSim, RML utilities, VSTS, 3rd party tools
- SQLIO – http://www.microsoft.com/downloads/details.aspx?familyid=9A8B005B-84E4-4F24-8D65-CB53442D9E19&displaylang=en
- IOMeter – http://www.iometer.org/
- SQLIOSim - http://blogs.msdn.com/sqlserverstorageengine/archive/2006/10/06/SQLIOSim-available-for-download.aspx
- SQLIOSim - correctness and stress tool, simulates data and log file activity, does not require SQL
- See http://sqlblog.com/blogs/kevin_kline/archive/2007/06/28/understanding-sqliosim-output.aspx
- RML utilities – OSTRESS, ReadTrace, Reporter - http://support.microsoft.com/kb/887057
- See http://blogs.msdn.com/psssql/archive/2007/12/18/rml-utilities-for-microsoft-sql-server-released.aspx
- See http://sqlcat.com/technicalnotes/archive/2008/02/01/precision-performance-for-microsoft-sql-server-using-rml-utilities-9-0.aspx
Waits and Queues
- Wait occurs when thread has to wait for a resource
- How waits reporting works on SQL Server – waiter queue and wait types
- SQL Server 2000 – 76 types of waits
- SQL Server 2005 – 201 types of waits
- SQL Server 2008 – 484 types of waits
- Query with sys.dm_os_wait_stats, reset with DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR)
- See http://msdn.microsoft.com/en-us/library/ms179984.aspx
- See http://blogs.msdn.com/sqlcat/archive/2005/09/05/461199.aspx
- SQL Server 2005 Waits and Queues - White Paper
From the white paper, regarding OLTP workloads
- Database Design
- High Frequency queries having a high number of table joins (>4)
- Frequently updated tables having # indexes (>3)
- Big IOs - Table Scans, Range Scans (>1)
- Unused Indexes
- Signal Waits (>25%)
- Plan reuse (<90%)
- Parallelism: Cxpacket waits (>5%)
- Page life expectancy (<300 sec)
- Page life expectancy (Drops by 50%)
- Memory Grants Pending (>1)
- SQL cache hit ratio (<90%)
- Average Disk sec/read (>20 ms) – Comment: this should be lower (>5, >8)
- Average Disk sec/write (>20 ms) -- Comment: this should be lower (>5, >8)
- Big IOs - Table Scans, Range Scans (>1)
- ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR, WRITELOG, PAGEIOLATCH_x (in top 2)
- Low bytes per sec
- Block percentage (>2%)
- Block process report (30 sec)
- Average Row Lock Waits (>100ms)
- LCK_M_% (in top 2)
- 5, High number of deadlocks (>5 per hour)
- High network latency + many round trips to DB (Output queue length >2)
- Network bandwidth used up (Packets Outbound Discarded, Packets Outbound Errors, Packets Received Discarded, Packets Received Errors)
Troubleshooting Performance Problems in SQL Server 2005
Diagnosing Transaction Log Performance Issues and Limits of the Log Manager