Describe recovery time objective and recovery point objective

Completed

Understanding recovery time and recovery point objectives are crucial to your high availability and disaster recovery (HADR) plan as they're the foundation for any availability solution.

Recovery Time Objective

Recovery Time Objective (RTO) is the maximum amount of time available to bring resources online after an outage or problem. If that process takes longer than the RTO, there could be consequences such as financial penalties, work not able to be done, and so on. RTO can be specified for the whole solution, which includes all resources, as well as for individual components such as SQL Server instances and databases.

Recovery Point Objective

Recovery Point Objective (RPO) is the point in time to which a database should be recovered and equates to the maximum amount of data loss that the business is willing to accept. For example, suppose an IaaS VM containing SQL Server experiences an outage at 10:00 AM and the databases within the SQL Server instance have an RPO of 15 minutes. No matter what feature or technology is used to bring back that instance and its databases, the expectation is that there will be at most 15 minutes worth of data lost. That means the database can be restored to 9:45 AM or later to ensure minimal to no data loss meeting that stated RPO. There may be factors that determine if that RPO is achievable.

Defining Recovery Time and Recovery Point Objectives

RTOs and RPOs are driven by business requirements but are also based on various technological and other factors, such as the skills and abilities of the administrators (not just DBAs). While the business may want no downtime or zero data loss, that may not be realistic or possible for a variety of reasons. Determining your solution’s RTO and RPO should be an open and honest discussion between all parties involved.

One of the aspects crucial for both RTO and RPO is knowing the cost of downtime. If you define that number and the overall effect being down or unavailable has to the business, it's easier to define solutions. For example, if the business can lose 10,000 per hour or could be fined by a government agency if something could not be processed, that is a measurable way to help define RTO and RPO. Spending on the solution should be proportional to the amount, or the cost, of downtime. If your HADR solution costs $X, but you wind up only being affected for a few seconds instead of hours or days when a problem occurs, it has paid for itself.

From a nonbusiness standpoint, RTO should be defined at a component level (for example, SQL Server) as well as for the entire application architecture. The ability to recover from an outage is only as good as its weakest link. For example, if SQL Server and its databases can be brought online in five minutes but it takes application servers 20 minutes to do the same, the overall RTO would be 20 minutes, not five. The SQL Server environment could still have an RTO of five minutes; it still will not change the overall time to recover.

RPO deals specifically with data and directly influences the design of any HADR solution as well as administrative policies and procedures. The features used must support both the RTO and RPOs that are defined. For example, if transaction log back ups are scheduled every 30 minutes but there is a 15-minute RPO, a database could only be recovered to the last transaction log backup available which in the worst case would be 30 minutes ago. This timing assumes no other issues and the backups have been tested and are known to be good. While it is hard to test every backup generated for each database in your environment, backups are just files on a file system. Without doing at least periodic restores, there is no guarantee they're good. Running checks during the backup process can give you some degree of confidence.

The specific features used, such as an Always On Availability Group (AG) or an Always On Failover Cluster Instance (FCI) will also affect your RTOs and RPOs. Depending on how the features are configured, IaaS or PaaS solutions may or may not automatically fail over to another location, which could result in longer downtime. By defining RTO and RPO, the technical solution that supports that requirement can be designed knowing the allowances for time and data loss. If those wind-up being unrealistic, RTOs and RPOs must be adjusted accordingly. For example, if there is a desired RTO of two hours but a backup will take three hours to copy to the destination server for restoring, the RTO is already missed. These types of factors must be accounted for when determining your RTOs and RPOs.

There should be RTOs and RPOs defined for both HA and DR. HA is considered a more localized event that can be recovered from more easily. One example of high availability would be an AG automatically failing over from one replica to another within an Azure region. That may take seconds, and at that point, you would need to ensure that the application can connect after the failovers. SQL Server’s downtime would be minimal. A local RTO or RPO may potentially be measured in minutes depending on the critical nature of the solution or system.

DR would be akin to bringing up a whole new data center. There are lots of pieces to the puzzle; SQL Server is just one component. Getting everything online may take hours or longer. This is why the RTOs and RPOs are separate. Even if many the technologies and features used for HA and DR are the same, the level of effort and time involved may not be.

All RTOs and RPOs should be formally documented and revised periodically or as needed. Once they're documented, you can then consider what technologies and features you may use for the architecture.