What is the minimum number of AlwaysOn nodes required to obtain both HA and DR?

While recently answering questions in the MSDN SQL Server forums, I came across a great question about the minimum number of nodes required to obtain both high availability (HA) and disaster recovery (DR) with AlwaysOn Availability Groups. As an on-site SQL Server engineer, this is a very typical type of question that I receive from my customers. This one in particular is interesting because licensing costs are a factor in the proposed design decisions, just as I see in many of my customer environments. The answers provided in the thread are very helpful and accurately provide the answer, but I had a few additional questions about the details. As such, I asked one of the world’s best AlwaysOn experts, Trayce Jordan, for some additional information. His reply was outstanding (as always) and I think that many others will benefit from his depth of his answers.

Q: My understanding is that this architecture will require Asynchronous replication to reduce network latency across long distance but at the risk of some potential data loss if failover occurs.

A: ASYNC versus SYNC does not reduce network latency – the network latency is what it is. ASYNC is preferred if there is high network latency because then the client does not have to way to for the SYNC acknowledgement. So there is “latency”, and the application will have to wait longer for the database engine to let the client know that SQL committed it, but the “latency” is due to the intrinsic distance involved between the sites and the “network latency”. It’s a technical distinction, I know.

Q: It also requires implementation of fileshare witness to achieve quorum which is generally considered inferior to disk witness.

A: It does not require fileshare witness, but having a witness of some sort is recommended (disk or file). Disk witnesses are not superior to file share witnesses as long as the hosting server is highly available. There are several ways to introduce HA for the fileshare host including clustering or using cloud infrastructures. There is a distinct advantage in using a FSW over a DW, in that DW are typically only available within the same datacenter. FSW can be in remote locations and accessible from multiple sites.

Q: I have a sense that there will be relatively heavy transaction workload, potentially many millions per day.

A: Millions / day translates to 277 /sec (10M in a 10hr day), which is not very high. AlwaysOn AGs can handle very high transaction loads and beginning with SQL 2016, the transport performance has been improved to handle much higher rates. There are tests showing up to 450MB/sec in log flush rates. Even with SQL 2012, I have seen systems that do 20,000 batches / sec and anywhere between 5K and 10K transactions/sec.

Q: It seems that in order to achieve both HA AND DR, it is recommended to have 2 synchronous replicated nodes in same DC and a third Passive replica in separate geo-located DC (for Disaster Recovery). However, I don't think management is going to go for the third node (due to cost).

A: There are other advantages, by having a local machine for HA, and still have a machine for DR  -- both using AGs. Locally, having a second server reduces downtime for patching. If you failover to a DR machine every time you have to patch, the application could experience more latency in accessing the database if the application servers remain in the local data center and the active database server is in the remote datacenter –due to the network latency. Another advantage would be able to use the local secondary for real-time reporting. Cost can certainly be an issue, but if the system is critical enough then having both HA and DR is critical. People tend to confuse HA with DR. Having to failover to a remote datacenter just for patching or because a single database or database server fails is not a good strategy. Better to have local failovers of some kind – which is for HA.

Q: If management wants to stick with just 2 nodes, I think I want to recommend putting both in the same DC as Active/Active with synchronous replication and implement disk witness quorum. The drawback of course is that if there is DC failure (power/network), then both nodes would be lost. But I want to make sure I have my facts straight before I make any recommendations. Am I understanding things correctly?

A: If you’re going to have “active/active”, then you are still paying for two SQL Servers at the local DC. If you have software assurance, the licensing for a truly passive node is free, so you could still have a node serve as the remote secondary – for AGs hosted on either server. So if I have server A & B in the primary DC, I can have Availability Groups that are primary on A as well as AGs that are primary on B, both replicating to node C, and would not need to license C.

 

Thanks again for another great discussion in the MSDN Forums that helps the entire community work through the technical details of these solutions and to Trayce for providing excellent detail in his reply.

Sam Lester (MSFT)