What You Should Know About Geographically Dispersed Exchange High Availability and FSW
Written by Jaroslav Zikmund, Microsoft Premier Field Engineer.
This article discusses the recommendations around the placement of the Microsoft Exchange Server File Share Witness (FSW) in Database Availability Groups (DAG) where the DAG member is placed in two or more physical locations. Microsoft’s recommendation is to place the FSW in the primary location, or in the location with the highest number of users.
But what happens if you already have a geographically dispersed High-Availability (HA) file server, and hosts the FSW on this file server? The primary reason for this very complex design is to achieve automatic failover across datacenter boundaries. At this point I have to mention that the guidance on TechNet states that “datacenter failure is considered to be a disaster recovery event, and recovery must be manually performed”. But it looks like if we move the FSW to another HA cluster and lose the entire datacenter, we will have the majority of votes in the remaining datacenter. Unfortunately Exchange is not designed to work this way, and this configuration will probably not meet your expectations. Now we can go into details about why this usually doesn’t from a quorum point of view. Note that this is a very broad topic, and we won’t touch on related details about CAS redirection, permissions problems or using Datacenter Activation Coordination (DAC) mode on a DAG when the FSW is placed on the geographically dispersed cluster.
A Geographically Dispersed Exchange Example
If we lost the entire datacenter (for example, DataCenter1 in the diagram above), we actually lose Mailbox Server 1 and the connectivity network link and the FC link, but in many cases not everything dies at the same time.
In this example, the network link between the datacenters fails first, then DataCenter1 experiences a complete failure. Let’s also assume the FSW is also hosted on DataCenter1 (otherwise we don’t need a HA FileServer).
Here’s what DataCenter2 sees:
- Mailbox Server 2 has only 1 vote out of 3 (as the FSW is still in DataCenter1)
- All mailbox databases are dismounted.
Here’s what DataCenter1 sees:
- Mailbox Server 1 + FSW
- The datacenter has the majority of votes
- Databases are up and running
- Exchange Active Manager update cluster database and the “PaxosTag” attribute on FSW are also updated.
So let’s say DataCenter1 experiences a complete failure. The FSW is automatically moved to DataCenter2 by the HA file share cluster. Mailbox server 2 in DataCenter2 tries to lock the FSW to achieve a quorum, but the PaxosTag value on the FSW is higher than that of the cluster database on Mailbox Server 2. This will generate a cluster report error event 1561 : ”The cluster service has determined that this node does not have the latest copy of cluster configuration data. Therefore, the cluster service has prevented itself from starting on this node.” Therefore the mailbox database will not be mounted on Mailbox Server 2
The result of this scenario: Nobody has majority and all databases are dismounted. So, placing the FSW on a cluster doesn’t generally increase the availability of the solution, only the complexity.
DAG members can use FSW to maintain quorum only in cases where the version of the cluster database on the DAG member is the same or newer than the PaxosTag value on the FSW.