High Availability in Exchange 2007

Messaging is mission critical for most companies now and high availability is a major focus for any Exchange Server implementation.  Exchange Server 2007 gives us some great options to enable us to achieve 'high availability' including LCR, CCR and, in SP1, SCR.  ...but this is only really a part of the story.

These features only really focus on data availability. In the vast majority of companies that I have worked in, the workforce access their email predominantly via Outlook.  To access email via OWA or via a hand held device is important but not necessarily mission critical.  I don't think I have ever seen an SLA that defines OWA or ActiveSync service availability specifically. 

This is changing though and as devices improve in usability, and device access in particular becomes more trustworthy and therefore more critical, it is important that we revisit the notion of 'high availability'.  To focus on CCR, for example, to provide high availability is a bit misleading.  If I provide a solution that does not take into account access to OWA or ActiveSync clients then I don't believe I can call it highly available.  We need to focus on all Exchange roles in Exchange Server 2007 to claim high availability.

Of course planning for high availability is a fruitless exercise unless we define exactly what high availability is, and this definition will of course change according to what our business requires of its mission critical applications and ultimately what messaging SLA's we have to meet. Achieving high availability might mean providing full site resilience where our SLA defines the length of time it should take to recover all messaging services to a second data centre and restore full site resilience.  Of course high availability might also mean the ability to return the service to Outlook users via dial-tone databases.  For the purposes of this article I am going to assume that high availability means some form of site resilience.  i.e. If we lose a data centre we can manually fail over to a second data centre and recover all messaging services with no loss in performance and where possible no loss of data. So in the rest of this blog I want to point out some of the main issues and decisions that you might come across in designing for high availability in Exchange 2007.

Planning for HA for your Client Access Servers

I have written a couple of times before on site resilience and the CAS role. These blogs are here and here... In a nutshell the way to provide site resilience for your Outlook Anywhere, ActiveSync and OWA user communities will be based on whether it makes sense that mailboxes in one data centre are accessed by a different url to mailboxes in the second data centre? If your data centres are a long way apart and the user community knows where their mailboxes are then this might make sense. This is what we do at Microsoft for example. If this is an ok option then the preferred design will be two AD sites that sit logically in each data centre. One set of users will have to access their mailboxes with mail.siteAexternalurl.companyA.com for example and the other with mail.siteBexternalurl.companyA.com.

If this is absolutely not workable in your environment and something that will be met with huge resistance (and I have to say a lot of companies that I know of will balk at this) then you need to consider a different AD site design. The most simple is a single AD site across your data centres. The less obvious is the CAS tier where you have 3 AD sites - 1 in each data centre and 1 spanning the two. The tier provides the single namespace but means that all remote access will be via CAS--CAS proxying which will slow things down.

An issue which might sway your decision one way or another is the number of servers you want to build and support. The most simple design should mean you can deploy a significantly fewer number of CAS servers.

Planning for HA for your Hub Transport Servers

Their are perhaps the fewest decisions to be made with HA and the Hub Transport role. The main factor is of course going to be your AD site design as it is your site topology which dictates mail routing. There are a couple of main options when we look at site resilience. The first would seem the most obvious - that is to have to logical AD sites which match your data centre sites. In this design for outbound email and internal email the most logical routing decisions will be made, ultimately with the most efficient use of the WAN. For inbound email it is likely that 50% of this mail will travel across the WAN but as far as I can see there are few options to get around this.

The second option, which might be dictated to you following other design decisions, would be a single AD site stretched across the 2 data centres. In this case you would need to set the -SubmissionServerOverrideList on your mailbox role servers to override the default routing decisions and provide the most efficient use of the WAN. (Of course you also need to make sure that administrators know to reset the lists if you add or remove HT servers in the future.)

One thing to mention here is that in order for one site to take over the full load in the event of a disaster there must be sufficient servers in both sites. i.e. you need to double up on HT role computers.

Planning for Mailbox Role Server HA

There are a number of decisions that need to be made when planning for site resilience and the mailbox role. For site resilience, unless we can make use of some form of storage replication solution, then we need to make a decision between CCR and\or SCR. Your decision will ultimately be determined by the following factors:

Stretched subnets - If you cannot stretch a subnet between the data centres then it will not be possible to stretch two nodes from the same cluster between them. Therefore you cannot just use CCR.

Windows O\S - If you plan to use Windows 2003 Server and you cannot stretch a subnet, then again you cannot just use CCR to provide site resilience. However if you are designing now and plan to make use of Windows 2008 Server then CCR might be the preferred option. As I understand it Windows 2008 provides the ability to have two nodes of the same cluster in different subnets.

AD Site topology - Remember that the two nodes of a CCR cluster need to be in the same Active Directory Site. The target in an SCR configuration does not. If you can stretch a subnet between your data centres then you can stretch an AD site, but this introduces other complexity if you go down this route, particularly during the loss of a data centre (when all other servers in that AD site are not available).

For most companies I think the most logical choice is a combination of CCR & SCR. CCR running on a MNS cluster in one data centre and a SCR target (standby MNS cluster) in the second data centre. Of course this leads me to my last factor which is cost. Using CCR & a standby cluster as a target SCR means 4 physical servers for every 1 mailbox role Exchange Server. To reduce this some administrators might consider recovering to a standalone Exchange Server and use database portability. ..but of course this bring us back to our SLA. If we need to recover local site resilience in the event of the loss of a data centre then we need our target to be a standby MNS cluster.

This is a very basic overview of high availability in Exchange Server 2007 focusing on site resilience.  I know that there are numerous other factors that need to be taken into account in any Exchange 2007 design.  I have not really looked at the network bandwidth for example which will be important if you are replicating a lot of data over large distances. The key to getting it right, I believe, is to understand exactly what your business requires before you start.  If you know exactly what your use community & decision makers actually want from their messaging infrastructure, and what they mean by high availability, then making the right decisions should be a far less painful process.