High Availability and Site Resilience

Article
06/03/2009

[This is pre-release documentation and subject to change in future releases. This topic's current status is: Writing Not Started.]

Microsoft Exchange Server 2010 reduces the cost and complexity of deploying an e-mail solution that provides the highest levels of server availability and site resilience. Building on the native replication capabilities introduced in Exchange Server 2007, the new high availability architecture in Exchange 2010 provides a simplified, unified framework for both high availability and disaster recovery. Exchange 2010 integrates high availability into the core architecture of Exchange, enabling customers of all sizes and in all segments to be able to economically deploy a messaging continuity service in their organization.

Exchange 2007 decreased the costs of high availability and made site resilience much more economical by introducing new technologies such as local continuous replication (LCR) cluster continuous replication (CCR) and standby continuous replication (SCR). Still, some challenges remained:

Some administrators were intimidated by the complexity of Windows failover clustering.
Achieving a high level of uptime can require a high level of administrator intervention.
Each type of continuous replication was managed differently and separately.
Recovering from a failure of a single database on a large mailbox server could result in a temporary disruption of service to all users on the mailbox server.
Site resilience solutions were not seamless.
The transport dumpster feature of the Hub Transport server could only protect messages destined for mailboxes in an LCR or CCR environment. If a Hub Transport server fails while processing messages and cannot be recovered, it could result in data loss.

Exchange 2010 includes significant core changes that integrate high availability deep in its architecture, making it even less costly and easier to deploy and maintain than Exchange 2007 for all customers. Organizations can now deploy a fully-redundant Exchange organization with just two servers, and benefit from database-level failovers. Customers benefit from automatic, database-level failover capabilities without having to become experts in Windows failover clustering. Moreover, you can add site resilience to you existing high availability deployments with less complexity.

Exchange 2007 introduced many new architectural changes designed to make deploying high availability and site resiliency solutions for Exchange faster and simpler. These improvements included an integrated Setup experience, optimized out-of-box configuration settings, and the ability to manage most aspects of the high availability solution using native Exchange management tools.

Still, management of an Exchange 2007 high availability solution required administrators to master some clustering concepts, such as the concept of moving network identities and managing cluster resources. In addition, when troubleshooting issues related to a clustered mailbox server, administrators has to use Exchange tools and cluster tools to review and correlate logs and events from two different sources: one from Exchange and one from the cluster.

Two other limiting aspects of the Exchange 2007 architecture have also been re-evaluated and re-engineered based on customer feedback:

Clustered Exchange 2007 servers require dedicated hardware. Only the Mailbox server role could be installed on a node in the cluster. This meant that a minimum of four Exchange servers were required in order to achieve full redundancy of the primary components of a deployment, i.e., the core server roles (Mailbox, Hub Transport, and Client Access).
In Exchange 2007, failover of a clustered mailbox server occurs at the server level. As a result, if a single database failure occurred, the administrator had to failover the entire clustered mailbox server to another node in the cluster (which resulted in brief downtime for all users on the server, and not just those users with a mailbox on the affected database), or leave the users on the failed database offline (potentially for hours) while restoring the database from backup.

Exchange 2010 has been re-engineered around the concept of continuous availability, in which the architecture has changed so that automatic failover protection is now provided at the individual mailbox database level instead of at the server level. In Exchange 2010, this is known as database mobility. As a result of this and other database cache architectural changes, failover actions now complete much faster than in previous versions of Exchange. For example, failover of a clustered mailbox server in a CCR environment running Exchange 2007 with Service Pack 1 completes in about 2 minutes. By comparison, failover of a mailbox database in an Exchange 2010 environment completes in 30 seconds (measured from the time when the failure is detected to when a database copy is mounted, assuming the copy is healthy and up-to-date with log replay). The combination of database-level failovers and significant faster failover times dramatically improves an organization's overall uptime.

The continuous availability architecture built into Exchange 2010 provides new benefits for organizations and their messaging administrators:

Multiple server roles can co-exist on servers that provide high availability. This enables small organizations to deploy a two-server configuration provides full redundancy of mailbox data, while also providing redundant Client Access and Hub Transport services.
An administrator no longer needs to build a failover cluster in order to achieve high availability. Failover clusters are now created by Exchange 2010 in a way that is invisible to the administrator. Unlike previous versions of Exchange clusters which used an Exchange-provided cluster resource DLL named ExRes.dll, Exchange 2010 no longer needs or uses a cluster resource DLL. Exchange 2010 uses only a small portion of the failover cluster components, namely, its heartbeat capabilities and the cluster database, in order to provide database mobility.
Administrators can add high availability to their Exchange 2010 environment after Exchange has been deployed, without having to uninstall Exchange and then re-deploy in a highly availability configuration.
Exchange 2010 provides a view of the event stream that combines the events from the operating system with the events from Exchange.
Because storage group objects no longer exist in Exchange 2010, and because mailbox databases are portable across all Exchange 2010 Mailbox servers, it is very easy to move databases when needed.

For more information, see Understanding Mailbox Database Availability.

Changes to High Availability from Previous Versions of Exchange

Exchange 2010 includes many changes to its core architecture. Two prominent features from Exchange 2007, namely CCR and SCR, have been combined and evolved into a single framework called a database availability group (DAG). The DAG handles both on-site data replication and off-site data replication, and forms a platform that makes operating a highly available Exchange environment easier than ever before. Other new high availability concepts are introduced in Exchange 2010, such as database mobility, and incremental deployment. The concepts of a backup-less and RAID-less organization are also being introduced in Exchange 2010.

In a nutshell, the key aspects to data and service availability for the Mailbox server role and mailbox databases are:

Exchange 2010 uses an enhanced version of the same continuous replication technology introduced in Exchange 2007. See the section below entitled "Changes to Continuous Replication from Exchange Server 2007" for more information.
Storage groups no longer exist in Exchange 2010. Instead, there are simply mailbox databases and mailbox database copies, and public folder databases. The primary management interfaces for Exchange databases has moved within the Exchange Management Console from the Mailbox node under Server Configuration to the Mailbox node under Organization Configuration.
Some Windows Failover Clustering technology is used by Exchange 2010, but it is now completely managed under-the-hood by Exchange. Administrators do not need to install, build or configure any aspects of failover clustering when deploying highly available Mailbox servers.
Each Mailbox server can host as many as 100 databases. In this Beta release of Exchange 2010, each Mailbox server can host a maximum of 50 databases. The total number of databases equals the combined number of active and passive databases on a server.
Each mailbox database can have as many as 16 copies.
In addition to the transport dumpster feature, a new Hub Transport server feature named shadow redundancy has been added. Shadow redundancy provides redundancy for messages for the entire time they are in transit. The solution involves a technique similar to the transport dumpster. With shadow redundancy, the deletion of a message from the transport database is delayed until the transport server verifies that all of the next hops for that message have completed delivery. If any of the next hops fail before reporting back successful delivery, the message is resubmitted for delivery to that next hop. For more information about shadow redundancy, see Understanding Shadow Redundancy.

Database Availability Groups

A database availability group (DAG) is a set of up to 16 Mailbox servers that provide automatic database-level recovery from failures that affect individual databases. Any server in a DAG can host a copy of a mailbox database from any other server in the DAG. When a server is added to a DAG, it works with the other servers in the DAG to provide automatic recovery from failures that affect mailbox databases, such as a disk failure or server failure.

When an administrator creates a DAG, it is initially empty, and an object is created in Active Directory that represents the DAG. The directory object is used to store relevant information about the DAG, such as server membership information. When an administrator adds the first server to a DAG, a failover cluster is automatically created for the DAG. In addition, the infrastructure that monitors the servers for network or server failures is initiated. The failover cluster heartbeat mechanism and cluster database are then used to track and manage information about the DAG that can change quickly, such as database mount status, replication status, and last mounted location.

For more information about DAGs, see Managing Database Availability Groups.

Mailbox Database Copies

The high availability and site resilience features first introduced in Exchange 2007 are used in Exchange 2010 to create and maintain database copies, thereby enabling you to achieve your availability goals in Exchange 2010. Exchange 2010 also introduces the new concept of database mobility, which is Exchange-managed database-level failovers.

Database mobility disconnects databases from servers and adds support for up to 16 copies of a single database and it provides a native experience for adding database copies to a database. In Exchange 2007, a feature called database portability also enabled you to move a mailbox database between servers. A key distinction between database portability and database mobility, however, is that all copies of a database have the same GUID.

Other key characteristics of database mobility are:

Because storage groups have been removed from Exchange 2010, continuous replication now operates at the database level. In Exchange 2010, transaction logs are replicated to one or more other Mailbox servers, and replayed into a copy of a mailbox database that is stored on those servers.
A failover or switchover can occur at either the database level or at the server level. Tasks for server-level failover are not in the current build of Exchange 2010, but are expected to be functional in the final release of Exchange 2010.
Database names for Exchange 2010 must be unique within the Exchange organization.
When a mailbox database has been configured with one or more database copies, the full path for all database copies must be identical on all Mailbox servers that host a copy.
Any mailbox database copy (the active or any passive copy) can be backed up using an Exchange-aware VSS-based backup application.

For more information about mailbox database copies, see Managing Mailbox Database Copies.

Changes to Continuous Replication from Exchange Server 2007

The underlying continuous replication technology previously found in CCR and SCR remains in Exchange 2010, and it has been further evolved to support new high availability features such as database copies, database mobility, and database availability groups. Some of these new architectural changes are briefly described below:

Because storage groups have been removed from Exchange 2010, continuous replication now operates at the database level. Exchange 2010 still uses an Extensible Storage Engine (ESE) database that produces transaction logs which are replicated to one or more other locations and replayed into one or more copies of a mailbox database.
Log shipping and seeding no longer uses Server Message Block (SMB) for data transfer. Exchange 2010 continuous replication uses a single administrator-defined TCP port for data transfer. In addition, Exchange 2010 includes built-in options for network encryption and compression for the data stream.
Database copies are for mailbox databases only. For redundancy and high availability of public folder databases, we recommend that you use public folder replication. Unlike CCR, where multiple copies of a public folder database could not exist in the same cluster, you can use public folder replication to replicate public folder databases between servers in a DAG.

Several concepts used in Exchange 2007 continuous replication also remain in Exchange 2010. These include the concepts of failover management, divergence, the use of the auto database mount dial, and the use of public and private networks.

Incremental Deployment

In previous versions of Microsoft Exchange, service availability for the Mailbox server roles was achieved by deploying Exchange in a Windows failover cluster. To deploy Exchange in a cluster, you had to first build a failover cluster, and then install the Exchange program files. This process created a special mailbox server called a clustered mailbox server (or Exchange Virtual Server in really old versions of Microsoft Exchange). If you had already installed the Exchange program files on a non-clustered server and you decided you wanted a clustered mailbox server, you had to build a cluster using new hardware, or remove Exchange from the existing server, install failover clustering, and reinstall Exchange.

Exchange 2010 introduces the concept of incremental deployment, which enables you to deploy service and data availability for all Mailbox servers and databases after Exchange is installed. Service and data redundancy is achieved by using new features in Exchange 2010 such as database availability groups and database copies.

Backup-less Exchange Organization

There are several changes to the core architecture of Exchange 2010 that have a direct effect on the backup or restore of Exchange databases. One significant change is the removal of storage groups. In Exchange 2010, each database is associated with a single log stream, represented by a series of 1 megabyte (MB) log files. Each Mailbox server can host a maximum of 100 databases.

Note

In this Beta version, each Mailbox server can host a maximum of 50 databases.

Another significant change for Exchange 2010 is that databases are no longer closely tied to a specific Mailbox server. Database Mobility expands the system's use of continuous replication by replicating a database to multiple, different servers. This provides better protection of the database and increased availability. In the case of failures, the other servers that have copies of the database can mount the database.

Because you can have multiple copies of a database hosted on multiple servers, you can effectively have backup-less Exchange organization.

Incremental Resync

Exchange 2007 introduced the concepts of lost log resilience (LLR) and incremental reseed. LLR is an internal component of ESE that enables you to recover Exchange mailbox databases even if one or more of the most recently generated transaction log files have been lost or damaged. LLR enables a mailbox database to mount even when recently generated log files are unavailable. LLR works by delaying writes to the database until the specified number of log generations have been created. LLR delays recent updates to the database file for a short time. The length of time that writes are delayed depends on how quickly logs are being generated.

Note

Currently, LLR is hard-coded to 1 log; however, this number may change in future releases of Exchange 2010.

Incremental reseed provided the ability to correct divergences in the transaction log stream between a source and target storage group, by relying on the delayed replay capabilities of LLR. Incremental reseed did not provide a means to correct divergences in the passive copy of a database, once divergent logs had been replayed, which forced the need for a complete reseed.

In Exchange 2010, incremental resync is the new name for the feature that automatically corrects divergences in database copies under the following conditions:

After an automatic failover for all of the configured copies of a database
When a new copy is enabled and some database and log files already exist at the copy location
When replication is resumed following a suspension or restarting of the Microsoft Exchange Replication service.