High Availability Changes in Exchange Server 2013 Cumulative Update 1
Exchange Server 2013 Cumulative Update 1 (CU1) has been released and is now available for download! CU1 is the first release to use the servicing model introduced with Exchange 2013. CU1 includes new features, new functionality and bug fixes, including in the area of high availability. The announcement post on the Exchange Team Blog already has some great information on what’s new in CU1, but I wanted to augment that announcement with some additional details. Below is a list of some of the high availability-related changes in CU1. This is by no means an exhaustive list; just a list of some of the changes that we have made.
Witness Server Warning Message When Using Certain Database Availability Group Tasks
I first wrote about this issue back in June 2011. This is where the system displays an incorrect warning message when you are using a non-Exchange server as your witness server, even when you have configured things correctly. This issue was eventually fixed in Exchange 2010 Service Pack 2 RU5, but it didn’t make it’s way into Exchange 2013 RTM. Fortunately, the fix did make its way into CU1.
Exchange 2013 continues the innovation introduced in Exchange 2010 by including functionality that allows the system to self-recover from failures that affect resiliency or redundancy. In addition to the Exchange 2010 self-recovery behaviors, Exchange 2013 RTM includes additional behaviors for long I/O times, excessive memory consumption by the Microsoft Exchange Replication service (MSExchangeRepl.exe), and severe cases where threads can't be scheduled. For example, every 30 seconds, the Exchange Replication service heartbeats the crimson channel, as it is a required component for normal operations. If this heartbeat fails, an indication that the crimson channel is inaccessible for some reason, the Exchange Replication service self-recovers the server by forcibly rebooting the server, thereby triggering a server failover.
In addition to the behaviors in Exchange 2013 RTM, CU1 includes new behaviors:
- Bus Resets – Event 129 is logged in the System event log when a bus reset occurs. Bus resets, particularly the rare but possible bus reset storm, can often result in storage issues, such as hung IO. When these events occur, it typically requires administrator intervention to resolve the issue. To obviate the need for administrator intervention, CU1 includes new functionality that triggers a forcible reboot of the server when event 129 is detected in the System event log.
- Replication service endpoints not responding – Exchange periodically verifies that the TCPListener component in the Exchange Replication service is responding to connection requests by periodically heart beating the local instance of the Exchange Replication service. If the TCPListener does not respond, the system will automatically self-recover by forcibly rebooting the server.
Automatic reseed, or AutoReseed, is a feature that's the replacement for what is normally administrator-driven action in response to a disk failure, database corruption event, or other issue that necessitates a reseed of a database copy. When properly configured, AutoReseed is designed to automatically restore database redundancy after a disk failure by using spare disks that have been provisioned on the system.
CU1 includes numerous fixes to AutoReseed, including fixes for issues around AutoReseed not detecting spare disks correctly and AutoReseed not using detected spare disks. In addition, the following enhancements have been made to AutoReseed:
- GetCopyStatus now has a new field 'ExchangeVolumeMountPoint', which shows the mount point of the database volume under C:\ExchangeVolumes (or a custom folder if you are not using the default setting of C:\ExchangeVolumes). This is useful information to know because in a configuration that uses multiple disks per volume, the LogicalDisk performance counters show up as the first mount point (which would be the one under C:\ExchangeVolumes) instead of as the database path like they used to in Exchange 2010 with a single disk per volume.
- We now have better internal tracking around mount paths and the ExchangeVolume path.
- The limits for AutoReseed have been increased from 4 databases per volume in Exchange 2013 RTM to 8 databases per volume in CU1.
- AutoReseed properties have been added to Active Directory that allow you to enable and disable automatic reseeding and the DiskReclaimer function (which formats Exchange volumes). The two new properties are:
- AutoDagAutoReseedEnabled - Setting AutoDagAutoReseedEnabled to false turns off AutoReseed (including automatic resume, sparing, and in-place reseeds).
- AutoDagDiskReclaimerEnabled - Setting AutoDagDiskReclaimerEnabled to false turns off the DiskReclaimer, which formats exchange volumes. The default setting is true, and it only tries to format volumes mounted under C:\ExchangeVolumes.
- The unused AutoDagFailedVolumesRootFolderPath property was also removed from the DAG object.
As a result of these and other changes, the workflow for AutoReseed in CU1 has changed. The primary input condition for the AutoReseed workflow is still a database copy that is in an Failed and Suspended (F&S) state for 15 consecutive minutes. When that condition is detected, the following AutoReseed workflow is initiated:
- The system will first try to resume the database copy up to 3 times, with 5 minute sleeps in between each try. Sometimes, after an F&S database copy is resumed, the copy remains in a Failed state. This can happen for a variety of reasons, so this first step is designed to handle all such cases; AutoReseed will automatically suspend a database copy that has been Failed for 10 consecutive minutes to keep the workflow running. If the suspend and resume actions don’t result in a healthy database copy, the workflow continues.
- Next, AutoReseed will perform a variety of pre-requisite checks. For example, it will verify that a spare disk is available, that the database and its log files are configured on the same volume, and in the appropriate locations that match the required naming conventions. In a configuration that uses multiple databases per volume, AutoReseed will also verify that all database copies on the volume are in an F&S state.
- Next, AutoReseed will attempt to assign a spare volume up to 5 times, with 1 hour sleeps in between each try.
- Once a spare has been assigned, AutoReseed will perform an InPlaceSeed operation using the SafeDeleteExistingFiles seeding switch. If one or more database files exists, AutoReseed will wait for 2 days before in-place reseeding (based on the LastWriteTime of the database file). This provides an administrator with an opportunity to preserve data, if needed. AutoReseed will attempt a seeding operation up to 5 times, with 1 hour sleeps in between each try.
Once all retries are exhausted, the workflow stops. If, after 3 days, the database copy is still F&S, the workflow state is reset and it starts again from Step 1. This reset/resume behavior is useful (and intentional) since it can take a few days to replace a failed disk, controller, etc..
The Update-MailboxDatabaseCopy cmdlet includes some new parameters in CU1 that are designed to aid with automation of seeding operations. These parameters include:
- BeginSeed – This is useful for scripting reseeds, because with this parameter, the task asynchronously starts the seeding operation and then exits the cmdlet.
- MaximumSeedsInParallel – This is used with the Server parameter to specify the maximum number of parallel seeding operations that should occur across the specified server during a full server reseed operation. The default value is
- SafeDeleteExistingFiles – This is used to perform a seeding operation with a single copy redundancy pre-check prior to the seed. Because this parameter includes the redundancy safety check, it requires a lower level of permissions than the DeleteExistingFiles parameter, enabling a limited permission administrator to perform the seeding operation.
- Server – This is used as part of a full server reseed operation to reseed all database copies in a Failed and Suspended state. It can be used with the MaximumSeedsInParallel parameter to start reseeds of database copies in parallel across the specified server in batches of up to the value of the MaximumSeedsInParallel parameter copies at a time.
The Set-DatabaseAvailabilityGroup cmdlet includes a new parameter named SkipDagValidation. It is used to bypass the validation of the DAG's quorum model and the health check on the DAG's witness during certain DAG configuration operations. While this parameter has some usefulness for us in Exchange Online (and that is why it was introduced), and while it is enabled for on-premises use, it won’t be of much use to on-premises environments. I’m only pointing it out because, as I said, it is enabled for on-premises use.
Managed Availability: Get-ServerHealth and Get-HealthReport
The Get-ServerHealth and Get-HealthReport cmdlets are used to get and process raw health set data from Managed Availability, the new monitoring and recovery framework used by the various components within Exchange. Get-ServerHealth can be used to view the various health sets and their current status. In Exchange 2013 RTM, the Get-HealthReport cmdlet consumed results from Get-ServerHealth to produce a summary rollup of health. But the way in which it was implemented made it very slow and inefficient.
With CU1, instead of piping Get-ServerHealth to Get-HealthReport, Get-HealthReport is now capable of reporting the consolidated results on its own, and it now takes an Identity parameter that enables you to specify a server instead of InputObject/InputEntries. Get-HealthReport also includes a new HealthSet parameter, which is used to return the health state for a group of monitors. However, to use a rollup group, a list of names must be pipelined to Get-HealthReport. Unfortunately, Get-HealthReport -Identity does not support an array of names, so our recommended way to do this is to simply get the list of DAG members and pipe that to Get-HealthReport. For example to display a rollup summary of transport health on members of a DAG, you would run:
(Get-DatabaseAvailabilityGroup DAG1).Servers | Get-HealthReport -RollupGroup -HealthSet HubTransport
There are a couple of changes for Get-ServerHealth in CU1; namely, two parameters have also been added:
- HaImpactingOnly – This is used to display only monitors that have HaImpacting set.
- HealthSet – This is used to return the health state of a group of monitors.
Best Copy and Server Selection Changes
Best Copy and Server Selection (BCSS) is the algorithm used by Active Manager in Exchange 2013 to select the best database copy to activate in response to a failover or a target-less switchover. In CU1, a change was made so that the Primary Active Manager (PAM) now keeps track of the number of active databases per server, so that during BCSS it can honor the value of MaximumActiveDatabases, if configured. The server holding the PAM role now keeps an in-memory state that tracks the number of active databases per server. When the PAM role moves or when the Exchange Replication service is restarted on the PAM, this information is rebuilt from the cluster database.
This change allows Active Manager to exclude servers that are already hosting the maximum amount of active databases when determining potential candidates for activation. Prior to this change, Active Manager would not evaluate whether a potential server candidate for activation was already at its configured active database limit. Thus, if such a server were selected for activation, the activation process would fail during the mount attempt, and a new server would have to be selected (if available). This scenario is now avoided as a result of this change.
Other CU1 Changes
Of course there are other changes in CU1 besides the above, so be sure to read the Release Notes and other appropriate documentation when everything is released.