Multiple Datacenter Deployment Guidance
Deploying an application to more than one datacenter can provide benefits such as increased availability and a better user experience across wider geographical areas. However, there are challenges that must be resolved, such as data synchronization and regulatory limitations.
Why Deploy to Multiple Datacenters?
Organizations typically begin by deploying their applications to a single datacenter. This may be the local on-premises datacenter, or a remote environment such as a traditional hosting provider or a cloud provider. In addition to advantages such as easier scaling, improved reach, and cost effectiveness, cloud hosting typically offers service level guarantees of availability and throughput that can help to make applications more robust and available. However, a single deployment of an application is still at risk of becoming unavailable due to failure.
Deploying an application to more than one datacenter, and even to more than one hosting provider, can reduce the chance that events outside of your control (such as a failure of a datacenter, or global Internet connectivity issues between countries or regions) will cause it to become unavailable. For vital commercial applications such as ecommerce sites, where even short periods of downtime or reduced availability can have a huge impact on profitability, deployment to multiple datacenters is a solution you might consider.
Therefore, you might choose to deploy to multiple datacenters for one or more reasons, including:
- Growing capacity over time. Applications often start out as a single deployment, either on-premises or in the cloud, and grow over time as demand increases. This growth may be provided for by expanding deployment to a sub-region or availability set, to multiple sub-regions or availability sets, and then to full multi-region deployment. In some cases, parts of the application may be split between on-premises and the cloud, following a hybrid approach, move to the cloud, and then expand to multiple regions.
- Providing global reach with minimum latency for users. You can maintain multiple running versions of the application in datacenters located near the majority of users, and route users to the one that provides the best performance and lowest connection latency.
Maintaining performance and availability. Deployment to more than one datacenter, and perhaps to more than one hosting provider, is a useful technique for reducing the risk that an application will become unavailable. Common scenarios are:
Providing additional instances for resiliency. Deploying to multiple datacenters, and choosing the number of instances based on demand at each datacenter, can improve resiliency and availability because there are additional alternative instances to which users can be re-routed if performance of one instance or datacenter is degraded, or if one instance should fail altogether.****
Deploying an application to multiple locations will not protect it from failure due to factors that are your fault, such as poor design or errors in the code.
Providing a facility for disaster recovery. If the primary deployment fails, or becomes unavailable for any reason, you can start up the alternative offline deployment that is located in a different datacenter and switch requests to it while you resolve issues with the primary deployment. The backup deployment might be located in your own on-premises datacenter, and might even be designed to provide only minimal services while the primary deployment is brought back into use. Ideally, the process of recovering from backup and recreating the production environment should require the minimum number of tasks, be fully automated to minimize startup time, and must be tested regularly.
- Providing a hot-swap standby capability. You can maintain a second running deployment of the application (or more than one additional deployment), and instantly switch requests to the standby deployment if the primary one fails or an event occurs in the datacenter where the primary deployment is located. To minimize cost the standby application could be scaled down, perhaps with a fewer number of instances, and then rapidly ramped up through scripting or autoscaling when required. This is sometimes referred to as a “warm standby” approach.
Routing Requests to Multiple Application Deployments
Deploying an application in more than one datacenter, and choosing datacenters that are local to the majority of users, can provide advantages in availability and user experience, but it does not provide a complete solution. For example, users that are accessing a US-hosted application will not be automatically redirected to an alternative deployment if the US deployment fails.
To resolve this issue you might decide to distribute requests using a round-robin approach, or by using a manual, custom, or third party mechanism to detect application failures and redirect traffic.
Round-robin routing is a traditional way to distribute requests to multiple deployments of an application or website, with each request being routed to the next deployment in a list and then starting again with the first item in the list. It can be implemented by using multiple entries for a domain name in a DNS server, each with an IP address that points to one deployment of the application. The DNS service will automatically hand out a different IP address to each DNS lookup request from users, working repeatedly though the list.
However, round robin routing is not an ideal solution in most cases because it will continue to hand out the IP address of failed deployments, and so some users will find the application is unavailable. Instead, you should consider a solution that routes requests only to deployments that are available, such as the solutions described in the following sections:
- Manual re-routing on application failure.
- Automated re-routing on application failure.
- Re-routing with Microsoft Azure Traffic Manager.
Manual Re-Routing On Application Failure
When using multiple deployments that act as disaster recovery or hot-swap services, you may choose the simplest approach of manually routing users to the appropriate deployment of the application. You can do this by changing the DNS entry for the application, or by using a redirection page. However, both of these approaches suffer from several issues:
- You must be able to quickly detect a failure (indicated by your application monitoring system) and perform the manual changeover. This might involve starting up a backup deployment in another datacenter and validating that it is operating correctly if you do not have a hot-swap deployment already running. If the failure occurs out of business hours, when nobody is on site, there may be additional delays in the changeover.
- If you reroute requests by changing the DNS entry it may take several hours, and even one or two days, before all of the global DNS servers pick up the change and start routing to the backup deployment. You can mitigate this to some extent by specifying a short expiration interval in the DNS records, though some global DNS caches may ignore this and apply their own expiration period. In addition, most client devices such as web browsers will cache the result of the DNS lookup for a specific period (typically 30 minutes) to reduce the number of DNS lookups required, and so users will be routed to the same deployment during that period.
- If you use a redirection page or mechanism to reroute requests to the backup copy, this becomes a single point of failure. If users cannot access the redirection mechanism they will not be able to access the backup application. For example, using a page in a separate website to redirect to the required deployment will not work if this website is unavailable for any reason.
- You must be prepared to manually change the routing back to the failed deployment after it has been recovered. This may take some time to propagate when using DNS routing, which will extend the period that it is unavailable.
Automated Re-Routing On Application Failure
To avoid delays when an application deployment becomes unavailable you might choose to use an automated mechanism that monitors each deployment and routes requests to the most suitable one that is operating. All requests for the application are initially routed to the automated mechanism, which then redirects the request to the appropriate deployment.
The way that you design or configure the mechanism depends on the reason you are using multiple datacenter deployments:
- For a disaster recovery scenario the automated mechanism must be able to start up the backup deployment and verify that it is operating, and then route users to it instead of the primary deployment.
- For a hot-swap scenario the automated mechanism needs only to verify that the backup deployment is working, and then route users to it.
- For a global reach scenario the automated mechanism may examine the request and route to the appropriate deployment. Requests from web browsers and many other client devices will include a language code that broadly indicates the country or region of the user, and so the mechanism can route to the datacenter nearest to the user. See Accept-Language used for locale setting on the W3C website for more information about browser language codes.
An advantage of a custom implementation is that you have complete flexibility in how you design the probe that checks for availability. For example, you can perform a range of tests on the application to ensure that it is performing correctly, perhaps by measuring the response and execution times of individual components and services.
For more information about checking the operation of an application see Heath Endpoint Monitoring pattern.
However, there are some issues to be aware of if you design a custom automated routing implementation:
- You risk the effects of a single point of failure. This can be partly mitigated by deploying multiple instances of the mechanism in different datacenters and using DNS round robin to distribute requests between them. However, managing these multiple deployments means considering how you will replicate configuration and communicate changes to the routing tables they use.
- You should include a delay in the rerouting mechanism to prevent transient connectivity issues from causing intermittent flipping between deployments. Good practice is to wait until more than one probe to the applications fails before initiating a changeover to an alternative deployment.
- If you are using the alternative deployment as a disaster recovery solution it will probably not be running, and so the mechanism will need to start the application and verify it is operating correctly before switching requests to it.
- You should consider how you will route requests back to the failed application after it has been recovered. Typically, you will need to incorporate a mechanism that detects availability and updates the routing as necessary.
The alternative is to hand off the problem to a commercial organization that offers a global IP routing solution. These services usually include detection of availability, optimization based on user location, and other features. They are typically distributed services designed to be highly resilient and available, and are usually a better choice than attempting to implement your own solution. Some examples of providers of these services are Akamai, SoftLayer, and Microsoft Azure Traffic Manager (which is described in the following section).
Re-Routing with Azure Traffic Manager
Azure Traffic Manager is an intelligent DNS service built into Azure that combines application failure detection with dynamic DNS routing. However, because it acts as a DNS server and has very short expiration times for its DNS records, it does not suffer the delays inherent in a custom solution.
Traffic Manager maintains a list of the typical response times of the network paths to each of the Azure global datacenters. You configure the service by specifying the location of your application deployments, and an endpoint in each one that will respond within ten seconds when Traffic Manager pings that endpoint. You also choose a policy that defines how Traffic Manager will behave.
The Round-Robin policy routes requests to the application deployment in each datacenter in turn. It detects failed application deployments and does not route to these in order to maintain availability, but this may not provide users with the best response times due to network latency. The main advantage of this policy is that it distributes load across all working deployments of the application.
The Failover policy allows you to configure a prioritized list of application deployments, and Traffic Manager will route requests to the first one in the list that it detects is responding to requests. If that application fails, Traffic Manager will route requests to the next application deployment in the list, and so on. Again this may not provide users with the best response times due to network latency, but it works well if you want all requests to go to a single deployment as long as it is responding. It is typically used for hot-swap scenarios, where the backup application is only accessed when the primary deployment is unavailable.
However, for the best combination of availability and performance you should use the aptly named Performance policy. This policy routes requests from users to the application deployment in the datacenter that provides the lowest network latency (it may not be the closest in purely geographical terms). It also detects failed applications and does not route to these, instead choosing the next closest working application deployment.
When a failed deployment comes back online, Traffic Manager will automatically detect this and include it in the routing table.
For more information see Reducing Network Latency for Accessing Cloud Applications with Microsoft Azure Traffic Manager in the p&p guide Building Hybrid Applications in the Cloud on Microsoft Azure and Traffic Manager on the Azure website.
Considerations for Multiple Datacenter Deployment
If you do decide to deploy your application to more than one datacenter, you must be aware of some additional issues that this raises. For example, you should consider the following:
- Datacenter location and domain names. Consider how you can specify the location where you will host the application and services it uses, and how this relates to domain names you choose.
- Cloud hosting providers usually allow you to specify a region for each deployment. In addition, you may be able to select a sub-region or an availability set, though the actual meaning of these terms varies. Typically, an availability set allows you to specify that deployments and services should be close together with low latency, but physically separated to enhance availability in case of issues that affect a datacenter. To enable efficient allocation of resources, providers often do not allow you to specify the actual datacenter—and in these cases using an availability set or a sub-region is the only way to specify the location for deployment.
- Each region or geographical deployment must have a unique domain name. A common approach is to use country-specific top level domain (TLD) names such as .co.us or .au. However, this requires multiple registrations of TLDs with domain registries around the world. It may be more cost effective to use subdomain names based on a single root domain names such as us.myapp.com, emea.my.com, asia.myapp.com and point each one to the appropriate deployment of the application. For multi-tenant applications you can extend this with a tenant identifier if required; for example, adatum.us.myapp.com and adatum.emea.my.com
- Regulatory or SLA restrictions. You must take into account service availability obligations and any legal restrictions that may apply. These might include:
- Local or international restrictions on the location of the application and the data it uses, or movement of this data. For example, in some countries or regions it is illegal to export data outside of a specific area, even just for processing. The location for storage of some types of data is also strictly mandated by law in some countries or regions, or may be specified in an SLA.
- Availability requirements may be defined in an SLA or guaranteed in service documentation. There may be mandatory requirements for the recovery point objective (RPO–the amount of data that may be lost during a failure) or the recovery time objective (RTO–the time taken to recover after a failure). Depending on the routing technology used it may take several minutes, or even hours, for the routing mechanism to discover a service has failed, perhaps retry after a period to confirm it was not a transient error, set up a new routing, and for the changes to reach the user (for example, the time taken for a DNS change to propagate). Take these factors into account before agreeing recovery objectives with clients.
- Data synchronization. Different deployments of your application are likely to use different local data stores, and users may be routed to a datacenter where data is not available. Examples include:
- Separate databases or data stores, where the data in each one may not be fully consistent with that in another datacenter if synchronization has not completed. In disaster recovery scenarios where the backup application is not running, it may be necessary to import or synchronize all of the data before the backup deployment can be used. If the application cannot function until all of the data is synchronized, this must be accounted for when agreeing the RTO. If some data may not have been replicated, and has been lost, this must be accounted for when agreeing the RPO. See the Data Replication and Synchronization Guidance for more information about data synchronization across application deployments.
- Separate caching services, such as local caches on each server and distributed cache services located in each datacenter. Your application must be able to refresh the cache if data it expects to find there is not available. For more information see the Cache-Aside pattern.
- Data and service availability. Depending on how you design and configure your application, some data and services may not be available in all of the datacenters. In some cases you may be able to design the application to downgrade its behavior or functionality automatically or through configuration. Examples of lack of availability include:
- If you partition your data in different locations, which can minimize data transfer costs and reduce the chances of conflicting updates, the data required by the application for a user rerouted from another datacenter may not be available.
- Some services you rely on require an absolute path, domain, or URL to be specified; examples are queues, authentication mechanisms, and external services. If there is only a single instance of these services, and the datacenter where they are located is unavailable, the application is likely to fail. You may need to configure multiple instances of these services in different datacenters and specify separate instances for each deployment of the application. This can make configuration and deployment more complex.
- Application versions and functionality. You may choose to deploy different versions of your application in each datacenter. Examples of this include:
- In the global reach scenario, you may decide to deploy localized versions in each datacenter. However, this may be an issue when users are rerouted to a more distant datacenter when one deployment fails. A better approach is to use a single version that has built-in locale and accept-language detection, and so is suitable for all users.
- You may decide to deploy a different number of instances of the application in each datacenter based on the average load under normal conditions. However, consider using an autoscaling solution so that each datacenter can cope with the extra load from rerouted users when one datacenter or deployment becomes unavailable. Alternatively, include the capability for the application to degrade functionality to manage the additional load for short periods; perhaps by temporarily disabling certain services or providing a simplified UI.
- For hot swap and disaster recovery scenarios it may be possible to deploy a reduced functionality version of the application in backup locations to minimize cost and complexity. However, you must consider the impact of this on both the organization and customers should the primary deployment be unavailable for long periods.
- Consider if you should partition the functionality of your application so that some non-essential services are available only in one or a few datacenters. This can simplify configuration and deployment. For example, the chapter Maximizing Availability, Scalability, and Elasticity in the p&p guide Developing Multi-tenant Applications for the Cloud demonstrates how an organization maximizes availability by using multiple deployments of the public website where users complete surveys, but only a single instance of the subscriber website where customers configure surveys. A short interruption to the availability of the subscriber website is less crucial than non-availability of the public website.
- Testing and deployment. When an application runs in multiple datacenters it is vital to perform thorough testing to ensure that the application will perform correctly in every location, and to plan how and when you will manage deployments, updates, and failures. For example:
- Consider how you will upgrade to new versions of the application across all datacenters. Updating the deployment in one datacenter at a time reduces the chance of a global failure through incorrect configuration or an error in the application, and allows each deployment to be validated and the performance checked in a live environment.
- It may be an advantage to schedule the update in datacenters in different time zones to coincide with the period when the application running there is less heavily loaded.
- Consider using automated mechanisms such as scripts or utilities that can be configured for deployment to multiple datacenters, and can ensure that the appropriate configuration is applied to each datacenter deployment.
- It may be necessary to roll back individual failed or underperforming updates after deployment, and restore the previous version. Consider how you can use the features of the hosting environment, and other tools and utilities, to achieve this.
- Be prepared to test resilience, autoscaling, rerouting, and failover features of the application by terminating services. Measure the impact on customer experience, the immediate and ongoing effects, and the time taken to recover. Fine tune the policies and deployments to minimize the impact on users.
- Customer experience. The routing solution you adopt should prevent regular or random flipping between datacenters, and only switch over when it can be definitely determined that the current deployment of the application is unavailable. However, the effects of routing to different deployments of your application may have side-effects you should plan to accommodate. These include:
- Session management. For example, a user rerouted from another datacenter may need to sign on again, and may lose information that was cached locally such as a shopping cart.
- Increased latency or reduced performance. For example, if users are rerouted to a very distant datacenter the additional delays might make the application less responsive and usable.
- Application instability or errors. If the applications in each datacenter use different instances of features such as queues or third party services, messages and state held by these services will not be available. This may cause the application to behave in unpredictable ways.
You may consider displaying a message indicating to the user that they have been rerouted to another version of the application when this occurs to help mitigate complaints, and to explain the reason for any errors or unexplained behavior of the application. You might achieve this by using a cookie to identify the datacenter that delivered the most recent output to the client, and checking this with each request to see if it is different from the current datacenter.
Related Patterns and Guidance
The following patterns and guidance may also be relevant to your scenario when deploying your application in more than one datacenter:
- Compute Partitioning Guidance. It is often useful to partition applications into smaller segments, or separate functional areas such as public and administrative websites, when deploying them to more than one datacenter. The Compute Partitioning guidance describes how to allocate the services and components in a cloud-hosted application in a way that helps to minimize running costs while maintaining the scalability, performance, availability, and security of the service.
- Data Replication and Synchronization Guidance. When applications are deployed to more than one datacenter, it is common to deploy multiple copies of the data that they use to maintain performance and availability. The Data Replication and Synchronization guidance summarizes issues related to replicating the data each instance of the application uses, and how you will synchronize changes to the data.
- Federated Identity Pattern. Applications often need to authenticate users, and this may be more complicated when applications are deployed to more than one datacenter. The Federated Identity pattern describes how an application can delegate authentication to an external identity provider in order to simplify development, minimize the requirement for user administration, and reduce the management overheads for the application.
- Accept-Language used for locale setting on the W3C website.
- Global IP routing service providers: Akamai, SoftLayer, and Microsoft Azure Traffic Manager.
- For more information about using Azure Traffic Manager, see:
- Recovery Point Objective (RPO) on Wikipedia.
- Recovery Time Objective (RTO) on Wikipedia.
- The chapter Maximizing Availability, Scalability, and Elasticity in the guide Developing Multi-tenant Applications for the Cloud.