Introducing Geo-replication for Windows Azure Storage

Article
09/15/2011

We are excited to announce that we are now geo-replicating customer’s Windows Azure Blob and Table data, at no additional cost, between two locations hundreds of miles apart within the same region (i.e., between North and South US, between North and West Europe, and between East and Southeast Asia). Geo-replication is provided for additional data durability in case of a major data center disaster.

Storing Data in Two Locations for Durability

With geo-replication, Windows Azure Storage now keeps your data durable in two locations. In both locations, Windows Azure Storage constantly maintains multiple healthy replicas of your data.

The location where you read, create, update, or delete data is referred to as the ‘primary’ location. The primary location exists in the region you choose at the time you create an account via the Azure Portal (e.g., North Central US). The location where your data is geo-replicated is referred to as the secondary location. The secondary location is automatically determined based on the location of the primary; it is in the other data center that is in the same region as the primary. In this example, the secondary would be located in South Central US (see table below for full listing). The primary location is currently displayed in the Azure Portal, as shown below. In the future, the Azure Portal will be updated to show both the primary and secondary locations. To view the primary location for your storage account in the Azure Portal, click on the account of interest; the primary region will be displayed on the lower right side under Country/Region, as highlighted below.

The following table shows the primary and secondary location pairings:

Primary	Secondary
North Central US	South Central US
South Central US	North Central US
East US	West US
West US	East US
North Europe	West Europe
West Europe	North Europe
South East Asia	East Asia
East Asia	South East Asia

Geo-Replication Costs and Disabling Geo-Replication

Geo-replication is included in current pricing for Azure Storage. This is called Geo Redundant Storage.

If you do not want your data geo-replicated you can disable geo-replication for your account. This is called Locally Redundant Storage, and is a 23% to 34% discounted price (depending on how much data is stored) over geo-replicated storage. See here for more details on Locally Redundant Storage (LRS).

When you turn geo-replication off, the data will be deleted from the secondary location. If you decide to turn geo-replication on again after you have turned it off, there is a re-bootstrap egress bandwidth charge (based on the data transfer rates) for copying your existing data from the primary to the secondary location to kick start geo-replication for the storage account. This charge will be applied only when you turn geo-replication on after you have turned it off. There is no additional charge for continuing geo-replication after the re-bootstrap is done.

Currently all storage accounts are bootstrapped and in geo-replication mode between primary and secondary storage locations.

How Geo-Replication Works

When you create, update, or delete data to your storage account, the transaction is fully replicated on three different storage nodes across three fault domains and upgrade domains inside the primary location, then success is returned back to the client. Then, in the background, the primary location asynchronously replicates the recently committed transaction to the secondary location. That transaction is then made durable by fully replicating it across three different storage nodes in different fault and upgrade domains at the secondary location. Because the updates are asynchronously geo-replicated, there is no change in existing performance for your storage account.

Our goal is to keep the data durable at both the primary and secondary location. This means we keep enough replicas in both locations to ensure that each location can recover by itself from common failures (e.g., disk, node, rack, TOR failing), without having to talk to the other location. The two locations only have to talk to each other to geo-replicate the recent updates to storage accounts. They do not have to talk to each other to recover data due to common failures. This is important, because it means that if we had to failover a storage account from the primary to the secondary, then all the data that had been committed to the secondary location via geo-replication will already be durable there.

With this first release of geo-replication, we do not provide an SLA for how long it will take to asynchronously geo-replicate the data, though transactions are typically geo-replicated within a few minutes after they have been committed in the primary location.

How Geo-Failover Works

In the event of a major disaster that affects the primary location, we will first try to restore the primary location. Dependent upon the nature of the disaster and its impacts, in some rare occasions, we may not be able to restore the primary location, and we would need to perform a geo-failover. When this happens, affected customers will be notified via their subscription contact information (we are investigating more programmatic ways to perform this notification). As part of the failover, the customer’s “account.service.core.windows.net” DNS entry would be updated to point from the primary location to the secondary location. Once this DNS change is propagated, the existing Blob and Table URIs will work. This means that you do not need to change your application’s URIs – all existing URIs will work the same before and after a geo-failover.

For example, if the primary location for a storage account “myaccount” was North Central US, then the DNS entry for myaccount.<service>.core.windows.net would direct traffic to North Central US. If a geo-failover became necessary, the DNS entry for myaccount.<service>.core.windows.net would be updated so that it would then direct all traffic for the storage account to South Central US.

After the failover occurs, the location that is accepting traffic is considered the new primary location for the storage account. This location will remain as the primary location unless another geo-failover was to occur. Once the new primary is up and accepting traffic, we will bootstrap a new secondary, which will also be in the same region, for the failed over storage accounts. In the future we plan to support the ability for customers to choose their secondary location (when we have more than two data centers in a given region), as well as the ability to swap their primary and secondary locations for a storage account.

Order of Geo-Replication and Transaction Consistency

Geo-replication ensures that all the data within a PartitionKey is committed in the same order at the secondary location as at the primary location. This said, it is also important to note that there are no geo-replication ordering guarantees across partitions. This means that different partitions can be geo-replicating at different speeds. However, once all the updates have been geo-replicated and committed at the secondary location, the secondary location will have the exact same state as the primary location. However, because geo-replication is asynchronous, recent updates can be lost in the event of a major disaster.

For example, consider the case where we have two blobs, foo and bar, in our storage account (for blobs, the complete blob name is the PartitionKey). Now say we execute transactions A and B on blob foo, and then execute transactions X and Y against blob bar. It is guaranteed that transaction A will be geo-replicated before transaction B, and that transaction X will be geo-replicated before transaction Y. However, no other guarantees are made about the respective timings of geo-replication between the transactions against foo and the transactions against bar. If a disaster happened and caused recent transactions to not get geo-replicated, that would make it possible for, transactions A and X to be geo-replicated, while losing transactions B and Y. Or transactions A and B could have been geo-replicated, but neither X nor Y had made it. The same holds true for operations involving Tables, except that the partitions are determined by the application defined PartitionKey of the entity instead of the blob name. For more information on partition keys, please see Windows Azure Storage Abstractions and their Scalability Targets.

Because of this, to best leverage geo-replication, one best practice is to avoid cross-PartitionKey relationships whenever possible. This means you should try to restrict relationships for Tables to entities that have the same PartitionKey value. Since all transactions within a single partition are geo-replicated in order, this guarantees those relationships will be committed in order on the secondary.

The only multiple object transaction supported by Windows Azure Storage is Entity Group Transactions for Windows Azure Tables, which allow clients to commit a batch of entities together as a single atomic transaction. Geo-replication also treats this batch as an atomic operation. Therefore, the whole batch transaction is committed atomically on the secondary.

Summary

This is our first step in geo-replication, where we are now providing additional durability in case of a major data center disaster. The next steps involve developing features needed to help applications recover after a failover, which is an area we are investigating further.

Brad Calder and Monilee Atkinson