What to do if an Azure Storage outage occurs

At Microsoft, we work hard to make sure our services are always available. Sometimes, forces beyond our control impact us in ways that cause unplanned service outages in one or more regions. To help you handle these rare occurrences, we provide the following high-level guidance for Azure Storage services.

How to prepare

It is critical for every customer to prepare their own disaster recovery plan. The effort to recover from a storage outage typically involves both operations personnel and automated procedures in order to reactivate your applications in a functioning state. Please refer to the Azure documentation below to build your own disaster recovery plan:

How to detect

The recommended way to determine the Azure service status is to subscribe to the Azure Service Health Dashboard.

What to do if a Storage outage occurs

If one or more Storage services are temporarily unavailable at one or more regions, there are two options for you to consider. If you desire immediate access to your data, please consider Option 2.

Option 1: Wait for recovery

In this case, no action on your part is required. We are working diligently to restore the Azure service availability. You can monitor the service status on the Azure Service Health Dashboard.

Option 2: Copy data from secondary

If you chose Read-access geo-redundant storage (RA-GRS) (recommended) for your storage accounts, you will have read access to your data from the secondary region. You can use tools such as AzCopy, Azure PowerShell, and the Azure Data Movement library to copy data from the secondary region into another storage account in an unimpacted region, and then point your applications to that storage account for both read and write availability.

What to expect if a Storage failover occurs

If you chose Geo-redundant storage (GRS) or Read-access geo-redundant storage (RA-GRS) (recommended), Azure Storage will keep your data durable in two regions (primary and secondary). In both regions, Azure Storage constantly maintains multiple replicas of your data.

When a regional disaster affects your primary region, we will first try to restore the service in that region. Dependent upon the nature of the disaster and its impacts, in some rare occasions we may not be able to restore the primary region. At that point, we will perform a geo-failover. The cross-region data replication is an asynchronous process which can involve a delay, so it is possible that changes that have not yet been replicated to the secondary region may be lost. You can query the "Last Sync Time" of your storage account to get details on the replication status.

A couple of points regarding the storage geo-failover experience:

  • Storage geo-failover will only be triggered by the Azure Storage team – there is no customer action required.
  • Your existing storage service endpoints for blobs, tables, queues, and files will remain the same after the failover; the Microsoft-supplied DNS entry will need to be updated to switch from the primary region to the secondary region. Microsoft will perform this update automatically as part of the geo-failover process.
  • Before and during the geo-failover, you won't have write access to your storage account due to the impact of the disaster but you can still read from the secondary if your storage account has been configured as RA-GRS.
  • When the geo-failover has been completed and the DNS changes propagated, read and write access to your storage account will be resumed; this points to what used to be your secondary endpoint.
  • Note that you will have write access if you have GRS or RA-GRS configured for the storage account.
  • You can query "Last Geo Failover Time" of your storage account to get more details.
  • After the failover, your storage account will be fully functioning, but in a "degraded" status, as it is actually hosted in a standalone region with no geo-replication possible. To mitigate this risk, we will restore the original primary region and then do a geo-failback to restore the original state. If the original primary region is unrecoverable, we will allocate another secondary region. For more details on the infrastructure of Azure Storage geo replication, please refer to the article on the Storage team blog about Redundancy Options and RA-GRS.

Best Practices for protecting your data

There are some recommended approaches to back up your storage data on a regular basis.

For information about creating applications that take full advantage of the RA-GRS feature, please check out Designing Highly Available Applications using RA-GRS Storage