What to do if an Azure Storage outage occurs
At Microsoft, we work hard to make sure our services are always available. Sometimes, forces beyond our control impact us in ways that cause unplanned service outages in one or more regions. To help you handle these rare occurrences, we provide the following high-level guidance for Azure Storage services.
How to prepare
It is critical for every customer to prepare their own disaster recovery plan. The effort to recover from a storage outage typically involves both operations personnel and automated procedures in order to reactivate your applications in a functioning state. Please refer to the Azure documentation below to build your own disaster recovery plan:
- Availability checklist
- Designing resilient applications for Azure
- Azure Site Recovery service
- Azure Storage replication
- Azure Backup service
How to detect
The recommended way to determine the Azure service status is to subscribe to the Azure Service Health Dashboard.
What to do if a Storage outage occurs
If one or more Storage services are temporarily unavailable at one or more regions, there are two options for you to consider. If you desire immediate access to your data, please consider Option 2.
Option 1: Wait for recovery
In this case, no action on your part is required. We are working diligently to restore the Azure service availability. You can monitor the service status on the Azure Service Health Dashboard.
Option 2: Copy data from secondary
If you chose Read-access geo-redundant storage (RA-GRS) (recommended) for your storage accounts, you will have read access to your data from the secondary region. You can use tools such as AzCopy, Azure PowerShell, and the Azure Data Movement library to copy data from the secondary region into another storage account in an unimpacted region, and then point your applications to that storage account for both read and write availability.
What to expect if a Storage failover occurs
If you chose Geo-redundant storage (GRS) or Read-access geo-redundant storage (RA-GRS) (recommended), Azure Storage will keep your data durable in two regions (primary and secondary). In both regions, Azure Storage constantly maintains multiple replicas of your data.
When a regional disaster affects your primary region, we will first try to restore the service in that region to provides the best combination of RTO and RPO. Dependent upon the nature of the disaster and its impacts, in some rare occasions we may not be able to restore the primary region. At that point, we will perform a geo-failover. Cross-region data replication is an asynchronous process that involves a delay, so it is possible that changes which have not yet been replicated to the secondary region may be lost. You can query the "Last Sync Time" of your storage account to get details on the replication status.
A couple of points regarding the storage geo-failover experience:
- Storage geo-failover will only be triggered by the Azure Storage team – there is no customer action required. The failover is triggered when the Azure Storage team has exhausted all options of restoring data in the same region, which provides the best combination of RTO and RPO.
- Your existing storage service endpoints for blobs, tables, queues, and files will remain the same after the failover; the Microsoft-supplied DNS entry will need to be updated to switch from the primary region to the secondary region. Microsoft will perform this update automatically as part of the geo-failover process.
- Before and during the geo-failover, you won't have write access to your storage account due to the impact of the disaster but you can still read from the secondary if your storage account has been configured as RA-GRS.
- When the geo-failover has been completed and the DNS changes propagated, read and write access to your storage account will be resumed; this points to what used to be your secondary endpoint.
- Note that you will have write access if you have GRS or RA-GRS configured for the storage account.
- You can query "Last Geo Failover Time" of your storage account to get more details.
- After the failover, your storage account will be fully functioning, but in a "degraded" state, as it is hosted in a standalone region with no geo-replication possible. To mitigate this risk, we will restore the original primary region and then do a geo-failback to restore the original state. If the original primary region is unrecoverable, we will allocate another secondary region. For more details on the infrastructure of Azure Storage geo replication, please refer to the article on the Storage team blog about Redundancy Options and RA-GRS.
Best Practices for protecting your data
There are some recommended approaches to back up your storage data on a regular basis.
- VM Disks – Use the Azure Backup service to back up the VM disks used by your Azure virtual machines.
- Block blobs – Turn on soft delete to protect against object-level deletions and overwrites, or copy the blobs to another storage account in another region using AzCopy, Azure PowerShell, or the Azure Data Movement library.
- Tables – use AzCopy to export the table data into another storage account in another region.
- Files – use AzCopy or Azure PowerShell to copy your files to another storage account in another region.
For information about creating applications that take full advantage of the RA-GRS feature, please check out Designing Highly Available Applications using RA-GRS Storage