question

XiangyuZhang-2194 avatar image
0 Votes"
XiangyuZhang-2194 asked AnuragSharma-MSFT answered

Write region outage behavior for multi-region account with single write region

Our team is considering to enable the multi-region for our upcoming PROD environment. We have decided to enable multi read regions, but we are having a couple of questions regarding the single write region vs multi write region.

According to the document (https://docs.microsoft.com/en-us/azure/cosmos-db/high-availability#what-to-expect-during-a-cosmos-db-region-outage), if we have only single write region, in the event of the write region outage, we would experience write availability loss until Cosmos DB automatically elect a new region as the write region. It appears to me that with this mechanism the only disadvantage compared to a multi-write region setup is we cannot do a manual failover. But here what's the SLA for the automatic region switch. If the switch happens in a reasonable period of time we probably could handle that.

azure-cosmos-db
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @XiangyuZhang-2194, just wanted to follow up no the thread if you need any details. If answer helps, you can mark it 'Accept Answer'

0 Votes 0 ·
AnuragSharma-MSFT avatar image
0 Votes"
AnuragSharma-MSFT answered

Hi @XiangyuZhang-2194 , thanks for your patience.

<Writing it as another answer as word limit of 1600 is crossed>

Azure Cosmos DB has multiple replicas maintained in same region through replication and redundancy. It is highly unlikely that all of these replicas fail in real time as failure of entire region could be termed as rarest of the rare scenario. So if Azure Cosmos DB account fails due to any issue, it is switched to another replica from the same region within very short span of time as there is no extra replication time needed in this case.


However, if region fails and though the SLA says 99.99 but the RPO is up to 240 minutes and RTO is up to 1 week for single region account which is much bigger than 1 minutes. In these specific cases when the SLA is not met, then the billing cost will be adjusted in significant billing credit based on how much downtime is there.

Consistency levels and data durability

Also as per the article ,"Single-region accounts may lose availability following a regional outage. It's always recommended to set up at least two regions (preferably, at least two write regions) with your Azure Cosmos account to ensure high availability at all times."


Please let me know if this helps or else we can continue the discussion.


Please don't forgot to click on accept it as answer button 130202-image.png wherever the information provided helps you. This can be beneficial to other community members as well.









image.png (4.2 KiB)
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

AnuragSharma-MSFT avatar image
0 Votes"
AnuragSharma-MSFT answered AnuragSharma-MSFT edited

Hi @XiangyuZhang-2194, welcome to Microsoft Q&A forum.

Single region write offers SLA of 99.99 in writes for Azure Cosmos DB as per the article. You can check how much downtime can be afforded as per your requirements.

129503-image.png

This would limit the outage in less than an hour as per the below metrics provided.

129493-image.png

Referenced Article: Identify dependencies

Please let me know if this helps or else we can discuss further on the same.


If answer helps, please mark it 'Accept Answer'





image.png (40.3 KiB)
image.png (23.1 KiB)
· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi @AnuragSharma-MSFT , thank you for your answer, and we do have a follow up question - According to the table you provided, a single region account (both read and write) has SLA of 99.99%, which means the average down time per week is less than a minute. We were wondering how this is implemented to achieve such high availability. We know that Cosmos keeps four data replicas in the same region (and by default it's the write region for a single region account), but in the event of a failure in this region, does Cosmos need to replicate the data to another region to restore the availability? If so, completing in 1 minute seems too fast to be true; and if not, does that mean that Cosmos keeps multiple replicas in different regions regardless of the configuration of the account?

Thanks.

0 Votes 0 ·

Hi @XiangyuZhang-2194 , thank for reverting back. I am checking on this internally once and will get back at the earliest.

0 Votes 0 ·