Diagnose and troubleshoot the availability of Azure Cosmos SDKs in multiregional environments

APPLIES TO: SQL API

This article describes the behavior of the latest version of Azure Cosmos SDKs when you see a connectivity issue to a particular region or when a region failover occurs.

All the Azure Cosmos SDKs give you an option to customize the regional preference. The following properties are used in different SDKs:

When you set the regional preference, the client will connect to a region as mentioned in the following table:

Account type Reads Writes
Single write region Preferred region Primary region
Multiple write regions Preferred region Preferred region

If you don't set a preferred region, the SDK client defaults to the primary region:

Account type Reads Writes
Single write region Primary region Primary region
Multiple write regions Primary region Primary region

Note

Primary region refers to the first region in the Azure Cosmos account region list. If the values specified as regional preference do not match with any existing Azure regions, they will be ignored. If they match an existing region but the account is not replicated to it, then the client will connect to the next preferred region that matches or to the primary region.

Warning

The failover and availability logic described in this document can be disabled on the client configuration, which is not advised unless the user application is going to handle availability errors itself. This can be achieved by:

Under normal circumstances, the SDK client will connect to the preferred region (if a regional preference is set) or to the primary region (if no preference is set), and the operations will be limited to that region, unless any of the below scenarios occur.

In these cases, the client using the Azure Cosmos SDK exposes logs and includes the retry information as part of the operation diagnostic information:

  • The RequestDiagnosticsString property on responses in .NET V2 SDK.
  • The Diagnostics property on responses and exceptions in .NET V3 SDK.
  • The getDiagnostics() method on responses and exceptions in Java V4 SDK.

When determining the next region in order of preference, the SDK client will use the account region list, prioritizing the preferred regions (if any).

For a comprehensive detail on SLA guarantees during these events, see the SLAs for availability.

Removing a region from the account

When you remove a region from an Azure Cosmos account, any SDK client that actively uses the account will detect the region removal through a backend response code. The client then marks the regional endpoint as unavailable. The client retries the current operation and all the future operations are permanently routed to the next region in order of preference. In case the preference list only had one entry (or was empty) but the account has other regions available, it will route to the next region in the account list.

Adding a region to an account

Every 5 minutes, the Azure Cosmos SDK client reads the account configuration and refreshes the regions that it's aware of.

If you remove a region and later add it back to the account, if the added region has a higher regional preference order in the SDK configuration than the current connected region, the SDK will switch back to use this region permanently. After the added region is detected, all the future requests are directed to it.

If you configure the client to preferably connect to a region that the Azure Cosmos account does not have, the preferred region is ignored. If you add that region later, the client detects it and will switch permanently to that region.

Fail over the write region in a single write region account

If you initiate a failover of the current write region, the next write request will fail with a known backend response. When this response is detected, the client will query the account to learn the new write region and proceeds to retry the current operation and permanently route all future write operations to the new region.

Regional outage

If the account is single write region and the regional outage occurs during a write operation, the behavior is similar to a manual failover. For read requests or multiple write regions accounts, the behavior is similar to removing a region.

Session consistency guarantees

When using session consistency, the client needs to guarantee that it can read its own writes. In single write region accounts where the read region preference is different from the write region, there could be cases where the user issues a write and when doing a read from a local region, the local region has not yet received the data replication (speed of light constraint). In such cases, the SDK detects the specific failure on the read operation and retries the read on the primary region to ensure session consistency.

Transient connectivity issues on TCP protocol

In scenarios where the Azure Cosmos SDK client is configured to use the TCP protocol, for a given request, there might be situations where the network conditions are temporarily affecting the communication with a particular endpoint. These temporary network conditions can surface as TCP timeouts and Service Unavailable (HTTP 503) errors. The client will retry the request locally on the same endpoint for some seconds before surfacing the error.

If the user has configured a preferred region list with more than one region and the Azure Cosmos account is multiple write regions or single write region and the operation is a read request, the client will detect the local failure, and retry that single operation in the next region from the preference list.

Next steps