Diagnose and troubleshoot Azure Cosmos DB Java v4 SDK request timeout exceptions

APPLIES TO: NoSQL

The HTTP 408 error occurs if the SDK was unable to complete the request before the timeout limit occurred.

Troubleshooting steps

The following list contains known causes and solutions for request timeout exceptions.

End-to-end timeout policy

There are scenarios where 408 network timeout errors will occur even when all pre-emptive solutions mentioned below have been implemented. A general best practice for reducing tail latency, as well as improving availability in these scenarios, is to implement end-to-end timeout policy. Tail latency is reduced by failing faster, and request units and client-side compute costs are reduced by stopping retries after the timeout. The timeout duration can be set on CosmosItemRequestOptions. The options can then be passed to any request sent to Azure Cosmos DB:

CosmosEndToEndOperationLatencyPolicyConfig endToEndOperationLatencyPolicyConfig = new CosmosEndToEndOperationLatencyPolicyConfigBuilder(Duration.ofSeconds(1)).build();
CosmosItemRequestOptions options = new CosmosItemRequestOptions();
options.setCosmosEndToEndOperationLatencyPolicyConfig(endToEndOperationLatencyPolicyConfig);
container.readItem("id", new PartitionKey("pk"), options, TestObject.class);

Existing issues

If you are seeing requests getting stuck for longer duration or timing out more frequently, please upgrade the Java v4 SDK to the latest version. NOTE: We strongly recommend to use the version 4.18.0 and above. Checkout the Java v4 SDK release notes for more details.

High CPU utilization

High CPU utilization is the most common case. For optimal latency, CPU usage should be roughly 40 percent. Use 10 seconds as the interval to monitor maximum (not average) CPU utilization. CPU spikes are more common with cross-partition queries where it might do multiple connections for a single query.

Solution:

The client application that uses the SDK should be scaled up or out.

Connection throttling

Connection throttling can happen because of either a connection limit on a host machine or Azure SNAT (PAT) port exhaustion.

Connection limit on a host machine

Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too. Run the following command.

ulimit -a

Solution:

The number of max allowed open files, which are identified as "nofile," needs to be at least 10,000 or more. For more information, see the Azure Cosmos DB Java SDK v4 performance tips.

Socket or port availability might be low

When running in Azure, clients using the Java SDK can hit Azure SNAT (PAT) port exhaustion.

Solution 1:

If you're running on Azure VMs, follow the SNAT port exhaustion guide.

Solution 2:

If you're running on Azure App Service, follow the connection errors troubleshooting guide and use App Service diagnostics.

Solution 3:

If you're running on Azure Functions, verify you're following the Azure Functions recommendation of maintaining singleton or static clients for all of the involved services (including Azure Cosmos DB). Check the service limits based on the type and size of your Function App hosting.

Solution 4:

If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK GatewayConnectionConfig. Otherwise, you'll face connection issues.

Create multiple client instances

Creating multiple client instances might lead to connection contention and timeout issues.

Solution 1:

Follow the performance tips, and use a single CosmosClient instance across an entire application.

Solution 2:

If singleton CosmosClient is not possible to have in an application, we recommend using connection sharing across multiple Azure Cosmos DB Clients through this API connectionSharingAcrossClientsEnabled(true) in CosmosClient. When you have multiple instances of Azure Cosmos DB Client in the same JVM interacting to multiple Azure Cosmos DB accounts, enabling this allows connection sharing in Direct mode if possible between instances of Azure Cosmos DB Client. Please note, when setting this option, the connection configuration (e.g., socket timeout config, idle timeout config) of the first instantiated client will be used for all other client instances.

Hot partition key

Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. When there's a hot partition, one or more logical partition keys on a physical partition are consuming all the physical partition's Request Units per second (RU/s). At the same time, the RU/s on other physical partitions are going unused. As a symptom, the total RU/s consumed will be less than the overall provisioned RU/s at the database or container, but you'll still see throttling (429s) on the requests against the hot logical partition key. Use the Normalized RU Consumption metric to see if the workload is encountering a hot partition.

Solution:

Choose a good partition key that evenly distributes request volume and storage. Learn how to change your partition key.

High degree of concurrency

The application is doing a high level of concurrency, which can lead to contention on the channel.

Solution:

The client application that uses the SDK should be scaled up or out.

Large requests or responses

Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.

Solution:

The client application that uses the SDK should be scaled up or out.

Failure rate is within the Azure Cosmos DB SLA

The application should be able to handle transient failures and retry when necessary. Any 408 exceptions aren't retried because on create paths it's impossible to know if the service created the item or not. Sending the same item again for create will cause a conflict exception. User applications business logic might have custom logic to handle conflicts, which would break from the ambiguity of an existing item versus conflict from a create retry.

Failure rate violates the Azure Cosmos DB SLA

Contact Azure Support.

Next steps