Troubleshoot issues when you use Azure Cosmos DB Java SDK v4 with SQL API accounts



This article covers troubleshooting for Azure Cosmos DB Java SDK v4 only. Please see the Azure Cosmos DB Java SDK v4 Release notes, Maven repository, and performance tips for more information. If you are currently using an older version than v4, see the Migrate to Azure Cosmos DB Java SDK v4 guide for help upgrading to v4.

This article covers common issues, workarounds, diagnostic steps, and tools when you use Azure Cosmos DB Java SDK v4 with Azure Cosmos DB SQL API accounts. Azure Cosmos DB Java SDK v4 provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues.

Start with this list:

  • Take a look at the Common issues and workarounds section in this article.
  • Look at the Java SDK in the Azure Cosmos DB central repo, which is available open source on GitHub. It has an issues section that's actively monitored. Check to see if any similar issue with a workaround is already filed. One helpful tip is to filter issues by the cosmos:v4-item tag.
  • Review the performance tips for Azure Cosmos DB Java SDK v4, and follow the suggested practices.
  • Read the rest of this article, if you didn't find a solution. Then file a GitHub issue. If there is an option to add tags to your GitHub issue, add a cosmos:v4-item tag.

Retry Logic

Cosmos DB SDK on any IO failure will attempt to retry the failed operation if retry in the SDK is feasible. Having a retry in place for any failure is a good practice but specifically handling/retrying write failures is a must. It's recommended to use the latest SDK as retry logic is continuously being improved.

  1. Read and query IO failures will get retried by the SDK without surfacing them to the end user.
  2. Writes (Create, Upsert, Replace, Delete) are "not" idempotent and hence SDK cannot always blindly retry the failed write operations. It is required that user's application logic to handle the failure and retry.
  3. Trouble shooting sdk availability explains retries for multi-region Cosmos DB accounts.

Retry design

The application should be designed to retry on any exception unless it is a known issue where retrying will not help. For example, the application should retry on 408 request timeouts, this timeout is possibly transient so a retry may result in success. The application should not retry on 400s, this typically means that there is an issue with the request that must first be resolved. Retrying on the 400 will not fix the issue and will result in the same failure if retried again. The table below shows known failures and which ones to retry on.

Common error status codes

Status Code Retryable Description
400 No Bad request (i.e. invalid json, incorrect headers, incorrect partition key in header)
401 No Not authorized
403 No Forbidden
404 No Resource is not found
408 Yes Request timed out
409 No Conflict failure is when the ID provided for a resource on a write operation has been taken by an existing resource. Use another ID for the resource to resolve this issue as ID must be unique within all documents with the same partition key value.
410 Yes Gone exceptions (transient failure that should not violate SLA)
412 No Precondition failure is where the operation specified an eTag that is different from the version available at the server. It's an optimistic concurrency error. Retry the request after reading the latest version of the resource and updating the eTag on the request.
413 No Request Entity Too Large
429 Yes It is safe to retry on a 429. This can be avoided by following the link for too many requests.
449 Yes Transient error that only occurs on write operations, and is safe to retry. This can point to a design issue where too many concurrent operations are trying to update the same object in Cosmos DB.
500 Yes The operation failed due to an unexpected service error. Contact support by filing an Azure support issue.
503 Yes Service unavailable

Common issues and workarounds

Network issues, Netty read timeout failure, low throughput, high latency

General suggestions

For best performance:

  • Make sure the app is running on the same region as your Azure Cosmos DB account.
  • Check the CPU usage on the host where the app is running. If CPU usage is 50 percent or more, run your app on a host with a higher configuration. Or you can distribute the load on more machines.

Connection throttling

Connection throttling can happen because of either a connection limit on a host machine or Azure SNAT (PAT) port exhaustion.

Connection limit on a host machine

Some Linux systems, such as Red Hat, have an upper limit on the total number of open files. Sockets in Linux are implemented as files, so this number limits the total number of connections, too. Run the following command.

ulimit -a

The number of max allowed open files, which are identified as "nofile," needs to be at least double your connection pool size. For more information, see the Azure Cosmos DB Java SDK v4 performance tips.

Azure SNAT (PAT) port exhaustion

If your app is deployed on Azure Virtual Machines without a public IP address, by default Azure SNAT ports establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the Azure SNAT configuration.

Azure SNAT ports are used only when your VM has a private IP address and a process from the VM tries to connect to a public IP address. There are two workarounds to avoid Azure SNAT limitation:

  • Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see Azure Virtual Network service endpoints.

    When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using Virtual Network ACLs.

  • Assign a public IP to your Azure VM.

Can't reach the Service - firewall

ConnectTimeoutException indicates that the SDK cannot reach the service. You may get a failure similar to the following when using the direct mode:

GoneException{error=null, resourceAddress='', statusCode=410, message=Message: The requested resource is no longer available at the server., getCauseInfo=[class: class, message: connection timed out:]

If you have a firewall running on your app machine, open port range 10,000 to 20,000 which are used by the direct mode. Also follow the Connection limit on a host machine.

HTTP proxy

If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK ConnectionPolicy. Otherwise, you face connection issues.

Invalid coding pattern: Blocking Netty IO thread

The SDK uses the Netty IO library to communicate with Azure Cosmos DB. The SDK has an Async API and uses non-blocking IO APIs of Netty. The SDK's IO work is performed on IO Netty threads. The number of IO Netty threads is configured to be the same as the number of CPU cores of the app machine.

The Netty IO threads are meant to be used only for non-blocking Netty IO work. The SDK returns the API invocation result on one of the Netty IO threads to the app's code. If the app performs a long-lasting operation after it receives results on the Netty thread, the SDK might not have enough IO threads to perform its internal IO work. Such app coding might result in low throughput, high latency, and io.netty.handler.timeout.ReadTimeoutException failures. The workaround is to switch the thread when you know the operation takes time.

For example, take a look at the following code snippet which adds items to a container (look here for guidance on setting up the database and container.) You might perform long-lasting work that takes more than a few milliseconds on the Netty thread. If so, you eventually can get into a state where no Netty IO thread is present to process IO work. As a result, you get a ReadTimeoutException failure.

Java SDK V4 (Maven Async API

//Bad code with read timeout exception

int requestTimeoutInSeconds = 10;

/* ... */

AtomicInteger failureCount = new AtomicInteger();
// Max number of concurrent item inserts is # CPU cores + 1
Flux<Family> familyPub =
        Flux.just(Families.getAndersenFamilyItem(), Families.getAndersenFamilyItem(), Families.getJohnsonFamilyItem());
familyPub.flatMap(family -> {
    return container.createItem(family);
}).flatMap(r -> {
    try {
        // Time-consuming work is, for example,
        // writing to a file, computationally heavy work, or just sleep.
        // Basically, it's anything that takes more than a few milliseconds.
        // Doing such operations on the IO Netty thread
        // without a proper scheduler will cause problems.
        // The subscriber will get a ReadTimeoutException failure.
        TimeUnit.SECONDS.sleep(2 * requestTimeoutInSeconds);
    } catch (Exception e) {
    return Mono.empty();
}).doOnError(Exception.class, exception -> {
assert(failureCount.get() > 0);

The workaround is to change the thread on which you perform work that takes time. Define a singleton instance of the scheduler for your app.

Java SDK V4 (Maven Async API

// Have a singleton instance of an executor and a scheduler.
ExecutorService ex  = Executors.newFixedThreadPool(30);
Scheduler customScheduler = Schedulers.fromExecutor(ex);

You might need to do work that takes time, for example, computationally heavy work or blocking IO. In this case, switch the thread to a worker provided by your customScheduler by using the .publishOn(customScheduler) API.

Java SDK V4 (Maven Async API

        .publishOn(customScheduler) // Switches the thread.
                // ...

By using publishOn(customScheduler), you release the Netty IO thread and switch to your own custom thread provided by the custom scheduler. This modification solves the problem. You won't get a io.netty.handler.timeout.ReadTimeoutException failure anymore.

Request rate too large

This failure is a server-side failure. It indicates that you consumed your provisioned throughput. Retry later. If you get this failure often, consider an increase in the collection throughput.

  • Implement backoff at getRetryAfterInMilliseconds intervals

    During performance testing, you should increase load until a small rate of requests get throttled. If throttled, the client application should backoff for the server-specified retry interval. Respecting the backoff ensures that you spend minimal amount of time waiting between retries.

Failure connecting to Azure Cosmos DB Emulator

The Azure Cosmos DB Emulator HTTPS certificate is self-signed. For the SDK to work with the emulator, import the emulator certificate to a Java TrustStore. For more information, see Export Azure Cosmos DB Emulator certificates.

Dependency Conflict Issues

The Azure Cosmos DB Java SDK pulls in a number of dependencies; generally speaking, if your project dependency tree includes an older version of an artifact that Azure Cosmos DB Java SDK depends on, this may result in unexpected errors being generated when you run your application. If you are debugging why your application unexpectedly throws an exception, it is a good idea to double-check that your dependency tree is not accidentally pulling in an older version of one or more of the Azure Cosmos DB Java SDK dependencies.

The workaround for such an issue is to identify which of your project dependencies brings in the old version and exclude the transitive dependency on that older version, and allow Azure Cosmos DB Java SDK to bring in the newer version.

To identify which of your project dependencies brings in an older version of something that Azure Cosmos DB Java SDK depends on, run the following command against your project pom.xml file:

mvn dependency:tree

For more information, see the maven dependency tree guide.

Once you know which dependency of your project depends on an older version, you can modify the dependency on that lib in your pom file and exclude the transitive dependency, following the example below (which assumes that reactor-core is the outdated dependency):


For more information, see the exclude transitive dependency guide.

Enable client SDK logging

Azure Cosmos DB Java SDK v4 uses SLF4j as the logging facade that supports logging into popular logging frameworks such as log4j and logback.

For example, if you want to use log4j as the logging framework, add the following libs in your Java classpath.


Also add a log4j config.

# this is a sample log4j configuration

# Set root logger level to INFO and its only appender to A1.
log4j.rootLogger=INFO, A1
# A1 is set to be a ConsoleAppender.

# A1 uses PatternLayout.
log4j.appender.A1.layout.ConversionPattern=%d %5X{pid} [%t] %-5p %c - %m%n

For more information, see the sfl4j logging manual.

OS network statistics

Run the netstat command to get a sense of how many connections are in states such as ESTABLISHED and CLOSE_WAIT.

On Linux, you can run the following command.

netstat -nap

On Windows, you can run the same command with different argument flags:

netstat -abn

Filter the result to only connections to the Azure Cosmos DB endpoint.

The number of connections to the Azure Cosmos DB endpoint in the ESTABLISHED state can't be greater than your configured connection pool size.

Many connections to the Azure Cosmos DB endpoint might be in the CLOSE_WAIT state. There might be more than 1,000. A number that high indicates that connections are established and torn down quickly. This situation potentially causes problems. For more information, see the Common issues and workarounds section.