Diagnose and troubleshoot issues when using Azure Cosmos DB .NET SDK

This article covers common issues, workarounds, diagnostic steps, and tools when you use the .NET SDK with Azure Cosmos DB SQL API accounts. The .NET SDK provides client-side logical representation to access the Azure Cosmos DB SQL API. This article describes tools and approaches to help you if you run into any issues.

Checklist for troubleshooting issues

Consider the following checklist before you move your application to production. Using the checklist will prevent several common issues you might see. You can also quickly diagnose when an issue occurs:

  • Use the latest SDK. Preview SDKs should not be used for production. This will prevent hitting known issues that are already fixed.
  • Review the performance tips, and follow the suggested practices. This will help prevent scaling, latency, and other performance issues.
  • Enable the SDK logging to help you troubleshoot an issue. Enabling the logging may affect performance so it's best to enable it only when troubleshooting issues. You can enable the following logs:
  • Log metrics by using the Azure portal. Portal metrics show the Azure Cosmos DB telemetry, which is helpful to determine if the issue corresponds to Azure Cosmos DB or if it's from the client side.
  • Log the diagnostics string in the V2 SDK or diagnostics in V3 SDK from the point operation responses.
  • Log the SQL Query Metrics from all the query responses
  • Follow the setup for SDK logging

Take a look at the Common issues and workarounds section in this article.

Check the GitHub issues section that's actively monitored. Check to see if any similar issue with a workaround is already filed. If you didn't find a solution, then file a GitHub issue. You can open a support tick for urgent issues.

Common issues and workarounds

General suggestions

  • Run your app in the same Azure region as your Azure Cosmos DB account, whenever possible.
  • You may run into connectivity/availability issues due to lack of resources on your client machine. We recommend monitoring your CPU utilization on nodes running the Azure Cosmos DB client, and scaling up/out if they're running at high load.

Check the portal metrics

Checking the portal metrics will help determine if it's a client-side issue or if there is an issue with the service. For example, if the metrics contain a high rate of rate-limited requests(HTTP status code 429) which means the request is getting throttled then check the Request rate too large section.

Requests timeouts

RequestTimeout usually happens when using Direct/TCP, but can happen in Gateway mode. These errors are the common known causes, and suggestions on how to fix the problem.

  • CPU utilization is high, which will cause latency and/or request timeouts. The customer can scale up the host machine to give it more resources, or the load can be distributed across more machines.
  • Socket / Port availability might be low. When running in Azure, clients using the .NET SDK can hit Azure SNAT (PAT) port exhaustion. To reduce the chance of hitting this issue, use the latest version 2.x or 3.x of the .NET SDK. This is an example of why it is recommended to always run the latest SDK version.
  • Creating multiple DocumentClient instances might lead to connection contention and timeout issues. Follow the performance tips, and use a single DocumentClient instance across an entire process.
  • Users sometimes see elevated latency or request timeouts because their collections are provisioned insufficiently, the back-end throttles requests, and the client retries internally. Check the portal metrics.
  • Azure Cosmos DB distributes the overall provisioned throughput evenly across physical partitions. Check portal metrics to see if the workload is encountering a hot partition key. This will cause the aggregate consumed throughput (RU/s) to be appear to be under the provisioned RUs, but a single partition consumed throughput (RU/s) will exceed the provisioned throughput.
  • Additionally, the 2.0 SDK adds channel semantics to direct/TCP connections. One TCP connection is used for multiple requests at the same time. This can lead to two issues under specific cases:
    • A high degree of concurrency can lead to contention on the channel.
    • Large requests or responses can lead to head-of-line blocking on the channel and exacerbate contention, even with a relatively low degree of concurrency.
    • If the case falls in any of these two categories (or if high CPU utilization is suspected), these are possible mitigations:
      • Try to scale the application up/out.
      • Additionally, SDK logs can be captured through Trace Listener to get more details.

High network latency

High network latency can be identified by using the diagnostics string in the V2 SDK or diagnostics in V3 SDK.

If no timeouts are present and the diagnostics show single requests where the high latency is evident on the difference between ResponseTime and RequestStartTime, like so (>300 milliseconds in this example):

RequestStartTime: 2020-03-09T22:44:49.5373624Z, RequestEndTime: 2020-03-09T22:44:49.9279906Z,  Number of regions attempted:1
ResponseTime: 2020-03-09T22:44:49.9279906Z, StoreResult: StorePhysicalAddress: rntbd://..., ...

This latency can have multiple causes:

Azure SNAT (PAT) port exhaustion

If your app is deployed on Azure Virtual Machines without a public IP address, by default Azure SNAT ports establish connections to any endpoint outside of your VM. The number of connections allowed from the VM to the Azure Cosmos DB endpoint is limited by the Azure SNAT configuration. This situation can lead to connection throttling, connection closure, or the above mentioned Request timeouts.

Azure SNAT ports are used only when your VM has a private IP address is connecting to a public IP address. There are two workarounds to avoid Azure SNAT limitation (provided you already are using a single client instance across the entire application):

  • Add your Azure Cosmos DB service endpoint to the subnet of your Azure Virtual Machines virtual network. For more information, see Azure Virtual Network service endpoints.

    When the service endpoint is enabled, the requests are no longer sent from a public IP to Azure Cosmos DB. Instead, the virtual network and subnet identity are sent. This change might result in firewall drops if only public IPs are allowed. If you use a firewall, when you enable the service endpoint, add a subnet to the firewall by using Virtual Network ACLs.

  • Assign a public IP to your Azure VM.

HTTP proxy

If you use an HTTP proxy, make sure it can support the number of connections configured in the SDK ConnectionPolicy. Otherwise, you face connection issues.

Request rate too large

'Request rate too large' or error code 429 indicates that your requests are being throttled, because the consumed throughput (RU/s) has exceeded the provisioned throughput. The SDK will automatically retry requests based on the specified retry policy. If you get this failure often, consider increasing the throughput on the collection. Check the portal's metrics to see if you are getting 429 errors. Review your partition key to ensure it results in an even distribution of storage and request volume.

Slow query performance

The query metrics will help determine where the query is spending most of the time. From the query metrics, you can see how much of it is being spent on the back-end vs the client.

  • If the back-end query returns quickly, and spends a large time on the client check the load on the machine. It's likely that there are not enough resource and the SDK is waiting for resources to be available to handle the response.
  • If the back-end query is slow try optimizing the query and looking at the current indexing policy

HTTP 401: The MAC signature found in the HTTP request is not the same as the computed signature

If you received the following 401 error message: "The MAC signature found in the HTTP request is not the same as the computed signature." it can be caused by the following scenarios.

  1. The key was rotated and did not follow the best practices. This is usually the case. Cosmos DB account key rotation can take anywhere from a few seconds to possibly days depending on the Cosmos DB account size.
    1. 401 MAC signature is seen shortly after a key rotation and eventually stops without any changes.
  2. The key is misconfigured on the application so the key does not match the account.
    1. 401 MAC signature issue will be consistent and happens for all calls
  3. The application is using the read-only keys for write operations.
    1. 401 MAC signature issue will only happen when the application is doing write requests, but read requests will succeed.
  4. There is a race condition with container creation. An application instance is trying to access the container before container creation is complete. The most common scenario for this if the application is running, and the container is deleted and recreated with the same name while the application is running. The SDK will attempt to use the new container, but the container creation is still in progress so it does not have the keys.
    1. 401 MAC signature issue is seen shortly after a container creation, and only occur until the container creation is completed.

HTTP Error 400. The size of the request headers is too long.

The size of the header has grown to large and is exceeding the maximum allowed size. It's always recommended to use the latest SDK. Make sure to use at least version 3.x or 2.x, which adds header size tracing to the exception message.

Causes:

  1. The session token has grown too large. The session token grows as the number of partitions increase in the container.
  2. The continuation token has grown to large. Different queries will have different continuation token sizes.
  3. It's caused by a combination of the session token and continuation token.

Solution:

  1. Follow the performance tips and convert the application to Direct + TCP connection mode. Direct + TCP does not have the header size restriction like HTTP does which avoids this issue.
  2. If the session token is the cause, then a temporary mitigation is to restart the application. Restarting the application instance will reset the session token. If the exceptions stop after the restart, then it confirms the session token is the cause. It will eventually grow back to the size that will cause the exception.
  3. If the application cannot be converted to Direct + TCP and the session token is the cause, then mitigation can be done by changing the client consistency level. The session token is only used for session consistency which is the default for Cosmos DB. Any other consistency level will not use the session token.
  4. If the application cannot be converted to Direct + TCP and the continuation token is the cause, then try setting the ResponseContinuationTokenLimitInKb option. The option can be found in the FeedOptions for v2 or the QueryRequestOptions in v3.