Troubleshoot query issues when using the Azure Cosmos DB API for MongoDB

APPLIES TO: Azure Cosmos DB API for MongoDB

This article walks through a general recommended approach for troubleshooting queries in Azure Cosmos DB. Although you shouldn't consider the steps outlined in this article a complete defense against potential query issues, we've included the most common performance tips here. You should use this article as a starting place for troubleshooting slow or expensive queries in Azure Cosmos DB's API for MongoDB. If you are using the Azure Cosmos DB core (SQL) API, see the SQL API query troubleshooting guide article.

Query optimizations in Azure Cosmos DB are broadly categorized as follows:

  • Optimizations that reduce the Request Unit (RU) charge of the query
  • Optimizations that just reduce latency

If you reduce the RU charge of a query, you'll typically decrease latency as well.

This article provides examples that you can re-create by using the nutrition dataset.

Note

This article assumes you are using Azure Cosmos DB's API for MongoDB accounts with version 3.6 and higher. Some queries that perform poorly in version 3.2 have significant improvements in versions 3.6+. Upgrade to version 3.6 by filing a support request.

Use $explain command to get metrics

When you optimize a query in Azure Cosmos DB, the first step is always to obtain the RU charge for your query. As a rough guideline, you should explore ways to lower the RU charge for queries with charges greater than 50 RUs.

In addition to obtaining the RU charge, you should use the $explain command to obtain the query and index usage metrics. Here is an example that runs a query and uses the $explain command to show query and index usage metrics:

$explain command:

db.coll.find({foodGroup: "Baby Foods"}).explain({"executionStatistics": true })

Output:

{
    "stages" : [ 
        {
            "stage" : "$query",
            "timeInclusiveMS" : 905.2888,
            "timeExclusiveMS" : 905.2888,
            "in" : 362,
            "out" : 362,
            "details" : {
                "database" : "db-test",
                "collection" : "collection-test",
                "query" : {
                    "foodGroup" : {
                        "$eq" : "Baby Foods"
                    }
                },
                "pathsIndexed" : [],
                "pathsNotIndexed" : [ 
                    "foodGroup"
                ],
                "shardInformation" : [ 
                    {
                        "activityId" : "e68e6bdd-5e89-4ec5-b053-3dbbc2428140",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 788.5867,
                        "preemptions" : 1,
                        "outputDocumentCount" : 362,
                        "retrievedDocumentCount" : 8618
                    }
                ],
                "queryMetrics" : {
                    "retrievedDocumentCount" : 8618,
                    "retrievedDocumentSizeBytes" : 104963042,
                    "outputDocumentCount" : 362,
                    "outputDocumentSizeBytes" : 2553535,
                    "indexHitRatio" : 0.0016802042237178,
                    "totalQueryExecutionTimeMS" : 777.72,
                    "queryPreparationTimes" : {
                        "queryCompilationTimeMS" : 0.19,
                        "logicalPlanBuildTimeMS" : 0.14,
                        "physicalPlanBuildTimeMS" : 0.09,
                        "queryOptimizationTimeMS" : 0.03
                    },
                    "indexLookupTimeMS" : 0,
                    "documentLoadTimeMS" : 687.22,
                    "vmExecutionTimeMS" : 774.09,
                    "runtimeExecutionTimes" : {
                        "queryEngineExecutionTimeMS" : 37.45,
                        "systemFunctionExecutionTimeMS" : 10.82,
                        "userDefinedFunctionExecutionTimeMS" : 0
                    },
                    "documentWriteTimeMS" : 49.42
                }
            }
        }
    ],
    "estimatedDelayFromRateLimitingInMilliseconds" : 0.0,
    "continuation" : {
        "hasMore" : false
    },
    "ok" : 1.0
}

The $explain command output is lengthy and has detailed information about query execution. In general, though, there are a few sections where you should focus when optimizing query performance:

Metric Description
timeInclusiveMS Backend query latency
pathsIndexed Shows indexes that the query used
pathsNotIndexed Shows indexes that the query could have used, if available
shardInformation Summary of query performance for a particular physical partition
retrievedDocumentCount Number of documents loaded by the query engine
outputDocumentCount Number of documents returned in the query results
estimatedDelayFromRateLimitingInMilliseconds Estimated additional query latency due to rate limiting

After you get the query metrics, compare the retrievedDocumentCount with the outputDocumentCount for your query. Use this comparison to identify the relevant sections to review in this article. The retrievedDocumentCount is the number of documents that the query engine needs to load. The outputDocumentCount is the number of documents that were needed for the results of the query. If the retrievedDocumentCount is significantly higher than the outputDocumentCount, there was at least one part of your query that was unable to use an index and needed to do a scan.

Refer to the following sections to understand the relevant query optimizations for your scenario.

Query's RU charge is too high

Retrieved Document Count is significantly higher than Output Document Count

Retrieved Document Count is approximately equal to Output Document Count

Query's RU charge is acceptable but latency is still too high

Queries where retrieved document count exceeds output document count

The retrievedDocumentCount is the number of documents that the query engine needed to load. The outputDocumentCount is the number of documents returned by the query. If the retrievedDocumentCount is significantly higher than the outputDocumentCount, there was at least one part of your query that was unable to use an index and needed to do a scan.

Here's an example of scan query that wasn't entirely served by the index:

$explain command:

db.coll.find(
  {
    $and : [
            { "foodGroup" : "Cereal Grains and Pasta"}, 
            { "description" : "Oat bran, cooked"}
        ]
  }
).explain({"executionStatistics": true })

Output:

{
    "stages" : [ 
        {
            "stage" : "$query",
            "timeInclusiveMS" : 436.5716,
            "timeExclusiveMS" : 436.5716,
            "in" : 1,
            "out" : 1,
            "details" : {
                "database" : "db-test",
                "collection" : "indexing-test",
                "query" : {
                    "$and" : [ 
                        {
                            "foodGroup" : {
                                "$eq" : "Cereal Grains and Pasta"
                            }
                        }, 
                        {
                            "description" : {
                                "$eq" : "Oat bran, cooked"
                            }
                        }
                    ]
                },
                "pathsIndexed" : [],
                "pathsNotIndexed" : [ 
                    "foodGroup", 
                    "description"
                ],
                "shardInformation" : [ 
                    {
                        "activityId" : "13a5977e-a10a-4329-b68e-87e4f0081cac",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 435.4867,
                        "preemptions" : 1,
                        "outputDocumentCount" : 1,
                        "retrievedDocumentCount" : 8618
                    }
                ],
                "queryMetrics" : {
                    "retrievedDocumentCount" : 8618,
                    "retrievedDocumentSizeBytes" : 104963042,
                    "outputDocumentCount" : 1,
                    "outputDocumentSizeBytes" : 6064,
                    "indexHitRatio" : 0.0,
                    "totalQueryExecutionTimeMS" : 433.64,
                    "queryPreparationTimes" : {
                        "queryCompilationTimeMS" : 0.12,
                        "logicalPlanBuildTimeMS" : 0.09,
                        "physicalPlanBuildTimeMS" : 0.1,
                        "queryOptimizationTimeMS" : 0.02
                    },
                    "indexLookupTimeMS" : 0,
                    "documentLoadTimeMS" : 387.44,
                    "vmExecutionTimeMS" : 432.93,
                    "runtimeExecutionTimes" : {
                        "queryEngineExecutionTimeMS" : 45.36,
                        "systemFunctionExecutionTimeMS" : 16.86,
                        "userDefinedFunctionExecutionTimeMS" : 0
                    },
                    "documentWriteTimeMS" : 0.13
                }
            }
        }
    ],
    "estimatedDelayFromRateLimitingInMilliseconds" : 0.0,
    "continuation" : {
        "hasMore" : false
    },
    "ok" : 1.0
}

The retrievedDocumentCount (8618) is significantly higher than the outputDocumentCount (1), implying that this query required a document scan.

Include necessary indexes

You should check the pathsNotIndexed array and add these indexes. In this example, the paths foodGroup and description should be indexed.

"pathsNotIndexed" : [ 
                    "foodGroup", 
                    "description"
                ]

Indexing best practices in Azure Cosmos DB's API for MongoDB are different from MongoDB. In Azure Cosmos DB's API for MongoDB, compound indexes are only used in queries that need to efficiently sort by multiple properties. If you have queries with filters on multiple properties, you should create single field indexes for each of these properties. Query predicates can use multiple single field indexes.

Wildcard indexes can simplify indexing. Unlike in MongoDB, wildcard indexes can support multiple fields in query predicates. There will not be a difference in query performance if you use one single wildcard index instead of creating a separate index for each property. Adding a wildcard index for all properties is the easiest way to optimize all of your queries.

You can add new indexes at any time, with no effect on write or read availability. You can track index transformation progress.

Understand which aggregation operations use the index

In most cases, aggregation operations in Azure Cosmos DB's API for MongoDB will partially use indexes. Typically, the query engine will apply equality and range filters first and use indexes. After applying these filters, the query engine can evaluate additional filters and resort to loading remaining documents to compute the aggregate, if needed.

Here's an example:

db.coll.aggregate( [
   { $match: { foodGroup: 'Fruits and Fruit Juices' } },
   {
     $group: {
        _id: "$foodGroup",
        total: { $max: "$version" }
     }
   }
] )

In this case, indexes can optimize the $match stage. Adding an index for foodGroup will significantly improve query performance. Like in MongoDB, you should place $match as early in the aggregation pipeline as possible to maximize usage of indexes.

In Azure Cosmos DB's API for MongoDB, indexes are not used for the actual aggregation, which in this case is $max. Adding an index on version will not improve query performance.

Queries where retrieved document count is equal to Output Document Count

If the retrievedDocumentCount is approximately equal to the outputDocumentCount, the query engine didn't have to scan many unnecessary documents.

Minimize cross partition queries

Azure Cosmos DB uses partitioning to scale individual containers as Request Unit and data storage needs increase. Each physical partition has a separate and independent index. If your query has an equality filter that matches your container's partition key, you'll need to check only the relevant partition's index. This optimization reduces the total number of RUs that the query requires. Learn more about the differences between in-partition queries and cross-partition queries.

If you have a large number of provisioned RUs (more than 30,000) or a large amount of data stored (more than approximately 100 GB), you probably have a large enough container to see a significant reduction in query RU charges.

You can check the shardInformation array to understand the query metrics for each individual physical partition. The number of unique shardKeyRangeId values is the number of physical partitions where the query needed to be executed. In this example, the query was executed on four physical partitions. It's important to understand that execution is completely independent from index utilization. In other words, cross-partition queries can still use indexes.

  "shardInformation" : [ 
                    {
                        "activityId" : "42f670a8-a201-4c58-8023-363ac18d9e18",
                        "shardKeyRangeId" : "5",
                        "durationMS" : 24.3859,
                        "preemptions" : 1,
                        "outputDocumentCount" : 463,
                        "retrievedDocumentCount" : 463
                    }, 
                    {
                        "activityId" : "a8bf762a-37b9-4c07-8ed4-ae49961373c0",
                        "shardKeyRangeId" : "2",
                        "durationMS" : 35.8328,
                        "preemptions" : 1,
                        "outputDocumentCount" : 905,
                        "retrievedDocumentCount" : 905
                    }, 
                    {
                        "activityId" : "3754e36b-4258-49a6-8d4d-010555628395",
                        "shardKeyRangeId" : "1",
                        "durationMS" : 67.3969,
                        "preemptions" : 1,
                        "outputDocumentCount" : 1479,
                        "retrievedDocumentCount" : 1479
                    }, 
                    {
                        "activityId" : "a69a44ee-db97-4fe9-b489-3791f3d52878",
                        "shardKeyRangeId" : "0",
                        "durationMS" : 185.1523,
                        "preemptions" : 1,
                        "outputDocumentCount" : 867,
                        "retrievedDocumentCount" : 867
                    }
                ]

Optimizations that reduce query latency

In many cases, the RU charge might be acceptable when query latency is still too high. The following sections give an overview of tips for reducing query latency. If you run the same query multiple times on the same dataset, it will typically have the same RU charge each time. But query latency might vary between query executions.

Improve proximity

Queries that are run from a different region than the Azure Cosmos DB account will have higher latency than if they were run inside the same region. For example, if you're running code on your desktop computer, you should expect latency to be tens or hundreds of milliseconds higher (or more) than if the query came from a virtual machine within the same Azure region as Azure Cosmos DB. It's simple to globally distribute data in Azure Cosmos DB to ensure you can bring your data closer to your app.

Increase provisioned throughput

In Azure Cosmos DB, your provisioned throughput is measured in Request Units (RUs). Imagine you have a query that consumes 5 RUs of throughput. For example, if you provision 1,000 RUs, you would be able to run that query 200 times per second. If you tried to run the query when there wasn't enough throughput available, Azure Cosmos DB will rate limit the requests. Azure Cosmos DB's API for MongoDB will automatically retry this query after waiting for a short time. Throttled requests take longer, so increasing provisioned throughput can improve query latency.

The value estimatedDelayFromRateLimitingInMilliseconds gives a sense of the potential latency benefits if you increase throughput.

Next steps