Index large data sets in Azure Cognitive Search

Azure Cognitive Search supports two basic approaches for importing data into a search index: pushing your data into the index programmatically, or pointing an Azure Cognitive Search indexer at a supported data source to pull in the data.

As data volumes grow or processing needs change, you might find that simple or default indexing strategies are no longer practical. For Azure Cognitive Search, there are several approaches for accommodating larger data sets, ranging from how you structure a data upload request, to using a source-specific indexer for scheduled and distributed workloads.

The same techniques also apply to long-running processes. In particular, the steps outlined in parallel indexing are helpful for computationally intensive indexing, such as image analysis or natural language processing in an AI enrichment pipeline.

The following sections explain techniques for indexing large amounts of data using both the push API and indexers. You should also review Tips for improving performance for more best practices.

For a C# tutorial and code sample, see Tutorial: Optimize indexing speeds.

Indexing large datasets with the "push" API

When pushing large data volumes into an index using the Add Documents REST API or the IndexDocuments method (Azure SDK for .NET), batching documents and managing threads are two techniques that improve indexing speed.

Batch multiple documents per request

One of the simplest mechanisms for indexing a larger data set is to submit multiple documents or records in a single request. As long as the entire payload is under 16 MB, a request can handle up to 1000 documents in a bulk upload operation. These limits apply whether you're using the Add Documents REST API or the IndexDocuments method in the .NET SDK. For either API, you would package 1000 documents in the body of each request.

Using batches to index documents will significantly improve indexing performance. Determining the optimal batch size for your data is a key component of optimizing indexing speeds. The two primary factors influencing the optimal batch size are:

  • The schema of your index
  • The size of your data

Because the optimal batch size depends on your index and your data, the best approach is to test different batch sizes to determine which one results in the fastest indexing speeds for your scenario. Tutorial: Optimize indexing with the push API provides sample code for testing batch sizes using the .NET SDK.

Add threads and a retry strategy

Indexers have built-in thread management, but when you're using the push APIs, your application code will have to manage threads. Make sure there are sufficient threads to make full use of the available capacity.

  1. Increase the number of threads in your client code. As you increase the tier of your search service or increase the partitions, you should also increase the number of concurrent threads so that you can take full advantage of the new capacity.

  2. As you ramp up the requests hitting the search service, you may encounter HTTP status codes indicating the request didn't fully succeed. During indexing, two common HTTP status codes are:

    • 503 Service Unavailable - This error means that the system is under heavy load and your request can't be processed at this time.

    • 207 Multi-Status - This error means that some documents succeeded, but at least one failed.

  3. To handle failures, requests should be retried using an exponential backoff retry strategy.

The Azure .NET SDK automatically retries 503s and other failed requests, but you'll need to implement your own logic to retry 207s. Open-source tools such as Polly can also be used to implement a retry strategy.

Indexing large datasets with indexers and the "pull" APIs

Indexers have built-in capabilities that are particularly useful for accommodating larger data sets:

  • Indexer schedules allow you to parcel out indexing at regular intervals so that you can spread it out over time.

  • Scheduled indexing can resume at the last known stopping point. If a data source isn't fully scanned within a 24-hour window, the indexer will resume indexing on day two at wherever it left off.

  • Partitioning data into smaller individual data sources enables parallel processing. You can break up source data into smaller components, such as into multiple containers in Azure Blob Storage, create a data source for each partition, and then run multiple indexers in parallel.

Check indexer batch size

As with the push API, indexers allow you to configure the number of items per batch. For indexers based on the Create Indexer REST API, you can set the batchSize argument to customize this setting to better match the characteristics of your data.

Default batch sizes are data source specific. Azure SQL Database and Azure Cosmos DB have a default batch size of 1000. In contrast, Azure Blob indexing sets batch size at 10 documents in recognition of the larger average document size.

Schedule indexers for long-running processes

Indexer scheduling is an important mechanism for processing large data sets, and slow-running processes like image analysis in a cognitive search pipeline. Indexer processing operates within a 24-hour window. If processing fails to finish within 24 hours, the behaviors of indexer scheduling can work to your advantage.

By design, scheduled indexing starts at specific intervals, with a job typically completing before resuming at the next scheduled interval. However, if processing does not complete within the interval, the indexer stops (because it ran out of time). At the next interval, processing resumes where it last left off, with the system keeping track of where that occurs.

In practical terms, for index loads spanning several days, you can put the indexer on a 24-hour schedule. When indexing resumes for the next 24-hour cycle, it restarts at the last known good document. In this way, an indexer can work its way through a document backlog over a series of days until all unprocessed documents are processed. For more information about setting schedules in general, see Create Indexer REST API or see How to schedule indexers for Azure Cognitive Search.

Run indexers in parallel

If you partition your data, you can create multiple indexer-data-source combinations that pull from each data source and write to the same search index. Because each indexer is distinct, you can run them at the same time, populating a search index more quickly than if you ran them sequentially.

Make sure you have sufficient capacity. One search unit in your service can run one indexer at any given time. Creating multiple indexers is only useful if they can run in parallel.

The number of indexing jobs that can run simultaneously varies for text-based and skills-based indexing. For more information, see Indexer execution.

  1. Sign in to Azure portal and check the number of search units used by your search service. Select Settings > Scale to view the number at the top of the page. The number of indexers that will run in parallel is approximately equal to the number of search units.

  2. Partition source data among multiple containers or multiple virtual folders inside the same container.

  3. Create multiple data sources, one for each partition, paired to its own indexer.

  4. Specify the same target search index in each indexer.

  5. Schedule the indexers.

  6. Review indexer status and execution history for confirmation.

There are some risks associated with parallel indexing. First, recall that indexing does not run in the background, increasing the likelihood that queries will be throttled or dropped.

Second, Azure Cognitive Search does not lock the index for updates. Concurrent writes are managed, invoking a retry if a particular write does not succeed on first attempt, but you might notice an increase in indexing failures.

Although multiple indexer-data-source sets can target the same index, be careful of indexer runs that can overwrite existing values in the index. If a second indexer-data-source targets the same documents and fields, any values from the first run will be overwritten. Field values are replaced in full; an indexer can't merge values from multiple runs into the same field.

See also