Options to migrate your on-premises or cloud data to Azure Cosmos DB

You can load data from various data sources to Azure Cosmos DB. Additionally, since Azure Cosmos DB supports multiple APIs, the targets can be any of the existing APIs. In order to support migration paths from the various sources to the different Azure Cosmos DB APIs, there are multiple solutions that provide specialized handling for each migration path. This document lists the available solutions and describes their advantages and limitations.

Factors affecting the choice of migration tool

The following factors determine the choice of the migration tool:

  • Online vs offline migration: Many migration tools provide a path to do a one-time migration only. This means that the applications accessing the database might experience a period of downtime. Some migration solutions provide a way to do a live migration where there is a replication pipeline set up between the source and the target.

  • Data source: The existing data can be in various data sources like Oracle DB2, Datastax Cassanda, Azure SQL Database, PostgreSQL, etc. The data can also be in an existing Azure Cosmos DB account and the intent of migration can be to change the data model or repartition the data in a container with a different partition key.

  • Azure Cosmos DB API: For the SQL API in Azure Cosmos DB, there are a variety of tools developed by the Azure Cosmos DB team which aid in the different migration scenarios. All of the other APIs have their own specialized set of tools developed and maintained by the community. Since Azure Cosmos DB supports these APIs at a wire protocol level, these tools should work as-is while migrating data into Azure Cosmos DB too. However, they might require custom handling for throttles as this concept is specific to Azure Cosmos DB.

  • Size of data: Most migration tools work very well for smaller datasets. When the data set exceeds a few hundred gigabytes, the choices of migration tools are limited.

  • Expected migration duration: Migrations can be configured to take place at a slow, incremental pace that consumes less throughput or can consume the entire throughput provisioned on the target Azure Cosmos DB container and complete the migration in less time.

Azure Cosmos DB SQL API

Migration type Solution Considerations
Offline Data Migration Tool • Easy to set up and supports multiple sources
• Not suitable for large datasets
Offline Azure Data Factory • Easy to set up and supports multiple sources
• Makes use of the Azure Cosmos DB bulk executor library
• Suitable for large datasets
• Lack of checkpointing - It means that if an issue occurs during the course of migration, you need to restart the whole migration process
• Lack of a dead letter queue - It means that a few erroneous files can stop the entire migration process.
Offline Azure Cosmos DB Spark connector • Makes use of the Azure Cosmos DB bulk executor library
• Suitable for large datasets
• Needs a custom Spark setup
• Spark is sensitive to schema inconsistencies and this can be a problem during migration
Offline Custom tool with Cosmos DB bulk executor library • Provides checkpointing, dead-lettering capabilities which increases migration resiliency
• Suitable for very large datasets (10 TB+)
• Requires custom setup of this tool running as an App Service
Online Cosmos DB Functions + ChangeFeed API • Easy to set up
• Works only if the source is an Azure Cosmos DB container
• Not suitable for large datasets
• Does not capture deletes from the source container
Online Custom Migration Service using ChangeFeed • Provides progress tracking
• Works only if the source is an Azure Cosmos DB container
• Works for larger datasets as well
• Requires the user to set up an App Service to host the Change feed processor
• Does not capture deletes from the source container
Online Striim • Works with a large variety of sources like Oracle, DB2, SQL Server
• Easy to build ETL pipelines and provides a dashboard for monitoring
• Supports larger datasets
• Since this is a third-party tool, it needs to be purchased from the marketplace and installed in the user's environment

Azure Cosmos DB Mongo API

Migration type Solution Considerations
Offline Data Migration Tool • Easy to set up and supports multiple sources
• Not suitable for large datasets
Offline Azure Data Factory • Easy to set up and supports multiple sources
• Makes use of the Azure Cosmos DB bulk executor library
• Suitable for large datasets
• Lack of checkpointing means that any issue during the course of migration would require a restart of the whole migration process
• Lack of a dead letter queue would mean that a few erroneous files could stop the entire migration process
• Needs custom code to increase read throughput for certain data sources
Offline Existing Mongo Tools (mongodump, mongorestore, Studio3T) • Easy to set up and integration
• Needs custom handling for throttles
Online Azure Database Migration Service • Makes use of the Azure Cosmos DB bulk executor library
• Suitable for large datasets and takes care of replicating live changes
• Works only with other MongoDB sources

Azure Cosmos DB Cassandra API

Migration type Solution Considerations
Offline cqlsh COPY command • Easy to set up
• Not suitable for large datasets
• Works only when the source is a Cassandra table
Offline Copy table with Spark • Can make use of Spark capabilities to parallelize transformation and ingestion
• Needs configuration with a custom retry policy to handle throttles
Online Striim (from Oracle DB/Apache Cassandra) • Works with a large variety of sources like Oracle, DB2, SQL Server
• Easy to build ETL pipelines and provides a dashboard for monitoring
• Supports larger datasets
• Since this is a third-party tool, it needs to be purchased from the marketplace and installed in the user's environment
Online Blitzz (from Oracle DB/Apache Cassandra)
• Supports larger datasets
• Since this is a third-party tool, it needs to be purchased from the marketplace and installed in the user's environment

Other APIs

For APIs other than the SQL API, Mongo API and the Cassandra API, there are various tools supported by each of the API's existing ecosystems.

Table API

Gremlin API

Next steps

  • Learn more by trying out the sample applications consuming the bulk executor library in .NET and Java.
  • The bulk executor library is integrated into the Cosmos DB Spark connector, to learn more, see Azure Cosmos DB Spark connector article.
  • Contact the Azure Cosmos DB product team by opening a support ticket under the "General Advisory" problem type and "Large (TB+) migrations" problem subtype for additional help with large scale migrations.