Copy data to or from Azure Cosmos DB by using Azure Data Factory

This article outlines how to use Copy Activity in Azure Data Factory to copy data from and to Azure Cosmos DB (SQL API). The article builds on Copy Activity in Azure Data Factory, which presents a general overview of Copy Activity.

Supported capabilities

You can copy data from Azure Cosmos DB to any supported sink data store, or copy data from any supported source data store to Azure Cosmos DB. For a list of data stores that Copy Activity supports as sources and sinks, see Supported data stores and formats.

You can use the Azure Cosmos DB connector to:

  • Copy data from and to the Azure Cosmos DB SQL API.
  • Write to Azure Cosmos DB as insert or upsert.
  • Import and export JSON documents as-is, or copy data from or to a tabular dataset. Examples include a SQL database and a CSV file. To copy documents as-is to or from JSON files or to or from another Azure Cosmos DB collection, see Import or export JSON documents.

Data Factory integrates with the Azure Cosmos DB bulk executor library to provide the best performance when you write to Azure Cosmos DB.

Tip

The Data Migration video walks you through the steps of copying data from Azure Blob storage to Azure Cosmos DB. The video also describes performance-tuning considerations for ingesting data to Azure Cosmos DB in general.

Get started

You can use one of the following tools or SDKs to use Copy Activity with a pipeline. Select a link for step-by-step instructions:

The following sections provide details about properties you can use to define Data Factory entities that are specific to Azure Cosmos DB.

Linked service properties

The following properties are supported for the Azure Cosmos DB linked service:

Property Description Required
type The type property must be set to CosmosDb. Yes
connectionString Specify information that's required to connect to the Azure Cosmos DB database.

Note: You must specify database information in the connection string as shown in the examples that follow. Mark this field as a SecureString type to store it securely in Data Factory. You can also reference a secret stored in Azure Key Vault.
Yes
connectVia The Integration Runtime to use to connect to the data store. You can use the Azure Integration Runtime or a self-hosted integration runtime (if your data store is located in a private network). If this property isn't specified, the default Azure Integration Runtime is used. No

Example

{
    "name": "CosmosDbLinkedService",
    "properties": {
        "type": "CosmosDb",
        "typeProperties": {
            "connectionString": {
                "type": "SecureString",
                "value": "AccountEndpoint=<EndpointUrl>;AccountKey=<AccessKey>;Database=<Database>"
            }
        },
        "connectVia": {
            "referenceName": "<name of Integration Runtime>",
            "type": "IntegrationRuntimeReference"
        }
    }
}

Dataset properties

This section provides a list of properties that the Azure Cosmos DB dataset supports.

For a full list of sections and properties that are available for defining datasets, see Datasets and linked services.

To copy data from or to Azure Cosmos DB, set the type property of the dataset to DocumentDbCollection. The following properties are supported:

Property Description Required
type The type property of the dataset must be set to DocumentDbCollection. Yes
collectionName The name of the Azure Cosmos DB document collection. Yes

Example

{
    "name": "CosmosDbDataset",
    "properties": {
        "type": "DocumentDbCollection",
        "linkedServiceName":{
            "referenceName": "<Azure Cosmos DB linked service name>",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "collectionName": "<collection name>"
        }
    }
}

Schema by Data Factory

For schema-free data stores like Azure Cosmos DB, Copy Activity infers the schema in one of the ways described in the following list. Unless you want to import or export JSON documents as-is, the best practice is to specify the structure of data in the structure section.

  • If you specify the structure of data by using the structure property in the dataset definition, Data Factory honors this structure as the schema.

    If a row doesn't contain a value for a column, a null value is provided for the column value.

  • If you don't specify the structure of data by using the structure property in the dataset definition, the Data Factory service infers the schema by using the first row in the data.

    If the first row doesn't contain the full schema, some columns will be missing in the result of the copy operation.

Copy Activity properties

This section provides a list of properties that the Azure Cosmos DB source and sink support.

For a full list of sections and properties that are available for defining activities, see Pipelines.

Azure Cosmos DB as source

To copy data from Azure Cosmos DB, set the source type in Copy Activity to DocumentDbCollectionSource.

The following properties are supported in the Copy Activity source section:

Property Description Required
type The type property of the copy activity source must be set to DocumentDbCollectionSource. Yes
query Specify the Azure Cosmos DB query to read data.

Example:
SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\"
No

If not specified, this SQL statement is executed: select <columns defined in structure> from mycollection
nestingSeparator A special character that indicates that the document is nested and how to flatten the result set.

For example, if an Azure Cosmos DB query returns the nested result "Name": {"First": "John"}, Copy Activity identifies the column name as Name.First, with the value "John", when the nestedSeparator value is . (dot).
No
(the default is . (dot))

Example

"activities":[
    {
        "name": "CopyFromCosmosDB",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<Document DB input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "DocumentDbCollectionSource",
                "query": "SELECT c.BusinessEntityID, c.Name.First AS FirstName, c.Name.Middle AS MiddleName, c.Name.Last AS LastName, c.Suffix, c.EmailPromotion FROM c WHERE c.ModifiedDate > \"2009-01-01T00:00:00\""
            },
            "sink": {
                "type": "<sink type>"
            }
        }
    }
]

Azure Cosmos DB as sink

To copy data to Azure Cosmos DB, set the sink type in Copy Activity to DocumentDbCollectionSink.

The following properties are supported in the Copy Activity source section:

Property Description Required
type The type property of the Copy Activity sink must be set to DocumentDbCollectionSink. Yes
writeBehavior Describes how to write data to Azure Cosmos DB. Allowed values: insert and upsert.

The behavior of upsert is to replace the document if a document with the same ID already exists; otherwise, insert the document.

Note: Data Factory automatically generates an ID for a document if an ID isn't specified either in the original document or by column mapping. This means that you must ensure that, for upsert to work as expected, your document has an ID.
No
(the default is insert)
writeBatchSize Data Factory uses the Azure Cosmos DB bulk executor library to write data to Azure Cosmos DB. The writeBatchSize property controls the size of documents that we provide to the library. You can try increasing the value for writeBatchSize to improve performance. No
(the default is 10,000)
nestingSeparator A special character in the source column name that indicates that a nested document is needed.

For example, Name.First in the output dataset structure generates the following JSON structure in the Azure Cosmos DB document when the nestedSeparator is . (dot): "Name": {"First": "[value maps to this column from source]"}
No
(the default is . (dot))

Example

"activities":[
    {
        "name": "CopyToCosmosDB",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<input dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<Document DB output dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>"
            },
            "sink": {
                "type": "DocumentDbCollectionSink",
                "writeBehavior": "upsert"
            }
        }
    }
]

Import or export JSON documents

You can use this Azure Cosmos DB connector to easily:

  • Import JSON documents from various sources to Azure Cosmos DB, including from Azure Blob storage, Azure Data Lake Store, and other file-based stores that Azure Data Factory supports.
  • Export JSON documents from an Azure Cosmos DB collection to various file-based stores.
  • Copy documents between two Azure Cosmos DB collections as-is.

To achieve schema-agnostic copy:

  • When you use the Copy Data tool, select the Export as-is to JSON files or Cosmos DB collection option.
  • When you use activity authoring, don't specify the structure (also called schema) section in the Azure Cosmos DB dataset. Also, don't specify the nestingSeparator property in the Azure Cosmos DB source or sink in Copy Activity. When you import from or export to JSON files, in the corresponding file store dataset, specify the format type as JsonFormat and configure the filePattern as described in the JSON format section. Then, don't specify the structure section and skip the rest of the format settings.

Next steps

For a list of data stores that Copy Activity supports as sources and sinks in Azure Data Factory, see supported data stores.