Copy Activity in Azure Data Factory

Overview

In Azure Data Factory, you can use Copy Activity to copy data among data stores located on-premises and in the cloud. After the data is copied, it can be further transformed and analyzed. You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption.

Role of Copy Activity

Copy Activity is executed on an Integration Runtime. For different data copy scenario, different flavor of Integration Runtime can be leveraged:

  • When copying data between data stores that both are publicly accessible, copy activity can be empowered by Azure Integration Runtime, which is secure, reliable, scalable, and globally available.
  • When copying data from/to data stores located on-premises or in a network with access control (for example, Azure Virtual Network), you need to set up a self-hosted Integrated Runtime to empower data copy.

Integration Runtime needs to be associated with each source and sink data store. Learn details on how copy activity determines which IR to use.

Copy Activity goes through the following stages to copy data from a source to a sink. The service that powers Copy Activity:

  1. Reads data from a source data store.
  2. Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity.
  3. Writes data to the sink/destination data store.

Copy Activity Overview

Supported data stores and formats

Category Data store Supported as a source Supported as a sink Supported by Azure IR Supported by Self-hosted IR
Azure Azure Blob Storage
  Azure Cosmos DB
  Azure Data Lake Storage Gen1
  Azure Data Lake Storage Gen2 (Preview)
  Azure Database for MySQL
  Azure Database for PostgreSQL
  Azure File Storage
  Azure SQL Database
  Azure SQL Data Warehouse
  Azure Search Index
  Azure Table Storage
Database Amazon Redshift
  DB2
  Drill (Preview)
  Google BigQuery
  Greenplum
  HBase
  Hive
  Apache Impala (Preview)
  Informix
  MariaDB
  Microsoft Access
  MySQL
  Netezza
  Oracle
  Phoenix
  PostgreSQL
  Presto (Preview)
  SAP Business Warehouse
  SAP HANA
  Spark
  SQL Server
  Sybase
  Teradata
  Vertica
NoSQL Cassandra
  Couchbase (Preview)
  MongoDB
File Amazon S3
  File System
  FTP
  HDFS
  SFTP
Generic protocol Generic HTTP
  Generic OData
  Generic ODBC
Services and apps Amazon Marketplace Web Service (Preview)
  Common Data Service for Apps
  Concur (Preview)
  Dynamics 365
  Dynamics CRM
  GE Historian
  Google AdWords (Preview)
  HubSpot (Preview)
  Jira (Preview)
  Magento (Preview)
  Marketo (Preview)
  Office 365 (Preview)
  Oracle Eloqua (Preview)
  Oracle Responsys (Preview)
  Oracle Service Cloud (Preview)
  Paypal (Preview)
  QuickBooks (Preview)
  Salesforce
  Salesforce Service Cloud
  Salesforce Marketing Cloud (Preview)
  SAP Cloud for Customer (C4C)
  SAP ECC
  ServiceNow
  Shopify (Preview)
  Square (Preview)
  Web Table (HTML table)
  Xero (Preview)
  Zoho (Preview)

Note

Any connector marked as Preview means that you can try it out and give us feedback. If you want to take a dependency on preview connectors in your solution, please contact Azure support.

Supported file formats

You can use Copy Activity to copy files as-is between two file-based data stores, in which case the data is copied efficiently without any serialization/deserialization.

Copy Activity also supports reading from and writing to files in specified formats: Text, JSON, Avro, ORC, and Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. See Supported file and compression formats with details.

For example, you can do the following copy activities:

  • Copy data in on-premises SQL Server and write to Azure Data Lake Store in ORC format.
  • Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format.
  • Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store.
  • Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database.

Supported regions

The service that powers Copy Activity is available globally in the regions and geographies listed in Azure Integration Runtime locations. The globally available topology ensures efficient data movement that usually avoids cross-region hops. See Services by region for availability of Data Factory and Data Movement in a region.

Configuration

To use copy activity in Azure Data Factory, you need to:

  1. Create linked services for source data store and sink data store. Refer to the connector article's "Linked service properties" section on how to configure and the supported properties. You can find the supported connector list in Supported data stores and formats section.
  2. Create datasets for source and sink. Refer to the source and sink connector articles' "Dataset properties" section on how to configure and the supported properties.
  3. Create a pipeline with copy activity. The next section provides an example.

Syntax

The following template of a copy activity contains an exhaustive list of supported properties. Specify the ones that fit your scenario.

"activities":[
    {
        "name": "CopyActivityTemplate",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<source dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<sink dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>",
                <properties>
            },
            "sink": {
                "type": "<sink type>"
                <properties>
            },
            "translator":
            {
                "type": "TabularTranslator",
                "columnMappings": "<column mapping>"
            },
            "dataIntegrationUnits": <number>,
            "parallelCopies": <number>,
            "enableStaging": true/false,
            "stagingSettings": {
                <properties>
            },
            "enableSkipIncompatibleRow": true/false,
            "redirectIncompatibleRowSettings": {
                <properties>
            }
        }
    }
]

Syntax details

Property Description Required
type The type property of a copy activity must be set to: Copy Yes
inputs Specify the dataset you created which points to the source data. Copy activity supports only a single input. Yes
outputs Specify the dataset you created which points to the sink data. Copy activity supports only a single output. Yes
typeProperties A group of properties to configure copy activity. Yes
source Specify the copy source type and the corresponding properties on how to retrieve data.

Learn details from the "Copy activity properties" section in connector article listed in Supported data stores and formats.
Yes
sink Specify the copy sink type and the corresponding properties on how to write data.

Learn details from the "Copy activity properties" section in connector article listed in Supported data stores and formats.
Yes
translator Specify explicit column mappings from source to sink. Applies when the default copy behavior cannot fulfill your need.

Learn details from Schema and data type mapping.
No
dataIntegrationUnits Specify the powerfulness of Azure Integration Runtime to empower data copy. Formerly known as cloud Data Movement Units (DMU).

Learn details from Data Integration Units.
No
parallelCopies Specify the parallelism that you want Copy Activity to use when reading data from source and writing data to sink.

Learn details from Parallel copy.
No
enableStaging
stagingSettings
Choose to stage the interim data in aa blob storage instead of directly copy data from source to sink.

Learn the useful scenarios and configuration details from Staged copy.
No
enableSkipIncompatibleRow
redirectIncompatibleRowSettings
Choose how to handle incompatible rows when copying data from source to sink.

Learn details from Fault tolerance.
No

Monitoring

You can monitor the copy activity run on Azure Data Factory "Author & Monitor" UI or programmatically. You can then compare the performance and configuration of your scenario to Copy Activity's performance reference from in-house testing.

Monitor visually

To visually monitor the copy activity run, go to your data factory -> Author & Monitor -> Monitor tab, you see a list of pipeline runs with a "View Activity Runs" link in the Actions column.

Monitor pipeline runs

Click to see the list of activities in this pipeline run. In the Actions column, you have links to the copy activity input, output, errors (if copy activity run fails), and details.

Monitor activity runs

Click the "Details" link under Actions to see copy activity's execution details and performance characteristics. It shows you information including volume/rows/files of data copied from source to sink, throughput, steps it goes through with corresponding duration and used configurations for your copy scenario.

Example: copy from Amazon S3 to Azure Data Lake Store Monitor activity run details

Example: copy from Azure SQL Database to Azure SQL Data Warehouse using staged copy Monitor activity run details

Monitor programmatically

Copy activity execution details and performance characteristics are also returned in Copy Activity run result -> Output section. Below is an exhaustive list; only the applicable ones to your copy scenario will show up. Learn how to monitor activity run from quickstart monitoring section.

Property name Description Unit
dataRead Data size read from source Int64 value in bytes
dataWritten Data size written to sink Int64 value in bytes
filesRead Number of files being copied when copying data from file storage. Int64 value (no unit)
filesWritten Number of files being copied when copying data to file storage. Int64 value (no unit)
rowsCopied Number of rows being copied (not applicable for binary copy). Int64 value (no unit)
rowsSkipped Number of incompatible rows being skipped. You can turn on the feature by set "enableSkipIncompatibleRow" to true. Int64 value (no unit)
throughput Ratio at which data are transferred Floating point number in KB/s
copyDuration The duration of the copy Int32 value in seconds
sqlDwPolyBase If PolyBase is used when copying data into SQL Data Warehouse. Boolean
redshiftUnload If UNLOAD is used when copying data from Redshift. Boolean
hdfsDistcp If DistCp is used when copying data from HDFS. Boolean
effectiveIntegrationRuntime Show which Integration Runtime(s) is used to empower the activity run, in the format of <IR name> (<region if it's Azure IR>). Text (string)
usedDataIntegrationUnits The effective Data Integration Units during copy. Int32 value
usedParallelCopies The effective parallelCopies during copy. Int32 value
redirectRowPath Path to the log of skipped incompatible rows in the blob storage you configure under "redirectIncompatibleRowSettings". See below example. Text (string)
executionDetails More details on the stages copy activity goes through, and the corresponding steps, duration, used configurations, etc. It's not recommended to parse this section as it may change. Array
"output": {
    "dataRead": 107280845500,
    "dataWritten": 107280845500,
    "filesRead": 10,
    "filesWritten": 10,
    "copyDuration": 224,
    "throughput": 467707.344,
    "errors": [],
    "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US 2)",
    "usedDataIntegrationUnits": 32,
    "usedParallelCopies": 8,
    "executionDetails": [
        {
            "source": {
                "type": "AmazonS3"
            },
            "sink": {
                "type": "AzureDataLakeStore"
            },
            "status": "Succeeded",
            "start": "2018-01-17T15:13:00.3515165Z",
            "duration": 221,
            "usedDataIntegrationUnits": 32,
            "usedParallelCopies": 8,
            "detailedDurations": {
                "queuingDuration": 2,
                "transferDuration": 219
            }
        }
    ]
}

Schema and data type mapping

See the Schema and data type mapping, which describes how copy activity maps your source data to sink.

Fault tolerance

By default, copy activity stops copying data and returns failure when it encounters incompatible data between source and sink. You can explicitly configure to skip and log the incompatible rows and only copy those compatible data to make the copy succeeded. See the Copy Activity fault tolerance on more details.

Performance and tuning

See the Copy Activity performance and tuning guide, which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity.

Incremental copy

Data Factory supports scenarios for incrementally copying delta data from a source data store to a destination data store. See Tutorial: incrementally copy data.

Read and write partitioned data

In version 1, Azure Data Factory supported reading or writing partitioned data by using SliceStart/SliceEnd/WindowStart/WindowEnd system variables. In the current version, you can achieve this behavior by using a pipeline parameter and trigger's start time/scheduled time as a value of the parameter. For more information, see How to read or write partitioned data.

Next steps

See the following quickstarts, tutorials, and samples: