Copy Activity in Azure Data Factory

Overview

In Azure Data Factory, you can use Copy Activity to copy data among data stores located on-premises and in the cloud. After the data is copied, it can be further transformed and analyzed. You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption.

Role of Copy Activity

Note

This article applies to version 2 of Data Factory, which is currently in preview. If you are using version 1 of the Data Factory service, which is generally available (GA), see Copy Activity in V1.

Copy Activity is executed on an Integration Runtime. For different data copy scenario, different flavor of Integration Runtime can be leveraged:

  • When copying data between data stores that both are publicly accessible, copy activity can be empowered by Azure Integration Runtime, which is secure, reliable, scalable, and globally available.
  • When copying data from/to data stores located on-premises or in a network with access control (for example, Azure Virtual Network), you need to set up a self-hosted Integrated Runtime to empower data copy.

Integration Runtime need to be associated with each source and sink data store. Learn details on how copy activity determines which IR to use.

Copy Activity goes through the following stages to copy data from a source to a sink. The service that powers Copy Activity:

  1. Reads data from a source data store.
  2. Performs serialization/deserialization, compression/decompression, column mapping, etc. It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity.
  3. Writes data to the sink/destination data store.

Copy Activity Overview

Supported data stores and formats

Category Data store Supported as a source Supported as a sink Supported by Azure IR Supported by Self-hosted IR
Azure Azure Blob Storage
  Azure Cosmos DB
  Azure Data Lake Store
  Azure Database for MySQL
  Azure Database for PostgreSQL
  Azure File Storage
  Azure SQL Database
  Azure SQL Data Warehouse
  Azure Search Index
  Azure Table Storage
Databases/Data Warehouses Amazon Redshift
  DB2
  Drill (Beta)
  Google BigQuery (Beta)
  Greenplum (Beta)
  HBase
  Hive
  Apache Impala (Beta)
  Informix
  MariaDB
  Microsoft Access
  MySQL
  Oracle
  Phoenix
  PostgreSQL
  Presto (Beta)
  SAP Business Warehouse
  SAP HANA
  Spark
  SQL Server
  Sybase
  Teradata
NoSQLs Cassandra
  Couchbase (Beta)
  MongoDB
File Amazon S3
  File System
  FTP
  HDFS
  SFTP
Generic protocols Generic HTTP
  Generic OData
  Generic ODBC
Services and Others Amazon Marketplace Web Service (Beta)
  Concur (Beta)
  Dynamics 365
  Dynamics CRM
  GE Historian
  HubSpot (Beta)
  Jira (Beta)
  Magento (Beta)
  Oracle Eloqua (Beta)
  Paypal (Beta)
  QuickBooks (Beta)
  Salesforce
  Salesforce Service Cloud
  SAP Cloud for Customer (C4C)
  ServiceNow (Beta)
  Shopify (Beta)
  Square (Beta)
  Web Table (HTML table)
  Xero (Beta)
  Zoho (Beta)

Note

Any connector marked as Beta means that you can try it out and give us feedback, and do not use in production environments.

Supported file formats

You can use Copy Activity to copy files as-is between two file-based data stores, in which case the data is copied efficiently without any serialization/deserialization.

Copy Activity also supports reading from and writing to files in specified formats: Text, JSON, Avro, ORC, and Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. See Supported file and compression formats with details.

For example, you can do the following copy activities:

  • Copy data in on-premises SQL Server and write to Azure Data Lake Store in ORC format.
  • Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format.
  • Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store.
  • Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database.

Supported regions

The service that powers Copy Activity is available globally in the regions and geographies listed in Azure Integration Runtime locations. The globally available topology ensures efficient data movement that usually avoids cross-region hops. See Services by region for availability of Data Factory and Data Movement in a region.

Configuration

To use copy activity in Azure Data Factory, you need to:

  1. Create linked services for source data store and sink data store. Refer to the connector article's "Linked service properties" section on how to configure and the supported properties. You can find the supported connector list in Supported data stores and formats section.
  2. Create datasets for source and sink. Refer to the source and sink connector articles' "Dataset properties" section on how to configure and the supported properties.
  3. Create a pipeline with copy activity. The next section provides an example.

Syntax

The following template of a copy activity contains an exhaustive list of supported properties. Specify the ones that fit your scenario.

"activities":[
    {
        "name": "CopyActivityTemplate",
        "type": "Copy",
        "inputs": [
            {
                "referenceName": "<source dataset name>",
                "type": "DatasetReference"
            }
        ],
        "outputs": [
            {
                "referenceName": "<sink dataset name>",
                "type": "DatasetReference"
            }
        ],
        "typeProperties": {
            "source": {
                "type": "<source type>",
                <properties>
            },
            "sink": {
                "type": "<sink type>"
                <properties>
            },
            "translator":
            {
                "type": "TabularTranslator",
                "columnMappings": "<column mapping>"
            },
            "cloudDataMovementUnits": <number>,
            "parallelCopies": <number>,
            "enableStaging": true/false,
            "stagingSettings": {
                <properties>
            },
            "enableSkipIncompatibleRow": true/false,
            "redirectIncompatibleRowSettings": {
                <properties>
            }
        }
    }
]

Syntax details

Property Description Required
type The type property of a copy activity must be set to: Copy Yes
inputs Specify the dataset you created which points to the source data. Copy activity supports only a single input. Yes
outputs Specify the dataset you created which points to the sink data. Copy activity supports only a single output. Yes
typeProperties A group of properties to configure copy activity. Yes
source Specify the copy source type and the corresponding properties on how to retrieve data.

Learn details from the "Copy activity properties" section in connector article listed in Supported data stores and formats.
Yes
sink Specify the copy sink type and the corresponding properties on how to write data.

Learn details from the "Copy activity properties" section in connector article listed in Supported data stores and formats.
Yes
translator Specify explicit column mappings from source to sink. Applies when the default copy behavior cannot fulfill your need.

Learn details from Schema and data type mapping.
No
cloudDataMovementUnits Specify the powerfulness of Azure Integration Runtime to empower data copy.

Learn details from Cloud data movement units.
No
parallelCopies Specify the parallelism that you want Copy Activity to use when reading data from source and writing data to sink.

Learn details from Parallel copy.
No
enableStaging
stagingSettings
Choose to stage the interim data in aa blob storage instead of directly copy data from source to sink.

Learn the useful scenarios and configuration details from Staged copy.
No
enableSkipIncompatibleRow
redirectIncompatibleRowSettings
Choose how to handle incompatible rows when copying data from source to sink.

Learn details from Fault tolerance.
No

Monitoring

Copy activity execution details and performance characteristics are returned in Copy Activity run result -> Output section. Below is an exhausted list. Learn how to monitor activity run from quickstart monitoring section. You can compare the performance and configuration of your scenario to Copy Activity's performance reference from in-house testing.

Property name Description Unit
dataRead Data size read from source Int64 value in bytes
dataWritten Data size written to sink Int64 value in bytes
rowsCopied Number of rows being copied (not applicable for binary copy). Int64 value (no unit)
rowsSkipped Number of incompatible rows being skipped. You can turn on the feature by set "enableSkipIncompatibleRow" to true. Int64 value (no unit)
throughput Ratio at which data are transferred Floating point number in KB/s
copyDuration The duration of the copy Int32 value in seconds
sqlDwPolyBase If PolyBase is used when copying data into SQL Data Warehouse. Boolean
redshiftUnload If UNLOAD is used when copying data from Redshift. Boolean
hdfsDistcp If DistCp is used when copying data from HDFS. Boolean
effectiveIntegrationRuntime Show which Integration Runtime(s) is used to empower the activity run, in the format of <IR name> (<region if it's Azure IR>). Text (string)
usedCloudDataMovementUnits The effective cloud data movement units during copy. Int32 value
redirectRowPath Path to the log of skipped incompatible rows in the blob storage you configure under "redirectIncompatibleRowSettings". See below example. Text (string)
billedDuration The duration being billed for data movement. Int32 value in seconds
"output": {
    "dataRead": 1024,
    "dataWritten": 2048,
    "rowsCopies": 100,
    "rowsSkipped": 2,
    "throughput": 1024.0,
    "copyDuration": 3600,
    "redirectRowPath": "https://<account>.blob.core.windows.net/<path>/<pipelineRunID>/",
    "redshiftUnload": true,
    "sqlDwPolyBase": true,
    "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (West US)",
    "usedCloudDataMovementUnits": 8,
    "billedDuration": 28800
}

Schema and data type mapping

See the Schema and data type mapping, which describes how copy activity maps your source data to sink.

Fault tolerance

By default, copy activity stops copying data and returns failure when it encounters incompatible data between source and sink. You can explicitly configure to skip and log the incompatible rows and only copy those compatible data to make the copy succeeded. See the Copy Activity fault tolerance on more details.

Performance and tuning

See the Copy Activity performance and tuning guide, which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity.

Next steps

See the following quickstarts, tutorials, and samples: