您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用复制活动移动数据Move data by using Copy Activity

备注

本文适用于数据工厂版本 1。This article applies to version 1 of Data Factory. 如果使用的是数据工厂服务的当前版本,请参阅 V2 中的复制活动If you are using the current version of the Data Factory service, see Copy Activity in V2.

概述Overview

在 Azure 数据工厂中,可使用“复制活动”在本地和云数据存储区之间复制数据。In Azure Data Factory, you can use Copy Activity to copy data between on-premises and cloud data stores. 复制数据后,可对其进一步执行转换和分析操作。After the data is copied, it can be further transformed and analyzed. 还可使用复制活动发布有关商业智能 (BI) 和应用程序消耗的转换和分析结果。You can also use Copy Activity to publish transformation and analysis results for business intelligence (BI) and application consumption.

复制活动的角色

复制活动由安全、可靠、可缩放和全局可用的服务提供支持。Copy Activity is powered by a secure, reliable, scalable, and globally available service. 本文提供数据工厂和复制活动中移动数据的详细信息。This article provides details on data movement in Data Factory and Copy Activity.

首先,了解如何在两个云数据存储之间,以及本地数据存储和云数据存储之间迁移数据。First, let's see how data migration occurs between two cloud data stores, and between an on-premises data store and a cloud data store.

备注

若要了解活动的常规信息,请参阅了解管道和活动To learn about activities in general, see Understanding pipelines and activities.

在两个云数据存储之间复制数据Copy data between two cloud data stores

源和接收器数据存储均在云中时,复制活动通过以下阶段将数据从源复制到接收器。When both source and sink data stores are in the cloud, Copy Activity goes through the following stages to copy data from the source to the sink. 为复制活动提供支持的服务可:The service that powers Copy Activity:

  1. 读取源数据存储中的数据。Reads data from the source data store.
  2. 执行序列化/反序列化、压缩/解压缩、列映射和类型转换。Performs serialization/deserialization, compression/decompression, column mapping, and type conversion. 此服务基于输入数据集、输出数据集和复制活动的配置执行这些操作。It does these operations based on the configurations of the input dataset, output dataset, and Copy Activity.
  3. 将数据写入目标数据存储。Writes data to the destination data store.

服务自动选择执行数据移动的最佳区域。The service automatically chooses the optimal region to perform the data movement. 此区域通常是最接近接收器数据存储的区域。This region is usually the one closest to the sink data store.

云到云复制

在本地数据存储和云数据存储之间复制数据Copy data between an on-premises data store and a cloud data store

若要在本地数据存储和云数据存储之间安全移动数据,请在本地计算机上安装数据管理网关。To securely move data between an on-premises data store and a cloud data store, install Data Management Gateway on your on-premises machine. 数据管理网关是一个支持混合数据移动和处理的代理。Data Management Gateway is an agent that enables hybrid data movement and processing. 可在数据存储本身所在计算机上或在可访问此数据存储的其他计算机上安装数据管理网关。You can install it on the same machine as the data store itself, or on a separate machine that has access to the data store.

在此方案中,数据管理网关执行序列化/反序列化、压缩/解压缩、列映射和类型转换。In this scenario, Data Management Gateway performs the serialization/deserialization, compression/decompression, column mapping, and type conversion. 数据不会通过 Azure 数据工厂服务流动。Data does not flow through the Azure Data Factory service. 相反,数据管理网关会将数据直接写入目标存储。Instead, Data Management Gateway directly writes the data to the destination store.

本地和云之间的复制

有关说明和演练,请参阅在本地和云数据存储之间移动数据See Move data between on-premises and cloud data stores for an introduction and walkthrough. 有关此代理的详细信息,请参阅数据管理网关See Data Management Gateway for detailed information about this agent.

使用数据管理网关还可将数据移出/移入托管在 Azure IaaS 虚拟机 (VM) 上的受支持数据存储。You can also move data from/to supported data stores that are hosted on Azure IaaS virtual machines (VMs) by using Data Management Gateway. 在此情况下,可在数据存储本身所在 VM 上或在可访问此数据存储的其他 VM 上安装数据管理网关。In this case, you can install Data Management Gateway on the same VM as the data store itself, or on a separate VM that has access to the data store.

支持的数据存储和格式Supported data stores and formats

数据工厂中的复制活动可以将数据从源数据存储复制到接收器数据存储。Copy Activity in Data Factory copies data from a source data store to a sink data store. 数据工厂支持以下数据存储。Data Factory supports the following data stores. 来自任何源的数据都可以写入到任何接收器。Data from any source can be written to any sink. 单击某个数据存储即可了解如何将数据复制到该存储,以及如何从该存储复制数据。Click a data store to learn how to copy data to and from that store.

备注

如需将数据移入/移出复制活动不支持的数据存储,可通过自己的逻辑使用数据工厂内的 自定义活动 来复制/移动数据。If you need to move data to/from a data store that Copy Activity doesn't support, use a custom activity in Data Factory with your own logic for copying/moving data. 有关创建和使用自定义活动的详细信息,请参阅在 Azure数据工厂管道中使用自定义活动For details on creating and using a custom activity, see Use custom activities in an Azure Data Factory pipeline.

类别Category 数据存储Data store 支持用作源Supported as a source 支持用作接收器Supported as a sink
AzureAzure Azure Blob 存储Azure Blob storage
  Azure Cosmos DB (SQL API)Azure Cosmos DB (SQL API)
  Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1
  Azure SQL 数据库Azure SQL Database
  Azure Synapse AnalyticsAzure Synapse Analytics
  Azure 认知搜索索引Azure Cognitive Search Index
  Azure 表存储Azure Table storage
数据库Databases Amazon RedshiftAmazon Redshift
  DB2*DB2*
  MySQL*MySQL*
  Oracle*Oracle*
  PostgreSQL*PostgreSQL*
  SAP Business Warehouse*SAP Business Warehouse*
  SAP HANA*SAP HANA*
  SQL Server*SQL Server*
  Sybase*Sybase*
  Teradata*Teradata*
NoSQLNoSQL Cassandra*Cassandra*
  MongoDB*MongoDB*
FileFile Amazon S3Amazon S3
  文件系统*File System*
  FTPFTP
  HDFS*HDFS*
  SFTPSFTP
其他Others 泛型 HTTPGeneric HTTP
  泛型 ODataGeneric OData
  泛型 ODBC*Generic ODBC*
  SalesforceSalesforce
  Web 表(HTML 中的表)Web Table (table from HTML)

备注

带 * 的数据存储既可位于本地,也可位于 Azure IaaS 上,需要用户在本地/Azure IaaS 计算机上安装数据管理网关Data stores with * can be on-premises or on Azure IaaS, and require you to install Data Management Gateway on an on-premises/Azure IaaS machine.

支持的文件格式Supported file formats

可使用“复制活动”在两个基于文件的数据存储之间“按原样复制文件”,可以同时在输入和输出数据集定义中跳过格式部分You can use Copy Activity to copy files as-is between two file-based data stores, you can skip the format section in both the input and output dataset definitions. 无需任何序列化/反序列化操作即可有效复制数据。The data is copied efficiently without any serialization/deserialization.

“复制活动”还以指定格式从文件中读取并写入到文件:Text、JSON、Avro、ORC 和 Parquet,并且压缩编解码器 GZip、Deflate、BZip2 和 ZipDeflate 也受支持。Copy Activity also reads from and writes to files in specified formats: Text, JSON, Avro, ORC, and Parquet, and compression codec GZip, Deflate, BZip2, and ZipDeflate are supported. 有关详细信息,请参阅支持的文件和压缩格式See Supported file and compression formats with details.

例如,可执行以下复制活动:For example, you can do the following copy activities:

  • 复制 SQL Server 数据库中的数据,并以 ORC 格式写入 Azure Data Lake Store。Copy data in a SQL Server database and write to Azure Data Lake Store in ORC format.
  • 从本地文件系统中复制文本 (CSV) 格式文件,并将其以 Avro 格式写入 Azure Blob。Copy files in text (CSV) format from on-premises File System and write to Azure Blob in Avro format.
  • 从本地文件系统中复制压缩文件,并将其解压缩然后传到 Azure Data Lake Store。Copy zipped files from on-premises File System and decompress then land to Azure Data Lake Store.
  • 从 Azure Blob 复制 GZip 压缩文本 (CSV) 格式的数据,并将其写入 Azure SQL 数据库。Copy data in GZip compressed text (CSV) format from Azure Blob and write to Azure SQL Database.

全局可用的数据移动Globally available data movement

Azure 数据工厂仅在美国西部、美国东部和北欧区域内可用。Azure Data Factory is available only in the West US, East US, and North Europe regions. 但是,为复制活动提供支持的服务在以下区域和地域内全局可用。However, the service that powers Copy Activity is available globally in the following regions and geographies. 全局可用拓扑可确保高效的数据移动,此类移动通常避免跨区域跃点。The globally available topology ensures efficient data movement that usually avoids cross-region hops. 有关某区域内数据工厂和数据移动的可用性,请参阅服务(按区域)See Services by region for availability of Data Factory and Data Movement in a region.

在云数据存储之间复制数据Copy data between cloud data stores

源和接收器数据存储均在云中时,数据工厂会使用同一地域内最接近接收器的区域中的服务部署来移动数据。When both source and sink data stores are in the cloud, Data Factory uses a service deployment in the region that is closest to the sink in the same geography to move the data. 请参照下表进行映射:Refer to the following table for mapping:

目标数据存储的地域Geography of the destination data stores 目标数据存储的区域Region of the destination data store 用于数据移动的区域Region used for data movement
美国United States 美国东部East US 美国东部East US
  美国东部 2East US 2 美国东部 2East US 2
  Central USCentral US Central USCentral US
  美国中北部North Central US 美国中北部North Central US
  美国中南部South Central US 美国中南部South Central US
  美国中西部West Central US 美国中西部West Central US
  美国西部West US 美国西部West US
  美国西部 2West US 2 美国西部 2West US 2
加拿大Canada 加拿大东部Canada East 加拿大中部Canada Central
  加拿大中部Canada Central 加拿大中部Canada Central
巴西Brazil 巴西南部Brazil South Brazil SouthBrazil South
欧洲Europe 北欧North Europe 北欧North Europe
  西欧West Europe 西欧West Europe
英国United Kingdom 英国西部UK West 英国南部UK South
  英国南部UK South 英国南部UK South
亚太区Asia Pacific Southeast AsiaSoutheast Asia Southeast AsiaSoutheast Asia
  东亚East Asia 东南亚Southeast Asia
澳大利亚Australia 澳大利亚东部Australia East 澳大利亚东部Australia East
  澳大利亚东南部Australia Southeast Australia SoutheastAustralia Southeast
印度India 印度中部Central India 印度中部Central India
  印度西部West India 印度中部Central India
  印度南部South India 印度中部Central India
日本Japan 日本东部Japan East 日本东部Japan East
  日本西部Japan West Japan EastJapan East
韩国Korea 韩国中部Korea Central 韩国中部Korea Central
  韩国南部Korea South 韩国中部Korea Central

或者可以通过指定复制活动 typeProperties 下的 executionLocation 属性,明确指示要用于执行复制的数据工厂服务的区域。Alternatively, you can explicitly indicate the region of Data Factory service to be used to perform the copy by specifying executionLocation property under Copy Activity typeProperties. 上述 用于数据移动的区域 列中列举了此属性支持的值。Supported values for this property are listed in above Region used for data movement column. 请注意复制过程中数据将通过网络经过该区域。Note your data goes through that region over the wire during copy. 例如,若要在韩国的 Azure 存储间进行复制,可以指定 "executionLocation": "Japan East",以便经过日本区域(请参阅示例 JSON 作为参考)。For example, to copy between Azure stores in Korea, you can specify "executionLocation": "Japan East" to route through Japan region (see sample JSON as reference).

备注

如果目标数据存储的区域不在上方列表中或未找到该区域,默认情况下,复制活动会失败,而不会通过其他区域完成,除非指定了 executionLocationIf the region of the destination data store is not in preceding list or undetectable, by default Copy Activity fails instead of going through an alternative region, unless executionLocation is specified. 以后支持的区域列表还将扩大。The supported region list will be expanded over time.

在本地数据存储和云数据存储之间复制数据Copy data between an on-premises data store and a cloud data store

在本地(或 Azure 虚拟机/IaaS)和云存储之间复制数据时,数据管理网关会在本地计算机或虚拟机上执行数据移动。When data is being copied between on-premises (or Azure virtual machines/IaaS) and cloud stores, Data Management Gateway performs data movement on an on-premises machine or virtual machine. 数据不会通过此服务在云中流动,除非使用暂存复制功能。The data does not flow through the service in the cloud, unless you use the staged copy capability. 在这种情况下,数据先流经暂存 Azure Blob 存储,然后写入接收器数据存储。In this case, data flows through the staging Azure Blob storage before it is written into the sink data store.

创建包含复制活动的管道Create a pipeline with Copy Activity

可通过以下几种方法创建包含复制活动的管道:You can create a pipeline with Copy Activity in a couple of ways:

使用复制向导By using the Copy Wizard

数据工厂复制向导有助于创建包含复制活动的管道。The Data Factory Copy Wizard helps you to create a pipeline with Copy Activity. 使用此管道,无需对链接服务、数据集和管道编写 JSON 定义,即可将数据从支持的源复制到目标源。This pipeline allows you to copy data from supported sources to destinations without writing JSON definitions for linked services, datasets, and pipelines. 有关此向导的详细信息,请参阅数据工厂复制向导See Data Factory Copy Wizard for details about the wizard.

使用 JSON 脚本By using JSON scripts

您可以使用 Visual Studio 中的数据工厂编辑器或 Azure PowerShell 通过使用复制活动) 为管道创建 JSON 定义 (。You can use Data Factory Editor in Visual Studio, or Azure PowerShell to create a JSON definition for a pipeline (by using Copy Activity). 然后,可对其进行部署以在数据工厂中创建管道。Then, you can deploy it to create the pipeline in Data Factory. 有关包含分步说明的教程,请参阅教程:在 Azure 数据工厂管道中使用复制活动See Tutorial: Use Copy Activity in an Azure Data Factory pipeline for a tutorial with step-by-step instructions.

JSON 属性(例如名称、说明、输入和输出表,以及策略)可用于所有类型的活动。JSON properties (such as name, description, input and output tables, and policies) are available for all types of activities. 可用于此活动的 typeProperties 节的属性因每个活动类型而异。Properties that are available in the typeProperties section of the activity vary with each activity type.

对于复制活动,typeProperties 节因源和接收器的类型而异。For Copy Activity, the typeProperties section varies depending on the types of sources and sinks. 单击支持的源和接收器部分中的源/接收器,可了解复制活动支持该数据存储的哪些类型属性。Click a source/sink in the Supported sources and sinks section to learn about type properties that Copy Activity supports for that data store.

下面是示例 JSON 定义:Here's a sample JSON definition:

{
  "name": "ADFTutorialPipeline",
  "properties": {
    "description": "Copy data from Azure blob to Azure SQL table",
    "activities": [
      {
        "name": "CopyFromBlobToSQL",
        "type": "Copy",
        "inputs": [
          {
            "name": "InputBlobTable"
          }
        ],
        "outputs": [
          {
            "name": "OutputSQLTable"
          }
        ],
        "typeProperties": {
          "source": {
            "type": "BlobSource"
          },
          "sink": {
            "type": "SqlSink"
          },
          "executionLocation": "Japan East"          
        },
        "Policy": {
          "concurrency": 1,
          "executionPriorityOrder": "NewestFirst",
          "retry": 0,
          "timeout": "01:00:00"
        }
      }
    ],
    "start": "2016-07-12T00:00:00Z",
    "end": "2016-07-13T00:00:00Z"
  }
}

输出数据集中定义的计划决定活动运行时间(例如:每天,其频率为 ,间隔为 1)。The schedule that is defined in the output dataset determines when the activity runs (for example: daily, frequency as day, and interval as 1). 该活动将数据从输入数据集()复制到输出数据集(接收器)。The activity copies data from an input dataset (source) to an output dataset (sink).

可向复制活动指定多个输入数据集。You can specify more than one input dataset to Copy Activity. 这些数据集用于在活动运行之前验证依赖项。They are used to verify the dependencies before the activity is run. 但是,仅第一个数据集中的数据会复制到目标数据集。However, only the data from the first dataset is copied to the destination dataset. 有关详细信息,请参阅计划和执行For more information, see Scheduling and execution.

性能和优化Performance and tuning

请参阅复制活动性能和优化指南,其中介绍了影响 Azure 数据工厂中数据移动(复制活动)性能的关键因素。See the Copy Activity performance and tuning guide, which describes key factors that affect the performance of data movement (Copy Activity) in Azure Data Factory. 还列出了在内部测试期间观察到的性能,并讨论了优化复制活动性能的多种方法。It also lists the observed performance during internal testing and discusses various ways to optimize the performance of Copy Activity.

容错Fault tolerance

默认情况下,如果遇到数据源与接收器的数据不兼容,复制活动会停止复制数据,并返回故障;而用户则可以显式配置跳过和记录不兼容行,只复制兼容数据,以便能够成功执行复制活动。By default, copy activity will stop copying data and return failure when encounter incompatible data between source and sink; while you can explicitly configure to skip and log the incompatible rows and only copy those compatible data to make the copy succeeded. 有关详细信息,请参阅复制活动容错See the Copy Activity fault tolerance on more details.

安全注意事项Security considerations

请参阅安全注意事项,其中介绍了 Azure 数据工厂中的数据移动服务用来保护数据的安全基础结构。See the Security considerations, which describes security infrastructure that data movement services in Azure Data Factory use to secure your data.

计划和按顺序复制Scheduling and sequential copy

若要详细了解如何在数据工厂中计划和执行活动,请参阅计划和执行See Scheduling and execution for detailed information about how scheduling and execution works in Data Factory. 可以按顺序或以有序的方式依次运行多个复制操作。It is possible to run multiple copy operations one after another in a sequential/ordered manner. 请参阅按顺序复制部分。See the Copy sequentially section.

类型转换Type conversions

不同数据存储具有不同本机类型系统。Different data stores have different native type systems. 复制活动使用以下 2 步方法执行从源类型到接收器类型的自动类型转换:Copy Activity performs automatic type conversions from source types to sink types with the following two-step approach:

  1. 从本机源类型转换为 .NET 类型。Convert from native source types to a .NET type.
  2. 从 .NET 类型转换为本机接收器类型。Convert from a .NET type to a native sink type.

各数据存储文章中介绍了从本机类型系统到 .NET 类型的数据存储映射。The mapping from a native type system to a .NET type for a data store is in the respective data store article. (单击“支持的数据存储”表中的特定链接)。(Click the specific link in the Supported data stores table). 可在创建表时使用这些映射确定适当的类型,以便复制活动执行正确转换。You can use these mappings to determine appropriate types while creating your tables, so that Copy Activity performs the right conversions.

后续步骤Next steps