您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

复制活动性能和优化指南Copy Activity performance and tuning guide

Azure 数据工厂复制活动提供安全、可靠且高性能的一流数据加载解决方案。Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. 它允许用户在各种云和本地数据存储中每天复制数十 TB 的数据。It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores. 速度超快的数据加载性能是确保用户能专注于核心“大数据”问题的关键:构建高级分析解决方案并从所有数据获得深入见解。Blazing-fast data loading performance is key to ensure you can focus on the core “big data” problem: building advanced analytics solutions and getting deep insights from all that data.

Azure 提供了一组企业级数据存储和数据仓库解决方案,并且复制活动提供了高度优化的数据加载体验,易于配置和设置。Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a highly optimized data loading experience that is easy to configure and set up. 使用单个复制活动,可完成:With just a single copy activity, you can achieve:

  • 1.2 GBps 的速度将数据加载到 Azure SQL 数据仓库Loading data into Azure SQL Data Warehouse at 1.2 GBps.
  • 1.0 GBps 的速度将数据加载到 Azure Blob 存储Loading data into Azure Blob storage at 1.0 GBps
  • 1.0 GBps 的速度将数据加载到 Azure Data Lake StoreLoading data into Azure Data Lake Store at 1.0 GBps

本文介绍:This article describes:

备注

如果对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述If you are not familiar with Copy Activity in general, see Copy Activity Overview before reading this article.

性能参考Performance reference

作为参考,下表基于内部测试显示了单次复制活动运行中给定源和接收器对的复制吞吐量数目(以 MBps 为单位)。As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs in a single copy activity run based on in-house testing. 为进行比较,它还演示了数据集成单元自承载 Integration Runtime 可伸缩性的不同设置(多个节点)如何提升复制性能。For comparison, it also demonstrates how different settings of Data Integration Units or Self-hosted Integration Runtime scalability (multiple nodes) can help on copy performance.

性能矩阵

重要

对 Azure Integration Runtime 执行复制活动时,允许的数据集成单元(以前称为数据移动单元)数量下限为两个。When copy activity is executed on an Azure Integration Runtime, the minimal allowed Data Integration Units (formerly known as Data Movement Units) is two. 如果未指定,请参阅数据集成单元中使用的默认数据集成单元数。If not specified, see default Data Integration Units being used in Data Integration Units.

需要注意的要点:Points to note:

  • 使用以下公式计算吞吐量:[从源读取的数据大小]/[复制活动运行持续时间]。Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run duration].
  • 表中的性能参考数字使用单次复制活动运行中的 TPC-H 数据集测量得出。The performance reference numbers in the table were measured using TPC-H dataset in a single copy activity run. 基于文件的存储的测试文件是多个大小为 10 GB 的文件。Test files for file-based stores are multiple files with 10GB in size.
  • 在 Azure 数据存储中,源和接收器位于同一 Azure 区域。In Azure data stores, the source and sink are in the same Azure region.
  • 对于本地数据存储和云数据存储之间的混合复制,已在与数据存储分开并具有以下规格的计算机上运行每个自承载 Integration Runtime 节点。For hybrid copy between on-premises and cloud data stores, each Self-hosted Integration Runtime node was running on a machine that was separate from the data store with below specification. 单个活动运行时,复制操作仅使用测试计算机的一小部分 CPU、内存或网络带宽。When a single activity was running, the copy operation consumed only a small portion of the test machine's CPU, memory, or network bandwidth.
    CPUCPU 32 核 2.20 GHz Intel Xeon E5-2660 v232 cores 2.20 GHz Intel Xeon E5-2660 v2
    内存Memory 128 GB128 GB
    网络Network Internet 接口:10 Gbps;Intranet 接口:40 GbpsInternet interface: 10 Gbps; intranet interface: 40 Gbps

提示

通过使用更多数据集成单元 (DIU),可以实现更高吞吐量。You can achieve higher throughput by using more Data Integration Units (DIU). 例如,使用 100 个 DIU,可将数据以 1.0GBps 的速率从 Azure Blob 复制到 Azure Data Lake Store 中。For example, with 100 DIUs, you can achieve copying data from Azure Blob into Azure Data Lake Store at 1.0GBps. 请参阅数据集成单元部分,了解有关此功能和受支持方案的相关详细信息。See the Data Integration Units section for details about this feature and the supported scenario.

数据集成单元Data integration units

数据集成单元 (DIU) (以前称为云数据移动单元或 DMU)是一种度量单位,代表单个单元在数据工厂中的能力(包含 CPU、内存、网络资源分配)。A Data Integration Unit (DIU) (formerly known as Cloud Data Movement Unit or DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. DIU 仅适用于 Azure Integration Runtime ,而不适用于自承载 Integration RuntimeDIU only applies to Azure Integration Runtime, but not Self-hosted Integration Runtime.

为复制活动运行提供支持的数据集成单元数下限为两个。The minimal Data Integration Units to empower Copy Activity run is two. 如果未指定,下表列出了不同复制方案中使用的默认 DIU 数目:If not specified, the following table lists the default DIUs used in different copy scenarios:

复制方案Copy scenario 服务决定的默认 DIU 数目Default DIUs determined by service
在基于文件的存储之间复制数据Copy data between file-based stores 4 到 32 个,具体取决于文件的数量和大小。Between 4 and 32 depending on the number and size of the files.
所有其他复制方案All other copy scenarios 44

若要替代此默认值,请如下所示指定 dataIntegrationUnits 属性的值。To override this default, specify a value for the dataIntegrationUnits property as follows. dataIntegrationUnits 属性 允许的值 最大为 256The allowed values for the dataIntegrationUnits property is up to 256. 复制操作在运行时使用的实际 DIU 数等于或小于配置的值,具体取决于数据模式。The actual number of DIUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern. 有关为特定复制源和接收器配置更多单元时可能获得的性能增益级别的信息,请参阅性能参考For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference.

监视活动运行时,可以在复制活动输出中看到每次复制运行实际使用的数据集成单元数。You can see the actually used Data Integration Units for each copy run in the copy activity output when monitoring an activity run. 复制活动监视中了解详细信息。Learn details from Copy activity monitoring.

备注

仅当将多个文件从 Azure 存储/Data Lake Storage/Amazon S3/Google Cloud Storage/云 FTP/云 SFTP 复制到任何其他云数据存储时,当前大于 4 的 DIU 的设置才适用。Setting of DIUs larger than 4 currently applies only when you copy multiple files from Azure Storage/Data Lake Storage/Amazon S3/Google Cloud Storage/cloud FTP/cloud SFTP to any other cloud data stores.

示例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "dataIntegrationUnits": 32
        }
    }
]

数据集成单元计费影响Data Integration Units billing impact

务必记住,会根据复制操作的总时间向你收费。It's important to remember that you are charged based on the total time of the copy operation. 对数据移动计费的总持续时间是所有 DIU 的持续时间总和。The total duration you are billed for data movement is the sum of duration across DIUs. 如果复制作业过去使用 2 个云单元花费 1 小时,现在使用 8 个云单元花费 15 分钟,则总费用几乎相同。If a copy job used to take one hour with two cloud units and now it takes 15 minutes with eight cloud units, the overall bill remains almost the same.

并行复制Parallel Copy

可使用 parallelCopies 属性指示要让“复制活动”使用的并行度。You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. 可将此属性视为复制活动内,可从源并行读取或并行写入接收器数据存储的最大线程数。You can think of this property as the maximum number of threads within Copy Activity that can read from your source or write to your sink data stores in parallel.

对于每个复制活动运行,数据工厂确定用于将数据从源数据存储复制到目标数据存储的并行复制数。For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. 它使用的默认并行复制数取决于使用的源和接收器类型:The default number of parallel copies that it uses depends on the type of source and sink that you are using:

复制方案Copy scenario 由服务确定的默认并行复制计数Default parallel copy count determined by service
在基于文件的存储之间复制数据Copy data between file-based stores 取决于文件大小以及用于在两个云数据存储之间复制数据的数据集成单元 (DIU) 数,或自承载 Integration Runtime 计算机的物理配置。Depends on the size of the files and the number of Data Integration Units (DIUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
将数据从任何源数据存储复制到 Azure 表存储Copy data from any source data store to Azure Table storage 44
所有其他复制方案All other copy scenarios 1

提示

在基于文件的存储之间复制数据时,默认行为(自动确定)通常会提供最佳吞吐量。When copying data between file-based stores, the default behavior (auto determined) usually give you the best throughput.

要控制托管数据存储的计算机上的负载或优化复制性能,可选择替代默认值并为 parallelCopies 属性指定值 。To control the load on machines that host your data stores, or to tune copy performance, you may choose to override the default value and specify a value for the parallelCopies property. 该值必须是大于或等于 1 的整数。The value must be an integer greater than or equal to 1. 在运行时,为了获得最佳性能,复制活动使用小于或等于所设置的值。At run time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set.

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "parallelCopies": 32
        }
    }
]

需要注意的要点:Points to note:

  • 在基于文件的存储之间复制数据时,parallelCopies 确定文件级别的并行度。When you copy data between file-based stores, the parallelCopies determine the parallelism at the file level. 单个文件内的区块化会自动透明地在文件下进行,它旨在对给定源数据存储类型使用最佳区块大小,以并行和正交方式将数据加载到 parallelCopies。The chunking within a single file would happen underneath automatically and transparently, and it's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. 数据移动服务在运行时用于复制操作的并行复制的实际数量不超过所拥有的文件数。The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. 如果复制行为是 mergeFile,复制活动无法利用文件级别的并行度。If the copy behavior is mergeFile, Copy Activity cannot take advantage of file-level parallelism.
  • 指定 parallelCopies 属性的值时,请考虑源数据存储和接收器数据存储上的负载会增加,如果使用自承载 Integration Runtime 为混合复制或其他复制的复制活动提供支持,其负载也会增加。When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores, and to Self-Hosted Integration Runtime if the copy activity is empowered by it for example, for hybrid copy. 尤其在有多个活动或针对同一数据存储运行的相同活动有并发运行时,会发生这种情况。This happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果注意到数据存储或自承载 Integration Runtime 负载过重,请降低 parallelCopies 值以减轻负载。If you notice that either the data store or Self-hosted Integration Runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
  • 将数据从不基于文件的存储复制到基于文件的存储时,数据移动服务将忽略 parallelCopies 属性。When you copy data from stores that are not file-based to stores that are file-based, the data movement service ignores the parallelCopies property. 即使指定了并行性,在此情况下也不适用。Even if parallelism is specified, it's not applied in this case.
  • parallelCopies 独立于 dataIntegrationUnitsparallelCopies is orthogonal to dataIntegrationUnits. 前者跨所有数据集成单元进行计数。The former is counted across all the Data Integration Units.

暂存复制Staged copy

将数据从源数据存储复制到接收器数据存储时,可能会选择使用 Blob 存储作为过渡暂存存储。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暂存在以下情况下特别有用:Staging is especially useful in the following cases:

  • 通过 PolyBase 从各种数据存储将数据引入 SQL 数据仓库You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL 数据仓库使用 PolyBase 作为高吞吐量机制,将大量数据加载到 SQL 数据仓库中。SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. 但是,源数据必须位于 Blob 存储或 Azure Data Lake Store 中,并且它必须满足其他条件。However, the source data must be in Blob storage or Azure Data Lake Store, and it must meet additional criteria. 从 Blob 存储或 Azure Data Lake Store 以外的数据存储加载数据时,可通过过渡暂存 Blob 存储激活数据复制。When you load data from a data store other than Blob storage or Azure Data Lake Store, you can activate data copying via interim staging Blob storage. 在这种情况下,数据工厂会执行所需的数据转换,确保其满足 PolyBase 的要求。In that case, Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然后,它使用 PolyBase 将数据高效加载到 SQL 数据仓库。Then it uses PolyBase to load data into SQL Data Warehouse efficiently. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库For more information, see Use PolyBase to load data into Azure SQL Data Warehouse.
  • 有时,通过速度慢的网络连接执行混合数据移动(即从本地数据存储复制到云数据存储)需要一段时间Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-premises data store to a cloud data store) over a slow network connection. 为了提高性能,可使用暂存复制在本地压缩数据,缩短将数据移动到云中的暂存数据存储的时间,然后,可先在暂存存储中解压缩数据,再将它们加载到目标数据存储。To improve performance, you can use staged copy to compress the data on-premises so that it takes less time to move data to the staging data store in the cloud then decompress the data in the staging store before loading into the destination data store.
  • 由于企业 IT 策略,不希望在防火墙中打开除端口 80 和端口 443 以外的端口You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate IT policies. 例如,将数据从本地数据存储复制到 Azure SQL 数据库接收器或 Azure SQL 数据仓库接收器时,需要对 Windows 防火墙和公司防火墙激活端口 1433 上的出站 TCP 通信。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在这种情况下,暂存复制可以利用自承载 Integration Runtime 首先在端口 443 上通过 HTTP 或 HTTPS 将数据复制到 Blob 存储暂存实例,然后将数据从 Blob 存储暂存加载到 SQL 数据库或 SQL 数据仓库。In this scenario, staged copy can take advantage of the Self-hosted Integration Runtime to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443, then load the data into SQL Database or SQL Data Warehouse from Blob storage staging. 在此流中,不需要启用端口 1433。In this flow, you don't need to enable port 1433.

暂存复制的工作原理How staged copy works

激活暂存功能时,首先将数据从源数据存储复制到暂存 Blob 存储(自带)。When you activate the staging feature, first the data is copied from the source data store to the staging Blob storage (bring your own). 然后,将数据从暂存数据存储复制到接收器数据存储。Next, the data is copied from the staging data store to the sink data store. 数据工厂自动管理两阶段流。Data Factory automatically manages the two-stage flow for you. 数据移动完成后,数据工厂还将清除暂存存储中的临时数据。Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

暂存复制

使用暂存存储激活数据移动时,可指定是否要先压缩数据,再将数据从源数据存储移动到过渡数据存储或暂存数据存储,然后先解压缩数据,再将数据从过渡数据存储或暂存数据移动到接收器数据存储。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before moving data from the source data store to an interim or staging data store, and then decompressed before moving data from an interim or staging data store to the sink data store.

目前,不能使用暂存存储在两个本地数据存储之间复制数据。Currently, you can't copy data between two on-premises data stores by using a staging store.

配置Configuration

在复制活动中配置 enableStaging 设置,指定在将数据加载到目标数据存储之前是否要在 Blob 存储中暂存。Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. enableStaging 设置为 TRUE 时,指定下一个表中列出的其他属性。When you set enableStaging to TRUE, specify the additional properties listed in the next table. 如果未指定,则还需要创建 Azure 存储或存储共享访问签名链接服务供暂存用。If you don’t have one, you also need to create an Azure Storage or Storage shared access signature-linked service for staging.

属性Property 说明Description 默认值Default value 必选Required
enableStagingenableStaging 指定是否要通过过渡暂存存储复制数据。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorage 链接服务的名称,这指用作过渡暂存存储的存储实例。Specify the name of an AzureStorage linked service, which refers to the instance of Storage that you use as an interim staging store.

不能使用具有共享访问签名的存储通过 PolyBase 将数据加载到 SQL 数据仓库。You cannot use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. 可在其他任何情况下使用它。You can use it in all other scenarios.
不适用N/A enableStaging 设置为 TRUE 时,则为是Yes, when enableStaging is set to TRUE
路径path 指定要包含此暂存数据的 Blob 存储路径。Specify the Blob storage path that you want to contain the staged data. 如果不提供路径,该服务将创建容器以存储临时数据。If you do not provide a path, the service creates a container to store temporary data.

只在使用具有共享访问签名的存储时,或者要求临时数据位于特定位置时才指定路径。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
不适用N/A No
enableCompressionenableCompression 指定是否应先压缩数据,再将数据复制到目标。Specifies whether data should be compressed before it is copied to the destination. 此设置可减少传输的数据量。This setting reduces the volume of data being transferred. FalseFalse No

备注

若使用暂存复制并启用压缩,则不支持对暂存 blob 链接服务的服务主体或 MSI 身份验证。If you use staged copy with compression enabled, service principal or MSI authentication for staging blob linked service is not supported.

以下是具有上表所述属性的复制活动的示例定义:Here's a sample definition of Copy Activity with the properties that are described in the preceding table:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
            },
            "sink": {
                "type": "SqlSink"
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": {
                    "referenceName": "MyStagingBlob",
                    "type": "LinkedServiceReference"
                },
                "path": "stagingcontainer/path",
                "enableCompression": true
            }
        }
    }
]

暂存复制计费影响Staged copy billing impact

基于两个步骤进行计费:复制持续时间和复制类型。You are charged based on two steps: copy duration and copy type.

  • 在云复制期间(将数据从一个云数据存储复制到另一个云数据存储,两个阶段均由 Azure Integration Runtime 提供支持)使用暂存时,需要支付 [步骤 1 和步骤 2 的复制持续时间总和] x [云复制单元价格]。When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store, both stages empowered by Azure Integration Runtime), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 在混合复制期间(将数据从本地数据存储复制到云数据存储,一个阶段由自承载 Integration Runtime 提供支持)使用暂存时,需要支付 [混合复制持续时间] x [混合复制单元价格] + [云复制持续时间] x [云复制单元价格]。When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data store, one stage empowered by Self-hosted Integration Runtime), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

性能优化步骤Performance tuning steps

我们建议采用以下步骤,通过复制活动调整数据工厂服务的性能:We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:

  1. 建立基准Establish a baseline. 在开发阶段,通过对代表性数据示例使用复制活动来测试管道。During the development phase, test your pipeline by using Copy Activity against a representative data sample. 按照复制活动监视收集执行详细信息和性能特征。Collect execution details and performance characteristics following Copy activity monitoring.

  2. 诊断和优化性能Diagnose and optimize performance. 如果观察到的性能不符合预期,则需要识别性能瓶颈。If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. 然后,优化性能以消除或减少瓶颈的影响。Then, optimize performance to remove or reduce the effect of bottlenecks.

    在某些情况下,当你在 ADF 中执行复制活动时,将直接在复制活动监视页上看到“性能调优技巧” ,如以下示例所示。In some cases, when you execute a copy activity in ADF, you will directly see "Performance tuning tips" on top of the copy activity monitoring page as shown in the following example. 它不仅告诉你针对给定复制运行所识别出的瓶颈,而且还指导你进行一些更改来提升复制吞吐量。It not only tells you the bottleneck identified for the given copy run, but also guides you on what to change so as to boost copy throughput. 目前的性能优化提示提供如下建议:在将数据复制到 Azure SQL 数据仓库时使用 PolyBase;在数据存储端资源出现瓶颈时增加 Azure Cosmos DB RU 或 Azure SQL DB DTU;删除不必要的暂存副本等等。性能优化规则也将逐渐丰富。The performance tuning tips currently provide suggestions like to use PolyBase when copying data into Azure SQL Data Warehouse, to increase Azure Cosmos DB RU or Azure SQL DB DTU when the resource on data store side is the bottleneck, to remove the unnecessary staged copy, etc. The performance tuning rules will be gradually enriched as well.

    示例:复制到 Azure SQL DB 时的性能优化提示Example: copy into Azure SQL DB with performance tuning tips

    在此示例中,在复制运行期间,ADF 注意到接收器 Azure SQL DB 达到了很高的 DTU 利用率,这会减慢写入操作,因此,建议使用更多的 DTU 来增加 Azure SQL DB 层。In this sample, during copy run, ADF notice the sink Azure SQL DB reaches high DTU utilization which slows down the write operations, thus the suggestion is to increase the Azure SQL DB tier with more DTU.

    使用性能调优技巧进行复制监视

    此外,以下是一些常见的注意事项。In addition, the following are some common considerations. 本文不涵盖性能诊断的完整说明。A full description of performance diagnosis is beyond the scope of this article.

  3. 将配置扩展至整个数据集Expand the configuration to your entire data set. 对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire data set.

自承载 Integration Runtime 注意事项Considerations for Self-hosted Integration Runtime

如果在自承载 Integration Runtime 上执行复制活动,请注意以下事项:If your copy activity is executed on a Self-hosted Integration Runtime, note the following:

设置:建议使用专用计算机托管 Integration Runtime。Setup: We recommend that you use a dedicated machine to host Integration Runtime. 请参阅使用自承载 Integration Runtime 的注意事项See Considerations for using Self-hosted Integration Runtime.

横向扩展:具有一个或多个节点的单个逻辑自承载 Integration Runtime 可同时用于在同一时间运行的多个复制活动。Scale out: A single logical Self-hosted Integration Runtime with one or more nodes can serve multiple Copy Activity runs at the same time concurrently. 如果非常需要包含大量并发复制活动运行或需要复制大量数据的混合数据移动,请考虑横向扩展自承载 Integration Runtime,以便预配更多资源为复制提供支持。If you have heavy need on hybrid data movement either with large number of concurrent copy activity runs or with large volume of data to copy, consider to scale out Self-hosted Integration Runtime so as to provision more resource to empower copy.

有关源的注意事项Considerations for the source

常规General

确保基础数据存储未被在其上运行或针对其运行的其他工作负荷过渡占用。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

有关 Microsoft 数据存储的信息,请参阅特定于数据存储的监视和优化主题,帮助用户了解数据存储性能特征、尽量缩短响应时间以及最大化吞吐量。For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you understand data store performance characteristics, minimize response times, and maximize throughput.

基于文件的数据存储File-based data stores

  • 平均文件大小和文件计数:复制活动一次传输一个文件的数据。Average file size and file count: Copy Activity transfers data one file at a time. 在移动相同数据量的情况下,如果数据由许多小文件而不是几个大文件组成,则由于每个文件都有启动阶段,总吞吐量会较低。With the same amount of data to be moved, the overall throughput is lower if the data consists of many small files rather than a few large files due to the bootstrap phase for each file. 因此,尽可能将小文件合并为较大的文件,以获得更高的吞吐量。Therefore, if possible, combine small files into larger files to gain higher throughput.
  • 文件格式和压缩:有关提高性能的更多方法,请参阅序列化和反序列化注意事项压缩注意事项部分。File format and compression: For more ways to improve performance, see the Considerations for serialization and deserialization and Considerations for compression sections.

关系数据存储Relational data stores

  • 数据模式:表架构会影响复制吞吐量。Data pattern: Your table schema affects copy throughput. 复制相同数据量时,较大行大小可比较小行大小提供更佳的性能。A large row size gives you a better performance than small row size, to copy the same amount of data. 原因是数据库可以更有效地检索包含较少行的较少批量数据。The reason is that the database can more efficiently retrieve fewer batches of data that contain fewer rows.
  • 查询或存储过程:优化在复制活动源中指定的查询或存储过程的逻辑,以更有效地提取数据。Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently.

有关接收器的注意事项Considerations for the sink

常规General

确保基础数据存储未被在其上运行或针对其运行的其他工作负荷过渡占用。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

有关 Microsoft 数据存储的信息,请参阅特定于数据存储的监视和优化主题For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. 这些主题可帮助用户了解数据存储性能特征、了解如何尽量缩短响应时间以及最大化吞吐量。These topics can help you understand data store performance characteristics and how to minimize response times and maximize throughput.

基于文件的数据存储File-based data stores

  • 复制行为:如果从基于文件的不同数据存储复制数据,则复制活动可通过 copyBehavior 属性提供三个选项。Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via the copyBehavior property. 它将保留层次结构、平展层次结构或合并文件。It preserves hierarchy, flattens hierarchy, or merges files. 保留或平展层次结构有少量的性能开销或没有性能开销,但合并文件会导致性能开销增加。Either preserving or flattening hierarchy has little or no performance overhead, but merging files causes performance overhead to increase.
  • 文件格式和压缩:有关提高性能的更多方法,请参阅序列化和反序列化注意事项压缩注意事项部分。File format and compression: See the Considerations for serialization and deserialization and Considerations for compression sections for more ways to improve performance.

关系数据存储Relational data stores

  • 复制行为和性能的含义:通过不同的方式将数据写入到的 SQL 接收器,了解详细信息最佳做法,用于将数据加载到 Azure SQL 数据库Copy behavior and performance implication: There are different ways to write data into SQL sink, learn more from Best practice for loading data into Azure SQL Database.

  • 数据模式和批大小Data pattern and batch size:

    • 表架构会影响复制吞吐量。Your table schema affects copy throughput. 复制相同数据量时,较大行大小会比较小行大小提供更好的性能,因为数据库可以更有效地提交较少的数据批次。To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data.
    • 复制活动以一系列批次插入数据。Copy Activity inserts data in a series of batches. 可使用 writeBatchSize 属性设置批中的行数。You can set the number of rows in a batch by using the writeBatchSize property. 如果数据的行较小,可设置具有更高值的 writeBatchSize 属性,从更低的批开销和更高的吞吐量获益。If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. 如果数据的行大小较大,请谨慎增加 writeBatchSizeIf the row size of your data is large, be careful when you increase writeBatchSize. 较高的值可能会导致复制失败(因为数据库负载过重)。A high value might lead to a copy failure caused by overloading the database.

NoSQL 存储NoSQL stores

  • 对于表存储For Table storage:
    • 分区:将数据写入交错分区会显着降低性能。Partition: Writing data to interleaved partitions dramatically degrades performance. 按分区键对源数据进行排序,以有效率地将数据插入各个分区,或调整逻辑将数据写入单个分区。Sort your source data by partition key so that the data is inserted efficiently into one partition after another, or adjust the logic to write the data to a single partition.

序列化和反序列化注意事项Considerations for serialization and deserialization

输入数据集或输出数据集是文件时,可能会发生序列化和反序列化。Serialization and deserialization can occur when your input data set or output data set is a file. 有关复制活动支持的文件格式的详细信息,请参阅支持的文件和压缩格式See Supported file and compression formats with details on supported file formats by Copy Activity.

复制行为Copy behavior:

  • 在基于文件的数据存储之间复制文件:Copying files between file-based data stores:
    • 输入和输出数据集具有相同的文件格式设置或没有文件格式设置时,数据移动服务执行二进制复制,而不进行任何序列化或反序列化。When input and output data sets both have the same or no file format settings, the data movement service executes a binary copy without any serialization or deserialization. 与源和接收器文件格式设置彼此不同的情况相比,这提供的吞吐量更高。You see a higher throughput compared to the scenario, in which the source and sink file format settings are different from each other.
    • 输入和输出数据集都为文本格式且只有编码类型不同时,数据移动服务仅进行编码转换。When input and output data sets both are in text format and only the encoding type is different, the data movement service only does encoding conversion. 它不进行任何序列化和反序列化,与二进制复制相比,这会产生一些性能开销。It doesn't do any serialization and deserialization, which causes some performance overhead compared to a binary copy.
    • 当输入和输出数据集具有不同的文件格式或不同的配置时(如分隔符),数据移动服务会反序列化源数据,以进行流式传输、转换,然后将其序列化为指示的输出格式。When input and output data sets both have different file formats or different configurations, like delimiters, the data movement service deserializes source data to stream, transform, and then serialize it into the output format you indicated. 与其他方案相比,此操作会产生更多的性能开销。This operation results in a much more significant performance overhead compared to other scenarios.
  • 向/从不基于文件的数据存储复制文件时(例如,从基于文件的存储到关系存储),需要序列化或反序列化步骤。When you copy files to/from a data store that is not file-based (for example, from a file-based store to a relational store), the serialization or deserialization step is required. 此步骤将造成大量的性能开销。This step results in significant performance overhead.

文件格式:选择的文件格式可能会影响复制性能。File format: The file format you choose might affect copy performance. 例如,Avro 是一种将元数据与数据一起存储的压缩二进制格式。For example, Avro is a compact binary format that stores metadata with data. 它在 Hadoop 生态系统中对处理和查询具有广泛的支持。It has broad support in the Hadoop ecosystem for processing and querying. 然而,Avro 进行序列化和反序列化的代价更高,这会导致比文本格式更低的复制吞吐量。However, Avro is more expensive for serialization and deserialization, which results in lower copy throughput compared to text format. 在选择整个处理流中使用的文件格式时,应考虑全面。Make your choice of file format throughout the processing flow holistically. 首先考虑数据存储的格式(源数据存储或从外部系统提取);存储、分析处理和查询的最佳格式;应将数据导出到数据集市、供报表和可视化工具使用的格式。Start with what form the data is stored in, source data stores or to be extracted from external systems; the best format for storage, analytical processing, and querying; and in what format the data should be exported into data marts for reporting and visualization tools. 有时,读取和写入性能不是最佳的文件格式可能对于整体分析处理来说却是不错的选择。Sometimes a file format that is suboptimal for read and write performance might be a good choice when you consider the overall analytical process.

压缩注意事项Considerations for compression

输入或输出数据集是文件时,可以设置复制活动,使其在将数据写入目标时执行压缩或解压缩。When your input or output data set is a file, you can set Copy Activity to perform compression or decompression as it writes data to the destination. 选择压缩时,请在输入/输出 (I/O) 和 CPU 之间进行权衡。When you choose compression, you make a tradeoff between input/output (I/O) and CPU. 压缩数据会花费额外的计算资源。Compressing the data costs extra in compute resources. 但反过来减少了网络 I/O 和存储。But in return, it reduces network I/O and storage. 根据所用数据,可能会看到整体复制吞吐量的提升。Depending on your data, you may see a boost in overall copy throughput.

编解码器:每中压缩编解码器各有优点。Codec: Each compression codec has advantages. 例如,虽然 bzip2 复制吞吐量最低,但使用 bzip2 可获得最佳的 Hive 查询性能,因为可将其拆分处理。For example, bzip2 has the lowest copy throughput, but you get the best Hive query performance with bzip2 because you can split it for processing. Gzip 是最平衡的选项,也是最常用的选项。Gzip is the most balanced option, and it is used the most often. 选择最适合端到端方案的编解码器。Choose the codec that best suits your end-to-end scenario.

级别:对于每个压缩编解码器,有以下两个选择:最快压缩和最佳压缩。Level: You can choose from two options for each compression codec: fastest compressed and optimally compressed. 最快压缩选项可尽快压缩数据,不过无法以最佳方式压缩生成的文件。The fastest compressed option compresses the data as quickly as possible, even if the resulting file is not optimally compressed. 最佳压缩选项花费更多时间进行压缩,产生最小量的数据。The optimally compressed option spends more time on compression and yields a minimal amount of data. 可对这两个选项进行测试,确定可为方案提供最佳整体性能的选项。You can test both options to see which provides better overall performance in your case.

注意事项:若要在本地存储和云之间复制大量数据,请考虑使用启用了压缩的暂存复制A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using Staged copy with compression enabled. 当公司网络和 Azure 服务的带宽是限制因素,并希望输入数据集和输出数据集都处于未压缩形式时,使用过渡存储将非常有用。Using interim storage is helpful when the bandwidth of your corporate network and your Azure services is the limiting factor, and you want the input data set and output data set both to be in uncompressed form.

列映射注意事项Considerations for column mapping

可在复制活动中设置 columnMappings 属性,将全部或部分输入列映射到输出列。You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the output columns. 数据移动服务从源读取数据后,它需要先对数据执行列映射,再将数据写入接收器。After the data movement service reads the data from the source, it needs to perform column mapping on the data before it writes the data to the sink. 这一额外处理会降低复制吞吐量。This extra processing reduces copy throughput.

如果源数据存储可查询,例如,如果存储是关系存储(如 SQL 数据库或 SQL Server),或者是 NoSQL 存储(如表存储或 Azure Cosmos DB),请考虑将列筛选和重排序逻辑推送到查询属性,而不使用列映射。If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and reordering logic to the query property instead of using column mapping. 这样,当数据移动服务从源数据存储读取数据时会发生投影,使效率更高。This way, the projection occurs while the data movement service reads data from the source data store, where it is much more efficient.

复制活动架构映射中了解详细信息。Learn more from Copy Activity schema mapping.

其他注意事项Other considerations

如果要复制的数据大小较大,可以调整业务逻辑来进一步对数据分区,并提高复制活动的运行频率,以减少每次复制活动运行的数据大小。If the size of data you want to copy is large, you can adjust your business logic to further partition the data and schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run.

请谨慎对待需要数据工厂同时连接到同一数据存储的数据集数和复制活动数。Be cautious about the number of data sets and copy activities requiring Data Factory to connect to the same data store at the same time. 许多并发复制作业可能会限制数据存储,并导致性能下降,复制作业内部重试,甚至在某些情况下导致执行失败。Many concurrent copy jobs might throttle a data store and lead to degraded performance, copy job internal retries, and in some cases, execution failures.

示例方案:从本地 SQL Server 复制到 Blob 存储Sample scenario: Copy from an on-premises SQL Server to Blob storage

场景:构建管道,以 CSV 格式将数据从本地 SQL Server 复制到 Blob 存储。Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. 要使复制作业更快,应将 CSV 文件压缩为 bzip2 格式。To make the copy job faster, the CSV files should be compressed into bzip2 format.

测试和分析:复制活动的吞吐量小于 2 MBps,这比性能基准慢得多。Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the performance benchmark.

性能分析和优化:为排除性能问题,可查看数据的处理和移动方式。Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is processed and moved.

  1. 读取数据:Integration Runtime 建立与 SQL Server 的连接并发送查询。Read data: Integration runtime opens a connection to SQL Server and sends the query. SQL Server 通过 Intranet 向 Integration Runtime 发送数据流,以此进行响应。SQL Server responds by sending the data stream to integration runtime via the intranet.
  2. 序列化和压缩数据:Integration Runtime 将数据流序列化为 CSV 格式,并将数据压缩为 bzip2 流。Serialize and compress data: Integration runtime serializes the data stream to CSV format, and compresses the data to a bzip2 stream.
  3. 写入数据:Integration Runtime 通过 Internet 将 bzip2 流上传到 Blob 存储。Write data: Integration runtime uploads the bzip2 stream to Blob storage via the Internet.

如你所见,数据以流式处理顺序方式进行处理和移动:SQL Server > LAN> Integration Runtime > WAN > Blob 存储。As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN > Integration runtime > WAN > Blob storage. 整体性能受管道中最小吞吐量的限制The overall performance is gated by the minimum throughput across the pipeline.

数据流

以下的一个或多个因素可能会导致性能瓶颈:One or more of the following factors might cause the performance bottleneck:

  • :SQL Server 本身由于负载过重而吞吐量低。Source: SQL Server itself has low throughput because of heavy loads.
  • 自承载 Integration RuntimeSelf-hosted Integration Runtime:
    • LAN:Integration Runtime 的位置离 SQL Server 计算机很远,且带宽连接低。LAN: Integration runtime is located far from the SQL Server machine and has a low-bandwidth connection.
    • Integration Runtime:Integration Runtime 已达到执行以下操作的负载限制:Integration runtime: Integration runtime has reached its load limitations to perform the following operations:
      • 序列化:将数据流序列化为 CSV 格式时吞吐量缓慢。Serialization: Serializing the data stream to CSV format has slow throughput.
      • 压缩:选择慢速压缩编解码器(例如,bzip2,其采用 Core i7,速度为 2.8 MBps)。Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps with Core i7).
    • WAN:企业网络和 Azure 服务之间的带宽低(例如,T1 = 1,544 kbps;T2 = 6,312 kbps)。WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1 = 1,544 kbps; T2 = 6,312 kbps).
  • 接收器:Blob 存储吞吐量低。Sink: Blob storage has low throughput. (这种情况不太可能发生,因为其 SLA 保证至少有 60 MBps。)(This scenario is unlikely because its SLA guarantees a minimum of 60 MBps.)

在这种情况下,bzip2 数据压缩可能会拖慢整个管道的速度。In this case, bzip2 data compression might be slowing down the entire pipeline. 切换到 gzip 压缩编解码器可能会缓解此瓶颈。Switching to a gzip compression codec might ease this bottleneck.

参考Reference

下面是有关一些受支持数据存储的性能监视和优化参考:Here is performance monitoring and tuning references for some of the supported data stores:

后续步骤Next steps

请参阅其他复制活动文章:See the other Copy Activity articles: