您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

复制活动性能和优化指南Copy Activity performance and tuning guide

备注

本文适用于数据工厂版本 1。 如果使用数据工厂服务的当前版本,请参阅数据工厂的复制活动性能和优化指南

Azure 数据工厂复制活动提供安全、可靠且高性能的一流数据加载解决方案。Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. 它允许用户在各种云和本地数据存储中每天复制数十 TB 的数据。It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores. 速度超快的数据加载性能是确保用户能专注于核心“大数据”问题的关键:构建高级分析解决方案并从所有数据获得深入见解。Blazing-fast data loading performance is key to ensure you can focus on the core “big data” problem: building advanced analytics solutions and getting deep insights from all that data.

Azure 提供了一组企业级数据存储和数据仓库解决方案,并且复制活动提供了高度优化的数据加载体验,易于配置和设置。Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a highly optimized data loading experience that is easy to configure and set up. 使用单个复制活动,可完成:With just a single copy activity, you can achieve:

本文介绍:This article describes:

备注

如果对常规复制活动不熟悉,在阅读本文前请参阅复制活动移动数据

性能参考Performance reference

作为参考,下表基于内部测试显示了给定源和接收器对的复制吞吐量数目(以 MBps 为单位)。As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs based on in-house testing. 为进行比较,它还演示了云数据移动单位数据管理网关可伸缩性的不同设置(多个网关节点)如何帮助复制性能。For comparison, it also demonstrates how different settings of cloud data movement units or Data Management Gateway scalability (multiple gateway nodes) can help on copy performance.

性能矩阵

重要

在 Azure 数据工厂版本 1 中,用于云到云复制的最小云数据移动单元数为两个。 如果未指定,请参阅云数据移动单元了解使用的默认数据移动单元数。

需要注意的要点:Points to note:

  • 使用以下公式计算吞吐量:[从源读取的数据大小]/[复制活动运行持续时间]。Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run duration].
  • 表中的性能参考数字使用单次复制活动运行中的 TPC-H 数据集测量得出。The performance reference numbers in the table were measured using TPC-H data set in a single copy activity run.
  • 在 Azure 数据存储中,源和接收器位于同一 Azure 区域。In Azure data stores, the source and sink are in the same Azure region.
  • 对于本地和云数据存储之间的混合复制,已通过以下规范在与本地数据存储分开的计算机上运行每个网关节点。For hybrid copy between on-premises and cloud data stores, each gateway node was running on a machine that was separate from the on-premises data store with below specification. 在网关上运行单个活动时,复制操作仅使用测试计算机的一小部分 CPU、内存或网络带宽。When a single activity was running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory, or network bandwidth. 有关详细信息,请参阅数据管理网关注意事项Learn more from consideration for Data Management Gateway.
    CPUCPU 32 核 2.20 GHz Intel Xeon E5-2660 v232 cores 2.20 GHz Intel Xeon E5-2660 v2
    内存Memory 128 GB128 GB
    网络Network Internet 接口:10 Gbps;intranet 接口:40 GbpsInternet interface: 10 Gbps; intranet interface: 40 Gbps

提示

可通过利用比默认最大 DMU 还要多的数据移动单元 (DMU) 来实现更高的吞吐量,对于云到云复制活动运行来说,默认的最大 DMU 为 32。 例如,使用 100 个 DMU,可将数据以 1.0GBps 的速率从 Azure Blob 复制到 Azure Data Lake Store 中。 请参阅云数据移动单位部分,了解有关此功能和受支持方案的相关详细信息。 要请求更多 DMU,请联系支持

并行复制Parallel copy

在复制活动运行中并行从源读取数据或将数据写入目标。You can read data from the source or write data to the destination in parallel within a Copy Activity run. 此功能可提高复制操作的吞吐量,并缩短移动数据所需的时间。This feature enhances the throughput of a copy operation and reduces the time it takes to move data.

此设置不同于活动定义中的并发属性。This setting is different from the concurrency property in the activity definition. 并发属性确定并发复制活动运行的数量,以从不同的活动时段(凌晨 1 点到 2 点、凌晨 2 点到 3 点,凌晨 3 点到 4 点等)处理数据。The concurrency property determines the number of concurrent Copy Activity runs to process data from different activity windows (1 AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). 执行历史负载时,此功能非常有用。This capability is helpful when you perform a historical load. 并行复制功能适用于单个活动运行The parallel copy capability applies to a single activity run.

我们来看一下示例方案。Let's look at a sample scenario. 在下面的示例中,需要处理多个以前的切片。In the following example, multiple slices from the past need to be processed. 数据工厂为每个切片运行复制活动实例(活动运行):Data Factory runs an instance of Copy Activity (an activity run) for each slice:

  • 第一个活动时段(凌晨 1 点到 2 点)的数据切片==> 活动运行 1The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1
  • 第二个活动时段(凌晨 2 点到 3 点)的数据切片==> 活动运行 2The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2
  • 第二个活动时段(凌晨 3 点到 4 点)的数据切片==> 活动运行 3The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3

依此类推。And so on.

在此示例中,当并发值设置为 2 时,活动运行 1活动运行 2并发从两个活动时段复制数据,以提高数据移动性能。In this example, when the concurrency value is set to 2, Activity run 1 and Activity run 2 copy data from two activity windows concurrently to improve data movement performance. 但是,如果多个文件与活动运行 1 相关联,则数据移动服务一次只会将一个文件从源复制到目标位置。However, if multiple files are associated with Activity run 1, the data movement service copies files from the source to the destination one file at a time.

云数据移动单位Cloud data movement units

云数据移动单位 (DMU) 是一种度量单位,代表单个单位在数据工厂中的能力(包含 CPU、内存、网络资源分配)。A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. DMU 适用于云到云复制操作,但不适用于混合复制。DMU is applicable for cloud-to-cloud copy operations, but not in a hybrid copy.

为复制活动运行提供支持的云数据移动单元数下限为两个。The minimal cloud data movement units to empower Copy Activity run is two. 如果未指定,下表列出了不同复制方案中使用的默认 DMU 数目:If not specified, the following table lists the default DMUs used in different copy scenarios:

复制方案Copy scenario 服务决定的默认 DMU 数目Default DMUs determined by service
在基于文件的存储之间复制数据Copy data between file-based stores 4 到 16 个,具体取决于文件的数量和大小。Between 4 and 16 depending on the number and size of the files.
所有其他复制方案All other copy scenarios 44

若要替代此默认值,请如下所示指定 cloudDataMovementUnits 属性的值。To override this default, specify a value for the cloudDataMovementUnits property as follows. cloudDataMovementUnits 属性的允许值为 2、4、8、16 和 32。The allowed values for the cloudDataMovementUnits property are 2, 4, 8, 16, 32. 复制操作在运行时使用的云 DMU 的实际数量等于或小于配置的值,具体取决于数据模式。The actual number of cloud DMUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern. 有关为特定复制源和接收器配置更多单元时可能获得的性能增益级别的信息,请参阅性能参考For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference.

"activities":[  
    {
        "name": "Sample copy activity",
        "description": "",
        "type": "Copy",
        "inputs": [{ "name": "InputDataset" }],
        "outputs": [{ "name": "OutputDataset" }],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "cloudDataMovementUnits": 32
        }
    }
]

备注

如果需要更多云 DMU 以获得更高的吞吐量,请联系 Azure支持。 目前仅在将多个文件从 Blob 存储/Data Lake Store/Amazon S3/云 FTP/云 SFTP 复制到 Blob 存储/Data Lake Store/Azure SQL 数据库时,才能设置为 8 或更高的值。

parallelCopiesparallelCopies

可使用 parallelCopies 属性指示要让“复制活动”使用的并行度。You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. 可将此属性视为复制活动内,可从源并行读取或并行写入接收器数据存储的最大线程数。You can think of this property as the maximum number of threads within Copy Activity that can read from your source or write to your sink data stores in parallel.

对于每个复制活动运行,数据工厂确定用于将数据从源数据存储复制到目标数据存储的并行复制数。For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. 它使用的默认并行复制数取决于使用的源和接收器类型。The default number of parallel copies that it uses depends on the type of source and sink that you are using.

源和接收器Source and sink 由服务确定的默认并行复制计数Default parallel copy count determined by service
在基于文件的存储(Blob 存储;Data Lake Store;Amazon S3;本地文件系统;本地 HDFS)之间复制数据Copy data between file-based stores (Blob storage; Data Lake Store; Amazon S3; an on-premises file system; an on-premises HDFS) 介于 1 和 32 之间。Between 1 and 32. 取决于文件大小以及用于在两个云数据存储之间复制数据的云数据移动单元 (DMU) 数,或用于混合复制的网关计算机的物理配置(将数据粘贴到本地数据存储或从本地数据存储复制)。Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Gateway machine used for a hybrid copy (to copy data to or from an on-premises data store).
将数据从任何源数据存储复制到 Azure 表存储Copy data from any source data store to Azure Table storage 44
所有其他源和接收器对All other source and sink pairs 11

默认行为通常应可提供最佳吞吐量。Usually, the default behavior should give you the best throughput. 但是,若要控制托管数据存储的计算机上的负载或优化复制性能,可选择替代默认值并为 parallelCopies 属性指定值。However, to control the load on machines that host your data stores, or to tune copy performance, you may choose to override the default value and specify a value for the parallelCopies property. 该值必须介于 1 和 32 之间(两者均含)。The value must be between 1 and 32 (both inclusive). 在运行时,为了获得最佳性能,复制活动使用小于或等于所设置的值。At run time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set.

"activities":[  
    {
        "name": "Sample copy activity",
        "description": "",
        "type": "Copy",
        "inputs": [{ "name": "InputDataset" }],
        "outputs": [{ "name": "OutputDataset" }],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "parallelCopies": 8
        }
    }
]

需要注意的要点:Points to note:

  • 在基于文件的存储之间复制数据时,parallelCopies 确定文件级别的并行度。When you copy data between file-based stores, the parallelCopies determine the parallelism at the file level. 单个文件内的区块化会自动透明地在文件下进行,它旨在对给定源数据存储类型使用最佳区块大小,以并行和正交方式将数据加载到 parallelCopies。The chunking within a single file would happen underneath automatically and transparently, and it's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. 数据移动服务在运行时用于复制操作的并行复制的实际数量不超过所拥有的文件数。The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. 如果复制行为是 mergeFile,复制活动无法利用文件级别的并行度。If the copy behavior is mergeFile, Copy Activity cannot take advantage of file-level parallelism.
  • parallelCopies 属性指定值时,请考虑源和接收器数据存储上的负载会增加,如果是混合复制,则网关的负载会增加。When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores, and to gateway if it is a hybrid copy. 尤其在有多个活动或针对同一数据存储运行的相同活动有并发运行时,会发生这种情况。This happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果注意到数据存储或网关负载过重,请降低 parallelCopies 值以减轻负载。If you notice that either the data store or Gateway is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
  • 将数据从不基于文件的存储复制到基于文件的存储时,数据移动服务将忽略 parallelCopies 属性。When you copy data from stores that are not file-based to stores that are file-based, the data movement service ignores the parallelCopies property. 即使指定了并行性,在此情况下也不适用。Even if parallelism is specified, it's not applied in this case.

备注

在进行混合复制时,必须使用数据管理网关版本 1.11 或更高版本才能使用 parallelCopies 功能。

若要更好地使用这两个属性,并提高数据移动吞吐量,请参阅示例用例To better use these two properties, and to enhance your data movement throughput, see the sample use cases. 无需配置 parallelCopies 即可利用默认行为。You don't need to configure parallelCopies to take advantage of the default behavior. 如果已配置且 parallelCopies 太小,则可能不能充分利用多个云 DMU。If you do configure and parallelCopies is too small, multiple cloud DMUs might not be fully utilized.

计费影响Billing impact

务必记住,会根据复制操作的总时间向你收费。It's important to remember that you are charged based on the total time of the copy operation. 如果复制作业过去使用 1 个云单元花费 1 小时,现在使用 4 个单元花费 15 分钟,则总费用几乎相同。If a copy job used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill remains almost the same. 例如,使用 4 个云单元。For example, you use four cloud units. 第一个云单元花费 10 分钟,第二个花费 10 分钟,第三个花费 5 分钟,第四个花费 5 分钟,这些都属于一个复制活动运行。The first cloud unit spends 10 minutes, the second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run. 将对总复制(数据移动)时间进行收费,即 10 + 10 + 5 + 5 = 30 分钟。You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. 使用 parallelCopies 不会影响计费。Using parallelCopies does not affect billing.

暂存复制Staged copy

将数据从源数据存储复制到接收器数据存储时,可能会选择使用 Blob 存储作为过渡暂存存储。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暂存在以下情况下特别有用:Staging is especially useful in the following cases:

  1. 通过 PolyBase 从各种数据存储将数据引入 SQL 数据仓库You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL 数据仓库使用 PolyBase 作为高吞吐量机制,将大量数据加载到 SQL 数据仓库中。SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. 但是,源数据必须位于 Blob 存储中,并且它必须满足其他条件。However, the source data must be in Blob storage, and it must meet additional criteria. 从 Blob 存储以外的数据存储加载数据时,可通过过渡暂存 Blob 存储激活数据复制。When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. 在这种情况下,数据工厂会执行所需的数据转换,确保其满足 PolyBase 的要求。In that case, Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然后,它使用 PolyBase 将数据加载到 SQL 数据仓库。Then it uses PolyBase to load data into SQL Data Warehouse. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库For more details, see Use PolyBase to load data into Azure SQL Data Warehouse. 有关带有用例的演练,请参阅在不到 15 分钟的时间里通过 Azure 数据工厂将 1 TB 的数据加载到 Azure SQL 数据仓库For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
  2. 有时,通过速度慢的网络连接执行混合数据移动(即在本地数据存储和云数据存储之间进行复制)需要一段时间Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an on-premises data store and a cloud data store) over a slow network connection. 为了提高性能,可压缩本地数据,缩短将数据移动到云中的暂存数据存储的时间。To improve performance, you can compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. 然后,可先在暂存存储中解压缩数据,再将它们加载到目标数据存储。Then you can decompress the data in the staging store before you load it into the destination data store.
  3. 由于企业 IT 策略,不希望在防火墙中打开除端口 80 和端口 443 以外的端口You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate IT policies. 例如,将数据从本地数据存储复制到 Azure SQL 数据库接收器或 Azure SQL 数据仓库接收器时,需要对 Windows 防火墙和公司防火墙激活端口 1433 上的出站 TCP 通信。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在这种情况下,利用网关首先在端口 443 上通过 HTTP 或 HTTPS 将数据复制到 Blob 存储暂存实例。In this scenario, take advantage of the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. 然后,将数据从 Blob 存储暂存加载到 SQL 数据库或 SQL 数据仓库。Then, load the data into SQL Database or SQL Data Warehouse from Blob storage staging. 在此流中,不需要启用端口 1433。In this flow, you don't need to enable port 1433.

暂存复制的工作原理How staged copy works

激活暂存功能时,首先将数据从源数据存储复制到暂存数据存储(自带)。When you activate the staging feature, first the data is copied from the source data store to the staging data store (bring your own). 然后,将数据从暂存数据存储复制到接收器数据存储。Next, the data is copied from the staging data store to the sink data store. 数据工厂自动管理两阶段流。Data Factory automatically manages the two-stage flow for you. 数据移动完成后,数据工厂还将清除暂存存储中的临时数据。Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

在云复制方案(源和接收器数据都位于云)中,未使用网关。In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. 数据工厂服务执行复制操作。The Data Factory service performs the copy operations.

暂存复制:云方案

在混合复制方案中(源位于本地,接收器位于云中),网关将数据从源数据存储移动到暂存数据存储。In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the source data store to a staging data store. 数据工厂服务将数据从暂存数据存储复制到接收器数据存储。Data Factory service moves data from the staging data store to the sink data store. 反向流也支持通过暂存将数据从云数据存储复制到本地数据存储。Copying data from a cloud data store to an on-premises data store via staging also is supported with the reversed flow.

暂存复制:混合方案

使用暂存存储激活数据移动时,可指定是否要先压缩数据,再将数据从源数据存储移动到过渡数据存储或暂存数据存储,然后先解压缩数据,再将数据从过渡数据存储或暂存数据移动到接收器数据存储。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before moving data from the source data store to an interim or staging data store, and then decompressed before moving data from an interim or staging data store to the sink data store.

目前,不能使用暂存存储在两个本地数据存储之间复制数据。Currently, you can't copy data between two on-premises data stores by using a staging store. 我们预计会尽快提供此选项。We expect this option to be available soon.

配置Configuration

在复制活动中配置 enableStaging 设置,指定在将数据加载到目标数据存储之前是否要在 Blob 存储中暂存。Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. enableStaging 设置为 TRUE 时,指定下一个表中列出的其他属性。When you set enableStaging to TRUE, specify the additional properties listed in the next table. 如果未指定,则还需要创建 Azure 存储或存储共享访问签名链接服务供暂存用。If you don’t have one, you also need to create an Azure Storage or Storage shared access signature-linked service for staging.

属性Property 说明Description 默认值Default value 必选Required
enableStagingenableStaging 指定是否要通过过渡暂存存储复制数据。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorageAzureStorageSas 链接服务的名称,这指用作过渡暂存存储的存储实例。Specify the name of an AzureStorage or AzureStorageSas linked service, which refers to the instance of Storage that you use as an interim staging store.

不能使用具有共享访问签名的存储通过 PolyBase 将数据加载到 SQL 数据仓库。You cannot use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. 可在其他任何情况下使用它。You can use it in all other scenarios.
不适用N/A enableStaging 设置为 TRUE 时,则为是Yes, when enableStaging is set to TRUE
路径path 指定要包含此暂存数据的 Blob 存储路径。Specify the Blob storage path that you want to contain the staged data. 如果不提供路径,该服务将创建容器以存储临时数据。If you do not provide a path, the service creates a container to store temporary data.

只在使用具有共享访问签名的存储时,或者要求临时数据位于特定位置时才指定路径。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
不适用N/A No
enableCompressionenableCompression 指定是否应先压缩数据,再将数据复制到目标。Specifies whether data should be compressed before it is copied to the destination. 此设置可减少传输的数据量。This setting reduces the volume of data being transferred. FalseFalse No

以下是具有上表所述属性的复制活动的示例定义:Here's a sample definition of Copy Activity with the properties that are described in the preceding table:

"activities":[  
{
    "name": "Sample copy activity",
    "type": "Copy",
    "inputs": [{ "name": "OnpremisesSQLServerInput" }],
    "outputs": [{ "name": "AzureSQLDBOutput" }],
    "typeProperties": {
        "source": {
            "type": "SqlSource",
        },
        "sink": {
            "type": "SqlSink"
        },
        "enableStaging": true,
        "stagingSettings": {
            "linkedServiceName": "MyStagingBlob",
            "path": "stagingcontainer/path",
            "enableCompression": true
        }
    }
}
]

计费影响Billing impact

基于两个步骤进行计费:复制持续时间和复制类型。You are charged based on two steps: copy duration and copy type.

  • 在云复制期间(将数据从一个云数据存储复制到另一个云数据存储)使用暂存时,需要支付 [步骤 1 和步骤 2 的复制持续时间总和] x [云复制单位价格]。When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 在混合复制(将数据从本地数据存储复制到云数据存储)期间使用暂存时,需要支付 [混合复制持续时间] x [混合复制单价] + [云复制持续时间] x [云复制单价]。When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

性能优化步骤Performance tuning steps

我们建议采用以下步骤,通过复制活动调整数据工厂服务的性能:We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:

  1. 建立基准Establish a baseline. 在开发阶段,通过对代表性数据示例使用复制活动来测试管道。During the development phase, test your pipeline by using Copy Activity against a representative data sample. 可使用数据工厂切片模型来限制处理的数据量。You can use the Data Factory slicing model to limit the amount of data you work with.

    使用监视和管理应用收集执行时间和性能特征。Collect execution time and performance characteristics by using the Monitoring and Management App. 在数据工厂主页上选择“监视和管理”。Choose Monitor & Manage on your Data Factory home page. 在树视图中,选择“输出数据集”。In the tree view, choose the output dataset. 在“活动窗口”列表中,选择复制活动运行。In the Activity Windows list, choose the Copy Activity run. 活动窗口列出了复制活动持续时间和所复制的数据大小。Activity Windows lists the Copy Activity duration and the size of the data that's copied. 活动窗口资源管理器中列出了吞吐量。The throughput is listed in Activity Window Explorer. 有关此应用的详细信息,请参阅使用监视和管理应用来监视和管理 Azure 数据工厂管道To learn more about the app, see Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management App.

    活动运行详细信息

    在本文之后的部分,可你将方案中的性能和配置与我们测试中复制活动的性能参考进行比较。Later in the article, you can compare the performance and configuration of your scenario to Copy Activity’s performance reference from our tests.

  2. 诊断和优化性能Diagnose and optimize performance. 如果观察到的性能不符合预期,则需要识别性能瓶颈。If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. 然后,优化性能以消除或减少瓶颈的影响。Then, optimize performance to remove or reduce the effect of bottlenecks. 本文不涵盖性能诊断的完整说明,但下面是一些常见的注意事项:A full description of performance diagnosis is beyond the scope of this article, but here are some common considerations:

  3. 将配置扩展至整个数据集Expand the configuration to your entire data set. 对执行结果和性能满意时,可以扩展定义和管道活动期以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline active period to cover your entire data set.

数据管理网关注意事项Considerations for Data Management Gateway

网关设置:建议使用专用计算机托管数据管理网关。Gateway setup: We recommend that you use a dedicated machine to host Data Management Gateway. 请参阅使用数据管理网关的注意事项See Considerations for using Data Management Gateway.

网关监视和纵向/横向扩展:具有一个或多个网关节点的单个逻辑网关可同时用于在同一时间运行的多个复制活动。Gateway monitoring and scale-up/out: A single logical gateway with one or more gateway nodes can serve multiple Copy Activity runs at the same time concurrently. 可在网关计算机上查看资源利用率(CPU、内存、网络(入站/出站)等)的近实时快照,以及在 Azure 门户中运行的并发作业数与限制。有关详细信息,请参阅在门户中监视网关You can view near-real time snapshot of resource utilization (CPU, memory, network(in/out), etc.) on a gateway machine as well as the number of concurrent jobs running versus limit in the Azure portal, see Monitor gateway in the portal. 如果非常需要包含大量并发复制活动运行或需要复制大量数据的混合数据移动,请考虑纵向或横向扩展网关,以便更好地利用资源或设置更多资源以允许复制。If you have heavy need on hybrid data movement either with large number of concurrent copy activity runs or with large volume of data to copy, consider to scale up or scale out gateway so as to better utilize your resource or to provision more resource to empower copy.

有关源的注意事项Considerations for the source

常规General

确保基础数据存储未被在其上运行或针对其运行的其他工作负荷过渡占用。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

有关 Microsoft 数据存储的信息,请参阅特定于数据存储的监视和优化主题,帮助用户了解数据存储性能特征、尽量缩短响应时间以及最大化吞吐量。For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you understand data store performance characteristics, minimize response times, and maximize throughput.

如果将数据从 Blob 存储复制到 SQL 数据仓库,请考虑使用 PolyBase 来提高性能。If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库See Use PolyBase to load data into Azure SQL Data Warehouse for details. 有关带有用例的演练,请参阅在不到 15 分钟的时间里通过 Azure 数据工厂将 1 TB 的数据加载到 Azure SQL 数据仓库For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.

基于文件的数据存储File-based data stores

(包括 Blob 存储、Data Lake Store、Amazon S3、本地文件系统和本地 HDFS)(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)

  • 平均文件大小和文件计数:复制活动一次传输一个文件的数据。Average file size and file count: Copy Activity transfers data one file at a time. 在移动相同数据量的情况下,如果数据由许多小文件而不是几个大文件组成,则由于每个文件都有启动阶段,总吞吐量会较低。With the same amount of data to be moved, the overall throughput is lower if the data consists of many small files rather than a few large files due to the bootstrap phase for each file. 因此,尽可能将小文件合并为较大的文件,以获得更高的吞吐量。Therefore, if possible, combine small files into larger files to gain higher throughput.
  • 文件格式和压缩:有关提高性能的更多方法,请参阅序列化和反序列化注意事项压缩注意事项部分。File format and compression: For more ways to improve performance, see the Considerations for serialization and deserialization and Considerations for compression sections.
  • 对于其中需要数据管理网关本地文件系统方案,请参阅数据管理网关注意事项部分。For the on-premises file system scenario, in which Data Management Gateway is required, see the Considerations for Data Management Gateway section.

关系数据存储Relational data stores

(包括 SQL 数据库;SQL 数据仓库;Amazon Redshift;SQL Server 数据库;以及 Oracle、MySQL、DB2、Teradata、Sybase 和 PostgreSQL 数据库等。)(Includes SQL Database; SQL Data Warehouse; Amazon Redshift; SQL Server databases; and Oracle, MySQL, DB2, Teradata, Sybase, and PostgreSQL databases, etc.)

  • 数据模式:表架构会影响复制吞吐量。Data pattern: Your table schema affects copy throughput. 复制相同数据量时,较大行大小可比较小行大小提供更佳的性能。A large row size gives you a better performance than small row size, to copy the same amount of data. 原因是数据库可以更有效地检索包含较少行的较少批量数据。The reason is that the database can more efficiently retrieve fewer batches of data that contain fewer rows.
  • 查询或存储过程:优化在复制活动源中指定的查询或存储过程的逻辑,以更有效地提取数据。Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently.
  • 对于需要使用数据管理网关本地关系数据库(如 SQL Server 和 Oracle),请参阅数据管理网关注意事项部分。For on-premises relational databases, such as SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

有关接收器的注意事项Considerations for the sink

常规General

确保基础数据存储未被在其上运行或针对其运行的其他工作负荷过渡占用。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

有关 Microsoft 数据存储的信息,请参阅特定于数据存储的监视和优化主题For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. 这些主题可帮助用户了解数据存储性能特征、了解如何尽量缩短响应时间以及最大化吞吐量。These topics can help you understand data store performance characteristics and how to minimize response times and maximize throughput.

如果将数据从 Blob 存储复制到 SQL 数据仓库,请考虑使用 PolyBase 来提高性能。If you are copying data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. 有关详细信息,请参阅使用 PolyBase 将数据加载到 Azure SQL 数据仓库See Use PolyBase to load data into Azure SQL Data Warehouse for details. 有关带有用例的演练,请参阅在不到 15 分钟的时间里通过 Azure 数据工厂将 1 TB 的数据加载到 Azure SQL 数据仓库For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.

基于文件的数据存储File-based data stores

(包括 Blob 存储、Data Lake Store、Amazon S3、本地文件系统和本地 HDFS)(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)

  • 复制行为:如果从基于文件的不同数据存储复制数据,则复制活动可通过 copyBehavior 属性提供三个选项。Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via the copyBehavior property. 它将保留层次结构、平展层次结构或合并文件。It preserves hierarchy, flattens hierarchy, or merges files. 保留或平展层次结构有少量的性能开销或没有性能开销,但合并文件会导致性能开销增加。Either preserving or flattening hierarchy has little or no performance overhead, but merging files causes performance overhead to increase.
  • 文件格式和压缩:有关提高性能的更多方法,请参阅序列化和反序列化注意事项压缩注意事项部分。File format and compression: See the Considerations for serialization and deserialization and Considerations for compression sections for more ways to improve performance.
  • Blob 存储:目前,对于优化的数据传输和吞吐量,Blob 存储仅支持块 blob。Blob storage: Currently, Blob storage supports only block blobs for optimized data transfer and throughput.
  • 对于需要使用数据管理网关本地文件系统方案,请参阅数据管理网关注意事项部分。For on-premises file systems scenarios that require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

关系数据存储Relational data stores

(包括 SQL 数据库、SQL 数据仓库、SQL Server 数据库和 Oracle 数据库)(Includes SQL Database, SQL Data Warehouse, SQL Server databases, and Oracle databases)

  • 复制行为:根据已为 sqlSink 设置的属性,复制活动以不同的方式将数据写入目标数据库。Copy behavior: Depending on the properties you've set for sqlSink, Copy Activity writes data to the destination database in different ways.
    • 数据移动服务默认使用大容量复制 API 以追加模式插入数据,这提供最佳性能。By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance.
    • 如果在接收器中配置存储过程,数据库一次会应用一行数据,而不是大容量加载。If you configure a stored procedure in the sink, the database applies the data one row at a time instead of as a bulk load. 性能会大大降低。Performance drops significantly. 如果数据集较大,请考虑切换为使用 sqlWriterCleanupScript 属性(如适用)。If your data set is large, when applicable, consider switching to using the sqlWriterCleanupScript property.
    • 如果为每个复制活动运行配置 sqlWriterCleanupScript 属性,该服务将触发脚本,然后使用大容量复制 API 插入数据。If you configure the sqlWriterCleanupScript property for each Copy Activity run, the service triggers the script, and then you use the Bulk Copy API to insert the data. 例如,若要使用最新数据覆盖整个表,可指定一个脚本,先删除所有记录,再从源大容量加载新数据。For example, to overwrite the entire table with the latest data, you can specify a script to first delete all records before bulk-loading the new data from the source.
  • 数据模式和批大小Data pattern and batch size:
    • 表架构会影响复制吞吐量。Your table schema affects copy throughput. 复制相同数据量时,较大行大小会比较小行大小提供更好的性能,因为数据库可以更有效地提交较少的数据批次。To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data.
    • 复制活动以一系列批次插入数据。Copy Activity inserts data in a series of batches. 可使用 writeBatchSize 属性设置批中的行数。You can set the number of rows in a batch by using the writeBatchSize property. 如果数据的行较小,可设置具有更高值的 writeBatchSize 属性,从更低的批开销和更高的吞吐量获益。If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. 如果数据的行大小较大,请谨慎增加 writeBatchSizeIf the row size of your data is large, be careful when you increase writeBatchSize. 较高的值可能会导致复制失败(因为数据库负载过重)。A high value might lead to a copy failure caused by overloading the database.
  • 对于需要使用数据管理网关本地关系数据库(如 SQL Server 和 Oracle),请参阅数据管理网关注意事项部分。For on-premises relational databases like SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

NoSQL 存储NoSQL stores

(包括表存储和 Azure Cosmos DB)(Includes Table storage and Azure Cosmos DB )

  • 对于表存储For Table storage:
    • 分区:将数据写入交错分区会显着降低性能。Partition: Writing data to interleaved partitions dramatically degrades performance. 按分区键对源数据进行排序,以有效率地将数据插入各个分区,或调整逻辑将数据写入单个分区。Sort your source data by partition key so that the data is inserted efficiently into one partition after another, or adjust the logic to write the data to a single partition.
  • 对于 Azure Cosmos DBFor Azure Cosmos DB:
    • 批大小writeBatchSize 属性设置对 Azure Cosmos DB 服务的并行请求数以创建文档。Batch size: The writeBatchSize property sets the number of parallel requests to the Azure Cosmos DB service to create documents. 当增加 writeBatchSize 时,由于会向 Azure Cosmos DB 发送更多的并行请求,因此可以获得更好的性能。You can expect better performance when you increase writeBatchSize because more parallel requests are sent to Azure Cosmos DB. 但在写入 Azure Cosmos DB 时,请注意限制(错误消息为“请求速率高”)。However, watch for throttling when you write to Azure Cosmos DB (the error message is "Request rate is large"). 导致限制的因素很多,包括文档大小、文档中的术语数和目标集合的索引策略。Various factors can cause throttling, including document size, the number of terms in the documents, and the target collection's indexing policy. 若要实现更高的复制吞吐量,请考虑使用更好的集合,例如 S3。To achieve higher copy throughput, consider using a better collection, for example, S3.

序列化和反序列化注意事项Considerations for serialization and deserialization

输入数据集或输出数据集是文件时,可能会发生序列化和反序列化。Serialization and deserialization can occur when your input data set or output data set is a file. 有关复制活动支持的文件格式的详细信息,请参阅支持的文件和压缩格式See Supported file and compression formats with details on supported file formats by Copy Activity.

复制行为Copy behavior:

  • 在基于文件的数据存储之间复制文件:Copying files between file-based data stores:
    • 输入和输出数据集具有相同的文件格式设置或没有文件格式设置时,数据移动服务将执行二进制复制,而不进行任何序列化或反序列化。When input and output data sets both have the same or no file format settings, the data movement service executes a binary copy without any serialization or deserialization. 与源和接收器文件格式设置彼此不同的情况相比,这提供的吞吐量更高。You see a higher throughput compared to the scenario, in which the source and sink file format settings are different from each other.
    • 输入和输出数据集都为文本格式且只有编码类型不同时,数据移动服务仅进行编码转换。When input and output data sets both are in text format and only the encoding type is different, the data movement service only does encoding conversion. 它不进行任何序列化和反序列化,与二进制复制相比,这会产生一些性能开销。It doesn't do any serialization and deserialization, which causes some performance overhead compared to a binary copy.
    • 当输入和输出数据集具有不同的文件格式或不同的配置时(如分隔符),数据移动服务会反序列化源数据,以进行流式传输、转换,然后将其序列化为指示的输出格式。When input and output data sets both have different file formats or different configurations, like delimiters, the data movement service deserializes source data to stream, transform, and then serialize it into the output format you indicated. 与其他方案相比,此操作会产生更多的性能开销。This operation results in a much more significant performance overhead compared to other scenarios.
  • 向/从不基于文件的数据存储复制文件时(例如,从基于文件的存储到关系存储),需要序列化或反序列化步骤。When you copy files to/from a data store that is not file-based (for example, from a file-based store to a relational store), the serialization or deserialization step is required. 此步骤将造成大量的性能开销。This step results in significant performance overhead.

文件格式:选择的文件格式可能会影响复制性能。File format: The file format you choose might affect copy performance. 例如,Avro 是一种将元数据与数据一起存储的压缩二进制格式。For example, Avro is a compact binary format that stores metadata with data. 它在 Hadoop 生态系统中对处理和查询具有广泛的支持。It has broad support in the Hadoop ecosystem for processing and querying. 然而,Avro 进行序列化和反序列化的代价更高,这会导致比文本格式更低的复制吞吐量。However, Avro is more expensive for serialization and deserialization, which results in lower copy throughput compared to text format. 在选择整个处理流中使用的文件格式时,应考虑全面。Make your choice of file format throughout the processing flow holistically. 首先考虑数据存储的格式(源数据存储或从外部系统提取);存储、分析处理和查询的最佳格式;应将数据导出到数据集市、供报表和可视化工具使用的格式。Start with what form the data is stored in, source data stores or to be extracted from external systems; the best format for storage, analytical processing, and querying; and in what format the data should be exported into data marts for reporting and visualization tools. 有时,读取和写入性能不是最佳的文件格式可能对于整体分析处理来说却是不错的选择。Sometimes a file format that is suboptimal for read and write performance might be a good choice when you consider the overall analytical process.

压缩注意事项Considerations for compression

输入或输出数据集是文件时,可以设置复制活动,使其在将数据写入目标时执行压缩或解压缩。When your input or output data set is a file, you can set Copy Activity to perform compression or decompression as it writes data to the destination. 选择压缩时,请在输入/输出 (I/O) 和 CPU 之间进行权衡。When you choose compression, you make a tradeoff between input/output (I/O) and CPU. 压缩数据会花费额外的计算资源。Compressing the data costs extra in compute resources. 但反过来减少了网络 I/O 和存储。But in return, it reduces network I/O and storage. 根据所用数据,可能会看到整体复制吞吐量的提升。Depending on your data, you may see a boost in overall copy throughput.

编解码器:复制活动支持 gzip、bzip2 和 Deflate 压缩类型。Codec: Copy Activity supports gzip, bzip2, and Deflate compression types. Azure HDInsight 可使用这三种类型进行处理。Azure HDInsight can consume all three types for processing. 每中压缩编解码器各有优点。Each compression codec has advantages. 例如,虽然 bzip2 复制吞吐量最低,但使用 bzip2 可获得最佳的 Hive 查询性能,因为可将其拆分处理。For example, bzip2 has the lowest copy throughput, but you get the best Hive query performance with bzip2 because you can split it for processing. Gzip 是最平衡的选项,也是最常用的选项。Gzip is the most balanced option, and it is used the most often. 选择最适合端到端方案的编解码器。Choose the codec that best suits your end-to-end scenario.

级别:对于每个压缩编解码器,有以下两个选择:最快压缩和最佳压缩。Level: You can choose from two options for each compression codec: fastest compressed and optimally compressed. 最快压缩选项可尽快压缩数据,不过无法以最佳方式压缩生成的文件。The fastest compressed option compresses the data as quickly as possible, even if the resulting file is not optimally compressed. 最佳压缩选项花费更多时间进行压缩,产生最小量的数据。The optimally compressed option spends more time on compression and yields a minimal amount of data. 可对这两个选项进行测试,确定可为方案提供最佳整体性能的选项。You can test both options to see which provides better overall performance in your case.

注意事项:若要在本地存储和云之间复制大量数据,请考虑搭配使用过渡 Blob 存储与压缩。A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using interim blob storage with compression. 当公司网络和 Azure 服务的带宽是限制因素,并希望输入数据集和输出数据集都处于未压缩形式时,使用过渡存储将非常有用。Using interim storage is helpful when the bandwidth of your corporate network and your Azure services is the limiting factor, and you want the input data set and output data set both to be in uncompressed form. 更具体地说,可将单个复制活动分为两个复制活动。More specifically, you can break a single copy activity into two copy activities. 第一个复制活动以压缩形式从源复制到过渡或暂存 blob。The first copy activity copies from the source to an interim or staging blob in compressed form. 第二个复制活动从暂存复制已压缩的数据,并在写入接收器时进行解压缩。The second copy activity copies the compressed data from staging, and then decompresses while it writes to the sink.

列映射注意事项Considerations for column mapping

可在复制活动中设置 columnMappings 属性,将全部或部分输入列映射到输出列。You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the output columns. 数据移动服务从源读取数据后,它需要先对数据执行列映射,再将数据写入接收器。After the data movement service reads the data from the source, it needs to perform column mapping on the data before it writes the data to the sink. 这一额外处理会降低复制吞吐量。This extra processing reduces copy throughput.

如果源数据存储可查询,例如,如果存储是关系存储(如 SQL 数据库或 SQL Server),或者是 NoSQL 存储(如表存储或 Azure Cosmos DB),请考虑将列筛选和重排序逻辑推送到查询属性,而不使用列映射。If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and reordering logic to the query property instead of using column mapping. 这样,当数据移动服务从源数据存储读取数据时会发生投影,使效率更高。This way, the projection occurs while the data movement service reads data from the source data store, where it is much more efficient.

其他注意事项Other considerations

如果要复制的数据大小较大,可调整业务逻辑,使用数据工厂中的切片机制进一步对数据进行分区。If the size of data you want to copy is large, you can adjust your business logic to further partition the data using the slicing mechanism in Data Factory. 然后,计划使复制活动更频繁地运行,减少每个复制活动运行的数据大小。Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run.

请谨慎对待需要数据工厂同时连接到同一数据存储的数据集数和复制活动数。Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same data store at the same time. 许多并发复制作业可能会限制数据存储,并导致性能下降,复制作业内部重试,甚至在某些情况下导致执行失败。Many concurrent copy jobs might throttle a data store and lead to degraded performance, copy job internal retries, and in some cases, execution failures.

示例方案:从本地 SQL Server 复制到 Blob 存储Sample scenario: Copy from an on-premises SQL Server to Blob storage

方案:构建管道,以 CSV 格式将数据从本地 SQL Server 复制到 Blob 存储。Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. 要使复制作业更快,应将 CSV 文件压缩为 bzip2 格式。To make the copy job faster, the CSV files should be compressed into bzip2 format.

测试和分析:复制活动的吞吐量小于 2 MBps,这比性能基准慢得多。Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the performance benchmark.

性能分析和优化:为排除性能问题,可查看数据的处理和移动方式。Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is processed and moved.

  1. 读取数据:网关打开与 SQL Server 的连接并发送查询。Read data: Gateway opens a connection to SQL Server and sends the query. SQL Server 通过 Intranet 向网关发送数据流,以此进行响应。SQL Server responds by sending the data stream to Gateway via the intranet.
  2. 序列化和压缩数据:网关将数据流序列化为 CSV 格式,并将数据压缩为 bzip2 流。Serialize and compress data: Gateway serializes the data stream to CSV format, and compresses the data to a bzip2 stream.
  3. 写入数据:网关通过 Internet 将 bzip2 流上传到 Blob 存储。Write data: Gateway uploads the bzip2 stream to Blob storage via the Internet.

如你所见,数据以流式处理顺序方式进行处理和移动:SQL Server > LAN> 网关 > WAN > Blob 存储。As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN > Gateway > WAN > Blob storage. 整体性能受管道中最小吞吐量的限制The overall performance is gated by the minimum throughput across the pipeline.

数据流

以下的一个或多个因素可能会导致性能瓶颈:One or more of the following factors might cause the performance bottleneck:

  • :SQL Server 本身由于负载过重而吞吐量低。Source: SQL Server itself has low throughput because of heavy loads.
  • 数据管理网关Data Management Gateway:
    • LAN:网关的位置离 SQL Server 计算机很远,且带宽连接低。LAN: Gateway is located far from the SQL Server machine and has a low-bandwidth connection.
    • 网关:网关已达到其执行以下操作的负载限制:Gateway: Gateway has reached its load limitations to perform the following operations:
      • 序列化:将数据流序列化为 CSV 格式时吞吐量缓慢。Serialization: Serializing the data stream to CSV format has slow throughput.
      • 压缩:选择慢速压缩编解码器(例如,bzip2,其采用 Core i7,速度为 2.8 MBps)。Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps with Core i7).
    • WAN:企业网络和 Azure 服务之间的带宽低(例如,T1 = 1,544 kbps;T2 = 6,312 kbps)。WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1 = 1,544 kbps; T2 = 6,312 kbps).
  • 接收器:Blob 存储吞吐量低。Sink: Blob storage has low throughput. (这种情况不太可能发生,因为其 SLA 保证至少有 60 MBps。)(This scenario is unlikely because its SLA guarantees a minimum of 60 MBps.)

在这种情况下,bzip2 数据压缩可能会拖慢整个管道的速度。In this case, bzip2 data compression might be slowing down the entire pipeline. 切换到 gzip 压缩编解码器可能会缓解此瓶颈。Switching to a gzip compression codec might ease this bottleneck.

示例方案:使用并行复制Sample scenarios: Use parallel copy

方案 I: 将 1,000 个 1 MB 的文件从本地文件系统复制到 Blob 存储。Scenario I: Copy 1,000 1-MB files from the on-premises file system to Blob storage.

分析和性能优化:例如,如果已在四核计算机上安装了网关,数据工厂将使用 16 个并行复制将文件从文件系统并发移动到 Blob 存储。Analysis and performance tuning: For an example, if you have installed gateway on a quad core machine, Data Factory uses 16 parallel copies to move files from the file system to Blob storage concurrently. 此并行执行应会导致高吞吐量。This parallel execution should result in high throughput. 还可显式指定并行复制数。You also can explicitly specify the parallel copies count. 复制许多小文件时,并行复制通过更有效地使用资源显著帮助提高吞吐量。When you copy many small files, parallel copies dramatically help throughput by using resources more effectively.

方案 1

方案 II:将 20 个 Blob(每个 500 MB)从 Blob 存储复制到 Data Lake Store Analytics,然后优化性能。Scenario II: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune performance.

分析和性能优化:在此方案中,数据工厂通过使用单个复制(parallelCopies 设置为 1)和单一云数据移动单位,将数据从 Blob 存储复制到 Data Lake Store。Analysis and performance tuning: In this scenario, Data Factory copies the data from Blob storage to Data Lake Store by using single-copy (parallelCopies set to 1) and single-cloud data movement units. 观察到的吞吐量将接近性能参考部分中描述的吞吐量。The throughput you observe will be close to that described in the performance reference section.

方案 2

方案 III:个别文件大小大于几十 MB 且总量很大。Scenario III: Individual file size is greater than dozens of MBs and total volume is large.

分析和优化性能:因为单一云 DMU 的资源限制,增加 parallelCopies 不会产生更好的复制性能。Analysis and performance turning: Increasing parallelCopies doesn't result in better copy performance because of the resource limitations of a single-cloud DMU. 而应指定更多云 DMU,获取更多资源来执行数据移动。Instead, you should specify more cloud DMUs to get more resources to perform the data movement. 请勿指定 parallelCopies 属性的值。Do not specify a value for the parallelCopies property. 数据工厂处理并行度。Data Factory handles the parallelism for you. 在此情况下,如果将 cloudDataMovementUnits 设置为 4,则会产生大约 4 倍的吞吐量。In this case, if you set cloudDataMovementUnits to 4, a throughput of about four times occurs.

方案 3

引用Reference

下面是有关一些受支持数据存储的性能监视和优化参考:Here are performance monitoring and tuning references for some of the supported data stores: