您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

复制活动性能和可伸缩性指南Copy activity performance and scalability guide

适用于: Azure 数据工厂 Azure Synapse Analytics

有时,你想要执行从 data lake 或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移。Sometimes you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW), to Azure. 在其他情况下,需要将大量数据从不同的源引入 Azure,用于大数据分析。Other times you want to ingest large amounts of data, from different sources into Azure, for big data analytics. 在每种情况下,实现最佳性能和可伸缩性至关重要。In each case, it is critical to achieve optimal performance and scalability.

Azure 数据工厂 (ADF) 提供一种用于引入数据的机制。Azure Data Factory (ADF) provides a mechanism to ingest data. ADF 具有以下优势:ADF has the following advantages:

  • 处理大量数据Handles large amounts of data
  • 高性能Is highly performant
  • 经济高效Is cost-effective

这些优势使 ADF 成为需要构建高性能的可缩放数据引入管道的数据工程师。These advantages make ADF an excellent fit for data engineers who want to build scalable data ingestion pipelines that are highly performant.

阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:

  • 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?
  • 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?
  • 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?
  • 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?

备注

如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF

ADF 提供的无服务器体系结构允许不同级别的并行性。ADF offers a serverless architecture that allows parallelism at different levels.

此体系结构允许开发最大程度地提高环境数据移动吞吐量的管道。This architecture allows you to develop pipelines that maximize data movement throughput for your environment. 这些管道充分利用以下资源:These pipelines fully utilize the following resources:

  • 网络带宽Network bandwidth
  • 每秒存储输入/输出操作数 (IOPS) 和带宽Storage input/output operations per second (IOPS) and bandwidth

此完全使用量意味着可以通过测量以下资源提供的最小吞吐量来估算总体吞吐量:This full utilization means you can estimate the overall throughput by measuring the minimum throughput available with the following resources:

  • 源数据存储Source data store
  • 目标数据存储Destination data store
  • 源和目标数据存储之间的网络带宽Network bandwidth in between the source and destination data stores

下表计算复制持续时间。The table below calculates the copy duration. 持续时间取决于数据大小和环境的带宽限制。The duration is based on data size and the bandwidth limit for your environment.

 

数据大小/Data size /
bandwidthbandwidth
50 Mbps50 Mbps 100 Mbps100 Mbps 500 Mbps500 Mbps 1 Gbps1 Gbps 5 Gbps5 Gbps 10 Gbps10 Gbps 50 Gbps50 Gbps
1 GB1 GB 2.7 分钟2.7 min 1.4 分钟1.4 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min 0.01 分钟0.01 min 0.0 分钟0.0 min
10 GB10 GB 27.3 分钟27.3 min 13.7 分钟13.7 min 2.7 分钟2.7 min 1.3 分钟1.3 min 0.3 分钟0.3 min 0.1 分钟0.1 min 0.03 分钟0.03 min
100 GB100 GB 4.6 小时4.6 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs 0.02 小时0.02 hrs 0.0 小时0.0 hrs
1 TB1 TB 46.6 小时46.6 hrs 23.3 小时23.3 hrs 4.7 小时4.7 hrs 2.3 小时2.3 hrs 0.5 小时0.5 hrs 0.2 小时0.2 hrs 0.05 小时0.05 hrs
10 TB10 TB 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 0.9 天0.9 days 0.2 天0.2 days 0.1 天0.1 days 0.02 天0.02 days
100 TB100 TB 194.2 天194.2 days 97.1 天97.1 days 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 1 天1 day 0.2 天0.2 days
1 PB1 PB 64.7 个月64.7 mo 32.4 个月32.4 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo 0.3 个月0.3 mo 0.06 个月0.06 mo
10 PB10 PB 647.3 个月647.3 mo 323.6 个月323.6 mo 64.7 个月64.7 mo 31.6 个月31.6 mo 6.5 个月6.5 mo 3.2 个月3.2 mo 0.6 个月0.6 mo

ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:

ADF 副本的缩放方式

  • ADF 控制流可以并行启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.

  • 单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources.

    • 使用 Azure 集成运行时 (IR) 时,最多可以为每个复制活动 (DIUs) 指定256数据集成单元 ,以无服务器方式。When using Azure integration runtime (IR), you can specify up to 256 data integration units (DIUs) for each copy activity, in a serverless manner.
    • 使用自承载 IR 时,可以采用以下方法之一:When using self-hosted IR, you can take either of the following approaches:
      • 手动纵向扩展计算机。Manually scale up the machine.
      • 横向扩展到多台计算机, (多 达4个节点) ,一个复制活动将跨所有节点对其文件集进行分区。Scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.

性能优化步骤Performance tuning steps

执行以下步骤,使用复制活动来优化 Azure 数据工厂服务的性能:Take the following steps to tune the performance of your Azure Data Factory service with the copy activity:

  1. 选取测试数据集并建立基线。Pick up a test dataset and establish a baseline.

    在开发过程中,请对具有代表性的数据示例使用复制活动来测试管道。During development, test your pipeline by using the copy activity against a representative data sample. 你选择的数据集应按以下属性表示典型的数据模式:The dataset you choose should represent your typical data patterns along the following attributes:

    • 文件夹结构Folder structure
    • 文件模式File pattern
    • 数据架构Data schema

    您的数据集应该足够大,以便评估复制性能。And your dataset should be big enough to evaluate copy performance. 若要完成复制活动,大小良好至少需要10分钟。A good size takes at least 10 minutes for copy activity to complete. 收集 复制活动监视后的执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.

  2. 如何最大化单个复制活动的性能How to maximize performance of a single copy activity:

    建议使用单个复制活动最大限度地提高性能。We recommend you to first maximize performance using a single copy activity.

    • 如果正在 Azure integration runtime 上执行复制活动:If the copy activity is being executed on an Azure integration runtime:

      一开始对数据集成单位 (DIU)并行复制设置使用默认值。Start with default values for Data Integration Units (DIU) and parallel copy settings.

    • 如果在 自承载 集成运行时上执行复制活动:If the copy activity is being executed on a self-hosted integration runtime:

      建议使用专用计算机托管 IR。We recommend that you use a dedicated machine to host IR. 计算机应该与托管数据存储的服务器分离。The machine should be separate from the server hosting the data store. 一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR.

    执行性能测试运行。Conduct a performance test run. 记下实现的性能。Take a note of the performance achieved. 包括使用的实际值,如 DIUs 和并行副本。Include the actual values used, such as DIUs and parallel copies. 有关如何收集运行结果和所用性能设置,请参阅复制活动监视Refer to copy activity monitoring on how to collect run results and performance settings used. 了解如何对 复制活动性能进行故障排除 ,确定并解决瓶颈问题。Learn how to troubleshoot copy activity performance to identify and resolve the bottleneck.

    按照故障排除和优化指南进行迭代,执行其他性能测试。Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance. 一旦单个复制活动运行无法获得更好的吞吐量,请考虑是否通过同时运行多个副本来最大程度地提高聚合吞吐量。Once single copy activity runs cannot achieve better throughput, consider whether to maximize aggregate throughput by running multiple copies concurrently. 下一编号项目中讨论了此选项。This option is discussed in the next numbered bullet.

  3. 如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:

    至此,你已将单个复制活动的性能最大化。By now you have maximized the performance of a single copy activity. 如果尚未获得环境上限,可以并行运行多个复制活动。If you have not yet achieved the throughput upper limits of your environment, you can run multiple copy activities in parallel. 可以使用 ADF 控制流构造并行运行。You can run in parallel by using ADF control flow constructs. 这种构造是 每个循环的。One such construct is the For Each loop. 有关详细信息,请参阅以下有关解决方案模板的文章:For more information, see the following articles about solution templates:

  4. 将配置扩展至整个数据集。Expand the configuration to your entire dataset.

    对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.

排查复制活动的性能问题Troubleshoot copy activity performance

遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario. 了解如何对 Azure 数据工厂中的每个复制活动运行性能问题进行故障排除,以 解决复制活动性能问题。And learn how to troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity performance.

复制性能优化功能Copy performance optimization features

Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:

数据集成单元Data Integration Units

(DIU) 的数据集成单元是表示 Azure 数据工厂中单个单元的强大功能的度量值。A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory. 幂是 CPU、内存和网络资源分配的组合。Power is a combination of CPU, memory, and network resource allocation. DIU 仅适用于 Azure 集成运行时DIU only applies to Azure integration runtime. DIU 不适用于 自承载集成运行时DIU does not apply to self-hosted integration runtime. 在此处了解详细信息Learn more here.

自承载集成运行时可伸缩性Self-hosted integration runtime scalability

你可能希望托管不断增长的并发工作负荷。You might want to host an increasing concurrent workload. 或者,您可能想要在当前工作负荷级别获得更高的性能。Or you might want to achieve higher performance in your present workload level. 可以通过以下方法增强处理规模:You can enhance the scale of processing by the following approaches:

  • 可以通过增加_up_可在节点上运行的并发作业的数目,增加自承载 IR。You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
    仅当节点的处理器和内存小于完全使用时,才可以进行扩展。Scale up works only if the processor and memory of the node are being less than fully utilized.
  • 可以通过添加更多节点 (机) 来横向扩展 自承载 IR。You can scale out the self-hosted IR, by adding more nodes (machines).

有关详细信息,请参阅:For more information, see:

并行复制Parallel copy

可以设置 parallelCopies 属性以指示要复制活动使用的并行度。You can set the parallelCopies property to indicate the parallelism you want the copy activity to use. 将此属性视为复制活动中的最大线程数。Think of this property as the maximum number of threads within the copy activity. 线程并行操作。The threads operate in parallel. 线程从源读取,或写入接收器数据存储。The threads either read from your source, or write to your sink data stores. 了解详细信息Learn more.

暂存复制Staged copy

数据复制操作可以将数据 直接 发送到接收器数据存储。A data copy operation can send the data directly to the sink data store. 或者,你可以选择使用 Blob 存储作为 临时过渡 存储。Alternatively, you can choose to use Blob storage as an interim staging store. 了解详细信息Learn more.

后续步骤Next steps

请参阅其他复制活动文章:See the other copy activity articles: