您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.
复制活动性能和可伸缩性指南Copy activity performance and scalability guide
Azure 数据工厂
Azure Synapse Analytics
有时,你想要执行从 data lake 或企业数据仓库 (EDW) 到 Azure 的大规模数据迁移。Sometimes you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW), to Azure. 在其他情况下,需要将大量数据从不同的源引入 Azure,用于大数据分析。Other times you want to ingest large amounts of data, from different sources into Azure, for big data analytics. 在每种情况下,实现最佳性能和可伸缩性至关重要。In each case, it is critical to achieve optimal performance and scalability.
Azure 数据工厂 (ADF) 提供一种用于引入数据的机制。Azure Data Factory (ADF) provides a mechanism to ingest data. ADF 具有以下优势:ADF has the following advantages:
- 处理大量数据Handles large amounts of data
- 高性能Is highly performant
- 经济高效Is cost-effective
这些优势使 ADF 成为需要构建高性能的可缩放数据引入管道的数据工程师。These advantages make ADF an excellent fit for data engineers who want to build scalable data ingestion pipelines that are highly performant.
阅读本文后,能够回答以下问题:After reading this article, you will be able to answer the following questions:
- 对于数据迁移和数据引入方案,使用 ADF 复制活动可以实现哪种程度的性能和可伸缩性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?
- 应执行哪些步骤来优化 ADF 复制活动的性能?What steps should I take to tune the performance of ADF copy activity?
- 可以利用哪些 ADF 性能优化设置来优化单个复制活动运行的性能?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?
- 优化复制性能时,需要考虑 ADF 以外的其他哪些因素?What other factors outside ADF to consider when optimizing copy performance?
备注
如果你对常规复制活动不熟悉,在阅读本文前请参阅复制活动概述。If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.
使用 ADF 可实现的复制性能和可伸缩性Copy performance and scalability achievable using ADF
ADF 提供的无服务器体系结构允许不同级别的并行性。ADF offers a serverless architecture that allows parallelism at different levels.
此体系结构允许开发最大程度地提高环境数据移动吞吐量的管道。This architecture allows you to develop pipelines that maximize data movement throughput for your environment. 这些管道充分利用以下资源:These pipelines fully utilize the following resources:
- 网络带宽Network bandwidth
- 每秒存储输入/输出操作数 (IOPS) 和带宽Storage input/output operations per second (IOPS) and bandwidth
此完全使用量意味着可以通过测量以下资源提供的最小吞吐量来估算总体吞吐量:This full utilization means you can estimate the overall throughput by measuring the minimum throughput available with the following resources:
- 源数据存储Source data store
- 目标数据存储Destination data store
- 源和目标数据存储之间的网络带宽Network bandwidth in between the source and destination data stores
下表计算复制持续时间。The table below calculates the copy duration. 持续时间取决于数据大小和环境的带宽限制。The duration is based on data size and the bandwidth limit for your environment.
数据大小/Data size / bandwidthbandwidth |
50 Mbps50 Mbps | 100 Mbps100 Mbps | 500 Mbps500 Mbps | 1 Gbps1 Gbps | 5 Gbps5 Gbps | 10 Gbps10 Gbps | 50 Gbps50 Gbps |
---|---|---|---|---|---|---|---|
1 GB1 GB | 2.7 分钟2.7 min | 1.4 分钟1.4 min | 0.3 分钟0.3 min | 0.1 分钟0.1 min | 0.03 分钟0.03 min | 0.01 分钟0.01 min | 0.0 分钟0.0 min |
10 GB10 GB | 27.3 分钟27.3 min | 13.7 分钟13.7 min | 2.7 分钟2.7 min | 1.3 分钟1.3 min | 0.3 分钟0.3 min | 0.1 分钟0.1 min | 0.03 分钟0.03 min |
100 GB100 GB | 4.6 小时4.6 hrs | 2.3 小时2.3 hrs | 0.5 小时0.5 hrs | 0.2 小时0.2 hrs | 0.05 小时0.05 hrs | 0.02 小时0.02 hrs | 0.0 小时0.0 hrs |
1 TB1 TB | 46.6 小时46.6 hrs | 23.3 小时23.3 hrs | 4.7 小时4.7 hrs | 2.3 小时2.3 hrs | 0.5 小时0.5 hrs | 0.2 小时0.2 hrs | 0.05 小时0.05 hrs |
10 TB10 TB | 19.4 天19.4 days | 9.7 天9.7 days | 1.9 天1.9 days | 0.9 天0.9 days | 0.2 天0.2 days | 0.1 天0.1 days | 0.02 天0.02 days |
100 TB100 TB | 194.2 天194.2 days | 97.1 天97.1 days | 19.4 天19.4 days | 9.7 天9.7 days | 1.9 天1.9 days | 1 天1 day | 0.2 天0.2 days |
1 PB1 PB | 64.7 个月64.7 mo | 32.4 个月32.4 mo | 6.5 个月6.5 mo | 3.2 个月3.2 mo | 0.6 个月0.6 mo | 0.3 个月0.3 mo | 0.06 个月0.06 mo |
10 PB10 PB | 647.3 个月647.3 mo | 323.6 个月323.6 mo | 64.7 个月64.7 mo | 31.6 个月31.6 mo | 6.5 个月6.5 mo | 3.2 个月3.2 mo | 0.6 个月0.6 mo |
ADF 副本可在不同的级别缩放:ADF copy is scalable at different levels:
ADF 控制流可以并行启动多个复制活动(例如,使用 For Each 循环)。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
单个复制活动可以利用可缩放的计算资源。A single copy activity can take advantage of scalable compute resources.
- 使用 Azure 集成运行时 (IR) 时,最多可以为每个复制活动 (DIUs) 指定256数据集成单元 ,以无服务器方式。When using Azure integration runtime (IR), you can specify up to 256 data integration units (DIUs) for each copy activity, in a serverless manner.
- 使用自承载 IR 时,可以采用以下方法之一:When using self-hosted IR, you can take either of the following approaches:
- 手动纵向扩展计算机。Manually scale up the machine.
- 横向扩展到多台计算机, (多 达4个节点) ,一个复制活动将跨所有节点对其文件集进行分区。Scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
单个复制活动并行使用多个线程读取和写入数据存储。A single copy activity reads from and writes to the data store using multiple threads in parallel.
性能优化步骤Performance tuning steps
执行以下步骤,使用复制活动来优化 Azure 数据工厂服务的性能:Take the following steps to tune the performance of your Azure Data Factory service with the copy activity:
选取测试数据集并建立基线。Pick up a test dataset and establish a baseline.
在开发过程中,请对具有代表性的数据示例使用复制活动来测试管道。During development, test your pipeline by using the copy activity against a representative data sample. 你选择的数据集应按以下属性表示典型的数据模式:The dataset you choose should represent your typical data patterns along the following attributes:
- 文件夹结构Folder structure
- 文件模式File pattern
- 数据架构Data schema
您的数据集应该足够大,以便评估复制性能。And your dataset should be big enough to evaluate copy performance. 若要完成复制活动,大小良好至少需要10分钟。A good size takes at least 10 minutes for copy activity to complete. 收集 复制活动监视后的执行详细信息和性能特征。Collect execution details and performance characteristics following copy activity monitoring.
如何最大化单个复制活动的性能:How to maximize performance of a single copy activity:
建议使用单个复制活动最大限度地提高性能。We recommend you to first maximize performance using a single copy activity.
如果正在 Azure integration runtime 上执行复制活动:If the copy activity is being executed on an Azure integration runtime:
一开始对数据集成单位 (DIU) 和并行复制设置使用默认值。Start with default values for Data Integration Units (DIU) and parallel copy settings.
如果在 自承载 集成运行时上执行复制活动:If the copy activity is being executed on a self-hosted integration runtime:
建议使用专用计算机托管 IR。We recommend that you use a dedicated machine to host IR. 计算机应该与托管数据存储的服务器分离。The machine should be separate from the server hosting the data store. 一开始对并行复制设置使用默认值,并对自承载 IR 使用单个节点。Start with default values for parallel copy setting and using a single node for the self-hosted IR.
执行性能测试运行。Conduct a performance test run. 记下实现的性能。Take a note of the performance achieved. 包括使用的实际值,如 DIUs 和并行副本。Include the actual values used, such as DIUs and parallel copies. 有关如何收集运行结果和所用性能设置,请参阅复制活动监视。Refer to copy activity monitoring on how to collect run results and performance settings used. 了解如何对 复制活动性能进行故障排除 ,确定并解决瓶颈问题。Learn how to troubleshoot copy activity performance to identify and resolve the bottleneck.
按照故障排除和优化指南进行迭代,执行其他性能测试。Iterate to conduct additional performance test runs following the troubleshooting and tuning guidance. 一旦单个复制活动运行无法获得更好的吞吐量,请考虑是否通过同时运行多个副本来最大程度地提高聚合吞吐量。Once single copy activity runs cannot achieve better throughput, consider whether to maximize aggregate throughput by running multiple copies concurrently. 下一编号项目中讨论了此选项。This option is discussed in the next numbered bullet.
如何通过并行运行多项复制来最大化聚合吞吐量:How to maximize aggregate throughput by running multiple copies concurrently:
至此,你已将单个复制活动的性能最大化。By now you have maximized the performance of a single copy activity. 如果尚未获得环境上限,可以并行运行多个复制活动。If you have not yet achieved the throughput upper limits of your environment, you can run multiple copy activities in parallel. 可以使用 ADF 控制流构造并行运行。You can run in parallel by using ADF control flow constructs. 这种构造是 每个循环的。One such construct is the For Each loop. 有关详细信息,请参阅以下有关解决方案模板的文章:For more information, see the following articles about solution templates:
将配置扩展至整个数据集。Expand the configuration to your entire dataset.
对执行结果和性能满意时,可以扩展定义和管道以覆盖整个数据集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.
排查复制活动的性能问题Troubleshoot copy activity performance
遵循性能优化步骤为方案规划并执行性能测试。Follow the Performance tuning steps to plan and conduct performance test for your scenario. 了解如何对 Azure 数据工厂中的每个复制活动运行性能问题进行故障排除,以 解决复制活动性能问题。And learn how to troubleshoot each copy activity run's performance issue in Azure Data Factory from Troubleshoot copy activity performance.
复制性能优化功能Copy performance optimization features
Azure 数据工厂提供以下性能优化功能:Azure Data Factory provides the following performance optimization features:
- 数据集成单元Data Integration Units
- 自承载集成运行时可伸缩性Self-hosted integration runtime scalability
- 并行复制Parallel copy
- 暂存复制Staged copy
数据集成单元Data Integration Units
(DIU) 的数据集成单元是表示 Azure 数据工厂中单个单元的强大功能的度量值。A Data Integration Unit (DIU) is a measure that represents the power of a single unit in Azure Data Factory. 幂是 CPU、内存和网络资源分配的组合。Power is a combination of CPU, memory, and network resource allocation. DIU 仅适用于 Azure 集成运行时。DIU only applies to Azure integration runtime. DIU 不适用于 自承载集成运行时。DIU does not apply to self-hosted integration runtime. 在此处了解详细信息。Learn more here.
自承载集成运行时可伸缩性Self-hosted integration runtime scalability
你可能希望托管不断增长的并发工作负荷。You might want to host an increasing concurrent workload. 或者,您可能想要在当前工作负荷级别获得更高的性能。Or you might want to achieve higher performance in your present workload level. 可以通过以下方法增强处理规模:You can enhance the scale of processing by the following approaches:
- 可以通过增加_up_可在节点上运行的并发作业的数目,增加自承载 IR。You can scale up the self-hosted IR, by increasing the number of concurrent jobs that can run on a node.
仅当节点的处理器和内存小于完全使用时,才可以进行扩展。Scale up works only if the processor and memory of the node are being less than fully utilized. - 可以通过添加更多节点 (机) 来横向扩展 自承载 IR。You can scale out the self-hosted IR, by adding more nodes (machines).
有关详细信息,请参阅:For more information, see:
- 复制活动性能优化功能:自承载集成运行时可伸缩性Copy activity performance optimization features: Self-hosted integration runtime scalability
- 创建和配置自承载集成运行时:缩放注意事项Create and configure a self-hosted integration runtime: Scale considerations
并行复制Parallel copy
可以设置 parallelCopies
属性以指示要复制活动使用的并行度。You can set the parallelCopies
property to indicate the parallelism you want the copy activity to use. 将此属性视为复制活动中的最大线程数。Think of this property as the maximum number of threads within the copy activity. 线程并行操作。The threads operate in parallel. 线程从源读取,或写入接收器数据存储。The threads either read from your source, or write to your sink data stores. 了解详细信息。Learn more.
暂存复制Staged copy
数据复制操作可以将数据 直接 发送到接收器数据存储。A data copy operation can send the data directly to the sink data store. 或者,你可以选择使用 Blob 存储作为 临时过渡 存储。Alternatively, you can choose to use Blob storage as an interim staging store. 了解详细信息。Learn more.
后续步骤Next steps
请参阅其他复制活动文章:See the other copy activity articles:
- 复制活动概述Copy activity overview
- 排查复制活动的性能问题Troubleshoot copy activity performance
- 复制活动性能优化功能Copy activity performance optimization features
- 使用 Azure 数据工厂将数据从 Data Lake 或数据仓库迁移到 AzureUse Azure Data Factory to migrate data from your data lake or data warehouse to Azure
- 将数据从 Amazon S3 迁移到 Azure 存储Migrate data from Amazon S3 to Azure Storage