複製活動效能和擴充性指南Copy activity performance and scalability guide

無論您想要從 data lake 或企業資料倉儲 (EDW) 執行大規模資料移轉至 Azure, 或想要將大規模資料從不同來源內嵌到 Azure 以進行 big data 分析, 請務必達到最佳效能, 並延展性.Whether you want to perform a large-scale data migration from data lake or enterprise data warehouse (EDW) to Azure, or you want to ingest data at scale from different sources into Azure for big data analytics, it is critical to achieve optimal performance and scalability. Azure Data Factory 提供高效能、彈性且符合成本效益的機制來內嵌資料, 讓資料工程師更適合想要建立高效能且可調整規模的資料內嵌管線。Azure Data Factory provides a performant, resilient, and cost-effective mechanism to ingest data at scale, making it a great fit for data engineers looking to build highly performant and scalable data ingestion pipelines.

閱讀本文後,您將能夠回答下列問題:After reading this article, you will be able to answer the following questions:

  • 我可以針對資料移轉和資料內嵌案例使用 ADF 複製活動, 達到何種層級的效能和擴充性?What level of performance and scalability can I achieve using ADF copy activity for data migration and data ingestion scenarios?

  • 為了微調 ADF 複製活動的效能, 我應該採取哪些步驟?What steps should I take to tune the performance of ADF copy activity?

  • 我可以使用哪些 ADF 效能優化旋鈕, 將單一複製活動執行的效能優化?What ADF perf optimization knobs can I utilize to optimize performance for a single copy activity run?

  • 在優化複製效能時, ADF 以外的其他因素會被考慮嗎?What other factors outside ADF to consider when optimizing copy performance?

注意

如果您一般不熟悉複製活動, 請參閱複製活動總覽, 再閱讀本文。If you aren't familiar with the copy activity in general, see the copy activity overview before you read this article.

使用 ADF 可達到複製效能和擴充性Copy performance and scalability achievable using ADF

ADF 提供無伺服器架構, 允許不同層級的平行處理原則, 讓開發人員能夠建立管線來充分利用您的網路頻寬, 以及儲存 IOPS 和頻寬, 以最大化您環境的資料移動輸送量。ADF offers a serverless architecture that allows parallelism at different levels, which allows developers to build pipelines to fully utilize your network bandwidth as well as storage IOPS and bandwidth to maximize data movement throughput for your environment. 這表示您可以藉由測量來來源資料存放區、目的地資料存放區, 以及來源與目的地之間的網路頻寬所提供的最小輸送量來估計您可達到的輸送量。This means the throughput you can achieve can be estimated by measuring the minimum throughput offered by the source data store, the destination data store, and network bandwidth in between the source and destination. 下表根據您環境的資料大小和頻寬限制來計算複製持續時間。The table below calculates the copy duration based on data size and the bandwidth limit for your environment.

資料大小/Data size /
頻寬bandwidth
50 Mbps50 Mbps 100 Mbps100 Mbps 500 Mbps500 Mbps 1 Gbps1 Gbps 5 Gbps5 Gbps 10 Gbps10 Gbps 50 Gbps50 Gbps
1 GB1 GB 2.7 分鐘2.7 min 1.4 分鐘1.4 min 0.3 分鐘0.3 min 0.1 分鐘0.1 min 0.03 分鐘0.03 min 0.01 分鐘0.01 min 0.0 分鐘0.0 min
10 GB10 GB 27.3 分鐘27.3 min 13.7 分鐘13.7 min 2.7 分鐘2.7 min 1.3 分鐘1.3 min 0.3 分鐘0.3 min 0.1 分鐘0.1 min 0.03 分鐘0.03 min
100 GB100 GB 4.6 小時4.6 hrs 2.3 小時2.3 hrs 0.5 小時0.5 hrs 0.2 小時0.2 hrs 0.05 小時0.05 hrs 0.02 小時0.02 hrs 0.0 小時0.0 hrs
1 TB1 TB 46.6 小時46.6 hrs 23.3 小時23.3 hrs 4.7 小時4.7 hrs 2.3 小時2.3 hrs 0.5 小時0.5 hrs 0.2 小時0.2 hrs 0.05 小時0.05 hrs
10 TB10 TB 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 0.9 天0.9 days 0.2 天0.2 days 0.1 天0.1 days 0.02 天0.02 days
100 TB100 TB 194.2 天194.2 days 97.1 天97.1 days 19.4 天19.4 days 9.7 天9.7 days 1.9 天1.9 days 1天1 days 0.2 天0.2 days
1 PB1 PB 64.7 月64.7 mo 32.4 月32.4 mo 6.5 月6.5 mo 3.2 月3.2 mo 0.6 月0.6 mo 0.3 月0.3 mo 0.06 月0.06 mo
10 PB10 PB 647.3 月647.3 mo 323.6 月323.6 mo 64.7 月64.7 mo 31.6 月31.6 mo 6.5 月6.5 mo 3.2 月3.2 mo 0.6 月0.6 mo

ADF 複製可在不同層級進行調整:ADF copy is scalable at different levels:

ADF 複製的縮放方式

  • ADF 控制流程可以平行啟動多個複製活動, 例如,針對每個迴圈使用。ADF control flow can start multiple copy activities in parallel, for example using For Each loop.
  • 單一複製活動可以利用可調整的計算資源: 使用 Azure Integration Runtime 時, 您可以用無伺服器的方式為每個複製活動指定最多 256 diu ;使用自我裝載的 Integration Runtime 時, 您可以手動相應增加機器或相應放大至多部電腦 (最多4個節點), 而單一複製活動會在所有節點上分割其檔案集。A single copy activity can take advantage of scalable compute resources: when using Azure Integration Runtime, you can specify up to 256 DIUs for each copy activity in a serverless manner; when using self-hosted Integration Runtime, you can manually scale up the machine or scale out to multiple machines (up to 4 nodes), and a single copy activity will partition its file set across all nodes.
  • 單一複製活動會以平行方式使用多個執行緒讀取和寫入資料存放區。A single copy activity reads from and writes to the data store using multiple threads in parallel.

效能微調步驟Performance tuning steps

採取下列步驟, 使用複製活動來微調 Azure Data Factory 服務的效能。Take these steps to tune the performance of your Azure Data Factory service with the copy activity.

  1. 挑選測試資料集並建立基準。Pick up a test dataset and establish a baseline. 在開發階段, 使用複製活動針對代表性資料範例來測試您的管線。During the development phase, test your pipeline by using the copy activity against a representative data sample. 您選擇的資料集應該代表您的一般資料模式 (資料夾結構、檔案模式、資料結構描述等), 而且夠大, 足以評估複製的效能, 例如, 需要10分鐘或更久的時間才能完成複製活動。The dataset you choose should represent your typical data patterns (folder structure, file pattern, data schema, etc.), and is big enough to evaluate copy performance, for example it takes 10 minutes or beyond for copy activity to complete. 複製活動監視之後收集執行詳細資料和效能特性。Collect execution details and performance characteristics following copy activity monitoring.

  2. 如何將單一複製活動的效能最大化:How to maximize performance of a single copy activity:

    一開始, 我們建議您先使用單一複製活動來最大化效能。To start with, we recommend you to first maximize performance using a single copy activity.

    如果要在 Azure Integration Runtime 上執行複製活動:If the copy activity is being executed on an Azure Integration Runtime:

    從[資料整合單位 (DIU) ] 和 [平行複製設定] 的預設值開始。Start with default values for Data Integration Units (DIU) and parallel copy settings. 執行效能測試回合, 並記下所達到的效能, 以及用於 Diu 和平行複製的實際值。Perform a performance test run, and take a note of the performance achieved as well as the actual values used for DIUs and parallel copies. 請參閱複製活動監視, 以瞭解如何收集所使用的執行結果和效能設定。Refer to copy activity monitoring on how to collect run results and performance settings used.

    現在會執行額外的效能測試回合, 每次加倍 DIU 設定的值。Now conduct additional performance test runs, each time doubling the value for DIU setting. 或者, 如果您認為使用預設設定所達到的效能低於預期, 您可以在後續的測試回合中更大幅地增加 DIU 設定。Alternatively, if you think the performance achieved using the default setting is far below your expectation, you can increase the DIU setting more drastically in the subsequent test run.

    當您增加 DIU 設定時, 複製活動應該會以最線性的方式進行調整。Copy activity should scale almost perfectly linearly as you increase the DIU setting. 如果 DIU 設定加倍, 您就不會看到輸送量加倍, 因此可能會發生兩件事:If by doubling the DIU setting you are not seeing the throughput double, two things could be happening:

    • 您所執行的特定複製模式不會因新增更多 Diu 而受益。The specific copy pattern you are running does not benefit from adding more DIUs. 即使您已指定較大的 DIU 值, 實際使用的 DIU 還是會維持不變, 因此您會取得與之前相同的輸送量。Even though you had specified a larger DIU value, the actual DIU used remained the same, and therefore you are getting the same throughput as before. 如果是這種情況, 您可以同時執行多個複本, 同時參考步驟 3, 以最大化匯總輸送量。If this is the case, maximize aggregate throughput by running multiple copies concurrently referring step 3.
    • 藉由新增更多 Diu (更有動力), 進而驅動資料提取、傳輸和載入的速率偏高, 來源資料存放區、其間的網路或目的地資料存放區已達到其瓶頸, 而且可能正在進行節流。By adding more DIUs (more horsepower) and thereby driving higher rate of data extraction, transfer, and loading, either the source data store, the network in between, or the destination data store has reached its bottleneck and possibly being throttled. 如果是這種情況, 請嘗試聯絡您的資料存放區系統管理員或網路系統管理員以提高上限, 或減少 DIU 設定, 直到節流停止發生為止。If this is the case, try contacting your data store administrator or your network administrator to raise the upper limit, or alternatively, reduce the DIU setting until throttling stops occurring.

    如果要在自我裝載的 Integration Runtime 上執行複製活動:If the copy activity is being executed on a self-hosted Integration Runtime:

    我們建議您將專用的獨立電腦與裝載資料存放區的伺服器搭配使用, 以裝載整合執行時間。We recommend that you use a dedicated machine separate from the server hosting the data store to host integration runtime.

    平行複製設定的預設值開始, 並使用自我裝載 IR 的單一節點。Start with default values for parallel copy setting and using a single node for the self-hosted IR. 執行效能測試回合, 並記下所達到的效能。Perform a performance test run and take a note of the performance achieved.

    如果您想要達到更高的輸送量, 您可以擴充或相應放大自我裝載的 IR:If you would like to achieve higher throughput, you can either scale up or scale out the self-hosted IR:

    • 如果自我裝載 IR 節點上的 CPU 和可用記憶體未完全使用, 但並行作業的執行達到限制, 您應該增加可在節點上執行的並行作業數目來相應增加。If the CPU and available memory on the self-hosted IR node are not fully utilized, but the execution of concurrent jobs is reaching the limit, you should scale up by increasing the number of concurrent jobs that can run on a node. 如需相關指示, 請參閱這裡See here for instructions.
    • 另一方面, 如果自我裝載 IR 節點上的 CPU 很高, 或可用記憶體不足, 您可以加入新的節點, 以協助相應放大多個節點上的負載。If, on the other hand, the CPU is high on the self-hosted IR node or available memory is low, you can add a new node to help scale out the load across the multiple nodes. 如需相關指示, 請參閱這裡See here for instructions.

    當您相應增加或相應放大自我裝載 IR 的容量時, 請重複執行效能測試, 以查看是否有更好的輸送量。As you scale up or scale out the capacity of the self-hosted IR, repeat the performance test run to see if you are getting increasingly better throughput. 如果輸送量停止改善, 很可能是來源資料存放區、網路介於其間, 或目的地資料存放區已達到其瓶頸, 因而開始受到節流。If throughput stops improving, most likely either the source data store, the network in between, or the destination data store has reached its bottleneck and is starting to get throttled. 如果是這種情況, 請嘗試聯絡您的資料存放區系統管理員或網路系統管理員以提高上限, 或回到您先前的自我裝載 IR 調整設定。If this is the case, try contacting your data store administrator or your network administrator to raise the upper limit, or alternatively, go back to your previous scaling setting for the self-hosted IR.

  3. 如何藉由同時執行多個複本, 將匯總輸送量最大化:How to maximize aggregate throughput by running multiple copies concurrently:

    現在您已將單一複製活動的效能最大化, 如果您尚未達到環境的輸送量上限 (網路、來源資料存放區和目的地資料存放區), 您可以使用 ADF 平行執行多個複製活動控制流程結構, 例如For each 迴圈Now that you have maximized the performance of a single copy activity, if you have not yet achieved the throughput upper limits of your environment – network, source data store, and destination data store - you can run multiple copy activities in parallel using ADF control flow constructs such as For Each loop.

  4. 效能微調秘訣和優化功能。Performance tuning tips and optimization features. 在某些情況下, 當您在 Azure Data Factory 中執行複製活動時, 您會在複製活動監視之上看到「效能調整秘訣」訊息, 如下列範例所示。In some cases, when you run a copy activity in Azure Data Factory, you see a "Performance tuning tips" message on top of the copy activity monitoring, as shown in the following example. 訊息會告訴您針對指定的複本執行所識別的瓶頸。The message tells you the bottleneck that was identified for the given copy run. 它也會引導您進行要變更的內容, 以提高複製輸送量。It also guides you on what to change to boost copy throughput. 效能微調秘訣目前提供的建議如下:The performance tuning tips currently provide suggestions like:

    • 當您將資料複製到 Azure SQL 資料倉儲時, 請使用 PolyBase。Use PolyBase when you copy data into Azure SQL Data Warehouse.
    • 當資料存放區端的資源為瓶頸時, 增加 Azure Cosmos DB 要求單位或 Azure SQL Database Dtu (資料庫輸送量單位)。Increase Azure Cosmos DB Request Units or Azure SQL Database DTUs (Database Throughput Units) when the resource on the data store side is the bottleneck.
    • 移除不必要的分段複製。Remove the unnecessary staged copy.

    效能調整規則也會逐漸變豐富。The performance tuning rules will be gradually enriched as well.

    範例:使用效能微調秘訣複製到 Azure SQL DatabaseExample: Copy into Azure SQL Database with performance tuning tips

    在此範例中, 在複製執行期間, Azure Data Factory 通知接收器 Azure SQL Database 到達高 DTU 使用率, 這會使寫入作業變慢。In this sample, during a copy run, Azure Data Factory notices the sink Azure SQL Database reaches high DTU utilization, which slows down the write operations. 建議是以更多 Dtu 來增加 Azure SQL Database 層。The suggestion is to increase the Azure SQL Database tier with more DTUs.

    複製監視和效能微調祕訣

    此外, 以下是一些您應該注意的效能優化功能:In addition, the following are some performance optimization features you should be aware of:

  5. 將設定擴充到整個資料集。Expand the configuration to your entire dataset. 當您對執行結果及效能感到滿意時, 可以展開定義和管線來涵蓋整個資料集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline to cover your entire dataset.

複製效能優化功能Copy performance optimization features

Azure Data Factory 提供下列效能優化功能:Azure Data Factory provides the following performance optimization features:

資料整合單位Data Integration Units

資料整合單位是一種量值, 代表 Azure Data Factory 中單一單位的能力 (CPU、記憶體和網路資源分配的組合)。A Data Integration Unit is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Azure Data Factory. 資料整合單位僅適用于Azure 整合運行時間, 但不會套用至自我裝載整合運行時間。Data Integration Unit only applies to Azure integration runtime, but not self-hosted integration runtime.

您將以 # 個已使用的* diu 複製*持續時間單位價格/DIU 小時計費。You will be charged # of used DIUs * copy duration * unit price/DIU-hour. 請在這裡查看目前的價格。See the current prices here. 每個訂用帳戶類型可能會套用當地貨幣和個別折扣。Local currency and separate discounting may apply per subscription type.

允許複製活動執行的 Diu, 是介於2到256之間The allowed DIUs to empower a copy activity run is between 2 and 256. 如果未指定, 或您在 UI 上選擇 [自動], Data Factory 根據您的來源接收組和資料模式, 動態套用最佳 DIU 設定。If not specified or you choose “Auto” on the UI, Data Factory dynamically apply the optimal DIU setting based on your source-sink pair and data pattern. 下表列出不同複製案例中使用的預設 Diu:The following table lists the default DIUs used in different copy scenarios:

複製案例Copy scenario 服務決定的預設 DIUDefault DIUs determined by service
在以檔案為基礎的存放區之間複製資料Copy data between file-based stores 根據檔案的數目和大小, 介於4到32之間Between 4 and 32 depending on the number and size of the files
將資料複製到 Azure SQL Database 或 Azure Cosmos DBCopy data to Azure SQL Database or Azure Cosmos DB 依據接收器 Azure SQL Database 或 Cosmos DB 的層 (Dtu 數/ru 數目), 介於4到16之間Between 4 and 16 depending on the sink Azure SQL Database's or Cosmos DB's tier (number of DTUs/RUs)
所有其他複製案例All the other copy scenarios 44

若要覆寫此預設值,請如下所示指定 dataIntegrationUnits 屬性的值。To override this default, specify a value for the dataIntegrationUnits property as follows. 根據您的資料模式,複製作業會在執行階段使用的 實際 DIU 數目 等於或小於所設定的值。The actual number of DIUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern.

當您監視活動執行時, 您可以在複製活動輸出中查看每個複製執行所使用的 Diu。You can see the DIUs used for each copy run in the copy activity output when you monitor an activity run. 如需詳細資訊, 請參閱複製活動監視For more information, see Copy activity monitoring.

注意

目前只有當您將多個檔案從 Azure 儲存體、Azure Data Lake Storage、Amazon S3、Google Cloud Storage、雲端 FTP 或雲端 SFTP 複製到任何其他雲端資料存放區時, 才會套用超過四個 Diu 的設定。Setting of DIUs larger than four currently applies only when you copy multiple files from Azure Storage, Azure Data Lake Storage, Amazon S3, Google Cloud Storage, cloud FTP, or cloud SFTP to any other cloud data stores.

範例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "dataIntegrationUnits": 32
        }
    }
]

平行複製Parallel copy

您可以使用parallelCopies屬性來指出您想要複製活動使用的平行處理原則。You can use the parallelCopies property to indicate the parallelism that you want the copy activity to use. 您可以將此屬性視為可以從來源讀取或以平行方式寫入接收資料存放區的複製活動內的最大執行緒數目。You can think of this property as the maximum number of threads within the copy activity that can read from your source or write to your sink data stores in parallel.

針對每個複製活動執行, Azure Data Factory 會決定要用來將資料從來源資料存放區複製到目的地資料存放區的平行複製數目。For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. 其使用的預設平行複製數目取決於您所使用的來源和接收類型。The default number of parallel copies that it uses depends on the type of source and sink that you use.

複製案例Copy scenario 由服務決定的預設平行複製計數Default parallel copy count determined by service
在以檔案為基礎的存放區之間複製資料Copy data between file-based stores 取決於檔案大小和用來在兩個雲端資料存放區之間複製資料的 Diu 數, 或自我裝載整合執行時間電腦的實體設定。Depends on the size of the files and the number of DIUs used to copy data between two cloud data stores, or the physical configuration of the self-hosted integration runtime machine.
從已啟用分割區選項的關聯式資料存放區複製 (包括OracleNetezzaTeradatasap 資料表sap Open Hub)Copy from relational data store with partition option enabled (including Oracle, Netezza, Teradata, SAP Table, and SAP Open Hub) 44
將資料從任何來源存放區複製到 Azure 表格儲存體Copy data from any source store to Azure Table storage 44
所有其他複製案例All other copy scenarios 11

提示

當您在以檔案為基礎的存放區之間複製資料時, 預設行為通常會提供您最佳的輸送量。When you copy data between file-based stores, the default behavior usually gives you the best throughput. 系統會根據您的來源檔案模式自動決定預設行為。The default behavior is auto-determined based on your source file pattern.

若要控制裝載資料存放區之電腦上的負載, 或微調複製效能, 您可以覆寫預設值並指定parallelCopies屬性的值。To control the load on machines that host your data stores, or to tune copy performance, you can override the default value and specify a value for the parallelCopies property. 值必須是大於或等於 1 的整數。The value must be an integer greater than or equal to 1. 在執行時間, 為了達到最佳效能, 複製活動會使用小於或等於您所設定之值的值。At run time, for the best performance, the copy activity uses a value that is less than or equal to the value that you set.

注意事項:Points to note:

  • 當您在以檔案為基礎的存放區之間複製資料時, parallelCopies會決定檔案層級的平行處理原則。When you copy data between file-based stores, parallelCopies determines the parallelism at the file level. 單一檔案內的區塊化會自動且透明地出現在下方。The chunking within a single file happens underneath automatically and transparently. 其設計目的是針對指定的來源資料存放區類型, 使用最適合的區塊大小, 以平行方式載入資料, 並與parallelCopiesIt's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. 資料移動服務在執行階段用於複製作業的實際平行複製數目不會超過您擁有的檔案數目。The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. 如果複製行為是mergeFile, 複製活動就無法利用檔案層級的平行處理原則。If the copy behavior is mergeFile, the copy activity can't take advantage of file-level parallelism.
  • 當您從不是以檔案為基礎的存放區複製資料時 ( OracleNetezzaTeradatasap 資料表sap 開放式中樞連接器除外), 如果是以檔案為基礎的存放區, 則資料會移動服務會忽略parallelCopies屬性。When you copy data from stores that are not file-based (except Oracle, Netezza, Teradata, SAP Table, and SAP Open Hub connector as source with data partitioning enabled) to stores that are file-based, the data movement service ignores the parallelCopies property. 即使已指定平行處理原則,也不會套用於此案例。Even if parallelism is specified, it's not applied in this case.
  • ParallelCopies屬性正交于dataIntegrationUnitsThe parallelCopies property is orthogonal to dataIntegrationUnits. 前者會跨所有資料整合單位計算。The former is counted across all the Data Integration Units.
  • 當您指定parallelCopies屬性的值時, 請考慮來源和接收資料存放區的負載增加。When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores. 也請考慮負載增加至自我裝載整合執行時間 (如果複製活動是由其進行授權), 例如針對混合式複製。Also consider the load increase to the self-hosted integration runtime if the copy activity is empowered by it, for example, for hybrid copy. 當您對相同的資料存放區執行相同活動的多個活動或並存執行時, 就會發生這種負載增加的情況。This load increase happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果您注意到資料存放區或自我裝載整合執行時間已負擔負載, 請減少parallelCopies值以減輕負載。If you notice that either the data store or the self-hosted integration runtime is overwhelmed with the load, decrease the parallelCopies value to relieve the load.

範例:Example:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "parallelCopies": 32
        }
    }
]

分段複製Staged copy

從來源資料存放區將資料複製到接收資料存放區時,您可以選擇使用 Blob 儲存體做為過渡暫存存放區。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暫存在下列情況下特別有用︰Staging is especially useful in the following cases:

  • 您想要透過 PolyBase 將資料從各種資料存放區內嵌到 SQL 資料倉儲。You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL 資料倉儲使用 PolyBase 做為高輸送量機制,將大量資料載入 SQL 資料倉儲。SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. 來源資料必須位於 Blob 儲存體或 Azure Data Lake 存放區中, 而且必須符合其他準則。The source data must be in Blob storage or Azure Data Lake Store, and it must meet additional criteria. 當您從 Blob 儲存體或 Azure Data Lake Store 以外的資料存放區載入資料時,您可以啟用透過過渡暫存 Blob 儲存體的資料複製。When you load data from a data store other than Blob storage or Azure Data Lake Store, you can activate data copying via interim staging Blob storage. 在此情況下, Azure Data Factory 會執行必要的資料轉換, 以確保它符合 PolyBase 的需求。In that case, Azure Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然後,它會使用 PolyBase 將資料有效率地載入到 SQL 資料倉儲。Then it uses PolyBase to load data into SQL Data Warehouse efficiently. 如需詳細資訊,請參閱使用 PolyBase 將資料載入 Azure SQL 資料倉儲For more information, see Use PolyBase to load data into Azure SQL Data Warehouse.
  • 有時候, 執行混合式資料移動 (也就是從內部部署資料存放區複製到雲端資料存放區) 時, 會需要一些時間才能透過低速網路連線。Sometimes it takes a while to perform a hybrid data movement (that is, to copy from an on-premises data store to a cloud data store) over a slow network connection. 若要改善效能, 您可以使用分段複製來壓縮內部部署資料, 以便將資料移至雲端中的暫存資料存放區所需的時間較短。To improve performance, you can use staged copy to compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. 接著, 您可以在載入至目的地資料存放區之前, 先解壓縮臨時存儲區中的資料。Then you can decompress the data in the staging store before you load into the destination data store.
  • 您不想要在防火牆中開啟埠80和埠443以外的通訊埠, 因為公司的 IT 原則。You don't want to open ports other than port 80 and port 443 in your firewall because of corporate IT policies. 例如,從內部部署資料存放區將資料複製到 Azure SQL Database 接收或 Azure SQL 資料倉儲接收時,您必須針對 Windows 防火牆和公司防火牆啟用連接埠 1433 上的輸出 TCP 通訊。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在此案例中, 分段複製可以利用自我裝載整合執行時間, 先透過 HTTP 或 HTTPS 在埠443上將資料複製到 Blob 儲存體暫存實例。In this scenario, staged copy can take advantage of the self-hosted integration runtime to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. 然後, 它可以將資料從 Blob 儲存體暫存載入 SQL Database 或 SQL 資料倉儲。Then it can load the data into SQL Database or SQL Data Warehouse from Blob storage staging. 在此流程中,您不需要啟用連接埠 1433。In this flow, you don't need to enable port 1433.

分段複製的運作方式How staged copy works

當您啟用暫存功能時,會先從來源資料存放區複製資料到暫存 Blob 儲存體 (自備)。When you activate the staging feature, first the data is copied from the source data store to the staging Blob storage (bring your own). 接著再從暫存資料存放區複製資料到接收資料存放區。Next, the data is copied from the staging data store to the sink data store. Azure Data Factory 會自動為您管理兩階段流程。Azure Data Factory automatically manages the two-stage flow for you. 在資料移動完成後, Azure Data Factory 也會清除暫存儲存體中的臨時資料。Azure Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

分段複製

當您使用暫存存放區啟用資料移動時, 可以指定是否要在將資料從來源資料存放區移至過渡或暫存資料存放區之前先壓縮資料, 然後再從過渡或暫存的 dat 移動資料之前進行解壓縮。接收資料存放區的存放區。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before you move data from the source data store to an interim or staging data store and then decompressed before you move data from an interim or staging data store to the sink data store.

目前, 您無法在透過不同自我裝載的 IRs 連線的兩個數據存放區之間複製資料, 不論是否有分段複製也不會。Currently, you can't copy data between two data stores that are connected via different Self-hosted IRs, neither with nor without staged copy. 針對這種情況, 您可以設定兩個明確連結的複製活動, 從來源複製到預備環境, 然後從暫存到接收。For such scenario, you can configure two explicitly chained copy activity to copy from source to staging then from staging to sink.

組態Configuration

設定 [複製] 活動中的 [ enableStaging ] 設定, 指定在將資料載入目的地資料存放區之前, 是否要在 Blob 儲存體中暫存資料。Configure the enableStaging setting in the copy activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. 當您將enableStaging設定TRUE為時, 請指定下表所列的其他屬性。When you set enableStaging to TRUE, specify the additional properties listed in the following table. 您也需要建立 Azure 儲存體或儲存體共用存取簽章連結服務, 以供暫存 (如果您沒有的話)。You also need to create an Azure Storage or Storage shared access signature-linked service for staging if you don’t have one.

屬性Property 說明Description 預設值Default value 必要項Required
enableStagingenableStaging 指定您是否要透過過渡暫存存放區複製資料。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorage 連結服務的名稱,以代表您用來做為過渡暫存存放區的儲存體執行個體。Specify the name of an AzureStorage linked service, which refers to the instance of Storage that you use as an interim staging store.

您無法使用具有共用存取簽章的儲存體, 透過 PolyBase 將資料載入 SQL 資料倉儲。You can't use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. 您可以將它用於其他所有案例。You can use it in all other scenarios.
N/AN/A 是,當 enableStaging 設為 TRUEYes, when enableStaging is set to TRUE
pathpath 指定要包含分段資料的 Blob 儲存體路徑。Specify the Blob storage path that you want to contain the staged data. 如果您未提供路徑, 服務會建立容器來儲存暫存資料。If you don't provide a path, the service creates a container to store temporary data.

只有在使用具有共用存取簽章的儲存體時,或需要讓暫存資料位於特定位置時,才指定路徑。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
N/AN/A No
enableCompressionenableCompression 指定是否應該先壓縮資料, 再將它複製到目的地。Specifies whether data should be compressed before it's copied to the destination. 此設定可減少傳輸的資料量。This setting reduces the volume of data being transferred. FalseFalse No

注意

如果您在啟用壓縮的情況下使用分段複製, 則不支援暫存 blob 連結服務的服務主體或 MSI 驗證。If you use staged copy with compression enabled, the service principal or MSI authentication for staging blob linked service isn't supported.

以下是複製活動的範例定義, 其中包含上表中所述的屬性:Here's a sample definition of a copy activity with the properties that are described in the preceding table:

"activities":[
    {
        "name": "Sample copy activity",
        "type": "Copy",
        "inputs": [...],
        "outputs": [...],
        "typeProperties": {
            "source": {
                "type": "SqlSource",
            },
            "sink": {
                "type": "SqlSink"
            },
            "enableStaging": true,
            "stagingSettings": {
                "linkedServiceName": {
                    "referenceName": "MyStagingBlob",
                    "type": "LinkedServiceReference"
                },
                "path": "stagingcontainer/path",
                "enableCompression": true
            }
        }
    }
]

分段複製的計費影響Staged copy billing impact

您會根據兩個步驟向您收費: 複製持續時間和複製類型。You're charged based on two steps: copy duration and copy type.

  • 當您在雲端複製期間使用暫存時, 如果將資料從雲端資料存放區複製到另一個雲端資料存放區 (這兩個階段都是由 Azure integration runtime 所授權), 您就會向您收取 [步驟1和步驟2的複製持續時間總和] x [雲端複製單位價格]。When you use staging during a cloud copy, which is copying data from a cloud data store to another cloud data store, both stages empowered by Azure integration runtime, you're charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 當您在混合式複製期間使用暫存時, 會將資料從內部部署資料存放區複製到雲端資料存放區, 由自我裝載整合執行時間所授權的一個階段, 您需支付 [混合式複製持續時間] x [混合式複製單位價格] + [雲端複製持續時間]x [雲端複製單位價格]。When you use staging during a hybrid copy, which is copying data from an on-premises data store to a cloud data store, one stage empowered by a self-hosted integration runtime, you're charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

參考References

以下是幾個支援的資料存放區所適用的效能監視及調整參考:Here are performance monitoring and tuning references for some of the supported data stores:

後續步驟Next steps

請參閱其他複製活動文章:See the other copy activity articles: