複製活動的效能及微調指南Copy Activity performance and tuning guide

注意

本文適用於第 1 版的 Data Factory。This article applies to version 1 of Data Factory. 如果您使用目前版本的 Data Factory 服務,請參閱 Data Factory 的複製活動效能及微調指南If you are using the current version of the Data Factory service, see Copy activity performance and tuning guide for Data Factory.

Azure Data Factory 複製活動會提供安全、可靠、高效能的頂級資料載入解決方案。Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. 它可讓您複製每天在各式各樣雲端和內部部署資料存放區上數十 TB 的資料。It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores. 超快的資料載入效能是可確保您能夠專注於核心「巨量資料」問題的關鍵︰建置進階的分析解決方案,並從該所有資料獲得深入解析。Blazing-fast data loading performance is key to ensure you can focus on the core “big data” problem: building advanced analytics solutions and getting deep insights from all that data.

Azure 提供一組企業級資料儲存與資料倉儲解決方案,而「複製活動」則提供一個容易設定的高度最佳化資料載入體驗。Azure provides a set of enterprise-grade data storage and data warehouse solutions, and Copy Activity offers a highly optimized data loading experience that is easy to configure and set up. 只要使用單一的複製活動,您便可以達成下列目的︰With just a single copy activity, you can achieve:

本文章說明:This article describes:

注意

如果您大致來說並不熟悉複製活動,請先參閱 使用複製活動來移動資料 再閱讀本文。If you are not familiar with Copy Activity in general, see Move data by using Copy Activity before reading this article.

效能參考Performance reference

下表顯示根據內部測試之給定來源與接收配對的複製輸送量數字 (以 MBps 為單位),以供參考。As a reference, below table shows the copy throughput number in MBps for the given source and sink pairs based on in-house testing. 為了進行比較,該表格也會示範雲端資料移動單位資料管理閘道延展性 (多個閘道節點) 的不同設定如何協助複製效能。For comparison, it also demonstrates how different settings of cloud data movement units or Data Management Gateway scalability (multiple gateway nodes) can help on copy performance.

效能矩陣

重要

在 Azure Data Factory 第 1 版中,適用於雲端到雲端複製的最小雲端資料移動單位是兩個。In Azure Data Factory version 1, the minimal cloud data movement units for cloud-to-cloud copy is two. 如果未指定,請參閱用於雲端資料移動單位中的預設資料移動單位。If not specified, see default data movement units being used in cloud data movement units.

注意事項:Points to note:

  • 輸送量的計算公式如下:[從來源讀取的資料大小]/[複製活動執行持續時間]。Throughput is calculated by using the following formula: [size of data read from source]/[Copy Activity run duration].
  • 表中的效能參考數字是使用單一複製活動執行裡的 TPC-H 資料集來測量的。The performance reference numbers in the table were measured using TPC-H data set in a single copy activity run.
  • 在 Azure 資料存放區中,來源和接收位於相同 Azure 區域中。In Azure data stores, the source and sink are in the same Azure region.
  • 對於內部部署和雲端資料存放區之間的混合式複製,每個閘道節點都在與內部部署資料存放區分開的機器上執行,該機器具有以下規格。For hybrid copy between on-premises and cloud data stores, each gateway node was running on a machine that was separate from the on-premises data store with below specification. 當閘道上僅執行單一活動,複製作業將只會取用測試電腦一小部分的 CPU、記憶體或網路頻寬。When a single activity was running on gateway, the copy operation consumed only a small portion of the test machine's CPU, memory, or network bandwidth. 深入了解資料管理閘道的考量Learn more from consideration for Data Management Gateway.
    CPUCPU 32 核心 2.20 GHz Intel Xeon E5-2660 v232 cores 2.20 GHz Intel Xeon E5-2660 v2
    記憶體Memory 128 GB128 GB
    網路Network 網際網路介面:10 Gbps;內部網路介面:40 GbpsInternet interface: 10 Gbps; intranet interface: 40 Gbps

提示

您可以利用比預設最大 DMU 更多的資料移動單位 (DMU),也就是對雲端到雲端複製活動執行使用 32 單位,以達到更高的輸送量。You can achieve higher throughput by leveraging more data movement units (DMUs) than the default maximum DMUs, which is 32 for a cloud-to-cloud copy activity run. 比方說,使用 100 DMU,您就可以用 1.0GBps 的速率將資料從 Azure Blob 複製到 Azure Data Lake Store。For example, with 100 DMUs, you can achieve copying data from Azure Blob into Azure Data Lake Store at 1.0GBps. 如需此功能和支援案例的詳細資訊,請參閱雲端資料移動單位一節。See the Cloud data movement units section for details about this feature and the supported scenario. 請連絡 Azure 支援來要求更多 DMU。Contact Azure support to request more DMUs.

平行複製Parallel copy

您可以 在複製活動執行內以平行方式從來源讀取資料或將資料寫入目的地。You can read data from the source or write data to the destination in parallel within a Copy Activity run. 這項功能可增強複製作業的輸送量並減少所需的資料移動時間。This feature enhances the throughput of a copy operation and reduces the time it takes to move data.

此設定不同於活動定義中的 並行 屬性。This setting is different from the concurrency property in the activity definition. 並行屬性可決定並行複製活動執行的數目,以處理來自不同活動時段 (1 AM 至 2 AM、2 AM 至 3 AM、3 AM 至 4 AM,依此類推) 的資料。The concurrency property determines the number of concurrent Copy Activity runs to process data from different activity windows (1 AM to 2 AM, 2 AM to 3 AM, 3 AM to 4 AM, and so on). 在執行歷程載入時,這個功能非常有用。This capability is helpful when you perform a historical load. 平行複製功能適用於 單一活動執行The parallel copy capability applies to a single activity run.

讓我們看一下範例案例。Let's look at a sample scenario. 在下列範例中,需要處理多個過往配量。In the following example, multiple slices from the past need to be processed. Data Factory 會對每個配量執行一個複製活動執行個體 (活動執行):Data Factory runs an instance of Copy Activity (an activity run) for each slice:

  • 第 1 個活動時段 (1 AM 至 2 AM) 的資料配量 ==> 活動執行 1The data slice from the first activity window (1 AM to 2 AM) ==> Activity run 1
  • 第 2 個活動時段 (2 AM 至 3 AM) 的資料配量 ==> 活動執行 2The data slice from the second activity window (2 AM to 3 AM) ==> Activity run 2
  • 第 3 個活動時段 (3 AM 至 4 AM) 的資料配量 ==> 活動執行 3The data slice from the second activity window (3 AM to 4 AM) ==> Activity run 3

依此類推。And so on.

在此範例中,當並行值設定為 2,活動執行 1活動執行 2並行複製兩個活動時段的資料,以改善資料移動效能。In this example, when the concurrency value is set to 2, Activity run 1 and Activity run 2 copy data from two activity windows concurrently to improve data movement performance. 不過,如果有多個檔案與活動執行 1 相關聯,則資料移動服務一次只會從來源複製一個檔案到目的地。However, if multiple files are associated with Activity run 1, the data movement service copies files from the source to the destination one file at a time.

雲端資料移動單位Cloud data movement units

雲端資料移動單位 (DMU) 是一項量值,代表 Data Factory 中單一單位的能力 (CPU、記憶體和網路資源配置的組合)。A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. DMU 適用於雲端到雲端的複製作業,但不適用於混合式複製。DMU is applicable for cloud-to-cloud copy operations, but not in a hybrid copy.

複製活動執行所需的最小雲端資料移動單位是兩個。The minimal cloud data movement units to empower Copy Activity run is two. 如果未指定,下表列出用於不同複製案例中的預設 DMU:If not specified, the following table lists the default DMUs used in different copy scenarios:

複製案例Copy scenario 服務決定的預設 DMUDefault DMUs determined by service
在以檔案為基礎的存放區之間複製資料Copy data between file-based stores 依據檔案的數量和大小,介於 4 到 16 之間。Between 4 and 16 depending on the number and size of the files.
所有其他複製案例All other copy scenarios 44

若要覆寫此預設值,請如下所示指定 cloudDataMovementUnits 屬性的值。To override this default, specify a value for the cloudDataMovementUnits property as follows. cloudDataMovementUnits 屬性的「允許值」 是 2、4、8、16 和 32。The allowed values for the cloudDataMovementUnits property are 2, 4, 8, 16, 32. 根據您的資料模式,複製作業會在執行階段使用的 實際雲端 DMU 數目 等於或小於所設定的值。The actual number of cloud DMUs that the copy operation uses at run time is equal to or less than the configured value, depending on your data pattern. 如需在為特定複製來源和接收設定更多單位時可能獲得之效能增益水準的相關資訊,請參閱 效能參考For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference.

"activities":[
    {
        "name": "Sample copy activity",
        "description": "",
        "type": "Copy",
        "inputs": [{ "name": "InputDataset" }],
        "outputs": [{ "name": "OutputDataset" }],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "cloudDataMovementUnits": 32
        }
    }
]

注意

如果您需要更多雲端 DMU 以提高輸送量,請連絡 Azure 支援If you need more cloud DMUs for a higher throughput, contact Azure support. 目前只有當您是將多個檔案從 Blob 儲存體/Data Lake Store/Amazon S3/雲端 FTP/雲端 SFTP 複製到 Blob 儲存體/Data Lake Store/Azure SQL Database 時,設定 8 及 8 以上的值才有作用。Setting of 8 and above currently works only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to Blob storage/Data Lake Store/Azure SQL Database.

parallelCopiesparallelCopies

您可以使用 parallelCopies 屬性,來指出您想要複製活動使用的平行處理原則。You can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use. 您可以將此屬性視為複製活動內,可透過平行方式從來源讀取或寫入至接收資料存放區中的執行緒數目上限。You can think of this property as the maximum number of threads within Copy Activity that can read from your source or write to your sink data stores in parallel.

對於每個複製活動執行,Data Factory 會決定要用來從來源資料存放區複製資料到目的地資料存放區的平行複製數目。For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store. 它會使用的預設平行複製數目取決於您所使用的來源和接收類型。The default number of parallel copies that it uses depends on the type of source and sink that you are using.

來源和接收Source and sink 由服務決定的預設平行複製計數Default parallel copy count determined by service
在檔案型存放區 (Blob 儲存體、Data Lake Store、Amazon S3、內部部署檔案系統、內部部署 HDFS) 之間複製資料Copy data between file-based stores (Blob storage; Data Lake Store; Amazon S3; an on-premises file system; an on-premises HDFS) 介於 1 到 32。Between 1 and 32. 取決於檔案大小和用來在兩個雲端資料存放區之間複製資料的雲端資料移動單位數 (DMU),或用於混合式複製的實體閘道器電腦組態 (複製資料至內部部署資料存放區或從內部部署資料存放區複製資料)。Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Gateway machine used for a hybrid copy (to copy data to or from an on-premises data store).
將資料從 任何來源資料存放區複製到 Azure 表格儲存體Copy data from any source data store to Azure Table storage 44
所有其他來源和接收組All other source and sink pairs 11

通常,預設行為應可提供最佳輸送量。Usually, the default behavior should give you the best throughput. 不過,若要控制裝載資料存放區之電腦上的負載或是調整複製效能,您可以選擇覆寫預設值並指定 parallelCopies 屬性的值。However, to control the load on machines that host your data stores, or to tune copy performance, you may choose to override the default value and specify a value for the parallelCopies property. 該值必須介於 1 (含) 到 32 (含)。The value must be between 1 and 32 (both inclusive). 在執行階段,為獲得最佳效能,複製活動會使用小於或等於設定值的值。At run time, for the best performance, Copy Activity uses a value that is less than or equal to the value that you set.

"activities":[
    {
        "name": "Sample copy activity",
        "description": "",
        "type": "Copy",
        "inputs": [{ "name": "InputDataset" }],
        "outputs": [{ "name": "OutputDataset" }],
        "typeProperties": {
            "source": {
                "type": "BlobSource",
            },
            "sink": {
                "type": "AzureDataLakeStoreSink"
            },
            "parallelCopies": 8
        }
    }
]

注意事項:Points to note:

  • 在檔案型存放區之間複製資料時,parallelCopies 決定檔案層級的平行處理原則。When you copy data between file-based stores, the parallelCopies determine the parallelism at the file level. 在底層則會自動且直接在單一檔案中進行區塊化,而這是設計來為指定的來源資料存放區類型使用最適合的區塊大小,以載入平行於和垂直於 parallelCopies 的資料。The chunking within a single file would happen underneath automatically and transparently, and it's designed to use the best suitable chunk size for a given source data store type to load data in parallel and orthogonal to parallelCopies. 資料移動服務在執行階段用於複製作業的實際平行複製數目不會超過您擁有的檔案數目。The actual number of parallel copies the data movement service uses for the copy operation at run time is no more than the number of files you have. 如果複製行為是 mergeFile,則複製活動無法利用檔案層級的平行處理原則。If the copy behavior is mergeFile, Copy Activity cannot take advantage of file-level parallelism.
  • 在為 parallelCopies 屬性指定值時,請考慮來源和接收資料存放區的負載會增加,而如果是混合式複製,則閘道器的負載會增加。When you specify a value for the parallelCopies property, consider the load increase on your source and sink data stores, and to gateway if it is a hybrid copy. 當您有多個活動或是會對相同資料存放區執行相同活動的並行執行時,尤其如此。This happens especially when you have multiple activities or concurrent runs of the same activities that run against the same data store. 如果您注意到資料存放區或閘道器已無法應付負載,請減少 parallelCopies 值以減輕負載。If you notice that either the data store or Gateway is overwhelmed with the load, decrease the parallelCopies value to relieve the load.
  • 將資料從非檔案型存放區複製到檔案型存放區時,資料移動服務會忽略 parallelCopies 屬性。When you copy data from stores that are not file-based to stores that are file-based, the data movement service ignores the parallelCopies property. 即使已指定平行處理原則,也不會套用於此案例。Even if parallelism is specified, it's not applied in this case.

注意

您必須使用 1.11 版或更新版本的資料管理閘道,才能在進行混合式複製時使用 parallelCopies 功能。You must use Data Management Gateway version 1.11 or later to use the parallelCopies feature when you do a hybrid copy.

若要更妥善地使用這兩個屬性,以及增強您的資料移動輸送量,請參閱範例使用案例。To better use these two properties, and to enhance your data movement throughput, see the sample use cases. 您不需要設定 parallelCopies 就能利用預設行為。You don't need to configure parallelCopies to take advantage of the default behavior. 如果您有設定且 parallelCopies 太小,將可能無法充分利用多個雲端 DMU。If you do configure and parallelCopies is too small, multiple cloud DMUs might not be fully utilized.

計費影響Billing impact

務必 要記住,您必須根據複製作業的總時間付費。It's important to remember that you are charged based on the total time of the copy operation. 若過去某複製作業使用 1 個雲端單位花費 1 小時,現在使用 4 個雲端單位花費 15 分鐘,則兩者的整體費用幾乎相同。If a copy job used to take one hour with one cloud unit and now it takes 15 minutes with four cloud units, the overall bill remains almost the same. 例如,您使用 4 個雲端單位。For example, you use four cloud units. 第 1 個雲端單位費時 10 分鐘、第 2 個單位費時 10 分鐘、第 3 個單位費時 5 分鐘、第 4 個單位費時 5 分鐘,以上全都在一個複製活動執行內。The first cloud unit spends 10 minutes, the second one, 10 minutes, the third one, 5 minutes, and the fourth one, 5 minutes, all in one Copy Activity run. 您必須支付總複製 (資料移動) 時間的費用,亦即 10 + 10 + 5 + 5 = 30 分鐘。You are charged for the total copy (data movement) time, which is 10 + 10 + 5 + 5 = 30 minutes. 是否使用 parallelCopies 對計費沒有任何影響。Using parallelCopies does not affect billing.

分段複製Staged copy

從來源資料存放區將資料複製到接收資料存放區時,您可以選擇使用 Blob 儲存體做為過渡暫存存放區。When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store. 暫存在下列情況下特別有用︰Staging is especially useful in the following cases:

  1. 您想要透過 PolyBase 將資料從各種資料存放區內嵌至 SQL 資料倉儲You want to ingest data from various data stores into SQL Data Warehouse via PolyBase. SQL 資料倉儲使用 PolyBase 做為高輸送量機制,將大量資料載入 SQL 資料倉儲。SQL Data Warehouse uses PolyBase as a high-throughput mechanism to load a large amount of data into SQL Data Warehouse. 不過,來源資料必須位於 Blob 儲存體,並符合額外的條件。However, the source data must be in Blob storage, and it must meet additional criteria. 當您從 Blob 儲存體以外的資料存放區載入資料時,您可以啟用透過過渡暫存 Blob 儲存體的資料複製。When you load data from a data store other than Blob storage, you can activate data copying via interim staging Blob storage. 在該情況下,Data Factory 會執行必要的資料轉換,以確保它符合 PolyBase 的需求。In that case, Data Factory performs the required data transformations to ensure that it meets the requirements of PolyBase. 然後,它會使用 PolyBase 將資料載入 SQL 資料倉儲。Then it uses PolyBase to load data into SQL Data Warehouse. 如需詳細資訊,請參閱 使用 PolyBase 將資料載入 Azure SQL 資料倉儲For more details, see Use PolyBase to load data into Azure SQL Data Warehouse. 如需使用案例的逐步解說,請參閱使用 Azure Data Factory 在 15 分鐘內將 1 TB 載入至 Azure SQL 資料倉儲For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.
  2. 有時透過慢速網路連接執行混合式資料移動 (也就是,在內部部署資料存放區和雲端資料存放區之間複製資料) ,需要一段時間。Sometimes it takes a while to perform a hybrid data movement (that is, to copy between an on-premises data store and a cloud data store) over a slow network connection. 為了提升效能,您可以壓縮內部部署資料,減少移動資料到雲端中的暫存資料存放區所需的時間。To improve performance, you can compress the data on-premises so that it takes less time to move data to the staging data store in the cloud. 然後,您可以先在暫存存放區中解壓縮資料,再將其載入至目的地資料存放區。Then you can decompress the data in the staging store before you load it into the destination data store.
  3. 由於公司 IT 原則,您不想要在您的防火牆中開啟 80 和 443 以外的連接埠You don't want to open ports other than port 80 and port 443 in your firewall, because of corporate IT policies. 例如,從內部部署資料存放區將資料複製到 Azure SQL Database 接收或 Azure SQL 資料倉儲接收時,您必須針對 Windows 防火牆和公司防火牆啟用連接埠 1433 上的輸出 TCP 通訊。For example, when you copy data from an on-premises data store to an Azure SQL Database sink or an Azure SQL Data Warehouse sink, you need to activate outbound TCP communication on port 1433 for both the Windows firewall and your corporate firewall. 在此案例中,請利用閘道先在連接埠 443 上透過 HTTP 或 HTTPS 將資料複製到 Blob 儲存體暫存執行個體。In this scenario, take advantage of the gateway to first copy data to a Blob storage staging instance over HTTP or HTTPS on port 443. 接著,將資料從 Blob 儲存體暫存載入到 SQL Database 或 SQL 資料倉儲。Then, load the data into SQL Database or SQL Data Warehouse from Blob storage staging. 在此流程中,您不需要啟用連接埠 1433。In this flow, you don't need to enable port 1433.

分段複製的運作方式How staged copy works

當您啟用暫存功能時,會先從來源資料存放區複製資料到暫存資料存放區 (自備),When you activate the staging feature, first the data is copied from the source data store to the staging data store (bring your own). 接著再從暫存資料存放區複製資料到接收資料存放區。Next, the data is copied from the staging data store to the sink data store. Data Factory 會自動為您管理 2 階段流程,Data Factory automatically manages the two-stage flow for you. 也會在資料移動完成之後,清除暫存儲存體中的暫存資料。Data Factory also cleans up temporary data from the staging storage after the data movement is complete.

在雲端複製案例中 (來源與接收的資料存放區皆在雲端),不會使用閘道。In the cloud copy scenario (both source and sink data stores are in the cloud), gateway is not used. Data Factory 服務會執行複製作業。The Data Factory service performs the copy operations.

分段複製:雲端案例

在混合式複製案例中 (來源是在內部部署而接收是在雲端),閘道會從來源資料存放區將資料移動至暫存資料存放區。In the hybrid copy scenario (source is on-premises and sink is in the cloud), the gateway moves data from the source data store to a staging data store. Data Factory 服務會從暫存資料存放區將資料移動至接收資料存放區。Data Factory service moves data from the staging data store to the sink data store. 反轉流程也支援透過暫存將資料從雲端資料存放區複製到內部部署資料存放區。Copying data from a cloud data store to an on-premises data store via staging also is supported with the reversed flow.

分段複製:混合式案例

當您使用暫存存放區啟用資料移動時,您可以指定是否要在從來源資料存放區將資料移動至過渡或暫存資料存放區之前壓縮資料,然後在從過渡或暫存資料存放區將資料移動至接收資料存放區之前解壓縮資料。When you activate data movement by using a staging store, you can specify whether you want the data to be compressed before moving data from the source data store to an interim or staging data store, and then decompressed before moving data from an interim or staging data store to the sink data store.

目前您還無法使用暫存存放區在兩個內部部署資料存放區之間複製資料。Currently, you can't copy data between two on-premises data stores by using a staging store. 我們預計此選項很快就會推出。We expect this option to be available soon.

組態Configuration

在複製活動中設定 enableStaging 設定,指定您是否想要讓資料在載入至目的地資料存放區之前,暫存在 Blob 儲存體中。Configure the enableStaging setting in Copy Activity to specify whether you want the data to be staged in Blob storage before you load it into a destination data store. 當您將 enableStaging 設定為 TRUE 時,請指定下一份資料表所列出的其他屬性。When you set enableStaging to TRUE, specify the additional properties listed in the next table. 如果還未指定,您也需要建立 Azure 儲存體或儲存體共用存取簽章連結服務以供暫存使用。If you don’t have one, you also need to create an Azure Storage or Storage shared access signature-linked service for staging.

屬性Property 說明Description 預設值Default value 必要Required
enableStagingenableStaging 指定您是否要透過過渡暫存存放區複製資料。Specify whether you want to copy data via an interim staging store. FalseFalse No
linkedServiceNamelinkedServiceName 指定 AzureStorageAzureStorageSas 連結服務的名稱,以代表您用來做為過渡暫存存放區的儲存體執行個體。Specify the name of an AzureStorage or AzureStorageSas linked service, which refers to the instance of Storage that you use as an interim staging store.

您無法使用具有共用存取簽章的儲存體來透過 PolyBase 將資料載入至 SQL 資料倉儲。You cannot use Storage with a shared access signature to load data into SQL Data Warehouse via PolyBase. 您可以將它用於其他所有案例。You can use it in all other scenarios.
N/AN/A 是,當 enableStaging 設為 TRUEYes, when enableStaging is set to TRUE
路徑path 指定要包含分段資料的 Blob 儲存體路徑。Specify the Blob storage path that you want to contain the staged data. 如果未提供路徑,服務會建立容器來儲存暫存資料。If you do not provide a path, the service creates a container to store temporary data.

只有在使用具有共用存取簽章的儲存體時,或需要讓暫存資料位於特定位置時,才指定路徑。Specify a path only if you use Storage with a shared access signature, or you require temporary data to be in a specific location.
N/AN/A No
enableCompressionenableCompression 指定將資料複製到目的地之前,是否應該壓縮資料。Specifies whether data should be compressed before it is copied to the destination. 此設定可減少傳輸的資料量。This setting reduces the volume of data being transferred. FalseFalse No

以下是具有上表所述屬性的「複製活動」的範例定義︰Here's a sample definition of Copy Activity with the properties that are described in the preceding table:

"activities":[
{
    "name": "Sample copy activity",
    "type": "Copy",
    "inputs": [{ "name": "OnpremisesSQLServerInput" }],
    "outputs": [{ "name": "AzureSQLDBOutput" }],
    "typeProperties": {
        "source": {
            "type": "SqlSource",
        },
        "sink": {
            "type": "SqlSink"
        },
        "enableStaging": true,
        "stagingSettings": {
            "linkedServiceName": "MyStagingBlob",
            "path": "stagingcontainer/path",
            "enableCompression": true
        }
    }
}
]

計費影響Billing impact

我們將會根據兩個步驟向您收費:複製持續時間和複製類型。You are charged based on two steps: copy duration and copy type.

  • 在雲端複製期間使用暫存時 (從雲端資料存放區將資料複製到其他雲端資料存放區),所收費用為 [步驟 1 和步驟 2 的複製持續時間總和] x [雲端複製單位價格]。When you use staging during a cloud copy (copying data from a cloud data store to another cloud data store), you are charged the [sum of copy duration for step 1 and step 2] x [cloud copy unit price].
  • 在混合式複製期間使用暫存時 (從內部部署資料存放區將資料複製到雲端資料存放區),所收費用為 [混合式複製持續時間] x [混合式複製單位價格] + [雲端複製持續時間] x [雲端複製單位價格]。When you use staging during a hybrid copy (copying data from an on-premises data store to a cloud data store), you are charged for [hybrid copy duration] x [hybrid copy unit price] + [cloud copy duration] x [cloud copy unit price].

效能微調步驟Performance tuning steps

建議您採取下列步驟來微調 Data Factory 服務搭配使用複製活動時的效能︰We suggest that you take these steps to tune the performance of your Data Factory service with Copy Activity:

  1. 建立基準Establish a baseline. 在開發階段,對具有代表性的資料範例使用複製活動來測試您的管線。During the development phase, test your pipeline by using Copy Activity against a representative data sample. 您可以使用 Data Factory 的 切割模型 來限制您所使用的資料量。You can use the Data Factory slicing model to limit the amount of data you work with.

    藉由使用 監視及管理應用程式來收集執行時間和效能特性。Collect execution time and performance characteristics by using the Monitoring and Management App. 在 Data Factory 首頁上選擇 [監視及管理] 。Choose Monitor & Manage on your Data Factory home page. 在樹狀檢視中選擇 [輸出資料集] 。In the tree view, choose the output dataset. 在 [活動時段] 清單中選擇 [複製活動執行]。In the Activity Windows list, choose the Copy Activity run. 會列出複製活動的持續時間和所複製資料的大小。Activity Windows lists the Copy Activity duration and the size of the data that's copied. 輸送量會列在 [活動時段總管] 中。The throughput is listed in Activity Window Explorer. 若要深入了解該應用程式,請參閱 使用監視及管理應用程式來監視及管理 Azure Data Factory 管線To learn more about the app, see Monitor and manage Azure Data Factory pipelines by using the Monitoring and Management App.

    活動執行詳細資料

    在本文稍後,您可以將您案例的效能和組態與我們的測試中複製活動的 效能參考 進行比較。Later in the article, you can compare the performance and configuration of your scenario to Copy Activity’s performance reference from our tests.

  2. 效能診斷與最佳化Diagnose and optimize performance. 如果您觀察到的效能不符預期,您必須找出效能瓶頸。If the performance you observe doesn't meet your expectations, you need to identify performance bottlenecks. 然後將效能最佳化,以消除或減少瓶頸的影響。Then, optimize performance to remove or reduce the effect of bottlenecks. 效能診斷的完整說明不在本文的討論之列,但以下是一些常見的考量:A full description of performance diagnosis is beyond the scope of this article, but here are some common considerations:

  3. 將組態擴展至整個資料集Expand the configuration to your entire data set. 當您對執行結果及效能感到滿意時,您可以將定義和管線作用期間擴展為涵蓋整個資料集。When you're satisfied with the execution results and performance, you can expand the definition and pipeline active period to cover your entire data set.

資料管理閘道的考量Considerations for Data Management Gateway

閘道設定:建議您使用專用的電腦來裝載資料管理閘道。Gateway setup: We recommend that you use a dedicated machine to host Data Management Gateway. 請參閱使用資料管理閘道的考量See Considerations for using Data Management Gateway.

閘道監控與相應增加/相應放大:具有一或多個閘道節點的單一邏輯閘道可同時提供多個複製活動進行。Gateway monitoring and scale-up/out: A single logical gateway with one or more gateway nodes can serve multiple Copy Activity runs at the same time concurrently. 您可以在 Azure 入口網站中,檢視閘道機器近乎即時的資訊使用率 (CPU、記憶體、網路 (輸入/輸出) 等) 快照集,以及執行的並行作業數目與限制,請參閱在入口網站中監視閘道You can view near-real time snapshot of resource utilization (CPU, memory, network(in/out), etc.) on a gateway machine as well as the number of concurrent jobs running versus limit in the Azure portal, see Monitor gateway in the portal. 如果您對於混合式資料移動 (包含大量的並行複製活動執行或需要複製的大量資料) 有很大的需求,請考慮相應增加或相應放大閘道,以充份利用您的資源,或佈建更多資源以賦予複製能力。If you have heavy need on hybrid data movement either with large number of concurrent copy activity runs or with large volume of data to copy, consider to scale up or scale out gateway so as to better utilize your resource or to provision more resource to empower copy.

來源的考量Considerations for the source

一般General

請確定基礎資料存放區未被執行於其上或對其執行的其他工作負載全面佔據。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

針對 Microsoft 資料存放區,請參閱資料存放區特定的 監視和微調主題 ,以協助您了解資料存放區的效能特性、最小化回應時間和最大化輸送量。For Microsoft data stores, see monitoring and tuning topics that are specific to data stores, and help you understand data store performance characteristics, minimize response times, and maximize throughput.

如果您要從 Blob 儲存體複製資料到 SQL 資料倉儲,請考慮使用 PolyBase 以提升效能。If you copy data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. 如需詳細資訊,請參閱 使用 PolyBase 將資料載入 Azure SQL 資料倉儲See Use PolyBase to load data into Azure SQL Data Warehouse for details. 如需使用案例的逐步解說,請參閱使用 Azure Data Factory 在 15 分鐘內將 1 TB 載入至 Azure SQL 資料倉儲For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.

以檔案為基礎的資料存放區File-based data stores

(包括 Blob 儲存體、Data Lake Store、Amazon S3、內部部署檔案系統及內部部署 HDFS)(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)

  • 平均檔案大小和檔案計數:複製活動會逐檔案傳送資料。Average file size and file count: Copy Activity transfers data one file at a time. 在要移動的資料量相同的前提下,如果資料包含許多個小型檔案而非少數幾個大型檔案,其整體輸送量會較低,因為每個檔案都需要啟動程序階段。With the same amount of data to be moved, the overall throughput is lower if the data consists of many small files rather than a few large files due to the bootstrap phase for each file. 因此,可能的話,請將小型檔案合併為較大的檔案,以提高輸送量。Therefore, if possible, combine small files into larger files to gain higher throughput.
  • 檔案格式和壓縮:如需可改善效能的其他方法,請參閱序列化和還原序列化的考量壓縮的考量小節。File format and compression: For more ways to improve performance, see the Considerations for serialization and deserialization and Considerations for compression sections.
  • 對於必須使用資料管理閘道內部部署檔案系統案例,請參閱資料管理閘道的考量一節。For the on-premises file system scenario, in which Data Management Gateway is required, see the Considerations for Data Management Gateway section.

關聯式資料存放區Relational data stores

(包括 SQL Database、SQL 資料倉儲、Amazon Redshift、SQL Server 資料庫,以及 Oracle、MySQL、DB2、Teradata、Sybase 和 PostgreSQL 資料庫等)(Includes SQL Database; SQL Data Warehouse; Amazon Redshift; SQL Server databases; and Oracle, MySQL, DB2, Teradata, Sybase, and PostgreSQL databases, etc.)

  • 資料模式:資料表結構描述對複製輸送量會有影響。Data pattern: Your table schema affects copy throughput. 若要複製相同的資料量,較大的資料列大小會有優於較小資料列大小的效能。A large row size gives you a better performance than small row size, to copy the same amount of data. 這是因為資料庫可以更有效率地擷取包含較少資料列的較少資料批次。The reason is that the database can more efficiently retrieve fewer batches of data that contain fewer rows.
  • 查詢或預存程序:最佳化您在複製活動來源中指定的查詢或預存程序邏輯,以更有效率地擷取資料。Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently.
  • 對於必須使用 [資料管理閘道] 的內部部署關聯式資料庫 (例如 SQL Server 和 Oracle),請參閱「資料管理閘道的考量」一節。For on-premises relational databases, such as SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

接收的考量Considerations for the sink

一般General

請確定基礎資料存放區未被執行於其上或對其執行的其他工作負載全面佔據。Be sure that the underlying data store is not overwhelmed by other workloads that are running on or against it.

針對 Microsoft 資料存放區,請參閱資料存放區特定的 監視和微調主題For Microsoft data stores, refer to monitoring and tuning topics that are specific to data stores. 這些主題可協助您了解資料存放區的效能特性,以及如何最小化回應時間和最大化輸送量。These topics can help you understand data store performance characteristics and how to minimize response times and maximize throughput.

如果您要從 Blob 儲存體複製資料到 SQL 資料倉儲,請考慮使用 PolyBase 以提升效能。If you are copying data from Blob storage to SQL Data Warehouse, consider using PolyBase to boost performance. 如需詳細資訊,請參閱 使用 PolyBase 將資料載入 Azure SQL 資料倉儲See Use PolyBase to load data into Azure SQL Data Warehouse for details. 如需使用案例的逐步解說,請參閱使用 Azure Data Factory 在 15 分鐘內將 1 TB 載入至 Azure SQL 資料倉儲For a walkthrough with a use case, see Load 1 TB into Azure SQL Data Warehouse under 15 minutes with Azure Data Factory.

以檔案為基礎的資料存放區File-based data stores

(包括 Blob 儲存體、Data Lake Store、Amazon S3、內部部署檔案系統及內部部署 HDFS)(Includes Blob storage, Data Lake Store, Amazon S3, on-premises file systems, and on-premises HDFS)

  • 複製行為:如果您從其他以檔案為基礎的資料存放區複製資料,複製活動會透過 copyBehavior 屬性提供三個選項。Copy behavior: If you copy data from a different file-based data store, Copy Activity has three options via the copyBehavior property. 它會保留階層、扁平化階層或合併檔案。It preserves hierarchy, flattens hierarchy, or merges files. 保留或扁平化階層幾乎不會造成效能負荷,但合併檔案則會導致效能負荷增加。Either preserving or flattening hierarchy has little or no performance overhead, but merging files causes performance overhead to increase.
  • 檔案格式和壓縮:請參閱序列化和還原序列化的考量壓縮的考量小節,以了解可改善效能的其他方法。File format and compression: See the Considerations for serialization and deserialization and Considerations for compression sections for more ways to improve performance.
  • Blob 儲存體:Blob 儲存體目前只支援以區塊 Blob 來最佳化資料傳送和輸送量。Blob storage: Currently, Blob storage supports only block blobs for optimized data transfer and throughput.
  • 對於必須使用資料管理閘道內部部署檔案系統案例,請參閱資料管理閘道的考量一節。For on-premises file systems scenarios that require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

關聯式資料存放區Relational data stores

(包括 SQL Database、SQL 資料倉儲、SQL Server 資料庫及 Oracle 資料庫)(Includes SQL Database, SQL Data Warehouse, SQL Server databases, and Oracle databases)

  • 複製行為:根據為 sqlSink 設定的屬性,複製活動會以不同的方式將資料寫入目的地資料庫中。Copy behavior: Depending on the properties you've set for sqlSink, Copy Activity writes data to the destination database in different ways.
    • 根據預設,資料移動服務會使用大量複製 API 以附加模式插入資料,而提供最佳效能。By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance.
    • 如果您在接收中設定了預存程序,資料庫會一次套用一個資料列的資料 (而不是大量載入)。If you configure a stored procedure in the sink, the database applies the data one row at a time instead of as a bulk load. 因此效能會大幅降低。Performance drops significantly. 如果資料集較大,在適用的情況下,請考慮改為使用 sqlWriterCleanupScript 屬性。If your data set is large, when applicable, consider switching to using the sqlWriterCleanupScript property.
    • 如果您設定了每個複製活動執行的 sqlWriterCleanupScript 屬性,服務會觸發指令碼,然後使用大量複製 API 插入資料。If you configure the sqlWriterCleanupScript property for each Copy Activity run, the service triggers the script, and then you use the Bulk Copy API to insert the data. 例如,若要以最新的資料覆寫整個資料表,您可以先指定指令碼以刪除所有記錄,再從來源大量載入新資料。For example, to overwrite the entire table with the latest data, you can specify a script to first delete all records before bulk-loading the new data from the source.
  • 資料模式和批次大小Data pattern and batch size:
    • 資料表結構描述對複製輸送量會有影響。Your table schema affects copy throughput. 若要複製相同的資料量,較大的資料列大小會有優於較小資料列大小的效能,因為資料庫可以更有效率地認可較少的資料批次。To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data.
    • 複製活動會以一系列的批次插入資料。Copy Activity inserts data in a series of batches. 您可以使用 writeBatchSize 屬性來設定批次中包含的資料列數。You can set the number of rows in a batch by using the writeBatchSize property. 如果您的資料具有較小的資料列,您可以將 writeBatchSize 屬性設為較高的值,以減少批次的額外負荷,並增加輸送量。If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. 如果您的資料的資料列大小較大,在增加 writeBatchSize時請多加留意。If the row size of your data is large, be careful when you increase writeBatchSize. 較大的值可能會導致複製因資料庫的超載而失敗。A high value might lead to a copy failure caused by overloading the database.
  • 對於必須使用資料管理閘道內部部署關聯式資料庫 (如 SQL Server 和 Oracle),請參閱資料管理閘道的考量一節。For on-premises relational databases like SQL Server and Oracle, which require the use of Data Management Gateway, see the Considerations for Data Management Gateway section.

NoSQL 存放區NoSQL stores

(包括表格儲存體和 Azure Cosmos DB) (Includes Table storage and Azure Cosmos DB )

  • 針對 表格儲存體For Table storage:
    • 資料分割:將資料寫入至交錯的資料分割會大幅降低效能。Partition: Writing data to interleaved partitions dramatically degrades performance. 請依資料分割索引鍵來排序來源資料,使資料能有效率地依序插入資料分割中,或者,調整邏輯以將資料寫入單一資料分割中。Sort your source data by partition key so that the data is inserted efficiently into one partition after another, or adjust the logic to write the data to a single partition.
  • 針對 Azure Cosmos DBFor Azure Cosmos DB:
    • 批次大小writeBatchSize 屬性會設定對 Azure Cosmos DB 服務提出建立文件的平行要求數目。Batch size: The writeBatchSize property sets the number of parallel requests to the Azure Cosmos DB service to create documents. 增加 writeBatchSize 時,您可預期有更好的效能,因為對 Azure Cosmos DB 傳送了更多的平行要求。You can expect better performance when you increase writeBatchSize because more parallel requests are sent to Azure Cosmos DB. 不過,在寫入至 Azure Cosmos DB 時請注意節流問題 (錯誤訊息為「要求率偏高」)。However, watch for throttling when you write to Azure Cosmos DB (the error message is "Request rate is large"). 有各種因素會導致發生節流,包括文件大小、文件中的詞彙數目,以及目標集合的索引編製原則。Various factors can cause throttling, including document size, the number of terms in the documents, and the target collection's indexing policy. 若要達到更高的複製輸送量,請考慮使用更好的集合 (例如 S3)。To achieve higher copy throughput, consider using a better collection, for example, S3.

序列化和還原序列化的考量Considerations for serialization and deserialization

如果您的輸入資料集或輸出資料集是檔案,就可能發生序列化和還原序列化。Serialization and deserialization can occur when your input data set or output data set is a file. 請參閱支援的檔案和壓縮格式,其中具有關於複製活動支援檔案格式的詳細資訊。See Supported file and compression formats with details on supported file formats by Copy Activity.

複製行為Copy behavior:

  • 在以檔案為基礎的資料存放區之間複製檔案:Copying files between file-based data stores:
    • 如果輸入和輸出資料集同時具有相同的檔案格式設定,或者都沒有這些設定,資料移動服務會執行二進位複製,而不執行任何序列化或還原序列化。When input and output data sets both have the same or no file format settings, the data movement service executes a binary copy without any serialization or deserialization. 此情況的輸送量會高於來源和接收檔案格式設定彼此不同的案例。You see a higher throughput compared to the scenario, in which the source and sink file format settings are different from each other.
    • 如果輸入和輸出資料集都是文字格式,只有編碼類型不同,則資料移動服務只會執行編碼轉換,When input and output data sets both are in text format and only the encoding type is different, the data movement service only does encoding conversion. 而不會執行任何序列化和還原序列化,因此和二進位複製相較之下,會產生一些效能負荷。It doesn't do any serialization and deserialization, which causes some performance overhead compared to a binary copy.
    • 如果輸入和輸出資料集皆有不同的檔案格式或不同的組態 (例如分隔符號),則資料移動服務會將來源資料還原序列化,以進行串流、轉換然後再序列化為您所指出的輸出格式。When input and output data sets both have different file formats or different configurations, like delimiters, the data movement service deserializes source data to stream, transform, and then serialize it into the output format you indicated. 此作業會導致遠高於其他案例的效能負荷。This operation results in a much more significant performance overhead compared to other scenarios.
  • 對 (從) 不是以檔案為基礎的資料存放區複製檔案時 (例如,從以檔案為基礎的存放區複製到關聯式存放區),必須執行序列化或還原序列化步驟。When you copy files to/from a data store that is not file-based (for example, from a file-based store to a relational store), the serialization or deserialization step is required. 此步驟會導致很高的效能負荷。This step results in significant performance overhead.

檔案格式:您選擇的檔案格式可能會影響複製效能。File format: The file format you choose might affect copy performance. 例如,Avro 是一種壓縮二進位格式,可將中繼資料和資料儲存在一起。For example, Avro is a compact binary format that stores metadata with data. 它廣泛支援在 Hadoop 生態系統中進行處理和查詢。It has broad support in the Hadoop ecosystem for processing and querying. 不過,Avro 的序列化和還原序列化代價較高,因為它會導致低於文字格式的複製輸送量。However, Avro is more expensive for serialization and deserialization, which results in lower copy throughput compared to text format. 在選擇整個處理流程中所使用的檔案格式時,應有整體考量。Make your choice of file format throughout the processing flow holistically. 首先要考量資料的儲存形式是來源資料存放區或是要從外部系統擷取,再考量最理想的儲存、分析處理和查詢格式,以及資料應以何種格式匯出到資料超市中,以供報告和視覺化工具使用。Start with what form the data is stored in, source data stores or to be extracted from external systems; the best format for storage, analytical processing, and querying; and in what format the data should be exported into data marts for reporting and visualization tools. 有些時候,在考量整體分析程序時,讀取和寫入效能次佳的檔案格式,可能會是較好的選擇。Sometimes a file format that is suboptimal for read and write performance might be a good choice when you consider the overall analytical process.

壓縮的考量Considerations for compression

如果您的輸入或輸出資料集是檔案,您可以將複製活動設定為在資料寫入至目的地時執行壓縮或解壓縮。When your input or output data set is a file, you can set Copy Activity to perform compression or decompression as it writes data to the destination. 當您選擇壓縮時,您必須在輸入/輸出 (I/O) 與 CPU 之間進行取捨。When you choose compression, you make a tradeoff between input/output (I/O) and CPU. 壓縮資料須耗用額外的計算資源。Compressing the data costs extra in compute resources. 但另一方面卻可降低網路 I/O 和儲存體用量。But in return, it reduces network I/O and storage. 根據您的資料,您可能會看到整體複製輸送量有所提升。Depending on your data, you may see a boost in overall copy throughput.

轉碼器:複製活動支援 gzip、bzip2 和 Deflate 壓縮類型。Codec: Copy Activity supports gzip, bzip2, and Deflate compression types. 這三種類型都可供 Azure HDInsight 進行處理。Azure HDInsight can consume all three types for processing. 每種壓縮轉碼器各有優點。Each compression codec has advantages. 例如,bzip2 的複製輸送量最低,但您卻可以在使用 bzip2 時獲得最佳的 Hive 查詢效能,因為可將其劃分來進行處理。For example, bzip2 has the lowest copy throughput, but you get the best Hive query performance with bzip2 because you can split it for processing. Gzip 是最均衡的選項,也最常被使用。Gzip is the most balanced option, and it is used the most often. 請選擇最適合您的端對端案例使用的轉碼器。Choose the codec that best suits your end-to-end scenario.

層級:對於每個壓縮轉碼器,您可以從兩個選項中做選擇:最快速的壓縮和最佳化的壓縮。Level: You can choose from two options for each compression codec: fastest compressed and optimally compressed. 最快速的壓縮選項能以最快速度壓縮資料,但產生的檔案不一定經過最理想的壓縮。The fastest compressed option compresses the data as quickly as possible, even if the resulting file is not optimally compressed. 最佳化的壓縮選項會花費較長的壓縮時間,並產生最少量的資料。The optimally compressed option spends more time on compression and yields a minimal amount of data. 您可以測試這兩個選項,以查看何者在您的案例中可提供更好的整體效能。You can test both options to see which provides better overall performance in your case.

考量事項:若要在內部部署存放區與雲端之間複製大量資料,請考慮使用過渡 Blob 儲存體來搭配壓縮。A consideration: To copy a large amount of data between an on-premises store and the cloud, consider using interim blob storage with compression. 當公司網路與 Azure 服務的頻寬是限制因素,而且您想要讓輸入資料集和輸出資料集都是未壓縮的形式時,使用過渡儲存體會有所幫助。Using interim storage is helpful when the bandwidth of your corporate network and your Azure services is the limiting factor, and you want the input data set and output data set both to be in uncompressed form. 更具體來說,您可以將單一複製活動分成兩個複製活動。More specifically, you can break a single copy activity into two copy activities. 第一個複製活動以壓縮形式從來源複製到過渡或暫存 Blob。The first copy activity copies from the source to an interim or staging blob in compressed form. 第二個複製活動則從暫存複製壓縮的資料,然後在寫入至接收時加以解壓縮。The second copy activity copies the compressed data from staging, and then decompresses while it writes to the sink.

資料行對應的考量Considerations for column mapping

您可以在複製活動中設定 columnMappings 屬性以將所有或部分的輸入資料行對應至輸出資料行。You can set the columnMappings property in Copy Activity to map all or a subset of the input columns to the output columns. 資料移動服務從來源讀取資料之後,必須先對資料執行資料行對應,再將資料寫入至接收。After the data movement service reads the data from the source, it needs to perform column mapping on the data before it writes the data to the sink. 這項額外處理會降低複製輸送量。This extra processing reduces copy throughput.

如果您的來源資料存放區是可查詢的 (例如,如果是 SQL Database 或 SQL Server 之類的關聯式存放區,或如果是表格儲存體或 Azure Cosmos DB 之類的 NoSQL 存放區),請考慮將資料行篩選和重新排序邏輯推送至 查詢 屬性,而不使用資料行對應。If your source data store is queryable, for example, if it's a relational store like SQL Database or SQL Server, or if it's a NoSQL store like Table storage or Azure Cosmos DB, consider pushing the column filtering and reordering logic to the query property instead of using column mapping. 如此一來,便會在資料移動服務從來源資料存放區讀取資料時發生投射,而大幅提高效率。This way, the projection occurs while the data movement service reads data from the source data store, where it is much more efficient.

其他考量Other considerations

如果要複製的資料很大,您可以調整您的商務邏輯,使用 Data Factory 的切割機制進一步分割資料。If the size of data you want to copy is large, you can adjust your business logic to further partition the data using the slicing mechanism in Data Factory. 然後,將複製活動排程為更頻繁地執行,以縮減每個複製活動執行的資料大小。Then, schedule Copy Activity to run more frequently to reduce the data size for each Copy Activity run.

請密切留意資料集數目,以及要求 Data Factory 同時連線至相同資料存放區的複製活動。Be cautious about the number of data sets and copy activities requiring Data Factory to connector to the same data store at the same time. 許多並行複製作業可能會導致資料存放區出現瓶頸,並導致效能降低,複製作業內部重試,在某些情況下甚至導致執行失敗。Many concurrent copy jobs might throttle a data store and lead to degraded performance, copy job internal retries, and in some cases, execution failures.

範例案例:從內部部署 SQL Server 複製到 Blob 儲存體Sample scenario: Copy from an on-premises SQL Server to Blob storage

案例:建置從內部部署 SQL Server 將資料以 CSV 格式複製到 Blob 儲存體的管線。Scenario: A pipeline is built to copy data from an on-premises SQL Server to Blob storage in CSV format. 為了加快複製作業速度,CSV 檔案應該壓縮為 bzip2 格式。To make the copy job faster, the CSV files should be compressed into bzip2 format.

測試和分析:複製活動的輸送量小於 2 MBps,遠低於效能基準。Test and analysis: The throughput of Copy Activity is less than 2 MBps, which is much slower than the performance benchmark.

效能分析和微調:為了排解效能問題,我們將查看資料的處理及移動方式。Performance analysis and tuning: To troubleshoot the performance issue, let’s look at how the data is processed and moved.

  1. 讀取資料:閘道會開啟對 SQL Server 的連線,並傳送查詢。Read data: Gateway opens a connection to SQL Server and sends the query. SQL Server 透過內部網路將資料流傳送至閘道器,以進行回應。SQL Server responds by sending the data stream to Gateway via the intranet.
  2. 序列化和壓縮資料︰閘道會將資料流序列化為 CSV 格式,並將資料壓縮為 bzip2 資料流。Serialize and compress data: Gateway serializes the data stream to CSV format, and compresses the data to a bzip2 stream.
  3. 寫入資料:閘道會透過網際網路將 bzip2 資料流上傳至 Blob 儲存體。Write data: Gateway uploads the bzip2 stream to Blob storage via the Internet.

如您所見,資料將會以串流序列的方式處理和移動:SQL Server > LAN > 閘道 > WAN > Blob 儲存體。As you can see, the data is being processed and moved in a streaming sequential manner: SQL Server > LAN > Gateway > WAN > Blob storage. 整體效能受限於管線的最小輸送量The overall performance is gated by the minimum throughput across the pipeline.

資料流

下列一或多個因素可能會造成效能瓶頸:One or more of the following factors might cause the performance bottleneck:

  • 來源:SQL Server 本身的輸送量偏低,因為負載過重。Source: SQL Server itself has low throughput because of heavy loads.
  • 資料管理閘道Data Management Gateway:
    • LAN:閘道的位置遠離 SQL Server 電腦,且頻寬連線較低。LAN: Gateway is located far from the SQL Server machine and has a low-bandwidth connection.
    • 閘道:閘道已達到其負載限制,而無法執行下列作業:Gateway: Gateway has reached its load limitations to perform the following operations:
      • 序列化:將資料流序列化為 CSV 格式時,輸送量偏低。Serialization: Serializing the data stream to CSV format has slow throughput.
      • 壓縮:您選擇了緩慢的壓縮轉碼器 (例如 bzip2,其採用 Core i7,速度為 2.8 MBps)。Compression: You chose a slow compression codec (for example, bzip2, which is 2.8 MBps with Core i7).
    • WAN:公司網路與 Azure 服務之間的頻寬偏低 (例如,T1 = 1,544 kbps、T2 = 6,312 kbps)。WAN: The bandwidth between the corporate network and your Azure services is low (for example, T1 = 1,544 kbps; T2 = 6,312 kbps).
  • 接收:Blob 儲存體的輸送量偏低。Sink: Blob storage has low throughput. (但不太可能發生,因為其 SLA 保證至少有 60 MBps)。(This scenario is unlikely because its SLA guarantees a minimum of 60 MBps.)

在此情況下,bzip2 資料壓縮可能會拖慢整個管線。In this case, bzip2 data compression might be slowing down the entire pipeline. 改用 gzip 壓縮轉碼器可能會緩解此瓶頸。Switching to a gzip compression codec might ease this bottleneck.

範例案例:使用平行複製Sample scenarios: Use parallel copy

案例 I: 從內部部署檔案系統複製 1,000 個 1 MB 的檔案至 Blob 儲存體。Scenario I: Copy 1,000 1-MB files from the on-premises file system to Blob storage.

分析和效能微調︰例如,如果您已在四核心電腦上安裝閘道,Data Factory 會使用 16 個平行複製,以並行方式從檔案系統中將檔案移至 Blob 儲存體。Analysis and performance tuning: For an example, if you have installed gateway on a quad core machine, Data Factory uses 16 parallel copies to move files from the file system to Blob storage concurrently. 此平行執行應該會導致高輸送量。This parallel execution should result in high throughput. 您也可以明確指定平行複製計數。You also can explicitly specify the parallel copies count. 在複製許多小型檔案時,平行複製可藉由更有效率地使用資源,而對輸送量大有幫助。When you copy many small files, parallel copies dramatically help throughput by using resources more effectively.

案例 1

案例 II:從 Blob 儲存體複製 20 個 Blob (每個 Blob 有 500 MB) 到 Data Lake Store 分析,然後微調效能。Scenario II: Copy 20 blobs of 500 MB each from Blob storage to Data Lake Store Analytics, and then tune performance.

分析和效能微調︰在此案例中,Data Factory 會使用一個複製 (parallelCopies 設為 1) 以及一個雲端資料移動單位,將資料從 Blob 儲存體複製到 Data Lake Store。Analysis and performance tuning: In this scenario, Data Factory copies the data from Blob storage to Data Lake Store by using single-copy (parallelCopies set to 1) and single-cloud data movement units. 您所觀察到的輸送量會接近 效能參考一節所述。The throughput you observe will be close to that described in the performance reference section.

案例 2

案例 III:個別檔案大小大於數十 MB 且總數量很大。Scenario III: Individual file size is greater than dozens of MBs and total volume is large.

分析和效能微調︰增加 parallelCopies 並不會提升複製效能,因為單一雲端 DMU 的資源有所限制。Analysis and performance turning: Increasing parallelCopies doesn't result in better copy performance because of the resource limitations of a single-cloud DMU. 相反地,您應該指定更多個雲端 DMU,以取得更多用來執行資料移動的資源。Instead, you should specify more cloud DMUs to get more resources to perform the data movement. 請不要指定 parallelCopies 屬性的值。Do not specify a value for the parallelCopies property. Data Factory 會為您處理平行處理原則。Data Factory handles the parallelism for you. 在此案例中,如果您將 cloudDataMovementUnits 設定為 4,會讓輸送量變成大約 4 倍。In this case, if you set cloudDataMovementUnits to 4, a throughput of about four times occurs.

案例 3

參考資料Reference

以下是幾個支援的資料存放區所適用的效能監視及調整參考:Here are performance monitoring and tuning references for some of the supported data stores: