您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

通过数据工厂将 1 TB 负载引入 Azure Synapse 分析Load 1 TB into Azure Synapse Analytics under 15 minutes with Data Factory

备注

本文适用于数据工厂版本 1。This article applies to version 1 of Data Factory. 如果使用的是最新版本的数据工厂服务,请参阅 使用数据工厂将数据复制到 Azure Synapse 分析或从中复制数据If you are using the current version of the Data Factory service, see Copy data to or from Azure Synapse Analytics by using Data Factory.

Azure Synapse Analytics 是一种基于云的向外扩展数据库,可以处理大量数据(关系数据和非关系数据)。Azure Synapse Analytics is a cloud-based, scale-out database capable of processing massive volumes of data, both relational and non-relational. 基于大规模并行处理 (MPP) 体系结构构建,Azure Synapse Analytics 针对企业数据仓库工作负荷进行了优化。Built on massively parallel processing (MPP) architecture, Azure Synapse Analytics is optimized for enterprise data warehouse workloads. 它通过灵活地缩放存储以及独立计算提供云灵活性。It offers cloud elasticity with the flexibility to scale storage and compute independently.

Azure Synapse Analytics 入门现在比以往任何时候都容易使用 Azure 数据工厂Getting started with Azure Synapse Analytics is now easier than ever using Azure Data Factory. Azure 数据工厂是一个完全托管的基于云的数据集成服务,它可用于使用现有系统中的数据填充 Azure Synapse 分析,并在评估 Azure Synapse 分析和构建分析解决方案时节省宝贵的时间。Azure Data Factory is a fully managed cloud-based data integration service, which can be used to populate Azure Synapse Analytics with the data from your existing system, and saving you valuable time while evaluating Azure Synapse Analytics and building your analytics solutions. 下面是使用 Azure 数据工厂将数据加载到 Azure Synapse Analytics 的主要优点:Here are the key benefits of loading data into Azure Synapse Analytics using Azure Data Factory:

  • 轻松设置:无需脚本的 5 步直观的向导。Easy to set up: 5-step intuitive wizard with no scripting required.
  • 丰富的数据存储支持:内置支持一组丰富的本地和基于云的数据存储。Rich data store support: built-in support for a rich set of on-premises and cloud-based data stores.
  • 安全且合规:通过 HTTPS 或 ExpressRoute 传输数据,并且全局服务可确保数据不会离开地理边界Secure and compliant: data is transferred over HTTPS or ExpressRoute, and global service presence ensures your data never leaves the geographical boundary
  • 使用 polybase 的无与伦比性能 –使用 polybase 是将数据移动到 Azure Synapse Analytics 的最有效方法。Unparalleled performance by using PolyBase – Using Polybase is the most efficient way to move data into Azure Synapse Analytics. 使用临时 blob 功能,可以实现所有类型的数据存储(包括 Azure Blob 存储)的高加载速度,默认情况下,Polybase 支持此功能。Using the staging blob feature, you can achieve high load speeds from all types of data stores besides Azure Blob storage, which the Polybase supports by default.

本文介绍如何使用数据工厂复制向导将数据从 Azure Blob 存储加载到 Azure Synapse 分析,时间为15分钟,超过 1.2 GBps 的吞吐量。This article shows you how to use Data Factory Copy Wizard to load 1-TB data from Azure Blob Storage into Azure Synapse Analytics in under 15 minutes, at over 1.2 GBps throughput.

本文提供了有关使用复制向导将数据移动到 Azure Synapse Analytics 的分步说明。This article provides step-by-step instructions for moving data into Azure Synapse Analytics by using the Copy Wizard.

备注

有关将数据移入/移出 Azure Synapse 分析的数据工厂功能的一般信息,请参阅 使用 Azure 数据工厂将数据移入和移出 Azure Synapse 分析 一文。For general information about capabilities of Data Factory in moving data to/from Azure Synapse Analytics, see Move data to and from Azure Synapse Analytics using Azure Data Factory article.

你还可以使用 Visual Studio、PowerShell 等生成管道。有关使用 Azure 数据工厂中的复制活动的分步说明,请参阅 教程:将数据从 Azure Blob 复制到 AZURE SQL 数据库You can also build pipelines using Visual Studio, PowerShell, etc. See Tutorial: Copy data from Azure Blob to Azure SQL Database for a quick walkthrough with step-by-step instructions for using the Copy Activity in Azure Data Factory.

先决条件Prerequisites

  • Azure Blob 存储:此试验使用 Azure Blob 存储 (GRS) 来存储 TPC-H 测试数据集。Azure Blob Storage: this experiment uses Azure Blob Storage (GRS) for storing TPC-H testing dataset. 如果还没有 Azure 存储帐户,请参阅如何创建存储帐户If you do not have an Azure storage account, learn how to create a storage account.

  • TPC-H 数据:我们将使用 TPC-H 作为测试数据集。TPC-H data: we are going to use TPC-H as the testing dataset. 为此,需要使用 TPC-H 工具包中的 dbgen,它将有助于生成数据集。To do that, you need to use dbgen from TPC-H toolkit, which helps you generate the dataset. 可以从 TPC 工具下载 dbgen 的源代码,并自己进行编译,或从 GitHub 下载已编译的二进制。You can either download source code for dbgen from TPC Tools and compile it yourself, or download the compiled binary from GitHub. 使用以下命令运行 dbgen.exe,为分布在 10 个文件中的 lineitem 表生成 1 TB 的平面文件:Run dbgen.exe with the following commands to generate 1 TB flat file for lineitem table spread across 10 files:

  • Azure Synapse Analytics:此试验将数据加载到通过 6000 Dwu 创建的 Azure Synapse Analytics 中。Azure Synapse Analytics: this experiment loads data into Azure Synapse Analytics created with 6,000 DWUs

    有关如何创建 Azure Synapse 分析数据库的详细说明,请参阅 创建 Azure Synapse analyticsRefer to Create an Azure Synapse Analytics for detailed instructions on how to create an Azure Synapse Analytics database. 若要使用 Polybase 将最佳负载性能提高到 Azure Synapse Analytics 中,请选择 "性能" 设置中允许的最大数据仓库单位数 (Dwu) ,即 6000 Dwu。To get the best possible load performance into Azure Synapse Analytics using Polybase, we choose maximum number of Data Warehouse Units (DWUs) allowed in the Performance setting, which is 6,000 DWUs.

    备注

    从 Azure Blob 加载数据时,数据加载性能与为 Azure Synapse Analytics 配置的 Dwu 数成正比:When loading from Azure Blob, the data loading performance is directly proportional to the number of DWUs you configure for Azure Synapse Analytics:

    将 1 TB 加载到 1000 DWU 中的 Azure Synapse 87 Analytics (~ 200 MBps 吞吐量) 将 1 TB 加载到 2000 DWU Azure Synapse Analytics 需 (~ 46 MBps 吞吐量) 将 1 TB 加载到 380 DWU Azure Synapse Analytics 需14分钟 (~ 6000 GBps 吞吐量) Loading 1 TB into 1,000 DWU Azure Synapse Analytics takes 87 minutes (~200 MBps throughput) Loading 1 TB into 2,000 DWU Azure Synapse Analytics takes 46 minutes (~380 MBps throughput) Loading 1 TB into 6,000 DWU Azure Synapse Analytics takes 14 minutes (~1.2 GBps throughput)

    若要创建具有 6000 Dwu 的专用 SQL 池,请将性能滑块一直移动到右侧:To create a dedicated SQL pool with 6,000 DWUs, move the Performance slider all the way to the right:

    性能滑块

    对于未配置为 6,000 DWU 的现有数据库,可以使用 Azure 门户对其进行扩展。For an existing database that is not configured with 6,000 DWUs, you can scale it up using Azure portal. 导航到 Azure 门户中的数据库,在“概述”面板中有一个“缩放”按钮,如下图所示:Navigate to the database in Azure portal, and there is a Scale button in the Overview panel shown in the following image:

    “缩放”按钮

    单击“缩放”按钮以打开以下面板,将滑块移动到最大值,然后单击“保存”按钮。Click the Scale button to open the following panel, move the slider to the maximum value, and click Save button.

    “缩放”对话框

    此试验使用资源类将数据加载到 Azure Synapse Analytics 中 xlargercThis experiment loads data into Azure Synapse Analytics using xlargerc resource class.

    若要实现最佳吞吐量,需要使用属于资源类的 Azure Synapse Analytics 用户执行复制 xlargercTo achieve best possible throughput, copy needs to be performed using an Azure Synapse Analytics user belonging to xlargerc resource class. 请参阅更改用户资源类示例,了解如何执行该操作。Learn how to do that by following Change a user resource class example.

  • 通过运行以下 DDL 语句,在 Azure Synapse Analytics 数据库中创建目标表架构:Create destination table schema in Azure Synapse Analytics database, by running the following DDL statement:

    CREATE TABLE [dbo].[lineitem]
    (
        [L_ORDERKEY] [bigint] NOT NULL,
        [L_PARTKEY] [bigint] NOT NULL,
        [L_SUPPKEY] [bigint] NOT NULL,
        [L_LINENUMBER] [int] NOT NULL,
        [L_QUANTITY] [decimal](15, 2) NULL,
        [L_EXTENDEDPRICE] [decimal](15, 2) NULL,
        [L_DISCOUNT] [decimal](15, 2) NULL,
        [L_TAX] [decimal](15, 2) NULL,
        [L_RETURNFLAG] [char](1) NULL,
        [L_LINESTATUS] [char](1) NULL,
        [L_SHIPDATE] [date] NULL,
        [L_COMMITDATE] [date] NULL,
        [L_RECEIPTDATE] [date] NULL,
        [L_SHIPINSTRUCT] [char](25) NULL,
        [L_SHIPMODE] [char](10) NULL,
        [L_COMMENT] [varchar](44) NULL
    )
    WITH
    (
        DISTRIBUTION = ROUND_ROBIN,
        CLUSTERED COLUMNSTORE INDEX
    )
    

    完成必需的步骤后,现在可以准备使用复制向导配置复制活动。With the prerequisite steps completed, we are now ready to configure the copy activity using the Copy Wizard.

启动复制向导Launch Copy Wizard

  1. 登录到 Azure 门户Log in to the Azure portal.

  2. 单击左上角的“创建资源”,单击“智能 + 分析”,然后单击“数据工厂”。Click Create a resource from the top-left corner, click Intelligence + analytics, and click Data Factory.

  3. 在“新建数据工厂”窗格中:In the New data factory pane:

    1. 输入 LoadIntoSQLDWDataFactory 作为 名称Enter LoadIntoSQLDWDataFactory for the name. Azure 数据工厂的名称必须全局唯一。The name of the Azure data factory must be globally unique. 如果收到错误: 数据工厂名称 "LoadIntoSQLDWDataFactory" 不可用,请更改数据工厂的名称, (例如,yournameLoadIntoSQLDWDataFactory) 并再次尝试创建。If you receive the error: Data factory name "LoadIntoSQLDWDataFactory" is not available, change the name of the data factory (for example, yournameLoadIntoSQLDWDataFactory) and try creating again. 有关数据工厂项目命名规则,请参阅 Data Factory - Naming Rules (数据工厂 - 命名规则)主题。See Data Factory - Naming Rules topic for naming rules for Data Factory artifacts.
    2. 选择 Azure 订阅Select your Azure subscription.
    3. 对于资源组,请执行以下步骤之一:For Resource Group, do one of the following steps:
      1. 选择“使用现有资源组”并选择一个现有的资源组。Select Use existing to select an existing resource group.
      2. 选择“新建”并输入资源组的名称。Select Create new to enter a name for a resource group.
    4. 选择数据工厂的 位置Select a location for the data factory.
    5. 选中位于边栏选项卡底部的“固定到仪表板”复选框。Select Pin to dashboard check box at the bottom of the blade.
    6. 单击“创建”。Click Create.
  4. 完成创建后,将看到如下图所示的“数据工厂”边栏选项卡:After the creation is complete, you see the Data Factory blade as shown in the following image:

    数据工厂主页

  5. 在“数据工厂”主页上,单击“复制数据”磁贴,启动“复制向导”。On the Data Factory home page, click the Copy data tile to launch Copy Wizard.

    备注

    如果 Web 浏览器卡在“正在授权...”处,请禁用或取消选中“阻止第三方 Cookie 和站点数据”设置,或在保持启用该项的状态下为 login.microsoftonline.com 创建一个例外,然后尝试再次启动该向导。If you see that the web browser is stuck at "Authorizing...", disable/uncheck Block third party cookies and site data setting (or) keep it enabled and create an exception for login.microsoftonline.com and then try launching the wizard again.

步骤 1:配置数据加载计划Step 1: Configure data loading schedule

第一步是配置数据加载计划。The first step is to configure the data loading schedule.

在“属性” 页中:In the Properties page:

  1. 输入 CopyFromBlobToAzureSqlDataWarehouse 作为 任务名称Enter CopyFromBlobToAzureSqlDataWarehouse for Task name

  2. 选择“立即运行一次”选项。Select Run once now option.

  3. 单击“下一步” 。Click Next.

    复制向导 - 属性页

步骤 2:配置源Step 2: Configure source

本部分说明配置源的步骤:包含 1-TB TPC-H 行项目文件的 Azure Blob。This section shows you the steps to configure the source: Azure Blob containing the 1-TB TPC-H line item files.

  1. 选择“Azure Blob 存储”作为数据存储,并单击“下一步”。Select the Azure Blob Storage as the data store and click Next.

    复制向导 - 选择源页

  2. 填写 Azure Blob 存储帐户的连接信息,并单击“下一步”。Fill in the connection information for the Azure Blob storage account, and click Next.

    复制向导 - 源连接信息

  3. 选择包含 TPC-H 行项目文件的“文件夹”,并单击“下一步”。Choose the folder containing the TPC-H line item files and click Next.

    复制向导 - 选择输入文件夹

  4. 单击“下一步”时会自动检测文件格式设置。Upon clicking Next, the file format settings are detected automatically. 检查以确保列分隔符为 "|",而不是默认的逗号 ","。Check to make sure that column delimiter is '|' instead of the default comma ','. 预览数据后,单击“下一步”。Click Next after you have previewed the data.

    复制向导 - 文件格式设置

步骤 3:配置目标Step 3: Configure destination

本部分介绍如何 lineitem 在 Azure Synapse Analytics 数据库中配置目标:表。This section shows you how to configure the destination: lineitem table in the Azure Synapse Analytics database.

  1. 选择 " Azure Synapse Analytics " 作为目标存储,然后单击 " 下一步"。Choose Azure Synapse Analytics as the destination store and click Next.

    复制向导 - 选择目标数据存储

  2. 填写 Azure Synapse Analytics 的连接信息。Fill in the connection information for Azure Synapse Analytics. 请确保指定作为 xlargerc 角色的成员的用户(有关详细说明,请参阅 先决条件 部分),并单击“下一步”。Make sure you specify the user that is a member of the role xlargerc (see the prerequisites section for detailed instructions), and click Next.

    复制向导 - 目标连接信息

  3. 选择目标表,并单击“下一步”。Choose the destination table and click Next.

    复制向导 - 表映射页

  4. 在架构映射页中,保留“应用列映射”选项的未勾选状态,并单击“下一步”。In Schema mapping page, leave "Apply column mapping" option unchecked and click Next.

步骤 4:性能设置Step 4: Performance settings

默认选中“允许 polybase”。Allow polybase is checked by default. 单击“下一步” 。Click Next.

复制向导 - 架构映射页

步骤 5:部署和监视加载结果Step 5: Deploy and monitor load results

  1. 单击“完成”按钮以便部署。Click Finish button to deploy.

    复制向导-"摘要" 页1

  2. 部署完成后,单击 Click here to monitor copy pipeline 以监视副本运行进度。After the deployment is complete, click Click here to monitor copy pipeline to monitor the copy run progress. 选择在“活动窗口”列表中创建的副本管道。Select the copy pipeline you created in the Activity Windows list.

    复制向导-"摘要" 页2

    可以在右侧面板中的“活动窗口资源管理器”中查看副本运行的详细信息,包括从源中读取和写入到目标中的数据量、持续时间以及运行的平均吞吐量。You can view the copy run details in the Activity Window Explorer in the right panel, including the data volume read from source and written into destination, duration, and the average throughput for the run.

    如以下屏幕截图所示,将 1 TB 从 Azure Blob 存储复制到 Azure Synapse Analytics 花费了14分钟,有效地实现了 1.22 GBps 吞吐量!As you can see from the following screenshot, copying 1 TB from Azure Blob Storage into Azure Synapse Analytics took 14 minutes, effectively achieving 1.22 GBps throughput!

    复制向导 - “成功”对话框

最佳实践Best practices

下面是运行 Azure Synapse 分析数据库的几个最佳做法:Here are a few best practices for running your Azure Synapse Analytics database:

  • 加载到聚集列存储索引时,请使用较大的资源类。Use a larger resource class when loading into a CLUSTERED COLUMNSTORE INDEX.
  • 若要提高联接效率,请考虑使用基于所选列的哈希分布而非默认的轮循机制分布。For more efficient joins, consider using hash distribution by a select column instead of default round robin distribution.
  • 若要提高加载速度,请考虑对临时数据使用堆。For faster load speeds, consider using heap for transient data.
  • 完成加载到 Azure Synapse Analytics 后,创建统计信息。Create statistics after you finish loading to Azure Synapse Analytics.

有关详细信息,请参阅 Azure Synapse Analytics 的最佳实践See Best practices for Azure Synapse Analytics for details.

后续步骤Next steps