您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

大规模提取、转换和加载 (ETL)Extract, transform, and load (ETL) at scale

提取、转换和加载 (ETL) 是从各种源获取数据、将数据收集到标准位置、进行清理和处理,最终将其载入数据存储供查询的过程。Extract, transform, and load (ETL) is the process by which data is acquired from various sources, collected in a standard location, cleaned and processed, and ultimately loaded into a datastore from which it can be queried. 传统的 ETL 过程是导入数据、就地清理该数据,然后将其存储在关系数据引擎中。Legacy ETL processes import data, clean it in place, and then store it in a relational data engine. 使用 HDInsight 时,各种 Apache Hadoop 生态系统组件都支持大规模执行 ETL。With HDInsight, a wide variety of Apache Hadoop ecosystem components support performing ETL at scale.

在 ETL 过程中使用 HDInsight 可以通过以下管道来总结:The use of HDInsight in the ETL process can be summarized by this pipeline:

可扩展的 HDInsight ETL 概述

以下部分探讨每个 ETL 阶段及其关联的组件。The following sections explore each of the ETL phases and their associated components.

业务流程Orchestration

业务流程跨越 ETL 管道的所有阶段。Orchestration spans across all phases of the ETL pipeline. HDInsight 中的 ETL 作业通常涉及到多个相互配合运行的不同产品。ETL jobs in HDInsight often involve several different products working in conjunction with each other. 可以使用 Hive 来清理数据的一部分,同时使用 Pig 来清理另一部分。You might use Hive to clean some portion of the data, while Pig cleans another portion. 可以使用 Azure 数据工厂将数据从 Azure Data Lake Store 载入 Azure SQL 数据库。You might use Azure Data Factory to load data into Azure SQL Database from Azure Data Lake Store.

若要在适当的时间运行适当的作业,则需要业务流程。Orchestration is needed to run the appropriate job at the appropriate time.

Apache OozieApache Oozie

Apache Oozie 是一个管理 Hadoop 作业的工作流协调系统。Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie 在 HDInsight 群集中运行,并与 Hadoop 堆栈集成。Oozie runs within an HDInsight cluster and is integrated with the Hadoop stack. Oozie 支持 Apache Hadoop MapReduce、Apache Pig、Apache Hive 和 Apache Sqoop 的 Hadoop 作业。Oozie supports Hadoop jobs for Apache Hadoop MapReduce, Apache Pig, Apache Hive, and Apache Sqoop. Oozie 还可用于安排特定于某系统的作业,例如 Java 程序或 shell 脚本。Oozie can also be used to schedule jobs that are specific to a system, such as Java programs or shell scripts.

有关详细信息,请参阅在 HDInsight 中将 Apache Oozie 与 Apache Hadoop 配合使用以定义和运行工作流。如需深入演示如何使用 Oozie 驱动端到端管道,请参阅操作数据管道For more information, see Use Apache Oozie with Apache Hadoop to define and run a workflow on HDInsight For a deep dive showing how to use Oozie to drive an end-to-end pipeline, see Operationalize the Data Pipeline.

Azure 数据工厂Azure Data Factory

Azure 数据工厂以平台即服务的形式提供业务流程功能。Azure Data Factory provides orchestration capabilities in the form of platform-as-a-service. 这是一项基于云的数据集成服务,可让你在云中创建数据驱动型工作流,以便协调和自动化数据移动和数据转换。It's a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.

使用 Azure 数据工厂可以:Using Azure Data Factory, you can:

  1. 创建和计划数据驱动型工作流(称为管道),以便从不同的数据存储引入数据。Create and schedule data-driven workflows (called pipelines) that ingest data from disparate data stores.
  2. 使用计算服务(例如 Azure HDInsight Hadoop、Spark、Azure Data Lake Analytics、Azure Batch 和 Azure 机器学习)处理和转换数据。Process and transform the data using compute services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, Azure Batch, and Azure Machine Learning.
  3. 将输出数据发布到数据存储(例如 Azure SQL 数据仓库),供商业智能 (BI) 应用程序使用。Publish output data to data stores such as Azure SQL Data Warehouse for business intelligence (BI) applications to consume.

有关 Azure 数据工厂的详细信息,请参阅文档For more information on Azure Data Factory, see the documentation.

提取文件存储和结果存储Ingest file storage and result storage

源数据文件通常载入 Azure 存储或 Azure Data Lake Storage 中的某个位置。Source data files are typically loaded into a location in Azure Storage or Azure Data Lake Storage. 文件可以采用任何格式,但通常是平面文件(如 Csv)。Files can be in any format, but typically they're flat files like CSVs.

Azure 存储器Azure Storage

Azure 存储具有特定的可伸缩性目标。Azure Storage has specific scalability targets. 有关详细信息,请参阅 Blob 存储的可伸缩性和性能目标For more information, see Scalability and performance targets for Blob storage. 对于大多数分析节点而言,在处理许多较小的文件时,Azure 存储的可伸缩性最佳。For most analytic nodes, Azure Storage scales best when dealing with many smaller files. Azure 存储可以保证无论有多少文件或文件有多大(只要在限制范围内),都能提供相同的性能。Azure Storage guarantees the same performance, no matter how many files or how large the files (as long as you are within your limits). 这意味着,无论使用的是数据子集还是所有数据,都可以存储数 tb 的数据,并且仍能获得一致的性能。This means that you can store terabytes of data and still get consistent performance, whether you're using a subset of the data or all of the data.

Azure 存储包含多种不同类型的 Blob。Azure Storage has several different types of blobs. “追加 Blob”是存储 Web 日志或传感器数据的极佳选项。An append blob is a great option for storing web logs or sensor data.

尽管可以在众多服务器之间分布多个 Blob 以便扩大对其的访问权限,但只能由单个服务器为单个 Blob 提供服务。Multiple blobs can be distributed across many servers to scale out access to them, but a single blob can only be served by a single server. 虽然 Blob 可在 Blob 容器中进行逻辑分组,但这种分组不会对分区产生影响。While blobs can be logically grouped in blob containers, there are no partitioning implications from this grouping.

Azure 存储还为 Blob 存储提供一个 WebHDFS API 层。Azure Storage also has a WebHDFS API layer for the blob storage. HDInsight 中的所有服务都可以访问 Azure Blob 存储中的文件来进行数据清理和数据处理,就像这些服务使用 Hadoop 分布式文件系统 (HDFS) 时一样。All the services in HDInsight can access files in Azure Blob Storage for data cleaning and data processing, similarly to how those services would use Hadoop Distributed Files System (HDFS).

通常使用 PowerShell、Azure 存储 SDK 或 AZCopy 将数据引入 Azure 存储。Data is typically ingested into Azure Storage using either PowerShell, the Azure Storage SDK, or AZCopy.

Azure Data Lake StorageAzure Data Lake Storage

Azure Data Lake Storage (ADLS) 是一个托管的超大规模存储库,用于分析与 HDFS 兼容的数据。Azure Data Lake Storage (ADLS) is a managed, hyperscale repository for analytics data that is compatible with HDFS. ADLS 使用类似于 HDFS 的设计范例,并在总容量及单个文件的大小方面提供无限可伸缩性。ADLS uses a design paradigm that is similar to HDFS, and offers unlimited scalability in terms of total capacity and the size of individual files. ADLS 非常适合与大型文件配合运行,因为大型文件可以跨多个节点存储。ADLS is very good when working with large files, since a large file can be stored across multiple nodes. ADLS 中的数据分区在幕后执行。Partitioning data in ADLS is done behind the scenes. 通过数以千计的并发执行程序,可高效读取和写入数百 TB 的数据,从而可获得极大的吞吐量来运行分析作业。You get massive throughput to run analytic jobs with thousands of concurrent executors that efficiently read and write hundreds of terabytes of data.

通常使用 Azure 数据工厂、ADLS SDK、AdlCopy 服务、Apache DistCp 或 Apache Sqoop 将数据引入 ADLS。Data is typically ingested into ADLS using Azure Data Factory, ADLS SDKs, AdlCopy Service, Apache DistCp, or Apache Sqoop. 要使用其中的哪些服务在很大程度上取决于数据的所在位置。Which of these services to use largely depends on where the data is. 如果数据当前在现有的 Hadoop 群集中,则可以使用 Apache DistCp、AdlCopy 服务或 Azure 数据工厂。If the data is currently in an existing Hadoop cluster, you might use Apache DistCp, AdlCopy Service, or Azure Data Factory. 如果数据在 Azure Blob 存储中,则可以使用 Azure Data Lake Storage .NET SDK、Azure PowerShell 或 Azure 数据工厂。If it's in Azure Blob Storage, you might use Azure Data Lake Storage .NET SDK, Azure PowerShell, or Azure Data Factory.

ADLS 还针对使用 Azure 事件中心或 Apache Storm 运行的事件引入进行了优化。ADLS is also optimized for event ingestion using Azure Event Hub or Apache Storm.

两种存储选项的注意事项Considerations for both storage options

如果上传的数据集达到 TB 量级,网络延迟可能是一个重要问题,尤其是当数据来自本地位置时。For uploading datasets in the terabyte range, network latency can be a major problem, particularly if the data is coming from an on-premises location. 这种情况下,可使用以下选项:In such cases, you can use the options below:

  • Azure ExpressRoute:使用 Azure ExpressRoute 可在 Azure 数据中心与本地基础结构之间创建专用连接。Azure ExpressRoute: Azure ExpressRoute lets you create private connections between Azure datacenters and your on-premises infrastructure. 这些连接为传输大量数据提供了可靠选项。These connections provide a reliable option for transferring large amounts of data. 有关详细信息,请参阅 Azure ExpressRoute 文档For more information, see Azure ExpressRoute documentation.

  • 数据“离线上传”。"Offline" upload of data. 可以使用 Azure 导入/导出服务将包含数据的硬盘驱动器发送到 Azure 数据中心。You can use Azure Import/Export service to ship hard disk drives with your data to an Azure data center. 数据会首先上传到 Azure 存储 Blob。Your data is first uploaded to Azure Storage Blobs. 然后可使用 Azure 数据工厂AdlCopy 工具从 Azure 存储 Blob 将数据复制到 Data Lake Storage。You can then use Azure Data Factory or the AdlCopy tool to copy data from Azure Storage blobs to Data Lake Storage.

Azure SQL 数据仓库Azure SQL Data Warehouse

Azure SQL DW 是存储已清理且准备好的数据供今后分析的绝佳选项。Azure SQL DW is a great choice to store cleaned and prepared results for future analytics. 可以使用 Azure HDInsight 对 Azure SQL DW 执行这些服务。Azure HDInsight can be used to perform those services for Azure SQL DW.

Azure SQL 数据仓库 (SQL DW) 是已针对分析工作负荷进行优化的关系数据库。Azure SQL Data Warehouse (SQL DW) is a relational database store optimized for analytic workloads. Azure SQL DW 根据分区表进行缩放。Azure SQL DW scales based on partitioned tables. 表可以跨多个节点分区。Tables can be partitioned across multiple nodes. 在创建 Azure SQL DW 节点时便选择了节点。Azure SQL DW nodes are selected at the time of creation. 可以在事后缩放节点,但是,该主动过程可能需要移动数据。They can scale after the fact, but that's an active process that might require data movement. 有关详细信息,请参阅SQL 数据仓库-管理计算For more information, see SQL Data Warehouse - Manage Compute.

Apache HBaseApache HBase

Apache HBase 是 Azure HDInsight 中提供的键/值存储。Apache HBase is a key-value store available in Azure HDInsight. Apache HBase 是一种开源 NoSQL 数据库,它构建于 Hadoop 基础之上,并基于 Google BigTable 模型化。Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable. HBase 针对按列系列组织的无架构数据库中的大量非结构化和结构化数据提供高性能的随机访问和强一致性。HBase provides performant random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families.

数据存储在表的各行中,行中的数据按列系列分组。Data is stored in the rows of a table, and data within a row is grouped by column family. HBase 是无架构数据库,也就是说,在使用其数据前,不必定义列以及列中存储的数据类型。HBase is a schemaless database in the sense that neither the columns nor the type of data stored in them need to be defined before using them. 开放源代码可进行线性伸缩,以处理上千节点上数 PB 的数据。The open-source code scales linearly to handle petabytes of data on thousands of nodes. HBase 可依赖数据冗余、批处理以及 Hadoop 生态系统中的分布式应用程序提供的其他功能。HBase can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.

HBase 是供将来分析的传感器和日志数据的绝佳目标。HBase is an excellent destination for sensor and log data for future analysis.

HBase 可伸缩性取决于 HDInsight 群集中的节点数。HBase scalability is dependent on the number of nodes in the HDInsight cluster.

Azure SQL 数据库和 Azure 数据库Azure SQL Database and Azure Database

Azure 以平台即服务 (PAAS) 的形式提供三种不同的关系数据库。Azure offers three different relational databases as platform-as-a-service (PAAS).

这些产品会向上扩展,这意味着可以通过添加更多的 CPU 和内存来对其进行缩放。These products scale up, which means that they're scaled by adding more CPU and memory. 还可以选择使用高级磁盘配合这些产品来获得更好的 I/O 性能。You can also choose to use premium disks with the products for better I/O performance.

Azure Analysis ServicesAzure Analysis Services

Azure Analysis Services (AAS) 是用于决策支持和业务分析的分析数据引擎,可为业务报告和客户端应用程序(例如 Power BI、Excel、Reporting Services 报告及其他可视化工具)提供分析数据。Azure Analysis Services (AAS) is an analytical data engine used in decision support and business analytics, providing the analytical data for business reports and client applications such as Power BI, Excel, Reporting Services reports, and other data visualization tools.

可以通过更改每个多维数据集的层来缩放这些分析多维数据集。Analysis cubes can scale by changing tiers for each individual cube. 有关详细信息,请参阅 Azure Analysis Services 定价For more information, see Azure Analysis Services Pricing.

提取和加载Extract and Load

将数据存储到 Azure 中后,可以使用许多服务将这些数据提取和载入其他产品。Once the data exists in Azure, you can use many services to extract and load it into other products. HDInsight 支持 Sqoop 和 Flume。HDInsight supports Sqoop and Flume.

Apache SqoopApache Sqoop

Apache Sqoop 是专用于在结构化、半结构化和非结构化数据源之间有效传输数据的工具。Apache Sqoop is a tool designed for efficiently transferring data between structured, semi-structured, and unstructured data sources.

Sqoop 使用 MapReduce 导入和导出数据,可提供并行操作和容错。Sqoop uses MapReduce to import and export the data, to provide parallel operation and fault tolerance.

Apache FlumeApache Flume

Apache Flume 是分布式、可靠且高度可用的服务,能够有效地收集、聚合与移动大量日志数据。Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Flume 采用基于流式传输数据流的简单弹性体系结构。Flume has a simple and flexible architecture based on streaming data flows. Flume 既可靠又能容错,提供可优化的可靠性机制和许多故障转移与恢复机制。Flume is robust and fault-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. Flume 使用一个允许联机分析应用程序的简单可扩展数据模型。Flume uses a simple extensible data model that allows for online analytic application.

Apache Flume 不能与 Azure HDInsight 配合使用。Apache Flume can't be used with Azure HDInsight. 本地 Hadoop 安装可以使用 Flume 将数据发送到 Azure 存储 Blob 或 Azure Data Lake Storage。An on-premises Hadoop installation can use Flume to send data to either Azure Storage Blobs or Azure Data Lake Storage. 有关详细信息,请参阅将 Apache Flume 与 HDInsight 配合使用For more information, see Using Apache Flume with HDInsight.

转型Transform

将数据存储到所选的位置后,需要根据特定的使用模式清理、合并或准备这些数据。Once data exists in the chosen location, you need to clean it, combine it, or prepare it for a specific usage pattern. Hive、Pig 和 Spark SQL 都是适合用于完成此类任务的极佳选择。Hive, Pig, and Spark SQL are all good choices for that kind of work. 它们都在 HDInsight 上受到支持。They're all supported on HDInsight.

后续步骤Next steps