您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

将 Apache Hive 用作提取、转换和加载 (ETL) 工具Use Apache Hive as an Extract, Transform, and Load (ETL) tool

通常需要先将传入的数据清理并转换,才能将它载入适合用于分析的目标。You typically need to clean and transform incoming data before loading it into a destination suitable for analytics. 提取、转换和加载 (ETL) 操作可用于准备数据并将其载入数据目标。Extract, Transform, and Load (ETL) operations are used to prepare data and load it into a data destination. HDInsight 上的 Apache Hive 可以读入非结构化数据、根据需要处理该数据,然后将该数据载入关系数据仓库,供决策支持系统使用。Apache Hive on HDInsight can read in unstructured data, process the data as needed, and then load the data into a relational data warehouse for decision support systems. 此方法从源提取数据并将其存储在可缩放的存储(例如 Azure 存储 Blob 或 Azure Data Lake Storage)中。In this approach, data is extracted from the source and stored in scalable storage, such as Azure Storage blobs or Azure Data Lake Storage. 然后,使用一系列 Hive 查询来转换该数据,最后将其暂存在 Hive 中,为批量载入目标数据存储做好准备。The data is then transformed using a sequence of Hive queries and is finally staged within Hive in preparation for bulk loading into the destination data store.

用例和模型概述Use case and model overview

下图提供 ETL 自动化用例和模型的概述。The following figure shows an overview of the use case and model for ETL automation. 将转换输入数据以生成适当的输出。Input data is transformed to generate the appropriate output. 在转换期间,数据可以更改形状、数据类型甚至语言。During that transformation, the data can change shape, data type, and even language. ETL 过程可将英制转换为公制、更改时区和提高精确度,以便与目标中现有的数据相符。ETL processes can convert Imperial to metric, change time zones, and improve precision to properly align with existing data in the destination. ETL 进程还可以将新的数据与现有数据组合在一起,以保持报表的最新状态,或进一步了解现有数据。ETL processes can also combine new data with existing data to keep reporting up to date, or to provide further insight into existing data. 然后,应用程序(例如报告工具和服务)能以所需的格式使用此数据。Applications such as reporting tools and services can then consume this data in the desired format.

Apache Hive 为 ETL 体系结构

在导入大量文本文件(例如 CSV)或数量较少但经常更改的文本文件或上述两者兼具的 ETL 过程中,通常使用 Hadoop。Hadoop is typically used in ETL processes that import either a massive number of text files (like CSVs) or a smaller but frequently changing number of text files, or both. Hive 是一个很好的工具,可以在将数据载入数据目标之前先准备好数据。Hive is a great tool to use to prepare the data before loading it into the data destination. 在 Hive 中,可以基于 CSV 创建架构,然后使用类似于 SQL 的语言来生成与数据交互的 MapReduce 程序。Hive allows you to create a schema over the CSV and use a SQL-like language to generate MapReduce programs that interact with the data.

使用 Hive 执行 ETL 的典型步骤如下:The typical steps to using Hive to perform ETL are as follows:

  1. 将数据载入 Azure Data Lake Storage 或 Azure Blob 存储。Load data into Azure Data Lake Storage or Azure Blob Storage.

  2. 创建元数据存储数据库(使用 Azure SQL 数据库)供 Hive 用于存储架构。Create a Metadata Store database (using Azure SQL Database) for use by Hive in storing your schemas.

  3. 创建 HDInsight 群集并连接数据存储。Create an HDInsight cluster and connect the data store.

  4. 定义要在读取阶段应用到数据存储中的数据的架构:Define the schema to apply at read-time over data in the data store:

    DROP TABLE IF EXISTS hvac;
    
    --create the hvac table on comma-separated sensor data stored in Azure Storage blobs
    
    CREATE EXTERNAL TABLE hvac(`date` STRING, time STRING, targettemp BIGINT,
        actualtemp BIGINT,
        system BIGINT,
        systemage BIGINT,
        buildingid BIGINT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE LOCATION 'wasbs://{container}@{storageaccount}.blob.core.windows.net/HdiSamples/SensorSampleData/hvac/';
    
  5. 转换数据并将其载入目标。Transform the data and load it into the destination. 在转换和加载期间,可通过多种方式使用 Hive:There are several ways to use Hive during the transformation and loading:

    • 使用 Hive 查询和准备数据,并将其以 CSV 格式存储在 Azure Data Lake Storage 或 Azure Blob 存储中。Query and prepare data using Hive and save it as a CSV in Azure Data Lake Storage or Azure blob storage. 然后,使用 SQL Server Integration Services (SSIS) 等工具获取这些 CSV,并将数据载入 SQL Server 等目标关系数据库。Then use a tool like SQL Server Integration Services (SSIS) to acquire those CSVs and load the data into a destination relational database such as SQL Server.
    • 使用 Hive ODBC 驱动程序直接从 Excel 或 C# 查询数据。Query the data directly from Excel or C# using the Hive ODBC driver.
    • 使用 Apache Sqoop 读取已准备好的 CSV 平面文件,并将其载入目标关系数据库。Use Apache Sqoop to read the prepared flat CSV files and load them into the destination relational database.

数据源Data sources

数据源通常是可与数据存储中现有数据进行匹配的外部数据,例如:Data sources are typically external data that can be matched to existing data in your data store, for example:

  • 社交媒体数据、日志文件、传感器,以及生成数据文件的应用程序。Social media data, log files, sensors, and applications that generate data files.
  • 从数据提供程序获取的数据集,例如天气统计信息或供应商销售数字。Datasets obtained from data providers, such as weather statistics or vendor sales numbers.
  • 通过适当的工具或框架捕获、筛选和处理的数据。Streaming data captured, filtered, and processed through a suitable tool or framework.

输出目标Output targets

可以使用 Hive 将数据输出到各种目标,包括:You can use Hive to output data to a variety of targets including:

  • 关系数据库,例如 SQL Server 或 Azure SQL 数据库。A relational database, such as SQL Server or Azure SQL Database.
  • 数据仓库,例如 Azure SQL 数据仓库。A data warehouse, such as Azure SQL Data Warehouse.
  • Excel。Excel.
  • Azure 表和 Blob 存储。Azure table and blob storage.
  • 要求将数据处理成特定格式或处理成包含特定类型的信息结构的应用程序或服务。Applications or services that require data to be processed into specific formats, or as files that contain specific types of information structure.
  • JSON 文档存储,如Azure Cosmos DBA JSON Document Store like Azure Cosmos DB.

注意事项Considerations

有以下需要时,通常可以使用 ETL 模型:The ETL model is typically used when you want to:

  • 将流数据或大量半结构化或非结构化数据从外部源载入现有数据库或信息系统。Load stream data or large volumes of semi-structured or unstructured data from external sources into an existing database or information system.
  • 在加载数据之前,先清理、转换和验证这些数据(也许是通过群集使用多个转换阶段执行此操作)。Clean, transform, and validate the data before loading it, perhaps by using more than one transformation pass through the cluster.
  • 生成定期更新的报表和可视化效果。Generate reports and visualizations that are regularly updated. 例如,如果在日间生成报表耗时太长,可以安排在夜间运行报告。For example, if the report takes too long to generate during the day, you can schedule the report to run at night. 若要自动运行 Hive 查询,可以使用Azure 逻辑应用和 PowerShell。To automatically run a Hive query, you can use Azure Logic Apps and PowerShell.

如果数据的目标不是数据库,则可以在查询中以适当的格式(例如 CSV)生成文件。If the target for the data isn't a database, you can generate a file in the appropriate format within the query, for example a CSV. 然后,可将此文件导入 Excel 或 Power BI。This file can then be imported into Excel or Power BI.

如果需要在 ETL 过程中对数据执行多个操作,请考虑如何管理这些操作。If you need to execute several operations on the data as part of the ETL process, consider how you manage them. 如果操作由外部程序而不是解决方案中的工作流控制,则需要确定某些操作是否可以并行运行,并检测每项操作何时完成。If the operations are controlled by an external program, rather than as a workflow within the solution, you need to decide whether some operations can be executed in parallel, and to detect when each job completes. 与使用外部脚本或自定义程序来尝试协调一系列操作相比,使用工作流机制(例如 Hadoop 中的 Oozie)可能更方便。Using a workflow mechanism such as Oozie within Hadoop may be easier than trying to orchestrate a sequence of operations using external scripts or custom programs. 有关 Oozie 的详细信息,请参阅工作流和作业业务流程For more information about Oozie, see Workflow and job orchestration.

后续步骤Next steps