您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

教程:使用 Apache Hive on Azure HDInsight 提取、转换和加载数据Tutorial: Extract, transform, and load data by using Apache Hive on Azure HDInsight

本教程执行 ETL 操作:提取、转换和加载数据。In this tutorial, you perform an ETL operation: extract, transform, and load data. 有了原始 CSV 数据文件以后,将其导入 Azure HDInsight 群集,使用 Apache Hive 对其进行转换,然后使用 Apache Sqoop 将其加载到 Azure SQL 数据库。You take a raw CSV data file, import it into an Azure HDInsight cluster, transform it with Apache Hive, and load it into an Azure SQL database with Apache Sqoop.

本教程介绍如何执行下列操作:In this tutorial, you learn how to:

  • 提取数据并将其上传到 HDInsight 群集。Extract and upload the data to an HDInsight cluster.
  • 使用 Apache Hive 转换数据。Transform the data by using Apache Hive.
  • 使用 Sqoop 将数据加载到 Azure SQL 数据库。Load the data to an Azure SQL database by using Sqoop.

如果还没有 Azure 订阅,可以在开始前创建一个免费帐户If you don't have an Azure subscription, create a free account before you begin.

先决条件Prerequisites

下载航班数据Download the flight data

  1. 浏览到美国研究与技术创新管理部门、运输统计局Browse to Research and Innovative Technology Administration, Bureau of Transportation Statistics.

  2. 在该页面上,选择以下值:On the page, select the following values:

    名称Name Value
    筛选年份Filter Year 20132013
    筛选期间Filter Period 1 月January
    字段Fields Year、FlightDate、Reporting_Airline、IATA_CODE_Reporting_Airline、Flight_Number_Reporting_Airline、OriginAirportID、Origin、OriginCityName、OriginState、DestAirportID、Dest、DestCityName、DestState、DepDelayMinutes、ArrDelay、ArrDelayMinutes、CarrierDelay、WeatherDelay、NASDelay、SecurityDelay、LateAircraftDelay。Year, FlightDate, Reporting_Airline, IATA_CODE_Reporting_Airline, Flight_Number_Reporting_Airline, OriginAirportID, Origin, OriginCityName, OriginState, DestAirportID, Dest, DestCityName, DestState, DepDelayMinutes, ArrDelay, ArrDelayMinutes, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay.

    清除所有其他字段。Clear all other fields.

  3. 选择“下载” 。Select Download. 你将得到一个具有所选数据字段的 zip 文件。You get a .zip file with the data fields you selected.

提取并上传数据Extract and upload the data

在此部分中,会将数据上传到 HDInsight 群集,然后将该数据复制到 Data Lake Storage Gen2 帐户。In this section, you'll upload data to your HDInsight cluster and then copy that data to your Data Lake Storage Gen2 account.

  1. 打开命令提示符,使用以下安全负责 (Scp) 命令将 zip 文件上传到 HDInsight 群集头节点:Open a command prompt and use the following Secure Copy (Scp) command to upload the .zip file to the HDInsight cluster head node:

    scp <file-name>.zip <ssh-user-name>@<cluster-name>-ssh.azurehdinsight.net:<file-name.zip>
    
    • <file-name> 占位符替换为 .zip 文件的名称。Replace the <file-name> placeholder with the name of the .zip file.
    • <ssh-user-name> 占位符替换为 HDInsight 群集的 SSH 登录名。Replace the <ssh-user-name> placeholder with the SSH login for the HDInsight cluster.
    • <cluster-name> 占位符替换为 HDInsight 群集的名称。Replace the <cluster-name> placeholder with the name of the HDInsight cluster.

    如果使用密码对 SSH 登录名进行身份验证,系统会提示输入密码。If you use a password to authenticate your SSH login, you're prompted for the password.

    如果使用公钥,可能需要使用 -i 参数并指定匹配私钥的路径。If you use a public key, you might need to use the -i parameter and specify the path to the matching private key. 例如,scp -i ~/.ssh/id_rsa <file_name>.zip <user-name>@<cluster-name>-ssh.azurehdinsight.net:For example, scp -i ~/.ssh/id_rsa <file_name>.zip <user-name>@<cluster-name>-ssh.azurehdinsight.net:.

  2. 上传完成后,使用 SSH 连接到群集。After the upload has finished, connect to the cluster by using SSH. 在命令提示符中输入以下命令:On the command prompt, enter the following command:

    ssh <ssh-user-name>@<cluster-name>-ssh.azurehdinsight.net
    
  3. 使用以下命令解压缩 .zip 文件:Use the following command to unzip the .zip file:

    unzip <file-name>.zip
    

    此命令会提取 .csv 文件。The command extracts a .csv file.

  4. 使用以下命令创建 Data Lake Storage Gen2 文件系统。Use the following command to create the Data Lake Storage Gen2 file system.

    hadoop fs -D "fs.azure.createRemoteFileSystemDuringInitialization=true" -ls abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/
    

    <file-system-name> 占位符替换为你要为文件系统提供的名称。Replace the <file-system-name> placeholder with the name that you want to give your file system.

    <storage-account-name> 占位符替换为存储帐户的名称。Replace the <storage-account-name> placeholder with the name of your storage account.

  5. 使用以下命令创建目录。Use the following command to create a directory.

    hdfs dfs -mkdir -p abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data
    
  6. 使用以下命令将 .csv 文件复制到目录:Use the following command to copy the .csv file to the directory:

    hdfs dfs -put "<file-name>.csv" abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data/
    

    如果文件名包含空格或特殊字符,请对文件名使用引号。Use quotes around the file name if the file name contains spaces or special characters.

转换数据Transform the data

本部分使用 Beeline 运行 Apache Hive 作业。In this section, you use Beeline to run an Apache Hive job.

在 Apache Hive 作业运行期间,请将 .csv 文件中的数据导入到名为“delays”的 Apache Hive 表中 。As part of the Apache Hive job, you import the data from the .csv file into an Apache Hive table named delays.

  1. 在 HDInsight 群集已有的 SSH 提示符中,使用以下命令创建并编辑名为 flightdelays.hql 的新文件 :From the SSH prompt that you already have for the HDInsight cluster, use the following command to create and edit a new file named flightdelays.hql:

    nano flightdelays.hql
    
  2. 修改以下文本,将 <file-system-name><storage-account-name> 占位符值替换为文件系统名称和存储帐户名称。Modify the following text by replace the <file-system-name> and <storage-account-name> placeholders with your file system and storage account name. 然后将文本复制并粘贴到 nano 控制台中,方法是同时按 SHIFT 键和鼠标右键单击按钮。Then copy and paste the text into the nano console by using pressing the SHIFT key along with the right-mouse click button.

    DROP TABLE delays_raw;
    -- Creates an external table over the csv file
    CREATE EXTERNAL TABLE delays_raw (
        YEAR string,
        FL_DATE string,
        UNIQUE_CARRIER string,
        CARRIER string,
        FL_NUM string,
        ORIGIN_AIRPORT_ID string,
        ORIGIN string,
        ORIGIN_CITY_NAME string,
        ORIGIN_CITY_NAME_TEMP string,
        ORIGIN_STATE_ABR string,
        DEST_AIRPORT_ID string,
        DEST string,
        DEST_CITY_NAME string,
        DEST_CITY_NAME_TEMP string,
        DEST_STATE_ABR string,
        DEP_DELAY_NEW float,
        ARR_DELAY_NEW float,
        CARRIER_DELAY float,
        WEATHER_DELAY float,
        NAS_DELAY float,
        SECURITY_DELAY float,
        LATE_AIRCRAFT_DELAY float)
    -- The following lines describe the format and location of the file
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION 'abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/data';
    
    -- Drop the delays table if it exists
    DROP TABLE delays;
    -- Create the delays table and populate it with data
    -- pulled in from the CSV file (via the external table defined previously)
    CREATE TABLE delays
    LOCATION 'abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/processed'
    AS
    SELECT YEAR AS year,
        FL_DATE AS flight_date,
        substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
        substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
        substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
        ORIGIN_AIRPORT_ID AS origin_airport_id,
        substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
        substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
        substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1)  AS origin_state_abr,
        DEST_AIRPORT_ID AS dest_airport_id,
        substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
        substring(DEST_CITY_NAME,2) AS dest_city_name,
        substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
        DEP_DELAY_NEW AS dep_delay_new,
        ARR_DELAY_NEW AS arr_delay_new,
        CARRIER_DELAY AS carrier_delay,
        WEATHER_DELAY AS weather_delay,
        NAS_DELAY AS nas_delay,
        SECURITY_DELAY AS security_delay,
        LATE_AIRCRAFT_DELAY AS late_aircraft_delay
    FROM delays_raw;
    
  3. 保存文件,方法是使用 CTRL+X,然后根据提示键入 YSave the file by using use CTRL+X and then type Y when prompted.

  4. 若要启动 Hive 并运行 flightdelays.hql 文件,请使用以下命令:To start Hive and run the flightdelays.hql file, use the following command:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f flightdelays.hql
    
  5. flightdelays.hql 脚本完成运行后,使用以下命令打开交互式 Beeline 会话 :After the flightdelays.hql script finishes running, use the following command to open an interactive Beeline session:

    beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http'
    
  6. 收到 jdbc:hive2://localhost:10001/> 提示时,使用以下查询从导入的航班延误数据中检索数据:When you receive the jdbc:hive2://localhost:10001/> prompt, use the following query to retrieve data from the imported flight delay data:

    INSERT OVERWRITE DIRECTORY '/tutorials/flightdelays/output'
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    SELECT regexp_replace(origin_city_name, '''', ''),
        avg(weather_delay)
    FROM delays
    WHERE weather_delay IS NOT NULL
    GROUP BY origin_city_name;
    

    此查询会检索遇到天气延迟的城市的列表以及平均延迟时间,并将其保存到 abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output 中。This query retrieves a list of cities that experienced weather delays, along with the average delay time, and saves it to abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output. 稍后,Sqoop 会从该位置读取数据并将其导出到 Azure SQL 数据库。Later, Sqoop reads the data from this location and exports it to Azure SQL Database.

  7. 若要退出 Beeline,请在提示符处输入 !quitTo exit Beeline, enter !quit at the prompt.

创建 SQL 数据库表Create a SQL database table

需要使用 SQL 数据库中的服务器名称才能执行此操作。You need the server name from your SQL database for this operation. 完成以下步骤即可找到服务器名称。Complete these steps to find your server name.

  1. 转到 Azure 门户Go to the Azure portal.

  2. 选择“SQL 数据库” 。Select SQL Databases.

  3. 针对选择使用的数据库的名称进行筛选。Filter on the name of the database that you choose to use. 服务器名称在“服务器名称”列中列出 。The server name is listed in the Server name column.

  4. 针对要使用的数据库的名称进行筛选。Filter on the name of the database that you want to use. 服务器名称在“服务器名称”列中列出 。The server name is listed in the Server name column.

    获取 Azure SQL 服务器的详细信息Get Azure SQL server details

    有多种方法可连接到 SQL 数据库并创建表。There are many ways to connect to SQL Database and create a table. 以下步骤从 HDInsight 群集中使用 FreeTDSThe following steps use FreeTDS from the HDInsight cluster.

  5. 若要安装 FreeTDS,请使用以下命令从 SSH 连接到群集:To install FreeTDS, use the following command from an SSH connection to the cluster:

    sudo apt-get --assume-yes install freetds-dev freetds-bin
    
  6. 安装完成后,使用以下命令连接到 SQL 数据库服务器。After the installation completes, use the following command to connect to the SQL Database server.

    TDSVER=8.0 tsql -H '<server-name>.database.windows.net' -U '<admin-login>' -p 1433 -D '<database-name>'
    
    • <server-name> 占位符替换为 SQL 数据库服务器名称。Replace the <server-name> placeholder with the SQL Database server name.

    • 使用 SQL 数据库的管理员登录名替换 <admin-login> 占位符。Replace the <admin-login> placeholder with the admin login for SQL Database.

    • <database-name> 占位符替换为数据库名称Replace the <database-name> placeholder with the database name

    出现提示时,输入 SQL 数据库管理员登录名的密码。When you're prompted, enter the password for the SQL Database admin login.

    将收到类似于以下文本的输出:You receive output similar to the following text:

    locale is "en_US.UTF-8"
    locale charset is "UTF-8"
    using default charset "UTF-8"
    Default database being set to sqooptest
    1>
    
  7. 1> 提示符下输入以下语句:At the 1> prompt, enter the following statements:

    CREATE TABLE [dbo].[delays](
    [OriginCityName] [nvarchar](50) NOT NULL,
    [WeatherDelay] float,
    CONSTRAINT [PK_delays] PRIMARY KEY CLUSTERED
    ([OriginCityName] ASC))
    GO
    
  8. 输入 GO 语句后,将评估前面的语句。When the GO statement is entered, the previous statements are evaluated.

    此查询创建一个名为 delays 且具有聚集索引的表。The query creates a table named delays, which has a clustered index.

  9. 使用以下查询验证是否已创建表:Use the following query to verify that the table is created:

    SELECT * FROM information_schema.tables
    GO
    

    输出与以下文本类似:The output is similar to the following text:

    TABLE_CATALOG   TABLE_SCHEMA    TABLE_NAME      TABLE_TYPE
    databaseName       dbo             delays        BASE TABLE
    
  10. 1> 提示符下输入 exit 以退出 tsql 实用工具。Enter exit at the 1> prompt to exit the tsql utility.

导出和加载数据Export and load the data

在前面的部分中,已经在 abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output 位置复制了转换后的数据。In the previous sections, you copied the transformed data at the location abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output. 本部分使用 Sqoop 将数据从 abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output 导出到在 Azure SQL 数据库中创建的表。In this section, you use Sqoop to export the data from abfs://<file-system-name>@<storage-account-name>.dfs.core.windows.net/tutorials/flightdelays/output to the table you created in the Azure SQL database.

  1. 使用以下命令验证 Sqoop 是否可以查看 SQL 数据库:Use the following command to verify that Sqoop can see your SQL database:

    sqoop list-databases --connect jdbc:sqlserver://<SERVER_NAME>.database.windows.net:1433 --username <ADMIN_LOGIN> --password <ADMIN_PASSWORD>
    

    此命令会返回数据库列表,其中包括创建的 delays 表所在的数据库。The command returns a list of databases, including the database in which you created the delays table.

  2. 使用以下命令将 hivesampletable 表中的数据导出到 delays 表:Use the following command to export data from the hivesampletable table to the delays table:

    sqoop export --connect 'jdbc:sqlserver://<SERVER_NAME>.database.windows.net:1433;database=<DATABASE_NAME>' --username <ADMIN_LOGIN> --password <ADMIN_PASSWORD> --table 'delays' --export-dir 'abfs://<file-system-name>@.dfs.core.windows.net/tutorials/flightdelays/output' --fields-terminated-by '\t' -m 1
    

    Sqoop 连接到包含 delays 表的数据库,并将数据从 /tutorials/flightdelays/output 目录导出到 delays 表。Sqoop connects to the database that contains the delays table, and exports data from the /tutorials/flightdelays/output directory to the delays table.

  3. sqoop 命令完成后,使用 tsql 实用程序连接到数据库:After the sqoop command finishes, use the tsql utility to connect to the database:

    TDSVER=8.0 tsql -H <SERVER_NAME>.database.windows.net -U <ADMIN_LOGIN> -P <ADMIN_PASSWORD> -p 1433 -D <DATABASE_NAME>
    
  4. 使用以下语句验证数据是否已导出到 delays 表:Use the following statements to verify that the data was exported to the delays table:

    SELECT * FROM delays
    GO
    

    会在表中看到一系列数据。You should see a listing of data in the table. 该表包括城市名称和该城市的平均航班延迟时间。The table includes the city name and the average flight delay time for that city.

  5. 输入 exit,退出 tsql 实用程序。Enter exit to exit the tsql utility.

清理资源Clean up resources

本教程中使用的所有资源都预先存在。All resources used in this tutorial are preexisting. 不需清理。No cleanup is necessary.

后续步骤Next steps

若要了解使用 HDInsight 中的数据的更多方式,请参阅以下文章:To learn more ways to work with data in HDInsight, see the following article: