教程:在 SQL Server 大数据群集中查询 HDFSTutorial: Query HDFS in a SQL Server big data cluster

适用于:Applies to: 是SQL Server 2019 (15.x)SQL Server 2019 (15.x)yesSQL Server 2019 (15.x)SQL Server 2019 (15.x)适用于:Applies to: 是SQL Server 2019 (15.x)SQL Server 2019 (15.x)yesSQL Server 2019 (15.x)SQL Server 2019 (15.x)

本教程演示如何在 SQL Server 2019 大数据群集SQL Server 2019 Big Data Clusters 中查询 HDFS 数据。This tutorial demonstrates how to Query HDFS data in a SQL Server 2019 大数据群集SQL Server 2019 Big Data Clusters.

在本教程中,你将了解如何执行以下操作:In this tutorial, you learn how to:

  • 创建指向大数据群集中 HDFS 数据的外部表。Create an external table pointing to HDFS data in a big data cluster.
  • 将此数据与主实例中的高值数据联接起来。Join this data with high-value data in the master instance.

提示

如果需要,可以下载并运行本教程中的命令脚本。If you prefer, you can download and run a script for the commands in this tutorial. 有关说明,请参阅 GitHub 上的数据虚拟化示例For instructions, see the Data virtualization samples on GitHub.

这段 7 分钟的视频将引导你逐步了解在大数据集群中查询 HDFS 数据的步骤:This 7-minute video walks you through querying HDFS data in a big data cluster:

先决条件Prerequisites

为 HDFS 创建外部表Create an external table to HDFS

存储池包含存储在 HDFS 中的 CSV 文件中的 Web 点击流数据。The storage pool contains web clickstream data in a CSV file stored in HDFS. 使用以下步骤定义可访问该文件中的数据的外部表。Use the following steps to define an external table that can access the data in that file.

  1. 在 Azure Data Studio 中,连接到大数据群集的 SQL Server 主实例。In Azure Data Studio, connect to the SQL Server master instance of your big data cluster. 有关详细信息,请参阅连接到 SQL Server 主实例For more information, see Connect to the SQL Server master instance.

  2. 双击“服务器”窗口中的连接,以显示 SQL Server 主实例的服务器仪表板。Double-click on the connection in the Servers window to show the server dashboard for the SQL Server master instance. 选择“新建查询”。Select New Query.

    SQL Server 主实例查询

  3. 运行以下 Transact-SQL 命令,将上下文更改为主实例中的 Sales 数据库。Run the following Transact-SQL command to change the context to the Sales database in the master instance.

    USE Sales
    GO
    
  4. 定义从 HDFS 读取的 CSV 文件格式。Define the format of the CSV file to read from HDFS. 按 F5 运行本语句。Press F5 to run the statement.

    CREATE EXTERNAL FILE FORMAT csv_file
    WITH (
        FORMAT_TYPE = DELIMITEDTEXT,
        FORMAT_OPTIONS(
            FIELD_TERMINATOR = ',',
            STRING_DELIMITER = '"',
            FIRST_ROW = 2,
            USE_TYPE_DEFAULT = TRUE)
    );
    
  5. 如果尚未创建存储池的外部数据源,请创建该数据源。Create an external data source to the storage pool if it does not already exist.

    IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool')
    BEGIN
      CREATE EXTERNAL DATA SOURCE SqlStoragePool
      WITH (LOCATION = 'sqlhdfs://controller-svc/default');
    END
    
  6. 创建可以从存储池读取 /clickstream_data 的外部表。Create an external table that can read the /clickstream_data from the storage pool. 可以从大数据群集的主实例访问 SqlStoragePool。The SqlStoragePool is accessible from the master instance of a big data cluster.

    CREATE EXTERNAL TABLE [web_clickstreams_hdfs]
    ("wcs_click_date_sk" BIGINT , "wcs_click_time_sk" BIGINT , "wcs_sales_sk" BIGINT , "wcs_item_sk" BIGINT , "wcs_web_page_sk" BIGINT , "wcs_user_sk" BIGINT)
    WITH
    (
        DATA_SOURCE = SqlStoragePool,
        LOCATION = '/clickstream_data',
        FILE_FORMAT = csv_file
    );
    GO
    

查询数据Query the data

运行以下查询,将 web_clickstream_hdfs 外部表中的 HDFS 数据与本地 Sales 数据库中的关系数据联接起来。Run the following query to join the HDFS data in the web_clickstream_hdfs external table with the relational data in the local Sales database.

SELECT  
    wcs_user_sk,
    SUM( CASE WHEN i_category = 'Books' THEN 1 ELSE 0 END) AS book_category_clicks,
    SUM( CASE WHEN i_category_id = 1 THEN 1 ELSE 0 END) AS [Home & Kitchen],
    SUM( CASE WHEN i_category_id = 2 THEN 1 ELSE 0 END) AS [Music],
    SUM( CASE WHEN i_category_id = 3 THEN 1 ELSE 0 END) AS [Books],
    SUM( CASE WHEN i_category_id = 4 THEN 1 ELSE 0 END) AS [Clothing & Accessories],
    SUM( CASE WHEN i_category_id = 5 THEN 1 ELSE 0 END) AS [Electronics],
    SUM( CASE WHEN i_category_id = 6 THEN 1 ELSE 0 END) AS [Tools & Home Improvement],
    SUM( CASE WHEN i_category_id = 7 THEN 1 ELSE 0 END) AS [Toys & Games],
    SUM( CASE WHEN i_category_id = 8 THEN 1 ELSE 0 END) AS [Movies & TV],
    SUM( CASE WHEN i_category_id = 9 THEN 1 ELSE 0 END) AS [Sports & Outdoors]
  FROM [dbo].[web_clickstreams_hdfs]
  INNER JOIN item it ON (wcs_item_sk = i_item_sk
                        AND wcs_user_sk IS NOT NULL)
GROUP BY  wcs_user_sk;
GO

清除Clean up

使用以下命令删除本教程中使用的外部表。Use the following command to remove the external table used in this tutorial.

DROP EXTERNAL TABLE [dbo].[web_clickstreams_hdfs];
GO

后续步骤Next steps

请继续学习下一篇文章,了解如何从大数据群集查询 Oracle。Advance to the next article to learn how to query Oracle from a big data cluster.