教程:在 SQL Server 大数据群集中查询 HDFSTutorial: Query HDFS in a SQL Server big data cluster
适用于:Applies to: SQL Server 2019 (15.x)SQL Server 2019 (15.x)
SQL Server 2019 (15.x)SQL Server 2019 (15.x)
SQL Server 2019 (15.x)SQL Server 2019 (15.x)
SQL Server 2019 (15.x)SQL Server 2019 (15.x)
本教程演示如何在 SQL Server 2019 大数据群集SQL Server 2019 Big Data Clusters 中查询 HDFS 数据。This tutorial demonstrates how to Query HDFS data in a SQL Server 2019 大数据群集SQL Server 2019 Big Data Clusters.
在本教程中,你将了解如何执行以下操作:In this tutorial, you learn how to:
- 创建指向大数据群集中 HDFS 数据的外部表。Create an external table pointing to HDFS data in a big data cluster.
- 将此数据与主实例中的高值数据联接起来。Join this data with high-value data in the master instance.
提示
如果需要,可以下载并运行本教程中的命令脚本。If you prefer, you can download and run a script for the commands in this tutorial. 有关说明,请参阅 GitHub 上的数据虚拟化示例。For instructions, see the Data virtualization samples on GitHub.
这段 7 分钟的视频将引导你逐步了解在大数据集群中查询 HDFS 数据的步骤:This 7-minute video walks you through querying HDFS data in a big data cluster:
先决条件Prerequisites
- 大数据工具Big data tools
- kubectlkubectl
- Azure Data StudioAzure Data Studio
- SQL Server 2019 扩展SQL Server 2019 extension
- 将示例数据加载到大数据群集中Load sample data into your big data cluster
为 HDFS 创建外部表Create an external table to HDFS
存储池包含存储在 HDFS 中的 CSV 文件中的 Web 点击流数据。The storage pool contains web clickstream data in a CSV file stored in HDFS. 使用以下步骤定义可访问该文件中的数据的外部表。Use the following steps to define an external table that can access the data in that file.
在 Azure Data Studio 中,连接到大数据群集的 SQL Server 主实例。In Azure Data Studio, connect to the SQL Server master instance of your big data cluster. 有关详细信息,请参阅连接到 SQL Server 主实例。For more information, see Connect to the SQL Server master instance.
双击“服务器”窗口中的连接,以显示 SQL Server 主实例的服务器仪表板。Double-click on the connection in the Servers window to show the server dashboard for the SQL Server master instance. 选择“新建查询”。Select New Query.
运行以下 Transact-SQL 命令,将上下文更改为主实例中的 Sales 数据库。Run the following Transact-SQL command to change the context to the Sales database in the master instance.
USE Sales GO
定义从 HDFS 读取的 CSV 文件格式。Define the format of the CSV file to read from HDFS. 按 F5 运行本语句。Press F5 to run the statement.
CREATE EXTERNAL FILE FORMAT csv_file WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2, USE_TYPE_DEFAULT = TRUE) );
如果尚未创建存储池的外部数据源,请创建该数据源。Create an external data source to the storage pool if it does not already exist.
IF NOT EXISTS(SELECT * FROM sys.external_data_sources WHERE name = 'SqlStoragePool') BEGIN CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs://controller-svc/default'); END
创建可以从存储池读取
/clickstream_data
的外部表。Create an external table that can read the/clickstream_data
from the storage pool. 可以从大数据群集的主实例访问 SqlStoragePool。The SqlStoragePool is accessible from the master instance of a big data cluster.CREATE EXTERNAL TABLE [web_clickstreams_hdfs] ("wcs_click_date_sk" BIGINT , "wcs_click_time_sk" BIGINT , "wcs_sales_sk" BIGINT , "wcs_item_sk" BIGINT , "wcs_web_page_sk" BIGINT , "wcs_user_sk" BIGINT) WITH ( DATA_SOURCE = SqlStoragePool, LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); GO
查询数据Query the data
运行以下查询,将 web_clickstream_hdfs
外部表中的 HDFS 数据与本地 Sales
数据库中的关系数据联接起来。Run the following query to join the HDFS data in the web_clickstream_hdfs
external table with the relational data in the local Sales
database.
SELECT
wcs_user_sk,
SUM( CASE WHEN i_category = 'Books' THEN 1 ELSE 0 END) AS book_category_clicks,
SUM( CASE WHEN i_category_id = 1 THEN 1 ELSE 0 END) AS [Home & Kitchen],
SUM( CASE WHEN i_category_id = 2 THEN 1 ELSE 0 END) AS [Music],
SUM( CASE WHEN i_category_id = 3 THEN 1 ELSE 0 END) AS [Books],
SUM( CASE WHEN i_category_id = 4 THEN 1 ELSE 0 END) AS [Clothing & Accessories],
SUM( CASE WHEN i_category_id = 5 THEN 1 ELSE 0 END) AS [Electronics],
SUM( CASE WHEN i_category_id = 6 THEN 1 ELSE 0 END) AS [Tools & Home Improvement],
SUM( CASE WHEN i_category_id = 7 THEN 1 ELSE 0 END) AS [Toys & Games],
SUM( CASE WHEN i_category_id = 8 THEN 1 ELSE 0 END) AS [Movies & TV],
SUM( CASE WHEN i_category_id = 9 THEN 1 ELSE 0 END) AS [Sports & Outdoors]
FROM [dbo].[web_clickstreams_hdfs]
INNER JOIN item it ON (wcs_item_sk = i_item_sk
AND wcs_user_sk IS NOT NULL)
GROUP BY wcs_user_sk;
GO
清除Clean up
使用以下命令删除本教程中使用的外部表。Use the following command to remove the external table used in this tutorial.
DROP EXTERNAL TABLE [dbo].[web_clickstreams_hdfs];
GO
后续步骤Next steps
请继续学习下一篇文章,了解如何从大数据群集查询 Oracle。Advance to the next article to learn how to query Oracle from a big data cluster.