什么是 PolyBase?What is PolyBase?

适用于: SQL Server Azure SQL 数据库 Azure Synapse Analytics (SQL DW) Parallel 数据仓库

借助 PolyBase,SQL Server 2016 实例可处理从 Hadoop 中读取数据的 Transact-SQL 查询。PolyBase enables your SQL Server 2016 instance to process Transact-SQL queries that read data from Hadoop. 同一查询还可以访问 SQL Server 中的关系表。The same query can also access relational tables in your SQL Server. 借助 PolyBase,同一查询还可以联接 Hadoop 和 SQL Server 中的数据。PolyBase enables the same query to also join the data from Hadoop and SQL Server. 在 SQL Server 中,外部表外部数据源提供对 Hadoop 的连接。In SQL Server, an external table or external data source provides the connection to Hadoop.

PolyBase 逻辑PolyBase logical

PolyBase 将一些计算推送到 Hadoop 节点,以优化总体查询。PolyBase pushes some computations to the Hadoop node to optimize the overall query. 不过,PolyBase 外部访问不仅限于 Hadoop。However, PolyBase external access is not limited to Hadoop. 其他未结构化的非关系表也受支持,如带分隔符的文本文件。Other unstructured non-relational tables are also supported, such as delimited text files.

提示

SQL Server 2019 为 PolyBase 引入了新的连接器,包括 SQL Server、Oracle、Teradata 和 MongoDB。SQL Server 2019 introduces new connectors for PolyBase, including SQL Server, Oracle, Teradata, and MongoDB. 有关详细信息,请参阅 SQL Server 2019 的 PolyBase 文档For more information, see the PolyBase documentation for SQL Server 2019

借助 PolyBase,SQL Server 实例可处理从外部数据源中读取数据的 Transact-SQL 查询。PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. SQL Server 2016 及更高版本可以访问 Hadoop 和 Azure Blob 存储中的外部数据。SQL Server 2016 and higher can access external data in Hadoop and Azure Blob Storage. 从 SQL Server 2019 开始,现在可以使用 PolyBase 访问 SQL ServerOracleTeradataMongoDB 中的外部数据。Starting in SQL Server 2019, you can now use PolyBase to access external data in SQL Server, Oracle, Teradata, and MongoDB.

访问外部数据的相同查询还可以定位 SQL Server 实例中的关系表。The same queries that access external data can also target relational tables in your SQL Server instance. 这样可以将外部源中的数据与数据库中的高价值关系数据合并。This allows you to combine data from external sources with high-value relational data in your database. 在 SQL Server 中,外部表外部数据源提供对 Hadoop 的连接。In SQL Server, an external table or external data source provides the connection to Hadoop.

PolyBase 将一些计算推送到 Hadoop 节点,以优化总体查询。PolyBase pushes some computations to the Hadoop node to optimize the overall query. 不过,PolyBase 外部访问不仅限于 Hadoop。However, PolyBase external access is not limited to Hadoop. 其他未结构化的非关系表也受支持,如带分隔符的文本文件。Other unstructured non-relational tables are also supported, such as delimited text files.

支持的 SQL 产品和服务Supported SQL products and services

PolyBase 对以下 Microsoft SQL 产品提供这些相同功能:PolyBase provides these same functionalities for the following SQL products from Microsoft:

  • SQL Server 2016 及更高版本(仅限 Windows)SQL Server 2016 and later versions (Windows only)
  • 分析平台系统(旧称为“并行数据仓库”)Analytics Platform System (formerly Parallel Data Warehouse)
  • Azure Synapse AnalyticsAzure Synapse Analytics

Azure 集成Azure integration

借助 PolyBase 的基础帮助,T-SQL 查询还可以将数据导入和导出 Azure Blob 存储。With the underlying help of PolyBase, T-SQL queries can also import and export data from Azure Blob Storage. 此外,借助 PolyBase,Azure Synapse Analytics 还可以将数据导入和导出 Azure Data Lake Store 和 Azure Blob 存储。Further, PolyBase enables Azure Synapse Analytics to import and export data from Azure Data Lake Store, and from Azure Blob Storage.

为什么要用 PolyBase?Why use PolyBase?

过去联接 SQL Server 数据与外部数据的难度更大。In the past it was more difficult to join your SQL Server data with external data. 具体有下列两种不方便的方法:You had the two following unpleasant options:

  • 传输一半数据,这样所有数据都采用一种格式或其他格式。Transfer half your data so that all your data was in one format or the other.
  • 查询两个数据源,然后编写自定义查询逻辑,以在客户端一级联接和集成数据。Query both sources of data, then write custom query logic to join and integrate the data at the client level.

PolyBase 使用 T-SQL 来联接数据,因此可避免使用这两种不方便的方法。PolyBase avoids those unpleasant options by using T-SQL to join the data.

为了简单起见,PolyBase 不要求在 Hadoop 环境中安装其他软件。To keep things simple, PolyBase does not require you to install additional software to your Hadoop environment. 查询外部数据所用的 T-SQL 语法也是用于查询数据库表的语法。You query external data by using the same T-SQL syntax used to query a database table. PolyBase 实现的所有支持操作全都以透明方式发生。The support actions implemented by PolyBase all happen transparently. 查询作者无需对 Hadoop 有任何了解。The query author does not need any knowledge about Hadoop.

PolyBase 用法PolyBase uses

PolyBase 支持在 SQL Server 中使用以下方案:PolyBase enables the following scenarios in SQL Server:

  • 通过 SQL Server 或 PDW 查询 Hadoop 中存储的数据。Query data stored in Hadoop from SQL Server or PDW. 用户将数据存储在经济高效的分布式、可扩展系统中,例如 Hadoop。Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. PolyBase 使得使用 T-SQL 查询数据更加容易。PolyBase makes it easy to query the data by using T-SQL.

  • 查询存储在 Azure Blob 存储中的数据。Query data stored in Azure Blob Storage. Azure blob 存储是一个方便存储供 Azure 服务使用的数据的位置。Azure blob storage is a convenient place to store data for use by Azure services. PolyBase 使得使用 T-SQL 访问数据变得更加容易。PolyBase makes it easy to access the data by using T-SQL.

  • 导入 Hadoop、Azure Blob 存储或 Azure Data Lake Store 中的数据。Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store. 通过将 Hadoop、Azure Blob 存储或 Azure Data Lake Store 中的数据导入到关系表中,利用 Microsoft SQL 的列存储技术和分析功能的速度优势。Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop, Azure Blob Storage, or Azure Data Lake Store into relational tables. 不需要单独的 ETL 或导入工具。There is no need for a separate ETL or import tool.

  • 将数据导出到 Hadoop、Azure Blob 存储或 Azure Data Lake Store。Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store. 将数据存档到 Hadoop、Azure Blob 存储或 Azure Data Lake Store,以获得经济高效的存储,并使数据保持联机以便于访问。Archive data to Hadoop, Azure Blob Storage, or Azure Data Lake Store to achieve cost-effective storage and keep it online for easy access.

  • 与 BI 工具集成Integrate with BI tools. 结合使用 PolyBase 和 Microsoft 的商业智能和分析堆栈,或使用任何与 SQL Server 兼容的第三方工具。Use PolyBase with Microsoft's business intelligence and analysis stack, or use any third party tools that are compatible with SQL Server.

性能Performance

  • 将计算推送到 Hadoop。Push computation to Hadoop. 查询优化器制定基于成本的决策,以在执行此操作将提升查询性能时将计算推送到 Hadoop。The query optimizer makes a cost-based decision to push computation to Hadoop, if that will improve query performance. 查询优化器使用外部表上的统计来制定基于成本的决策。The query optimizer uses statistics on external tables to make the cost-based decision. 推送计算会创建 MapReduce 作业并利用 Hadoop 的分布计算资源。Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.

  • 缩放计算资源。Scale compute resources. 若要提高查询性能,可以使用 SQL Server PolyBase 横向扩展组To improve query performance, you can use SQL Server PolyBase scale-out groups. 这使并行数据可以在 SQL Server 实例和 Hadoop 节点之间传输,并为处理外部数据添加计算资源。This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data.

后续步骤Next steps

在使用 PolyBase 之前,必须安装 PolyBase 功能Before using PolyBase, you must install the PolyBase feature. 然后,请参阅以下配置指南,具体取决于你的数据源:Then see the following configuration guides depending on your data source:

后续步骤Next steps

在使用 PolyBase 之前,必须安装 PolyBase 功能Before using PolyBase, you must install the PolyBase feature. 然后,请参阅以下配置指南,具体取决于你的数据源:Then see the following configuration guides depending on your data source: