什么是 PolyBase?What is PolyBase?

适用于: SQL Server Azure SQL 数据库 Azure Synapse Analytics 并行数据仓库

借助 PolyBase,SQL Server 实例可处理从外部数据源中读取数据的 Transact-SQL 查询。PolyBase enables your SQL Server instance to process Transact-SQL queries that read data from external data sources. 同一查询还可以访问 SQL Server 实例中的关系表。The same query can also access relational tables in your instance of SQL Server. 借助 PolyBase,同一查询还可以联接外部源和 SQL Server 中的数据。PolyBase enables the same query to also join the data from external sources and SQL Server.

若要在 SQL Server 实例中使用 PolyBase,请执行以下操作:To use PolyBase, in an instance of SQL Server:

  1. 在 Windows 上安装 PolyBaseInstall PolyBase on Windows
  2. 创建外部数据源Create an external data source
  3. 创建外部表Create an external table

它们一起提供与外部数据源的连接。Together, these provide the connection to the external data source.

SQL Server 2016 引入了 PolyBase,支持连接到 Hadoop 和 Azure Blob 存储。SQL Server 2016 introduces PolyBase with support for connections to Hadoop and Azure Blob Storage.

SQL Server 2019 引入了其他连接器,包括 SQL Server、Oracle、Teradata 和 MongoDB。SQL Server 2019 introduces additional connectors, including SQL Server, Oracle, Teradata, and MongoDB.

PolyBase 逻辑PolyBase logical

PolyBase 将一些计算推送到外部源,以优化总体查询。PolyBase pushes some computations to the external source to optimize the overall query. PolyBase 外部访问不仅限于 Hadoop。PolyBase external access is not limited to Hadoop. 其他未结构化的非关系表也受支持,如带分隔符的文本文件。Other unstructured non-relational tables are also supported, such as delimited text files.

外部连接器的示例包括:Examples of external connectors include:

支持的 SQL 产品和服务Supported SQL products and services

PolyBase 对以下 Microsoft SQL 产品提供这些相同功能:PolyBase provides these same functionalities for the following SQL products from Microsoft:

  • SQL Server 2016 及更高版本(仅限 Windows)SQL Server 2016 and later versions (Windows only)
  • 分析平台系统(旧称为“并行数据仓库”)Analytics Platform System (formerly Parallel Data Warehouse)
  • Azure Synapse AnalyticsAzure Synapse Analytics

Azure 集成Azure integration

借助 PolyBase 的基础帮助,T-SQL 查询还可以将数据导入和导出 Azure Blob 存储。With the underlying help of PolyBase, T-SQL queries can also import and export data from Azure Blob Storage. 此外,借助 PolyBase,Azure Synapse Analytics 还可以将数据导入和导出 Azure Data Lake Store 和 Azure Blob 存储。Further, PolyBase enables Azure Synapse Analytics to import and export data from Azure Data Lake Store, and from Azure Blob Storage.

为什么要用 PolyBase?Why use PolyBase?

PolyBase 允许你将来自 SQL Server 实例的数据与外部数据连接起来。PolyBase allows you to join data from a SQL Server instance with external data. 在 PolyBase 将数据连接到外部数据源之前,你可以:Prior to PolyBase to join data to external data sources you could either:

  • 传输一半数据,这样所有数据都在一个位置。Transfer half your data so that all the data was in one location.
  • 查询两个数据源,然后编写自定义查询逻辑,以在客户端一级联接和集成数据。Query both sources of data, then write custom query logic to join and integrate the data at the client level.

PolyBase 允许你简单地使用 Transact-SQL 来联接数据。PolyBase allows you to simply use Transact-SQL to join the data.

PolyBase 不要求向 Hadoop 环境安装其他软件。PolyBase does not require you to install additional software to your Hadoop environment. 查询外部数据所用的 T-SQL 语法也是用于查询数据库表的语法。You query external data by using the same T-SQL syntax used to query a database table. PolyBase 实现的所有支持操作全都以透明方式发生。The support actions implemented by PolyBase all happen transparently. 查询作者无需对外部源有任何了解。The query author does not need any knowledge about the external source.

PolyBase 用法PolyBase uses

PolyBase 支持在 SQL Server 中使用以下方案:PolyBase enables the following scenarios in SQL Server:

  • 通过 SQL Server 实例或 PDW 查询 Hadoop 中存储的数据。Query data stored in Hadoop from a SQL Server instance or PDW. 用户将数据存储在经济高效的分布式、可扩展系统中,例如 Hadoop。Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. PolyBase 使得使用 T-SQL 查询数据更加容易。PolyBase makes it easy to query the data by using T-SQL.

  • 查询存储在 Azure Blob 存储中的数据。Query data stored in Azure Blob Storage. Azure blob 存储是一个方便存储供 Azure 服务使用的数据的位置。Azure blob storage is a convenient place to store data for use by Azure services. PolyBase 使得使用 T-SQL 访问数据变得更加容易。PolyBase makes it easy to access the data by using T-SQL.

  • 导入 Hadoop、Azure Blob 存储或 Azure Data Lake Store 中的数据。Import data from Hadoop, Azure Blob Storage, or Azure Data Lake Store. 通过将 Hadoop、Azure Blob 存储或 Azure Data Lake Store 中的数据导入到关系表中,利用 Microsoft SQL 的列存储技术和分析功能的速度优势。Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop, Azure Blob Storage, or Azure Data Lake Store into relational tables. 不需要单独的 ETL 或导入工具。There is no need for a separate ETL or import tool.

  • 将数据导出到 Hadoop、Azure Blob 存储或 Azure Data Lake Store。Export data to Hadoop, Azure Blob Storage, or Azure Data Lake Store. 将数据存档到 Hadoop、Azure Blob 存储或 Azure Data Lake Store,以获得经济高效的存储,并使数据保持联机以便于访问。Archive data to Hadoop, Azure Blob Storage, or Azure Data Lake Store to achieve cost-effective storage and keep it online for easy access.

  • 与 BI 工具集成Integrate with BI tools. 结合使用 PolyBase 和 Microsoft 的商业智能和分析堆栈,或使用任何与 SQL Server 兼容的第三方工具。Use PolyBase with Microsoft's business intelligence and analysis stack, or use any third party tools that are compatible with SQL Server.

性能Performance

  • 将计算推送到 Hadoop。Push computation to Hadoop. 查询优化器制定基于成本的决策,以在执行此操作将提升查询性能时将计算推送到 Hadoop。The query optimizer makes a cost-based decision to push computation to Hadoop, if that will improve query performance. 查询优化器使用外部表上的统计来制定基于成本的决策。The query optimizer uses statistics on external tables to make the cost-based decision. 推送计算会创建 MapReduce 作业并利用 Hadoop 的分布计算资源。Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources.

  • 缩放计算资源。Scale compute resources. 若要提高查询性能,可以使用 SQL Server PolyBase 横向扩展组To improve query performance, you can use SQL Server PolyBase scale-out groups. 这使并行数据可以在 SQL Server 实例和 Hadoop 节点之间传输,并为处理外部数据添加计算资源。This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data.

后续步骤Next steps

在使用 PolyBase 之前,必须安装 PolyBase 功能Before using PolyBase, you must install the PolyBase feature. 然后,请参阅以下配置指南,具体取决于你的数据源:Then see the following configuration guides depending on your data source: