Introducing data virtualization with PolyBase

APPLIES TO: SQL Server Azure SQL Database Azure Synapse Analytics Parallel Data Warehouse

PolyBase is a data virtualization feature for SQL Server.

What is PolyBase?

PolyBase enables your SQL Server instance to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB, Hadoop clusters, Cosmos DB without separately installing client connection software. You can also use the generic ODBC connector to connect to additional providers using third-party ODBC drivers. PolyBase allows T-SQL queries to join the data from external sources to relational tables in an instance of SQL Server.

A key use case for data virtualization with the PolyBase feature is to allow the data to stay in its original location and format. You can virtualize the external data through the SQL Server instance, so that it can be queried in place like any other table in SQL Server. This process minimizes the need for ETL processes for data movement. This data virtualization scenario is possible with the use of PolyBase connectors.

Note

Some functionality of the PolyBase feature is in private preview for Azure SQL managed instances, including the ability to query external data (Parquet files) in Azure Data Lake Storage (ADLS) Gen2. Private preview includes access to client libraries and documentation for testing purposes that are not yet available publicly. If you are interested and ready to invest some time in trying out the functionalities and sharing your feedback and questions, please review the Azure SQL Managed Instance PolyBase Private Preview Guide.

Supported SQL products and services

PolyBase provides these same functionalities for the following SQL products from Microsoft:

  • SQL Server 2016 (13.x) and later versions (Windows only)
  • SQL Server 2019 (15.x) and later versions (Linux)
  • SQL Server Analytics Platform System (PDW) (PDW), hosted in the Analytics Platform System (APS)
  • Azure Synapse Analytics

PolyBase connectors

The PolyBase feature provides connectivity to the following external data sources:

External data sources SQL Server with PolyBase APS PDW Azure Synapse Analytics
Oracle, MongoDB, Teradata Read No No
Generic ODBC Read (Windows Only) No No
Azure Storage Read/Write Read/Write Read/Write
Hadoop Read/Write Read/Write No
SQL Server Read No No
  • SQL Server 2016 (13.x) introduced PolyBase with support for connections to Hadoop and Azure blob storage.
  • SQL Server 2019 (15.x) introduced additional connectors, including SQL Server, Oracle, Teradata, and MongoDB.

Examples of external connectors include:

* PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Cloudera Distributed Hadoop (CDH).

To use PolyBase in an instance of SQL Server:

  1. Install PolyBase on Windows or Install PolyBase on Linux.
  2. Starting with SQL Server 2019 (15.x), enable PolyBase in sp_configure, if necessary.
  3. Create an external data source.
  4. Create an external table.

Azure integration

With the underlying help of PolyBase, T-SQL queries can also import and export data from Azure blob storage. Further, PolyBase enables Azure Synapse Analytics to import and export data from Azure Data Lake Store, and from Azure blob storage.

Why use PolyBase?

PolyBase allows you to join data from a SQL Server instance with external data. Prior to PolyBase to join data to external data sources you could either:

  • Transfer half your data so that all the data was in one location.
  • Query both sources of data, then write custom query logic to join and integrate the data at the client level.

PolyBase allows you to simply use Transact-SQL to join the data.

PolyBase does not require you to install additional software to your Hadoop environment. You query external data by using the same T-SQL syntax used to query a database table. The support actions implemented by PolyBase all happen transparently. The query author does not need any knowledge about the external source.

PolyBase uses

PolyBase enables the following scenarios in SQL Server:

  • Query data stored in Hadoop from a SQL Server instance or PDW. Users are storing data in cost-effective distributed and scalable systems, such as Hadoop. PolyBase makes it easy to query the data by using T-SQL.

  • Query data stored in Azure blob storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.

  • Import data from Hadoop, Azure blob storage, or Azure Data Lake Store. Leverage the speed of Microsoft SQL's columnstore technology and analysis capabilities by importing data from Hadoop, Azure blob storage, or Azure Data Lake Store into relational tables. There is no need for a separate ETL or import tool.

  • Export data to Hadoop, Azure blob storage, or Azure Data Lake Store. Archive data to Hadoop, Azure blob storage, or Azure Data Lake Store to achieve cost-effective storage and keep it online for easy access.

  • Integrate with BI tools. Use PolyBase with Microsoft's business intelligence and analysis stack, or use any third-party tools that are compatible with SQL Server.

Performance

  • Push computation to Hadoop. PolyBase pushes some computations to the external source to optimize the overall query. The query optimizer makes a cost-based decision to push computation to Hadoop, if that will improve query performance. The query optimizer uses statistics on external tables to make the cost-based decision. Pushing computation creates MapReduce jobs and leverages Hadoop's distributed computational resources. For more information, see Pushdown computations in PolyBase.

  • Scale compute resources. To improve query performance, you can use SQL Server PolyBase scale-out groups. This enables parallel data transfer between SQL Server instances and Hadoop nodes, and it adds compute resources for operating on the external data.

Next steps

Before using PolyBase, you must install PolyBase on Windows or install PolyBase on Linux, and enable PolyBase in sp_configure if necessary. Then see the following configuration guides depending on your data source: