Guide for using PolyBase in SQL Data Warehouse

This guide gives practical information for using PolyBase in SQL Data Warehouse.

To get started, see the Load data with PolyBase tutorial.

Rotating storage keys

From time to time you will want to change the access key to your blob storage for security reasons.

The most elegant way to perform this task is to follow a process known as "rotating the keys". You may have noticed that you have two storage keys for your blob storage account. This is so that you can transition

Rotating your Azure storage account keys is a simple three step process

  1. Create second database scoped credential based on the secondary storage access key
  2. Create second external data source based off this new credential
  3. Drop and create the external table(s) pointing to the new external data source

When you have migrated all your external tables to the new external data source then you can perform the clean up tasks:

  1. Drop first external data source
  2. Drop first database scoped credential based on the primary storage access key
  3. Log into Azure and regenerate the primary access key ready for the next time

Load data with External Tables

This example loads data from Azure blob storage to SQL Data Warehouse database.

Storing data directly removes the data transfer time for queries. Storing data with a columnstore index improves query performance for analysis queries by up to 10x.

This example uses the CREATE TABLE AS SELECT statement to load data. The new table inherits the columns named in the query. It inherits the data types of those columns from the external table definition.

CREATE TABLE AS SELECT is a highly performant Transact-SQL statement that loads the data in parallel to all the compute nodes of your SQL Data Warehouse. It was originally developed for the massively parallel processing (MPP) engine in Analytics Platform System and is now in SQL Data Warehouse.

-- Load data from Azure blob storage to SQL Data Warehouse

CREATE TABLE [dbo].[Customer_Speed]
WITH
(   
    CLUSTERED COLUMNSTORE INDEX
,    DISTRIBUTION = HASH([CarSensor_Data].[CustomerKey])
)
AS
SELECT *
FROM   [ext].[CarSensor_Data]
;

See CREATE TABLE AS SELECT (Transact-SQL).

Note

A load using an external table can fail with the error "Query aborted-- the maximum reject threshold was reached while reading from an external source". This indicates that your external data contains dirty records. A data record is considered 'dirty' if the actual data types/number of columns do not match the column definitions of the external table or if the data doesn't conform to the specified external file format. To fix this, ensure that your external table and external file format definitions are correct and your external data conforms to these definitions. In case a subset of external data records are dirty, you can choose to reject these records for your queries by using the reject options in CREATE EXTERNAL TABLE DDL.

Create Statistics on newly loaded data

Azure SQL Data Warehouse does not yet support auto create or auto update statistics. In order to get the best performance from your queries, it's important that statistics be created on all columns of all tables after the first load or any substantial changes occur in the data. For a detailed explanation of statistics, see the Statistics topic in the Develop group of topics. Below is a quick example of how to create statistics on the tabled loaded in this example.

create statistics [SensorKey] on [Customer_Speed] ([SensorKey]);
create statistics [CustomerKey] on [Customer_Speed] ([CustomerKey]);
create statistics [GeographyKey] on [Customer_Speed] ([GeographyKey]);
create statistics [Speed] on [Customer_Speed] ([Speed]);
create statistics [YearMeasured] on [Customer_Speed] ([YearMeasured]);

Export data with External Tables

This section shows how to export data from SQL Data Warehouse to Azure blob storage using external tables. This example uses CREATE EXTERNAL TABLE AS SELECT which is a highly performant Transact-SQL statement to export the data in parallel from all the compute nodes.

The following example creates an external table Weblogs2014 using column definitions and data from dbo.Weblogs table. The external table definition is stored in SQL Data Warehouse and the results of the SELECT statement are exported to the "/archive/log2014/" directory under the blob container specified by the data source. The data is exported in the specified text file format.

CREATE EXTERNAL TABLE Weblogs2014 WITH
(
    LOCATION='/archive/log2014/',
    DATA_SOURCE=azure_storage,
    FILE_FORMAT=text_file_format
)
AS
SELECT
    Uri,
    DateRequested
FROM
    dbo.Weblogs
WHERE
    1=1
    AND DateRequested > '12/31/2013'
    AND DateRequested < '01/01/2015';

Isolate Loading Users

There is often a need to have multiple users that can load data into a SQL DW. Because the CREATE TABLE AS SELECT (Transact-SQL) requires CONTROL permissions of the database, you will end up with multiple users with control access over all schemas. To limit this, you can use the DENY CONTROL statement.

Example: Consider database schemas schema_A for dept A, and schema_B for dept B Let database users user_A and user_B be users for PolyBase loading in dept A and B, respectively. They both have been granted CONTROL database permissions. The creators of schema A and B now lock down their schemas using DENY:

   DENY CONTROL ON SCHEMA :: schema_A TO user_B;
   DENY CONTROL ON SCHEMA :: schema_B TO user_A;

With this, user_A and user_B should now be locked out from the other dept’s schema.

PolyBase performance optimizations

To achieve optimal loading performance with PolyBase we suggest the following:

  • Split large compressed files into smaller compressed files. The compression types supported today are not splittable. As a result, performance will be impacted by loading a single large file.
  • For fastest loading speed, load into a round_robin, heap staging table. This will be the most efficient way to move the data from storage layer to the data warehouse.
  • All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8 and UTF-16 performance is minimal.
  • Co-locate your storage layer and your data warehouse to minimize latency
  • Scale up your data warehouse if you expect a large loading job.

PolyBase Limitations

PolyBase in SQL DW has the following limitations that need to be taken into consideration when designing a loading job:

  • A single row cannot be wider than 1,000,000 bytes. This is true regardless of the table schema defined.
  • When exporting data into an ORC File Format from SQL Server or Azure SQL Data Warehouse text heavy columns can be limited to as few as 50 columns due to java out of memory errors. To work around this, export only a subset of the columns.

Next steps

To learn more about moving data to SQL Data Warehouse, see the data migration overview.