Load data from Azure Data Lake Storage to SQL Data Warehouse
Use PolyBase external tables to load data from Azure Data Lake Storage into Azure SQL Data Warehouse. Although you can run adhoc queries on data stored in Data Lake Storage, we recommend importing the data into the SQL Data Warehouse for best performance.
- Create database objects required to load from Data Lake Storage.
- Connect to a Data Lake Storage directory.
- Load data into Azure SQL Data Warehouse.
If you don't have an Azure subscription, create a free account before you begin.
Before you begin
Before you begin this tutorial, download and install the newest version of SQL Server Management Studio (SSMS).
To run this tutorial, you need:
Azure Active Directory Application to use for Service-to-Service authentication. To create, follow Active directory authentication
An Azure SQL Data Warehouse. See Create and query and Azure SQL Data Warehouse.
A Data Lake Storage account. See Get started with Azure Data Lake Storage.
Create a credential
To access your Data Lake Storage account, you will need to create a Database Master Key to encrypt your credential secret used in the next step. You then create a Database Scoped Credential. When authenticating using service principals, the Database Scoped Credential stores the service principal credentials set up in AAD. You can also use the storage account key in the Database Scoped Credential for Gen2.
To connect to Data Lake Storage using service principals, you must first create an Azure Active Directory Application, create an access key, and grant the application access to the Data Lake Storage account. For instructions, see Authenticate to Azure Data Lake Storage Using Active Directory.
-- A: Create a Database Master Key. -- Only necessary if one does not already exist. -- Required to encrypt the credential secret in the next step. -- For more information on Master Key: https://msdn.microsoft.com/library/ms174382.aspx?f=255&MSPPError=-2147217396 CREATE MASTER KEY; -- B (for service principal authentication): Create a database scoped credential -- IDENTITY: Pass the client id and OAuth 2.0 Token Endpoint taken from your Azure Active Directory Application -- SECRET: Provide your AAD Application Service Principal key. -- For more information on Create Database Scoped Credential: https://msdn.microsoft.com/library/mt270260.aspx CREATE DATABASE SCOPED CREDENTIAL ADLSCredential WITH -- Always use the OAuth 2.0 authorization endpoint (v1) IDENTITY = '<client_id>@<OAuth_2.0_Token_EndPoint>', SECRET = '<key>' ; -- B (for Gen2 storage key authentication): Create a database scoped credential -- IDENTITY: Provide any string, it is not used for authentication to Azure storage. -- SECRET: Provide your Azure storage account key. CREATE DATABASE SCOPED CREDENTIAL ADLSCredential WITH IDENTITY = 'user', SECRET = '<azure_storage_account_key>' ; -- It should look something like this when authenticating using service principals: CREATE DATABASE SCOPED CREDENTIAL ADLSCredential WITH IDENTITY = '536540b4-4239-45fe-b9a3-629f97591c0c@https://login.microsoftonline.com/42f988bf-85f1-41af-91ab-2d2cd011da47/oauth2/token', SECRET = 'BjdIlmtKp4Fpyh9hIvr8HJlUida/seM5kQ3EpLAmeDI=' ;
Create the external data source
Use this CREATE EXTERNAL DATA SOURCE command to store the location of the data.
-- C (for Gen1): Create an external data source -- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure Data Lake Storage. -- LOCATION: Provide Data Lake Storage Gen1 account name and URI -- CREDENTIAL: Provide the credential created in the previous step. CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION = 'adl://<datalakestoregen1accountname>.azuredatalakestore.net', CREDENTIAL = ADLSCredential ); -- C (for Gen2): Create an external data source -- TYPE: HADOOP - PolyBase uses Hadoop APIs to access data in Azure Data Lake Storage. -- LOCATION: Provide Data Lake Storage Gen2 account name and URI -- CREDENTIAL: Provide the credential created in the previous step. CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION='abfs[s]://<container>@<AzureDataLake account_name>.dfs.core.windows.net', -- Please note the abfss endpoint for when your account has secure transfer enabled CREDENTIAL = ADLSCredential );
Configure data format
To import the data from Data Lake Storage, you need to specify the External File Format. This object defines how the files are written in Data Lake Storage. For the complete list, look at our T-SQL documentation CREATE EXTERNAL FILE FORMAT
-- D: Create an external file format -- FIELD_TERMINATOR: Marks the end of each field (column) in a delimited text file -- STRING_DELIMITER: Specifies the field terminator for data of type string in the text-delimited file. -- DATE_FORMAT: Specifies a custom format for all date and time data that might appear in a delimited text file. -- Use_Type_Default: Store missing values as default for datatype. CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff' , USE_TYPE_DEFAULT = FALSE ) );
Create the External Tables
Now that you have specified the data source and file format, you are ready to create the external tables. External tables are how you interact with external data. The location parameter can specify a file or a directory. If it specifies a directory, all files within the directory will be loaded.
-- D: Create an External Table -- LOCATION: Folder under the Data Lake Storage root folder. -- DATA_SOURCE: Specifies which Data Source Object to use. -- FILE_FORMAT: Specifies which File Format Object to use -- REJECT_TYPE: Specifies how you want to deal with rejected rows. Either Value or percentage of the total -- REJECT_VALUE: Sets the Reject value based on the reject type. -- DimProduct CREATE EXTERNAL TABLE [dbo].[DimProduct_external] ( [ProductKey] [int] NOT NULL, [ProductLabel] [nvarchar](255) NULL, [ProductName] [nvarchar](500) NULL ) WITH ( LOCATION='/DimProduct/' , DATA_SOURCE = AzureDataLakeStorage , FILE_FORMAT = TextFileFormat , REJECT_TYPE = VALUE , REJECT_VALUE = 0 ) ;
External Table Considerations
Creating an external table is easy, but there are some nuances that need to be discussed.
External Tables are strongly typed. This means that each row of the data being ingested must satisfy the table schema definition. If a row does not match the schema definition, the row is rejected from the load.
The REJECT_TYPE and REJECT_VALUE options allow you to define how many rows or what percentage of the data must be present in the final table. During load, if the reject value is reached, the load fails. The most common cause of rejected rows is a schema definition mismatch. For example, if a column is incorrectly given the schema of int when the data in the file is a string, every row will fail to load.
Data Lake Storage Gen1 uses Role Based Access Control (RBAC) to control access to the data. This means that the Service Principal must have read permissions to the directories defined in the location parameter and to the children of the final directory and files. This enables PolyBase to authenticate and load that data.
Load the data
To load data from Data Lake Storage use the CREATE TABLE AS SELECT (Transact-SQL) statement.
CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to have the same columns and data types as the results of the select statement. If you select all the columns from an external table, the new table is a replica of the columns and data types in the external table.
In this example, we are creating a hash distributed table called DimProduct from our External Table DimProduct_external.
CREATE TABLE [dbo].[DimProduct] WITH (DISTRIBUTION = HASH([ProductKey] ) ) AS SELECT * FROM [dbo].[DimProduct_external] OPTION (LABEL = 'CTAS : Load [dbo].[DimProduct]');
Optimize columnstore compression
By default, SQL Data Warehouse stores the table as a clustered columnstore index. After a load completes, some of the data rows might not be compressed into the columnstore. There's a variety of reasons why this can happen. To learn more, see manage columnstore indexes.
To optimize query performance and columnstore compression after a load, rebuild the table to force the columnstore index to compress all the rows.
ALTER INDEX ALL ON [dbo].[DimProduct] REBUILD;
It is best to create single-column statistics immediately after a load. There are some choices for statistics. For example, if you create single-column statistics on every column it might take a long time to rebuild all the statistics. If you know certain columns are not going to be in query predicates, you can skip creating statistics on those columns.
If you decide to create single-column statistics on every column of every table, you can use the stored procedure code sample
prc_sqldw_create_stats in the statistics article.
The following example is a good starting point for creating statistics. It creates single-column statistics on each column in the dimension table, and on each joining column in the fact tables. You can always add single or multi-column statistics to other fact table columns later on.
You have successfully loaded data into Azure SQL Data Warehouse. Great job!
In this tutorial, you created external tables to define the structure for data stored in Data Lake Storage Gen1, and then used the PolyBase CREATE TABLE AS SELECT statement to load data into your data warehouse.
You did these things:
- Created database objects required to load from Data Lake Storage Gen1.
- Connected to a Data Lake Storage Gen1 directory.
- Loaded data into Azure SQL Data Warehouse.
Loading data is the first step to developing a data warehouse solution using SQL Data Warehouse. Check out our development resources.