Load data warehouse tables

Completed

At a basic level, loading a data warehouse is typically achieved by adding new data from files in a data lake into tables in the data warehouse. The COPY statement is an effective way to accomplish this task, as shown in the following example:

COPY INTO dbo.StageProducts
    (ProductID, ProductName, ProductCategory, Color, Size, ListPrice, Discontinued)
FROM 'https://mydatalake.blob.core.windows.net/data/stagedfiles/products/*.parquet'
WITH
(
    FILE_TYPE = 'PARQUET',
    MAXERRORS = 0,
    IDENTITY_INSERT = 'OFF'
);

Considerations for designing a data warehouse load process

One of the most common patterns for loading a data warehouse is to transfer data from source systems to files in a data lake, ingest the file data into staging tables, and then use SQL statements to load the data from the staging tables into the dimension and fact tables. Usually data loading is performed as a periodic batch process in which inserts and updates to the data warehouse are coordinated to occur at a regular interval (for example, daily, weekly, or monthly).

In most cases, you should implement a data warehouse load process that performs tasks in the following order:

  1. Ingest the new data to be loaded into a data lake, applying pre-load cleansing or transformations as required.
  2. Load the data from files into staging tables in the relational data warehouse.
  3. Load the dimension tables from the dimension data in the staging tables, updating existing rows or inserting new rows and generating surrogate key values as necessary.
  4. Load the fact tables from the fact data in the staging tables, looking up the appropriate surrogate keys for related dimensions.
  5. Perform post-load optimization by updating indexes and table distribution statistics.

After using the COPY statement to load data into staging tables, you can use a combination of INSERT, UPDATE, MERGE, and CREATE TABLE AS SELECT (CTAS) statements to load the staged data into dimension and fact tables.

Note

Implementing an effective data warehouse loading solution requires careful consideration of how to manage surrogate keys, slowly changing dimensions, and other complexities inherent in a relational data warehouse schema. To learn more about techniques for loading a data warehouse, consider completing the Load data into a relational data warehouse module.