Assigning surrogate keys to early arriving facts using Integration Services
In data warehouses, it is quite common that fact records arrive with a source system key that has not yet been loaded in the dimension tables. This phenomena is known as “late arriving dimensions” or “early arriving facts” in Kimball terminology.
When you see a fact record that cannot be resolved to a dimension surrogate key, the typical solution is this:
- Create a dummy member in the dimension table using the source system key
- Assign a surrogate key to this dummy member
- Use the newly create surrogate key and assign it to the fact record
If you use T-SQL to load the data warehouse, it means you have to pass over the input fact rows twice. First, you have to discover which keys are not present in the dimension (and create surrogates for them). Second, you will have to look at the input data again and use the newly generated surrogate keys to load the the fact table.
Using Integration Services, early arriving facts can be populated with just one pass over the source rows, which means less read I/O operations. Nice!
In project Project REAL, a script component is used to achieve this effect. If many of your dimension have early arriving facts, this creates a lot of copy/paste code. There is a cleaner solution that does not use script components.
There is a way handle early arriving facts without relying on script components. It is best illustrated with an example. Let us create these three tables:
/* The input table */ CREATE TABLE Stage_Fact
NK_A CHAR(10) NOT NULL /* The late arriving source system key */
/* The late arriving dimension table */ CREATE TABLE Dim_A
SK_A INT PRIMARY KEY IDENTITY(1,1) /* The surrogate key*/
, NK_A CHAR(10) NOT NULL /* The natural, source system key */
/* The final destination table */
CREATE TABLE Fact
SK_A INT NOT NULL /* Surrogate key from dimension */
Now, use this script to generate 16M rows in the input table and create a 9000 row dimension table:
/* Create some staging data */ INSERT Stage_Fact WITH (TABLOCK)
SELECT RIGHT(REPLICATE('0', 10) + CAST(K AS VARCHAR(10)), 10) AS NK_A
FROM (SELECT ABS(binary_checksum(*) % 10000) AS K
FROM sys.trace_event_bindings eb1
CROSS JOIN sys.trace_event_bindings eb2) AS stuff
/* Populate Dim_A with 90% of the keys from the fact table */ INSERT Dim_A WITH (TABLOCK) (NK_A)
SELECT DISTINCT NK_A FROM Stage_Fact
WHERE NK_A < '0000009000'
With this data, there will be 1000 late arriving dimension keys in Stage_Fact in around 1.8M rows. When we see a non-matched key in Stage_Fact, we want to generate a new surrogate key in Dim_A. But here is the catch: We only want to generate the surrogate once, and we do NOT want to do a roundtrip to the database the second time we see the same key.
Project Real uses a .NET hash table to track the generated keys and perform quick lookups the next time we see the key. But, we already have a fine hash table available without using script components: the lookup transformation. Let us see how we solve the early arriving fact problem with Integration Services, au natural:
The non-matched rows from Lookup SK_A go into the second lookup (New SK_A Cache). New SK_A Cache is where we want to handle the early arriving facts.
First, configure New SK_A Cache as a partial cache:
Now, we play a clever trick: Whenever a partial lookup cache first receives a non-matched row, it will call a SQL statement and fetch data to populate the lookup cache. The default is a SELECT statement, but it does not have to be a SELECT statement. We could replace it with a stored procedure that returns the same result as the SELECT. Actually, let us do exactly that:
Now, the FIRST time the partial lookup cache sees a early arriving fact, it will call Generate_SK_A. I have mapped the NK_A (the source system, natural key) column to the input parameter. To finish the trick, we just have to create a simple stored procedure that uses NK_A to lookup SK_A (the Surrogate Key), and if not found, create a new key:
CREATE PROCEDURE Generate_SK_A
@NK_A CHAR(10) /* The key to find a surrogate for */ AS
SET NOCOUNT ON
/* Prevent race conditions */ SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
/* Check if we already have the key (procedure is idempotent) */ DECLARE @SK_A INT
SELECT @SK_A = SK_A
WHERE NK_A = @NK_A
/* The natural key was not found, generate a new one */ IF @SK_A IS NULL BEGIN
INSERT Dim_A (NK_A) VALUES (@NK_A)
SET @SK_A = SCOPE_IDENTITY()
/* Return the result.
IMPORTANT: must return same format is the SELECT statement we replaced */
SELECT @SK_A AS SK_A, @NK_A AS NK_A
Simple isn’t it?... No need to use any .NET script components here. Have a look at the attached files to study the technique further and you will be handling early arriving facts elegantly in no time.