Burritos and SSIS 2005 Data Flow Tasks...
How are these two topics possibly related you ask? Well, I'm in Mexico City, I'm eating a burrito, and I'm bogging about data flow tasks in SQL Server 2005 Integration Services. So there you go.
I'm going to start to use some Business Intelligence / Data Mart / Data Warehousing terminology in my posts. If you aren't a member of the BI elite, don't worry about it. Remember, its just a bunch of databases, queries and front ends. Anybody who is smart enough to learn .NET, XML, Web Services, ASP.NET and the like can figure this stuff out. Don't let the terminology scare you away. Also remember that the terminology I'm using represents only one school of thought.
The first set of terminology I'm going to throw out there is a data mart. A data mart is nothing more than a relational database used for reporting and analytics that follows a particular design pattern known as a dimensional model. Think of it as an OLTP schema that gets re-organized to make it easier to do reporting and analysis.
Typically a data mart consists of some tables known as "fact tables" that contain stuff you want to measure (price, quantity, date, count) and some related tables known as "dimension" tables that are used to analyze your facts (product, customer, time). Dimension tables also represent useful hierarchies. For example, a product dimension would define not only products, but product categories that the products belong to. Together these fact tables and dimension tables form what is known as a star schema. Tools like SSIS help you keep your fact and dimension tables up-to-date with the latest information represented in your OLTP database schema.
Okay, so now that you know everything about data marts, let's get back to SSIS. In my last post I talked about how I was to use the XML Source Adapter to read some data I extracted from SharePoint. In this post I'm going to talk about what I did with the source data in my data flow tasks. In its simplest form, a data flow task reads data from a source, runs it through a series of on-the-fly transformations, and writes it out to a destination. I'm going to pick up at the point were the XML Source Adapter has produced a result set that I now need to transform and load into SQL Server. I'll talk about two different transforms that are useful in constructing data flow tasks:
- multicast transforms
- derived column transforms.
Now that I had a result set containing my source data, I wanted to persist a copy of that result set in a table before doing anything with it. This is a good design-time best practice because it allows you to compare your source data unmolested with whatever you wind up with at the end of your data flow task. To accomplish this, I simply connected my XML Source Adapter to a Multicast transform with a data flow. Then I routed the data flow outputs from my multicast transform to both to an OLE DB Destination Adapter targeting SQL Server to persist the original source data, and to subsequent transformations in my data flow task.
I needed a target table to use as a destination for my raw data. Now normally I would switch out of the SSIS design environment and go write some scripts. But one nice design-time feature of the OLE DB Destination Adapter (and other destination adapters for that matter), is that you can create the target on-the-fly at design time. It simply examines the metadata for the input, and generates DDL for a compatible table structure. It shows the CREATE TABLE command in an editor dialog so you can make changes, then you just click OK and it builds a compatible destination table for you on the target. Very sweet, especially for tables with a lot of columns. I did have to go back later and auto-generate my scripts in SQL Server 2005 Management Studio so I could rebuild my data mart, but having this at design time made me more productive.
So next I tested my dataflow, and it didn't work. Problem is that my multicast didn't work due to a bug in the Feb CTP. It manifested itself by throwing unknown exceptions further down in my data flow on the other branch of my multicast. So I just left my multicast in as a stub, and I'll add the OLE DB Destination back in later to persist my raw data after the bug is fixed. Again, the joys of dogfood.
At this point I need to digress. Warning, here comes more elite BI babble. Again, don't be afraid, this is simple stuff.
If you are familiar with reporting and analytics, you may be familiar with the concept of a time dimension. A time dimension is a fancy word for a lookup table that provides improved sorting and grouping performance on dates. A typical time dimension has one row per day, and one column for each date part. For example, for 3/21/05 my row in my time dimension would look like this:
TimeKey: 20050305 (int primary key)
Date: 3/21/05 (datetime)
Year: 2005 (int)
Month: 3 (int)
Day: 5 (int)
Quarter: 1 (int)
Fiscal Year: 2005 (int)
Fiscal Quarter: 3 (int)
Why persist this data you ask? Because query processors can perform sorting, grouping and aggregation much faster against this persisted data than against datetime expressions. I needed a time dimension for my data mart, and I needed it fast. After a few google searches I was unable to turn up a quick time dimension generator, so I wrote my own in TSQL. Here it is:
CREATE TABLE [dbo].[timeDim](
[timeKey] int PRIMARY KEY,
[timeValue] datetime NOT NULL,
[calendarYear] smallint NOT NULL,
[calendarQuarter] tinyint NOT NULL,
[calendarMonth] tinyint NOT NULL,
[calendarDay] tinyint NOT NULL,
[fiscalYear] smallint NOT NULL,
[fiscalQuarter] tinyint NOT NULL)
CREATE PROCEDURE [buildTimeDim](
@yearSpan smallint = 10, -- # of years to generate
@fiscalStartMonth tinyint = 7)
IF @yearSpan <= 0
RAISERROR('Invalid parameter value. @yearSpan must be greater than or equal to 0.', 16, 0)
SET @today = '1/1/' + CAST(DATEPART(year,GETDATE()) AS char(4))
SET @startDate = DATEADD(year, (ROUND(@yearSpan / 2, 0) * -1),@today)
SET @stopDate = DATEADD(year, ROUND(@yearSpan / 2, 0),@today)
SET @curDate = @startDate
WHILE @curDate <= @stopDate
SET @fiscalDate = DATEADD(month, @fiscalStartMonth -1, @curDate)
((DATEPART(year,@curDate) * 10000 ) + (DATEPART(month, @curDate) * 100) + DATEPART(day, @curDate)
SET @curDate = DATEADD(day,1,@curDate)
Using this stored procedure I was able to quickly generate my time dimension. So now that you know everything about time dimensions, let's get back to SSIS.
Now we will proceed down the data flow branch where the actual transformations occur. The next transform following the multicast in my data flow tasks is a derived column transformation. I use this in all my data flow tasks to do an initial scrub of the input and add any derived columns that I need.
This transform allows you to replace input columns with new values, or to add new columns to the output results on the fly. The transform, and SSIS as a whole, supports an expression syntax as well as its own scalar type system. Both will be very familiar to C++ programmers. Calculating values involves writing an expression using this syntax and setting the datatype to one of the SSIS scalar data types.
So all of my data pump tasks have a derived column transformation early in the dataflow where I do as much scrubbing as I can in a single pass. The key is to avoid doing this more than once, especially for big data sets. The editor has a grid that contains one row per derived column operation. You just add a row for each derived column operation you want to perform.
My transforms perform a number of different types of operations including:
- String cleanup, e.g. trim, changing keys to upper case / lower case, etc
- Null handling
- Generating new output columns from multi-valued input columns
- Transforming data from one format to another
I'll cover a few of the more interesting transformations I was able to do. The first example involves my time dimension that I talked about previously. My source data for my fact tables contains some datetime values for things like event dates. I used the derived column transform to convert these datetimes into the timeKey values from my time dimension. The expression syntax I used for generating a timeKey from a base datetime value looks like this:
(DATEPART("year",startDate) * 10000) + (DATEPART("month",startDate) * 100) + DATEPART("day",startDate)
Since the datatype generated by this expression (int) is different than the original datetime value, I chose the "add as new column" option rather than to replace the original datetime value. The column gets added to the result set on the fly and I'll use it later to persist to my destination.
Another interesting expression I used was to handle a multi-valued field I got from SharePoint. Basically SharePoint surveys use a cool column type called a Rating Scale. The idea is to provide the respondent with a way to rate multiple related things on a given scale. Each rating scale column can contain multiple ratings. Here's an example of what a Rating Scale XML value looks like in the data I extracted from SharePoint:
<Taking_x0020_into_x0020_account_>Taking into account all aspects of this course (courseware, instructor, and facilities), this was a valuable overall training experience.;#+5#The enrollment process was easy to navigate and complete.;#+5#The delivery method was appropriate for the course.;#+5#Microsoft should offer more training of this type on a variety of products and technologies;#+5#The selection of non-Microsoft experts to develop and deliver the training made the course seem balanced and credible;#+5#This course and the program supporting it is an improvement over previous Microsoft training experiences;#+3#</Taking_x0020_into_x0020_account_>
So I needed to extract the ratings (delimited by # symbols) and create derived columns for them so I could persist them separately in my destination for analysis. So for this particular rating scale value I needed to create five new derived columns using these expressions:
FINDSTRING(overallTrainingExperience,";",1) > 0 ? (DT_R4)SUBSTRING(overallTrainingExperience,FINDSTRING(overallTrainingExperience,";",1) + 3,1) : NULL(DT_R4)
FINDSTRING(overallTrainingExperience,";",2) > 0 ? (DT_R4)SUBSTRING(overallTrainingExperience,FINDSTRING(overallTrainingExperience,";",2) + 3,1) : NULL(DT_R4)
FINDSTRING(overallTrainingExperience,";",3) > 0 ? (DT_R4)SUBSTRING(overallTrainingExperience,FINDSTRING(overallTrainingExperience,";",3) + 3,1) : NULL(DT_R4)
FINDSTRING(overallTrainingExperience,";",4) > 0 ? (DT_R4)SUBSTRING(overallTrainingExperience,FINDSTRING(overallTrainingExperience,";",4) + 3,1) : NULL(DT_R4)
FINDSTRING(overallTrainingExperience,";",5) > 0 ? (DT_R4)SUBSTRING(overallTrainingExperience,FINDSTRING(overallTrainingExperience,";",5) + 3,1) : NULL(DT_R4)
Basically these use the SSIS conditional expression to determine if the n'th occurrence of the search pattern occurs in the value which would indicate a rating value exists. If the pattern does occur, I use the substring operator to extract the n'th rating value, and then cast it to a 4-byte real. Cool huh? This solved a huge problem for the users of this system who were previously unable to do analysis on the rating values because they were buried in these multi-valued fields.
Well I think its lunch time. Hmmm, what will it be today? A chalupa maybe? In my next post I'll dive into some other cool transforms in SSIS 2005.