Data Juggler: Type 2 Slowly Changing Dimension Loads

I was talking with a coworker about how I loaded my dimensions and thought it would make an interesting addition to the blog.

Type 2 slowly changing dimensions

Functionally, a type 2 dimension is one in which the records are never updated and any change to a record is reflected by an insert of a new record with a new id. What this means to the users is that they will always be able to see the true state of the data for a given date.

If I have a customer record and my zip code is 43567, I'll load that record once in the customer dimension and it will remain there until the zip changes. If the zip comes in on May 15th as 23456, I'll add a new record with a new id. Since this id will be associate with the effective date in the fact table, users can now see that a customer had the old zip up until May 15 and the new one after.

In contrast, a type one dimension would have updated the original customer record and the history of the zip code will have been lost. For some applications, this is OK or even desirable but I tend to go to type 2 as my standard dimension and only use type 1 or type 3 when requirements dictate.

I think Type 2 dimensions are also the easiest to load. You simply stage your current day's data, insert new dimension records by comparing it to stage, then update stage with the dimension keys. Got that? Don't worry, here's the details:

Process

I'm not going to go into all the details for staging your data since every stage is going to be a little different but here's my basic routine:

Step 1: Create a staging table

drop table Staging;

--create an empty copy of the dimension table

--I know everyone says "never select *" but in this case

-- you have my permission.

select *

into Staging

from Dimension

where 1=2

--my identity column isn't going to play nice with the remainder

-- of the load so let's hamstring it

alter table Staging

alter column dimension_id int null

--Now insert your data into the staging table

--(I'm simplifying here)

insert into Staging (field-list)

select field-list

from Source

I wish my dimension loads were as simple as that last step. I regularly am joining multiple tables, bringing in a lookup table and then doing lots of little updates to add additional value to the results.

Step 2: Hash each staging record

Now that I've got my data staged, here's where the magic happens. I'm going to use sql server's hashbytes function to hash every record of my staging table. Since the input for hashbytes is a single varchar, I'll need to convert non string fields and concatenate them. I'll also need to cover any null values with an isnull statement. I also delimit my fields with a pipe "|" to make sure that an unusual combination of fields don't appear to the hash function as identical.

I have learned to avoid nulls in my dimension an make every effort to store an 'NA' or 'Unknown' or at the very least an empty string (''). This will make your hashing less complex and any rehashing less painful.

update Staging

set unique_hash = hashbytes('SHA1',

isnull(varchar_field,'NA')+'|'+

isnull(cast(int_field as varchar),'NA')+'|'+

isnull(convert(varchar,date_field,101),'NA')

)

Step 3: Insert new records to the Dimension Table

This step is pretty straight-forward. Add any records to the dimension table that aren't already there by comparing to existing hashes. Make sure you store the hash of the record in the dimension as well so that you can compare it without rehashing it.

insert into Dimension (

field-list,

unique_hash

)

select

field-list,

unique_hash

from Staging

where unique_hash not in (select distinct unique_hash from Dimension)

Step 4: Update staging with Dimension keys

Now that we know that there's a record in the dimension with the same hash as every record in the Stage table, we can go get the id values and store them in the staging table id (which has been null for the duration of the load thanks to the alter column we pulled off above.)

update Staging

set dimension_id = d.dimension_id

from Staging s

inner join Dimension d

on (s.unique_hash = d.unique_hash)

and we're done.

Hope that was painless and clear!

Now we have all of today's data for this dimension staged, we have the target dimension synchronized with it and we have all of the dimension_id values updated in the stage table. We'll use those when we load the fact table.

Data Juggler

Wednesday, June 1, 2011

Type 2 Slowly Changing Dimension Loads

1 comment: