Data Juggler

Wednesday, June 1, 2011

Type 2 Slowly Changing Dimension Loads

I was talking with a coworker about how I loaded my dimensions and thought it would make an interesting addition to the blog.

Type 2 slowly changing dimensions

Functionally, a type 2 dimension is one in which the records are never updated and any change to a record is reflected by an insert of a new record with a new id. What this means to the users is that they will always be able to see the true state of the data for a given date.

If I have a customer record and my zip code is 43567, I'll load that record once in the customer dimension and it will remain there until the zip changes. If the zip comes in on May 15th as 23456, I'll add a new record with a new id. Since this id will be associate with the effective date in the fact table, users can now see that a customer had the old zip up until May 15 and the new one after.

In contrast, a type one dimension would have updated the original customer record and the history of the zip code will have been lost. For some applications, this is OK or even desirable but I tend to go to type 2 as my standard dimension and only use type 1 or type 3 when requirements dictate.

I think Type 2 dimensions are also the easiest to load. You simply stage your current day's data, insert new dimension records by comparing it to stage, then update stage with the dimension keys. Got that? Don't worry, here's the details:

Process

I'm not going to go into all the details for staging your data since every stage is going to be a little different but here's my basic routine:

Step 1: Create a staging table

drop table Staging;

--create an empty copy of the dimension table

--I know everyone says "never select *" but in this case

-- you have my permission.

select *

into Staging

from Dimension

where 1=2

--my identity column isn't going to play nice with the remainder

-- of the load so let's hamstring it

alter table Staging

alter column dimension_id int null

--Now insert your data into the staging table

--(I'm simplifying here)

insert into Staging (field-list)

select field-list

from Source

I wish my dimension loads were as simple as that last step. I regularly am joining multiple tables, bringing in a lookup table and then doing lots of little updates to add additional value to the results.

Step 2: Hash each staging record

Now that I've got my data staged, here's where the magic happens. I'm going to use sql server's hashbytes function to hash every record of my staging table. Since the input for hashbytes is a single varchar, I'll need to convert non string fields and concatenate them. I'll also need to cover any null values with an isnull statement. I also delimit my fields with a pipe "|" to make sure that an unusual combination of fields don't appear to the hash function as identical.

I have learned to avoid nulls in my dimension an make every effort to store an 'NA' or 'Unknown' or at the very least an empty string (''). This will make your hashing less complex and any rehashing less painful.

update Staging

set unique_hash = hashbytes('SHA1',

isnull(varchar_field,'NA')+'|'+

isnull(cast(int_field as varchar),'NA')+'|'+

isnull(convert(varchar,date_field,101),'NA')

)

Step 3: Insert new records to the Dimension Table

This step is pretty straight-forward. Add any records to the dimension table that aren't already there by comparing to existing hashes. Make sure you store the hash of the record in the dimension as well so that you can compare it without rehashing it.

insert into Dimension (

field-list,

unique_hash

)

select

field-list,

unique_hash

from Staging

where unique_hash not in (select distinct unique_hash from Dimension)

Step 4: Update staging with Dimension keys

Now that we know that there's a record in the dimension with the same hash as every record in the Stage table, we can go get the id values and store them in the staging table id (which has been null for the duration of the load thanks to the alter column we pulled off above.)

update Staging

set dimension_id = d.dimension_id

from Staging s

inner join Dimension d

on (s.unique_hash = d.unique_hash)

and we're done.

Hope that was painless and clear!

Now we have all of today's data for this dimension staged, we have the target dimension synchronized with it and we have all of the dimension_id values updated in the stage table. We'll use those when we load the fact table.

Monday, May 30, 2011

Normalization and How to Know When You Are Done…

I enjoyed reading Louis Davidson article on normalization over at SQLBlog.com. Nothing really new for me there but it got me thinking about my own learning process. I'm reposting my comments here:

An excellent introduction though nothing beats examples for teaching. It would be valuable to show a poorly designed database, show why it's poor and show how to correct it.

The light bulb moment is an amazing thing too. Everyone (who's successful) has one. It's fun to think about how the brain works and why these sudden leaps of comprehension take place.

Most of my work now is with data warehouses so I do the opposite -- denormalization. As I was learning and implementing the denormalization process, it was interesting how much my understanding of normalization practices increased. I had to do things counter to my instincts and I made some mistakes along the way. This forced me to think more specifically about the decisions I was making while putting data into a star and, in contrast, I better understood the normalization process.

A couple years back, I was asked to handle the database design and implementation for a small IT-internal project used to track something (can't actually remember what for). I worked with a co-worker to flesh out the requirements. He would be designing the website and his first ideas about the database was to have just a couple of Big Tables and handle all of the record management in code. (As an aside, I see this tendency among a lot of decent programmers - mostly because they never learned better ways.) I steered him away from that and started breaking things up. He agreed we'd handle CRUD operations through sprocs and views allowing him to focus on the function of the website instead of wrangling data.

I made the mistake of letting him talk me into leaving one table less normalized than my gut was telling me since it only served a small piece of the solution and normalization ended up adding a few xref tables that he felt made things too complex. As I started writing insert and update sprocs for the subsystem, I saw it turning into an awful, broken mess requiring far more effort to manage than if it had been normalized. I jumped on correcting it right away despite his protests (though he trusted me enough not to protest too much). That taught me to rely on my instincts but it also made me understand the mindset of people who haven't learned to normalize. I think they are afraid of the perceived complexity behind it when the fact is, proper normalization makes the pieces of a project work together more cleanly and removes a great deal of complexity in the use and function of the database. This is especially true when you encapsulate tasks in sprocs (like p_AddUser or p_RemoveItem).

Wednesday, May 18, 2011

Back to Programming -- Sort of

This is my first blog post but I'm going to skip the formalities and get to the point.

I did some non-database programming today and it was wonderful. OK, it was VBA in Excel -- WAIT COME BACK! It gets better.

I'm currently working to hack together some automated Credit Policy reporting. These reports basically consume one person's time for an entire month. She can't take vacations without a computer by her side. And these reports are the foundation of much of our regulatory reporting. They go to the "C"s, they go to the Board and they go to the Feds. So much of what she is doing is manual: Spreadsheet from here, query from there, cobble it together, add a bunch of manual exception entries.

I was brought in to leverage the data warehouse I've developed (SQL Server backroom, Business Objects WebIntelligence in the front, Kimball methodology with a bit of Ingram thrown in where needed.)

I brought data into our core warehouse from 4 additional sources (more on that later, I suppose). There's still some tweaks being made to the way business rules are applied to some of these but the heavy lifting is done.

Today, however, was the day I used to create a way for the users to write manual exceptions to the warehouse. There are all sorts of exceptions that needed to be addressed -- payment checks that didn't post on time, reclassifications of risk following month end, Market assignments changing, etc etc. It basically boils down to a process failure somewhere along the line which those who report on this data are expected to disguise, despite what a very costly and complex source system insists on reporting.

My original recommendation for exception entry was a .NET app deployed on our intranet. Security could be managed through AD and there would be no need for anything beyond a browser on the desktop. I was prepared to provide the use case, do the entire database design (simple, single table with some lookups) and all the CRUD stored procs. I believe I could have even coded the front end -- but, since I'm not on the "developer list" -- not even in IT, actually -- those tools are reserved for the very competent developers.

So our acting PM went to the projects team with what should have been a simple 2-3 day task and was told that they didn't have the cycles to do it and would need to bring in a consultant. A consultant in this role would take at least week to get them network credentials and many more hours of haggling with IT bureaucracy to implement this solution that any "very competent programmer" should take to do. And so . . . the project sponsor said "Can't we just do something in Excel?"

And here's where my faulty honesty kicks in. "Of course! I've done this plenty of times!" I didn't elaborate on the long history of every company I've worked for wanting to get by on the cheap with a sub par Excel solution. The maintenance issues and code rot involved with it. Worst of all, the terrible error handling. I did manage to let them know that, if we went down this path, we'd be developing it "New Jersey Style". A term I may or may not have coined but which is certainly apt and I let that imagery sink in while wondering if I had just insulted anyone who might be from New Jersey (seriously, your beaches are lovely and I barely notice the smell). However, and don't tell anybody this, I have a soft spot for Excel -- the golden hammer of the luckless. I've been using Excel macros to create monsterpieces since the mid 90s. It was my first real IT job to manage an awful rats nest of reporting automations -- in Excel 3.0 mind you! I've learned a tremendous amount since then about what good code looks like. The current incarnations of VBA provide an ample assortment of solid tools to work with. And you can actually produce a solid piece of work with some elegant touches if you stay away from recording macros and have the confidence to hand code your work.

And so, I became inspired by the horrible dearth of BI blogs that aren't trying to sell you something to start blogging about my experience in the industry. And here I am kicking things off telling you how much I love programming -- even when it's with a sub par tool like Excel VBA.

There is something absolutely magical and intangibly satisfying about making something work and doing it in the simplest and most straight-forward way you can contrive. In poetry, they speak of the "economy of language" making every word add something to the piece. Using just the right words and no more than are needed to express what you intend.

A good bit of code does just that.

Moreover, it is the building of the thing starting with the decomposition of the problem -- forming a mental picture of the problem space and slowly stripping away the complexity until you understand the component parts that encapsulate the discrete functions of the thing. This is then followed by creating the facade that couples the functions together to create procedures. Then the writing of the functions that do their one thing well and briefly. And the bricks are snapped together and they just work. When it's done well, there is nothing that can be added or subtracted to improve upon what you've created.

Of course, Excel VBA fails to provide a framework that allows for such sublime master pieces but I feel like wrenched the most out of Excel that I could today and I feel good about it. I think if I had to program every day, I would get bored of it but it's moments like today that make me love my job.