Cookie Cutter SSIS, Part 2 – Standard Databases

by Ellen 30. June 2010 18:50

One of the things that may be different for us here at Perkins Consulting compared to your own work environment is that we're developing data marts for a number of different clients. Consequently, we benefit from standardization - names, objects, code, etc. Please keep that in mind as you read this series - some of the naming conventions we use, for instance (and the reasons we use them), may not fit the your company's model. Feel free to modify!

That's the whole point of cookie cutter SSIS - develop once, copy many.

In this second post of the series, I want to describe the standard databases we deploy for our clients and how they fit into the cookie cutter methodology.

For each new data mart project, we deploy a minimum of the following databases:

database_names

(A note on naming: Because we like to keep all the associated databases listed together, we use naming conventions to group them. The SQL Server database names  here are the current "generic" names that are easily transportable from one installation to the next. In the past, we've also used a prefix to denote the individual client - whatever works to keep them together in SSMS. I should also mention that I always use lower case and separate key words with underscores. This means that I won't have to think too hard when I'm working on a client site with a case-sensitive server on the same day that I'm working on another site that's not case-sensitive.)

The Data Mart database is, I hope, self-explanatory. This is the final resting place of the tables - facts, dimensions, aggregates, bridge tables - that comprise the data mart. The tables in this database will differ from client to client, of course, but will always include at least one date dimension and a table that contains reporting control parameters.

The Audit database contains a set of standard tables and other objects (user-defined functions, stored procedures, etc.) that support the auditing subsystem we use to track and support ETL activity. I'll describe the audit database in more detail in a later post.

The Staging database is volatile. By that I mean that the data in the database is not expected to persist, even though the table structures will. It is a repository for interim data, and yes, sometimes you need that interim storage. Even though SSIS is very good at managing an ETL data stream from source to final target, sometimes the interim step grants you better performance or more flexibility in processing.

The Source database is the landing spot for extracted data. The data in these tables is not expected to persist either. Each table in the Source database matches a data source object somewhere in the client's system, and is truncated and repopulated with each ETL run. With our clients, we try to use the concept of "Get In, Get Out and Get On With It" with respect to their source systems. The scheduled extract of the data is usually constrained by a number of things, including:

  • Status changes during the day that would affect reporting in an unexpected way
  • Source system batch processing that must complete in order to present the data correctly
  • Other system activity that consumes resources or locks source tables

We grab the data out of the source systems and park in in the Source database, after which we can work with it without worrying about the impact on the client's operations.

The Error database is the location for storing redirected data that would otherwise cause an insert process to fail. The tables in this database may be truncated and repopulated with each run (for data that will continue to fail until it's repaired at the source, for instance, and will continue to show up as an error until it's fixed). The data may also be allowed to persist in the Error database and be repaired and reloaded from the error tables (this is necessary for source data that cannot be recaptured in the same state from one ETL run to the next - inventory snapshot data, as an example).

Nearly every client installation requires the maintenance of some supplementary data that is not contained anywhere in the ERP system. (Can anyone say "Excel spreadsheets"?) The Lookup database is where we store persistent supplementary data. The data may be updated in its own SSIS package, but it's essentially reference data that's manually maintained by the client, in order to provide additional reporting richness to data mart tables.

The Archive database may or may not be used at any given client installation, depending on need. For instance, at one client site, we pass data back and forth from the data mart to a third-party service using the third party's defined fixed-width flat files. Troubleshooting issues with the files is much easier when we load the contents into database tables. The Archive database is used for non-data mart data only.

In the next post in this series, I'll go into more detail on the Audit database and its structures.

Tags: , , , ,

Data Warehousing

Cookie Cutter SSIS, Part 1 - “Data Mart in a Day”

by Ellen 29. June 2010 01:27

I've been working with SSIS since just before the RTM of SQL Server 2005. Over time, I've been able to steal tips and tricks from a number of sources (sessions at PASS Summits, notably those presented by Rushabh Mehta and Erik Veerman; Ralph Kimball, Joy Mundy and Warren Thornethwaite of the Kimball Group; SQLIS; Brian Knight and others that I apologize ahead of time for not mentioning). From all these sources, distilled by the practical work we've done with our own clients, I've evolved a standard set of patterns and practices that allow quick, efficient development for moderately sized data marts.cookie_cutters

I call this "Cookie Cutter SSIS."

The most extreme example of the use of these concepts is what I refer to as "Data Mart in a Day." Back in 2007, we had a prospective client who was interested in the Cognos (now IBM Cognos) business intelligence toolset, but wasn't certain how the tools would work for their organization. As a result, Perkins Consulting engaged to do a limited proof-of-concept project for this client, that started with the creation of a modest data mart (four dimensions, three base facts and an aggregate fact all with a relatively small number of rows) against which to deploy the Cognos reporting tools. Because the focus of the project was on the reports, we needed to get the data mart built and populated (along with ongoing maintenance) as quickly as possible.

Okay, okay. So the complete data mart was not literally finished in a day (we did some prep work prior to the onsite development day, added one of the facts and the aggregate after the first reporting pass and completed the data validation phase afterwards.) Nevertheless, this was PDQ - and the POC data mart remained in production with minimal downtime at this client site for at least six months.

I'd like to share some of these shortcuts that I continue to use to quickly deploy data mart objects. I'll say up front that the scope of the series is strictly mechanics - how to use the "cookie cutter" method of standard objects and templates to speed data mart ETL.

We're assuming that the data discovery and design phases have already occurred. We've got our data model; we know our data source options. Now we're ready to create our databases, build our tables and use SSIS to populate our data mart.

This methodology is predicated on several things:

  • A standard set of supporting databases, in addition to the data mart database itself
  • Uniform handling of data mart object metadata using SQL Server extended properties
  • Use of configuration files to enable data-driven dynamic Connection Managers
  • SSIS package templates, pre-configured with standard variables, containers and objects
  • Utility tables, functions and stored procedures
  • Custom SSIS components from community resources that extend SSIS functionality
  • A standard date dimension design and data source (Excel spreadsheet) that can be customized for specific client needs

In the next post, we'll talk about the standard databases that are the first step in the process.

Tags: , , , ,

Data Warehousing

Powered by BlogEngine.NET 1.5.0.7
Theme by Perkins Consulting Content Copyright 2009 Perkins Consulting, LLC All rights reserved.