Cookie Cutter SSIS, Part 3 – The Audit Database

by Ellen 12. August 2010 23:52

I think it's safe to say that most (if not all) data warehousing experts recommend keeping track of ETL execution results in some way. I started working with custom audit tables back in the DTS days, starting with some code I stole from Professional SQL Server 2000 DTS by Chaffin, Knight and Robinson. Over time, I've morphed the audit database into a critical part of the Cookie Cutter SSIS methodology.

Our standard audit database contains the following schemas:

  • audit
  • metadata
  • etl
  • util
  • and the ever-popular dbo, of course

Our standard audit database contains the following tables:

audit.package_execution ETL execution information at the package grain. We capture the execution GUID to enable linking to the standard SSIS logging table, dbo.sysdtslog90.
audit.job ETL execution information at the table grain. Cookie Cutter SSIS is about table-centric processing. This table will probably cause most DBAs to run, screaming, from the room (I try to protect Serena, our ace database designer, from it as much as I can). It's set up to be as generic as possible - consequently, a number of the columns in any given row will likely be null if they don't apply to the particular table being processed.
audit.execution_event This table stores event-related data. A SQL task with the stored procedure that inserts the event record can be inserted in your package when an event handler fires.
audit.reject_rows This table works with the error database - it stores a daily summary of total error rows for every non-empty table in the error database. Originally we developed this for use in client audit dashboards to give a quick view of potential problem data.
audit.control_parameter Control parameters are the mechanism we use to expose changing, low-cardinality data outside of ETL code. We use the parameters to populate variables inside of the SSIS packages. By setting up a control parameter for a potentially changing data item, we avoid hard-coding values inside of SSIS packages (where they're harder to find) and allow our ETL to be data-driven and dynamic.
metadata.dim_role Data mart dimensions may play multiple roles in a given fact. The most notorious role-player is the date dimension (at one client site, the date dimension is used in over a hundred different ways.) This table is a way to keep track of dimension roles; however, it was originally developed to make IBM Cognos Framework Manager easier to update. Rick, our ace developer, wrote a tool to assist Kathy, our ace report designer, in keeping the Framework Manager model up to date with terminology for both technical and business users (the terminology is captured from data mart object extended properties). This table was created to facilitate renaming all the role-playing dimensions inside the Framework Manager model.
metadata.stage_extended_properties

I took the Kimball University class from Joy Mundy and Warren Thornethwaite based on their book, The Microsoft Data Warehouse Toolkit, back when the book was just published. One of the concepts they presented was using SQL Server extended properties to store metadata about data mart objects. I took this idea and ran with it, and it has been one of the most transformative concepts in all my data warehousing experience. (Thanks, Joy and Warren!) Extended properties will get their own post - this particular table is used in the process of applying extended properties to data mart objects.

…and the following stored procedures:

audit.pr_insert_audit_job Instantiate audit job record (job_key as int data type)
audit.pr_insert_audit_job_bigint Instantiate audit job record (when installation requires job_key of type bigint)
audit.pr_insert_execution_event Add an event handler record
audit.pr_insert_package_execution Instantiate package execution record (package_exec_key as int data type)
audit.pr_insert_package_execution_bigint Instantiate package execution record (when installation requires package_exec_key of type bigint)
audit.pr_update_audit_job Update audit job record (configured for int type job_key)
audit.pr_update_package_execution Update package execution record (configured for int type package_exec_key)
etl.pr_add_table_extended_properties Add extended property metadata to data mart objects (used with SSIS metadata update package)
etl.pr_get_row_count Get row count from a specified table
etl.pr_get_surrogate_key Get the next surrogate key from a specified table
etl.pr_truncate_table Truncate a specified table

I’ve attached a script to create the audit database and all of the above schemas, tables and procedures.  The script assumes an int type for the job_key and the package_exec_key, and you’ll need to amend the code for the database create statements to point to the correct path in your environment:

In the next post in this series, I’ll discuss the other standard databases and some utility functions.

Tags: , ,

Data Warehousing | Business Intelligence

Cookie Cutter SSIS, Part 2 – Standard Databases

by Ellen 30. June 2010 18:50

One of the things that may be different for us here at Perkins Consulting compared to your own work environment is that we're developing data marts for a number of different clients. Consequently, we benefit from standardization - names, objects, code, etc. Please keep that in mind as you read this series - some of the naming conventions we use, for instance (and the reasons we use them), may not fit the your company's model. Feel free to modify!

That's the whole point of cookie cutter SSIS - develop once, copy many.

In this second post of the series, I want to describe the standard databases we deploy for our clients and how they fit into the cookie cutter methodology.

For each new data mart project, we deploy a minimum of the following databases:

database_names

(A note on naming: Because we like to keep all the associated databases listed together, we use naming conventions to group them. The SQL Server database names  here are the current "generic" names that are easily transportable from one installation to the next. In the past, we've also used a prefix to denote the individual client - whatever works to keep them together in SSMS. I should also mention that I always use lower case and separate key words with underscores. This means that I won't have to think too hard when I'm working on a client site with a case-sensitive server on the same day that I'm working on another site that's not case-sensitive.)

The Data Mart database is, I hope, self-explanatory. This is the final resting place of the tables - facts, dimensions, aggregates, bridge tables - that comprise the data mart. The tables in this database will differ from client to client, of course, but will always include at least one date dimension and a table that contains reporting control parameters.

The Audit database contains a set of standard tables and other objects (user-defined functions, stored procedures, etc.) that support the auditing subsystem we use to track and support ETL activity. I'll describe the audit database in more detail in a later post.

The Staging database is volatile. By that I mean that the data in the database is not expected to persist, even though the table structures will. It is a repository for interim data, and yes, sometimes you need that interim storage. Even though SSIS is very good at managing an ETL data stream from source to final target, sometimes the interim step grants you better performance or more flexibility in processing.

The Source database is the landing spot for extracted data. The data in these tables is not expected to persist either. Each table in the Source database matches a data source object somewhere in the client's system, and is truncated and repopulated with each ETL run. With our clients, we try to use the concept of "Get In, Get Out and Get On With It" with respect to their source systems. The scheduled extract of the data is usually constrained by a number of things, including:

  • Status changes during the day that would affect reporting in an unexpected way
  • Source system batch processing that must complete in order to present the data correctly
  • Other system activity that consumes resources or locks source tables

We grab the data out of the source systems and park in in the Source database, after which we can work with it without worrying about the impact on the client's operations.

The Error database is the location for storing redirected data that would otherwise cause an insert process to fail. The tables in this database may be truncated and repopulated with each run (for data that will continue to fail until it's repaired at the source, for instance, and will continue to show up as an error until it's fixed). The data may also be allowed to persist in the Error database and be repaired and reloaded from the error tables (this is necessary for source data that cannot be recaptured in the same state from one ETL run to the next - inventory snapshot data, as an example).

Nearly every client installation requires the maintenance of some supplementary data that is not contained anywhere in the ERP system. (Can anyone say "Excel spreadsheets"?) The Lookup database is where we store persistent supplementary data. The data may be updated in its own SSIS package, but it's essentially reference data that's manually maintained by the client, in order to provide additional reporting richness to data mart tables.

The Archive database may or may not be used at any given client installation, depending on need. For instance, at one client site, we pass data back and forth from the data mart to a third-party service using the third party's defined fixed-width flat files. Troubleshooting issues with the files is much easier when we load the contents into database tables. The Archive database is used for non-data mart data only.

In the next post in this series, I'll go into more detail on the Audit database and its structures.

Tags: , , , ,

Data Warehousing

Cookie Cutter SSIS, Part 1 - “Data Mart in a Day”

by Ellen 29. June 2010 01:27

I've been working with SSIS since just before the RTM of SQL Server 2005. Over time, I've been able to steal tips and tricks from a number of sources (sessions at PASS Summits, notably those presented by Rushabh Mehta and Erik Veerman; Ralph Kimball, Joy Mundy and Warren Thornethwaite of the Kimball Group; SQLIS; Brian Knight and others that I apologize ahead of time for not mentioning). From all these sources, distilled by the practical work we've done with our own clients, I've evolved a standard set of patterns and practices that allow quick, efficient development for moderately sized data marts.cookie_cutters

I call this "Cookie Cutter SSIS."

The most extreme example of the use of these concepts is what I refer to as "Data Mart in a Day." Back in 2007, we had a prospective client who was interested in the Cognos (now IBM Cognos) business intelligence toolset, but wasn't certain how the tools would work for their organization. As a result, Perkins Consulting engaged to do a limited proof-of-concept project for this client, that started with the creation of a modest data mart (four dimensions, three base facts and an aggregate fact all with a relatively small number of rows) against which to deploy the Cognos reporting tools. Because the focus of the project was on the reports, we needed to get the data mart built and populated (along with ongoing maintenance) as quickly as possible.

Okay, okay. So the complete data mart was not literally finished in a day (we did some prep work prior to the onsite development day, added one of the facts and the aggregate after the first reporting pass and completed the data validation phase afterwards.) Nevertheless, this was PDQ - and the POC data mart remained in production with minimal downtime at this client site for at least six months.

I'd like to share some of these shortcuts that I continue to use to quickly deploy data mart objects. I'll say up front that the scope of the series is strictly mechanics - how to use the "cookie cutter" method of standard objects and templates to speed data mart ETL.

We're assuming that the data discovery and design phases have already occurred. We've got our data model; we know our data source options. Now we're ready to create our databases, build our tables and use SSIS to populate our data mart.

This methodology is predicated on several things:

  • A standard set of supporting databases, in addition to the data mart database itself
  • Uniform handling of data mart object metadata using SQL Server extended properties
  • Use of configuration files to enable data-driven dynamic Connection Managers
  • SSIS package templates, pre-configured with standard variables, containers and objects
  • Utility tables, functions and stored procedures
  • Custom SSIS components from community resources that extend SSIS functionality
  • A standard date dimension design and data source (Excel spreadsheet) that can be customized for specific client needs

In the next post, we'll talk about the standard databases that are the first step in the process.

Tags: , , , ,

Data Warehousing

Don’t Forget the Parameter

by Ellen 14. May 2010 17:00

This is just a quick post to remind myself as much as anyone else who’s interested…

I am a firm believer in “Develop Once…Copy Many” since as far as I’m concerned, the fun in development is in the problem-solving, not necessarily in multiple implementations of a solution (no matter how cool it happens to be). 

Over time, I’ve evolved a templated “cookie-cutter” approach to ETL that allows me to copy and reuse SSIS containers for data mart object processing.  Normally, this works quite well, although BIDS has some interesting ideas about how to arrange nested containers and objects, which necessitates some fancy footwork at times. 

One of the instances (other than display) in which this fails is in copying a Data Flow task with an OLE DB Source adapter that includes a parameter in a SQL command.  When copying the the data flow task, the placeholder for the parameter is persisted, but the variable mapping for the parameter is not:

data_source_prameter

So, if you’re seeking the path of least resistance in your SSIS development and choose to copy and re-use Data Flow tasks that include parameters mapped in the SQL statement of an OLE DB Source adapter, don’t forget to remap the variable.

Tags: , ,

Data Warehousing

Using SSIS Package Explorer to Eliminate Undisplayed Package Objects

by Ellen 4. December 2009 20:11

Last night I saw upon the stair

A little man who wasn’t there

He wasn’t there again today

Oh, how I wish he’d go away

--from Antigonish by Hughes Mearns

 

Recently, I’ve been assisting a client in moving their data mart databases and SSIS packages to a new server.  Prior to decommissioning the old server, I was comparing package execution results, and came across an anomaly as exposed by our standard audit infrastructure tables.  In reviewing the package execution table, it appeared that several of the packages were being logged more than once (same package guid and execution guid but different execution keys).

package_exec_listing 

I could tell by reviewing the SQL Agent job history that each of these packages only executed once, and the rows in the package execution table were not completely identical – the columns that are normally updated at the end of the execution were null for one of the rows in each duplcate set.  Hmmmmm…. Obviously, something goofy was happening in the package, related to the instantiation and update of the package execution record.

When I opened the packages in question, I saw that the Control Flow pane included the three normal audit infrastructure sequence containers that are part of all our packages.  However, I noticed something.  Two of the infrastructure sequence containers had the numeral one appended to the container name.

 

exposed_sequence_containers

Usually, this means that the SSIS developer has copied and pasted an object with a name that already exists in the package context.  BIDS appends sequential numerals to objects in these cases, to make sure the object names are unique.  I searched the Control Flow pane, but could find no duplicates for the containers.  Sure, it was possible that the developer who created the package had copied from an open package and hadn’t renamed the objects in his destination package, but I had the evidence of the package execution table screaming at me that the code was firing more than once.  Also, I wasn’t able to remove the trailing numerals from the container names without BIDS throwing an error, so SSIS still believed that the containers existed within the package. 

Invisible code.  Oh, yeah.  We hate that.

I was reduced to observing what was right in front of my nose – the Package Explorer tab.  I’m embarrassed to say that in over five years of working with SSIS, I have never looked at that tab before.  However, the Package Explorer tab does not hold a grudge – even though I’d ignored it all this time, it helped me out anyway.

When I expanded the Executables node, I could see that the two infrastructure containers were indeed duplicated.  By right-clicking on the ones that were not displayed on the Control Flow tab (the items without the trailing numeral), I was able to delete them.  I was then able to rename the visible containers, removing the trailing numeral without error.

 

package_explorer_listing

Tags: , ,

Data Warehousing | Business Intelligence

64-bit SQL Server and iSeries ODBC

by Ellen 12. November 2009 00:31

This week, I started work on a data mart for a client whose source data resides on an AS400. The data access for the source system is handled via the iSeries ODBC driver from IBM. Getting the DSNs set up correctly to work with the client's 64-bit SQL Server installation was a little irritating, so I thought I'd share some lessons learned.

As you may already know, even though the database engine for 64-bit SQL Server is a 64-bit application (with a 64-bit dtexec executable for SSIS packages), the BIDS designer is a 32-bit application. This presents a challenge in package development and deployment when data connections either do not have a 64-bit driver (as in the case of Jet for Excel or Access) or when the 64-bit and 32-bit drivers exist, but require different executable versions (as with ODBC).

In this case, I needed to connect to the AS400 with the iSeries ODBC driver from within SSIS. The client had created the DSN for me and believed he had created a 32-bit DSN (since the driver description within the ODBC Administrator was labeled as 32-bit). However, the DSN did not appear in the DSN selection drop-down in the ODBC connection manager in SSIS.

The problem was that, regardless of what the driver was labeled, the DSN had to be created with the 32-bit ODBC Administrator in order for it to exist as a 32-bit DSN and become accessible to SSIS. Furthermore, in order for the package to run both in debug mode from within the designer (32-bit) and as a scheduled job in SQL Server Agent using the Integration Services step type (64-bit), you have to create DSNs with the identical name in both the 32-bit ODBC Administrator and the 64-bit Administrator.

So, in the Program menu for the iSeries access, you'll see a 64-bit and 32-bit ODBC Administrator:

 

iseries_menu

You need to create identical system DSNs using both Administration modules.  The DSN defined in the 32-bit Administrator will be exposed and utilized from with BIDS; the DSN defined in the 64-bit Administrator will be utilized by SQL Server Agent and 64-bit dtexec.

To add a little more excitement to your day, SSIS has no ODBC source adapter for the Data Flow task, and in SQL Server 2008, the Data Reader source adapter has been retired and replaced with the ADO.NET source adapter.  In order to use the iSeries DSN, you need to create your connection manager as an ADO.NET provider (.Net Providers\Odbc Data Provider) rather than as an ODBC connection manager type.

Tags: , , , ,

Business Intelligence | Data Warehousing

Using the Trash Destination Adapter for SSIS Housekeeping

by Ellen 23. October 2009 19:23

In an earlier post about the custom Row Number transformation, I mentioned the resources available from Konesans (www.konesans.com) via SQLIS (www.sqlis.com). The Trash Destination adapter is another one of my can't-live-without custom components, although I may use it a little bit differently than is described on its SQLIS page (http://www.sqlis.com/post/Trash-Destination-Adapter.aspx).

The description of the adapter on that page presents it as something you would use only in development and testing, as an option for breaking out the data flow to evaluate performance bottlenecks, for instance, or for creating a path to associate with a Data Viewer. The description even states:

"It is also obvious that this is for development or diagnostic purposes, and is clearly not a part of the functional design of the package."

Actually, I use this component in the majority of my SSIS packages.  In the ETL logic flow, it’s frequently necessary to evaluate data  and determine whether a particular row should continue in pipeline or be discarded.  The Trash Destination is a low-maintenance way to ensure that every row of data in the pipeline has a definite endpoint – even if that endpoint is the garbage.  It’s easily configurable.  Well, no configuration really required.  Can’t get much easier than that.

As an example, here’s a picture of a section of a data flow task that processes a type 1 slowly changing dimension: 

type_1_split_trash_destination

 

The Conditional Split evaluates whether the type 1 attributes have changed based on an upstream Checksum transformation.  If the data has sustained a change, the row is diverted to the type 1 change staging table.  If the row is unchanged, it’s discarded (diverted to the Trash Destination).  This appeals to my sense of closure – all those rows of data in the pipeline have a place to go.  But by inserting the Row Count Plus transformation (another freebie from Konesans/SQLIS), I can count the unchanged rows and store that information in the audit data for the ETL run, so I can squeeze a bit more use out of the Trash Destination branch.  It’s all good.

Oracle OLE DB Provider and 64-bit SQL Server 2005

by Ellen 6. October 2009 22:48

Back in fall of 2006, Perkins Consulting embarked on a data warehousing project with a client whose source data resided in Oracle and whose SQL Server 2005 environment was established on a 64-bit server. This was our first experience with this particular combination of variables, and we ran into some entertaining issues.

I was attempting to set up a Connection Manager for the Oracle source database, using the Native OLE DB\Oracle Provider for OLE DB. When testing the connection, I received the following error:

oracle_provider_error

Entertaining?  Oh, yeah.  A laugh riot.

After extensive Google searches, I finally found a post regarding this BIDS error in reference to Analysis Services.  Apparently the issue was on the Oracle side.  The Oracle provider had an issue with special characters in path names.  The default installation folder for SQL Server 2005 32-bit executables (including BIDS) on a 64-bit server includes parentheses (i.e., c:\Program Files (x86)\…).  Oracle was not pleased with this situation in the slightest.

The post I found recommended installing the 32-bit executables in a folder without special characters, but that wasn’t an option at our client site.  The post provided a work-around using a batch file to launch BIDS and to explicitly set the executable path without using the special characters, placing the Oracle client first:

Set path= c:\oracle\product\10.2.0\client_1\bin;%path% "C:\Progra~2\Microsoft Visual Studio 8\Common7\IDE\devenv.exe"

The error only arises when calling the Oracle provider using BIDS (a 32-bit application).  When executing the completed package, SQL Agent invokes the 64-bit dtexec executable, which is located in a path that TNS does not find objectionable. 

It’s entirely possible that this issue has been resolved in later versions of the Oracle client or the native Oracle OLE DB provider, but because of the configuration at this particular client site, I still launch BIDS there using this batch file.

Using the Row Number Transform to Assign Surrogate Keys in the Pipeline

by Ellen 30. September 2009 01:28

I suspect that most of us who have been using SQL Server 2005/2008 SSIS for any length of time know what a great resource SQLIS is (if you don't know, check it out - www.sqlis.com). This site provides downloads of several custom SSIS components that are now integral parts of any Pcon SSIS project.

One of these can't-live-without components is the Row Number transform. This component, like my other must-haves, was developed by Konesans (www.konesans.com).

When I first started using SSIS, I used a Script transform, as described in Donald Farmer's The Rational Guide to Extending SSIS 2005 with Script, to assign surrogate keys to data mart dimensions and facts in the data pipeline. Although I was grateful to have this model at the time, I don't really like to use script components if I can avoid them. I've run into too many compile issues, especially with 64-bit environments. The Row Number transform delivers a much simpler way of assigning surrogate keys (simpler because I don't have to write any code…always a plus for me!).

The Row Number transform…numbers rows. Yeah, sounds obvious, right? The transform UI allows you to assign a seed value, an increment value, a target pipeline column to store the sequential numbers and presents an option to store the final number in a target variable:

row_number_ui 

 

The Number Column can be a pipeline column that already exists (I always create the surrogate_key column in a Derived Column transform), or you can choose to let the Row Number transform create a new column for you. 

In my standard table processing methodology, I retrieve the next surrogate key value for the table prior to launching the data flow task and store it in a variable.  I then use SSIS Expressions to assign this seed value to the Seed property of the Row Number transform.

As with all data flow transforms, you set properties for the Row Number transform at the data flow task grain.  Just click anywhere on the Data Flow design surface to be able to access the properties for the transform:

row_number_properties

Since I haven’t found a way yet to insert an expression without popping open the Property Expression Editor window (by clicking the ellipsis next to Expressions – but you knew that already…), I open the editor and locate the Seed property for the Row Number transform (I’ve named it rnAssignSurrogateKeys in the screen shots):

row_number_expression_zero

There’s an issue with the way the custom transform interacts with the Property Expression Editor that sometimes results in an error when you try to directly assign the Seed property to a variable – regardless of the data type of your variable, the editor chokes and thinks you’re trying to assign a decimal value to a property that only accepts integers.  On the SQLIS page for the Row Number transform, the Troubleshooting section describes this as a known issue, and shows how to manually edit the .dtsx code to fix the issue.  Since I don’t want to mess with the code directly (no coding for me – see above), I prefer to fool the editor.  By assigning a zero to the Seed property as shown, you can successfully exit the editor.  Then, in the Properties pane, you can change the expression value to your variable as shown in the Properties screen shot above.

This only appears to be an issue the first time you try to assign a value to the property.  Once you’ve changed the value to a variable in the Properties pane, you can open and close the editor with no further errors – it will present the variable as the expression value:

row_number_expression_w_variable

Downstream of the Row Number transform, you can then assign the output column to the surrogate key column for your table.  Piece of cake.

64-bit SQL Server 2005 SSIS and UNC paths, Part 2

by Ellen 11. September 2009 22:58

"You can't get there from here."

Part 2

In Part 1 of my post on this topic, I described how to configure a package containing UNC paths on flat file destinations to execute successfully in BIDS debug mode.

At the risk of being repetitive, here's the system background:

  • The SQL Server data mart resides on a remote server running 64-bit SQL Server 2005 Enterprise Edition.
  • Access to the file system on the server is not allowed.
  • SSIS development occurs on a virtual server with only the SQL Server client tools installed.
  • The destination for the output files is on yet a third machine, only accessible to the server and the development machine via UNC paths.

Now that I had successfully executed the package in debug mode by forcing BIDS to use the 32-bit runtime to recognize the UNC paths, I was ready to try to schedule the package via a SQL Agent job.

I'm used to using the CmdExec step type in SQL Agent to force execution of a package in 32-bit mode (I have to do this any time the package includes Excel file sources, for instance). However, when I used the usual method for this particular package, I got yet another error message:

Error: 2009-08-31 16:01:55.99 Code: 0xC001401E Source: stage_reports Connection manager "Extract File Cards" Description: The file name "\\target1\Dest-Data\Cards.csv" specified in the connection was not valid.

Well, rats.

Since I wasn't able to access the file system or view the context of the server in the network, I had to find a way to work around this - i.e., schedule the package in 32-bit mode without using SQL Agent.

To do this, I wrote a batch file to execute the package using 32-bit dtexec from the context of the development machine (I used dtexecui to get the command line for the package):

"C:\Program Files\Microsoft SQL Server\90\DTS\Binn\dtexec.exe" /FILE "\\vm-dev1\ReportStaging\SSIS\report_staging\stage_reports.dtsx"

/MAXCONCURRENT " -1 " /CHECKPOINTING OFF /REPORTING EW

By scheduling the batch file as a Windows task and bypassing SQL Agent entirely, I was finally able to get the UNC paths to work on the calling machine, the server and the destination machine.

Whew!

Tags: , , , ,

Data Warehousing

Powered by BlogEngine.NET 1.5.0.7
Theme by Perkins Consulting Content Copyright 2009 Perkins Consulting, LLC All rights reserved.