Best Practices in Data Transformation for Data Warehouses

Posted by Michele Morrill

Mar 19, 2015 11:52:10 AM

 data_warehousing

The foundation of a sustainable business intelligence system is a good data warehouse. The function of the data warehouse is to consolidate data from various sources and supply that data to data marts that, in turn, supply the business user with easy-to-access, quality, integrated information. Business Intelligence allows an enterprise the ability to look at past behavior patterns of the business and current action patterns to understand where the enterprise has been and to facilitate business planning. 

  • A data warehouse is not and should not be a transactional system.
  • A data warehouse is an ETLP system: Extract, Transform, Load and Present.  
  • A data warehouse needs to be viable, sustainable and most importantly...flexible. 
  • A data warehouse needs to fit the business needs – it is not a cookie cutter one solution fits all.

NOT a Transactional System

A data warehouse is not and should not be a transactional system. Non-logged processing is good in a data warehouse environment. Consider using these whenever possible:

  • Bulk Inserts to move data quickly into stage tables
  • Truncate table is a non-logged operation
  • Simple recovery mode for the databases reduces log size

ETLP system

A data warehouse is an ETLP system: Extract, Transform, Load and Present.  Without a Presentation layer, a data warehouse is useless. Data flows in but people cannot get it out as information. Plan the Presentation layer along with the rest of the system/process structure.

Extraction tips

  • Best practice for performing Source Data extract is to extract from your multiple heterogeneous data sources into a common text file format that is then bulk loaded. This enables the data warehouse to consolidate the data along common dimensions or filters. 
  • This gives you the most control of the loads and allows quick adds of new sources. Allows bulk inserts which is not logged.
  • Business requirements and regulations such as HIPPA or SOX may prohibit clear text transfers and indicate a direct ODBC or OLE DB transfer as the preferred solution. Cloud based or hosted data warehouses may also require encryption.
  • Staging databases are used to hold raw data
  • Avoid trying to transform data when extracting or loading the stage environment. This puts an unnecessary burden on the source system which is often a transactional system. This would extend the duration of the extraction. This jeopardizes the validity of the logic if the source system changes.
  • Allows you to use truncate table before loading – a non-logged operation.

Viable, sustainable and flexible

A data warehouse needs to be viable, sustainable and most importantly flexible. Modularity is the key to maintaining flexibility. As with many consulting projects: you can have it good, you can have it fast, you can have it cheap...pick 2.

A balance must be struck between being complete, being fast and being correct. Transactional systems will have different ways of handling data and I have yet to find a transactional system without test records and dates that are out-of range of the business operations. These kinds of records must be handled in transformations or error filtering.

Needs to Fit the Business Needs

A data warehouse needs to fit the business needs. It is not a cookie cutter one solution fits all, but there will be already solved transformations available on the web that can be repurposed and reorganized to fit the business needs. There are no brownie points for reinventing the wheel.

The rest of this blog will focus on providing a selection of T-SQL transformations tips that can be referenced and repurposed for transforming your source data into actionable information. 

BULK Insert of Text File

This is a minimal clear and bulk load script for a text file using tab delimiters and standard Unicode row terminators. Bulk inserts can become a lot more complicated with separate file layout scripts and record block loading. See the Microsoft library to build your particular solution. Truncate and Bulk Insert are log intensive.

TRUNCATE TABLE DWtempfortime.dbo.TimeLoad

BULK INSERT DWtempfortime.dbo.TimeLoad

   FROM 'E:\data_share\Calendar for editing.txt'

   WITH

      (

         FIELDTERMINATOR ='\t',

         ROWTERMINATOR ='\n'

      );

 http://technet.microsoft.com/en-us/library/ms188365(v=sql.110).aspx

 

Other Practices to Reduce Logging

Set the database to Simple recovery mode

Periodic Database backup as part of transformation processing can be used to truncate logs if you don’t have a fancy backup system. 

/*   T-SQL Backing Up Databases to clear logs */

BACKUP DATABASE [DW_Mart2_RT]

  TO DISK = 'D:\BACKUPS\DW_Mart2_RT\DW_Mart2_RT.bak' WITH INIT

   ;

BACKUP DATABASE [DW_Mart1]

  TO DISK = 'D:\BACKUPS\DW_Mart1\DW_Mart1.bak' WITH INIT

   ;

/*   Now do Database Checks to get everything in good shape */

DBCC CHECKDB ([DW_Mart1], NOINDEX)WITH NO_INFOMSGS;

DBCC CHECKDB ([DW_Mart2_RT], NOINDEX)WITH NO_INFOMSGS;

 http://msdn.microsoft.com/en-us/library/ms188365.aspx

 

Identity Columns, Indexes and Check Sums

Three tricks to add to your bag of tools for data warehouse creation and maintenance:

  1. Identity Columns
  2. Indexes
  3. Check Sums

1. Identity columns

Identity columns can be used to create unique records. Identity columns can be used to give a system key independent of the data –

CREATE TABLE new_employees

( id_num int IDENTITY(1,1),

                     fname varchar (20),

                     minit char(1),

                     lname varchar(30) );

  • Do not try to use Identity keys as data warehouse join keys.  This is not considered good practice as the time needed to exchange/replace keys in the fact tables outweighs any numeric key join time savings.
  • Use Identity keys for dimension tables and create primary keys on dimension tables unique columns to allow shared manual and programmatic updates.

http://msdn.microsoft.com/en-us/library/ms186775.aspx

2. Indexes

Indexes are used on dimension tables to speed lookups for assigning missing values in fact tables. Indexes can be programmatically created and dropped on fact_tables but testing is always recommended. If you are constantly rebuilding your fact table then the overhead of indexing time is not a good trade-off. Indexing is not an all or none proposition – revisit periodically.

3. Check sums

Check sums return the checksum value computed over a row of a table, or over a list of expressions. CHECKSUM is intended for use in building hash indexes. This can be used for equality searches over the columns.

ALTER TABLE Production.Product

     ADD cs_Pname AS CHECKSUM(Name);

     GO

     CREATE INDEX Pname_index ON Production.Product (cs_Pname);

     GO

The checksum index is intended to be used as a hash index to improve indexing speed when the columns to be indexed comprise a long character set.  Just as with other indexing, testing is required to determine if time to update the index and stats outweighs the benefit.  Indexes must be maintained to be useful.  Do your homework and your testing.  

http://msdn.microsoft.com/en-us/library/ms189788.aspx

 

Data Transformation Code Examples for Easy Reference

Example Text Cleaner:

CREATE FUNCTION RemoveNonAlpha ( @nstring nvarchar(255) )

RETURNS varchar(255)

AS BEGIN

DECLARE @Result varchar(255) SET @Result = '' DECLARE @nchar nvarchar(1)

DECLARE @position int SET @position = 1

WHILE @position <= LEN(@nstring)

BEGIN SET @nchar = SUBSTRING(@nstring, @position, 1)

--Unicode & ASCII are the same from 1 to 255.

--Only Unicode goes beyond 255

--0 to 31 are non-printable characters

--65-90 inclusive are uppercase alphabet 97-122 are lowercase alphabet

IF UNICODE(@nchar) between 65 and 122

SET @Result = @Result + @nchar

SET @position = @position + 1

END

RETURN @Result

END

GO

Then call your new function in other transformation code:

UPDATE [dbo].[mytable]

SET [mycolumn] = [DW_MART2].[dbo].[RemoveNonAlpha]([mycolumn])

 

USE of NULLIF to set fields to NULL:

NULLIF checks for 0 and only Nulls if zero, WHERE condition limits number of records reviewed to records that are Nullable. Use this to set measure columns to NULL for OLAP math because in T-SQL NULL + number = NULL while in MS Analysis Services NULL + number = number and zeros show up in Reporting Services reports but NULLs filter out.

UPDATE MyTable

SET Col1 = NULLIF(Col1, 0),

        Col2 = NULLIF(Col2, 0),

        Col3 = NULLIF(Col3, 0)

WHERE (Col1 = 0

       OR Col2 = 0

       OR Col3 = 0)

 

UPDATE based on Join Conditions:

This is an example of updating two columns from another table using a join for best speed and avoiding a table scan, additionally limited by WHERE conditions.

UPDATE [DW_Mart2].[dbo].[ledger_fact]

SET  [Units] = [adunittot], [fill1]='Monday'

FROM  [DW_Mart4].[dbo].[visit_fact] a

JOIN [DW_Mart2].[dbo].[ledger_fact] b

        ON a.[guid] = b.[guid]

WHERE  (b.[code] ='TICK' or b.[code] = 'VIST'  )

AND a.[seq] = 1 and b.VISIT_code = 'I'  

AND b.[fill1]<>'Ledger'

 

FIND longest field value:

Use this to check data to see if field is empty, or how to size a transformation.

SELECT MAX(LEN(group_)) FROM [DW_Stage_All].[dbo].[payor_Stage]

 

Find Duplicates:

Invariable at some point you are going to have to figure out whether you have managed to load duplicate data. 

SELECT totals.word_id, totals.num

FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word

GROUP BY word_id) AS totals

WHERE num > 1;

 

Find Unmatched examples:

If the values in both tables are non-nullable, both methods are semantically identical. Careful about your SELECT in your NOT EXISTS if you are using a join to limit the records checked for an INSERT.  The join must list the non_nullable select INTO table first in the join, not the table that contains the records to be inserted.  

SELECT l.id, l.value

FROM [DW_Stage_All].table_left l

WHERE l.value NOT IN

(

SELECT value

FROM [DW_Stage_All].table_right r

)

SELECT l.id, l.value

FROM [DW_Stage_All].table_left l

WHERE NOT EXISTS

(

SELECT NULL

FROM [DW_Stage_All].table_right r

WHERE r.value = l.value

)

 

Group by Investigation:

SELECT count(code) as count,

       sum(payment),

              [code],

              [entry],

              [stat] FROM [DW_Mart2].[dbo].[ledger_fact] 

              WHERE  [entry] ='A' AND [stat] = 'O' AND [payment] <>0

              GROUP BY [code], [entry], [stat]

 

Calculate Two Months Ago:

declare @mon varchar(2)

declare @day varchar(2)

declare @yr varchar(4)

declare @prior varchar(10)

declare @startdate date

set @mon = Month(GETDATE())-2

set @day = '01'

set @yr = CASE WHEN Month(GETDATE()) = 1 THEN YEAR(GETDATE())-1

ELSE YEAR(GETDATE()) END

set @prior =  @mon + '/' + @day + '/' + @yr

set @startdate = CAST(@prior as date)

Select @mon, @day, @yr, @prior,@startdate;

 

Example join through link (intermediate) table:

SELECT  v.[apptstatus]

    ,v.[aptype]

       ,l.[pn]

       ,l.[date]

       ,e.[enum]

       ,e.[guid]

       ,v.[_timestamp_]

       ,c.[emrdoc]

       ,c.[emrloc]

       ,t.[cat]

       ,t.[encdate]

FROM [dbo].[visit] v join [dbo].[emr] e

on [_pn] = e.[guid] join [dbo].[ledger] l on e.[guid] = l.[guid]

join [dbo].[emr] t on t.[pat] = e.[pat] and t.[encdate] = e.[encdate]

JOIN [dbo].[case] c on t.[guid] = c.[_pn]

where v.[_timestamp_] > GETDATE()-6

I hope these examples will help you with examining your data quality and making informed decisions about data warehouse transformations, an important foundation for accurate business intelligence for business planning and financial consolidation. Contact Strafford Technology for more information.

Topics: Business Intelligence