This post was inspired by a session I held yesterday, presenting new BI features in SQL ‘08. One of the features which stirred some interest was the Data Profiling component in SSIS. Now this little feature hasn’t been given much airplay on the SQL radio waves, with features like Data Compression and Report Builder 2.0 taking center stage, so I thought I’d do my bit for the data profiler and unpack this incredibly useful feature…
A typical first step in any BI project involves a Business Analyst taking a look at a client’s source data. This step is crucial to validate that the client’s assumptions about data cleanliness and the underlying data relationships. (and normally the assumptions are a far cry from the truth!!)
This process is normally quite a tedious manual process, with a serious amount of SQL GROUP BY statements and manual Excel manipulation being used to understand exactly what’s in a given source table.
This is exactly where the Data Profiling task in SSIS comes into play. It is a mechanism whereby an analyst can profile an underlying table, and discover a number of key elements about that table.
The task itself is used in SSIS, with the results being pushed into an external xml file. This file can then be opened using the Data Profile Viewer (new client tool in SQL ‘08).
Full how-to details are available on www.microsoft.com/sql, with a nice video available at http://technet.microsoft.com/en-us/library/cc952923.aspx
Something which seems to be missing though is a concise summary of the types of profiles that can be run. So, here is my list of the type of profiles that can be run (in my personal order of usefulness…)
1. Candidate Key
This profile basically does a look at the uniqueness of each column in the table. So, if a column is 100% unique within the table, there’s a high chance that this field may be the primary key. This profile can be tweaked to allow for combinations of a number of fields to be analyzed (in the case where a composite key is actually used)
An example in my consulting career where this would have been useful was a case where the client was convinced that a Product Code field made the product unique, and it turned out that we actually needed the combination of a number of fields.
2. Column Length Distribution
Does just that. Gives a nice bar graph with field length distribution. How many times have we heard… “It’ll never be more than 8 characters long” 🙂
An absolute necessity for determining field lengths in the target data warehouse. Nothing worse than planning for VARCHAR(20) and then receiving a 21 character input.
3. Column NULL Ratio
Similar to above, returns the % NULL values per field. Useful for those cases where a client wants to create a report based on a certain field, when meanwhile 98% of the rows have NULL in that field.
4. Functional Dependency
Used to look for relationships within a given table. Can also be tweaked to look at a given number of fields.
Useful to determining hierarchical relationships within data. Again, similar to 1. in that it brings out the truth about the underlying data that the client is often not aware of.
5. Column Value Distribution
Actually returns the unique values that exist in individual fields, as well as the distribution of each of these values.
6. Column Pattern
Similar to above, yet it doesn’t return the actual values; rather it analyses the field values and returns regular expression values for each field. Nice way of generalizing / applying a filter to the types of values that exist in a certain field.
7. Column Statistics
This is bottom of this list, more due to my lack of statistical understanding 🙂 I’m quite sure a statistician would derive serious value from this output. Types of stats that are produced include
– Standard Deviation
So, that’s my short summary of a tool that could literally save an analyst hours when exploring new data sources. Happy analyzing!