You too love it, right?? Yes I know, SSIS is such a great tool that’s pretty easy and fast to learn. And, XML’s been a revolution in the way data is being stored, transmitted and consumed. Ultimately, SSIS and XML make a cool combo, don’t they??
In this post, I would like to explore how SSIS’ XML task component could be used to compute Null Ratio Percentage as part of data source profiling and cleansing. I chose this topic because I had seen a lot of blogs out there that were using script task instead to do this job, because of some constraints with namespaces, I guess. I do not say that this is wrong, but just that SSIS being such a popular tool, any developer who is not into VB/ VC++ should be able to use it. I myself being an Object Oriented Guy (Object Oriented Programming + Programming Guy = Object Oriented Guy), find it a bit difficult to straight away write VC++ code. So just to give those ‘other developers’ a seamless experience, here is my attempt. So, Ready, let’s dive in…
Here is what my demonstration package looks like
Yes, I set up a Null Ratio Data profiling task for Addressline2 column in Adventure Work’s Person table. This as you know creates a .xml ouput, which looks like
<DataProfile xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.microsoft.com/sqlserver/2008/DataDebugger/”>
<DataSources> … </DataSources>
<DataProfileInput> .. </DataProfileInput>
<ColumnNullRatioProfile ProfileRequestID=”NullRatioReq” IsExact=”true”>
<Column Name=”AddressLine2″ SqlDbType=”NVarChar” MaxLength=”60″ Precision=”255″ Scale=”255″ LCID=”1033″ CodePage=”0″ IsNullable=”true” StringCompareOptions=”0″ />
Our interest is the total RowCount; 19614 and the NullCount which is 19252. Based on these results, I would like to compute the Null Ratio as 19252/19614 which is 98%+, yup, pretty good.
Next comes the XML Task, and shown below is a detailed task screen grab
- Read the output .xml of the data profiling task
- Execute the following XPath query against the profile output to get Null Count value
Note: If you had noticed, the *[local-name()=..] is the trick to read nodes using default namespaces.
- Populate the Null Count value into a package variable for later use
Yup, the third is a script task, and I know, I know, I myself am using a script task here, but just to show you guys what the variable’s value is after the execution, and here it is
Cool, so we have what we need. Similarly we could even read the Row Count value and compute the Null Ratio Percentage. However, I would leave to it to you further, to compute the ratio. By the way, here is the XPath expression to read Row Count though,
So, how do you plan to do the Null Ratio Percentage further?? Write to me.. Till I catch you again, Adios..