Jeff Prom's SQL Server Blog

Sharing knowledge and tips on SQL Server, Business Intelligence and Data Management.

Cleansing Data Through SSIS with DQS

Posted by jeffprom on January 13, 2015

While Data Quality Services (DQS) takes some time to build up an effective knowledge base through manual data review, there are times when it sure would be handy to use SSIS. This is only recommended when you need quick results and are feeling pretty good about your knowledge base or are perhaps using an external knowledge base through reference data. Should you decide to cleanse data through SSIS using DQS, it’s important to note that it can export some quality control metrics such as Status and Confidence score with the final data which can later be used to review the results and make final decisions. So let’s take a look at using the DQS Cleansing transform in SSIS.

In this example, I created a basic SSIS package with one Data Flow. I selected an Excel Source and pointed it to a file that has some data to be cleansed. In my example I am going to send in some bad addresses and attempt to cleanse them using Melissa Data. To learn more about the Melissa Data Address Check add-in and reference data in general, view my other blog post here.

Create a new package, add a Data Flow Task, and then add a DQS Cleansing transform object.

dqs cleansing transform setup

Click on the New button next to the Data quality connection manager to setup a new connection to an instance of DQS and then select a knowledge base.

dqs cleansing transform setup with connection

Click on the mappings tab. Here you will map input columns from your data source to the domains on the knowledge base.

dqs cleansing transform with mappings

If you would like to see additional columns on the final results such as the Confidence score and the Reason then click on the Advanced tab.

dql cleansing transform advanced tab

When finished, click OK and test it out. In my example I am just going to a Union All object as my destination for demo purposes but I can still see the resulting cleansed data through the Data Viewer.

dqs cleansing results from transform

As you can see from the results, two addresses were corrected, two had suggestions, and one failed to be cleansed. We could get fancy and split up the results based on the Status values. Corrected ones may be considered ready to ship. Auto Suggest records may want to be reviewed, and the New ones would need to be reviewed since it was not found in the knowledge base.

Using the DQS Cleansing transform in SSIS is useful but doesn’t eliminate the need to review the final cleansed results. Cleansing data through SSIS is dependent on the quality of the underlying knowledge base but does provide a quick and efficient way to do an initial data cleanse. The resulting cleansed data can then be evaluated using the Status, Reason and Confidence score values. Once the data has been processed and reviewed, only then should it be considered cleansed and ‘ready for use’.

Posted in DQS, SSIS | Tagged: , , | Leave a Comment »

Cleansing Address Data With Data Quality Services (DQS) and Melissa Data

Posted by jeffprom on January 9, 2015

In this post I will show how to cleanse address data using Microsoft Data Quality Services (DQS) and Melissa Data as a reference data service. First, you need to make sure you have a working installation of DQS. If you don’t, you can view my other post here that walks you through installing DQS. Once DQS has been installed, you will need to visit the Microsoft Azure Marketplace website and subscribe to the Melissa Data Address Check service. When I wrote this post, you could get 1,000 addresses cleansed per month for free. This is useful for testing or if you have a small set of data to be cleansed. Select the desired service level/amount from the list and click the Sign-Up button. You will now have a subscription added to your account.

melissa data service on the azure marketplace

Next, we need to tell DQS about our Azure Marketplace account. On the Microsoft Azure Marketplace website, click on ‘My Account’ and look for the Primary Account Key value.

azure marketplace ID

Highlight and copy the Key value. Open the Data Quality Client tool and click on Configuration on the right-hand side. Paste your Key value into the DataMarket Account ID entry box and click on the validate button.

dqs validate account ID

Once your ID has been validated you are then ready to setup a knowledge base in DQS that will use Melissa Data. On the DQS client, click on the Close button on the Configuration screen to get back to the main menu. On the top left, click on New Knowledge Base. Add 4 new domains (Address, City, State, Zip). Next, add 1 composite domain (FullAddress). Add the first 4 domains you just created (Address, City, State, Zip).

setting up the knowledge base domains

With the FullAddress composite domain selected, click on the Reference Data tab. Click on Browse and check the Melissa Data service.

On the Schema mapping table, create the following mappings between the RDS Schema and Domains:
1. AddressLine (M) = Address
2. City = City
3. State = State
4. Zip = Zip

adding melissa data as reference data

You can adjust several of the Providers Settings such as Auto Correction Threshold, Suggested Candidates, and Min Confidence. Once finished, click the Finish button and select Publish.

save knowledge base with reference data

Next, we will cleanse a set of data using our new knowledge base. On the main screen of the DQS client, click on New Data Quality Project. Give it a name, and be sure to select the knowledge base we just finished setting up that uses the Melissa Data reference data.

new data quality project

Click Next and select a data source that points to address records you would like to cleanse. Map the 4 Source Columns to the appropriate Domains. If you have mapped all 4 correctly, you should be able to click on the ‘View/Select Composite Domains’ button which tells you that the composite domain will now be used.

data quality project - map domains

Click Next, and Start. DQS will analyze the data and provide a summary of results.

cleansing profile results

Click Next to see the cleansing suggestions and to ultimately work through corrections. In my example I am using 5 addresses which are Melissa Data office locations that I found on their website. I altered some aspects of the addresses to see how well it can cleanse the data. Let’s take a look at the results.

record results before adjustments

As you can see here, DQS suggested 2 corrections (Road to Rd, Drive to Dr). The interesting thing is that both of these corrections don’t match what is on their website. (They have Road and Dr.) Perhaps they need to cleanse their own addresses. 🙂 It also listed one entry as Invalid even though all I did was add a 1 to the address value. This record would need to be manually reviewed/corrected or we could adjust the threshold settings to see if we get better results. DQS successfully corrected 2 records with a good confidence score. It wasn’t fooled by me changing the City name on one and Zip on the other.

approved results

Work through the results by making adjustments, approving or rejecting and then finally exporting the records to it’s final destination (SQL Server, CSV, or Excel).

Address cleansing can be an important yet difficult task. However, with DQS we now have the ability to tap into external Knowledge Bases such as Melissa Data to help. By leveraging reference data within DQS we can quickly and effective cleanse address information.

Posted in DQS | Tagged: , | 3 Comments »

SQL Saturday #350 – Winnipeg

Posted by jeffprom on October 28, 2014

sqlsat350_web

On Saturday, November 22nd I will be at the Winnipeg SQL Saturday #350 event and will be giving a presentation on SQL Injections. The event will be held at the Red River Community College – downtown, 160 Princess St., Winnipeg, MB R3B 1K9, Canada. This is a great opportunity to learn more about SQL Server and I highly encourage everyone to attend.

Here is the topic I will be presenting on:

SQL Injections and How To Stop Them

SQL injections are a technique where malicious users can inject SQL commands into a SQL statement, through web page input. If left unprotected, SQL injections can cause havoc with web applications by compromising security, manipulating data, hoarding system resources, retrieving sensitive data, removing databases and database objects, or taking databases offline! This session will show how SQL injections work, identify if you are being attacked, and cover various prevention techniques and their effectiveness.

To register for this event and to find additional information, you can visit the SQL Saturday Winnipeg page here:

https://www.sqlsaturday.com/350/eventhome.aspx

 

Posted in Events, Security | Leave a Comment »

PASS Summit 2014 Here I Come!

Posted by jeffprom on October 16, 2014

PASS_14_Attending_250x250

I am fortunate again this year to have the opportunity to attend the PASS Summit conference. Big huge thanks and shout out to my current employer (Silver Bay Realty Trust Corp) for sending me. This will be my 3rd time attending and coincidentally all three of my visits have been to Seattle, WA which is great because Seattle is a very fun city to visit. If it’s anything like my previous visits, it should be a great conference filled with a lot of great learning and networking opportunities. Hope to see you there!

 

Posted in Events | Leave a Comment »

SQL Saturday #332 – Minnesota 2014

Posted by jeffprom on September 5, 2014

sqlsat332_web-white

This year I have been selected to present at SQL Saturday #332 Minnesota on what I consider to be a very fascinating topic; SQL Injections. If left unprotected, your web applications can be severely compromised by SQL Injection attacks. To learn more about SQL Injections, I invite you to attend my session on October 25th, 2014.

SQL Injections and How To Stop Them

SQL injections are a technique where malicious users can inject SQL commands into a SQL statement, through web page input. If left unprotected, SQL injections can cause havoc with web applications by compromising security, manipulating database data, hoarding system resources, retrieving sensitive data, and even removing database objects like the database itself! This session will show how SQL injections work, identify if you are being attacked, and cover various prevention techniques and their effectiveness.

There is a fantastic list of presenters at this event! Secure your spot today by signing up on the SQL Saturday website.

http://sqlsaturday.com/332/eventhome.aspx

 

Posted in Events, Security | Leave a Comment »

SQLSaturday #291 – Chicago 2014

Posted by jeffprom on March 19, 2014

I am happy to announce that I was selected to speak at SQLSaturday Chicago! This event will be held April, 26 2014 at 1221 N Swift Road, Addison, IL 60101. I will be presenting on Master Data Services.

Master Data Services
Does your company have similar data in multiple locations? Not really sure which set of data is accurate? With MDS, Microsoft gives you the ability to have one accurate source of non-transactional data to use with your applications or send to other databases and ETL processes. Leverage data stewards to help manage the data repository to ensure it is up-to-date and accurate. In this session we will cover MDS from start to finish including installation, creating models/entities, working with data, using the Excel add-in, security, hierarchies, views, versions, and business rules. After this session, you should have a good understanding of what MDS does, how to use it, and how to get started with your own MDS project.

Additional information can be found here:
http://sqlsaturday.com/291/eventhome.aspx

Be sure to register for this event today!

Posted in Events, MDS | 2 Comments »

Generating Dynamic CRUD Queries Within Microsoft Excel

Posted by jeffprom on February 21, 2014

Last year I learned a neat trick from Ross McNeely. We kept getting ad-hoc Excel documents sent to us and we needed to update the databases with these values. Traditionally you might go through the import/export data wizard, or create an SSIS package to import the data. This usually leads to data conversion issues, possible temp tables, writing extra queries, and somtimes all around headaches.

This method shows how to write CRUD sql statements within Excel that can then update the database. This is a really quick way to take a handful of fields and update data with little effort. If you have a lot of fields you may want to consider going the more traditional route because it could take some time to write long query statments in Excel. That said, I have used this technique probably 100 times so I decided to post about it. Let’s take a look at how it works.

I have two examples here that will update two tables in the Adventure Works DW database. Let’s pretend that someone has just sent us an Excel file and needs product information updated in the DimProduct table. This is what the data looks like.

First, find an open cell off to the right (F) and begin writing the sql query as shown here. Begin with =” and place single quotes where the sql statement will require them. In order to reference other data from the row, place double quotes, an ampersand and then the cell location. To continue the rest of the statement simply place another ampersand and double quote.

Continue until the rest of the statement is complete. Here is what it should look like:
=”update DimProduct set EnglishProductName='”&B2&”‘, StandardCost='”&C2&”‘, FinishedGoodsFlag=”&D2&”, Color='”&E2&”‘ where ProductAlternateKey='”&A3&”‘”

Once finished, click off of the query statement. If it worked correctly you should now see the query populated with values. If you received some errors, go back and double check the statement. Syntax issues are the most common mistakes.

Now that we have one statement that looks correct, let’s quickly fill in all the other rows. Single click the cell with the query (F2), and then double click the green square on the lower right. This will populate the rest of the rows with the same query template.

After the rest of the rows are populated, you should now have a complete set of queries.

One last trick to note is when using dates. You will need to use the text() function as shown here and specify a date format.
=”update FactInternetSales set ShipDate='”&TEXT(C2,”yyyy-mm-dd”)&”‘ where SalesOrderNumber='”&A2&”‘ and SalesOrderLineNumber=”&B2

The final step is to copy the query statements from Excel and paste them into SQL Server Management Studio where you can simply execute them to update the database. In the previous screenshot you would click on column D, copy, and then paste into Management Studio, and execute.

And there you have it, writing queries directly within Excel provides a quick way to take data from an Excel spreadsheet and manage data within your database with very little effort.

Posted in Excel, T-SQL | Tagged: , | Leave a Comment »

Using Microsoft Access to Import and Directly Manage Data within SQL Server

Posted by jeffprom on November 7, 2013

In this blog post I’m going to cover a couple of taboo subjects in the SQL Server realm.
1. Using Microsoft Access.
2. Giving end users the ability to import data directly into SQL Server tables and allow them to directly manage data within these tables.

Ok calm down, sit back down and let me explain how these are useful from a SQL perspective. Picture this; one source of an ETL process is to take an external file and import it into the data warehouse. Sound familiar? Absolutely. Now this file may fluctuate a bit. The data often changes and the formatting can cause issues during the import process. Also, the data values are often updated. Wouldn’t it be handy if we could instead just give the end user the ability to directly import data into a staging table? Why not give the end user the ability to update these staged values until they are ready to be imported into the data warehouse?

Now that we understand the use case, let’s take a look at a table based solution so our end users can quickly import and manage their data. Unfortunately, Excel only offers a one-way option for working with data in tables. You can import data into Excel from SQL Server, but you cannot make updates and send it back. Let’s instead turn to Microsoft Access where we can create linked tables and import data from flat files directly into SQL Server. We also get a nice ‘spreadsheet’ view of the data within the table and can update values directly on the server. Scary huh. Let’s see how we can do this.

First, we need to create an ODBC connection to the SQL Server database. On your workstation, go to the Control Panel, Administrative Tools, and select Data Sources (ODBC).

Select the ‘User DSN’ tab, and click on the ‘Add’ button.

Walk through the wizard and give your connection a relevant name and point it to the database you wish to connect to.

Next, navigate to Access. In my examples I will be using Microsoft Access 2013. From the main screen select ‘Blank desktop database’. Select a file name, loction, and click ‘Create’.

You will now be presented with an empty looking table. In order to import data to an existing SQL Server table, we need to first create a linked table connection to SQL Server. From within Access, go to the ‘External Data’ tab and click the ODBC Database icon.

Choose the second option which is ‘Link to the data source by creating a linked table’.

Next, click on the ‘Machine Data Source’ tab and choose the ODBC connection that you previously created.

Finally, select the Table you want to connect to, click ‘Next’, and then ‘OK’.

You will now have a linked table icon on the left side.

It is worth noting here that from within Access you cannot create a new table on SQL Server, or modify fields. There is a design view. However, it will prompt that you cannot save your changes for linked tables. An admin or developer will need to create the tables before hand for the end users.

Now that we have a destination table defined, let’s import some data. click on the ‘External Data’ tab and click the Excel button.

From this screen, select the data source file location and select one of the import options. We already have a table so we will select the ‘Append’ option. While this screen looks like it has options to create a new table on SQL Server, it will not. The first option will create a new local table, and the 3rd option will create a link to the source Excel file. Click ‘Ok’, and the next screen will show a preview of the data. If all looks correct, click ‘Next’ and then ‘Finish’.

The last step is to look at the data, and update where neccessary. To do this simply double click the linked table from the left hand side. This will now show data in a table view where you can alter values. Note: Data updated here will directly update data within the sql server database.

Normally SQL Server Admin’s and Developers do all of the data preparation, ETL work, and data management. By leveraging Microsoft Access, we can give end users the ability to import data directly into SQL Server staging tables and also allow them to manage data within a spreadsheet view. With proper security in place, this can help expedite ETL processing. It can also provide a cost effective data quality solution by giving end users the ability to review and cleanse data before it is imported into a data warehouse.

Posted in Access | Tagged: , | 3 Comments »

Installing Data Quality Services on SQL Server 2012

Posted by jeffprom on November 1, 2013

Data Quality Services is the new data cleansing tool provided in SQL Server 2012 Enterprise and BI Editions. DQS is a knowledge-driven solution that enables you to do data cleansing and matching through the DQS client, or automated cleansing through the DQS Cleansing transform in SSIS.

In this blog post I will show how to install DQS. When you first install SQL Server, there are two DQS options to select if you would like to use DQS. As shown here, the top option (Instance Features\Data Quality Services) will install the DQS Server files. The second option (Shared Features\Data Quality Client) will install the DQS client.

DQS Install File Options

Once you have the DQS files available from the SQL Server install, the next step is to run the DQS Installer. This can be found under the Microsoft SQL Server 2012 menu under Data Quality Services. Click on the Data Quality Server Installer.

DQS Menu Installer

This will launch a command prompt window and start the install. As seen here, you will need to enter in a Database Master Key password.

DQS Install command prompt

The install will continue and take on average about 7 minutes to finish. Once it finishes, you will be presented with a final message stating that DQS Installer finished sucessfully, press any key to continue. Simply press any key and the window will close.

DQS Installer command prompt finished

Congratulations, you now have the DQS server installed!

After doing a refresh on the database list in management studio, we can see there are now three new DQS databases that have been installed.

DQS_MAIN – Includes DQS stored procedures, the DQS engine, and published knowledge bases.
DQS_PROJECTS – Includes data that is required for knowledge base management and DQS project activities.
DQS_STAGING_DATA – Similar to tempdb.

DQS Installed Databases

While knowing about these databases is helpful, all of the DQS work will primarlily be done using the DQS client tool. This tool can be found in the same menu location where the DQS server install was located. When you launch the Data Quality Client tool you will be presented with a prompt to connect to the DQS server.

DQS Connect to server

Enter the instance location where DQS was installed and click Connect. You are now logged into the DQS client and will see the main screen.

DQS Client

On the left side is where you work with Knowledge Bases. The middle section is for creating projects to do data cleansing and data matching. The right side is for Administration tasks. These will each be discussed further in future posts, but you should now at least have DQS installed and ready to go.

Posted in DQS | Tagged: , , | 2 Comments »

SQL Saturday #238 – Minneapolis, MN

Posted by jeffprom on September 19, 2013

SQL Saturday #238

I’m happy to announce that I was selected to speak at the SQL Saturday #238, Minnesota event! This year I thought I would try something different. I recently changed jobs and am working with Ross McNeely. Ross mentioned that he once presented on F# which got me thinking… I know a bit about using PHP with SQL Server. So that is my session this year: ‘When scripting worlds collide: PHP + SQL Server’. This should be a fun presentation. I’ve actually used PHP longer than I have SQL Server. Somewhere in the neighborhood of 12 years. It wasn’t until just a few years ago that PHP was able to connect with SQL Server, and it makes for a very powerful combination!

Abstract:
PHP is a popular, and free, scripting language. It can be used to create websites, or simply perform a variety of tasks when run from a command prompt. Combine PHP with SQL Server and you have a powerful way to create data driven websites. In this session learn how to install, configure, and use PHP with SQL Server. Learn the basics of connecting to databases, writing queries, and using data to create dynamic, data driven websites.

I look forward to seeing everyone at the event.
http://sqlsaturday.com/238/eventhome.aspx

Posted in Events | Leave a Comment »