Jeff Prom's SQL Server Blog

Sharing knowledge and tips on SQL Server, Business Intelligence and Data Management.

Cleansing Address Data With Data Quality Services (DQS) and Melissa Data

Posted by jeffprom on January 9, 2015

In this post I will show how to cleanse address data using Microsoft Data Quality Services (DQS) and Melissa Data as a reference data service. First, you need to make sure you have a working installation of DQS. If you don’t, you can view my other post here that walks you through installing DQS. Once DQS has been installed, you will need to visit the Microsoft Azure Marketplace website and subscribe to the Melissa Data Address Check service. When I wrote this post, you could get 1,000 addresses cleansed per month for free. This is useful for testing or if you have a small set of data to be cleansed. Select the desired service level/amount from the list and click the Sign-Up button. You will now have a subscription added to your account.

melissa data service on the azure marketplace

Next, we need to tell DQS about our Azure Marketplace account. On the Microsoft Azure Marketplace website, click on ‘My Account’ and look for the Primary Account Key value.

azure marketplace ID

Highlight and copy the Key value. Open the Data Quality Client tool and click on Configuration on the right-hand side. Paste your Key value into the DataMarket Account ID entry box and click on the validate button.

dqs validate account ID

Once your ID has been validated you are then ready to setup a knowledge base in DQS that will use Melissa Data. On the DQS client, click on the Close button on the Configuration screen to get back to the main menu. On the top left, click on New Knowledge Base. Add 4 new domains (Address, City, State, Zip). Next, add 1 composite domain (FullAddress). Add the first 4 domains you just created (Address, City, State, Zip).

setting up the knowledge base domains

With the FullAddress composite domain selected, click on the Reference Data tab. Click on Browse and check the Melissa Data service.

On the Schema mapping table, create the following mappings between the RDS Schema and Domains:
1. AddressLine (M) = Address
2. City = City
3. State = State
4. Zip = Zip

adding melissa data as reference data

You can adjust several of the Providers Settings such as Auto Correction Threshold, Suggested Candidates, and Min Confidence. Once finished, click the Finish button and select Publish.

save knowledge base with reference data

Next, we will cleanse a set of data using our new knowledge base. On the main screen of the DQS client, click on New Data Quality Project. Give it a name, and be sure to select the knowledge base we just finished setting up that uses the Melissa Data reference data.

new data quality project

Click Next and select a data source that points to address records you would like to cleanse. Map the 4 Source Columns to the appropriate Domains. If you have mapped all 4 correctly, you should be able to click on the ‘View/Select Composite Domains’ button which tells you that the composite domain will now be used.

data quality project - map domains

Click Next, and Start. DQS will analyze the data and provide a summary of results.

cleansing profile results

Click Next to see the cleansing suggestions and to ultimately work through corrections. In my example I am using 5 addresses which are Melissa Data office locations that I found on their website. I altered some aspects of the addresses to see how well it can cleanse the data. Let’s take a look at the results.

record results before adjustments

As you can see here, DQS suggested 2 corrections (Road to Rd, Drive to Dr). The interesting thing is that both of these corrections don’t match what is on their website. (They have Road and Dr.) Perhaps they need to cleanse their own addresses. 🙂 It also listed one entry as Invalid even though all I did was add a 1 to the address value. This record would need to be manually reviewed/corrected or we could adjust the threshold settings to see if we get better results. DQS successfully corrected 2 records with a good confidence score. It wasn’t fooled by me changing the City name on one and Zip on the other.

approved results

Work through the results by making adjustments, approving or rejecting and then finally exporting the records to it’s final destination (SQL Server, CSV, or Excel).

Address cleansing can be an important yet difficult task. However, with DQS we now have the ability to tap into external Knowledge Bases such as Melissa Data to help. By leveraging reference data within DQS we can quickly and effective cleanse address information.

Advertisements

3 Responses to “Cleansing Address Data With Data Quality Services (DQS) and Melissa Data”

  1. […] Cleansing Address Data With Data Quality Services (DQS) and Melissa Data […]

    • Brian said

      Jeff..when you export the results from a Data Quality project that utilizes the Melissa Data Address Check service, the result set includes many extra fields to enrich the original composite domain schema, like latitude, longitude, etc.; however, the field names in the result set are “Co_1”, “Co_2”, “Co_3”, etc; therefore, it is hard to know what the result set actual contains without proper metadata, or a cross-reference map to what these generic field names mean. I have reached out to Melissa Data, they indicate this is proprietary to Microsoft’s version of the result set; my initial contact at Microsoft didn’t have the answer, leaving me with opening a ticket with Microsoft as the next step. Let me know if you have found such metadata.

  2. Brian said

    Jeff..when you export the results from a Data Quality project that utilizes the Melissa Data Address Check service, the result set includes many extra fields to enrich the original composite domain schema, like latitude, longitude, etc.; however, the field names in the result set are “Co_1”, “Co_2”, “Co_3”, etc; therefore, it is hard to know what the result set actual contains without proper metadata, or a cross-reference map to what these generic field names mean. I have reached out to Melissa Data, they indicate this is proprietary to Microsoft’s version of the result set; my initial contact at Microsoft didn’t have the answer, leaving me with opening a ticket with Microsoft as the next step. Let me know if you have found such metadata.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: