Share via


Exercise 1: Creating a Knowledge Base

In this exercise, you will commence by exploring a small data extract of new customers that needs to be loaded into the data warehouse. To support the cleansing of the data, you will then create the AdventureWorksBI knowledge base consisting of two domains to cleanse US state code and email address data.

Task 1 – Exploring the New Customer Data

In this task, you will open and explore the new US customer data that must be cleansed and loaded in the AdventureWorksDW2012 database’s DimCustomer table.

  1. To open Excel, click the Start button, and then select All Programs | Microsoft Office | Microsoft Excel 2010.
  2. On the File ribbon tab, click Open.
  3. In the Open window, navigate to the \Assets folder located in the Source folder for this lab, select the NewCustomers.xls file, and then click Open.
  4. Notice the table of data representing 20 new customers to be loaded into the data warehouse’s DimCustomer table. In particular, notice the cells highlighted in red that represent data quality problems. The two email addresses are invalid, and the state values are required to be the two character US state postal code, not the state abbreviation.

    Figure 1

    Reviewing the New Customer Data

  5. To close Excel, on the File ribbon tab, click Exit.

Task 2 – Creating the Knowledge Base

In this task, you will you will create a knowledge base and commence its creation with the knowledge discovery activity.

  1. To open the Data Quality Client, click the Start button, and then select All Programs | Microsoft SQL Server 2012 | Data Quality Services | Data Quality Client.
  2. In the Connect to Server window, in the Server Name dropdown list, enter the name of the server where Data Quality Services has been installed.
  3. Click Connect.
  4. To create a new knowledge base, in the Data Quality Client, in the Knowledge Base Management section, click New Knowledge Base.

    Figure 2

    Creating a New Knowledge Base

  5. In the Name box, enter AdventureWorksBI.
  6. In the Select Activity section, select Knowledge Discovery.

    Figure 3

    Selecting the Knowledge Discovery Activity

  7. Click Next.

Task 3 – Configuring DQS Permissions for the AdventureWorksDW2012 Database

In this task, you will you will open and execute a script to provide privileges to Data Quality Services to read data from and write data to the data warehouse. Knowledge will be retrieved from the database in the next task.

  1. To open SQL Server Management Studio, click the Start button, and then select All Programs | SQL Server 2012 | SQL Server Management Studio.
  2. If the Connect to Server window does not open, in Object Explorer, click Connect, and then select Database Engine.
  3. In the Connect to Server window, ensure the Server Type dropdown list is set to Database Engine.
  4. In the Server Name box, enter the name of the instance that hosts the AdventureWorksDW2012 database.
  5. Click Connect.
  6. To open the script file, on the File menu, select Open | File.
  7. In the Open File window, navigate to the Assets folder located in the Source folder for this lab, select the GrantDqsPrivileges.sql file, and then click Open.

    Note:
    This script grants privileges to the Data Quality Services logins (created at installation time) to read data from and write data to the AdventureWorksDW2012 database.

  8. Review the script, and then on the toolbar click Execute.

Task 4 – Performing Knowledge Discovery

In this task, you will you will perform knowledge discovery for the StateCode domain by retrieving the existing US state codes used in the data warehouse.

  1. Switch to the Data Quality Client.
  2. Notice the progression of steps that you will work through to discover knowledge.

    Figure 4

    Knowledge Discovery Steps

  3. In the Map step, in the Database dropdown list, select the AdventureWorksDW2012 database.
  4. In the Table/View dropdown list, select the vUSStateCode view (located at the bottom of the list).

    Note:
    This view retrieves all US state codes used in the DimGeography table.

  5. In the Mappings table, in the Source Column column, select StateCode (nvarchar).
  6. Click Create a Domain.

    Note:
    You can hover over the buttons to reveal a tooltip that describes their functionality.

    Figure 5

    Creating a Domain

  7. In the Create Domain window, in the Domain Name box, enter StateCode.

    Figure 6

    Creating the StateCode Domain

  8. Click OK.
  9. Ensure the Mappings table looks like the following.

    Figure 7

    Verifying the Mapping

  10. Click Next.
  11. In the Discover step, click Start.
  12. When the discovery process has completed, in the Profiler pane, notice that 36 unique values (US states) have been retrieved.
  13. Click Next.
  14. In the Manage Domain Values step, review the 36 state codes retrieved from the vUSStateCode view.

    Note:
    The star next to each state code indicates that it is a new value, and its type is defaulted to Correct (the green check). It is possible to add additional values, remove values, and to configure different types (Error and Invalid). You will add Error values in the next task.

  15. To complete the knowledge discovery process, click Finish.
  16. When prompted to publish the knowledge base, click No.

    Note:
    Once published, the knowledge base is available for use. In the next two tasks, the StateCode domain will be modified, and an additional domain will be added. The knowledge base will remain exclusively locked by you until the domains are defined. Then it can be published.

Task 5 – Managing the StateCode Domain

In this task, you will you will manage the StateCode domain by adding two corrections. Note that the Tex. value (as has been used in the NewCustomers.xls workbook) will not be corrected. This will allow you to see how to manage invalid data in an ETL process in the final exercise of this lab.

  1. In the Data Quality Client, in the Knowledge Base Management section, in the Recent Knowledge Base list, click the arrow to the right of the AdventureWorksBI domain, and then select Domain Management.

    Figure 8

    Performing Domain Management

  2. Notice that the StateCode domain is selected.
  3. Select the Domain Values tab, and notice the values loaded during the knowledge discovery process.

    Note:
    In addition, the DQS_NULL value is included to support possible NULL values.

  4. To add a new value, click Add New Domain Value.

    Figure 9

    Adding a New Domain Value

  5. In the new row, in the Value column, enter Calif. (include the period).
  6. In the corresponding Type dropdown list, select the cross.
  7. In the corresponding Correct To box, enter the two character state code CA.
  8. Verify that the new value looks like the following.

    Figure 10

    Verifying the New Value

  9. Press Enter to commit the new value.
  10. Repeat the steps in this task to add a second new value for Fla. that will correct to two character state code FL.
  11. Scroll to the top of the domain values list, and notice the association of the new domain values to their corrected domain values.

    Figure 11

    Reviewing the CA Value

Task 6 – Creating the EmailAddress Domain

In this task, you will create a new domain to test the validity of email addresses. The domain will be based on a rule using a regular expression.

  1. Click Create a Domain.

    Figure 12

    Creating a New Domain

  2. In the Create Domain window, in the Domain Name box, enter EmailAddress, and then click OK.
  3. Select the new EmailAddress domain, and then select the Domain Rules tab.
  4. Click Add a New Domain Rule.

    Figure 13

    Creating a New Domain Rule

  5. In the Name box, enter Valid EmailAddress.
  6. In the Build a Rule section, in the dropdown list, select Value matches regular expression.
  7. In the rule box, enter the following regular expression.

    Note:
    For convenience, the regular expression can be copied from the \Assets\Snippets.txt file located in the Source folder for this lab.

    Regular Expression

    \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

  8. Verify that the rule looks like the following.

    Figure 14

    Verifying the Domain Rule

  9. To test the rule, click Run the Selected Domain Rule On Test Data.

    Figure 15

    Testing the Domain Rule

  10. In the Test Domain Rule window, click Add a New Testing Term For the Domain Rule.

    Figure 16

    Adding a New Testing Term

  11. In the EmaillAddress box, enter john@hotmail.com (a valid email address).
  12. Repeat the last two steps to add the following two invalid email addresses:

    EmailAddress

    john**@hotmail.com

    john@hotmail

  13. Click Test the Domain Rule On All the Terms.

    Figure 17

    Testing the Domain Rule

  14. Review the icon in the Validity column for each test EmailAddress value.
  15. Click Close.
  16. Click Finish.
  17. When prompted to publish the knowledge base, click Publish.
  18. When prompted to acknowledge that the knowledge base was published, click OK.