Share via


Exercise 2: Creating a Data Quality Project

In this exercise, you will create a Data Quality Project to cleanse and review the data in the NewCustomers.xls file.

Task 1 – Creating the Data Quality Project

In this task, you will create a Data Quality Project to cleanse and review the data in the NewCustomers.xls file.

  1. In the Data Quality Client, in the Data Quality Projects section, select New Data Quality Project.

    Figure 19

    Creating a New Data Quality Project

  2. In the Name box, enter NewCustomers.xls.
  3. In the Use Knowledge Base dropdown list, select AdventureWorksBI.
  4. In the Select Activity section, notice that Cleansing is selected.
  5. Click Next.
  6. Notice the progression of steps that you will work through to create the data quality project.

    Figure 20

    Data Quality Project Steps

  7. In the Map step, in the Data Source dropdown list, select Excel File.
  8. To the right of the Excel File box, click Browse.
  9. In the Select an Excel File window, navigate to the \Assets folder located in the Source folder for this lab, select the NewCustomers.xls file, and then click Open.
  10. In the Worksheet dropdown list, select NewCustomers$.
  11. In the Mappings table, in the Source Column column, select EmailAddress (String).
  12. In the corresponding Domain column, select the EmailAddress domain.
  13. Repeat the last two steps to map the State (String) source column to the StateCode domain.
  14. Verify that the mappings look like the following.

    Figure 21

    Verifying the Mappings

  15. Click Next.
  16. In the Cleanse step, click Start.
  17. When the discovery process has completed, in the Profiler pane, to the left, notice the source statistics.

    Figure 22

    Reviewing the Source Statistics

  18. In the grid, notice the statistics associated with each mapped domain.

    The EmailAddress field contained no corrected or suggested values, which is expected because the domain includes only a rule and no domain values or corrections.

    The State field included four corrections for the customers in California (CA) and Florida (FL).

  19. Scroll to the very right of the grid to reveal the Accuracy column. Hover over each accuracy bar to reveal statistics about the correct, corrected, suggested, new or invalid results.

    Figure 23

    Reviewing the Accuracy Statistics for the State Field

  20. Click Next.
  21. In the Manage and View Results step, notice that the EmailAddress domain is selected.
  22. Notice the tabs and the number of records associated with each classification.

    Figure 24

    Reviewing the EmailAddress Tabs

  23. Select the New tab, and then scroll to review the valid email addresses.

    Note:
    These are considered new values because there are no domain values defined.

  24. Select the Invalid tab, and then review the invalid email addresses. Notice that the Reject radio buttons for both values are selected. It is possible to approve them if it made sense to do so.
  25. Select the StateCode domain.
  26. Select the New tab, and notice the one value for Tex.. The discovery of this new value would prompt the data steward to manage the domain to translate this value to the state code TX. You will not update the domain in this instance.
  27. Select the Corrected tab, and notice the two values and their corresponding corrected values.
  28. To review the individual Californian records, select the Calif. row. In the grid below, notice the three records for this state, and that they have a confidence of 100% and have been approved.
  29. Click Next.
  30. In the Export step, notice the left grid that displays the cleansed records.
  31. Scroll to the right of the grid and notice the additional statistical columns that describe reason, confidence and status for the columns mapped to domains.
  32. On the right, notice that the cleansed results can be output to a SQL Server table or a CSV file, by choosing Destination Type as SQL Server or CSV File.
  33. Click Finish.