Share via


Exercise 3: Using Integration Services to Cleanse Data

In this exercise, you will review the LoadDimCustomer Integration Services package, and then configure the DQS Cleansing component to use the AdventureWorksBI knowledge base. You will then execute the package and review the logged invalid values.

Task 1 – Reviewing the LoadDimCustomer Package

In this task, you will review the LoadDimCustomer package.

  1. To open SQL Server Data Tools, click the Start button, and then select All Programs | Microsoft SQL Server 2012 | SQL Server Data Tools.
  2. To open the existing solution, on the File menu, select Open | Project/Solution.
  3. In the Open Project window, navigate to the \Assets\AdventureWorksBI folder located in the Source folder for this lab, select the AdventureWorksBI.sln file, and then click Open.

    Note:
    This project represents the continued development of lab SQL11UPD05-HOL-02. Specifically it includes an additional package named LoadDimCustomer.

  4. If the AdventureWorksDW2012 database in your environment exists on an instance other than localhost, in Solution Explorer, in the Connection Managers folder, right-click the AdventureWorksDW2012.conmgr connection manager, and then select Open. Modify the Server Name property, and then click OK.
  5. If Data Quality Services in your environment exists on an instance other than localhost, in Solution Explorer, in the Connection Managers folder, right-click the DQS.conmgr connection manager, and then select Open. Modify the Server Name property, and then click OK.
  6. In Solution Explorer, right click the Populate DW project, and then select Properties.
  7. Expand Configuration Properties and select Debugging.
  8. In the Debug Options, set the Run64BitRuntime property to False.

    Note:
    By setting the Run64BitRuntime property to false, the runtime finds and uses the 32-bit provider used by the Connection Manager.

    Figure 26

    Changing the Run64Bit Runtime property

  9. Click OK.
  10. In Solution Explorer, right-click the LoadDimCustomer.dtsx package, and then select Open.
  11. To update the package connection manager’s reference to the Excel workbook, in the Connection Managers tray, right-click the NewCustomers.xls connection manager, and then select Edit.

    Figure 27

    Locating the NewCustomers.xls Connection Manager

  12. In the Excel Connection Manager window, click Browse.
  13. In the Open window, navigate to the \Assets folder located in the Source folder for this lab, select the NewCustomers.xls file, and then click Open.
  14. In the Excel Connection Manager window, click OK.
  15. Right-click the NewCustomers.xls connection manager, and then uncheck the Work Offline option if checked.

    Note:
    By unchecking the Work Offline option, you are turning on the connection to the package. By default, the connections that are slow or unavailable are set to Work Offline to help reduce the delay in validating the package data flow.

  16. Read the comments on the left side of the control flow design, and then read the comments related to each control flow task.
  17. Notice that the Load DimCustomer Data Flow Task includes an error icon. You will address this in the next task.

Task 2 – Updating the Package Data Flow

In this task, you will update the package data flow to use the AdventureWorksBI knowledge base and map the input columns to the two domains.

  1. Right-click the Load DimCustomer Data Flow Task, and then select Edit.
  2. Read the comments related to each data flow component. You can also open the components to fully appreciate the package design, but do not modify any properties.
  3. Right-click the DQS Cleansing component, and then select Edit.
  4. In the DQS Cleansing Transformation Editor window, notice that the Data Quality Connection Manager dropdown list has already been configured to use the DQS project connection manager.
  5. In the Data Quality Knowledge Base dropdown list, select the AdventureWorksBI knowledge base.
  6. Select the Mapping tab.
  7. In the Available Input Columns grid, scroll to the bottom of the grid, and then check EmailAddress.
  8. In the mapping grid, in the Domain dropdown list, select EmailAddress.
  9. Repeat the last two steps to map the StateProvince input column to the StateCode domain.
  10. Verify that your mappings look like the following.

    Figure 28

    Reviewing the Mappings

  11. Click OK.
  12. To review the Conditional Split components, right-click the Test Customers component, and then select Edit.

    Figure 29

    Locating the Conditional Split Components

  13. Notice the condition used to isolate valid customers, and then click Cancel.
  14. Repeat the last two steps to review the Test EmailAddress and Test StateProvince components to understand the logic used to isolate the invalid records.

Task 3 – Executing the Package

In this task, you will execute the LoadDimCustomer package and review the execution statistics.

  1. To execute the LoadDimCustomer package, in Solution Explorer, right-click the LoadDimCustomer.dtsx package, and then select Execute Package.
  2. When the package execution has completed, notice the row count passed after the Test EmailAddress component. Test components show a row count only if some of them were invalid.

    Figure 30

    Reviewing the Row Counts

  3. To return to design mode, on the Debug menu, select Stop Debugging.

Task 4 – Reviewing the Logged Data

In this task, you will review the logged invalid data. Appreciate that a data steward could be notified of records inserted into these tables, and that would then prompt them to either fix the data, or update the knowledge base to learn from these rows.

  1. Switch to SQL Server Management Studio.
  2. In Object Explorer, expand Databases | AdventureWorksDW2012 | Tables.
  3. Right-click the dbo.Log_DimCustomer_InvalidEmailAddress table, and then select Select Top 1000 Rows.
  4. Review the two rows returned.
  5. Repeat the last two steps for the dbo.Log_DimCustomer_InvalidStateCode table.

Task 5 – Finishing Up

In this task, you will close all open applications.

  1. To close SQL Server Management Studio, on the File menu, select Exit.
  2. If prompted to save changes, click No.
  3. To close SQL Server Data Tools, on the File menu, select Exit.
  4. If prompted to save changes, click No.
  5. To close the Data Quality Client, in the top right corner, click the X button.