資料比對Data Matching

本主題適用於: 是SQL Server (僅限 Windows)否Azure SQL Database否Azure SQL 資料倉儲否平行處理資料倉儲 THIS TOPIC APPLIES TO: yesSQL Server (Windows only)noAzure SQL DatabasenoAzure SQL Data WarehousenoParallel Data Warehouse

Data Quality ServicesData Quality Services (DQS) 資料比對程序可讓您減少資料重複,並提高資料來源中的資料精確度。The Data Quality ServicesData Quality Services (DQS) data matching process enables you to reduce data duplication and improve data accuracy in a data source. 比對會分析單一資料來源之所有記錄中的重複程度,傳回每一組比較之記錄之間相符的加權機率。Matching analyzes the degree of duplication in all records of a single data source, returning weighted probabilities of a match between each set of records compared. 然後您可以決定哪些記錄相符,並針對來源資料採取適當的動作。You can then decide which records are matches and take the appropriate action on the source data.

DQS 比對程序的優點如下:The DQS matching process has the following benefits:

  • 比對可讓您消除應該相等之資料值之間的差異,以判斷正確值並減少資料差異可能造成的錯誤。Matching enables you to eliminate differences between data values that should be equal, determining the correct value and reducing the errors that data differences can cause. 例如,名稱和地址經常是資料來源的識別資料,特別是客戶資料,但是資料可能會變更,並隨著時間而變質。For example, names and addresses are often the identifying data for a data source, particularly customer data, but the data can become dirty and deteriorate over time. 執行比對來識別及更正這些錯誤可讓資料的使用及維護變得更輕鬆。Performing matching to identify and correct these errors can make data use and maintenance much easier.

  • 比對可讓您確保相等但是以不同格式或樣式輸入的值會以一致的方式轉譯。Matching enables you to ensure that values that are equivalent, but were entered in a different format or style, are rendered uniform.

  • 比對會識別完全相符和大約符合,好讓您在定義資料時移除重複的資料。Matching identifies exact and approximate matches, enabling you to remove duplicate data as you define it. 您會定義大約相符實際上是相符的點。You define the point at which an approximate match is in fact a match. 您會定義已評估哪些欄位用於比對以及未評估的欄位。You define which fields are assessed for matching, and which are not.

  • DQS 可讓您使用電腦輔助的程序建立比對原則、根據比對結果以互動方式加以修改,並將它加入至可重複使用的知識庫。DQS enables you to create a matching policy using a computer-assisted process, modify it interactively based upon matching results, and add it to a knowledge base that is reusable.

  • 您可以針對從來源複製到暫存資料表的資料重新建立索引或是不重新建立索引 (根據比對原則和來源資料的狀態)。You can re-index data copied from the source to the staging table, or not re-index, depending on the state of the matching policy and the source data. 不重新建立索引可以提高效能。Not re-indexing can improve performance.

    您可以搭配其他資料清理程序來執行比對程序,以改善整體資料品質。You can perform the matching process in conjunction with other data cleansing processes to improve overall data quality. 您也可以使用內建在 Master Data Services 中的 DQS 功能來執行資料刪除重複作業。You can also perform data de-duplication using DQS functionality built into Master Data Services. 如需詳細資訊,請參閱 Master Data Services 概觀 (MDS)For more information, see Master Data Services Overview (MDS).

    下圖顯示如何在 DQS 中進行資料比對:The following illustration displays how data matching is done in DQS:

    DQS 中的比對程序Matching Process in DQS

如何執行資料比對How to Perform Data Matching

如果是 DQS 中的其他資料品質程序,執行比對的方式是建立知識庫,並使用以下步驟在資料品質專案中執行比對活動:As with other data quality processes in DQS, you perform matching by building a knowledge base and executing a matching activity in a data quality project in the following steps:

  1. 在知識庫中建立比對原則Create a matching policy in the knowledge base

  2. 在屬於資料品質專案之一部分的比對活動中執行刪除重複作業程序。Perform a de-duplication process in a matching activity that is part of a data quality project.

建立比對原則Building a Matching Policy

準備知識庫來執行比對的方式,是在知識庫中建立比對原則,以定義 DQS 如何指派比對機率。You prepare the knowledge base for performing matching by creating a matching policy in the knowledge base to define how DQS assigns matching probability. 比對原則是由一個或多個比對規則所組成,這些規則會在 DQS 評估某一筆記錄與另一筆記錄的相符程度時識別將使用哪些定義域,並指定每一個定義域值在比對評估中所佔的加權比重。A matching policy consists of one or more matching rules that identify which domains will be used when DQS assesses how well one record matches to another, and specify the weight that each domain value carries in the matching assessment. 您會在規則中指定定義域值是否必須完全相符或者可以類似,以及相似度的程度。You specify in the rule whether domain values have to be an exact match or can just be similar, and to what degree of similarity. 您也會指定定義域比對是否為必要條件。You also specify whether a domain match is a prerequisite.

[知識庫管理] 精靈中的比對原則活動會分析取樣資料,方法是套用每一個比對規則,一次比較整個記錄範圍中的兩筆記錄。The matching policy activity in the Knowledge Base Management wizard analyzes sample data by applying each matching rule to compare two records at a time throughout the range of records. 符合分數大於指定之最小值的記錄會分組在比對結果的叢集中。Records whose matching scores are greater than a specified minimum are grouped in clusters in the matching results. 這些比對結果不會加入至知識庫;您可以使用這些結果來微調比對規則。These matching results are not added to the knowledge base; you use them to tune the matching rules. 建立比對原則可以是反覆性的程序,在此程序中您會根據比對結果或分析統計資料來修改比對規則。Creating a matching policy can be an iterative process in which you modify matching rules based on the matching results or profiling statistics.

您可以為定義域指定,當您將資料來源中的資料載入定義域時,資料字串將會正規化。You can specify for a domain that data strings will be normalized when you load data from the data source into the domain. 這個程序包含使用 null 或空格取代特殊字元,這樣通常會移除兩個字串之間的差異。This process consists of replacing special characters with a null or a space, which often removes the difference between two strings. 這樣會提高比對精確度,而且通常可以讓比對結果超出最低比對臨界值 (在沒有正規化時,將不會超過)。This can increase matching accuracy, and can often enable a matching result to surpass the minimum matching threshold, when without normalization it would not pass.

注意

兩筆記錄之對應欄位中的 Null 值將會被視為相符。Null values in the corresponding fields of two records will be considered a match.

比對原則是在對應至取樣資料的定義域上執行。The matching policy is run on domains mapped to the sample data. 當您執行比對原則時,您可以指定資料是否從資料來源複製到暫存資料表,並重新建立索引。You can specify whether data is copied from the data source into the staging table and re-indexed when you run the matching policy, or not. 當建立知識庫及執行比對專案時,您也可以這樣做。You can do so both when building the knowledge base and when running the matching project. 不重新建立索引可能會提高效能。Not re-indexing could result in improved performance. 當下列情況成立時,不需要重新建立索引:比對原則尚未變更,而且您尚未更新資料來源、重新對應此原則、選取新的資料來源,或是對應一個或多個新的定義域。Re-indexing is not necessary if the following is true: the matching policy has not changed, and you have not updated the data source, remapped the policy, selected a new data source, or mapped one or more new domains.

在建立每一個比對規則時,該規則都會儲存在知識庫中。Each matching rule is saved in the knowledge base when it is created. 但是,只有在發行知識庫時,才可以在資料品質專案中使用知識庫。However, a knowledge base is available for use in a data quality project only when it is published. 此外在發行知識庫之前,建立者以外的使用者將無法變更其中的比對規則。In addition, until the knowledge base is published, the matching rules in it cannot be changed by a user other than the person who created it.

執行比對專案Running a Matching Project

DQS 執行資料刪除重複作業的方式,是使用知識庫中所定義的比對原則來比較來源資料中的每一個資料列與每一個其他資料列,並產生資料列相符的機率。DQS performs data de-duplication by comparing each row in the source data to every other row, using the matching policy defined in the knowledge base, and producing a probability that the rows are a match. 這會在比對類型的資料品質專案中執行。This is done in a data quality project with a type of Matching. 比對是資料品質專案中的其中一個主要步驟。Matching is one of the major steps in a data quality project. 最好是在資料清理之後執行比對,好讓要比對的資料沒有錯誤。It is best performed after data cleansing, so that the data to be matched is free from error. 在執行比對程序之前,您可以將清理專案的結果匯出到資料表或 .csv 檔案中,然後建立比對專案,您會在比對專案中將清理結果對應到定義域。Before running a matching process, you can export the results of the cleansing project into a data table or .csv file, and then create a matching project in which you map the cleansing results to domains in the matching project.

資料比對專案是由電腦輔助的程序與互動式程序所組成。A data matching project consists of a computer-assisted process and an interactive process. 比對專案會將比對原則中的比對規則套用到要評估的資料來源。The matching project applies the matching rules in the matching policy to the data source to be assessed. 這個程序會評估任何兩個資料列在符合分數中相符的可能性。This process assesses the likelihood that any two rows are matches in a matching score. 只有當記錄的符合機率大於資料監管在比對原則中所設定的值時,才會被視為相符的記錄。Only those records with a probability of a match greater than a value set by the data steward in the matching policy will be considered a match.

當 DQS 執行比對分析時,它會建立 DQS 視為相符的記錄叢集。When DQS performs the matching analysis, it creates clusters of records that DQS considers matches. DQS 會將每一個叢集中的其中一筆記錄隨機識別為樞紐記錄或前置記錄。DQS randomly identifies one of the records in each cluster as the pivot, or leading, record. 資料監管會驗證比對結果,並拒絕不是適當叢集相符項目的任何記錄。The data steward verifies the matching results, and rejects any record that is not an appropriate match for a cluster. 然後資料監管會選取存活規則,DQS 將使用此規則來判斷將存活於比對程序中並取代比對記錄的記錄。The data steward then selects a survivorship rule that DQS will use to determine the record that will survive the matching process and replace the matching records. 存活規則可以是「樞紐記錄」(預設值)、「最完整且最長的記錄」、「最完整的記錄」或「最長的記錄」。The survivorship rule can be “Pivot record” (the default), “most complete and longest record”, “most complete record”, or “longest record”. DQS 會根據哪一筆記錄最符合存活規則中的準則,以判斷每一個叢集中的存活者 (前置) 記錄。DQS determines the survivor (leading) record in each cluster based upon which record most closely matches the criteria or criterion in the survivorship rule. 如果給定叢集中的多筆記錄符合存活規則,DQS 會隨機選取其中一筆記錄。If multiple records in a given cluster comply with the survivorship rule, DQS selects one of those records randomly. DQS 可讓您選擇將擁有共同記錄的叢集顯示為單一叢集,方法是選取 [顯示非重疊的叢集]。DQS gives you the choice of displaying clusters that have records in common as a single cluster by selecting “show non-overlapping clusters”. 您必須執行比對程序,才能根據這項設定顯示結果。You must execute the matching process in order to display the results according to this setting.

您可以將比對程序的結果匯出到 SQL Server 資料表或 .csv 檔案。You can export the results of the matching process either to a SQL Server table or a .csv file. 您可使用兩種形式匯出比對結果:首先是相符記錄和不相符記錄,然後是只包含叢集存活者記錄和不相符結果的存活記錄。You can export matching results in two forms: first, the matched records and the unmatched records, or second, survivorship records that include only the survivor record for a cluster and the unmatched results. 在存活記錄中,如果將相同記錄識別為多個叢集的存活者,該記錄只會匯出一次。In the survivorship records, if the same record is identified as the survivor for multiple clusters, that record will only be exported once.

本節內容In This Section

您可以在 DQS 中執行與比對相關的以下工作:You can perform the following tasks related to matching in DQS:

在比對原則中建立及測試比對規則Create and test matching rules in a matching policy 建立比對原則Create a Matching Policy
在資料品質專案中執行比對Run matching in a data quality project 執行比對專案Run a Matching Project