February 2011

Volume 26 Number 02

Dynamic Data - Pattern Matching Database Records with F#

By Ambar Ray | February 2011

The data used by your applications doesn’t just appear out of thin air. To end up with a database full of useful data, you’re going to have to jump through a few hoops. To start, you probably perform an extract, transform and load (ETL) process on some collection of dimensional data. This typically includes cleansing, pruning and standardizing the data. This is just to get the data into a form that works in your database.

After the ETL step, you’ll want to go through your records and make sure they’re useful and consistent. This typically means implementing a matching and deduplication process. Next, you’ll do some name and address parsing. With this information you can begin a matching process and start identifying duplicates.

There are four common matching algorithms used for attribute deduplication processes: absolute match, partial match, Soundex and lookup match. These algorithms can be run against the data and, once the percentage match score is computed, you can decide whether to discard or store the data.

As an exercise, I’ve implemented these four matching algorithms using F# pattern matching and asynchronous programming features to quickly calculate the aggregate match score. In this article, I’ll show you my implementation of the matching algorithms and how I create the aggregate match score. The ultimate objective of this article is to showcase some of the features of F# such as functional programming, imperative programming and implicit type inference system, and demonstrate how you can use F# for accomplishing some significant data management tasks quickly and easily.

Preparing the Data

I’ll start by loading the dimensional data from various transactional systems in a staging table. The data source could be a relational database, a flat file, an Excel file or an XML file. The ETL process often uses a tool like SQL Server Integration Services (SSIS) 2008 for cleansing, pruning and standardizing the incoming data and subsequently loading the staging tables. The staging tables and the master database are kept in a master data hub.

As mentioned earlier, I use a separate application built with F# and Windows Presentation Foundation (WPF) to take care of the matching and deduplication process. In actual projects, you’d first generate ADO.NET Entity Framework models in the data-access layer. These models model the master data tables along with all the lookup tables present in the master data hub.

Next, the layer above should expose the underlying model through Windows Communication Foundation (WCF) services. The business logic layer above would implement the matching and deduplication routines as well as the parsing methods. Finally, the presentation layer would present various GUIs to the data manager. Figure 1 depicts the architecture of the application.

image: Master Data Management Application Architecture

Figure 1 Master Data Management Application Architecture

I’m not going to demonstrate the entire application here, only the implementation of the matching and parsing algorithms. The subsequent sections will explain the implementation of the matching and parsing algorithms in the business logic layer.

Getting Started

For the purposes of this scenario, let’s assume that the row data are retrieved from the database via the data-access layer and WCF Data Services is subsequently stored in List<T> objects for further processing. These List<T> objects will be populated with test data for implementing the pertinent algorithms.

The master database contains all the source tables as well as staging and history tables. As an example, Figure 2 shows the composition of a Customer data entity.

Figure 2 Customer Entity


The matching algorithm can be explained by means of the flowchart in Figure 3.

image: The Matching Process

Figure 3 The Matching Process

Step 1 is not really part of the matching process, but rather a precursor. Data is cleansed and standardized to a preferred format before it’s sent for parsing.

Name Parsing

The parsing process in Step 2 is also a precursor to the matching process. In this step, individual names are parsed out of the data. Assuming that a salutation may be entered along with the first name, it should be broken down (parsed) into individual elements: salutation and first name. Thus, if Mr. John is provided in the first name field, with Mr. being the salutation and John being the actual first name, then the algorithm will work as follows:

  1. Pick the first word terminated with a space (position A) as W1.
  2. Identify whether W1 is a salutation by matching with a lookup list. If so, ignore the salutation and proceed to step 4.
  3. If W1 is not a salutation, consider it the first name. Do not consider the string snippet for further parsing.
  4. If W1 is identified as a Salutation, identify the next occurrence of a space or an end of string (Position B). Consider the word enclosed within position A and position B as First Name. Do not consider the string snippet for further parsing.

For example, the string “Mr. John” would be parsed as shown in Figure 4.

Figure 4 Name-Parsing Example

Name M r .   J o h n
Position 1 2 3 4 5 6 7 8

Position 1 to position 3 would constitute the salutation. Salutation is optional, and would be populated only if the first word matches with an exhaustive list of possible values. (I do make an assumption here that each part of the name, including salutation, has to be followed by a space. An exception is made for the last name field.) In Figure 4, position 5 to position 8 would constitute the first name.

The implementation of this parsing algorithm in F# looks like this:

let ParseName(strName : string)=
  let input = strName.Trim()
  let splitNames = input.Split([|" "|], StringSplitOptions.None)
  match splitNames.[0] with
    | "Mr" | "Mr." | "Mrs" | "Mrs." | "Dr" | "Dr." | "Kum" | "Kum." 
      -> splitNames.[1]
    | _-> splitNames.[0]

In this code I’ve done pattern matching using the “match” and “with” keywords of F#, followed by pattern matching rules, each followed by the “->” symbol.

Address Parsing

The next step is address line parsing. Address lines 1 and 2 (concatenated) should be broken down (parsed) into a house or building number, street name, street type, apartment type (optional, constituting Apt, Flat or Suite) and apartment number (optional). If city, state and country information are provided in address lines 1 and 2, then those need to be omitted.

A combined address should be parsed into a house number, street name, street type and, if applicable, an apartment type and apartment number. To illustrate, let’s look at an example:

421, East Drachman St, Suite 5A, Tucson, Arizona, beside McDonald’s

In this case the address elements are parsed as shown in Figure 5.

Figure 5 Address-Parsing Example

House No. Street Name Street Type Street Type Apartment Number Location Descriptor
421 East Drachman Street Suite 5A beside McDonald’s

Note that city and state information are parsed separately as {city, state, country, zip} = {‘ Tucson’, ‘AZ’, ‘USA’, ‘85705’} and are omitted from the address line.

Let’s look at the steps needed to parse the address line:

  1. Define an exhaustive lookup for street type and apartment type.
  2. Define an exhaustive country or region-specific lookup for state. State lookup should be defined as such that AZ and Arizona are identified as the same (but incorrect spelling is not considered).
  3. Concatenate Address 1 and Address 2 into a single address Line.
  4. Search the address line string from the right-hand side for valid city, state name or ZIP code with respect to corresponding lookups. If true, then proceed to Step 5; otherwise proceed to Step 6.
  5. Remove city, state or ZIP code information if found in the address line string.
  6. Identify street and apartment tokens using the structure Street {[Number] [Name] [Type]}, Apartment {[Type] [Number]}. Token identification is based on searching 
[Type] with respect to values in street type and apartment type lookup, respectively.
  7. Consider any remaining string snippet after Step 6 to be part of a location descriptor.

The implementation of this algorithm is shown in Figure 6. In this code I’ve used while loops to scan through the state, city and ZIP lookup tables for finding a match. The main function ParseAddress uses this to eliminate city, state and ZIP information from the address line.

Figure 6 Parsing Address Lines

let MatchCityStateZip (addressPart : string) = 
  // Match with a state
  let stateMatchFound = stateNameTable |> 
    List.exists (fun (part1,part2) -> 
    part1 = addressPart || part2 = addressPart) 
  // Match with a city
  let cityMatchFound = cities |> List.exists (fun city -> 
    city = addressPart)
  // Match with a ZIP code
  let zipMatchFound = zipCodes |> List.exists (fun zipCode -> 
    zipCode = addressPart)
  stateMatchFound || cityMatchFound || zipMatchFound
// The main parsing address algorithm is as follows:
let ParseAddress (strAddress : string) = 
  let mutable finalAddress = strAddress
  // Split the address
  let addressParts = strAddress.Split([|","|], 
  // Remove city, state and ZIP information from the address
  for i = 0 to addressParts.Length - 1 do
    let currPart = addressParts.[i].Trim()
    if MatchCityStateZip currPart then
      // Index of current address part in original string
      let currIndex = finalAddress.IndexOf currPart
      // Remove city, state, ZIP information along with the 
      // following whitespace or comma
      finalAddress <- finalAddress.Remove(currIndex, currPart.Length)
  let finalAddress = finalAddress.Replace(", ,",",")
  let finalAddress = finalAddress.TrimEnd([|','; ' '|])

Matching and Deduplication

The process of matching and deduplication in Step 3 (see Figure 3) starts with the identification of the following three types of attributes:

  • Identity attributes, including Cust_ID, name (First_Name, Last_Name), address (House_no., Street_Name, City, State, Zip_Code, Country), Mob_Phone and so on.
  • Discriminating attributes such as date of birth (DOB).
  • Record-qualifying attributes such as country, region or postcode.

The application will prompt the user to select these attributes. At least one each of identity, discriminating and record-qualifying attributes need to be selected. Note that if any records are flagged due to inconsistencies or errors in the identity and discriminating attributes in the cleansing and standardization process, these records should not be considered for matching purposes.

As mentioned previously, matching will be performed based on four routines: absolute match, partial match, Soundex and lookup match. The program would prompt the user to select which routines will be used for each attribute. At least one routine should be selected for each attribute, though all four can be selected for an attribute.

Appropriate weights need to be assigned to each identity attribute depending on their importance to the business process—for example, identifying a customer (Step 4). Similarly, weights need to be assigned to each routine for each attribute. The program should prompt the user to define the weights on a scale of 1 to 5.

Finally, we get to Step 5 and start to perform matching based on discriminating attributes to obtain the match windows. Thus, if DOB is defined as the discriminating attribute, then separate windows need to be formed based on the DOB. This means that records within the same window would have the same value of DOB. In the subsequent steps, the matching process will be performed within the individual windows and not within records across different windows.

Absolute Match

In Step 6 I perform an absolute match for all attributes. This routine compares two fields and looks for an exact match only. A score of 100 is assigned for an exact match. Any other result is scored 0.

For example, if Field 1 contains “John” and Field 2 contains “John,” it’s an exact match and is given a score of 100. If Field 1 contains “Lisa” and Field 2 contains “Laura,” it’s not an exact match and is given a score of 0.

The implementation of this absolute match algorithm is:

let AbsoluteMatch (attrOfFirstRec : string) (attrOfSecondRec : string) =
  let attrRec01 = attrOfFirstRec.Trim().ToLower()
  let attrRec02 = attrOfSecondRec.Trim().ToLower()
  match attrRec01, attrRec02 with
    | "", "" -> 100
    | _, _ -> if attrRec01 = attrRec02 then 100 else 0

Partial Match

Next, I can perform a partial match routine for all attributes. The partial match is used to determine the relationship to a blank value. This is useful when some records have, for example, the customer first name but not the customer last name, or vice versa. Sometimes a field is blank in both records. The partial match algorithm takes care of matching those records where one of the important fields might have been left blank.

Blanks and zeros are considered to be the same value. As for the absolute match, an exact match is scored 100. A blank field value versus a non-blank field value is scored 75, while a blank field value versus a blank field value is scored 65. Any other result is scored 0.

The implementation of the partial match algorithm is:

let PartialMatch (attrOfFirstRec : string) (attrOfSecondRec : string) =
  let attrRec01 = attrOfFirstRec.Trim().ToLower()
  let attrRec02 = attrOfSecondRec.Trim().ToLower()
  match attrRec01, attrRec02 with
    | "", "" -> 65
    | _, "" | "", _-> 75
    | _, _ -> if attrRec01 = attrRec02 then 100 else 0

Notice that this code is similar to the absolute match.

Soundex Match

Now I perform the Soundex algorithm for the attributes first name, last name, street name, city, state and country individually. The Soundex algorithm detects similar-sounding words using the following algorithm:

  1. Capitalize all letters in the string.
  2. Retain the first letter of the string.
  3. After the first position, convert all occurrences of the following letters to blank: A, E, I, O, U, W, Y.
  4. Change letters from the predetermined sets to corresponding number as shown in Figure 7.
  5. Remove all consecutive pairs of duplicate digits and blanks from the string that resulted after Step 4.
  6. Return the first four characters of the string, padded with trailing zeros if needed.

Figure 7 Soundex Letter Conversion

Letter Corresponding Number
B, F, P, V 1
C, G, J, K, Q, S, X, Z 2
D, T 3
L 4
M, N 5
R 6

Scoring values for the Soundex routine are a bit different. As before, if both strings are equal they’re scored 100. If one string is blank and the other is non-blank, then they’re scored 50. If both strings are blank they’re scored 0. And if neither string is blank and they’re not equal, they’re scored 0.

The implementation of this algorithm in F# is shown in Figure 8.

Figure 8 Soundex Match

let SoundexMatch (attr1:string, attr2:string) = 
  let conv c = 
    match c with
    | 'A' | 'E' | 'I' | 'O' | 'U' | 'W' | 'Y' -> ' '
    | 'B' | 'F' | 'P' | 'V' -> '1'
    | 'C' | 'G' | 'J' | 'K' | 'Q' | 'S' | 'X' | 'Z' -> '2'
    | 'D' | 'T' -> '3'
    | 'L' -> '4'
    | 'M' | 'N' -> '5'
    | 'R' -> '6'
    | _ -> c
  let convertSoundex (inp:string) = 
    // Capitalize all letters in the string
    let chars = inp.ToUpper().ToCharArray() 
    let chars = 
      [| // Retain the first letter of the string
      yield chars.[0] 
      // Keep the first character, and remove pairwise-duplicates
      // Change letters from the predetermined sets to 
      // corresponding number 
      for (c1,c2) in Seq.pairwise (Seq.map conv chars) do
        // Remove all consecutive pairs of duplicate digits 
        // and blanks from the string 
        if c1 <> c2 && c2 <> ' ' then yield c2 |]
    // Convert back to a string
    String chars
  // Retain first four characters of resultant strings padded 
  // with trailing zeros if needed; leave unchanged if any 
  // string is blank
  let adjustResult (result:string) = 
    match result.Length with 
    | n when n >= 4 -> result.Substring(0, 4)
    | 0 -> result
    | n -> result + String.replicate (4 - n) "0"
  let attr1Result = attr1 |> convertSoundex |> adjustResult
  let attr2Result = attr2 |> convertSoundex |> adjustResult
  match attr1Result, attr2Result with
  | "", "" -> 0
  | "", _ | _, "" -> 50
  | _, _ -> if (attr1Result = attr2Result) then 100 else 0

In this code the Soundex conversion is done using pattern matching, keeping the first character intact. The following two for loops find consecutive duplicate characters and replace the second such character with a blank. The next two for loops discard the blanks, effectively removing any duplicates. Thus the four for loops discard juxtaposed duplicate characters.

The following two if statements extract the first four characters and, if needed, pad with zeros to make it at least four characters. The final pattern matching implements the scoring for the Soundex routine.

Lookup Matching

Lastly, I perform a lookup match for the street type attribute. A street lookup table will be referenced to standardize the street type, like so:

let LookupMatch (streetName : string) =
  let mutable streetMatchFound = false
  let mutable i = 0
  while ((no streetMatchFound) && (i < streetNames.Length)) do
    if (streetName = streetNames.[i]) then
      streetMatchFound <- true
  match streetMatchFound with
  | true -> 100
  | false -> 0

This code scans through the street lookup table to find a street name match using a while loop and then returns the scores if a match is found.

Scoring the Results

In Step 7 of the matching process, I retrieve the scores of the matching processes for each attribute. Thus, for first name, if there is no match for the absolute match routine, but there is a match for the Soundex routine, the match routine scores for first name would be 0 and 100, respectively.

In Step 8, the weighted match score for each attribute is determined, giving the attribute a score from 1 to 5. For example, if the absolute matching score of 0 is weighted as 5, and the Soundex score of 100 is weighted as 4, then the aggregate score for first name would be:

[(0x5)+(100x4)]/(5+4) ~ 44%

Assuming that all the match algorithms are selected, the implementation of this weighted scoring is shown in Figure 9.

Figure 9 Aggregated Scoring

let WeightedAverage results = 
  let cumulativeWeight = results |> 
    Array.sumBy (fun (r, weight) -> r * weight) 
  let totalWeight = results |> 
    Array.sumBy (fun (r, weight) -> weight) 
  cumulativeWeight / totalWeight
// Aggregate match score
// Calling the match routine which in turn calls absolute match, 
// Partial Match and Soundex Match routines in parallel
let FindAggregateMatchScore row = 
  let resultsWithWeights = 
    Async.Parallel [ async { return AbsoluteMatch row, 5 } 
                     async { return PartialMatch row, 5 } 
                     async { return SoundexMatch row, 4} ]
    |> Async.RunSynchronously
  WeightedAverage resultsWithWeights

This code assumes that all three match routines are called for each attribute (though not for the street attribute, for which the lookup match should be performed). First, Func delegates are declared for each match routine. Then the delegates are invoked in an asynchronous fashion using the BeginInvoke method. After waiting for the tasks to complete via WaitHandle.WaitAll, the results are collected using EndInvoke methods that take IAsyncResult parameters returned during BeginInvoke calls. Finally, the aggregate match score is returned as the last expression in the function.

In Step 9, the aggregate match scores of each attribute are multiplied by individual attribute weights, then added and expressed as a percent match score between the two records (see Figure 10).

Figure 10 Weighted Scoring

// Percentage match score
let FindPercentageMatchScore(rows : seq<string * string>) = 
  let results = 
    Async.Parallel [ for row in rows -> 
    async { return FindAggregateMatchScore row } ] 
    |> Async.RunSynchronously
  // Apply a weight of 5 for the first attribute and a weight 
  // of 4 for second and a weight of 3 for all other attributes
  let resultsWithWeights = results |> 
    Array.mapi (fun i r -> 
    r, (match i with 0 -> 5 | 1 -> 4 | _ -> 3)) 
  WeightedAverage resultsWithWeights

The Task.Factory.StartNew method from the Task Parallel Library for F# is used to call the aggregate match score for each pair of attributes of the two rows of data being compared, followed by a for loop that calculates the cumulative result. Finally, the percentage match score is returned.

Match thresholds—the upper threshold and lower threshold for scores—are user-defined. The system will ask the user for defining the upper and lower thresholds. A record match score above the upper threshold is considered an automatic match, while a record match score below the lower threshold is rejected and is considered a new customer record. A score between and inclusive of these two thresholds should be flagged for review.

With that, you’ve completed the record-matching and deduplication process. Obvious duplications can be deleted, and you can pass on the records flagged for review to either a real human or a more extensive programmatic review process.      

Ambar Ray is a solution architect working on Microsoft technologies. Ray has nearly 12 years of experience in the IT industry.

Thanks to the following technical experts for reviewing this article: Don Syme, Technical Excellence Group of TechMahindra Ltd.