Correlating Active Directory accounts with their corresponding HR records in the absence of unique identifiers
This posting is provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm
The question of how the records stored in various repositories could be linked together inevitably comes up in most identity management related projects. For a variety of reasons (see bullets below) this question often becomes the crux of a project.
· Active Directory accounts are being created on an ad-hoc basis, without supplying HR related information within the properties of a user object
· HR records are being created well after employee’s start date, hence making it impossible for IT to specify HR related information at the time of the account creation
This document is intended to provide guidance on how to approach an environment where unique attributes allowing for the linkage from one system to another are not present.
It is important to note upfront that the problem we are trying to solve is not technical in nature, but rather is related to the deficiencies around provisioning processes. Hence this document should not be treated as an alternative to establishing proper HR driven provisioning practices, though the procedures described here may assist with overcoming the challenges in moving to a centralized provisioning model. It is also important to clearly understand the limitations inherent in solving this problem:
· Since 100% accuracy of matching cannot be delivered, do not utilize the produced results for the projects related to authentication and access control, without first going through an attestation process to validate the produced linkages
· The framework is designed to ease the effort involved in establishing the correlation between AD and HR, but do expect and plan for manual intervention to validate and attest the matches
· The proposed solution relies heavily on Microsoft SQL Server Integration Services (SSIS), specifically the Fuzzy Lookup component (SQL Server Enterprise Edition only). This MSDN article should provide you with the necessary background information on this component. Pay special attention to the concepts of confidence and similarity, since they are critical in proper interpretation of the results.
· For extracting data from Active Directory and converting it into a relational format we will leverage SSIS Active Directory Domain Source component. More information on this component and the instruction on installation can be found here.
· The UI specifically developed for the purposes of assisting in the joining process is developed using Microsoft Visual Studio LightSwitch. See these links for details on how to deploy and secure Lightswitch applications:
On a conceptual level the process of correlating records could be subdivided into the following stages:
1. Extract information from HR and AD into SQL Server tables
2. Utilize SSIS Fuzzy Grouping component to group similar records within both HR and AD
3. Utilize SSIS Fuzzy Lookup to suggest matches between HR and AD name groupings
4. Separate the inconclusive (low quality matches) from the high quality matches
5. Route inconclusive matches to an Identity Management Administrator for manual validation
6. Once an inconclusive match is resolved it will be added to the list of the matched records
Reasoning for collapsing similar records into name grouping representations
Before we can answer this question we need to understand how SSIS Fuzzy Lookup (the component which we utilize to link records) makes matching decisions. Fuzzy Lookup, in addition to the input data, is also configured with a reference table, which is consulted for finding matching candidates for each incoming row.
Now imagine this scenario; there are 2 John Smiths in your HR database and one John Smith in Active Directory. In this case Fuzzy Lookup would provide a fairly high similarity and confidence score by matching an HR John Smith to the AD John Smith, since there is only one John Smith in AD and the names are identical, but the question is which HR John Smith. We can’t control this! To state this in different terms, the reference table needs to contain unique values in order to provide predictable results. For these reason in most SSIS cleansing jobs you will find Fuzzy Grouping transformation performing the de-duplication prior to Fuzzy Lookup matching operation. Of course in our case we can’t simply de-duplicate the records, since this would lead to the loss of data (it is highly possible that we may have 2 or more legitimate AD accounts with first and last names set to John Smith). For this reason when performing Fuzzy Lookup we need to abstract from the HR and AD records by adding another higher level of mapping, which is based purely on the unique naming combinations (name groupings), tracking at the same time the relationship from the name grouping to the actual records in HR and AD.
Now that we collapsed the HR data into the Naming Groupings records we can leverage this data as a reference in the Fuzzy Lookup process. Of course, the same process would need to be followed with AD data, so that matching is performed against the groupings.
Thinking in terms of Groupings
Of course, our final goal is to join HR records to AD, eventually we will need to descend to the level of the groupings members to create a join.
There are three scenarios here:
· A good quality match is found between HR name grouping and AD name grouping, and both groupings contain only one member. This case is a potential for an automatic join. To describe this in other terms, this case represent a scenario where there is a unique first and last name combination within and across both systems.
· A good quality match is found between HR and AD name groupings, but one or both groupings contain more than one member. Such ambiguous cases would have to be resolved manually.
· No quality match was found between HR and AD groupings, which probably suggests that either an HR record is not represented in AD or vice versa
Probably the best way to explain the process is by going through an exercise of matching on a small data set.
The sample datasets are composed on the assumption that only the last and first names could be utilized in the linking process. In other words, information like department, manager, location, etc., is either not available or is not reliable. It is highly recommended to conduct a data profiling exercise in order to determine if any other fields could be reliable utilized in the matching process in order to reduce the number of ambiguous matches.
1. Unique combinations of last and first names in both datasets.
a. Chris Daniel is a sufficiently unique combination within and across both datasets (i.e. there is only one Chris Daniel in AD and HR), hence HR record (employeeID #3) should automatically join to AD account (sAMAccountName cdaniel)
b. No other record should join automatically since they are not sufficiently distinct either within or outside their respective datasets.
2. Ambiguous last and first name combinations within and across the datasets.
a. Subcase 1. Multiple identical first and last name combinations. Grouping of HR records 11 and 12 should be related to the grouping in AD of accounts sromanof and sromanof1. This relationship should be presented in the matching UI and resolved manually by an Identity Management administrator.
b. Subcase 2. Records of high degree of similarity. Grouping of HR records 1, 2 and 8 should be related to the grouping in AD of accounts alextc and alextc2. This relationship should be presented in the matching UI and resolved manually by an Identity Management administrator.
3. Name grouping is not represented in one of the systems
a. Tim Harrison naming grouping is not represented in HR, hence no attempts should be made to linking. It should be possible to query all AD unmatched accounts.
Matching Process Flow
Importing AD and HR data into SQL tables
The process of matching begins by importing the data from AD and HR into a tabular format (SQL Server tables).
Generate HR Name Groupings
The goal of this data flow task is to identify similar first and last name combinations within the HR records. Once similar records are identified they form a grouping, where each grouping is identified by a unique ID and all grouping members are linked to the grouping. The grouping becomes the representation of the similar rows. The relationship between the grouping and the grouping members could be visualized by creating a view which links groupings and the corresponding grouping members, we can also think of this view is the end goal of this task
In this example HR records with IDs 1, 2, and 8 formed a grouping with ID of 1. Note that HRGroupingID is created by SSIS during the execution of the task, hence is only meaningful within the context of a specific job run.
The process of “collapsing” similar records into groupings allows us to abstract from the individual records and work with the unique name combinations. Such unique name combinations could later be compared with the unique name groupings in AD, of course when visualizing the relationship between the name groupings the grouping members will also be exposed in the matching UI.
Let’s walk through the logic flow of this task.
1. The source of the task is a SQL view which is based on the table containing all HR records; this table was populated with data in the LoadHRToSQL task. The goal of the view is to filter-out previously matched records, which are stored in the tblMatches.
2. The main component of this task is the Fuzzy Grouping transformation, which forms grouping of similar records.
Let’s examine the grouping of records identified by the HRGroupingID #3. Fuzzy Grouping component determined based on the closeness of first and last names that HR records: 11 and 12, should form a group. One of the rows out of the two is designated by SSIS as the grouping representation (grouping row) and the remaining members of the grouping point to the grouping row via the HRGroupingMemberID. You can spot the grouping row based on the fact that its HRGroupingID equals HRGroupingMemberID.
3. The remainder of the task activities is focused on splitting the output of the Fuzzy Grouping transformation into the name groupings and name grouping members.
The Conditional Split transformation separates the grouping rows from the grouping member rows.
The Union All and the Multicast transformations are utilized in order to bring the “primary” grouping row into the GroupingMembers table. Despite the fact that this “primary” row plays the role of the grouping representation, it still points to an HR record and needs to be considered in the matching process.
In principal the “primary” row and the grouping members could be separate via a self-joint view, but for the reasons of coding convenience I decided to separate these entities into their own tables.
Generate AD Groupings
The process of generating AD groupings is identical to the process we just covered for HR records (here we use objectGUID instead of employeeIDs to identify grouping members), hence I will only provide here the final output of the task.
Relate HR to AD
Conceptually this task could be subdivided into the following stages:
· Perform Fuzzy Lookup of AD Name Groupings by using HR Name Groupings as a reference table
· Use Conditional Split transformation to create two data flows: AutoMatchQualityMatches and HintQualityMatches
The Conditional Split uses two variable sets which determine whether or not a quality of match is sufficient to be considered for an automated join, or if the quality of a match is worth visualizing in the joiner UI for an Identity Manager Administrator
· The “high” quality matches of are now put through 2 check to see if the groupings in question contain more than 1 member. Remember we only want to auto-join groupings where there is no ambiguity of about the join candidates (i.e. grouping consists of a single member on both sides of the join). Groupings which did not pass this test are directed into the Hints table and will be visualized in the Joiner UI.
The check of whether a grouping has more than one member is performed by conducting a look-up against views (one for AD and one for HR) which contain only groupings with a single member
· Matches which did not pass the auto-match quality threshold but passed the hint quality threshold, plus the matches which failed the ambiguity test, are directed into the Hints table.
Translate HR and AD Groupings IDs to employeeID and objectGUID for the auto-matched records
Since the HR and AD groupings IDs are only relevant within the context of a specific SSIS job, this task will convert grouping IDs into the corresponding employeeID and objectGUID identifiers for the automatically joined records and deposit the “translated” match records into tblMatches. The task leverages the SSIS Merge Join transformation to build-out the relationship from the AutoMatchedGroupings to ADGroupMembers and HRGroupMembers, and in doing so translate from the matched grouping IDs into the unique HR and AD identifiers.
Appendix 1 Hints Visualization
Appendix 2 Matching database diagram
Components download links
All custom components referenced in this document including source code can be found