DISCLAIMER : THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
Fuzzy Matching in T-SQL
Normally we will use like ‘LIKE’, ‘IN’, ‘BETWEEN’ and other boolean operators to have more flexible, "fuzzier" filters when querying data.
At the very least, knowing these keywords will save you from having to write a tedious number of conditional statements just to get variations of a data value.
For example, instead of this:
SELECT * from person.person
WHERE FirstName ='Michele' OR FirstName ='Michael'
--OR
SELECT * from person.person
WHERE FirstName IN ('Michele','Michael')
We have:
SELECT * from person.person
WHERE FirstName LIKE '%MICH%'
--OR
SELECT * from person.person
WHERE FirstName LIKE 'MICH_E_'
But in some situations, we could not achieve our requirement when we use ‘LIKE’, ‘IN’, ‘BETWEEN’ and so on.
For example, we would like to compare two values ( for example, ‘Hector’ and ‘Hacktere’) who have the similar spelling or pronunciation but ‘LIKE’ is not working.
To do a fuzzy matching in this situation, we could try with one function 'Jaro-Winkler Stored Procedure' mentioned in
Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL and it could be a good solution for some specific situations.
Following is one example:
DECLARE @SOURCE TABLE
(
SRCNAME VARCHAR(100)
)
INSERT @SOURCE
SELECT 'Stephan sBaum' UNION ALL
SELECT 'Ida debSou' UNION ALL
SELECT 'Hector nicoCarrasco' UNION ALL
SELECT 'LUKE RUDD' UNION ALL
SELECT 'S.Caljouw' UNION ALL
SELECT 'Christelle Bregnauld' UNION ALL
SELECT 'Mike'
DECLARE @TARGET TABLE
(
TGTNAME VARCHAR(100)
)
INSERT @TARGET
SELECT 'Stephen' UNION ALL
SELECT 'Hacktere' UNION ALL
SELECT 'Hacktery' UNION ALL
SELECT 'Stephan' UNION ALL
SELECT 'luky rodd' UNION ALL
SELECT 'Christ' union all
SELECT 'Mike'
;with cte as (
select a.SRCNAME,b.TGTNAME,
[dbo].[fn_calculateJaroWinkler](a.SRCNAME,b.TGTNAME) similar
from @SOURCE a
cross apply @TARGET b)
select a.SRCNAME [SRC Name], b.TGTNAME [Targert Name]
from @SOURCE a
left join cte b on a.SRCNAME=b.SRCNAME
where similar>0.6
and left(a.SRCNAME,1)=LEFT(TGTNAME,1)
Output:
SRC Name Targert Name
Stephan sBaum Stephen
Stephan sBaum Stephan
Hector nicoCarrasco Hacktere
Hector nicoCarrasco Hacktery
LUKE RUDD luky rodd
Christelle Bregnauld Christ
Mike Mike
We could define the similarity by ourselves while calling this function. In above example, I chose 0.6 as the lowest value and we could get the expected output.
In some other situations, we need to compare the values in the same group which means Fuzzy grouping in TSQL.
Then we could try with SOUNDEX() function which returns a four-character (SOUNDEX) code to evaluate the similarity of two strings.
Let’s see another example as below:
DECLARE @SOURCE TABLE
(
ClassName VARCHAR(100),
SRCNAME VARCHAR(100)
)
INSERT @SOURCE
SELECT 'Class1','Stephan' UNION ALL
SELECT 'Class1','Stephen' UNION ALL
SELECT 'Class2','Hector' UNION ALL
SELECT 'Class2','Hacktere' UNION ALL
SELECT 'Class2','Hacktery' UNION ALL
SELECT 'Class3','Christelle' UNION ALL
SELECT 'Class3','Christ' UNION ALL
SELECT 'Class4','Mike' UNION ALL
SELECT 'Class4','Mike1' UNION ALL
SELECT 'Class4','Mike12' UNION ALL
SELECT 'Class4','Mike1234' UNION ALL
SELECT 'Class5','123' UNION ALL
SELECT 'Class5','456'
select t1.SRCNAME,
t2.SRCNAME
from @SOURCE t1
inner join
(
select SRCNAME , snd, rn
from
(
select SRCNAME,soundex(SRCNAME) snd,
row_number() over(partition by ClassName
order by soundex(SRCNAME)) rn
from @SOURCE
) d
where rn = 1
) t2
on soundex(t1.SRCNAME) = t2.snd;
Output:
SRCNAME SRCNAME
Stephan Stephan
Stephen Stephan
Hector Hector
Hacktere Hector
Hacktery Hector
Christelle Christelle
Christ Christelle
Mike Mike
Mike1 Mike
Mike12 Mike
Mike1234 Mike
123 123
456 123
In above example, we could see that all the similar values in the same group have the same values on the right side. Then we could find all potential duplicates in sub groups and proceed with next steps.
Of course, we could find some other fuzzy search related good functions and choose a more suitable one in our real cases.