“Fuzzy logic” search in SQL Server
I was looking hard for a good, robust, yet possibly simple and fast to implement method of “fuzzy” logic search for an application based on SQL Server 2005. “Fuzzy” meaning that a search would return not only exact (or wildcard) matches but also phonetically similar words. This is especially useful for any database applications with operators using last names, component names or any name actually heard over the phone for example.
SQL offers two methods: DIFFERENCE and SOUNDEX. Unfortunately both are quite inefficient unfortunately.
Books online give some examples how they work, but the fact is that the algorithm is a well-known and simple one: http://en.wikipedia.org/wiki/Soundex.
The first obstacle is that it retains the first letter of string, so “goat” and “boat” are according to SOUNDEX not phonetically similar…
What we did is take SOUNDEX of two compared words and then run them through DIFFERENCE. A difference of 3 would then mean similar words. This makes gets rid of the first letter “obstacle”, but then produces a too far-off comparison for words beginning with the same letter.
Unfortunately SQL does not have anything more built-in. The only way is to improve the algorithm yourself which can be quite complex – increased performance hit, and time for development. An example is a 600+ line stored procedure:
The most well-known improvements are DoubleMetaphone and adding distance algorithms, e.g. "Levenshtein Distance".
It might be interesting to compare how those features are implemented in other systems like Oracle, SAP, etc.