query for a term with high similarity to any of the terms(words) in large table column containing addresses

SeHor 66 Reputation points
2021-10-20T21:40:12.89+00:00

In SQL 2014, I have product names in one of the columns of a large 30 GB product table. Based on a search term, I need to find all records for which one of the terms in the product name contains or is very similar with this query term. so I was thinking at adding full text on this column to tokenize each term of this table column. But not sure if I can search for term similarity using the full text.

Eg: looking for "Richmond", where row 3 is very similar (missing the last 'd') and row 4 is also similar (but missing the 'h' in the middle)

rows :
-123 Richmond street
-North Richmond Ln
-East Richmon Crescent
-South Ricmond Drive, 876

Can you recommend a solution/link for this.

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,489 questions
{count} votes

3 answers

Sort by: Most helpful
  1. AmeliaGu-MSFT 13,961 Reputation points Microsoft Vendor
    2021-10-21T03:13:27.447+00:00

    Hi SeHor-8331,
    Please try to use CONTAINS or CONTAINSTABLE to search for words or phrases with a specified prefix.

    For Example:

    SELECT * FROM table  
    WHERE CONTAINS(column,'"Ric*"')  
    

    Please refer to this doc which might be helpful.

    Best Regards,
    Amelia


    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


  2. Viorel 110.8K Reputation points
    2021-10-21T03:57:34.96+00:00

    For this particular example, DIFFERENCE('Richmond', 'Richmon') and DIFFERENCE('Richmond', 'Ricmond') give 4, which indicates a strong similarity. The SOUNDEX function gives the same code for these words. Maybe you can use this technique by splitting the address into words (keeping the SOUNDEX codes in a separate indexed table).


  3. Erland Sommarskog 100.1K Reputation points MVP
    2021-10-21T21:15:12.263+00:00

    This is not very easy to do with SQL Server. soundex/difference that Viroel mentioned are very rudimentary. Nor is fulltext of much use here.

    The best fit in the SQL Server family is Data Quality Services which is an optional install. It is aimed to handles situations like this, but it has not gotten much love since it was introduced in SQL 2012, and the algorithms it uses have awful performance.

    There are better tools from other vendors. Unfortunately, I don't have any names, as I have not worked in this space myself. I have only attended a presentations where the presenter discussed a couple of solutions - and was very clear on how poor DQS is.

    0 comments No comments