question

SeHor-8331 avatar image
0 Votes"
SeHor-8331 asked AmeliaGu-msft commented

query for a term with high similarity to any of the terms(words) in large table column containing addresses

In SQL 2014, I have product names in one of the columns of a large 30 GB product table. Based on a search term, I need to find all records for which one of the terms in the product name contains or is very similar with this query term. so I was thinking at adding full text on this column to tokenize each term of this table column. But not sure if I can search for term similarity using the full text.

Eg: looking for "Richmond", where row 3 is very similar (missing the last 'd') and row 4 is also similar (but missing the 'h' in the middle)

rows :
-123 Richmond street
-North Richmond Ln
-East Richmon Crescent
-South Ricmond Drive, 876

Can you recommend a solution/link for this.

sql-server-general
· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi SeHor-8331,

Did the answers help you?
Please feel free to let us know if you have any other question.
If you find any post in the thread is helpful, you could kindly accept it as answer.

Best Regards,
Amelia

0 Votes 0 ·
AmeliaGu-msft avatar image
0 Votes"
AmeliaGu-msft answered AmeliaGu-msft commented

Hi SeHor-8331,
Please try to use CONTAINS or CONTAINSTABLE to search for words or phrases with a specified prefix.

For Example:

 SELECT * FROM table
 WHERE CONTAINS(column,'"Ric*"')

Please refer to this doc which might be helpful.

Best Regards,
Amelia


If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you Amelia, for my case it might be that the term in the column is very similar with the searched term ex: search for "Richmond" and address is "25 Rihmond" so no c as the 3-rd character. So i am looking for similarity as well.

0 Votes 0 ·

Hi SeHor-8331,

Thanks for your reply.
Full-text search is not very suitable for fuzzy or misspelled words.
You also can try to run SELECT * FROM tableWHERE CONTAINS(column,'"Ri*"'), but it may return all text that matches the word beginning with “Ri”.
You may use the LIKE clause to approach the effect for such matching.

Best Regards,
Amelia

0 Votes 0 ·
Viorel-1 avatar image
0 Votes"
Viorel-1 answered SeHor-8331 commented

For this particular example, DIFFERENCE('Richmond', 'Richmon') and DIFFERENCE('Richmond', 'Ricmond') give 4, which indicates a strong similarity. The SOUNDEX function gives the same code for these words. Maybe you can use this technique by splitting the address into words (keeping the SOUNDEX codes in a separate indexed table).


· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks Viorel, create a full text index on the table and copy the PK and this full text to a separate table? and for this table apply Soundex/Difference functions to determine similar values? OR, create new table with PK and parse the split the address column by blanks and create new table with PK and all terms in the address column? and then apply Soundex/Difference for this new table?
As i mentioned to Amelia, for my case it might be that the term in the column is very similar with the searched term ex: search for "Richmond" and address is "25 Rihmond" so no c as the 3-rd character. So i am looking for similarity as well.

0 Votes 0 ·
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered

This is not very easy to do with SQL Server. soundex/difference that Viroel mentioned are very rudimentary. Nor is fulltext of much use here.

The best fit in the SQL Server family is Data Quality Services which is an optional install. It is aimed to handles situations like this, but it has not gotten much love since it was introduced in SQL 2012, and the algorithms it uses have awful performance.

There are better tools from other vendors. Unfortunately, I don't have any names, as I have not worked in this space myself. I have only attended a presentations where the presenter discussed a couple of solutions - and was very clear on how poor DQS is.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.