question

SQL9-3412 avatar image
0 Votes"
SQL9-3412 asked SQL9-3412 commented

Need T-sql User Defined Function to remove specified words or chars from a table field.

Hi All,

I am trying to replace below words or chars from a given name field(company name or person's name) and I want a generic SQL user defined function which should remove all these given words or chars that are passed to the UDF function.

Basically I am storing all these chars or words in a Title table and in future if I add any new word or char to the Titles table then the function should remove the given word or char from the FullName field.

My Source field FullName values looks like below..

FullName:
John Doe Corporate
Mary Jones llp
Doctor Whu ltd
Leia & Organa MD
A&O Office
Dr. Ray Wed
James Adams Jr.
William Duke iii
Mr.John King Sr.
First Step, Inc.
A-Z Needs, Inc.
ABC 100 Sales, ltd


I need a UDF function to show output in below format. I am using SQL server 2016 & 2019 version

Note: 1) I have millions of records to process, so need something which is faster to process(not cursor functionality) the records.
2) It should remove numeric values (reg exp) if any.

FullName NewFullName
John Doe Corporate John Doe
Mary Jones llp Mary Jones
Doctor Whu ltd Doctor Whu
Leia & Organa MD Leia Organa
A&O Office AO Office
Dr.Ray Wed Ray Wed
James Adams Jr. James Adams
William Duke iii William Duke
Mr.John King Sr. John King
First Step, Inc. First Step
A-Z Needs, Inc. AZ Needs
ABC 100 Sales, ltd ABC Sales


CREATE TABLE #titles (title nvarchar(50));
INSERT INTO #titles (title)
VALUES (' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-')


CREATE TABLE #Data (fullname nvarchar(100));
INSERT INTO #Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')

Thanks in advance,
RH




sql-server-transact-sql
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@SQL9-3412,

Did you have a chance to try the proposed solution?

0 Votes 0 ·

Do you have any update?

0 Votes 0 ·

Please also remember to accept the answers if they helped.
Your action would be helpful to other users who encounter the same issue and read this thread.

0 Votes 0 ·
YitzhakKhabinsky-0887 avatar image
0 Votes"
YitzhakKhabinsky-0887 answered SQL9-3412 commented

Hi @SQL9-3412,

Please try the following solution.
It is pure T-SQL and will work starting from SQL Server 2017 onwards.
It is up to you to package it as a user defined function.

I separated the entire process into two steps:

  1. words cleansing.

  2. chars cleansing.

SQL

 -- DDL and sample data population, start
 DECLARE @titles TABLE  (ID INT IDENTITY PRIMARY KEY, title NVARCHAR(50));
 INSERT INTO @titles (title) VALUES 
 (' & '),('&'),('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'Dr.'),
 ('corporate'),( 'llp'),( 'llc'),( 'ltd'),('Inc.'),( 'limited'),( 'the'),( 'dr'),( 'md'),
 ('rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),( '3rd'),( 'aprn'),
 ('pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),
 ('cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' ),('.'),(','),('-');
    
 DECLARE @badChars NVARCHAR(100) = '0123456789&.,-';
 -- DDL and sample data population, end
    
 DECLARE @inputParameter NVARCHAR(100) = N'A-Z Needs, Inc.';
 DECLARE @output NVARCHAR(100);
    
 -- Step #1, cleanse words
 SELECT @output = STRING_AGG(cleansedWords, SPACE(1))
 FROM (SELECT value FROM STRING_SPLIT(@inputParameter, SPACE(1))
  WHERE value NOT IN (
  SELECT title FROM @titles)) AS t(cleansedWords);
    
 -- test, cleansed words
 SELECT @inputParameter, @output;
    
 -- Step #2, cleanse chars
 SELECT @output = TRIM((SELECT STRING_AGG(REPLACE(TRANSLATE(value, @badChars, SPACE(LEN(@badChars))),SPACE(1),''), SPACE(1))
  FROM STRING_SPLIT(@output, SPACE(1))
  ));
    
 -- test, cleansed chars
 SELECT @output AS Result;
· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

On the sample points that SQL9 has is Dr. Ray Wed, which your code does not cope with. Possibly you could fix that by first swapping the two steps. But my experience tells me that this sort of problem requires regular expression.

There is also an issue with names like Limited Resources Inc.. limited is on SQL9's list, but in this fictional example you want to retain that word.

0 Votes 0 ·

@ErlandSommarskog,

Thanks. Good catch.
I saw it too. My expectations were that the OP will add a full list of the forbidden words to the titles table.

0 Votes 0 ·

Hi YitzhakKhabinsky-0887 - Right now this solution worked for my current scenario as expected. I appreciate your help working on this.
I spoke to my client and they are satisfied with it. They are not expecting 100% accurate results when we use this function because like ErlandSommarskog mentioned there could be lot of other scenarios where we could miss some words or chars but that is acceptable in my current scenario.

Thanks,
RH

0 Votes 0 ·
ErlandSommarskog avatar image
0 Votes"
ErlandSommarskog answered ErlandSommarskog commented

Let's get this clear from the start: there is no good T-SQL-only solution for this.

You can write a user-defined function where you loop over the Titles table and run a replace for each entry. If you say:

UPDATE tbl SET fullname = dbo.StripJunk(fullname)

SQL Server will essentially implement this as a cursor under the hood. You may have heard of inlining of scalar functions in SQL 2019, but because of the loop in the function it will not happen here.

You could isntead build a dynamic SQL statement with an insane amount of nested replace() calls. I would expect this to run faster.

However, both these solution has the problem that replace() does not understand word boundaries, so if you have Faultdata Inc., it will become Fauata . with your Titles table. Trying to handle the word boundaries in T-SQL? Loop over each character in the string? How efficient will that be? You don't want to go there.

It still possible to run this from inside SQL Server, though. There are two options:
1. Using the RegEx classes in .NET through a CLR function.
2. Calling an external language with sp_execute_external_script.

Using the CLR might be the most efficient solution. It also has the advantage that it works on all SQL Server versions from SQL 2005 and up. The disadvantage is that it comes with quite a bit of red tape, and more so in recent versions.

The second option became available in SQL 2016. In SQL 2016, the only available language is R. SQL 2017 added Python and SQL 2019 added Java. (And also exposed the mechanisms to add your own language.)

R is a specialised language for statistics, and I don't know whether there is support for regular expressions. Below is a solution that uses Python, so you cannot use it on SQL 2016. In the solution, I am only considering words, and handling punctuation characters are left as an exercise to the reader.

DROP TABLE IF EXISTS Titles, Data

CREATE TABLE Titles (title nvarchar(50));
INSERT INTO Titles (title)
VALUES ('inc'),( 'incorp'),( 'incorporated'),( 'corporation'),( 'corp'),( 'corporate'),( 'llp'),( 'llc'),
( 'ltd'),( 'limited'),( 'the'),( 'dr'),( 'md'),( 'rn'),( 'phd'),( 'jr'),( 'sr'),( 'ii'),( 'iii'),( 'iv'),
( '3rd'),( 'aprn'),( 'pa-c'),( 'dds'),( 'dmd'),( 'dpm'),( 'dc'),( 'rnp'),( 'crnp'),( 'arnp'),( 'mpt'),( 'cfnp'),( 'canp'),( 'lnp'),( 'fnp'),( 'mba' )


CREATE TABLE Data (id int IDENTITY PRIMARY KEY, fullname nvarchar(100));
INSERT INTO Data (fullname) VALUES
('John Doe Corporate'),
('Mary Jones llp'),
('Doctor Whu ltd'),
('Leia & Organa MD'),
('A&O Office'),
('Dr.Ray Wed'),
('James Adams Jr.'),
('William Duke iii'),
('Mr.John King Sr.'),
('First Step, Inc.'),
('A-Z Needs, Inc.'),
('ABC 100 Sales, ltd')

DECLARE @pattern nvarchar(MAX)
SELECT @pattern = '\b(' + string_agg(title, '|') + '|\d+)\b(?i)'
FROM   Titles

DROP TABLE IF EXISTS #result
CREATE TABLE #result (id     int NOT NULL PRIMARY KEY,
                      result varchar(100) NULL)

INSERT #result(id, result)
 EXEC sp_execute_external_script @language = N'Python', 
     @input_data_1 = N'SELECT id, fullname FROM Data', 
     @output_data_1_name = N'Ret',
     @script = N'import re, pandas
Ret = InputDataSet
Ret["fullname"] = pandas.Series([re.sub(Pattern, "", i) for i in Ret["fullname"]])
', 
     @params = N'@Pattern nvarchar(MAX)',
     @Pattern = @pattern

UPDATE Data
SET    fullname = isnull(r.result, '')
FROM   Data d
JOIN   #result r ON d.id = r.id
WHERE  d.fullname <> isnull(r.result, '')

SELECT * FROM Data


In the end, you may prefer to read the data to a client-side program to do the processing there and write back.


· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I should add that for the solution above to work, three things need to be fulfilled:
1. The Python support must be available in the instance. It's an optional component when you install SQL Server.
2. The server configuration parameter external scripts enabled must be 1.
3. The user must have the permission EXECUTE EXTERNAL SCRIPT.


0 Votes 0 ·

Thank you ErlandSommarskog for taking time and reply back with solutions. Unfortunately client would not go with python or R script because nobody knows about how to code them and support down the line. They need a regular t-sql as a solution. We already have cursor which is taking long time just to return clean full name. I appreciate your help.

0 Votes 0 ·

That's understandable, and I agree with not adding technology which is new and unknown only for the sake of it. Or for that matter even if the solution makes perfect sense in this case. If no one understands the setup and configuration, it will be too brittle.

So that leaves you with a single option: Write a client-side program that fits in with the current infrastructure and mindset. But don't do it T-SQL. T-SQL is a good hammer for processing data in general. But this problem requires a screwdriver.

0 Votes 0 ·
EchoLiu-msft avatar image
0 Votes"
EchoLiu-msft answered YitzhakKhabinsky-0887 edited

Hi @SQL9-3412,

As ErlandSommarskog said, tsql alone seems to be difficult to achieve your needs. Nested replacement may be possible, but that efficiency is not necessarily higher than your existing cursor.


Regards
Echo

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

@EchoLiu-msft,

I hope you didn't miss the proposed solution.
It will work starting from SQL Server 2017 onwards.

0 Votes 0 ·
TomPhillips-1744 avatar image
0 Votes"
TomPhillips-1744 answered TomPhillips-1744 edited

With only the information you have given, it is going to be extremely difficult to do in any language.

If in fact you have other attributes, like address, you are likely better off using something like a white pages or USPS database to "standardize" your information. Something like https://www.smartystreets.com/. (I have used SmartyStreets in the past, but have no affiliation to them).

Without knowing more about how you would use this information, it is impossible to guess if that will work for your situation or not.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.