question

FarhanSabir-9947 avatar image
0 Votes"
FarhanSabir-9947 asked EchoLiu-msft commented

Find duplicate records based on combination of attributes

Hi

I am doing a cleanup of a customer table. The customer table receives data from 4 different sources and there are many duplicate records. I am using a combination of rules to identify similar customer records. The rule I have are:

1) they have same firstname,lastname, email and mobile
2) they have similar lastname, mobile and email
3) they have similar firstname, lastname and email.

I've so far been unable to write a query or use SSIS to identify the records based on the above rules. Can anyone help me.

Thanks

Expected output:
126908-image.png

Code:

CREATE TABLE dbo.customer(
customer_id int,
first_name varchar(100),
last_name varchar(100),
email varchar(100),
mobile varchar(20),
unique_id int)
GO

INSERT INTO dbo.customer(customer_id, first_name, last_name, email, mobile)
values (5475872,'','SMITH','john.smith@fakeemail.com','123456'),
(623073,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
(1286681,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
(1303103,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
(7063902,'JOHN','SMITH','john.smith@fakeemail.com','987654'),
(1808155,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
(2151985,'','SMITH','john.smith@fakeemail.com','123456'),
(7221764,'JOHN','SMITH','john.smith@fakeemail.com','987654'),
(1870704,'JOHN','','john.smith@fakeemail.com','123456'),
(887842,'BILL','GATES','bill@microsoft.com','555555'),
(736363,'','GATES','bill@microsoft.com','555555')
GO


sql-server-generalsql-server-transact-sqlsql-server-integration-services
image.png (28.7 KiB)
· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Good day and welcome the forum

You cannot execute query on a table if you do not have the table and same go for us.

Please provide queries to create the relevant table and to insert some sample data. In addition please inform us what is the server version and describe the expected result SET according the sample data you provided.

0 Votes 0 ·

I am trying to attach photo of the output and code but for some reasons I can't submit it

0 Votes 0 ·
 Here is the code
    
 CREATE TABLE dbo.customer(
 customer_id int,
 first_name varchar(100),
 last_name varchar(100),
 email varchar(100),
 mobile varchar(20),
 unique_id int)
 GO
    
 INSERT INTO dbo.customer(customer_id, first_name, last_name, email, mobile)
 values (5475872,'','SMITH','john.smith@fakeemail.com','123456'),
 (623073,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
 (1286681,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
 (1303103,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
 (7063902,'JOHN','SMITH','john.smith@fakeemail.com','987654'),
 (1808155,'JOHN','SMITH','john.smith@fakeemail.com','123456'),
 (2151985,'','SMITH','john.smith@fakeemail.com','123456'),
 (7221764,'JOHN','SMITH','john.smith@fakeemail.com','987654'),
 (1870704,'JOHN','','john.smith@fakeemail.com','123456'),
 (887842,'BILL','GATES','bill@microsoft.com','555555'),
 (736363,'','GATES','bill@microsoft.com','555555')
 GO
1 Vote 1 ·

Hi @FarhanSabir-9947
You can try to submit again. Q&A forums sometimes fail to upload screenshots. I often encounter such problems.

But we recommend that you provide CREATE statement and INSERT statement so that we can copy the code directly for testing. Screenshots will make it difficult for us to copy the code.

Regards
Echo

0 Votes 0 ·
Show more comments

Could you please validate all the answers so far and provide any update?
If all of them are not working or helpful, please provide more sample data or details about your issue.
Please remember to accept the answers if they helped. Your action would be helpful to other users who encounter the same issue and read this thread.

Thank you for understanding!

0 Votes 0 ·
pituach avatar image
1 Vote"
pituach answered FarhanSabir-9947 commented

Hi :-)

I am not sure what you want to do with the duplicate rows so I will present two options

Note! Rule 1 is not necessary since if it is implemented then rules 2 and 3 are also implemented

Note! I assume that customer_id is the primary key(uniqueness)

Option 1: Show all duplicates rows

 -- Show all duplicates
 ;With MyCTE as (
  SELECT c.customer_id, c.email, c.first_name, c.last_name, c.mobile, c.unique_id
  ,C1 = COUNT(*) OVER (PARTITION BY c.last_name, c.mobile,c.email ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  ,C2 = COUNT(*) OVER (PARTITION BY c.first_name, c.last_name,c.email ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  FROM customer c
 )
 SELECT customer_id, email, first_name, last_name, mobile, unique_id
 FROM MyCTE
 WHERE C1 > 1 or C2 > 1
 GO


Option 2: I will delete duplicate rows according to the condition and leave one of each duplicate rows (the one with the smaller customer_id)

 -- This will remove duplictae customers and will leave the one with the smaller customer_id
 ;With MyCTE as (
  SELECT c.customer_id, c.email, c.first_name, c.last_name, c.mobile, c.unique_id
  ,RN1 = ROW_NUMBER() OVER (PARTITION BY c.last_name, c.mobile,c.email ORDER BY c.customer_id)
  ,RN2 = ROW_NUMBER() OVER (PARTITION BY c.first_name, c.last_name,c.email ORDER BY c.customer_id)
  FROM customer c
 )
 DELETE MyCTE
 WHERE customer_id IN (
  SELECT customer_id
  FROM MyCTE
  WHERE RN1 > 1 or RN2 > 1
 )
 GO


Please check if this fit your needs

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thank you, I will try your solution and will let you know if it works

0 Votes 0 ·
FarhanSabir-9947 avatar image
0 Votes"
FarhanSabir-9947 answered Viorel-1 commented

Hi pituach

Thanks for the reply. I tried your scripts and option 2 is the one that I exactly need. The only issue is, it is returning 2 records instead of 1 for JOHN SMITH.

Is there anyway to restrict it to return just one record ? (order by customer_id)

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.


The results probably corresponds to your rules. The records cannot be merged.

0 Votes 0 ·