Skip to content Skip to sidebar Skip to footer

Querying Database To Find Potential Duplicates Based On Multiple Columns

I have a table in my database where I have about 100k clients and with a erroneous system that was in place for quite some time we are just coming to find out that we might potenti

Solution 1:

You would use window functions:

select t
from (select t.*,
             count(*) over (partitionbyleft(fname, 2), left(lname, 2), dob, sysemid) as cnt
      from t
     ) t
where cnt >1orderby sysemid, dob, fname, lname;

This may also be one of those rare situations where soundex() is actually useful. After all, it was designed with strings in mind. So:

select t
from (select t.*,
             count(*) over (partitionby soundex(fname), soundex(lname), dob, sysemid) as cnt
      from t
     ) t
where cnt >1orderby sysemid, dob, fname, lname;

Solution 2:

I guess you have sovled the issue for new data but for old data what you can do is the following:

1) Make exact comparison that will get you the very exact duplicates. This is the easy part.

2) For Similar strings, for example Josh and Joshh you need either to use some code that you need to write to check for distance between two strings or a simpler way if you like coding in SQL only is to use DIFFERENCE function. this will give you a value of 0-4 where 4 means the strings are very similar. The bad thing about this is you need to give them a quick look by eye to make sure everything is good. For more info https://technet.microsoft.com/en-us/library/ms189282(v=sql.105).aspx

3) I prefer you go, open Visual studio or your preferred IDE, connect to your database , Select the whole data in the table, bring it in memory and implement a good string distance function.

Solution 3:

Like this:

SELECT LEFT(FName,2) AS FN, LEFT(LName,2) AS LN, DOB, COUNT(*)
FROM YourTable
GROUP BY LEFT(FName,2), LEFT(LName,2), DOB
HAVING COUNT(*) > 1
;

Solution 4:

There is no magic bullet built into SQL Server. You will have to decide on your business rules and code them.

If you decide to look for matches in the first 3 letters of a name, for instance, use the LEFT() function to get the first 3 letters. To look at individual parts of the DOB, you can use the DATEPART() function.

Post a Comment for "Querying Database To Find Potential Duplicates Based On Multiple Columns"