Finding duplicate records using cascading criteria, then combining into one record - tsql

I am using MS SQL Server 2012, and have done simple querying and data loading, but not looping or case statements, or nested selects. I am looking for some assistance to get me started on the approach.
We are in a project where we are combining the customer listing from multiple legacy systems. I have a raw customer table in a staging database that contains records from those multiple sources. We need to do the following before writing the final table to a data mart. I would think that would be quite common scenario in the data cleansing/golden record world, but after much searching, have not been able to locate a similar post.
First, we need to find records that represent the same customer. These records are coming from multiple sources so there could be more than 2 records that represent the same customer. Each source uses a similar model. The criteria that determines whether the record(s) represent the same customer changes in a cascading hierarchy depending on the values available. The first criteria we want to use for a record is the DOB and SSN. But if the SSN is missing, then the criteria for that row becomes the Last Name, First Name and DOB. If both the SSN and the DOB are missing, then the duplicate test changes to last name + first name + another criteria field. There are other criteria even after this if one of the names is missing. And since records that represent the same customer may have different fields available, we would have to use the test that both records can use. There may not be duplicate records if it turns out that a given customer only exists in one system.
Once duplicated records have been identified, we wish to then combine those records that represent a customer, so that we end up with 1 record representing the customer written to a new table, using the same fields. Combining is done by comparing values of like fields. If the SSN is missing from one source, but is available in another, then that SSN is used. If there are more than 2 records that represent a customer, and more than 1 has an SSN, and those SSN numbers are different, there is a heirarchy based on which system the record came from, and we want to write the SSN value from the system highest in the hierarchy. This kind of logic would be applied to each field we need to examine.
I think the piece that is hardest for me to conceptualize is how do you store values of one record so that you can compare against one or more other records in the same table, do the actual compare logic, then write the "winning" value to a new table? If I can get some help with that, it would be greatly appreciated.

The basic requirements that you have outlines are fulfilled by this query
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY DOB, SSN ) AS Match1,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], DOB ) AS Match2,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], [Another criteria] ) AS Match3
INTO #Matchmaking
FROM tCustStaging
What you will likely find though is that you will need to "prepare" (cleanse) your data first, that is ensure that it is all in the same format and remove "rubbish". A common problem may be phone numbers where various formats can be used e.g. '02 1234 1234', '0212341234', '+212341234' etc. Names may also have variations in spelling especially for Compound Names.
Another way to do matching, is to calculate matches on all fields individually
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY SSN ) AS SSNMatch,
DENSE_RANK() OVER( PARTITION BY DOB ) AS DOBMatch,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 10 ) ) AS LNMatch10,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch9,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch8,
etc.
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 3 ) ) AS LNMatch8,
DENSE_RANK() OVER( PARTITION BY LEFT( [First Name], 10 ) ) AS FNMatch10,
etc.
DENSE_RANK() OVER( PARTITION BY [Other criteria1] ) AS OC1,
DENSE_RANK() OVER( PARTITION BY [Other criteria2] ) AS OC2,
INTO #Matchmaking
FROM tCustStaging
You then create the strongest match (SSN, DOB). You can also experiment with various combinations of fields to see what you get.
-- You can play around with various combinations to see what results you get
SELECT c.*
FROM #Matchmaking AS a
INNER JOIN #Matchmaking AS b ON a.SSNMatch = b.SSNMatch AND a.DOBMatch = b.DOBMatch AND a.LNMatch10 = b.LNMatch10
INNER JOIN tCustStaging AS C ON a.ID = c.ID
After each iteration of matching you save the results.
You then keep relaxing the matching criteria, while carefully checking for false matches, until matching criteria is so weak that you no longer get useful results.
You will end up eventually with a set of results based on different strength of matching criteria.
In the end the number of "questionable matches" (where you are not sure if two customers are the same or not) would depend on the initial quality of data and the quality of it after "preparation". You would likely still have to analyse some data manually.

Related

T-SQL "partition by" results not as expected

What I'm trying to do is get a total count of "EmailAddresses" via using partitioning logic. As you can see in the result set spreadsheet, the first record is correct - this particular email address exists 109 times. But, the second record, same email address, the numberOfEmailAddresses column shows 108. And so on - just keeps incrementing downward by 1 on the same email address. Clearly, I'm not writing this SQL right and I was hoping to get some feedback as to what I might be doing wrong.
What I would like to see is the number 109 consistently down the column numberOfEmailAddresses for this particular email address. What might I be doing wrong?
Here's my code:
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress order by sub.SubscriberKey asc) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
And here's my result set, ordered by "numberOfEmailAddresses":
select distinct
Q1.SubscriberKey,
Q1.EmailAddress,
(select count(*) from ent._Subscribers sub where sub.EmailAddress = Q1.EmailAddress) as numberOfEmailAddress
from ent._Subscribers Q1
will get you what you want. I think the inclusion of the order by in your partition function is what is causing the descending count. Ordering in a partition function further subdivides the partition as I understand it.
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
May also work but I can't find a suitable dataset to test.

Updating Group Number based on swap records in postgresql

I just have a two column says like below
Ref Comp
A B
B A
I have the data like this like swapping. Now i just need to provided the same group number for both the records like mentioned below. I our case both the records are same so i need to provide same number for both the records in seperate column. Please provide any solution for this.
GROUP REF COMP
1 A B
1 B A
You can use the Window Function dense_rank ... in the over(...) use just the order clause: (see demo)
select dense_rank() over( order by least(ref,comp), greatest(ref,comp) ) as "Group"
, ref
, comp
from <your_table>
order by "Group", least(ref,comp);
For demo, I added a couple additional data rows. I seldom trust test result set with only 1 basic item. In this case "Group".

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

Understanding a simple DISTINCT ON in postgresql

I am having a small difficulty understanding the below simple DISTINCT ON query:
SELECT DISTINCT
ON (bcolor) bcolor,
fcolor
FROM
t1
ORDER BY
bcolor,
fcolor;
I have this table here:
What is the order of execution of the above table and why I am getting the following result:
As I understand since ORDER BY is used it will display the table columns (both of them), in alphabetical order and since ON is used it will return the 1st matched duplicate, but I am still confused about how the resulting table is displayed.
Can somebody take me through how exactly this query is executed ?
This is an odd one since you would think that the SELECT would happen first, then the ORDER BY like any normal RDBMS, but the DISTINCT ON is special. It needs to know the order of the records in order to properly determine which records should be dropped.
So, in this case, it orders first by the bcolor, then by the fcolor. Then it determines distinct bcolors, and drops any but the first record for each distinct group.
In short, it does ORDER BY then applies the DISTINCT ON to drop the appropriate records. I think it would be most helpful to think of 'DISTINCT ON' as being special functionality that differs greatly from DISTINCT.
Added after initial post:
This could be done using window functions and a subquery as well:
SELECT
bcolor,
fcolor
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY bcolor ORDER BY fcolor ASC) as rownumber,
bcolor,
fcolor
FROM t1
) t2
WHERE rownumber = 1

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.
You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.