T-SQL "partition by" results not as expected

T-SQL "partition by" results not as expected - tsql

What I'm trying to do is get a total count of "EmailAddresses" via using partitioning logic. As you can see in the result set spreadsheet, the first record is correct - this particular email address exists 109 times. But, the second record, same email address, the numberOfEmailAddresses column shows 108. And so on - just keeps incrementing downward by 1 on the same email address. Clearly, I'm not writing this SQL right and I was hoping to get some feedback as to what I might be doing wrong.
What I would like to see is the number 109 consistently down the column numberOfEmailAddresses for this particular email address. What might I be doing wrong?
Here's my code:
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress order by sub.SubscriberKey asc) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
And here's my result set, ordered by "numberOfEmailAddresses":

select distinct
Q1.SubscriberKey,
Q1.EmailAddress,
(select count(*) from ent._Subscribers sub where sub.EmailAddress = Q1.EmailAddress) as numberOfEmailAddress
from ent._Subscribers Q1
will get you what you want. I think the inclusion of the order by in your partition function is what is causing the descending count. Ordering in a partition function further subdivides the partition as I understand it.
select
Q1.SubscriberKey,
Q1.EmailAddress,
Q1.numberOfEmailAddresses
from
(select
sub.SubscriberKey as SubscriberKey,
sub.EmailAddress as EmailAddress,
count(*) over (partition by sub.EmailAddress) as numberOfEmailAddresses
from
ent._Subscribers sub) Q1
May also work but I can't find a suitable dataset to test.

Related

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?

Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

How to write proper/efficient query

I have a question about the right way of writing the query.
I have an employees table, lets say there are 4 columns employee_id, department, salary, email.
There are some records without email address, I'd like to find the most efficient way to write SQL query using window function that brings the sum salary per group, divided by all of those without email address.
I have 2 solutions, of course only one is efficient, can anyone give any advice about it?
select department, sum(salary) as total
from employees
where email is null
group by 1
option 1
select a.department , a.total/(select sum(salary) from employees where email is null)
from (
select department, sum(salary) as total
from employees
where email is null
group by 1
) a
option 2
select a.department , a.total/sum(a.total) over()
from (
select department, sum(salary) as total
from employees
where email is null
group by 1
) a
I guess that query 2 is more efficient, but is it the right way? and is it valid to leave over clause empty?
Just started using PostgreSQL instead of MySQL 5.6.

Your second query is better.
The first query has to scan employees twice, while the second table only scans the (hopefully smaller) result set of the subquery to calculate the sum.
It is perfectly valid to leave the OVER clause empty, that just means that all result rows will get the same value (which is what you want).

Finding duplicate records using cascading criteria, then combining into one record

I am using MS SQL Server 2012, and have done simple querying and data loading, but not looping or case statements, or nested selects. I am looking for some assistance to get me started on the approach.
We are in a project where we are combining the customer listing from multiple legacy systems. I have a raw customer table in a staging database that contains records from those multiple sources. We need to do the following before writing the final table to a data mart. I would think that would be quite common scenario in the data cleansing/golden record world, but after much searching, have not been able to locate a similar post.
First, we need to find records that represent the same customer. These records are coming from multiple sources so there could be more than 2 records that represent the same customer. Each source uses a similar model. The criteria that determines whether the record(s) represent the same customer changes in a cascading hierarchy depending on the values available. The first criteria we want to use for a record is the DOB and SSN. But if the SSN is missing, then the criteria for that row becomes the Last Name, First Name and DOB. If both the SSN and the DOB are missing, then the duplicate test changes to last name + first name + another criteria field. There are other criteria even after this if one of the names is missing. And since records that represent the same customer may have different fields available, we would have to use the test that both records can use. There may not be duplicate records if it turns out that a given customer only exists in one system.
Once duplicated records have been identified, we wish to then combine those records that represent a customer, so that we end up with 1 record representing the customer written to a new table, using the same fields. Combining is done by comparing values of like fields. If the SSN is missing from one source, but is available in another, then that SSN is used. If there are more than 2 records that represent a customer, and more than 1 has an SSN, and those SSN numbers are different, there is a heirarchy based on which system the record came from, and we want to write the SSN value from the system highest in the hierarchy. This kind of logic would be applied to each field we need to examine.
I think the piece that is hardest for me to conceptualize is how do you store values of one record so that you can compare against one or more other records in the same table, do the actual compare logic, then write the "winning" value to a new table? If I can get some help with that, it would be greatly appreciated.

The basic requirements that you have outlines are fulfilled by this query
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY DOB, SSN ) AS Match1,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], DOB ) AS Match2,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], [Another criteria] ) AS Match3
INTO #Matchmaking
FROM tCustStaging
What you will likely find though is that you will need to "prepare" (cleanse) your data first, that is ensure that it is all in the same format and remove "rubbish". A common problem may be phone numbers where various formats can be used e.g. '02 1234 1234', '0212341234', '+212341234' etc. Names may also have variations in spelling especially for Compound Names.
Another way to do matching, is to calculate matches on all fields individually
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY SSN ) AS SSNMatch,
DENSE_RANK() OVER( PARTITION BY DOB ) AS DOBMatch,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 10 ) ) AS LNMatch10,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch9,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch8,
etc.
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 3 ) ) AS LNMatch8,
DENSE_RANK() OVER( PARTITION BY LEFT( [First Name], 10 ) ) AS FNMatch10,
etc.
DENSE_RANK() OVER( PARTITION BY [Other criteria1] ) AS OC1,
DENSE_RANK() OVER( PARTITION BY [Other criteria2] ) AS OC2,
INTO #Matchmaking
FROM tCustStaging
You then create the strongest match (SSN, DOB). You can also experiment with various combinations of fields to see what you get.
-- You can play around with various combinations to see what results you get
SELECT c.*
FROM #Matchmaking AS a
INNER JOIN #Matchmaking AS b ON a.SSNMatch = b.SSNMatch AND a.DOBMatch = b.DOBMatch AND a.LNMatch10 = b.LNMatch10
INNER JOIN tCustStaging AS C ON a.ID = c.ID
After each iteration of matching you save the results.
You then keep relaxing the matching criteria, while carefully checking for false matches, until matching criteria is so weak that you no longer get useful results.
You will end up eventually with a set of results based on different strength of matching criteria.
In the end the number of "questionable matches" (where you are not sure if two customers are the same or not) would depend on the initial quality of data and the quality of it after "preparation". You would likely still have to analyse some data manually.

Group output of single column postgresql

First off I'm a total SQL noob - Thanks in advance for any assistance you can offer.
I have a FortiAnalyzer that uses a Postgres DB to store firewall logs. The Analyzer is then used to report on usage etc.
Basically I need to write a custom query that can show the Top 10 Users by bandwidth used for the top 10 Websites/destinations per user.
I can get all of the relevant information out of the unit, but I cannot get the output formatted correctly.
I would be happy with the output showing a username 10 times with the top 10 sites next to the username. First prize however would be to show the username in Column A only once, then in column B and C the destination address and bandwidth used respectively.
Here is the query I have so far:
select coalesce(nullifna(`user`), `src`) as user_src,
coalesce(hostname, dstname, 'unknown') as web_site,
sum(rcvd + sent)/1024 as bandwidth from $log
where $filter and user is not null and status in ('passthrough', 'filtered')
group by `user_src` , web_site order by user_src desc
Once the query is linked to a report chart, I them have options to limit output by x value. I could for example limit this to limit the user_src column to 100 (i.e 10 Users with 10 outputs each)
I hope this is clear to you... If not, I will do my best to answer any questions.

I start with table aggregated on website, user_src level. Than it is not difficult to get top X users for top Y sites. You will need to use window function to get desired result.
Sample data:
create table test (web_site varchar, user_src varchar, bandwidth numeric);
insert into test values
('a','s1',18),
('b','s1',12),
('c','s1',13),
('d','s2',14),
('e','s2',15),
('f','s2',16),
('g','s3',17),
('h','s3',18),
('i','s3',19)
;
Get top X websites for top Y users:
with cte as (
select
user_src,
web_site,
bandwidth,
dense_rank() over(order by site_bandwidth desc) as user_rank,
dense_rank() over(partition by user_src order by bandwidth desc) as website_rank
from
test
join (select user_src, sum(bandwidth) site_bandwidth from test group by user_src) a using (user_src)
)
select
*
from
cte
where
user_rank <= 2
and website_rank <=2
order by
user_rank,
website_rank
SQLFiddle

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.

You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.