Understanding this Window Function Query in Postgresql - postgresql

I had a spreadsheet that looked like a prior "group by" had left many rows blank where I needed them to be filled with the data above it (see example picture below). I needed each account number to fill all the cells beneath it until the start of the next account number (i.e., A1234 needs to be in all the cells up to B4325, B4325 needs to be in all the cells up to C3452 and so on).
From this stack exchange answer by benjamin berhault I found this code and tailored it to my problem:
SELECT rn, acct, FIRST_VALUE(acct) OVER(PARTITION BY grp)
FROM (SELECT rn, acct, SUM(CASE WHEN acct <> '' THEN 1 END) OVER (ORDER BY rn) AS grp
FROM
(SELECT ROW_NUMBER() OVER() rn
, acct
FROM dataset AS d) AS sub1 ) sub2;
What I don't understand about this query is the ORDER BY clause in this part
SUM(CASE WHEN acct <> '' THEN 1 END) OVER (ORDER BY rn) AS grp
This whole line works to successfully create a new grp column that is all 1's for the first account, all 2's for the second account and so on. From here it can use the FIRST VALUE PARTITION BY in the main query to get the result I am looking for, but what I do not understand is why does ORDER BY rn cause the column to sum in that manner? I would have thought a PARTITION BY would be needed there, but it does not work.

Related

Rank() window function on Redshift over multiple columns , not 1 or 2

I want to use rank() window function on a Redshift database to rank over specific, multiple columns. The code shall check those multiple columns per each row and assign same rank to rows that have identical values in ALL those columns.
Example image found in link below:
https://ibb.co/GJv1xQL
There are 18 distinct rows, however the rank shows 3 distinct rows, according to the ranking I wish to apply.
I tried :
select tbl.*
, dense_rank() over (partition by secondary_id order by created_on, type1, type2, money, amount nulls last ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But the ranks assigned were wrong, and then I tried:
select tbl.*
, dense_rank() over (partition by secondary_id,created_on, type1, type2, money, amount ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But this assigned rank=1 everywhere, in every row.
I found how to solve this.
The reason that the order by all the columns of interest was failing, is because the timestamp column contained different values in miliseconds, which was not obvious by viewing the data . So I only took into account the timestamp up until seconds and it worked! So I converted created_on column to date_trunc('s',cd.created_on) .
select tbl.* , dense_rank() over (partition by secondary_id order by date_trunc('s',created_on), type1, type2, money, amount nulls last ) as rank from table tbl where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'

Mixing CROSS JOIN with LEFT JOIN on Redshift

I have two tables: accounts and opportunities. One account has 0-n opportunities, but only 0 or 1 opportunities at any point of time (within the contract_from/contract_to range).
I want to report for the past 4 months which account had which opportunity in this month.
I came up with this query:
WITH numbers AS (SELECT 1 AS n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4)
SELECT * FROM
(
(SELECT id, name FROM accounts WHERE is_active) AS acc(acct_id, name)
CROSS JOIN
(SELECT dateadd(MONTH, -n,
date_trunc('month', current_date))::date AS start,
dateadd(DAY, -1, dateadd(MONTH, -n + 1,
date_trunc('month', current_date)))::date AS stop
FROM numbers) AS period(start, stop)
)
LEFT OUTER JOIN
(SELECT acct_id, subscription_type, contract_from, contract_to
FROM opportunities) AS opp(acct_id, subscription, start, stop)
ON (acc.acct_id = opp.acct_id AND
opp.start <= period.start AND
(opp.stop ISNULL OR
opp.stop > period.stop))
My problem is, that some of the accounts only have two resulting rows, even thou I did a left join so I expect them to always have four rows with having the months without active opportunity resulting in null values in columns subscription, start and stop.
Is mixing these joins not supported in Redshift?
After some more iterations on my query I found out that the left join indeed works, but the order gets mixed up. The rows with the nulls end up further down. Probably because Redshift first does the left join and then "fills" up the rows which don't have a corresponding right match.
Also: OUTER JOIN is the wrong choice here, because if there are more than 1 opportunity at a given date, then the additional opportunity cause more resulting rows.

T-SQL import data and remove duplicate records from prior month

Every month we receive a roster which we run queries on and then generate data that gets uploaded into a table for an outside source to retrieve. My question is what would be the easiest way to remove the duplicate data from the prior months upload bearing in mind that not all data is duplicated and that if a person does not appear on the new roster their prior month needs to remain. The data is time stamped when it gets uploaded.
Thank you
You can use a cte and Row_Number() to identify and remove dupes
;with cte as (
Select *
,RN = Row_Number() over (Partition By SomeKeyField(s) Order By SomeDate Desc)
From YourTable
)
Select * -- << Remove if Satisfied
-- Delete -- << Remove Comment if Statisfied
From cte
Where RN>1
Without seeing your data structure, take a hard look a the Partition By and Order By within the OVER clause of Row_Number()
Short and efective way is delete via derived table.
delete from f
from (
select *, row_number() over (partition by col order by (select 0)) rn
from tbl) f
where rn > 1
But the most efective way is remove duplicates on input and prevent them (for example with unique constraint).

How to get the average row number per rank in PostgreSQL?

Is there any option to get the average of the same values using the RANK() function in PostgreSQL? Here is the example of what I want to do:
This query will do the trick for you
SELECT
test_score,
row_number() OVER (ORDER BY test_score) AS rank,
rank() OVER (ORDER BY test_score)
+ (count(*) OVER (PARTITION BY test_score) - 1) / 2.0 AS "rank (with tied)"
FROM scores
SQLFiddle
Remarks:
What you believe is the "rank" is really the row_number() (i.e. a consecutive series of positive integer with no gaps and no duplicates).
That rank "with tied" that you're looking for can be calculated from the real rank() (rank with gaps) + the number of other elements of the same rank divided by two. This is a faster shortcut to calculate the average row_number() given your specific requirements.
I'm pretty sure you want row_number(), not rank(). Rank will not give repeated values in the way you presented. To get the answer you're looking for:
with rwn as (
select
test_score
,row_number() over (order by test_score) rwn
from
score
)
select
test_score
,avg(rwn) average_rank
from
rwn
group by
test_score;
Here's a SQLFiddle.
#Lukas and #jeremy already explained the difference between rank() and row_number() you seemed to be missing.
You can also compute the row number (rn), and the average over rn (avg_rn) per rank (= per group of same values) in the next step:
SELECT test_score, rn, avg(rn) OVER (PARTITION BY test_score) AS avg_rn
FROM (SELECT test_score, row_number() OVER (ORDER BY test_score) AS rn FROM tbl) sub;
You need a subquery because window functions cannot be nested on the same query level.
You need another window function (not an aggregate function like has been suggested) to preserve all original rows.
The result is ordered by rn by default (for this simple query), but this is just an implementation detail. To guarantee an ordered result, add an explicit ORDER BY (for practically no cost):
...
ORDER BY rn;
SQL Fiddle.

How to use Row_Number() Function in SQL Server 2012?

I am trying to generate result set similar in the following table. However, could not achieve the goal. I want to assign each row of the table as shown in the 'I want' column of the following table.
Following SQL generated 'RowNbr' column. Any suggestion would be appreciated. Thank you
SELECT Date, Nbr, status, ROW_NUMBER () over (partition by Date,staus order by date asc) as RowNbr
Thank you
This is a classic "gaps and islands" problem, in case you are searching for similar solutions in the future. Basically you want the counter to reset every time you hit a new status for a given Nbr, ordered by date.
This general overall technique was developed, I believe, by Itzik Ben-Gan, and he has tons of articles and book chapters about it.
;WITH cte AS
(
SELECT [Date], Nbr, [Status],
rn = ROW_NUMBER() OVER (PARTITION BY Nbr ORDER BY [Date])
- ROW_NUMBER() OVER (PARTITION BY Nbr,[Status] ORDER BY [Date])
FROM dbo.your_table_name
)
SELECT [Date], Nbr, [Status],
[I want] = ROW_NUMBER() OVER (PARTITION BY Nbr,rn ORDER BY [Date])
FROM cte
ORDER BY Nbr, [Date];
On 2012, you may be able to achieve something similar using LAG and LEAD; I made a few honest attempts but couldn't get anywhere that would end up being anything less complex than the above.