How to get the average row number per rank in PostgreSQL? - postgresql

Is there any option to get the average of the same values using the RANK() function in PostgreSQL? Here is the example of what I want to do:

This query will do the trick for you
SELECT
test_score,
row_number() OVER (ORDER BY test_score) AS rank,
rank() OVER (ORDER BY test_score)
+ (count(*) OVER (PARTITION BY test_score) - 1) / 2.0 AS "rank (with tied)"
FROM scores
SQLFiddle
Remarks:
What you believe is the "rank" is really the row_number() (i.e. a consecutive series of positive integer with no gaps and no duplicates).
That rank "with tied" that you're looking for can be calculated from the real rank() (rank with gaps) + the number of other elements of the same rank divided by two. This is a faster shortcut to calculate the average row_number() given your specific requirements.

I'm pretty sure you want row_number(), not rank(). Rank will not give repeated values in the way you presented. To get the answer you're looking for:
with rwn as (
select
test_score
,row_number() over (order by test_score) rwn
from
score
)
select
test_score
,avg(rwn) average_rank
from
rwn
group by
test_score;
Here's a SQLFiddle.

#Lukas and #jeremy already explained the difference between rank() and row_number() you seemed to be missing.
You can also compute the row number (rn), and the average over rn (avg_rn) per rank (= per group of same values) in the next step:
SELECT test_score, rn, avg(rn) OVER (PARTITION BY test_score) AS avg_rn
FROM (SELECT test_score, row_number() OVER (ORDER BY test_score) AS rn FROM tbl) sub;
You need a subquery because window functions cannot be nested on the same query level.
You need another window function (not an aggregate function like has been suggested) to preserve all original rows.
The result is ordered by rn by default (for this simple query), but this is just an implementation detail. To guarantee an ordered result, add an explicit ORDER BY (for practically no cost):
...
ORDER BY rn;
SQL Fiddle.

Related

Rank() window function on Redshift over multiple columns , not 1 or 2

I want to use rank() window function on a Redshift database to rank over specific, multiple columns. The code shall check those multiple columns per each row and assign same rank to rows that have identical values in ALL those columns.
Example image found in link below:
https://ibb.co/GJv1xQL
There are 18 distinct rows, however the rank shows 3 distinct rows, according to the ranking I wish to apply.
I tried :
select tbl.*
, dense_rank() over (partition by secondary_id order by created_on, type1, type2, money, amount nulls last ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But the ranks assigned were wrong, and then I tried:
select tbl.*
, dense_rank() over (partition by secondary_id,created_on, type1, type2, money, amount ) as rank
from table tbl
where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'
But this assigned rank=1 everywhere, in every row.
I found how to solve this.
The reason that the order by all the columns of interest was failing, is because the timestamp column contained different values in miliseconds, which was not obvious by viewing the data . So I only took into account the timestamp up until seconds and it worked! So I converted created_on column to date_trunc('s',cd.created_on) .
select tbl.* , dense_rank() over (partition by secondary_id order by date_trunc('s',created_on), type1, type2, money, amount nulls last ) as rank from table tbl where secondary_id='92d30f87-b2da-45c0-bdf7-c5ca96fe5ea6'

ROW_NUMBER() OVER PARTITION optimization

I have the following query:
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Code ORDER BY Price ASC) as RowNum
from Offers) r
where RowNum = 1
Offers table contains about 10 million records. But there are only ~4000 distinct codes there. So I need to get the row with the lowest price for each code and there will be only 4000 rows in the result.
I have an Index on (Code, Price) columns with all other columns in INCLUDE statement.
The query runs for 2 minutes. And if I look at the execution plan, I see Index scan with 10M actual rows. So, I guess it scans the whole index to get needed values.
Why MSSQL do the whole index scan? Is it because subquery needs the whole data? How to avoid this scan? Is there a SQL hint to process only the first row in partition?
Is there another way to optimize such query?
After trying multiple different solutions, I've found the fastest query with CROSS APPLY statement:
SELECT C.*
FROM (SELECT DISTINCT Code from Offers) A
CROSS APPLY (SELECT TOP 1 *
FROM Offers B
WHERE A.Code = B.Code
ORDER by Price) C
It take ~1 second to run.
Try creating an index on ( Code, Price ) without including the other columns and then (assuming that there is a unique Id column):
select L.*
from Offers as L inner join
( select Id,
Row_Number() over ( partition by Code order by Price ) as RN
from Offers ) as R on R.Id = L.Id and R.RN = 1
An index scan on a smaller index ought to help.
Second guess would be to get the Id of the row with the lowest Price for each Code explicitly: Get distinct Code values, get Id of top 1 (to avoid problems with duplicate prices) Min( Price ) row for that Code, join with Offers to get complete rows. Again, the more compact index should help.
Not sure if you'll get any significant performance gains, but you may want to try the WITH TIES clause
Example
Select Top 1 with Ties *
From Offers
Order By Row_Number() over (Partition By Code Order By Price)

Partitioning in window functions with same value in different chunk of rows

In the picture below you can example data. I would like to get first occurence of batch_start for each batch. As you can see (green highlight) batch 1522049 occurs in 2 chunks, one has 2 rows and second has 1 row.
SELECT FIRST_VALUE(batch_start) OVER (PARTITION BY batch ORDER BY batch_start)
does not solve the problem, since it joins both chunks into one and result is '2013-01-29 10:27:23' for both of them.
Any idea how to distinguish these rows and get batch_start of each chunk of data?
This seems to me a simple gaps-and-islands problem: you just need to calculate a value, which is the same for every subsequent rows for the same batch value, which will be
row_number() over (order by batch_start) - row_number() over (partition by batch order by batch_start)
From this, the solution depends on what do you want to do with these "batch groups". F.ex. here is a variant, which will aggregate them, to find out which is the first batch_start:
select batch, min(batch_start)
from (select *, row_number() over (order by batch_start) -
row_number() over (partition by batch order by batch_start) batch_number
from batches) b
group by batch, batch_number
http://rextester.com/XLX80303
Maybe: select batch, min(batch_start) firstOccurance, max(batch_start) lastOccurance from yourTable group by batch
or try (keeping your part of query):
SELECT FIRST_VALUE(a.batch_start) OVER (PARTITION BY a.batch ORDER BY a.batch_start) from yourTable a
join (select batch, min(batch_start) firstOccurance, max(batch_end) lastOccurance from yourTable group by batch) b on a.batch = b.batch

PostgreSQL RANK() function over aggregated column

I'm constructing quite complex query, where I try to load users with their aggregated points altogether with their rank. I found the RANK() function that could help me to achieve this but can't get it working.
Here's the query that is working without RANK:
SELECT users.*, SUM(received_points.count) AS pts
FROM users
LEFT JOIN received_points ON received_points.user_id = users.id AND ...other joining conditions...
GROUP BY users.id
ORDER BY pts DESC NULLS LAST
Now I would like to select also the rank - but this way using RANK function it's not working:
SELECT users.*, SUM(received_points.count) AS pts,
RANK() OVER (ORDER BY pts DESC NULLS LAST) AS position
FROM users
LEFT JOIN received_points ON received_points.user_id = users.id AND ...other joining conditions...
GROUP BY users.id
ORDER BY pts DESC NULLS LAST
It tells: PG::UndefinedColumn: ERROR: column "pts" does not exist
I guess I get whole concept of window functions wrong. How can I select the rank of user sorted by aggregated value like pts in example above?
I know I can assign ranks manually afterwards but what if I want to also filter the rows according to users.name in query and still get user's rank in general (not-filtered) leaderboard...? Dont know if I'm clear...
As Marth suggested in his comment:
You can't use pts here as the alias doesn't exist yet (you can't reference an alias in the same SELECT it's defined). RANK() OVER (ORDER BY SUM(received_points.count) DESC NULLS LAST) should work fine.

How to use Row_Number() Function in SQL Server 2012?

I am trying to generate result set similar in the following table. However, could not achieve the goal. I want to assign each row of the table as shown in the 'I want' column of the following table.
Following SQL generated 'RowNbr' column. Any suggestion would be appreciated. Thank you
SELECT Date, Nbr, status, ROW_NUMBER () over (partition by Date,staus order by date asc) as RowNbr
Thank you
This is a classic "gaps and islands" problem, in case you are searching for similar solutions in the future. Basically you want the counter to reset every time you hit a new status for a given Nbr, ordered by date.
This general overall technique was developed, I believe, by Itzik Ben-Gan, and he has tons of articles and book chapters about it.
;WITH cte AS
(
SELECT [Date], Nbr, [Status],
rn = ROW_NUMBER() OVER (PARTITION BY Nbr ORDER BY [Date])
- ROW_NUMBER() OVER (PARTITION BY Nbr,[Status] ORDER BY [Date])
FROM dbo.your_table_name
)
SELECT [Date], Nbr, [Status],
[I want] = ROW_NUMBER() OVER (PARTITION BY Nbr,rn ORDER BY [Date])
FROM cte
ORDER BY Nbr, [Date];
On 2012, you may be able to achieve something similar using LAG and LEAD; I made a few honest attempts but couldn't get anywhere that would end up being anything less complex than the above.