Partitioning in window functions with same value in different chunk of rows - postgresql

In the picture below you can example data. I would like to get first occurence of batch_start for each batch. As you can see (green highlight) batch 1522049 occurs in 2 chunks, one has 2 rows and second has 1 row.
SELECT FIRST_VALUE(batch_start) OVER (PARTITION BY batch ORDER BY batch_start)
does not solve the problem, since it joins both chunks into one and result is '2013-01-29 10:27:23' for both of them.
Any idea how to distinguish these rows and get batch_start of each chunk of data?

This seems to me a simple gaps-and-islands problem: you just need to calculate a value, which is the same for every subsequent rows for the same batch value, which will be
row_number() over (order by batch_start) - row_number() over (partition by batch order by batch_start)
From this, the solution depends on what do you want to do with these "batch groups". F.ex. here is a variant, which will aggregate them, to find out which is the first batch_start:
select batch, min(batch_start)
from (select *, row_number() over (order by batch_start) -
row_number() over (partition by batch order by batch_start) batch_number
from batches) b
group by batch, batch_number
http://rextester.com/XLX80303

Maybe: select batch, min(batch_start) firstOccurance, max(batch_start) lastOccurance from yourTable group by batch
or try (keeping your part of query):
SELECT FIRST_VALUE(a.batch_start) OVER (PARTITION BY a.batch ORDER BY a.batch_start) from yourTable a
join (select batch, min(batch_start) firstOccurance, max(batch_end) lastOccurance from yourTable group by batch) b on a.batch = b.batch

Related

Window function without ORDER BY

There is a window function without ORDER BY in OVER () clause. Is there a guarantee that the rows will be processed in the order specified by the ORDER BY expression in SELECT itself?
For example:
SELECT tt.*
, row_number() OVER (PARTITION BY tt."group") AS npp --without ORDER BY
FROM
(
SELECT SUBSTRING(random() :: text, 3, 1) AS "group"
, random() :: text AS "data"
FROM generate_series(1, 100) t(ser)
ORDER BY "group", "data"
) tt
ORDER BY tt."group", npp;
In this example the subquery returns the data sorted in ascending order in each group. The window function handles the rows in the same order, and so the line numbers go in ascending order of the data. Can I rely on this?
Good question!
No, you cannot rely on that.
Window functions are processed before the query's ORDER BY clause, and without an ORDER BY in the window definition, the rows will be processed in the order in which they happen to come from the subselect.
if you use an order by in your over ()
row_number() OVER (PARTITION BY tt."group" ORDER BY tt."group")
you should get the order you want.

ROW_NUMBER() OVER PARTITION optimization

I have the following query:
SELECT *
FROM
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Code ORDER BY Price ASC) as RowNum
from Offers) r
where RowNum = 1
Offers table contains about 10 million records. But there are only ~4000 distinct codes there. So I need to get the row with the lowest price for each code and there will be only 4000 rows in the result.
I have an Index on (Code, Price) columns with all other columns in INCLUDE statement.
The query runs for 2 minutes. And if I look at the execution plan, I see Index scan with 10M actual rows. So, I guess it scans the whole index to get needed values.
Why MSSQL do the whole index scan? Is it because subquery needs the whole data? How to avoid this scan? Is there a SQL hint to process only the first row in partition?
Is there another way to optimize such query?
After trying multiple different solutions, I've found the fastest query with CROSS APPLY statement:
SELECT C.*
FROM (SELECT DISTINCT Code from Offers) A
CROSS APPLY (SELECT TOP 1 *
FROM Offers B
WHERE A.Code = B.Code
ORDER by Price) C
It take ~1 second to run.
Try creating an index on ( Code, Price ) without including the other columns and then (assuming that there is a unique Id column):
select L.*
from Offers as L inner join
( select Id,
Row_Number() over ( partition by Code order by Price ) as RN
from Offers ) as R on R.Id = L.Id and R.RN = 1
An index scan on a smaller index ought to help.
Second guess would be to get the Id of the row with the lowest Price for each Code explicitly: Get distinct Code values, get Id of top 1 (to avoid problems with duplicate prices) Min( Price ) row for that Code, join with Offers to get complete rows. Again, the more compact index should help.
Not sure if you'll get any significant performance gains, but you may want to try the WITH TIES clause
Example
Select Top 1 with Ties *
From Offers
Order By Row_Number() over (Partition By Code Order By Price)

T-SQL import data and remove duplicate records from prior month

Every month we receive a roster which we run queries on and then generate data that gets uploaded into a table for an outside source to retrieve. My question is what would be the easiest way to remove the duplicate data from the prior months upload bearing in mind that not all data is duplicated and that if a person does not appear on the new roster their prior month needs to remain. The data is time stamped when it gets uploaded.
Thank you
You can use a cte and Row_Number() to identify and remove dupes
;with cte as (
Select *
,RN = Row_Number() over (Partition By SomeKeyField(s) Order By SomeDate Desc)
From YourTable
)
Select * -- << Remove if Satisfied
-- Delete -- << Remove Comment if Statisfied
From cte
Where RN>1
Without seeing your data structure, take a hard look a the Partition By and Order By within the OVER clause of Row_Number()
Short and efective way is delete via derived table.
delete from f
from (
select *, row_number() over (partition by col order by (select 0)) rn
from tbl) f
where rn > 1
But the most efective way is remove duplicates on input and prevent them (for example with unique constraint).

How to get the average row number per rank in PostgreSQL?

Is there any option to get the average of the same values using the RANK() function in PostgreSQL? Here is the example of what I want to do:
This query will do the trick for you
SELECT
test_score,
row_number() OVER (ORDER BY test_score) AS rank,
rank() OVER (ORDER BY test_score)
+ (count(*) OVER (PARTITION BY test_score) - 1) / 2.0 AS "rank (with tied)"
FROM scores
SQLFiddle
Remarks:
What you believe is the "rank" is really the row_number() (i.e. a consecutive series of positive integer with no gaps and no duplicates).
That rank "with tied" that you're looking for can be calculated from the real rank() (rank with gaps) + the number of other elements of the same rank divided by two. This is a faster shortcut to calculate the average row_number() given your specific requirements.
I'm pretty sure you want row_number(), not rank(). Rank will not give repeated values in the way you presented. To get the answer you're looking for:
with rwn as (
select
test_score
,row_number() over (order by test_score) rwn
from
score
)
select
test_score
,avg(rwn) average_rank
from
rwn
group by
test_score;
Here's a SQLFiddle.
#Lukas and #jeremy already explained the difference between rank() and row_number() you seemed to be missing.
You can also compute the row number (rn), and the average over rn (avg_rn) per rank (= per group of same values) in the next step:
SELECT test_score, rn, avg(rn) OVER (PARTITION BY test_score) AS avg_rn
FROM (SELECT test_score, row_number() OVER (ORDER BY test_score) AS rn FROM tbl) sub;
You need a subquery because window functions cannot be nested on the same query level.
You need another window function (not an aggregate function like has been suggested) to preserve all original rows.
The result is ordered by rn by default (for this simple query), but this is just an implementation detail. To guarantee an ordered result, add an explicit ORDER BY (for practically no cost):
...
ORDER BY rn;
SQL Fiddle.

Understanding a simple DISTINCT ON in postgresql

I am having a small difficulty understanding the below simple DISTINCT ON query:
SELECT DISTINCT
ON (bcolor) bcolor,
fcolor
FROM
t1
ORDER BY
bcolor,
fcolor;
I have this table here:
What is the order of execution of the above table and why I am getting the following result:
As I understand since ORDER BY is used it will display the table columns (both of them), in alphabetical order and since ON is used it will return the 1st matched duplicate, but I am still confused about how the resulting table is displayed.
Can somebody take me through how exactly this query is executed ?
This is an odd one since you would think that the SELECT would happen first, then the ORDER BY like any normal RDBMS, but the DISTINCT ON is special. It needs to know the order of the records in order to properly determine which records should be dropped.
So, in this case, it orders first by the bcolor, then by the fcolor. Then it determines distinct bcolors, and drops any but the first record for each distinct group.
In short, it does ORDER BY then applies the DISTINCT ON to drop the appropriate records. I think it would be most helpful to think of 'DISTINCT ON' as being special functionality that differs greatly from DISTINCT.
Added after initial post:
This could be done using window functions and a subquery as well:
SELECT
bcolor,
fcolor
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY bcolor ORDER BY fcolor ASC) as rownumber,
bcolor,
fcolor
FROM t1
) t2
WHERE rownumber = 1