Difference of top two values while GROUP BY

Difference of top two values while GROUP BY - postgresql

Suppose I have the following SQL Table:
id | score
------------
1 | 4433
1 | 678
1 | 1230
1 | 414
5 | 8899
5 | 123
6 | 2345
6 | 567
6 | 2323
Now I wanted to do a GROUP BY id operation wherein the score column would be modified as follows: take the absolute difference between the top two highest scores for each id.
For example, the response for the above query should be:
id | score
------------
1 | 3203
5 | 8776
6 | 22
How can I perform this query in PostgreSQL?

Using ROW_NUMBER along with pivoting logic we can try:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY score DESC) rn
FROM yourTable
)
SELECT id,
ABS(MAX(score) FILTER (WHERE rn = 1) -
MAX(score) FILTER (WHERE rn = 2)) AS score
FROM cte
GROUP BY id;
Demo

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24

step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.

Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!

The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

Get different LIMIT on each group on postgresql rank

To get 2 rows from each group I can use ROW_NUMBER() with condition <= 2 at last but my question is what If I want to get different limits on each group e.g 3 rows for section_id 1, 1 rows for 2 and 1 rows for 3?
Given the following table:
db=# SELECT * FROM xxx;
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
3 | 1 | C
4 | 1 | D
5 | 2 | E
6 | 2 | F
7 | 3 | G
8 | 2 | H
(8 rows)
I get the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:
id | section_id | name
----+------------+------
1 | 1 | A
2 | 1 | B
5 | 2 | E
6 | 2 | F
7 | 3 | G
(5 rows)
Current Query:
SELECT
*
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
t.*
FROM
xxx t) x
WHERE
x.r <= 2;

Create a table to contain the section limits, then join. The big advantage being that as new sections are required or limits change maintenance is reduced to a single table update and comes at very little cost. See example.
select s.section_id, s.name
from (select section_id, name
, row_number() over (partition by section_id order by name) rn
from sections
) s
left join section_limits sl on (sl.section_id = s.section_id)
where
s.rn <= coalesce(sl.limit_to,2);

Just fix up your where clause:
with numbered as (
select row_number() over (partition by section_id
order by name) as r,
t.*
from xxx t
)
select *
from numbered
where (section_id = 1 and r <= 3)
or (section_id = 2 and r <= 1)
or (section_id = 3 and r <= 1);

Finding the length of a series in postgres

A tricky query for postgres. Imagine, I have a set of rows with a boolean column called (for example) success. Like this:
id | success
9 | false
8 | false
7 | true
6 | true
5 | true
4 | false
3 | false
2 | true
1 | false
And I need to calculate a length of the latest (not) successful series. E. g. in this case it would be "3" for successful and "2" for not successful. Or using window functions, then something like:
id | success | length
9 | false | 2
8 | false | 2
7 | true | 3
6 | true | 3
5 | true | 3
4 | false | 1
3 | true | 2
2 | true | 2
1 | false | 1
(note that I generally need a length of only the latest series, not all of those)
The closest answer I've found so far was this article:
https://jaxenter.com/10-sql-tricks-that-you-didnt-think-were-possible-125934.html
(See #5)
However, postgres doesn't support "IGNORE NULLS" option so the query doesn't work. Without "IGNORE NULLS" it simply returns me nulls in length column.
Here is the closest I was able to get:
WITH
trx1(id, success, rn) AS (
SELECT id, success, row_number() OVER (ORDER BY id desc)
FROM results
),
trx2(id, success, rn, lo, hi) AS (
SELECT trx1.*,
CASE WHEN coalesce(lag(success) OVER (ORDER BY id DESC), FALSE) != success THEN rn END,
CASE WHEN coalesce(lead(success) OVER (ORDER BY id DESC), FALSE) != success THEN rn END
FROM trx1
)
SELECT trx2.*, 1
- last_value (lo) IGNORE nulls OVER (ORDER BY id DESC ROWS BETWEEN
UNBOUNDED PRECEDING AND CURRENT ROW)
+ first_value(hi) OVER (ORDER BY id DESC ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING)
AS length FROM trx2;
Do you have any ideas of such a query?

You can use the window function row_number() to designate series:
select max(id) as max_id, success, count(*) as length
from (
select *, row_number() over wa - row_number() over wp as grp
from my_table
window
wp as (partition by success order by id desc),
wa as (order by id desc)
) s
group by success, grp
order by 1 desc
max_id | success | length
--------+---------+--------
9 | f | 2
7 | t | 3
4 | f | 2
2 | t | 1
1 | f | 1
(5 rows)
DbFiddle.

Even though answer by Klin is totally correct, I'd like to post another solution my friend suggested:
with last_success as (
select max(id) id from my_table where success
)
select count(mt.id) last_fails_count
from my_table mt, last_success lt
where mt.id > lt.id;
--------------------
| last_fails_count |
--------------------
| 2 |
--------------------
DbFiddle
It is twice faster if I only need to get the last failing or successful series.

Grouped LIMIT 10 in Postgresql

I have a query:
select
a.kli,
b.term_desc,
count(distinct(a.adic)) as count,
a.partner_id
from
ad_delivery.sgmt_kli_adic a
join wand.wandterms b on a.kli = b.term_code
join wand.wandterms c on b.term_desc=c.term_desc
join dwh.sgmt_clients e on a.partner_id::varchar = e.partner_id
join dwh.schema_names f on e.partner_id::integer = f.partner_id::integer
where
a.partner_id::integer in (f.partner_id)
and c.class_code = 969
group by a.partner_id, b.term_desc, a.kli
order by partner_id, count desc;
which brings back counts for certain terms per partner_id. I want to be able to show the top 10 for each of the ~40 partner_id in order by the count desc
the query results look like
db=# SELECT * FROM xxx;
pid | term_desc | count
----+------------+------
4 | termdesc1 | 3434
4 | termdesc2 | 235
4 | termdesc3 | 367
4 | termdesc4 | 4533
5 | termdesc1 | 235
5 | termdesc2 | 567
5 | termdesc3 | 344
5 | termdesc4 | 56
(10k+ rows)

You could add a rank column and then filter the result by the rank :
select
a.kli,
b.term_desc,
count(distinct(a.adic)) as count,
a.partner_id,
RANK() OVER (PARTITION BY a.partner_id order by a.partner_id DESC) AS r
from
ad_delivery.sgmt_kli_adic a
join wand.wandterms b on a.kli = b.term_code
join wand.wandterms c on b.term_desc=c.term_desc
join dwh.sgmt_clients e on a.partner_id::varchar = e.partner_id
join dwh.schema_names f on e.partner_id::integer = f.partner_id::integer
where
a.partner_id::integer in (f.partner_id)
and c.class_code = 969
group by a.partner_id, b.term_desc, a.kli
HAVING r < 11
order by partner_id, count desc;
I have not tested the code, however the trick is ranking the each row of the GROUP BY and filter the resultset with the HAVING clause, keeping only item with a lower rank than 11 (you will get 10 item per group).

Selecting rows ordered by some column and distinct on another

Related to - PostgreSQL DISTINCT ON with different ORDER BY
I have table purchases (product_id, purchased_at, address_id)
Sample data:
| id | product_id | purchased_at | address_id |
| 1 | 2 | 20 Mar 2012 21:01 | 1 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 3 | 2 | 20 Mar 2012 21:39 | 2 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
The result I expect is the most recent purchased product (full row) for each address_id and that result must be sorted in descendant order by the purchased_at field:
| id | product_id | purchased_at | address_id |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
Using query:
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
I'm getting:
| id | product_id | purchased_at | address_id |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
So the rows is same, but order is wrong. Any way to fix it?

Quite a clear question :)
SELECT t1.* FROM purchases t1
LEFT JOIN purchases t2
ON t1.address_id = t2.address_id AND t1.purchased_at < t2.purchased_at
WHERE t2.purchased_at IS NULL
ORDER BY t1.purchased_at DESC
And most likely a faster approach:
SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC

Your ORDER BY is used by DISTINCT ON for picking which row for each distinct address_id to produce. If you then want to order the resulting records, make the DISTINCT ON a subselect and order its results:
SELECT * FROM
(
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
) distinct_addrs
order by distinct_addrs.purchased_at DESC

This query is trickier to rephrase properly than it looks.
The currently accepted, join-based answer doesn’t correctly handle the case where two candidate rows have the same given purchased_at value: it will return both rows.
You can get the right behaviour this way:
SELECT * FROM purchases AS given
WHERE product_id = 2
AND NOT EXISTS (
SELECT NULL FROM purchases AS other
WHERE given.address_id = other.address_id
AND (given.purchased_at < other.purchased_at OR given.id < other.id)
)
ORDER BY purchased_at DESC
Note how it has a fallback of comparing id values to disambiguate the case in which the purchased_at values match. This ensures that the condition can only ever be true for a single row among those that have the same address_id value.
The original query using DISTINCT ON handles this case automatically!
Also note the way that you are forced to encode the fact that you want “the latest for each address_id” twice, both in the given.purchased_at < other.purchased_at condition and the ORDER BY purchased_at DESC clause, and you have to make sure they match. I had to spend a few extra minutes to convince myself that this query is really positively correct.
It’s much easier to write this query correctly and understandbly by using DISTINCT ON together with an outer subquery, as suggested by dbenhur.

Try this !
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 2
ORDER BY address_id, purchased_at DESC

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Difference of top two values while GROUP BY - postgresql

Using ROW_NUMBER along with pivoting logic we can try: WITH cte AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY score DESC) rn FROM yourTable ) SELECT id, ABS(MAX(score) FILTER (WHERE rn = 1) - MAX(score) FILTER (WHERE rn = 2)) AS score FROM cte GROUP BY id; Demo

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

Get different LIMIT on each group on postgresql rank

Finding the length of a series in postgres

Grouped LIMIT 10 in Postgresql

Selecting rows ordered by some column and distinct on another

Categories

Resources