Postgresql: Grouping with limit on group size using window functions - postgresql

Is there a way in Postgresql to write a query which groups rows based on a column with a limit without discarding additional rows.
Say I've got a table with three columns id, color, score with the following rows
1 red 10.0
2 red 7.0
3 red 3.0
4 blue 5.0
5 green 4.0
6 blue 2.0
7 blue 1.0
I can get a grouping based on color with window functions with the following query
SELECT * FROM (
SELECT id, color, score, rank()
OVER (PARTITION BY color ORDER BY score DESC)
FROM grouping_test
) AS foo WHERE rank <= 2;
with the result
id | color | score | rank
----+-------+-------+------
4 | blue | 5.0 | 1
6 | blue | 2.0 | 2
5 | green | 4.0 | 1
1 | red | 10.0 | 1
2 | red | 7.0 | 2
which discards item with ranks > 2. However what I need is a result like
1 red 10.0
2 red 7.0
4 blue 5.0
6 blue 2.0
5 green 4.0
3 red 3.0
7 blue 1.0
With no discarded rows.
Edit:
To be more precise about the logic I need:
Get me the row with the highest score
The next row with the same color and the highest possible score
The item with the highest score of the remaining items
Same as 2., but for the row from 3.
...
Continue as long as pairs with the same color can be found, then order whats left by descending score.
The import statements for a test table can be found here.
Thanks for your help.

It can be done using two nested window functions
SELECT
id
FROM (
SELECT
id,
color,
score,
((rank() OVER color_window) - 1) / 2 AS rank_window_id
FROM grouping_test
WINDOW color_window AS (PARTITION BY color ORDER BY score DESC)
) as foo
WINDOW rank_window AS (PARTITION BY (color, rank_window_id))
ORDER BY
(max(score) OVER rank_window) DESC,
color;
With 2 being the parameter of the group size.

You can do ORDER BY (rank <= 2) DESC to get the rows with rank<=2 above all else:
SELECT id,color,score FROM (
SELECT id, color, score, rank()
OVER (PARTITION BY color ORDER BY score DESC),
max(score) OVER (PARTITION BY color) mx
FROM grouping_test
) AS foo
ORDER BY
(rank <= 2) DESC,
CASE WHEN rank<=2 THEN mx ELSE NULL END DESC,
id;
http://sqlfiddle.com/#!12/bbcfc/109

Related

Retain only 3 highest positive and negative records in a table

I am new to databases and postgres as such.
I have a table called names which has 2 columns name and value which gets updated every x seconds with new name value pairs. My requirement is to retain only 3 positive and 3 negative values at any point of time and delete the rest of the rows during each table update.
I use the following query to delete the old rows and retain the 3 positive and 3 negative values ordered by value.
delete from names
using (select *,
row_number() over (partition by value > 0, value < 0 order by value desc) as rn
from names ) w
where w.rn >=3
I am skeptical to use a conditional like value > 0 in a partition statement. Is this approach correct?
For example,
A table like this prior to delete :
name | value
--------------
test | 10
test1 | 11
test1 | 12
test1 | 13
test4 | -1
test4 | -2
My table after delete should look like :
name | value
--------------
test1 | 13
test1 | 12
test1 | 11
test4 | -1
test4 | -2
demo:db<>fiddle
This works generally as expected: value > 0 clusters the values into all numbers > 0 and all numbers <= 0. The ORDER BY value orders these two groups as expected well.
So, the only thing, I would change:
row_number() over (partition by value >= 0 order by value desc)
remove: , value < 0 (Because: Why should you group the positive values into negative and other? You don't have any negative numbers in your positive group and vice versa.)
change: value > 0 to value >= 0 to ignore the 0 as long as possible
For deleting: If you want to keep the top 3 values of each direction:
you should change w.rn >= 3 into w.rn > 3 (it keeps the 3rd element as well)
you need to connect the subquery with the table records. In real cases you should use id columns for that. In your example you could take the value column: where n.value = w.value AND w.rn > 3
So, finally:
delete from names n
using (select *,
row_number() over (partition by value >= 0 order by value desc) as rn
from names ) w
where n.value = w.value AND w.rn > 3
If it's not a hard requirement to delete the other rows, you could instead select only the rows you're interested in:
WITH largest AS (
SELECT name, value
FROM names
ORDER BY value DESC
LIMIT 3),
smallest AS (
SELECT name, value
FROM names
ORDER BY value ASC
LIMIT 3)
SELECT * FROM largest
UNION
SELECT * FROM smallest
ORDER BY value DESC

How to find median by attribute with Postgres window functions?

I use PostgreSQL and have records like this on groups of people:
name | people | indicator
--------+--------+-----------
group 1 | 1000 | 1
group 2 | 100 | 2
group 3 | 2000 | 3
I need to find the indicator for the median person. The result should be
group 3 | 2000 | 3
If I do
select median(name) over (order by indicator) from table1
It will be group 2.
Not sure if I can select this with a window function.
Generating 1000/2000 rows per record seems impractical, because I have millions of people in the records.
Find the first cumulative sum of people greater than the median of total sum:
with the_data(name, people, indicator) as (
values
('group 1', 1000, 1),
('group 2', 100, 2),
('group 3', 2000, 3)
)
select name, people, indicator
from (
select *, sum(people) over (order by name)
from the_data
cross join (select sum(people)/2 median from the_data) s
) s
where sum > median
order by name
limit 1;
name | people | indicator
---------+--------+-----------
group 3 | 2000 | 3
(1 row)

Calculate length of a series of line segments

I have a table like the following:
X | Y | Z | node
----------------
1 | 2 | 3 | 100
2 | 2 | 3 |
2 | 2 | 4 |
2 | 2 | 5 | 200
3 | 2 | 5 |
4 | 2 | 5 |
5 | 2 | 5 | 300
X, Y, Z are 3D space coordinates of some points, a curve passes through all the corresponding points from the first row to the last row. I need to calculate the curve length between two adjacent points whose "node" column aren't null.
If would be great if I can directly insert the result into another table that has three columns: "first_node", "second_node", "curve_length".
I don't need to interpolate extra points into the curve, just need to accumulate lengths all the straight lines, for example, in order to calculate the curve length between node 100 and 200, I need to sum the lengths of 3 straight lines: (1,2,3)<->(2,2,3), (2,2,3)<->(2,2,4), (2,2,4)<->(2,2,5)
EDIT
The table has an ID column, which is in increasing order from the first row to the last row.
To get a previous value in SQL, use the lag window function, e.g.
SELECT
x,
lag(x) OVER (ORDER BY id) as prev_x, ...
FROM ...
ORDER BY id;
That lets you get the previous and next points in 3-D space for a given segment. From there you can trivially calculate the line segment length using regular geometric maths.
You'll now have the lengths of each segment (sqlfiddle query). You can use this as input into other queries, using SELECT ... FROM (SELECT ...) subqueries or a CTE (WITH ....) term.
It turns out to be pretty awkward to go from the node segment lengths to node-to-node lengths. You need to create a table that spans the null entries, using a recursive CTE or with a window function.
I landed up with this monstrosity:
SELECT
array_agg(from_id) AS seg_ids,
-- 'max' is used here like 'coalese' for an aggregate,
-- since non-null is greater than null
max(from_node) AS from_node,
max(to_node) AS to_node,
sum(seg_length) AS seg_length
FROM (
-- lengths of all sub-segments with the null last segment
-- removed and a partition counter added
SELECT
*,
-- A running counter that increments when the
-- node ID changes. Allows us to group by series
-- of nodes in the outer query.
sum(CASE WHEN from_node IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY from_id) AS partition_id
FROM
(
-- lengths of all sub-segments
SELECT
id AS from_id,
lead(id, 1) OVER (ORDER BY id) AS to_id,
-- length of sub-segment
sqrt(
(x - lead(x, 1) OVER (ORDER BY id)) ^ 2 +
(y - lead(y, 1) OVER (ORDER BY id)) ^ 2 +
(z - lead(z, 1) OVER (ORDER BY id)) ^ 2
) AS seg_length,
node AS from_node,
lead(node, 1) OVER (ORDER BY id) AS to_node
FROM
Table1
) sub
-- filter out the last row
WHERE to_id IS NOT NULL
) seglengths
-- Group into series of sub-segments between two nodes
GROUP BY partition_id;
Credit to How do I efficiently select the previous non-null value? for the partition trick.
Result:
seg_ids | to_node | from_node | seg_length
---------+---------+---------+------------
{1,2,3} | 100 | 200 | 3
{4,5,6} | 200 | 300 | 3
(2 rows)
To insert directly into another table, use INSERT INTO ... SELECT ....

Postgresql difference between rows

My data:
id value
1 10
1 20
1 60
2 10
3 10
3 30
How to compute column 'change'?
id value change | my comment, how to compute
1 10 10 | 20-10
1 20 40 | 60-20
1 60 40 | default_value-60. In this example default_value=100
2 10 90 | default_value-10
3 10 20 | 30-10
3 30 70 | default_value-30
In other words: if row of id is last, then compute 100-value,
else compute next_value-value_now
You can access the value of the "next" (or "previous") row using a window function. The concept of a "next" row only makes sense if you have a column to define an order on the rows. You said you have a date column on which you can order the result. I used the column name your_date_column for this. You need to replace that with the actual column name of course.
select id,
value,
lead(value, 1, 100) over (partition by id order by your_date_column) - value as change
from the_table
order by id, your_date_column
lead(value, 1, 100) says: take the column value of the "next" row (that's the 1). If there is no such row, use the default value 100 instead.
Join on a subquery and use ROW_NUMBER to find the last value per group
WITH CTE AS(
SELECT id,value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) rn,
(LEAD(value) OVER (PARTITION BY id ORDER BY date)-value) change FROM t)
SELECT cte.id,cte.value,
(CASE WHEN cte.change IS NULL THEN 100-cte.value ELSE cte.change END)as change FROM cte LEFT JOIN
(SELECT id,MAX(rn) mrn FROM cte
GROUP BY id) as x
ON x.mrn=cte.rn AND cte.id=x.id
FIDDLE

iReport: How to give my choice of custom labels to Pie chart for a computed count?

I am new to JasperReports and I am trying to generate a Pie Chart using iReports 5.1.0.
I have counts of days taken which should compute the percentages of the 3 slices but what should I give in the Key Expression and Label Expression ?
Trying to customize the 3 slice labels as Within 5 days, More than 5 days and Tested but not referred.
I am getting counts through this code
SELECT SUM(subSet.days_taken <= 5) AS within_5_days,
SUM(subSet.days_taken > 5) AS more_than_5,
SUM(subSet.date_referred IS NULL) as not_yet_referred
FROM (select p.patient_id,
(CASE
WHEN st.smear_result <> 'NEGATIVE' OR st.gxp_result = 'MTB+' THEN (DATEDIFF(r.date_referred, MIN(st.date_smear_tested)))
ELSE
(CASE
WHEN st.smear_result = 'NEGATIVE' OR st.gxp_result = 'MTB-' THEN (DATEDIFF(r.date_referred, MAX(st.date_smear_tested)))
END) END) as days_taken,
r.date_referred as date_referred
from patient as p
left outer join sputum_test as st on p.patient_id = st.patient_id
left outer join referral as r on r.patient_id = st.patient_id
where p.suspected_by is not null
and (p.patient_status = 'SUSPECT' or
p.patient_status = 'CONFIRMED')
group by p.patient_id)
as subSet
This is also the DataSet run I am using.
Your help will be really appreciated.
What you do now is that you make three columns in one tuple so you probably get something similar to the following:'
--------------------------------------------------
| within_5_days | more_than_5 | not_yet_referred |
--------------------------------------------------
| 4 | 5 | 8 |
--------------------------------------------------
However the pie chart won't accept it in that format. Instead you want this:
-------------------------
| Type | Summ |
-------------------------
|within_5_days | 4 |
|more_than_5 | 5 |
|not_yet_referred| 8 |
-------------------------
With that you can have "Type" as your label expression and "Sum" as value expression. So you would have to change the query to something like this
select CASE
WHEN subSet.days_taken <= 5 THEN 'within_5_days'
WHEN subSet.days_taken > 5 THEN 'more_than_5'
WHEN subSet.date_referred IS NULL THEN 'not_yet_referred'
END CASE AS Type, 1 AS Summ ...
Then you can sum it.