How to find median by attribute with Postgres window functions?

How to find median by attribute with Postgres window functions? - postgresql

I use PostgreSQL and have records like this on groups of people:
name | people | indicator
--------+--------+-----------
group 1 | 1000 | 1
group 2 | 100 | 2
group 3 | 2000 | 3
I need to find the indicator for the median person. The result should be
group 3 | 2000 | 3
If I do
select median(name) over (order by indicator) from table1
It will be group 2.
Not sure if I can select this with a window function.
Generating 1000/2000 rows per record seems impractical, because I have millions of people in the records.

Find the first cumulative sum of people greater than the median of total sum:
with the_data(name, people, indicator) as (
values
('group 1', 1000, 1),
('group 2', 100, 2),
('group 3', 2000, 3)
)
select name, people, indicator
from (
select *, sum(people) over (order by name)
from the_data
cross join (select sum(people)/2 median from the_data) s
) s
where sum > median
order by name
limit 1;
name | people | indicator
---------+--------+-----------
group 3 | 2000 | 3
(1 row)

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24

step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.

Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!

The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

How to retrieve top 3 results for each column in postgresql?

I have given a question. The table looks like this..
STATE | year1 | ... | year 10
AP | 100 | ... | 120
assam | 13 | .. | 42
madhya pradesh | 214 | ... | 421
Now, I need to get the top - 3 states for each year.
I tried everything possible. But, I am not able to filter results per column.

You have a design problem. The enumerated column are almost always a sign of bad design.
For now you could unpivot using unnest and then use window function row_number to get the top 3 states per year:
with unpivoted as (
select state,
unnest(array[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) as year,
unnest(array[
year_1, year_2, year_3,
year_4, year_5, year_6,
year_7, year_8, year_9,
year_10
]) as value,
from your_table
)
select *
from (
select t.*,
row_number() over (
partition by year
order by value desc
) as seqnum
from unpivoted t
) t
where seqnum <= 3;
Demo

Calculate the percentage of column data in postgres

I have table column like;
Documents
------------
1 2 3
1 2
2 3
1
4 5 1 3
Data under document column indicates the type of document; for e.g
1 indicates passport
2 indicates School id and so on.....
I want the column data as separate data with percentage calculation. like;
Basically i want to show percentage for each data...
Documents percentage
-------------------------
1 10%
2 2%
3 1%
4 25%
5 30%
I want to show data as separate data with their percentage.
can we build query in postgres to achive this????

You should convert the strings to arrays and unnest them, then you can calculate total and percentages:
create table test (documents text);
insert into test values
('1 2 3'),
('1 2'),
('2 3'),
('1'),
('4 5 1 3');
with docs as (
select doc
from test, unnest(string_to_array(documents, ' ')) doc
),
total as (
select count(*) as total
from docs
)
select doc, count(doc), count(doc)* 100/ total as percentage
from docs, total
group by doc, total
order by 1;
doc | count | percentage
-----+-------+------------
1 | 4 | 33
2 | 3 | 25
3 | 3 | 25
4 | 1 | 8
5 | 1 | 8
(5 rows)

with t(s) as ( values
('1 2 3'),('1 2'),('2 3'),('1'),('4 5 1 3')
)
select distinct s,
count(*) over(partition by s) * 100.0 /
count(*) over() as percentage
from (
select regexp_split_to_table(s, ' ') as s
from t
) t
;
s | percentage
---+---------------------
5 | 8.3333333333333333
4 | 8.3333333333333333
3 | 25.0000000000000000
1 | 33.3333333333333333
2 | 25.0000000000000000

Calculate length of a series of line segments

I have a table like the following:
X | Y | Z | node
----------------
1 | 2 | 3 | 100
2 | 2 | 3 |
2 | 2 | 4 |
2 | 2 | 5 | 200
3 | 2 | 5 |
4 | 2 | 5 |
5 | 2 | 5 | 300
X, Y, Z are 3D space coordinates of some points, a curve passes through all the corresponding points from the first row to the last row. I need to calculate the curve length between two adjacent points whose "node" column aren't null.
If would be great if I can directly insert the result into another table that has three columns: "first_node", "second_node", "curve_length".
I don't need to interpolate extra points into the curve, just need to accumulate lengths all the straight lines, for example, in order to calculate the curve length between node 100 and 200, I need to sum the lengths of 3 straight lines: (1,2,3)<->(2,2,3), (2,2,3)<->(2,2,4), (2,2,4)<->(2,2,5)
EDIT
The table has an ID column, which is in increasing order from the first row to the last row.

To get a previous value in SQL, use the lag window function, e.g.
SELECT
x,
lag(x) OVER (ORDER BY id) as prev_x, ...
FROM ...
ORDER BY id;
That lets you get the previous and next points in 3-D space for a given segment. From there you can trivially calculate the line segment length using regular geometric maths.
You'll now have the lengths of each segment (sqlfiddle query). You can use this as input into other queries, using SELECT ... FROM (SELECT ...) subqueries or a CTE (WITH ....) term.
It turns out to be pretty awkward to go from the node segment lengths to node-to-node lengths. You need to create a table that spans the null entries, using a recursive CTE or with a window function.
I landed up with this monstrosity:
SELECT
array_agg(from_id) AS seg_ids,
-- 'max' is used here like 'coalese' for an aggregate,
-- since non-null is greater than null
max(from_node) AS from_node,
max(to_node) AS to_node,
sum(seg_length) AS seg_length
FROM (
-- lengths of all sub-segments with the null last segment
-- removed and a partition counter added
SELECT
*,
-- A running counter that increments when the
-- node ID changes. Allows us to group by series
-- of nodes in the outer query.
sum(CASE WHEN from_node IS NULL THEN 0 ELSE 1 END) OVER (ORDER BY from_id) AS partition_id
FROM
(
-- lengths of all sub-segments
SELECT
id AS from_id,
lead(id, 1) OVER (ORDER BY id) AS to_id,
-- length of sub-segment
sqrt(
(x - lead(x, 1) OVER (ORDER BY id)) ^ 2 +
(y - lead(y, 1) OVER (ORDER BY id)) ^ 2 +
(z - lead(z, 1) OVER (ORDER BY id)) ^ 2
) AS seg_length,
node AS from_node,
lead(node, 1) OVER (ORDER BY id) AS to_node
FROM
Table1
) sub
-- filter out the last row
WHERE to_id IS NOT NULL
) seglengths
-- Group into series of sub-segments between two nodes
GROUP BY partition_id;
Credit to How do I efficiently select the previous non-null value? for the partition trick.
Result:
seg_ids | to_node | from_node | seg_length
---------+---------+---------+------------
{1,2,3} | 100 | 200 | 3
{4,5,6} | 200 | 300 | 3
(2 rows)
To insert directly into another table, use INSERT INTO ... SELECT ....

How do I get min, median and max from my query in postgresql?

I have written a query in which one column is a month. From that I have to get min month, max month, and median month. Below is my query.
select ext.employee,
pl.fromdate,
ext.FULL_INC as full_inc,
prevExt.FULL_INC as prevInc,
(extract(year from age (pl.fromdate))*12 +extract(month from age (pl.fromdate))) as month,
case
when prevExt.FULL_INC is not null then (ext.FULL_INC -coalesce(prevExt.FULL_INC,0))
else 0
end as difference,
(case when prevExt.FULL_INC is not null then (ext.FULL_INC - prevExt.FULL_INC) / prevExt.FULL_INC*100 else 0 end) as percent
from pl_payroll pl
inner join pl_extpayfile ext
on pl.cid = ext.payrollid
and ext.FULL_INC is not null
left outer join pl_extpayfile prevExt
on prevExt.employee = ext.employee
and prevExt.cid = (select max (cid) from pl_extpayfile
where employee = prevExt.employee
and payrollid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = prevExt.employee
and p.fromdate < pl.fromdate
))
and coalesce(prevExt.FULL_INC, 0) > 0
where ext.employee = 17
and (exists (
select employee
from pl_extpayfile preext
where preext.employee = ext.employee
and preext.FULL_INC <> ext.FULL_INC
and payrollid in (
select cid
from pl_payroll
where cid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = preext.employee
and p.fromdate < pl.fromdate
)
)
)
or not exists (
select employee
from pl_extpayfile fext,
pl_payroll p
where fext.employee = ext.employee
and p.cid = fext.payrollid
and p.fromdate < pl.fromdate
and fext.FULL_INC > 0
)
)
order by employee,
ext.payrollid desc
If it is not possible, than is it possible to get max month and min month?

To calculate the median in PostgreSQL, simply take the 50% percentile (no need to add extra functions or anything):
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY x) FROM t;

You want the aggregate functions named min and max. See the PostgreSQL documentation and tutorial:
http://www.postgresql.org/docs/current/static/tutorial-agg.html
http://www.postgresql.org/docs/current/static/functions-aggregate.html
There's no built-in median in PostgreSQL, however one has been implemented and contributed to the wiki:
http://wiki.postgresql.org/wiki/Aggregate_Median
It's used the same way as min and max once you've loaded it. Being written in PL/PgSQL it'll be a fair bit slower, but there's even a C version there that you could adapt if speed was vital.
UPDATE After comment:
It sounds like you want to show the statistical aggregates alongside the individual results. You can't do this with a plain aggregate function because you can't reference columns not in the GROUP BY in the result list.
You will need to fetch the stats from subqueries, or use your aggregates as window functions.
Given dummy data:
CREATE TABLE dummystats ( depname text, empno integer, salary integer );
INSERT INTO dummystats(depname,empno,salary) VALUES
('develop',11,5200),
('develop',7,4200),
('personell',2,5555),
('mgmt',1,9999999);
... and after adding the median aggregate from the PG wiki:
You can do this with an ordinary aggregate:
regress=# SELECT min(salary), max(salary), median(salary) FROM dummystats;
min | max | median
------+---------+----------------------
4200 | 9999999 | 5377.5000000000000000
(1 row)
but not this:
regress=# SELECT depname, empno, min(salary), max(salary), median(salary)
regress-# FROM dummystats;
ERROR: column "dummystats.depname" must appear in the GROUP BY clause or be used in an aggregate function
because it doesn't make sense in the aggregation model to show the averages alongside individual values. You can show groups:
regress=# SELECT depname, min(salary), max(salary), median(salary)
regress-# FROM dummystats GROUP BY depname;
depname | min | max | median
-----------+---------+---------+-----------------------
personell | 5555 | 5555 | 5555.0000000000000000
develop | 4200 | 5200 | 4700.0000000000000000
mgmt | 9999999 | 9999999 | 9999999.000000000000
(3 rows)
... but it sounds like you want the individual values. For that, you must use a window, a feature new in PostgreSQL 8.4.
regress=# SELECT depname, empno,
min(salary) OVER (),
max(salary) OVER (),
median(salary) OVER ()
FROM dummystats;
depname | empno | min | max | median
-----------+-------+------+---------+-----------------------
develop | 11 | 4200 | 9999999 | 5377.5000000000000000
develop | 7 | 4200 | 9999999 | 5377.5000000000000000
personell | 2 | 4200 | 9999999 | 5377.5000000000000000
mgmt | 1 | 4200 | 9999999 | 5377.5000000000000000
(4 rows)
See also:
http://www.postgresql.org/docs/current/static/tutorial-window.html
http://www.postgresql.org/docs/current/static/functions-window.html

One more option for median:
SELECT x
FROM table
ORDER BY x
LIMIT 1 offset (select count(*) from x)/2

To find Median:
for instance consider that we have 6000 rows present in the table.First we need to take half rows from the original Table (because we know that median is always the middle value) so here half of 6000 is 3000(Take 3001 for getting exact two middle value).
SELECT *
FROM (SELECT column_name
FROM Table_name
ORDER BY column_name
LIMIT 3001)As Table1
ORDER BY column_name DESC ---->Look here we used DESC(Z-A)it will display the last
-- two values(using LIMIT 2) i.e (3000th row and 3001th row) from 6000
-- rows
LIMIT 2;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find median by attribute with Postgres window functions? - postgresql

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

How to retrieve top 3 results for each column in postgresql?

Calculate the percentage of column data in postgres

Calculate length of a series of line segments

How do I get min, median and max from my query in postgresql?

Categories

Resources