Designate groups of consecutive equal values in an ordered dataset - postgresql

I'm trying to get the output to look like below. The problem is that I can't do a first_value or RANK because when I partition by event and order by time, then it doesn't break them up in that order. I need them to order by time first and then partition each time.

One of the known solutions
Use lag() to mark rows when event changes and cumulative sum() to designate groups, e.g.:
with my_table(event, time) as (
values
('A', '12:01'),
('A', '12:02'),
('B', '12:03'),
('A', '12:04'),
('A', '12:05'),
('B', '12:06'),
('B', '12:07'),
('A', '12:08')
)
select
event,
time,
sum(change) over (order by time) as "desired row number"
from (
select
event,
time,
(event is distinct from lag(event) over (order by time))::int as change
from my_table
) s
event | time | desired row number
-------+-------+--------------------
A | 12:01 | 1
A | 12:02 | 1
B | 12:03 | 2
A | 12:04 | 3
A | 12:05 | 3
B | 12:06 | 4
B | 12:07 | 4
A | 12:08 | 5
(8 rows)
Custom aggregate
It would be nice to have the function:
select *, group_number(event) over (order by time)
from my_table;
This can be done with the custom aggregate:
create type group_number_internal as (number int, lag text);
create or replace function group_number_transition(group_number_internal, anyelement)
returns group_number_internal language sql strict as $$
select
case
when $2::text is distinct from $1.lag then $1.number+ 1
else $1.number
end,
$2::text
$$;
create or replace function group_number_final(group_number_internal)
returns int language sql as $$
select $1.number
$$;
create aggregate group_number(anyelement) (
sfunc = group_number_transition,
stype = group_number_internal,
finalfunc = group_number_final,
initcond = '(0, null)'
);
Test it in rextester.

Related

combine different sorting result in one query

tasktime
id | name | start_date | end_date ...
1 | a | 2016-12-22 | 2017-01-01
2 | b | 2016-05-01 | 2016-05-31
3 | c | 2016-06-01 | 2016-12-25
should I use group or ..
I tried below query get result: 1 2 3
even if I change start_date asc or end_date desc nothing happen,
SELECT
tt.*
FROM tasktime tt
ORDER BY tt.name asc NULLS LAST
, tt.start_date desc NULLS LAST
, tt.end_date asc NULLS LAST
UPDATE
I want combine different sorting result
SELECT
tt.*
FROM tasktime tt
ORDER BY tt.end_date asc NULLS LAST
then use above result
ORDER BY tt.start_date desc NULLS LAST
then use above result
ORDER BY tt.name asc NULLS LAST
please close this question ... I realised what I want , and this question is totally wrong
Like here?
t=# create table tasktime (id int, name text, start_date date, end_date date);
CREATE TABLE
t=# insert into tasktime values (1,'a','2016-12-22', '2017-01-01'), (2, 'b', '2016-05-01', '2016-05-31'), (3, 'c', '2016-06-01','2016-12-25');
INSERT 0 3
t=# SELECT
t-# tt.*
t-# FROM tasktime tt
t-# order by tt.end_date asc NULLS LAST
t-# , tt.start_date desc NULLS LAST
t-# , tt.name asc NULLS LAST;
id | name | start_date | end_date
----+------+------------+------------
2 | b | 2016-05-01 | 2016-05-31
3 | c | 2016-06-01 | 2016-12-25
1 | a | 2016-12-22 | 2017-01-01
(3 rows)

Coalesce overlapping time ranges in PostgreSQL

I have a PostgreSQL (9.4) table that contains time stamp ranges and user IDs, and I need to collapse any overlapping ranges (with the same user ID) into a single record.
I've tried a complicated set of CTEs to accomplish this, but there are some edge cases in our (40,000+ rows) real table that complicate matters. I've come to the conclusion that I probably need a recursive CTE, but I haven't had any luck writing it.
Here's some code to create a test table and populate it with data. This isn't the exact layout of our table, but it's close enough for an example.
CREATE TABLE public.test
(
id serial,
sessionrange tstzrange,
fk_user_id integer
);
insert into test (sessionrange, fk_user_id)
values
('[2016-01-14 11:57:01-05,2016-01-14 12:06:59-05]', 1)
,('[2016-01-14 12:06:53-05,2016-01-14 12:17:28-05]', 1)
,('[2016-01-14 12:17:24-05,2016-01-14 12:21:56-05]', 1)
,('[2016-01-14 18:18:00-05,2016-01-14 18:42:09-05]', 2)
,('[2016-01-14 18:18:08-05,2016-01-14 18:18:15-05]', 1)
,('[2016-01-14 18:38:12-05,2016-01-14 18:48:20-05]', 1)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 1)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 1)
,('[2016-01-14 18:18:12-05,2016-01-14 18:18:20-05]', 3)
,('[2016-01-14 19:32:12-05,2016-01-14 23:18:20-05]', 3)
,('[2016-01-14 18:18:16-05,2016-01-14 18:18:26-05]', 4)
,('[2016-01-14 18:18:24-05,2016-01-14 18:18:31-05]', 2);
I have found that I can do this to get the sessions sorted by the time they started:
select * from test order by fk_user_id, sessionrange
I could use this to determine whether an individual record overlaps with the previous, using window functions:
SELECT *, sessionrange && lag(sessionrange) OVER (PARTITION BY fk_user_id ORDER BY sessionrange)
FROM test
ORDER BY fk_user_id, sessionrange
But this only detects whether the single previous record overlaps the current one (see the record where id = 6). I need to detect all the way back to the beginning of the partition.
After that, I'd need to group any records that overlap together, to find the beginning of the earliest session and the end of the last session to terminate.
I'm sure there's a way to do this that I'm overlooking. How can I collapse these overlapping records?
It is relatively easy to merge overlapping ranges as elements of an array. For simplicity the following function returns set of tstzrange:
create or replace function merge_ranges(tstzrange[])
returns setof tstzrange language plpgsql as $$
declare
t tstzrange;
r tstzrange;
begin
foreach t in array $1 loop
if r && t then r:= r + t;
else
if r notnull then return next r;
end if;
r:= t;
end if;
end loop;
if r notnull then return next r;
end if;
end $$;
Just aggregate the ranges for a user and use the function:
select fk_user_id, merge_ranges(array_agg(sessionrange))
from test
group by 1
order by 1, 2
fk_user_id | merge_ranges
------------+-----------------------------------------------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"]
1 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"]
1 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"]
1 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"]
2 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"]
3 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"]
3 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"]
4 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"]
(8 rows)
Alternatively, the algorithm can be applied to the entire table in one function loop. I'm not sure but for a large dataset this method should be faster.
create or replace function merge_ranges_in_test()
returns setof test language plpgsql as $$
declare
curr test;
prev test;
begin
for curr in
select *
from test
order by fk_user_id, sessionrange
loop
if prev notnull and prev.fk_user_id <> curr.fk_user_id then
return next prev;
prev:= null;
end if;
if prev.sessionrange && curr.sessionrange then
prev.sessionrange:= prev.sessionrange + curr.sessionrange;
else
if prev notnull then
return next prev;
end if;
prev:= curr;
end if;
end loop;
return next prev;
end $$;
Results:
select *
from merge_ranges_in_test();
id | sessionrange | fk_user_id
----+-----------------------------------------------------+------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] | 1
5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] | 1
7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] | 1
6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] | 1
4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] | 2
9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] | 3
10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] | 3
11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] | 4
(8 rows)
The problem is very interesting. I've tried to find a recursive solution but it seems the procedural attempt is most natural and efficient.
I have finally found a recursive solution. The query deletes overlapping rows and inserts their compacted equivalent:
with recursive cte (user_id, ids, range) as (
select t1.fk_user_id, array[t1.id, t2.id], t1.sessionrange + t2.sessionrange
from test t1
join test t2
on t1.fk_user_id = t2.fk_user_id
and t1.id < t2.id
and t1.sessionrange && t2.sessionrange
union all
select user_id, ids || t.id, range + sessionrange
from cte
join test t
on user_id = t.fk_user_id
and ids[cardinality(ids)] < t.id
and range && t.sessionrange
),
list as (
select distinct on(id) id, range, user_id
from cte, unnest(ids) id
order by id, upper(range)- lower(range) desc
),
deleted as (
delete from test
where id in (select id from list)
)
insert into test
select distinct on (range) id, range, user_id
from list
order by range, id;
Results:
select *
from test
order by 3, 2;
id | sessionrange | fk_user_id
----+-----------------------------------------------------+------------
1 | ["2016-01-14 17:57:01+01","2016-01-14 18:21:56+01"] | 1
5 | ["2016-01-15 00:18:08+01","2016-01-15 00:18:15+01"] | 1
7 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:31+01"] | 1
6 | ["2016-01-15 00:38:12+01","2016-01-15 00:48:20+01"] | 1
4 | ["2016-01-15 00:18:00+01","2016-01-15 00:42:09+01"] | 2
9 | ["2016-01-15 00:18:12+01","2016-01-15 00:18:20+01"] | 3
10 | ["2016-01-15 01:32:12+01","2016-01-15 05:18:20+01"] | 3
11 | ["2016-01-15 00:18:16+01","2016-01-15 00:18:26+01"] | 4
(8 rows)

Can window function LAG reference the column which value is being calculated?

I need to calculate value of some column X based on some other columns of the current record and the value of X for the previous record (using some partition and order). Basically I need to implement query in the form
SELECT <some fields>,
<some expression using LAG(X) OVER(PARTITION BY ... ORDER BY ...) AS X
FROM <table>
This is not possible because only existing columns can be used in window function so I'm looking way how to overcome this.
Here is an example. I have a table with events. Each event has type and time_stamp.
create table event (id serial, type integer, time_stamp integer);
I wan't to find "duplicate" events (to skip them). By duplicate I mean the following. Let's order all events for given type by time_stamp ascending. Then
the first event is not a duplicate
all events that follow non duplicate and are within some time frame after it (that is their time_stamp is not greater then time_stamp of the previous non duplicate plus some constant TIMEFRAME) are duplicates
the next event which time_stamp is greater than previous non duplicate by more than TIMEFRAME is not duplicate
and so on
For this data
insert into event (type, time_stamp)
values
(1, 1), (1, 2), (2, 2), (1,3), (1, 10), (2,10),
(1,15), (1, 21), (2,13),
(1, 40);
and TIMEFRAME=10 result should be
time_stamp | type | duplicate
-----------------------------
1 | 1 | false
2 | 1 | true
3 | 1 | true
10 | 1 | true
15 | 1 | false
21 | 1 | true
40 | 1 | false
2 | 2 | false
10 | 2 | true
13 | 2 | false
I could calculate the value of duplicate field based on current time_stamp and time_stamp of the previous non-duplicate event like this:
WITH evt AS (
SELECT
time_stamp,
CASE WHEN
time_stamp - LAG(current_non_dupl_time_stamp) OVER w >= TIMEFRAME
THEN
time_stamp
ELSE
LAG(current_non_dupl_time_stamp) OVER w
END AS current_non_dupl_time_stamp
FROM event
WINDOW w AS (PARTITION BY type ORDER BY time_stamp ASC)
)
SELECT time_stamp, time_stamp != current_non_dupl_time_stamp AS duplicate
But this does not work because the field which is calculated cannot be referenced in LAG:
ERROR: column "current_non_dupl_time_stamp" does not exist.
So the question: can I rewrite this query to achieve the effect I need?
Naive recursive chain knitter:
-- temp view to avoid nested CTE
CREATE TEMP VIEW drag AS
SELECT e.type,e.time_stamp
, ROW_NUMBER() OVER www as rn -- number the records
, FIRST_VALUE(e.time_stamp) OVER www as fst -- the "group leader"
, EXISTS (SELECT * FROM event x
WHERE x.type = e.type
AND x.time_stamp < e.time_stamp) AS is_dup
FROM event e
WINDOW www AS (PARTITION BY type ORDER BY time_stamp)
;
WITH RECURSIVE ttt AS (
SELECT d0.*
FROM drag d0 WHERE d0.is_dup = False -- only the "group leaders"
UNION ALL
SELECT d1.type, d1.time_stamp, d1.rn
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN d1.time_stamp
ELSE ttt.fst END AS fst -- new "group leader"
, CASE WHEN d1.time_stamp - ttt.fst > 20 THEN False
ELSE True END AS is_dup
FROM drag d1
JOIN ttt ON d1.type = ttt.type AND d1.rn = ttt.rn+1
)
SELECT * FROM ttt
ORDER BY type, time_stamp
;
Results:
CREATE TABLE
INSERT 0 10
CREATE VIEW
type | time_stamp | rn | fst | is_dup
------+------------+----+-----+--------
1 | 1 | 1 | 1 | f
1 | 2 | 2 | 1 | t
1 | 3 | 3 | 1 | t
1 | 10 | 4 | 1 | t
1 | 15 | 5 | 1 | t
1 | 21 | 6 | 1 | t
1 | 40 | 7 | 40 | f
2 | 2 | 1 | 2 | f
2 | 10 | 2 | 2 | t
2 | 13 | 3 | 2 | t
(10 rows)
An alternative to a recursive approach is a custom aggregate. Once you master the technique of writing your own aggregates, creating transition and final functions is easy and logical.
State transition function:
create or replace function is_duplicate(st int[], time_stamp int, timeframe int)
returns int[] language plpgsql as $$
begin
if st is null or st[1] + timeframe <= time_stamp
then
st[1] := time_stamp;
end if;
st[2] := time_stamp;
return st;
end $$;
Final function:
create or replace function is_duplicate_final(st int[])
returns boolean language sql as $$
select st[1] <> st[2];
$$;
Aggregate:
create aggregate is_duplicate_agg(time_stamp int, timeframe int)
(
sfunc = is_duplicate,
stype = int[],
finalfunc = is_duplicate_final
);
Query:
select *, is_duplicate_agg(time_stamp, 10) over w
from event
window w as (partition by type order by time_stamp asc)
order by type, time_stamp;
id | type | time_stamp | is_duplicate_agg
----+------+------------+------------------
1 | 1 | 1 | f
2 | 1 | 2 | t
4 | 1 | 3 | t
5 | 1 | 10 | t
7 | 1 | 15 | f
8 | 1 | 21 | t
10 | 1 | 40 | f
3 | 2 | 2 | f
6 | 2 | 10 | t
9 | 2 | 13 | f
(10 rows)
Read in the documentation: 37.10. User-defined Aggregates and CREATE AGGREGATE.
This feels more like a recursive problem than windowing function. The following query obtained the desired results:
WITH RECURSIVE base(type, time_stamp) AS (
-- 3. base of recursive query
SELECT x.type, x.time_stamp, y.next_time_stamp
FROM
-- 1. start with the initial records of each type
( SELECT type, min(time_stamp) AS time_stamp
FROM event
GROUP BY type
) x
LEFT JOIN LATERAL
-- 2. for each of the initial records, find the next TIMEFRAME (10) in the future
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = x.type
AND time_stamp > (x.time_stamp + 10)
) y ON true
UNION ALL
-- 4. recursive join, same logic as base
SELECT e.type, e.time_stamp, z.next_time_stamp
FROM event e
JOIN base b ON (e.type = b.type AND e.time_stamp = b.next_time_stamp)
LEFT JOIN LATERAL
( SELECT MIN(time_stamp) next_time_stamp
FROM event
WHERE type = e.type
AND time_stamp > (e.time_stamp + 10)
) z ON true
)
-- The actual query:
-- 5a. All records from base are not duplicates
SELECT time_stamp, type, false
FROM base
UNION
-- 5b. All records from event that are not in base are duplicates
SELECT time_stamp, type, true
FROM event
WHERE (type, time_stamp) NOT IN (SELECT type, time_stamp FROM base)
ORDER BY type, time_stamp
There are a lot of caveats with this. It assumes no duplicate time_stamp for a given type. Really the joins should be based on a unique id rather than type and time_stamp. I didn't test this much, but it may at least suggest an approach.
This is my first time to try a LATERAL join. So there may be a way to simplify that moe. Really what I wanted to do was a recursive CTE with the recursive part using MIN(time_stamp) based on time_stamp > (x.time_stamp + 10), but aggregate functions are not allowed in CTEs in that manner. But it seems the lateral join can be used in the CTE.

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)

How do I get min, median and max from my query in postgresql?

I have written a query in which one column is a month. From that I have to get min month, max month, and median month. Below is my query.
select ext.employee,
pl.fromdate,
ext.FULL_INC as full_inc,
prevExt.FULL_INC as prevInc,
(extract(year from age (pl.fromdate))*12 +extract(month from age (pl.fromdate))) as month,
case
when prevExt.FULL_INC is not null then (ext.FULL_INC -coalesce(prevExt.FULL_INC,0))
else 0
end as difference,
(case when prevExt.FULL_INC is not null then (ext.FULL_INC - prevExt.FULL_INC) / prevExt.FULL_INC*100 else 0 end) as percent
from pl_payroll pl
inner join pl_extpayfile ext
on pl.cid = ext.payrollid
and ext.FULL_INC is not null
left outer join pl_extpayfile prevExt
on prevExt.employee = ext.employee
and prevExt.cid = (select max (cid) from pl_extpayfile
where employee = prevExt.employee
and payrollid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = prevExt.employee
and p.fromdate < pl.fromdate
))
and coalesce(prevExt.FULL_INC, 0) > 0
where ext.employee = 17
and (exists (
select employee
from pl_extpayfile preext
where preext.employee = ext.employee
and preext.FULL_INC <> ext.FULL_INC
and payrollid in (
select cid
from pl_payroll
where cid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = preext.employee
and p.fromdate < pl.fromdate
)
)
)
or not exists (
select employee
from pl_extpayfile fext,
pl_payroll p
where fext.employee = ext.employee
and p.cid = fext.payrollid
and p.fromdate < pl.fromdate
and fext.FULL_INC > 0
)
)
order by employee,
ext.payrollid desc
If it is not possible, than is it possible to get max month and min month?
To calculate the median in PostgreSQL, simply take the 50% percentile (no need to add extra functions or anything):
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY x) FROM t;
You want the aggregate functions named min and max. See the PostgreSQL documentation and tutorial:
http://www.postgresql.org/docs/current/static/tutorial-agg.html
http://www.postgresql.org/docs/current/static/functions-aggregate.html
There's no built-in median in PostgreSQL, however one has been implemented and contributed to the wiki:
http://wiki.postgresql.org/wiki/Aggregate_Median
It's used the same way as min and max once you've loaded it. Being written in PL/PgSQL it'll be a fair bit slower, but there's even a C version there that you could adapt if speed was vital.
UPDATE After comment:
It sounds like you want to show the statistical aggregates alongside the individual results. You can't do this with a plain aggregate function because you can't reference columns not in the GROUP BY in the result list.
You will need to fetch the stats from subqueries, or use your aggregates as window functions.
Given dummy data:
CREATE TABLE dummystats ( depname text, empno integer, salary integer );
INSERT INTO dummystats(depname,empno,salary) VALUES
('develop',11,5200),
('develop',7,4200),
('personell',2,5555),
('mgmt',1,9999999);
... and after adding the median aggregate from the PG wiki:
You can do this with an ordinary aggregate:
regress=# SELECT min(salary), max(salary), median(salary) FROM dummystats;
min | max | median
------+---------+----------------------
4200 | 9999999 | 5377.5000000000000000
(1 row)
but not this:
regress=# SELECT depname, empno, min(salary), max(salary), median(salary)
regress-# FROM dummystats;
ERROR: column "dummystats.depname" must appear in the GROUP BY clause or be used in an aggregate function
because it doesn't make sense in the aggregation model to show the averages alongside individual values. You can show groups:
regress=# SELECT depname, min(salary), max(salary), median(salary)
regress-# FROM dummystats GROUP BY depname;
depname | min | max | median
-----------+---------+---------+-----------------------
personell | 5555 | 5555 | 5555.0000000000000000
develop | 4200 | 5200 | 4700.0000000000000000
mgmt | 9999999 | 9999999 | 9999999.000000000000
(3 rows)
... but it sounds like you want the individual values. For that, you must use a window, a feature new in PostgreSQL 8.4.
regress=# SELECT depname, empno,
min(salary) OVER (),
max(salary) OVER (),
median(salary) OVER ()
FROM dummystats;
depname | empno | min | max | median
-----------+-------+------+---------+-----------------------
develop | 11 | 4200 | 9999999 | 5377.5000000000000000
develop | 7 | 4200 | 9999999 | 5377.5000000000000000
personell | 2 | 4200 | 9999999 | 5377.5000000000000000
mgmt | 1 | 4200 | 9999999 | 5377.5000000000000000
(4 rows)
See also:
http://www.postgresql.org/docs/current/static/tutorial-window.html
http://www.postgresql.org/docs/current/static/functions-window.html
One more option for median:
SELECT x
FROM table
ORDER BY x
LIMIT 1 offset (select count(*) from x)/2
To find Median:
for instance consider that we have 6000 rows present in the table.First we need to take half rows from the original Table (because we know that median is always the middle value) so here half of 6000 is 3000(Take 3001 for getting exact two middle value).
SELECT *
FROM (SELECT column_name
FROM Table_name
ORDER BY column_name
LIMIT 3001)As Table1
ORDER BY column_name DESC ---->Look here we used DESC(Z-A)it will display the last
-- two values(using LIMIT 2) i.e (3000th row and 3001th row) from 6000
-- rows
LIMIT 2;