Window functions and more "local" aggregation - postgresql

Suppose I have this table:
select * from window_test;
k | v
---+---
a | 1
a | 2
b | 3
a | 4
Ultimately I want to get:
k | min_v | max_v
---+-------+-------
a | 1 | 2
b | 3 | 3
a | 4 | 4
But I would be just as happy to get this (since I can easily filter it with distinct):
k | min_v | max_v
---+-------+-------
a | 1 | 2
a | 1 | 2
b | 3 | 3
a | 4 | 4
Is it possible to achieve this with PostgreSQL 9.1+ window functions? I'm trying to understand if I can get it to use separate partition for the first and last occurrence of k=a in this sample (ordered by v).

This returns your desired result with the sample data. Not sure if it will work for real world data:
select k,
min(v) over (partition by group_nr) as min_v,
max(v) over (partition by group_nr) as max_v
from (
select *,
sum(group_flag) over (order by v,k) as group_nr
from (
select *,
case
when lag(k) over (order by v) = k then null
else 1
end as group_flag
from window_test
) t1
) t2
order by min_v;
I left out the DISTINCT though.

EDIT: I've came up with the following query — without window functions at all:
WITH RECURSIVE tree AS (
SELECT k, v, ''::text as next_k, 0 as next_v, 0 AS level FROM window_test
UNION ALL
SELECT c.k, c.v, t.k, t.v + level, t.level + 1
FROM tree t JOIN window_test c ON c.k = t.k AND c.v + 1 = t.v),
partitions AS (
SELECT t.k, t.v, t.next_k,
coalesce(nullif(t.next_v, 0), t.v) AS next_v, t.level
FROM tree t
WHERE NOT EXISTS (SELECT 1 FROM tree WHERE next_k = t.k AND next_v = t.v))
SELECT min(k) AS k, v AS min_v, max(next_v) AS max_v
FROM partitions p
GROUP BY v
ORDER BY 2;
I've provided 2 working queries now, I hope one of them will suite you.
SQL Fiddle for this variant.
Another way how to achieve this is to use a support sequence.
Create a support sequence:
CREATE SEQUENCE wt_rank START WITH 1;
The query:
WITH source AS (
SELECT k, v,
coalesce(lag(k) OVER (ORDER BY v), k) AS prev_k
FROM window_test
CROSS JOIN (SELECT setval('wt_rank', 1)) AS ri),
ranking AS (
SELECT k, v, prev_k,
CASE WHEN k = prev_k THEN currval('wt_rank')
ELSE nextval('wt_rank') END AS rank
FROM source)
SELECT r.k, min(s.v) AS min_v, max(s.v) AS max_v
FROM ranking r
JOIN source s ON r.v = s.v
GROUP BY r.rank, r.k
ORDER BY 2;

Would this not do the job for you, without the need for windows, partitions or coalescing. It just uses a traditional SQL trick for finding nearest tuples via a self join, and a min on the difference:
SELECT k, min(v), max(v) FROM (
SELECT k, v, v + min(d) lim FROM (
SELECT x.*, y.k n, y.v - x.v d FROM window_test x
LEFT JOIN window_test y ON x.k <> y.k AND y.v - x.v > 0)
z GROUP BY k, v, n)
w GROUP BY k, lim ORDER BY 2;
I think this is probably a more 'relational' solution, but I'm not sure about its efficiency.

Related

calculate rank without using rank or rownums function by using single column

Do not use any functions like rank or rownums.
Hint: Formulate matrix operation using sql. A rank of an item indicates how many items are less than or equal to it.
A matrix can be simulated by cross join and rank can be derived by
counting items smaller than the current item.
Table A:-
x
----
d
b
a
g
c
k
k
g
Expected output:
x1 | rank
----+------
a | 1
b | 2
d | 3
g | 4
c | 5
k | 6
select x as x1, count(x) as rank
from (select DISTINCT x from A order by x) as sub
Your current query is on the right track, using a distinct subquery. For a working version, use a correlated subquery in the select clause which takes counts:
SELECT
x AS x1,
(SELECT COUNT(DISTINCT x) FROM A t WHERE t.x <= sub.x) rank
FROM (SELECT DISTINCT x FROM A) AS sub
ORDER BY
x;
Demo

How can I get only one row if only one column is different

I have a table with patient ID's, contact dates and actioncodes. I want to retrieve all rows with actioncodes equal to EPS or D, however i want to keep only one row if the actioncode exists on the same contactdate.
For example this is part of my table, journal:
PatientID Contactdate Actioncode
1 2010-5-6 EPS
1 2010-5-6 D
1 2012-3-4 CNT
1 2013-7-8 D
2 2010-1-4 EPS
2 2010-5-6 D
This is the code i have now to retrieve all rows where actioncode is either EPS or D
select * from journal j where j.actioncode in ('EPS', 'D')
I tried group by contactdate, but then i miss the rows where the patients are different. The same effect occurs with distinct(contactdate). What can i use here to return only one row when the date and the patientid are similar and the actioncode is either D or EPS?
Preferred outcome:
PatientID Contactd Actioncode
1 2010-5-6 D
1 2012-3-4 D
2 2010-1-4 EPS
2 2010-5-6 D
We can try using ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY PatientID, Contactdate
ORDER BY Actioncode) rn
FROM journal
WHERE Actioncode in ('EPS', 'D')
)
SELECT PatientID, Contactdate, Actioncode
FROM cte
WHERE rn = 1;
This arbitrarily would always retain the Actioncode='D' record, should both action codes appear. If instead you would want to retain the EPS record, then modify the call to ROW_NUMBER to use ORDER BY Actioncode DESC.
What you want is a GROUP BY two columns: PatientID and Contactdate. You can use MAX() or MIN() to select one of the rows.
select
j.PatientID,
j.Contactdate,
MIN(j.actionCode)
from
journal j
where j.actioncode in ('EPS', 'D')
group by j.PatientID, j.Contactdate
For match your preferred outcome, you should use MIN().
You can do it with UNION ALL for the 2 cases:
select * from journal where actioncode = 'D'
union all
select * from journal j where j.actioncode = 'EPS'
and not exists (
select 1 from journal
where PatientID = j.PatientID and Contactdate = j.Contactdate and actioncode = 'D'
)
The 2nd query will only fetch rows if the 1st query returns nothing for actioncode = 'D'.
See the demo.
Results:
> patientid | contactdate | actioncode
> --------: | :---------- | :---------
> 1 | 2010-05-06 | D
> 1 | 2013-07-08 | D
> 2 | 2010-05-06 | D
> 2 | 2010-01-04 | EPS

Creating clusters of related columns

I have a table named Stores with columns:
StoreCode NVARCHAR(10),
OldStoreCode NVARCHAR(10)
Here is a sample of my data:
| StoreCode | OldStoreCode |
|-----------|--------------|
| A | B |
| B | A |
| D | E |
| E | F |
| M | K |
| J | K |
| K | L |
|-----------|--------------|
I want to create clusters of related Stores. Related store means there is a one way relation between StoreCodes and OldStoreCodes.
Expected result table:
| StoreCode | ClusterId |
|-----------|-----------|
| A | 1 |
| B | 1 |
| D | 2 |
| E | 2 |
| F | 2 |
| M | 3 |
| K | 3 |
| J | 3 |
| L | 3 |
|-----------|-----------|
There is no maximum number hops. There may be a StoreCode A which has a OldStoreCode B, which has a OldStoreCode C, which has a OldStoreCode D etc.
How can I cluster stores like this?
Try it like this:
EDIT: With changes by OP taken from comment
DECLARE #tbl TABLE(ID INT IDENTITY, StoreCode VARCHAR(100),OldStoreCode VARCHAR(100));
INSERT INTO #tbl VALUES
('A','B'),('B','A'),('D','E'),('E','F'),('M','K'),('J','K'),('K','L');
WITH Related AS
(
SELECT DISTINCT t1.ID,Val
FROM #tbl AS t1
INNER JOIN #tbl AS t2 ON t1.StoreCode=t2.StoreCode
OR t1.OldStoreCode=t2.OldStoreCode
OR t1.OldStoreCode=t2.StoreCode
OR t1.StoreCode=t2.OldStoreCode
CROSS APPLY(SELECT DISTINCT Val
FROM
(VALUES(t1.StoreCode),(t2.StoreCode),(t1.OldStoreCode),(t2.OldStoreCode)) AS A(Val)
) AS valsInCols
)
,ClusterKeys AS
(
SELECT r1.ID
,(
SELECT r2.Val AS [*]
FROM Related AS r2
WHERE r2.ID=r1.ID
ORDER BY r2.Val
FOR XML PATH('')
) AS ClusterKey
FROM Related AS r1
GROUP BY r1.ID
)
,ClusterIds AS
(
SELECT ClusterKey
,MIN(ID) AS ID
FROM ClusterKeys
GROUP BY ClusterKey
)
SELECT r.ID
,r.Val
FROM ClusterIds c
INNER JOIN Related r ON c.ID = r.ID
The result
ID Val
1 A
1 B
3 D
3 E
3 F
5 J
5 K
5 L
5 M
This should do it:
SAMPLE DATA:
IF OBJECT_ID('tempdb..#Temp1') IS NOT NULL
BEGIN
DROP TABLE #Temp1;
END;
CREATE TABLE #Temp1(StoreCode NVARCHAR(10)
, OldStoreCode NVARCHAR(10));
INSERT INTO #Temp1(StoreCode
, OldStoreCode)
VALUES
('A'
, 'B'),
('B'
, 'A'),
('D'
, 'E'),
('E'
, 'F'),
('M'
, 'K'),
('J'
, 'K'),
('K'
, 'L');
QUERY:
;WITH A -- get all distinct new and old storecodes
AS (
SELECT StoreCode
FROM #Temp1
UNION
SELECT OldStoreCode
FROM #Temp1),
B -- give a unique number id to each store code
AS (SELECT rn = RANK() OVER(ORDER BY StoreCode)
, StoreCode
FROM A),
C -- combine the store codes and the unique number id's in one table
AS (SELECT b2.rn AS StoreCodeID
, t.StoreCode
, b1.rn AS OldStoreCodeId
, t.OldStoreCode
FROM #Temp1 AS t
LEFT OUTER JOIN B AS b1 ON t.OldStoreCode = b1.StoreCode
LEFT OUTER JOIN B AS b2 ON t.StoreCode = b2.StoreCode),
D -- assign a row number for each entry in the data set
AS (SELECT rn = RANK() OVER(ORDER BY StoreCode)
, *
FROM C),
E -- derive first and last store in the path
AS (SELECT FirstStore = d2.StoreCode
, LastStore = d1.OldStoreCode
, GroupID = d1.OldStoreCodeId
FROM D AS d1
RIGHT OUTER JOIN D AS d2 ON d1.StoreCodeID = d2.OldStoreCodeId
AND d1.rn - 1 = d2.rn
WHERE d1.OldStoreCode IS NOT NULL) ,
F -- get the stores wich led to the last store with one hop
AS (SELECT C.StoreCode
, E.GroupID
FROM E
INNER JOIN C ON E.LastStore = C.OldStoreCode)
-- combine to get the full grouping
SELECT A.StoreCode, ClusterID = DENSE_RANK() OVER (ORDER BY A.GroupID) FROM (
SELECT C.StoreCode,F.GroupID FROM C INNER JOIN F ON C.OldStoreCode = F.StoreCode
UNION
SELECT * FROM F
UNION
SELECT E.LastStore,E.GroupID FROM E) AS A ORDER BY StoreCode, ClusterID
RESULTS:

Percentile calculation with a window function

I know you can get the average, total, min, and max over a subset of the data using a window function. But is it possible to get, say, the median, or the 25th percentile instead of the average with the window function?
Put another way, how do I rewrite this to get the id and the 25th or 50th percentile sales numbers within each district rather than the average?
SELECT id, avg(sales)
OVER (PARTITION BY district) AS district_average
FROM t
You can write this as an aggregation function using percentile_cont() or percentile_disc():
select district, percentile_cont(0.25) within group (order by sales)
from t
group by district;
Unfortunately, Postgres doesn't currently support these as a window functions:
select id, percentile_cont(0.25) within group (order by sales) over (partition by district)
from t;
So, you can use a join:
select t.*, p_25, p_75
from t join
(select district,
percentile_cont(0.25) within group (order by sales) as p_25,
percentile_cont(0.75) within group (order by sales) as p_75
from t
group by district
) td
on t.district = td.district
Another way to to this without joining as in Gordon's solution is by exploiting the array_agg function which can be used as a window function:
create function pg_temp.percentile_cont_window
(c double precision[], p double precision)
returns double precision
language sql
as
$$
with t1 as (select unnest(c) as x)
select percentile_cont(p) WITHIN GROUP (ORDER BY x) from t1;
$$
;
-- -- -- -- -- -- -- -- --
-- Usage examples:
create temporary table t1 as (
select 1 as g, 1 as x
union select 1 as g, 2 as x
union select 2 as g, 3 as x
);
-- Built-in function raises an error if used without group:
-- Error: OVER is not supported for ordered-set aggregate percentile_cont
select *, percentile_cont(.1) within group (order by x) over() from t1;
-- Built-in function with grouping
select g, percentile_cont(.1) within group (order by x) from t1 group by g;
-- | g | percentile_cont |
-- |----:|------------------:|
-- | 1 | 1.1 |
-- | 2 | 3 |
-- Custom function basic usage (note that this is without grouping)
select t1.*, pg_temp.percentile_cont_window(array_agg(x) over(), .1) from t1;
-- | g | x | percentile_cont_window |
-- |----:|----:|-------------------------:|
-- | 1 | 2 | 1.2 |
-- | 1 | 1 | 1.2 |
-- | 2 | 3 | 1.2 |
-- Custom function usage with grouping is the same as using the built-in percentile_cont function
select t1.g, pg_temp.percentile_cont_window(array_agg(x), .1) from t1 group by g;
-- | g | percentile_cont_window |
-- |----:|-------------------------:|
-- | 2 | 3 |
-- | 1 | 1.1 |

Select bitmask in Postgresql

I have a table with columns "one" and "two":
a | x
a | y
a | z
b | x
b | z
c | y
I want to write a query to complement it with missing nested values
b | null | y
c | null | x
c | null | z
Then I will select it with array_agg(two) group by one, such that
a {1 1 1}
b {1 0 1}
c {0 1 0}
And eventually export it in a CSV file with COPY query
What query should I write for the first step?
You can use a CROSS JOIN to build all the possible pairs of elements then a LEFT JOIN to check if each pair of elements exists:
SELECT
T1.one,
T2.two,
CASE WHEN your_table.one IS NULL THEN 0 ELSE 1 END AS is_present
FROM (SELECT DISTINCT one FROM your_table) T1
CROSS JOIN (SELECT DISTINCT two FROM your_table) T2
LEFT JOIN your_table
ON T1.one = your_table.one AND T2.two = your_table.two
You can then add a GROUP BY T1.one and an ARRAY_AGG(...) to this query.