Suppose I have this table:
select * from window_test;
k | v
---+---
a | 1
a | 2
b | 3
a | 4
Ultimately I want to get:
k | min_v | max_v
---+-------+-------
a | 1 | 2
b | 3 | 3
a | 4 | 4
But I would be just as happy to get this (since I can easily filter it with distinct):
k | min_v | max_v
---+-------+-------
a | 1 | 2
a | 1 | 2
b | 3 | 3
a | 4 | 4
Is it possible to achieve this with PostgreSQL 9.1+ window functions? I'm trying to understand if I can get it to use separate partition for the first and last occurrence of k=a in this sample (ordered by v).
This returns your desired result with the sample data. Not sure if it will work for real world data:
select k,
min(v) over (partition by group_nr) as min_v,
max(v) over (partition by group_nr) as max_v
from (
select *,
sum(group_flag) over (order by v,k) as group_nr
from (
select *,
case
when lag(k) over (order by v) = k then null
else 1
end as group_flag
from window_test
) t1
) t2
order by min_v;
I left out the DISTINCT though.
EDIT: I've came up with the following query — without window functions at all:
WITH RECURSIVE tree AS (
SELECT k, v, ''::text as next_k, 0 as next_v, 0 AS level FROM window_test
UNION ALL
SELECT c.k, c.v, t.k, t.v + level, t.level + 1
FROM tree t JOIN window_test c ON c.k = t.k AND c.v + 1 = t.v),
partitions AS (
SELECT t.k, t.v, t.next_k,
coalesce(nullif(t.next_v, 0), t.v) AS next_v, t.level
FROM tree t
WHERE NOT EXISTS (SELECT 1 FROM tree WHERE next_k = t.k AND next_v = t.v))
SELECT min(k) AS k, v AS min_v, max(next_v) AS max_v
FROM partitions p
GROUP BY v
ORDER BY 2;
I've provided 2 working queries now, I hope one of them will suite you.
SQL Fiddle for this variant.
Another way how to achieve this is to use a support sequence.
Create a support sequence:
CREATE SEQUENCE wt_rank START WITH 1;
The query:
WITH source AS (
SELECT k, v,
coalesce(lag(k) OVER (ORDER BY v), k) AS prev_k
FROM window_test
CROSS JOIN (SELECT setval('wt_rank', 1)) AS ri),
ranking AS (
SELECT k, v, prev_k,
CASE WHEN k = prev_k THEN currval('wt_rank')
ELSE nextval('wt_rank') END AS rank
FROM source)
SELECT r.k, min(s.v) AS min_v, max(s.v) AS max_v
FROM ranking r
JOIN source s ON r.v = s.v
GROUP BY r.rank, r.k
ORDER BY 2;
Would this not do the job for you, without the need for windows, partitions or coalescing. It just uses a traditional SQL trick for finding nearest tuples via a self join, and a min on the difference:
SELECT k, min(v), max(v) FROM (
SELECT k, v, v + min(d) lim FROM (
SELECT x.*, y.k n, y.v - x.v d FROM window_test x
LEFT JOIN window_test y ON x.k <> y.k AND y.v - x.v > 0)
z GROUP BY k, v, n)
w GROUP BY k, lim ORDER BY 2;
I think this is probably a more 'relational' solution, but I'm not sure about its efficiency.
Related
Do not use any functions like rank or rownums.
Hint: Formulate matrix operation using sql. A rank of an item indicates how many items are less than or equal to it.
A matrix can be simulated by cross join and rank can be derived by
counting items smaller than the current item.
Table A:-
x
----
d
b
a
g
c
k
k
g
Expected output:
x1 | rank
----+------
a | 1
b | 2
d | 3
g | 4
c | 5
k | 6
select x as x1, count(x) as rank
from (select DISTINCT x from A order by x) as sub
Your current query is on the right track, using a distinct subquery. For a working version, use a correlated subquery in the select clause which takes counts:
SELECT
x AS x1,
(SELECT COUNT(DISTINCT x) FROM A t WHERE t.x <= sub.x) rank
FROM (SELECT DISTINCT x FROM A) AS sub
ORDER BY
x;
Demo
I have a table with patient ID's, contact dates and actioncodes. I want to retrieve all rows with actioncodes equal to EPS or D, however i want to keep only one row if the actioncode exists on the same contactdate.
For example this is part of my table, journal:
PatientID Contactdate Actioncode
1 2010-5-6 EPS
1 2010-5-6 D
1 2012-3-4 CNT
1 2013-7-8 D
2 2010-1-4 EPS
2 2010-5-6 D
This is the code i have now to retrieve all rows where actioncode is either EPS or D
select * from journal j where j.actioncode in ('EPS', 'D')
I tried group by contactdate, but then i miss the rows where the patients are different. The same effect occurs with distinct(contactdate). What can i use here to return only one row when the date and the patientid are similar and the actioncode is either D or EPS?
Preferred outcome:
PatientID Contactd Actioncode
1 2010-5-6 D
1 2012-3-4 D
2 2010-1-4 EPS
2 2010-5-6 D
We can try using ROW_NUMBER here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY PatientID, Contactdate
ORDER BY Actioncode) rn
FROM journal
WHERE Actioncode in ('EPS', 'D')
)
SELECT PatientID, Contactdate, Actioncode
FROM cte
WHERE rn = 1;
This arbitrarily would always retain the Actioncode='D' record, should both action codes appear. If instead you would want to retain the EPS record, then modify the call to ROW_NUMBER to use ORDER BY Actioncode DESC.
What you want is a GROUP BY two columns: PatientID and Contactdate. You can use MAX() or MIN() to select one of the rows.
select
j.PatientID,
j.Contactdate,
MIN(j.actionCode)
from
journal j
where j.actioncode in ('EPS', 'D')
group by j.PatientID, j.Contactdate
For match your preferred outcome, you should use MIN().
You can do it with UNION ALL for the 2 cases:
select * from journal where actioncode = 'D'
union all
select * from journal j where j.actioncode = 'EPS'
and not exists (
select 1 from journal
where PatientID = j.PatientID and Contactdate = j.Contactdate and actioncode = 'D'
)
The 2nd query will only fetch rows if the 1st query returns nothing for actioncode = 'D'.
See the demo.
Results:
> patientid | contactdate | actioncode
> --------: | :---------- | :---------
> 1 | 2010-05-06 | D
> 1 | 2013-07-08 | D
> 2 | 2010-05-06 | D
> 2 | 2010-01-04 | EPS
I have a table named Stores with columns:
StoreCode NVARCHAR(10),
OldStoreCode NVARCHAR(10)
Here is a sample of my data:
| StoreCode | OldStoreCode |
|-----------|--------------|
| A | B |
| B | A |
| D | E |
| E | F |
| M | K |
| J | K |
| K | L |
|-----------|--------------|
I want to create clusters of related Stores. Related store means there is a one way relation between StoreCodes and OldStoreCodes.
Expected result table:
| StoreCode | ClusterId |
|-----------|-----------|
| A | 1 |
| B | 1 |
| D | 2 |
| E | 2 |
| F | 2 |
| M | 3 |
| K | 3 |
| J | 3 |
| L | 3 |
|-----------|-----------|
There is no maximum number hops. There may be a StoreCode A which has a OldStoreCode B, which has a OldStoreCode C, which has a OldStoreCode D etc.
How can I cluster stores like this?
Try it like this:
EDIT: With changes by OP taken from comment
DECLARE #tbl TABLE(ID INT IDENTITY, StoreCode VARCHAR(100),OldStoreCode VARCHAR(100));
INSERT INTO #tbl VALUES
('A','B'),('B','A'),('D','E'),('E','F'),('M','K'),('J','K'),('K','L');
WITH Related AS
(
SELECT DISTINCT t1.ID,Val
FROM #tbl AS t1
INNER JOIN #tbl AS t2 ON t1.StoreCode=t2.StoreCode
OR t1.OldStoreCode=t2.OldStoreCode
OR t1.OldStoreCode=t2.StoreCode
OR t1.StoreCode=t2.OldStoreCode
CROSS APPLY(SELECT DISTINCT Val
FROM
(VALUES(t1.StoreCode),(t2.StoreCode),(t1.OldStoreCode),(t2.OldStoreCode)) AS A(Val)
) AS valsInCols
)
,ClusterKeys AS
(
SELECT r1.ID
,(
SELECT r2.Val AS [*]
FROM Related AS r2
WHERE r2.ID=r1.ID
ORDER BY r2.Val
FOR XML PATH('')
) AS ClusterKey
FROM Related AS r1
GROUP BY r1.ID
)
,ClusterIds AS
(
SELECT ClusterKey
,MIN(ID) AS ID
FROM ClusterKeys
GROUP BY ClusterKey
)
SELECT r.ID
,r.Val
FROM ClusterIds c
INNER JOIN Related r ON c.ID = r.ID
The result
ID Val
1 A
1 B
3 D
3 E
3 F
5 J
5 K
5 L
5 M
This should do it:
SAMPLE DATA:
IF OBJECT_ID('tempdb..#Temp1') IS NOT NULL
BEGIN
DROP TABLE #Temp1;
END;
CREATE TABLE #Temp1(StoreCode NVARCHAR(10)
, OldStoreCode NVARCHAR(10));
INSERT INTO #Temp1(StoreCode
, OldStoreCode)
VALUES
('A'
, 'B'),
('B'
, 'A'),
('D'
, 'E'),
('E'
, 'F'),
('M'
, 'K'),
('J'
, 'K'),
('K'
, 'L');
QUERY:
;WITH A -- get all distinct new and old storecodes
AS (
SELECT StoreCode
FROM #Temp1
UNION
SELECT OldStoreCode
FROM #Temp1),
B -- give a unique number id to each store code
AS (SELECT rn = RANK() OVER(ORDER BY StoreCode)
, StoreCode
FROM A),
C -- combine the store codes and the unique number id's in one table
AS (SELECT b2.rn AS StoreCodeID
, t.StoreCode
, b1.rn AS OldStoreCodeId
, t.OldStoreCode
FROM #Temp1 AS t
LEFT OUTER JOIN B AS b1 ON t.OldStoreCode = b1.StoreCode
LEFT OUTER JOIN B AS b2 ON t.StoreCode = b2.StoreCode),
D -- assign a row number for each entry in the data set
AS (SELECT rn = RANK() OVER(ORDER BY StoreCode)
, *
FROM C),
E -- derive first and last store in the path
AS (SELECT FirstStore = d2.StoreCode
, LastStore = d1.OldStoreCode
, GroupID = d1.OldStoreCodeId
FROM D AS d1
RIGHT OUTER JOIN D AS d2 ON d1.StoreCodeID = d2.OldStoreCodeId
AND d1.rn - 1 = d2.rn
WHERE d1.OldStoreCode IS NOT NULL) ,
F -- get the stores wich led to the last store with one hop
AS (SELECT C.StoreCode
, E.GroupID
FROM E
INNER JOIN C ON E.LastStore = C.OldStoreCode)
-- combine to get the full grouping
SELECT A.StoreCode, ClusterID = DENSE_RANK() OVER (ORDER BY A.GroupID) FROM (
SELECT C.StoreCode,F.GroupID FROM C INNER JOIN F ON C.OldStoreCode = F.StoreCode
UNION
SELECT * FROM F
UNION
SELECT E.LastStore,E.GroupID FROM E) AS A ORDER BY StoreCode, ClusterID
RESULTS:
I know you can get the average, total, min, and max over a subset of the data using a window function. But is it possible to get, say, the median, or the 25th percentile instead of the average with the window function?
Put another way, how do I rewrite this to get the id and the 25th or 50th percentile sales numbers within each district rather than the average?
SELECT id, avg(sales)
OVER (PARTITION BY district) AS district_average
FROM t
You can write this as an aggregation function using percentile_cont() or percentile_disc():
select district, percentile_cont(0.25) within group (order by sales)
from t
group by district;
Unfortunately, Postgres doesn't currently support these as a window functions:
select id, percentile_cont(0.25) within group (order by sales) over (partition by district)
from t;
So, you can use a join:
select t.*, p_25, p_75
from t join
(select district,
percentile_cont(0.25) within group (order by sales) as p_25,
percentile_cont(0.75) within group (order by sales) as p_75
from t
group by district
) td
on t.district = td.district
Another way to to this without joining as in Gordon's solution is by exploiting the array_agg function which can be used as a window function:
create function pg_temp.percentile_cont_window
(c double precision[], p double precision)
returns double precision
language sql
as
$$
with t1 as (select unnest(c) as x)
select percentile_cont(p) WITHIN GROUP (ORDER BY x) from t1;
$$
;
-- -- -- -- -- -- -- -- --
-- Usage examples:
create temporary table t1 as (
select 1 as g, 1 as x
union select 1 as g, 2 as x
union select 2 as g, 3 as x
);
-- Built-in function raises an error if used without group:
-- Error: OVER is not supported for ordered-set aggregate percentile_cont
select *, percentile_cont(.1) within group (order by x) over() from t1;
-- Built-in function with grouping
select g, percentile_cont(.1) within group (order by x) from t1 group by g;
-- | g | percentile_cont |
-- |----:|------------------:|
-- | 1 | 1.1 |
-- | 2 | 3 |
-- Custom function basic usage (note that this is without grouping)
select t1.*, pg_temp.percentile_cont_window(array_agg(x) over(), .1) from t1;
-- | g | x | percentile_cont_window |
-- |----:|----:|-------------------------:|
-- | 1 | 2 | 1.2 |
-- | 1 | 1 | 1.2 |
-- | 2 | 3 | 1.2 |
-- Custom function usage with grouping is the same as using the built-in percentile_cont function
select t1.g, pg_temp.percentile_cont_window(array_agg(x), .1) from t1 group by g;
-- | g | percentile_cont_window |
-- |----:|-------------------------:|
-- | 2 | 3 |
-- | 1 | 1.1 |
I have a table with columns "one" and "two":
a | x
a | y
a | z
b | x
b | z
c | y
I want to write a query to complement it with missing nested values
b | null | y
c | null | x
c | null | z
Then I will select it with array_agg(two) group by one, such that
a {1 1 1}
b {1 0 1}
c {0 1 0}
And eventually export it in a CSV file with COPY query
What query should I write for the first step?
You can use a CROSS JOIN to build all the possible pairs of elements then a LEFT JOIN to check if each pair of elements exists:
SELECT
T1.one,
T2.two,
CASE WHEN your_table.one IS NULL THEN 0 ELSE 1 END AS is_present
FROM (SELECT DISTINCT one FROM your_table) T1
CROSS JOIN (SELECT DISTINCT two FROM your_table) T2
LEFT JOIN your_table
ON T1.one = your_table.one AND T2.two = your_table.two
You can then add a GROUP BY T1.one and an ARRAY_AGG(...) to this query.