How to rewrite SQL joins into window functions?

How to rewrite SQL joins into window functions? - postgresql

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1

I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30

As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

Related

Merge overlapping date intervals into big intervals providing uniquness of some id inside merged group

I have some date intervals, each interval is characterized by known "prop_id". My goal is to merge overlapping intervals into big intervals, while keeping the uniquness of "prop_id" inside the merged group. I have some code that helps me to get big intervals, but I 've no idea how to keep condition of uniquness (. Thanks in advance for any assistance.
________1 ________1
___________2
________1 |________1
_________|__2
[1,2]_________|________[1,2]
For SQLFiddle:
CREATE SEQUENCE ido_seq;
create table slots (
ido integer NOT NULL default nextval('ido_seq'),
begin_at date,
end_at date,
prop_id integer);
ALTER SEQUENCE ido_seq owned by slots.ido;
INSERT INTO slots (ido, begin_at, end_at, prop_id) VALUES
(0, '2014-10-05', '2014-10-10', 1),
(1, '2014-10-08', '2014-10-15', 2),
(2, '2014-10-13', '2014-10-20', 1),
(3, '2014-10-21', '2014-10-30', 2);
-- disired output:
-- start, end, props
-- 2014-10-05, 2014-10-12, [1,2] --! the whole group is (2014-10-05, 2014-10-20, [1,2,1]), but props should be unique
-- 2014-10-13, 2014-10-20, [1,2] --so, we obtain 2 ranges instead of 1, each one with 2 generating prop_id
-- 2014-10-21, 2014-10-30 [2]
How do we get it:
if two date intervals overlap, we merge them. The first ['2014-10-05', '2014-10-10'] and second ['2014-10-08', '2014-10-15'] have part ['2014-10-08', '2014-10-10'] in common. So we can merge them to ['2014-10-05', '2014-10-15']. The generalizing props are unique - OK. The next one ['2014-10-13', '2014-10-20'] is overlapping with previously calculated ['2014-10-05', '2014-10-15'], but we can't merge them without breaking the condition of uniquness. So we are to split the big interval ['2014-10-05', '2014-10-20'] into 2 small using the begin date of repeating prop ('2014-10-13'), but keeng the condition and receive ['2014-10-05', '2014-10-12'] (as '2014-10-13' minus 1 day) and ['2014-10-13', '2014-10-20'] both generalizing by props 1 and 2.
My attempt to get merged intervals (not keeping uniqueness condition):
SELECT min(begin_at), max(enddate), array_agg(prop_id) AS props
FROM (
SELECT *,
count(nextstart > enddate OR NULL) OVER (ORDER BY begin_at DESC, end_at DESC) AS grp
FROM (
SELECT
prop_id
, begin_at
, end_at
, end_at AS enddate
, lead(begin_at) OVER (ORDER BY begin_at, end_at) AS nextstart
FROM slots
) a
)b
GROUP BY grp
ORDER BY 1;

The right solution here is probably to use a recursive CTE to find the large intervals no matter how many smaller intervals need to be combined, and then to remove the intervals that we do not need.
with recursive intervals(idos, begin_at,end_at,prop_ids) as(
select array[ido], begin_at, end_at, array[prop_id]
from slots
union
select i.idos || s.ido
, least(s.begin_at, i.begin_at)
, greatest(s.end_at, i.end_at)
, i.prop_ids || s.prop_id
from intervals i
join slots s
on (s.begin_at, s.end_at) overlaps (i.begin_at, i.end_at)
and not (i.prop_ids && array[s.prop_id]) --check that the prop id is not already in the large interval
where s.begin_at < i.begin_at --to avoid having double intervals
)
select * from intervals i
--finally, remove the intervals that are a subinterval of an included interval
where not exists(select 1 from intervals i2 where i2.idos #> i.idos
and i2.idos <> i.idos);

lead and lag on large table 1billion rows

I have a table T as follows with 1 Billion records. Currently, this table has no Primary key or Indexes.
create table T(
day_c date,
str_c varchar2(20),
comm_c varchar2(20),
src_c varchar2(20)
);
some sample data:
insert into T
select to_date('20171011','yyyymmdd') day_c,'st1' str_c,'c1' comm_c,'s1' src_c from dual
union
select to_date('20171012','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171013','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171014','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171015','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171016','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171017','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171018','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171019','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171020','yyyymmdd'),'st1','c1','s1' from dual;
The expected result is to generate the date ranges for the changes in column src_c.
I have the following code snippet which provides the desired result. However, it is slow as the cost of running lag and lead is quite high on the table.
WITH EndsMarked AS (
SELECT
day_c,str_c,comm_c,src_c,
CASE WHEN src_c= LAG(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_START,
CASE WHEN src_c= LEAD(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_END
FROM T
), GroupsNumbered AS (
SELECT
day_c,str_c,comm_c,
src_c,
IS_START,
IS_END,
COUNT(CASE WHEN IS_START = 1 THEN 1 END)
OVER (ORDER BY day_c) AS GroupNum
FROM EndsMarked
WHERE IS_START=1 OR IS_END=1
)
SELECT
str_c,comm_c,src_c,
MIN(day_c) AS GROUP_START,
MAX(day_c) AS GROUP_END
FROM GroupsNumbered
GROUP BY str_c,comm_c, src_c,GroupNum
ORDER BY groupnum;
Output :
STR_C COMM_C SRC_C GROUP_START GROUP_END
st1 c1 s1 11-OCT-17 13-OCT-17
st1 c1 s2 14-OCT-17 16-OCT-17
st1 c1 s1 17-OCT-17 20-OCT-17
Any suggestion to speed up?
Oracle database :12c.
SGA Memory:20GB
Total CPU:22
Explain plan:

Order by day_c only, or do you need to partition by str_c and comm_c first? It seems so - in which case I am not sure your query is correct, and Sentinel's solution will need to be adjusted accordingly.
Then:
For some reason (which escapes me), it appears that the match_recognize clause (available only since Oracle 12.1) is faster than analytic functions, even when the work done seems to be the same.
In your problem, (1) you must read 1 billion rows from disk, which can't be done faster than the hardware allows (do you REALLY need to do this on all 1 billion rows, or should you archive a large portion of your table, perhaps after performing this identification of GROUP_START and GROUP_END)? (2) you must order the data by day_c no matter what method you use, and that is time consuming.
With that said, the tabibitosan method (see Sentinel's answer) will be faster than the start-of-group method (which is close to, but simpler than what you currently have).
The match_recognize solution, which will probably be faster than any solution based on analytic functions, looks like this:
select str_c, comm_c, src_c, group_start, group_end
from t
match_recognize(
partition by str_c, comm_c
order by day_c
measures x.src_c as src_c,
first(day_c) as group_start,
last(day_c) as group_end
pattern ( x y* )
define y as src_c = x.src_c
)
-- Add ORDER BY clause here, if needed
;
Here is a quick explanation of how this works; for developers who are not familiar with match_recognize, I provided links to a few good tutorials in a Comment below this Answer.
The match_recognize clause partitions the input rows by str_c and comm_c and orders them by day_c. So far this is exactly the same work that analytic functions do.
Then in the PATTERN and DEFINE clauses I declare and define two "classes" of rows, which will be flagged as X and Y, respectively. X is any row (there are no restrictions on it in the DEFINE clause). However, Y is restricted: it must have the same src_c as the last X row preceding it.
So, in each partition, and reading from the earliest row to the latest (within the partition), I am looking for any number of matches, where a match consists of an arbitrary row (marked X), followed by as many Y rows as possible; where Y means "same src_c as the first row in this match. So, this will identify sequences of rows where the src_c did not change.
For each match that is found, the clause will output the src_c value from the X row (which is the same, really, for all the rows in that match), and the first and the last value in the day_c column for that match. That is what we need to put in the SELECT clause of the overall query.

You can eliminate one CTE by using the Tabibito-san (Traveler) method:
with Groups as (
select t.*
, row_number() over (order by day_c)
- row_number() over (partition by str_c
, comm_c
, src_c
order by day_c) GroupNum
from t
)
select str_c
, comm_c
, src_c
, min(day_c) GROUP_START
, max(day_c) GROUP_END
from Groups
group by str_c
, comm_c
, src_c
, GroupNum

WHERE clause using to not select rows with timestamps 50ms either side of it?

I have part of a table like this:
timestamp | Source
----------------------------+----------
2017-07-28 14:20:28.757464 | Stream
2017-07-28 14:20:28.775248 | Poll
2017-07-28 14:20:29.777678 | Poll
2017-07-28 14:21:28.582532 | Stream
I want to achieve this:
timestamp | Source
----------------------------+----------
2017-07-28 14:20:28.757464 | Stream
2017-07-28 14:20:29.777678 | Poll
2017-07-28 14:21:28.582532 | Stream
Where the 2nd row in the original table had been removed, because it's within 50ms of a timestamp before or after it. Important is only removes rows when Source = 'Poll'.
Not sure how this can be achieved with a WHERE clause maybe?
Thanks in advance for any help.

Whatever we do, we can limit that to Pools, then union those rows with Streams.
with
streams as (
select *
from test
where Source = 'Stream'
),
pools as (
...
)
(select * from pools) union (select * from streams) order by timestamp
To get pools, there are different options:
Correlated subquery
For each row we run extra query to get the previous row with the same source, then select only those rows where there is no previous timestamp (first row) or where previous timestamp is more than 50ms older.
with
...
pools_with_prev as (
-- use correlated subquery
select
timestamp, Source,
timestamp - interval '00:00:00.05'
as timestamp_prev_limit,
(select max(t2.timestamp)from test as t2
where t2.timestamp < test.timestamp and
t2.Source = test.Source)
as timestamp_prev
from test
),
pools as (
select timestamp, Source
from pools_with_prev
-- then select rows which are >50ms apart
where timestamp_prev is NULL or
timestamp_prev < timestamp_prev_limit
)
...
https://www.db-fiddle.com/f/iVgSkvTVpqjNZ5F5RZVSd2/2
Join two sliding tables
Instead running subquery for each row, we can just create a copy of our table and slide it so each Pool row joins with the previous row of the same source type.
with
...
pools_rn as (
-- add extra row number column
-- rows: 1, 2, 3
select *,
row_number() over (order by timestamp) as rn
from test
where Source = 'Pool'
),
pools_rn_prev as (
-- add extra row number column increased by one
-- like sliding a copy of the table one row down
-- rows: 2, 3, 4
select timestamp as timestamp_prev,
row_number() over (order by timestamp)+1 as rn
from test
where Source = 'Pool'
),
pools as (
-- now join prev two tables on this column
-- each row will join with its predecessor
select timestamp, source
from pools_rn
left outer join pools_rn_prev
on pools_rn.rn = pools_rn_prev.rn
where
-- then select rows which are >50ms apart
timestamp_prev is null or
timestamp - interval '00:00:00.05' > timestamp_prev
)
...
https://www.db-fiddle.com/f/gXmSxbqkrxpvksE8Q4ogEU/2
Sliding window
Modern SQL can do something similar, with partitioning by source, then using sliding window to join with the previous row.
with
...
pools_with_prev as (
-- use sliding window to join prev timestamp
select *,
timestamp - interval '00:00:00.05'
as timestamp_prev_limit,
lag(timestamp) over(
partition by Source order by timestamp
) as timestamp_prev
from test
),
pools as (
select timestamp, Source
from pools_with_prev
-- then select rows which are >50ms apart
where timestamp_prev is NULL or
timestamp_prev < timestamp_prev_limit
)
...
https://www.db-fiddle.com/f/8KfTyqRBU62SFSoiZfpu6Q/1
I believe this is the most optimal.

SQL Time Series Completion Script

Version: SQL Server 2014
Objective: Create a complete time series with existing date range records.
Initial Data Setup:
IF OBJECT_ID('tempdb..#DataSet') IS NOT NULL
DROP TABLE #DataSet;
CREATE TABLE #DataSet (
RowID INT
,StartDt DATETIME
,EndDt DATETIME
,Col1 FLOAT);
INSERT INTO #DataSet (
RowID
,StartDt
,EndDt
,Col1)
VALUES
(1234,'1/1/2016','12/31/2999',100)
,(1234,'7/23/2016','7/27/2016',90)
,(1234,'7/26/2016','7/31/2016',80)
,(1234,'10/1/2016','12/31/2999',75);
Desired Results:
RowID, StartDt, EndDt, Col1
1234, '01/01/2016', '07/22/2016', 100
1234, '07/23/2016', '07/26/2016', 90
1234, '07/26/2016', '07/31/2016', 80
1234, '08/01/2016', '09/30/2016', 100
1234, '10/01/2016', '12/31/2999', 75
Not an easy task I will admit, If anyone has a suggestion on how to tackle this utilizing SQL alone (Microsoft 2014 TSQL) I would greatly appreciate it. Please keep in mind it is SQL and we want to try to avoid cursors at all costs based on performance for large data sets.
Thanks in Advance.
Also as an FYI I was able to achieve half of this by utilizing a LEAD windows function to set the End Date of the current record to the Startdate-1 of the next. The other half of filling gaps back in from previous records still eludes me.
Updated for the 9/31 to 9/30 date.

The following query does essentially what you are asking. You can tweak it to fit your requirements. Note that when checking the results of my query, your desired results contain 09/31/2016 which is not a valid date.
WITH
RankedData AS
(
SELECT RowID, StartDt, EndDt, Col1,
DATEADD(day, -1, StartDt) AS PrevEndDt,
RANK() OVER(ORDER BY StartDt, EndDt, RowID) AS rank_no
FROM #DataSet
),
HasGapsData AS
(
SELECT a.RowID, a.StartDt,
CASE WHEN b.PrevEndDt <= a.EndDt THEN b.PrevEndDt ELSE a.EndDt END AS EndDt,
a.Col1, a.rank_no
FROM RankedData a
LEFT JOIN RankedData b ON a.rank_no = b.rank_no - 1
)
SELECT RowID, StartDt, EndDt, Col1
FROM HasGapsData
UNION ALL
SELECT a.RowID,
DATEADD(day, 1, a.EndDt) AS StartDt,
DATEADD(day, -1, b.StartDt) AS EndDt,
a.Col1
FROM HasGapsData a
INNER JOIN HasGapsData b ON a.rank_no = b.rank_no - 1
WHERE DATEDIFF(day, a.EndDt, b.StartDt) > 1
ORDER BY StartDt, EndDt;

postgresql find preceding and following timestamp to arbitrary timestamp

Given an arbitrary timestamp such as 2014-06-01 12:04:55-04 I can find in sometable the timestamps just before and just after. I then calculate the elapsed number of seconds between those two with the following query:
SELECT EXTRACT (EPOCH FROM (
(SELECT time AS t0
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1) -
(SELECT time AS t1
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1)
)) as elapsedNegative;
`
It works, but I was was wondering if there was another more elegant or astute way to achieve the same result? I am using 9.3. Here is a toy database.
CREATE TABLE sometable (
id serial,
time timestamp
);
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 11:59:37-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:02:22-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:04:49-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:07:35-04');
INSERT INTO sometable (id, time) VALUES (1, '2014-06-01 12:09:53-04');
Thanks for any tips...
update Thanks to both #Joe Love and #Clément Prévost for interesting alternatives. Learned a lot on the way!

Your original query can't be more effective given that the sometable.time column is indexed, your execution plan should show only 2 index scans, which is very efficient (index only scans if you have pg 9.2 and above).
Here is a more readable way to write it
WITH previous_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time < '2014-06-01 12:04:55-04'
ORDER BY time DESC LIMIT 1
),
next_timestamp AS (
SELECT time AS time
FROM sometable
WHERE time > '2014-06-01 12:04:55-04'
ORDER BY time ASC LIMIT 1
)
SELECT EXTRACT (EPOCH FROM (
(SELECT * FROM next_timestamp)
- (SELECT * FROM previous_timestamp)
))as elapsedNegative;
Using CTE allow you to give meaning to a subquery by naming it. Explicit naming is a well known and recognised coding best practice (use explicit names, don't abbreviate and don't use over generic names like "data" or "value").
Be warned that CTE are optimisation "fences" and sometimes get in the way of planner optimisation
Here is the SQLFiddle.
Edit: Moved the extract from the CTE to the final query so that PostgreSQL can use a index only scan.

This solution will likely perform better if the timestamp column does not have an index. When 9.4 comes out we can do it a little shorter by using aggregate filters.
This should be a bit bit faster as it's running 1 full table scan instead of 2, however it may perform worse, if your timestamp column is indexed and you have a large dataset.
Here's the example without the epoch conversion to make it more easy to read.
select
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
),
max(
case when t1.start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)
from my_table as t1
And here's the example including the math and epoch extraction:
select
extract (EPOCH FROM (
min(
case when start_timestamp > current_timestamp
then
start_timestamp
else 'infinity'::timestamp
end
)-
max(
case when start_timestamp < current_timestamp
then
start_timestamp
else '-infinity'::timestamp
end
)))
from snap.offering_event
Please let me know if you need further details-- I'd recommend trying my code vs yours and seeing how it performs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse