I have a time_entries table, where I use Postgres' 15 multi-range feature to merge overlapping and adjacent time ranges:
CREATE TABLE time_entries (
id bigint NOT NULL,
contract_id bigint,
"from" timestamp(6) without time zone,
"to" timestamp(6) without time zone,
range tsrange GENERATED ALWAYS AS (tsrange("from", "to")) STORED
);
INSERT INTO time_entries VALUES (1, 1, '2022-12-07T09:00', '2022-12-07T09:45');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T09:45', '2022-12-07T10:00');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T09:55', '2022-12-07T10:15');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T10:20', '2022-12-07T10:30');
INSERT INTO time_entries VALUES (2, 1, '2022-12-07T10:45', '2022-12-07T11:00');
SELECT contract_id, unnest(range_agg(range)) AS range FROM time_entries GROUP BY contract_id;
The current result is:
contract_id
range
1
["2022-12-07 09:00:00","2022-12-07 10:15:00")
1
["2022-12-07 10:20:00","2022-12-07 10:30:00")
1
["2022-12-07 10:45:00","2022-12-07 11:00:00")
Now, when two ranges are only up to 5 minutes apart, I'd like to merge them as well. So the desired result would be:
contract_id
range
1
["2022-12-07 09:00:00","2022-12-07 10:30:00")
1
["2022-12-07 10:45:00","2022-12-07 11:00:00")
Working dbfiddle here:
https://dbfiddle.uk/owHkVaZ5
Is this achievable with SQL alone? Or do I need some kind of custom Postgres function for this? I've heard of an aggregator function, but never used one.
Richard Huxton's idea of adding 5 minutes before you aggregate is good. Here is a simple implementation:
SELECT contract_id,
tsrange(lower(u.r), upper(u.r) - INTERVAL '5 minutes')
FROM (SELECT contract_id,
range_agg(tsrange("from", "to" + INTERVAL '5 minutes')) AS mr
FROM time_entries
GROUP BY contract_id) AS agg
CROSS JOIN LATERAL unnest(agg.mr) AS u(r);
You need the CROSS JOIN because you want to join each group with all the multirange elements that belong to it. This "belong" is expressed by LATERAL which means that you refer to elements from the preceding FROM list entries in a later one. Essentially, that construct and the subquery are needed so that I can get the unnested ranges into the FROM expression where they belong, so that * can use them in the SELECT list.
AS u(r) is a table alias, that is an alias for a table name and the name of its columns at the same time.
Related
I have a very simple table like below
Events:
Event_name
Event_time
A
2022-02-10
B
2022-05-11
C
2022-07-17
D
2022-10-20
To a table like this are added new events, but we always take the event from the last X days (for example, 30 days), so the query result for this table is changeable.
I would like to transform the above table into this:
A
B
C
D
2022-02-10
2022-05-11
2022-07-17
2022-10-20
In general, the number of columns won't be constant. But if it's not possible we can add a limitation for the number of columns- for example, 10 columns.
I tried with crosstab, but I had to add the column name manually this is not what I mean and it doesn't work with the CTE query
WITH CTE AS (
SELECT DISTINCT
1 AS "Id",
event_time,
event_name,
ROW_NUMBER() OVER(ORDER BY event_time) AS nr
FROM
events
WHERE
event_time >= CURRENT_DATE - INTERVAL '31 days')
SELECT *
FROM
crosstab (
'SELECT id, event_name, event_time
FROM
CTE
WHERE
nr <= 10
ORDER BY
nr') AS ct(id int,
event_name text,
EventTime1 timestamp,
EventTime2 timestamp,
EventTime3 timestamp,
EventTime4 timestamp,
EventTime5 timestamp,
EventTime6 timestamp,
EventTime7 timestamp,
EventTime8 timestamp,
EventTime9 timestamp,
EventTime10 timestamp)
This query will be used as the data source in Tableau (data visualization and analysis software) it would be great if it could be one query (without temp tables, adding new functions, etc.)
Thanks!
Sample data below.
I want to clean up data based on the next non-null value of the same id, based on row (actually a timestamp).
I can't do lag, because in some cases there are consecutive nulls.
I can't do coalesce(a.col_a, (select min(b.col_a) from table b where a.id=b.id)) because it will return an "outdated" value (eg NYC instead of SF in col_a row 4). (I can do this, once I've accounted for everything else, for the cases where i have no next non-null value, like col_b row 9/10, to just fill in the last).
The only thing I can think of is to do
table_x as (select id, col_x from table where col_a is not null)
for each column, and then join taking the minimum where id = id and table_x.row > table.row. But I have a handful of columns and that feels cumbersome and inefficient.
Appreciate any help!
row
id
col_a
col_a_desired
col_b
col_b_desired
0
1
-
NYC
red
red
1
1
NYC
NYC
red
red
2
1
SF
SF
-
blue
3
1
-
SF
-
blue
4
1
SF
SF
blue
blue
5
2
PAR
PAR
red
red
6
2
LON
LON
-
blue
7
2
LON
LON
-
blue
8
2
-
LON
blue
blue
9
2
LON
LON
-
blue
10
2
-
LON
-
blue
Can you try this query?
WITH samp AS (
SELECT 0 row_id, 1 id, null col_a, 'red' col_b UNION ALL
SELECT 1, 1, 'NYC', 'red' UNION ALL
SELECT 2, 1, 'SF', NULL UNION ALL
SELECT 3, 1, NULL, NULL UNION ALL
SELECT 4, 1, 'SF', 'blue' UNION ALL
SELECT 5, 2, 'PAR', 'red' UNION ALL
SELECT 6, 2, 'LON', NULL UNION ALL
SELECT 7, 2, 'LON', NULL UNION ALL
SELECT 8, 2, NULL, 'blue' UNION ALL
SELECT 9, 2, 'LON', NULL UNION ALL
SELECT 10, 2, NULL, NULL
)
SELECT
row_id,
id,
IFNULL(FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_a IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_a,
IFNULL(FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING),
FIRST_VALUE(col_b IGNORE NULLS)
OVER (PARTITION BY id ORDER BY row_id desc
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) AS col_b
from samp order by id, row_id
Output:
References:
https://cloud.google.com/bigquery/docs/reference/standard-sql/navigation_functions#first_value
https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls
I want to clean up data based on the next non-null value.
So if you reverse the order, that's the last non-null value.
If you have multiple columns and the logic is too cumbersome to write in SQL, you can write it in plpgsql instead, or even use the script language of your choice (but that will be slower).
The idea is to open a cursor for update, with an ORDER BY in the reverse order mentioned in the question. Then the plpgsql code stores the last non-null values in variables, and if needed issues an UPDATE WHERE CURRENT OF cursor to replace the nulls in the table with desired values.
This may take a while, and the numerous updates will take a lot of locks. It looks like your data can be processed in independent chunks using the "id" column as chunk identifier, so it would be a good idea to use that.
I have a large table with 100s of millions of rows. Because it is so big, it is partitioned by date range first, and then that partition is also partitioned by a period_id.
CREATE TABLE research.ranks
(
security_id integer NOT NULL,
period_id smallint NOT NULL,
classificationtype_id smallint NOT NULL,
dtz timestamp with time zone NOT NULL,
create_dt timestamp with time zone NOT NULL DEFAULT now(),
update_dt timestamp with time zone NOT NULL DEFAULT now(),
rank_1 smallint,
rank_2 smallint,
rank_3 smallint
)
CREATE TABLE zpart.ranks_y1990 PARTITION OF research.ranks
FOR VALUES FROM ('1990-01-01 00:00:00+00') TO ('1991-01-01 00:00:00+00')
PARTITION BY LIST (period_id);
CREATE TABLE zpart.ranks_y1990p1 PARTITION OF zpart.ranks_y1990
FOR VALUES IN ('1');
every year has a partition and there are another dozen partitions for each year.
I needed to see the result for security_ids side by side for different period_ids.
So the join I initially used was one like this:
select c1.security_id, c1.dtz,c1.rank_2 as rank_2_1, c9.rank_2 as rank_2_9
from research.ranks c1
left join research.ranks c9 on c9.dtz=c9.dtz and c1.security_id=c9.security_id and c9.period_id=9
where c1.period_id =1 and c1.dtz>now()-interval'10 years'
which was slow, but acceptable. I'll call this the JOIN version.
Then, we wanted to show two more period_ids and extended the above to add additional joins on the new period_ids.
This slowed down the join enough for us to look at a different solution.
We found that the following type of query runs about 6 or 7 times faster:
select c1.security_id, c1.dtz
,sum(case when c1.period_id=1 then c1.rank_2 end) as rank_2_1
,sum(case when c1.period_id=9 then c1.rank_2 end) as rank_2_9
,sum(case when c1.period_id=11 then c1.rank_2 end) as rank_2_11
,sum(case when c1.period_id=14 then c1.rank_2 end) as rank_2_14
from research.ranks c1
where c1.period_id in (1,11,14,9) and c1.dtz>now()-interval'10 years'
group by c1.security_id, c1.dtz;
We can use the sum because the table has unique indexes so we know there will only ever be one record that is being "summed". I'll call this the SUM version.
The speed is so much better that I'm questioning half of the code I have written previously! Two questions:
Should I be trying to use the SUM version rather than the JOIN version everywhere or is the efficiency likely to be a factor of the specific structure and not likely to be as useful in other circumstances?
Is there a problem with the logic of the SUM version in cases that I haven't considered?
To be honest, I don't think your "join" version was ever a good idea anyway. You only have one (partitioned) table so there never was a need for any join.
SUM() is the way to go, but I would use SUM(...) FILTER(WHERE ..) instead of a CASE:
SELECT
security_id,
dtz,
SUM(rank_2) FILTER (WHERE period_id = 1) AS rank_2_1,
SUM(rank_2) FILTER (WHERE period_id = 9) AS rank_2_9,
SUM(rank_2) FILTER (WHERE period_id = 11) AS rank_2_11,
SUM(rank_2) FILTER (WHERE period_id = 14) AS rank_2_14,
FROM
research.ranks
WHERE
period_id IN ( 1, 11, 14, 9 )
AND dtz > now( ) - INTERVAL '10 years'
GROUP BY
security_id,
dtz;
I have some date intervals, each interval is characterized by known "prop_id". My goal is to merge overlapping intervals into big intervals, while keeping the uniquness of "prop_id" inside the merged group. I have some code that helps me to get big intervals, but I 've no idea how to keep condition of uniquness (. Thanks in advance for any assistance.
________1 ________1
___________2
________1 |________1
_________|__2
[1,2]_________|________[1,2]
For SQLFiddle:
CREATE SEQUENCE ido_seq;
create table slots (
ido integer NOT NULL default nextval('ido_seq'),
begin_at date,
end_at date,
prop_id integer);
ALTER SEQUENCE ido_seq owned by slots.ido;
INSERT INTO slots (ido, begin_at, end_at, prop_id) VALUES
(0, '2014-10-05', '2014-10-10', 1),
(1, '2014-10-08', '2014-10-15', 2),
(2, '2014-10-13', '2014-10-20', 1),
(3, '2014-10-21', '2014-10-30', 2);
-- disired output:
-- start, end, props
-- 2014-10-05, 2014-10-12, [1,2] --! the whole group is (2014-10-05, 2014-10-20, [1,2,1]), but props should be unique
-- 2014-10-13, 2014-10-20, [1,2] --so, we obtain 2 ranges instead of 1, each one with 2 generating prop_id
-- 2014-10-21, 2014-10-30 [2]
How do we get it:
if two date intervals overlap, we merge them. The first ['2014-10-05', '2014-10-10'] and second ['2014-10-08', '2014-10-15'] have part ['2014-10-08', '2014-10-10'] in common. So we can merge them to ['2014-10-05', '2014-10-15']. The generalizing props are unique - OK. The next one ['2014-10-13', '2014-10-20'] is overlapping with previously calculated ['2014-10-05', '2014-10-15'], but we can't merge them without breaking the condition of uniquness. So we are to split the big interval ['2014-10-05', '2014-10-20'] into 2 small using the begin date of repeating prop ('2014-10-13'), but keeng the condition and receive ['2014-10-05', '2014-10-12'] (as '2014-10-13' minus 1 day) and ['2014-10-13', '2014-10-20'] both generalizing by props 1 and 2.
My attempt to get merged intervals (not keeping uniqueness condition):
SELECT min(begin_at), max(enddate), array_agg(prop_id) AS props
FROM (
SELECT *,
count(nextstart > enddate OR NULL) OVER (ORDER BY begin_at DESC, end_at DESC) AS grp
FROM (
SELECT
prop_id
, begin_at
, end_at
, end_at AS enddate
, lead(begin_at) OVER (ORDER BY begin_at, end_at) AS nextstart
FROM slots
) a
)b
GROUP BY grp
ORDER BY 1;
The right solution here is probably to use a recursive CTE to find the large intervals no matter how many smaller intervals need to be combined, and then to remove the intervals that we do not need.
with recursive intervals(idos, begin_at,end_at,prop_ids) as(
select array[ido], begin_at, end_at, array[prop_id]
from slots
union
select i.idos || s.ido
, least(s.begin_at, i.begin_at)
, greatest(s.end_at, i.end_at)
, i.prop_ids || s.prop_id
from intervals i
join slots s
on (s.begin_at, s.end_at) overlaps (i.begin_at, i.end_at)
and not (i.prop_ids && array[s.prop_id]) --check that the prop id is not already in the large interval
where s.begin_at < i.begin_at --to avoid having double intervals
)
select * from intervals i
--finally, remove the intervals that are a subinterval of an included interval
where not exists(select 1 from intervals i2 where i2.idos #> i.idos
and i2.idos <> i.idos);
Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1
I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30
As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.