I have a table in Postgres which looks like below:
CREATE TABLE my_features
(
id uuid NOT NULL,
feature_id uuid NOT NULL,
begin_time timestamptz NOT NULL,
duration integer NOT NULL
)
For each feature_id there may be multiple rows with time ranges specified by begin_time .. (begin_time + duration). duration is in milliseconds. They may overlap. I'm looking for a fast way to find all feature_ids that have any overlaps.
I have referred to this - Query Overlapping time range which is similar but works on a fixed time end time.
I have tried the below query but it is throwing an error.
Query:
select c1.*
from my_features c1
where exists (select 1
from my_features c2
where tsrange(c2.begin_time, c2.begin_time + '30 minutes'::INTERVAL, '[]') && tsrange(c1.begin_time, c1.begin_time + '30 minutes'::INTERVAL, '[]')
and c2.feature_id = c1.feature_id
and c2.id <> c1.id);
Error:
ERROR: function tsrange(timestamp with time zone, timestamp with time zone, unknown) does not exist
LINE 5: where tsrange(c2.begin_time, c2.begin_time...
I have used a default time interval here because I did not understand how to convert the time into minutes and substitute it with 'n minutes'.
If you need a solution faster than O(n²), then you can use constraints on ranges with btree_gist extension, possibly on a temporary table:
CREATE TEMPORARY TABLE my_features_ranges (
id uuid NOT NULL,
feature_id uuid NOT NULL,
range tstzrange NOT NULL,
EXCLUDE USING GIST (feature_id WITH =, range WITH &&)
);
INSERT INTO my_features_ranges (id, feature_id, range)
select id, feature_id, tstzrange(begin_time, begin_time+duration*'1ms'::interval)
from my_features
on conflict do nothing;
select id from my_features except select id from my_features_ranges;
Using OVERLAPS predicate:
SELECT * -- DISTINCT f1.*
FROM my_features f1
JOIN my_features f2
ON f1.feature_id = f2.feature_id
AND f1.id <> f2.id
AND (f1.begin_time::date, f1.begin_time::date + '30 minutes'::INTERVAL)
OVERLAPS (f2.begin_time::date, f2.begin_time::date + '30 minutes'::INTERVAL);
db<>fiddle demo
Or try this
select c1.*
from jak.my_features c1
where exists (select 1
from jak.my_features c2
where tsrange(c2.begin_time::date, c2.begin_time::date + '30 minutes'::INTERVAL, '[]') && tsrange(c1.begin_time::date, c1.begin_time::date + '30 minutes'::INTERVAL, '[]') and
c2.feature_id = c1.feature_id
and c2.id <> c1.id);
The problem was, I was using tsrange on a column with timezone and for timestamp with timezone, there exist another function called tstzrange
Below worked for me:
EDIT: Added changes suggested by #a_horse_with_no_name
select c1.*
from my_features c1
where exists (select 1
from my_features c2
where tstzrange(c2.begin_time, c2.begin_time + make_interval(secs => c2.duration / 1000), '[]') && tstzrange(c1.begin_time, c1.begin_time + make_interval(secs => c1.duration / 1000), '[]')
and c2.feature_id = c1.feature_id
and c2.id <> c1.id);
However, the part of calculating interval dynamically is still pending
Related
I have a very simple table like below
Events:
Event_name
Event_time
A
2022-02-10
B
2022-05-11
C
2022-07-17
D
2022-10-20
To a table like this are added new events, but we always take the event from the last X days (for example, 30 days), so the query result for this table is changeable.
I would like to transform the above table into this:
A
B
C
D
2022-02-10
2022-05-11
2022-07-17
2022-10-20
In general, the number of columns won't be constant. But if it's not possible we can add a limitation for the number of columns- for example, 10 columns.
I tried with crosstab, but I had to add the column name manually this is not what I mean and it doesn't work with the CTE query
WITH CTE AS (
SELECT DISTINCT
1 AS "Id",
event_time,
event_name,
ROW_NUMBER() OVER(ORDER BY event_time) AS nr
FROM
events
WHERE
event_time >= CURRENT_DATE - INTERVAL '31 days')
SELECT *
FROM
crosstab (
'SELECT id, event_name, event_time
FROM
CTE
WHERE
nr <= 10
ORDER BY
nr') AS ct(id int,
event_name text,
EventTime1 timestamp,
EventTime2 timestamp,
EventTime3 timestamp,
EventTime4 timestamp,
EventTime5 timestamp,
EventTime6 timestamp,
EventTime7 timestamp,
EventTime8 timestamp,
EventTime9 timestamp,
EventTime10 timestamp)
This query will be used as the data source in Tableau (data visualization and analysis software) it would be great if it could be one query (without temp tables, adding new functions, etc.)
Thanks!
I've got a remedial question about pulling results out of a CTE in a later part of the query. For the example code, below are the relevant, stripped down tables:
CREATE TABLE print_job (
created_dts timestamp not null default now(),
status text not null
);
CREATE TABLE calendar_day (
date_actual date not null
);
In the current setup, there are gaps in the dates in the print_job data, and we would like to have a gapless result. For example, there are 87 days from the first to last date in the table, and only 77 days in there have data. We've already got a calendar_day dimension table to join with to get the 87 rows for the 87-day range. It's easy enough to figure out the min and max dates in the data with a subquery or in a CTE, but I don't know how to use those values from a CTE. I've got a full query below, but here are the relevant fragments with comments:
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- This CTE does not work because it doesn't know what date_range is.
complete_date_series_using_cte AS (
select actual_date
from calendar_day
where actual_date >= date_range.start_date
and actual_date <= date_range.end_date
),
-- Subqueries are fine, because the FROM is specified in the subquery condition directly.
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
I run into this regularly, and finally figured I'd ask. I've hunted around already for an answer, but I'm not clear how to summarize it well. And while there's nothing wrong with the subqueries in this case, I've got other situations where a CTE is nicer/more readable.
If it helps, I've listed the complete query below.
-- Get some counts and give them names.
WITH
daily_status AS (
select created_dts::date as created_date,
count(*) AS daily_total,
count(*) FILTER (where status = 'Error') AS status_error,
count(*) FILTER (where status = 'Processing') AS status_processing,
count(*) FILTER (where status = 'Aborted') AS status_aborted,
count(*) FILTER (where status = 'Done') AS status_done
from print_job
group by created_dts::date
),
-- Get the date range from the data.
date_range AS (
select min(created_dts::date) AS start_date,
max(created_dts::date) AS end_date
from print_job),
-- There are gaps in the data, and we want a row for dates with no results.
-- Could use generate_series on a timestamp & convert that to dates. But,
-- in our case, we've already got dimension tables for days. All that's needed
-- here is the actual date.
-- This CTE does not work because it doesn't know what date_range is.
-- complete_date_series_using_cte AS (
-- select actual_date
--
-- from calendar_day
--
-- where actual_date >= date_range.start_date
-- and actual_date <= date_range.end_date
-- ),
complete_date_series_using_subquery AS (
select date_actual
from calendar_day
where date_actual >= (select min(created_dts::date) from print_job)
and date_actual <= (select max(created_dts::date) from print_job)
)
-- The final query joins the complete date series with whatever data is in the print_job table daily summaries.
select date_actual,
coalesce(daily_total,0) AS total,
coalesce(status_error,0) AS errors,
coalesce(status_processing,0) AS processing,
coalesce(status_aborted,0) AS aborted,
coalesce(status_done,0) AS done
from complete_date_series_using_subquery
left join daily_status
on daily_status.created_date =
complete_date_series_using_subquery.date_actual
order by date_actual
I said it was a remedial question....I remembered where I'd seen this done before:
https://tapoueh.org/manual-post/2014/02/postgresql-histogram/
In my example, I need to list the CTE in the table list. That's obvious in retrospect, and I realize that I automatically don't think to do that as I'm habitually avoiding CROSS JOIN. The fragment below shows the slight change needed:
WITH
date_range AS (
select min(created_dts)::date as start_date,
max(created_dts)::date as end_date
from print_job
),
complete_date_series AS (
select date_actual
from calendar_day, date_range
where date_actual >= date_range.start_date
and date_actual <= date_range.end_date
),
I have the following query.
SELECT *
FROM (SELECT temp.*, ROWNUM AS rn
FROM ( SELECT (id) M_ID,
CREATION_DATE,
RECIPIENT_STATUS,
PARENT_OR_CHILD,
CHILD_COUNT,
IS_PICKABLE,
IS_GOLDEN,
trxn_id,
id AS id,
MASTER_ID,
request_wf_state,
TITLE,
FIRST_NAME,
MIDDLE,
LAST_NAME,
FULL_NAME_LNF,
FULL_NAME_FNF,
NAME_OF_ORGANIZATION,
ADDRESS,
CITY,
STATE,
COUNTRY,
HCP_TYPE,
HCP_SUBTYPE,
is_edit_locked,
record_type rec_type,
DATA_SOURCE_NAME,
DEA_DATA,
NPI_DATA,
STATE_DATA,
RPPS,
SIREN_NUMBER,
FINESS,
ROW_NUMBER ()
OVER (PARTITION BY id ORDER BY full_name_fnf)
AS rp
FROM V_RECIPIENT_TRANS_SCRN_OP
WHERE 1 = 1
AND creation_date >=
to_date( '01-Sep-2015', 'DD-MON-YYYY') AND creation_date <=
to_date( '09-Sep-2015', 'DD-MON-YYYY')
ORDER BY CREATION_DATE DESC) temp
WHERE rp = 1)
WHERE rn > 0 AND rn < 10;
Issue is, that the above query does return data which has creation_date as '09-Sep-2015'.
NLS_DATE_FORMAT of my database is 'DD-MON-RR'.
Datatype of the column creation_date is date and the date format in which date is stored is MM/DD/YYYY.
Since your column creation_date has values with non-zero time components, and the result of to_date( '09-Sep-2015', 'DD-MON-YYYY') has a zero time component, the predicate creation_date <= to_date( '09-Sep-2015', 'DD-MON-YYYY') is unlikely to match. As an example, "9/9/2015 1:07:45 AM" is clearly greater than "9/9/2015 0:00:00 AM", which is returned by your to_date() call.
You will need to take into account the time component of the Oracle DATE data type.
One option is to use the trunc() function, as you did, to remove the time component from values of creation_date. However, this may prevent the use of index on creation_date if it exists.
A better alternative, in my view, would be to reformulate your predicate as creation_date < to_date( '10-Sep-2015', 'DD-MON-YYYY'), which would match any time values on the date of 09-Sep-2015.
Im getting interval times via:
SELECT time_col - lag(time_col) OVER (ORDER BY whatever)
FROM myTable where conditions
This returns a table like this:
00:00:38
00:05:10
00:02:05
...
I want to have the time in seconds like this:
38
310
125
...
Here is my approach:
SELECT EXTRACT(epoch from dt) from (SELECT time_col - lag(time_col) OVER (ORDER BY whatever) FROM myTable where conditions) as dt
dt should be the table with the difference times (intervals). However I get the following error:
Error: Function pg_catalog.date_part(unknown, record) does not exist
So I have to cast record (the table 'dt') to interval? How do I do that? Or is this completely wrong? Sorry Im new to database queries....
Either this
SELECT EXTRACT(epoch from dt)
from (
SELECT time_col - lag(time_col) OVER (ORDER BY whatever) dt
FROM myTable
where conditions
) as dt
Or this
SELECT
extract(epoch from time_col - lag(time_col) OVER (ORDER BY whatever))
FROM myTable
where conditions
I have my measurement data stored into the following structure:
CREATE TABLE measurements(
measured_at TIMESTAMPTZ,
val INTEGER
);
I already know that using
(a) date_trunc('hour',measured_at)
AND
(b) generate_series
I would be able to aggregate my data by:
microseconds,
milliseconds
.
.
.
But is it possible to aggregate the data by 5 minutes or let's say an arbitrary amount of seconds? Is it possible to aggregate measured data by an arbitrary multiple of seconds?
I need the data aggregated by different time resolutions to feed them into a FFT or an AR-Model in order to see possible seasonalities.
You can generate a table of "buckets" by adding intervals created by generate_series(). This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data.
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n
Wrap that statement in a common table expression, and you can join and group on it as if it were a base table.
with five_min_intervals as (
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n
)
select f.start_time, f.end_time, avg(m.val) avg_val
from measurements m
right join five_min_intervals f
on m.measured_at >= f.start_time and m.measured_at < f.end_time
group by f.start_time, f.end_time
order by f.start_time
Grouping by an arbitrary number of seconds is similar--use date_trunc().
A more general use of generate_series() lets you avoid guessing the upper limit for five-minute buckets. In practice, you'd probably build this as a view or a function. You might get better performance from a base table.
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, ((select max(measured_at)::date - min(measured_at)::date from measurements) + 1)*24*60, 5) n;
Catcall has a great answer. My example of using it demonstrates having fixed buckets - in this case 30 minute intervals starting at midnight. It also shows that there can be one extra bucket generated in Catcall's first version and how to eliminate it. I wanted exactly 48 buckets in a day. In my problem, observations have separate date and time columns and I want to average the observations within a 30 minute period across the month for a number of different services.
with intervals as (
select
(n||' minutes')::interval as start_time,
((n+30)|| ' minutes')::interval as end_time
from generate_series(0, (23*60+30), 30) n
)
select i.start_time, o.service, avg(o.o)
from
observations o right join intervals i
on o.time >= i.start_time and o.time < i.end_time
where o.date between '2013-01-01' and '2013-01-31'
group by i.start_time, i.end_time, o.service
order by i.start_time
How about
SELECT MIN(val),
EXTRACT(epoch FROM measured_at) / EXTRACT(epoch FROM INTERVAL '5 min') AS int
FROM measurements
GROUP BY int
where '5 min' can be any expression supported by INTERVAL
The following will give you buckets of any size, even if they don't aline well with a nice minute/hour/whatever boundary. The value "300" is for a 5 minute grouping, but any value can be substituted:
select measured_at,
val,
(date_trunc('seconds', (measured_at - timestamptz 'epoch') / 300) * 300 + timestamptz 'epoch') as aligned_measured_at
from measurements;
You can then use whatever aggregate you need around "val", and use "group by aligned_measured_at" as required.
This is based on Mike Sherrill's answer, except that it uses timestamp intervals instead of separate start/end columns.
with intervals as (
select tstzrange(s, s + '5 minutes') das_interval
from (select generate_series(min(lower(time_range)), max(upper(time_rage)), '5 minutes') s
from your_table) x)
select das_interval, your_table.*
from your_table
right join intervals on time_range && das_interval
order by das_interval;
From PostgreSQL v14 on, you can use the date_bin function for that:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I wanted to look at the past 24 hours of data and count things in hourly increments. I started Cat Recall's solution, which is pretty slick. It's bound to the data, though, rather than just what's happened in the past 24H. So I refactored and ended up with something pretty close to Julian's solution, but with more CTE. So it's sort of the marriage of the 2 answers.
WITH interval_query AS (
SELECT (ts ||' hour')::INTERVAL AS hour_interval
FROM generate_series(0,23) AS ts
), time_series AS (
SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
FROM interval_query
), time_intervals AS (
SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
FROM time_series ORDER BY start_time
), reading_counts AS (
SELECT f.start_time, f.end_time, br.minor, count(br.id) readings
FROM beacon_readings br
RIGHT JOIN time_intervals f
ON br.reading_timestamp >= f.start_time AND br.reading_timestamp < f.end_time AND br.major = 4
GROUP BY f.start_time, f.end_time, br.minor
ORDER BY f.start_time, br.minor
)
SELECT * FROM reading_counts
Note that any additional limiting I wanted in the final query needed to be done in the RIGHT JOIN. I'm not suggesting this is necessarily the best (or even a good approach), but it is something I'm running with (at least at the moment) in a dashboard.
The Timescale extension for PostgreSQL gives the ability to group by arbitrary time intervals. The function is called time_bucket() and has the same syntax as the date_trunc() function but takes an interval instead of a time precision as first parameter. Here you can find its API Docs. This is an example:
SELECT
time_bucket('5 minutes', observation_time) as bucket,
device_id,
avg(metric) as metric_avg,
max(metric) - min(metric) as metric_spread
FROM
device_readings
GROUP BY bucket, device_id;
You may also take a look at the continuous aggregate views if you want the 'grouped by an interval' views be updated automatically with new ingested data and if you want to query these views on a frequent basis. This can save you a lot of resources and will make your queries a lot faster.
I've taken a synthesis of all the above to try and come up with something slightly easier to use;
create or replace function interval_generator(start_ts timestamp with TIME ZONE, end_ts timestamp with TIME ZONE, round_interval INTERVAL)
returns TABLE(start_time timestamp with TIME ZONE, end_time timestamp with TIME ZONE) as $$
BEGIN
return query
SELECT
(n) start_time,
(n + round_interval) end_time
FROM generate_series(date_trunc('minute', start_ts), end_ts, round_interval) n;
END
$$
LANGUAGE 'plpgsql';
This function is a timestamp abstraction of Mikes answer, which (IMO) makes things a little cleaner, especially if you're generating queries on the client end.
Also using an inner join gets rid of the sea of NULLs that appeared previously.
with intervals as (select * from interval_generator(NOW() - INTERVAL '24 hours' , NOW(), '30 seconds'::INTERVAL))
select f.start_time, m.session_id, m.metric, min(m.value) min_val, avg(m.value) avg_val, max(m.value) max_val
from ts_combined as m
inner JOIN intervals f
on m.time >= f.start_time and m.time < f.end_time
GROUP BY f.start_time, f.end_time, m.metric, m.session_id
ORDER BY f.start_time desc
(Also for my purposes I added in a few more aggregation fields)
Perhaps, you can extract(epoch from measured_at) and go from that?