Finding the timeslot with the maximum decrease in count of nearby points - postgresql

For each entry in the loc_of_interest table, I want to find the 15 minute timeslot (from the data in the other cte) with the maximum decrease in count of nearby points. I do not know how to proceed beyond the 'pseudocode' part, and indeed, am uncertain if I am going in the right direction with this existing code as well.
Here is my code:
-- I have two cte's already made
subset_cr -- many rows of data
(device_id, points_geom, time_created)
loc_of_interest -- 2 rows of data
(loc_id, points_geom)
-- here is how I wish to proceed:
with temp as (
SELECT loi.loc_id AS loc_id,
routes.fifteen_min_slot ,
routes.count_of_near_points
FROM loc_of_interest as loi
CROSS JOIN LATERAL (
SELECT date_trunc('hour', routes.time_created) + date_part('minute', routes.time_created)::int / 15 * interval '15 min' as fifteen_min_slot,
count (ST_DWithin(
loi.point_geom::geography,
st_transform(route_points.point_geom,4326)::geography,
100)) as count_of_near_points
FROM subset_cr as routes
) routes
group by 1,2
)
--pseudocode below
for each loc_id
select fifteen_min_slot
from temp
where difference in count_of_near_points is max
Code update:
I have added the following code for the pseudocode I wrote earlier:
tempy as (
select loc_id, fifteen_min_slot, count_of_near_points - lag (count_of_near_points) over (partition by loc_id, order by fifteen_min_slot) as lagging_diff
from temp
)
select loc_id, fifteen_min_slot
from tempy
where lagging_diff = (select max lagging_diff from tempy)

Related

Postgresql Use a specified value instead of first_value

I am trying to replace first_value and use a specified value to use in an equation because of alpha sort I'm having an issue. I want to use a value in a row called 'control' which is in the segment column and direct_mail_test table. So I need to find a way to call just that value ('control') to then use in the equation. New to PostgreSQL, any help would be greatly appreciated.
Here is my current code
select segment,
count(*) as total,
count(b.c_guid) as bookings,
100.0 * count(b.c_guid)/count(*) as percent,
100.0 * count(b.c_guid) / first_value(count(b.c_guid)) over ( order by segment asc ) as comp
from mailing_tests
left join (
select distinct g.contact_guid as c_guid
from guest g
inner join booking
on booking.guid = g.booking_guid
where booking.book_date >= {{date_start}}
[[and booking.book_date < {{date_end}}]]
and booking.status in ('Booked')
) b
on mailing_tests.guid = b.c_guid
where {{project}}
group by segment
order by segment asc
Here is my output:
segment
total bookings percent comp
catalog 4,091 30 0.73 100
control 30,611 359 1.17 1,196.67
direct_mail 30,611 393 1.28 1,310
online_ads 30,611 371 1.21 1,236.67
'''
As of now its taking in the 'catalog' as the
measurable and I need it to take the 'control' as the
measurable.
Just to put some more context on the code, I am using
metabase. So the {{date_start}}
[[and peak15_booking.book_date < {{date_end}}]]
{{project}}
are all variable functions used in metabase.
I tried to use the nth_value, fetch function, and
many others, but I'm not sure if I am even using
those properly. I have been unsuccessful in finding
any answer to this.
I found a way to make this work, but not sure if it's the cleanest way to do this or if I will run into future issues. I still haven't figured out how to replace first_value, but I did add (order by segment='control' desc) twice. once in the SELECT statement and the other in WHERE clause. Again, not sure if this is the correct way to do this, but thought I would show that I did end up figuring it out. Not sure if I should of answered my own question or deleted it, but thought that this might help someone else.
'
select segment,
count(*) as total,
count(b.c_guid) as bookings,
100.0 * count(b.c_guid) / count(*) as percent,
100.0 * count(b.c_guid) / first_value(count(b.c_guid)) over ( order by segment='control' desc, segment asc ) as comp
from mailing_tests
left join (
select distinct g.contact_guid as c_guid
from guest g
inner join booking
on booking.guid = g.booking_guid
where booking.book_date >= {{date_start}}
[[and booking.book_date < {{date_end}}]]
and booking.status in ('Booked')
) b
on mailing_tests.guid = b.c_guid
where {{project}}
group by segment
order by segment='control' desc, segment asc
'

Finding the percentage (%) range of average value in SQL

I am wanting to return the values that lie within 20% of the average value within the Duration column in my database.
I want to build on the code below but instead of returning Where Duration is less than the average value of duration I want to return all values which lay within 20% of the AVG(Duration) value.
Select * From table
Where Duration < (Select AVG(Duration) from table)
Here is one way...
Select * From table
Where Duration between (Select AVG(Duration)*0.8 from table)
and (Select AVG(Duration)*1.2 from table)
perhaps this to avoid repeated scans:
with cte as ( Select AVG(Duration) as AvgDuration from table )
Select * From table
Where Duration between (Select AvgDuration*0.8 from cte)
and (Select AvgDuration*1.2 from cte)
or
Select table.* From table
cross join ( Select AVG(Duration) as AvgDuration from table ) cj
Where Duration between cj.AvgDuration*0.8 and cj.AvgDuration*1.2
or using a window function:
Select d.*
from (
SELECT table.*
, AVG(Duration) OVER() as AvgDuration
From table
) d
Where d.Duration between d.AvgDuration*0.8 and d.AvgDuration*1.2
The last one might be the most efficient method.

Grouping consecutive dates in PostgreSQL

I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.
I'm using PostgreSQL.
Table A
id startdate enddate
--------------------------
101 12/28/2013 12/31/2013
Table B
id startdate enddate
--------------------------
101 12/15/2013 12/15/2013
101 12/16/2013 12/16/2013
101 12/28/2013 12/28/2013
101 12/29/2013 12/31/2013
Desired Result
id startdate enddate
-------------------------
101 12/15/2013 12/16/2013
101 12/28/2013 12/31/2013
Right. I have a query that I think works. It certainly works on the sample records you provided. It uses a recursive CTE.
First, you need to merge the two tables. Next, use a recursive CTE to get the sequences of overlapping dates. Finally, get the start and end dates, and join back to the "merged" table to get the id.
with recursive allrecords as -- this merges the input tables. Add a unique row identifier
(
select *, row_number() over (ORDER BY startdate) as rowid from
(select * from table1
UNION
select * from table2) a
),
path as ( -- the recursive CTE. This gets the sequences
select rowid as parent,rowid,startdate,enddate from allrecords a
union
select p.parent,b.rowid,b.startdate,b.enddate from allrecords b join path p on (p.enddate + interval '1 day')>=b.startdate and p.startdate <= b.startdate
)
SELECT id,g.startdate,g.enddate FROM -- outer query to get the id
-- inner query to get the start and end of each sequence
(select parent,min(startdate) as startdate, max(enddate) as enddate from
(
select *, row_number() OVER (partition by rowid order by parent,startdate) as row_number from path
) a
where row_number = 1 -- We only want the first occurrence of each record
group by parent)g
INNER JOIN allrecords a on a.rowid = parent
The below fragment does what you intend. (but it will probably be very slow) The problem is that detecteng (non)overlapping dateranges is impossible with standard range operators, since a range could be split into two parts.
So, my code does the following:
split the dateranges from table_A into atomic records, with one date per record
[the same for table_b]
cross join these two tables (we are only interested in A_not_in_B, and B_not_in_A) , remembering which of the L/R outer join wings it came from.
re-aggregate the resulting records into date ranges.
-- EXPLAIN ANALYZE
--
WITH RECURSIVE ranges AS (
-- Chop up the a-table into atomic date units
WITH ar AS (
SELECT generate_series(a.startdate,a.enddate , '1day'::interval)::date AS thedate
, 'A'::text AS which
, a.id
FROM a
)
-- Same for the b-table
, br AS (
SELECT generate_series(b.startdate,b.enddate, '1day'::interval)::date AS thedate
, 'B'::text AS which
, b.id
FROM b
)
-- combine the two sets, retaining a_not_in_b plus b_not_in_a
, moments AS (
SELECT COALESCE(ar.id,br.id) AS id
, COALESCE(ar.which, br.which) AS which
, COALESCE(ar.thedate, br.thedate) AS thedate
FROM ar
FULL JOIN br ON br.id = ar.id AND br.thedate = ar.thedate
WHERE ar.id IS NULL OR br.id IS NULL
)
-- use a recursive CTE to re-aggregate the atomic moments into ranges
SELECT m0.id, m0.which
, m0.thedate AS startdate
, m0.thedate AS enddate
FROM moments m0
WHERE NOT EXISTS ( SELECT * FROM moments nx WHERE nx.id = m0.id AND nx.which = m0.which
AND nx.thedate = m0.thedate -1
)
UNION ALL
SELECT rr.id, rr.which
, rr.startdate AS startdate
, m1.thedate AS enddate
FROM ranges rr
JOIN moments m1 ON m1.id = rr.id AND m1.which = rr.which AND m1.thedate = rr.enddate +1
)
SELECT * FROM ranges ra
WHERE NOT EXISTS (SELECT * FROM ranges nx
-- suppress partial subassemblies
WHERE nx.id = ra.id AND nx.which = ra.which
AND nx.startdate = ra.startdate
AND nx.enddate > ra.enddate
)
;

Postgresql SQL GROUP BY time interval with arbitrary accuracy (down to milli seconds)

I have my measurement data stored into the following structure:
CREATE TABLE measurements(
measured_at TIMESTAMPTZ,
val INTEGER
);
I already know that using
(a) date_trunc('hour',measured_at)
AND
(b) generate_series
I would be able to aggregate my data by:
microseconds,
milliseconds
.
.
.
But is it possible to aggregate the data by 5 minutes or let's say an arbitrary amount of seconds? Is it possible to aggregate measured data by an arbitrary multiple of seconds?
I need the data aggregated by different time resolutions to feed them into a FFT or an AR-Model in order to see possible seasonalities.
You can generate a table of "buckets" by adding intervals created by generate_series(). This SQL statement will generate a table of five-minute buckets for the first day (the value of min(measured_at)) in your data.
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n
Wrap that statement in a common table expression, and you can join and group on it as if it were a base table.
with five_min_intervals as (
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, (24*60), 5) n
)
select f.start_time, f.end_time, avg(m.val) avg_val
from measurements m
right join five_min_intervals f
on m.measured_at >= f.start_time and m.measured_at < f.end_time
group by f.start_time, f.end_time
order by f.start_time
Grouping by an arbitrary number of seconds is similar--use date_trunc().
A more general use of generate_series() lets you avoid guessing the upper limit for five-minute buckets. In practice, you'd probably build this as a view or a function. You might get better performance from a base table.
select
(select min(measured_at)::date from measurements) + ( n || ' minutes')::interval start_time,
(select min(measured_at)::date from measurements) + ((n+5) || ' minutes')::interval end_time
from generate_series(0, ((select max(measured_at)::date - min(measured_at)::date from measurements) + 1)*24*60, 5) n;
Catcall has a great answer. My example of using it demonstrates having fixed buckets - in this case 30 minute intervals starting at midnight. It also shows that there can be one extra bucket generated in Catcall's first version and how to eliminate it. I wanted exactly 48 buckets in a day. In my problem, observations have separate date and time columns and I want to average the observations within a 30 minute period across the month for a number of different services.
with intervals as (
select
(n||' minutes')::interval as start_time,
((n+30)|| ' minutes')::interval as end_time
from generate_series(0, (23*60+30), 30) n
)
select i.start_time, o.service, avg(o.o)
from
observations o right join intervals i
on o.time >= i.start_time and o.time < i.end_time
where o.date between '2013-01-01' and '2013-01-31'
group by i.start_time, i.end_time, o.service
order by i.start_time
How about
SELECT MIN(val),
EXTRACT(epoch FROM measured_at) / EXTRACT(epoch FROM INTERVAL '5 min') AS int
FROM measurements
GROUP BY int
where '5 min' can be any expression supported by INTERVAL
The following will give you buckets of any size, even if they don't aline well with a nice minute/hour/whatever boundary. The value "300" is for a 5 minute grouping, but any value can be substituted:
select measured_at,
val,
(date_trunc('seconds', (measured_at - timestamptz 'epoch') / 300) * 300 + timestamptz 'epoch') as aligned_measured_at
from measurements;
You can then use whatever aggregate you need around "val", and use "group by aligned_measured_at" as required.
This is based on Mike Sherrill's answer, except that it uses timestamp intervals instead of separate start/end columns.
with intervals as (
select tstzrange(s, s + '5 minutes') das_interval
from (select generate_series(min(lower(time_range)), max(upper(time_rage)), '5 minutes') s
from your_table) x)
select das_interval, your_table.*
from your_table
right join intervals on time_range && das_interval
order by das_interval;
From PostgreSQL v14 on, you can use the date_bin function for that:
SELECT date_bin(
INTERVAL '5 minutes',
measured_at,
TIMSTAMPTZ '2000-01-01'
),
sum(val)
FROM measurements
GROUP BY 1;
I wanted to look at the past 24 hours of data and count things in hourly increments. I started Cat Recall's solution, which is pretty slick. It's bound to the data, though, rather than just what's happened in the past 24H. So I refactored and ended up with something pretty close to Julian's solution, but with more CTE. So it's sort of the marriage of the 2 answers.
WITH interval_query AS (
SELECT (ts ||' hour')::INTERVAL AS hour_interval
FROM generate_series(0,23) AS ts
), time_series AS (
SELECT date_trunc('hour', now()) + INTERVAL '60 min' * ROUND(date_part('minute', now()) / 60.0) - interval_query.hour_interval AS start_time
FROM interval_query
), time_intervals AS (
SELECT start_time, start_time + '1 hour'::INTERVAL AS end_time
FROM time_series ORDER BY start_time
), reading_counts AS (
SELECT f.start_time, f.end_time, br.minor, count(br.id) readings
FROM beacon_readings br
RIGHT JOIN time_intervals f
ON br.reading_timestamp >= f.start_time AND br.reading_timestamp < f.end_time AND br.major = 4
GROUP BY f.start_time, f.end_time, br.minor
ORDER BY f.start_time, br.minor
)
SELECT * FROM reading_counts
Note that any additional limiting I wanted in the final query needed to be done in the RIGHT JOIN. I'm not suggesting this is necessarily the best (or even a good approach), but it is something I'm running with (at least at the moment) in a dashboard.
The Timescale extension for PostgreSQL gives the ability to group by arbitrary time intervals. The function is called time_bucket() and has the same syntax as the date_trunc() function but takes an interval instead of a time precision as first parameter. Here you can find its API Docs. This is an example:
SELECT
time_bucket('5 minutes', observation_time) as bucket,
device_id,
avg(metric) as metric_avg,
max(metric) - min(metric) as metric_spread
FROM
device_readings
GROUP BY bucket, device_id;
You may also take a look at the continuous aggregate views if you want the 'grouped by an interval' views be updated automatically with new ingested data and if you want to query these views on a frequent basis. This can save you a lot of resources and will make your queries a lot faster.
I've taken a synthesis of all the above to try and come up with something slightly easier to use;
create or replace function interval_generator(start_ts timestamp with TIME ZONE, end_ts timestamp with TIME ZONE, round_interval INTERVAL)
returns TABLE(start_time timestamp with TIME ZONE, end_time timestamp with TIME ZONE) as $$
BEGIN
return query
SELECT
(n) start_time,
(n + round_interval) end_time
FROM generate_series(date_trunc('minute', start_ts), end_ts, round_interval) n;
END
$$
LANGUAGE 'plpgsql';
This function is a timestamp abstraction of Mikes answer, which (IMO) makes things a little cleaner, especially if you're generating queries on the client end.
Also using an inner join gets rid of the sea of NULLs that appeared previously.
with intervals as (select * from interval_generator(NOW() - INTERVAL '24 hours' , NOW(), '30 seconds'::INTERVAL))
select f.start_time, m.session_id, m.metric, min(m.value) min_val, avg(m.value) avg_val, max(m.value) max_val
from ts_combined as m
inner JOIN intervals f
on m.time >= f.start_time and m.time < f.end_time
GROUP BY f.start_time, f.end_time, m.metric, m.session_id
ORDER BY f.start_time desc
(Also for my purposes I added in a few more aggregation fields)
Perhaps, you can extract(epoch from measured_at) and go from that?

T-SQL if value exists use it other wise use the value before

I have the following table
-----Account#----Period-----Balance
12345---------200901-----$11554
12345---------200902-----$4353
12345 --------201004-----$34
12345 --------201005-----$44
12345---------201006-----$1454
45677---------200901-----$14454
45677---------200902-----$1478
45677 --------201004-----$116776
45677 --------201005-----$996
56789---------201006-----$1567
56789---------200901-----$7894
56789---------200902-----$123
56789 --------201003-----$543345
56789 --------201005-----$114
56789---------201006-----$54
I want to select the account# that have a period of 201005.
This is fairly easy using the code below. The problem is that if a user enters 201003-which doesnt exist- I want the query to select the previous value.*NOTE that there is an account# that has a 201003 period and I still want to select it too.*
I tried CASE, IF ELSE, IN but I was unsuccessfull.
PS:I cannot create temp tables due to system limitations of 5000 rows.
Thank you.
DECLARE #INPUTPERIOD INT
#INPUTPERIOD ='201005'
SELECT ACCOUNT#, PERIOD , BALANCE
FROM TABLE1
WHERE PERIOD =#INPUTPERIOD
SELECT t.ACCOUNT#, t.PERIOD, t.BALANCE
FROM (SELECT ACCOUNT#, MAX(PERIOD) AS MaxPeriod
FROM TABLE1
WHERE PERIOD <= #INPUTPERIOD
GROUP BY ACCOUNT#) q
INNER JOIN TABLE1 t
ON q.ACCOUNT# = t.ACCOUNT#
AND q.MaxPeriod = t.PERIOD
select top 1 account#, period, balance
from table1
where period >= #inputperiod
; WITH Base AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE Period <= 201003
)
SELECT * FROM Base WHERE RN = 1
Using CTE and ROW_NUMBER() (we take all the rows with Period <= the selected date and we take the top one (the one with auto-generated ROW_NUMBER() = 1)
; WITH Base AS
(
SELECT *, 1 AS RN FROM #MyTable WHERE Period = 201003
)
, Alternative AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Period DESC) RN FROM #MyTable WHERE NOT EXISTS(SELECT 1 FROM Base) AND Period < 201003
)
, Final AS
(
SELECT * FROM Base
UNION ALL
SELECT * FROM Alternative WHERE RN = 1
)
SELECT * FROM Final
This one is a lot more complex but does nearly the same thing. It is more "imperative like". It first tries to find a row with the exact Period, and if it doesn't exists does the same thing as before. At the end it unite the two result sets (one of the two is always empty). I would always use the first one, unless profiling showed me the SQL wasn't able to comprehend what I'm trying to do. Then I would try the second one.