Run a SQL query against ten-minutes time intervals - postgresql

I have a postgresql table with this schema:
id SERIAL PRIMARY KEY,
traveltime INT,
departuredate TIMESTAMPTZ,
departurehour TIMETZ
Here is a bit of data (edited):
id | traveltime | departuredate | departurehour
----+------------+------------------------+---------------
1 | 73 | 2019-12-24 00:00:03+01 | 00:00:03+01
2 | 73 | 2019-12-24 00:12:16+01 | 00:12:16+01
53 | 115 | 2019-12-24 07:53:44+01 | 07:53:44+01
54 | 116 | 2019-12-24 07:58:45+01 | 07:58:45+01
55 | 119 | 2019-12-24 08:03:46+01 | 08:03:46+01
56 | 120 | 2019-12-24 08:08:47+01 | 08:08:47+01
57 | 121 | 2019-12-24 08:13:48+01 | 08:13:48+01
58 | 121 | 2019-12-24 08:18:48+01 | 08:18:48+01
542 | 112 | 2019-12-26 07:52:41+01 | 07:52:41+01
543 | 114 | 2019-12-26 07:57:42+01 | 07:57:42+01
544 | 116 | 2019-12-26 08:02:43+01 | 08:02:43+01
545 | 116 | 2019-12-26 08:07:44+01 | 08:07:44+01
546 | 117 | 2019-12-26 08:12:45+01 | 08:12:45+01
547 | 118 | 2019-12-26 08:17:46+01 | 08:17:46+01
548 | 118 | 2019-12-26 08:22:48+01 | 08:22:48+01
1031 | 80 | 2019-12-28 07:50:33+01 | 07:50:33+01
1032 | 81 | 2019-12-28 07:55:34+01 | 07:55:34+01
1033 | 81 | 2019-12-28 08:00:35+01 | 08:00:35+01
1034 | 82 | 2019-12-28 08:05:36+01 | 08:05:36+01
1035 | 82 | 2019-12-28 08:10:37+01 | 08:10:37+01
1036 | 83 | 2019-12-28 08:15:38+01 | 08:15:38+01
1037 | 83 | 2019-12-28 08:20:39+01 | 08:20:39+01
I'd like to get the average for all the values collected for traveltime for each 10 minutes interval for several weeks.
Expected result for the data sample: for the 10-minutes interval between 8h00 and 8h10, the rows that will be included in the avg are with id 55, 56, 544, 545, 1033 and 1034
and so on.
I can get the average for a specific interval:
select avg(traveltime) from belt where departurehour >= '10:40:00+01' and departurehour < '10:50:00+01';
To avoid creating a query for each interval, I used this query to get all the 10-minutes intervals for the complete period encoded:
select i from generate_series('2019-11-23', '2020-01-18', '10 minutes'::interval) i;
What I miss is a way to apply my AVG query to each of these generated intervals. Any direction would be helpful!

It turns out that the generate_series does not actually apply as requardless of the date range. The critical part is the 144 10Min intervals per day. Unfortunatly Postgres does not provide an interval type for minuets. (Perhaps creating one would be a useful exersize). But all is not loss you can simulate the same with BETWEEN, just need to play with the ending of the range.
The following generates this simulation using a recursive CTE. Then as previously joins to your table.
set timezone to '+1'; -- necessary to keep my local offset from effecting results.
-- create table an insert data here
-- additional data added outside of date range so should not be included)
with recursive min_intervals as
(select '00:00:00'::timetz start_10Min -- start of 1st 10Min interval
, '00:09:59.999999'::timetz end_10Min -- last microsecond in 10Min interval
, 1 interval_no
union all
select start_10Min + interval '10 min'
, end_10Min + interval '10 min'
, interval_no + 1
from Min_intervals
where interval_no < 144 -- 6 10Min intervals/hr * 24 Hr/day = No of 10Min intervals in any day
) -- select * from min_intervals;
select start_10Min, end_10Min, avg(traveltime) average_travel_time
from min_intervals
join belt
on departuredate::time between start_10Min and end_10Min
where departuredate::date between date '2019-11-23' and date '2020-01-18'
group by start_10Min, end_10Min
order by start_10Min;
-- test result for 'specified' Note added rows fall within time frame 08:00 to 08:10
-- but these should be excluded so the avg for that period should be the same for both queries.
select avg(traveltime) from belt where id in (55, 56, 544, 545, 1033, 1034);
My issue with the above is the data range is essentially hard coded (yes substitution parameter are available) and manually but that is OK for psql or an IDE but not good for a production environment. If this is to be used in that environment I'd use the following function to return a virtual table of the same results.
create or replace function travel_average_per_10Min_interval(
start_date_in date
, end_date_in date
)
returns table (Start_10Min timetz
,end_10Min timetz
,avg_travel_time numeric
)
language sql
as $$
with recursive min_intervals as
(select '00:00:00'::timetz start_10Min -- start of 1st 10Min interval
, '00:09:59.999999'::timetz end_10Min -- last microsecond in 10Min interval
, 1 interval_no
union all
select start_10Min + interval '10 min'
, end_10Min + interval '10 min'
, interval_no + 1
from Min_intervals
where interval_no < 144 -- 6 10Min intervals/hr * 24 Hr/day = No of 10Min intervals in any day
) -- select * from min_intervals;
select start_10Min, end_10Min, avg(traveltime) average_travel_time
from min_intervals
join belt
on departuredate::time between start_10Min and end_10Min
where departuredate::date between start_date_in and end_date_in
group by start_10Min, end_10Min
order by start_10Min;
$$;
-- test
select * from travel_average_per_10Min_interval(date '2019-11-23', date '2020-01-18');

Related

kdb+ equivalent of SQL's rank() and dense_rank()

Any one every have to simulate the result of SQL's rank(), dense_rank(), and row_number(), in kdb+? Here is some SQL to demonstrate the features. If anyone has a specific solution below, perhaps I could work on generalising it to support multiple partition and order by columns -- and post back on this site.
CREATE TABLE student(course VARCHAR(10), mark int, name varchar(10));
INSERT INTO student VALUES
('Maths', 60, 'Thulile'),
('Maths', 60, 'Pritha'),
('Maths', 70, 'Voitto'),
('Maths', 55, 'Chun'),
('Biology', 60, 'Bilal'),
('Biology', 70, 'Roger');
SELECT
RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY course ORDER BY mark DESC) AS row_num,
course, mark, name
FROM student ORDER BY course, mark DESC;
+------+------------+---------+---------+------+---------+
| rank | dense_rank | row_num | course | mark | name |
+------+------------+---------+---------+------+---------+
| 1 | 1 | 1 | Biology | 70 | Roger |
| 2 | 2 | 2 | Biology | 60 | Bilal |
| 1 | 1 | 1 | Maths | 70 | Voitto |
| 2 | 2 | 2 | Maths | 60 | Thulile |
| 2 | 2 | 3 | Maths | 60 | Pritha |
| 4 | 3 | 4 | Maths | 55 | Chun |
+------+------------+---------+---------+------+---------+
Here is some kdb+ to generate the equivalent student table:
student:([] course:`Maths`Maths`Maths`Maths`Biology`Biology;
mark:60 60 70 55 60 70;
name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
Thank you!
If you sort the table initially by course and mark:
student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
course mark name
--------------------
Biology 70 Roger
Biology 60 Bilal
Maths 70 Voitto
Maths 60 Thulile
Maths 60 Pritha
Maths 55 Chun
Then you can use something like the below to achieve your output:
update rank_sql:first row_num by course,mark from update dense_rank:1+where count each (where differ mark)cut mark,row_num:1+rank i by course from student
course mark name dense_rank row_num rank_sql
------------------------------------------------
Biology 70 Roger 1 1 1
Biology 60 Bilal 2 2 2
Maths 70 Voitto 1 1 1
Maths 60 Thulile 2 2 2
Maths 60 Pritha 2 3 2
Maths 55 Chun 3 4 4
This solution uses rank and the virtual index column if you would like to read up further on these.
For table ordered by target columns:
q) dense_sql:{sums differ x}
q) rank_sql:{raze #'[(1_deltas b),1;b:1+where differ x]}
q) row_sql:{1+til count x}
q) student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
q)update row_num:row_sql mark,rank_s:rank_sql mark,dense_s:dense_sql mark by course from student
I can think of this as of now:
Note: The rank function in kdb works on asc list, so I created below functions.
I would not xdesc the table, as I can just use the vector column and desc it
q)denseF
{((desc distinct x)?x)+1}
q)rankF
{((desc x)?x)+1}
q)update dense_rank:denseF mark,rank_rank:rankF mark,row_num:1+rank i by course from student
course
mark name
dense_rank
rank_rank
row_num
Maths
60 Thulile
2
2
1
Maths
60 Pritha
2
2
2
Maths
70 Voitto
1
1
3
Maths
55 Chun
3
4
4
Biology
60 Bilal
2
2
1
Biology
70 Roger
1
1
2

Check condition in date interval between now and next month

I have a table in PostgreSQL 10. The table has the following structure
| date | entity | col1 | col2 |
|------+--------+------+------|
Every row represents an event that happens to an entity in a given date. The event has attributes represented by col1 and col2.
I want to add a new column that indicates if with respect to the current row there are events in which the column col2 fulfills a given condition (in the following example the condition is col2 > 20) in a given interval (say 1 month) .
| date | entity | col1 | col2 | fulfill |
|------+--------+------+------+---------|
| t1 | A | a1 | 10 | F |
| t1 | B | b | 9 | F |
| t2 | A | a2 | 10 | T |
| t3 | A | a3 | 25 | F |
| t3 | B | b2 | 8 | F |
t3 is a date inside t2 + interval 1 month.
What is the most efficient way to acomplish this?
I am not sure if I got your problem correctly. My case is 'T if there is a value >= 10 between now an the next month'
I have the following data:
val event_date
--- ----------
22 2016-12-31 -- should be T because val >= 10
8 2017-03-20 -- should be F because in [event_date, eventdate + 1 month no val >= 10]
6 2017-03-22 -- F
42 2017-12-31 -- T because there are 2 values >= 10 in next month
25 2018-01-24 -- T val >= 10
9 2018-02-11 -- F
1 2018-03-01 -- T because in month there is 1 val >= 10
2 2018-03-10 -- T same
20 2018-04-01 -- T
7 2018-04-01 -- T because an same day val >= 10
1 2018-07-24 -- F
22 2019-01-01 -- T
4 2020-10-22 -- T
123 2020-11-04 -- T
The query:
SELECT DISTINCT
e1.val,
e1.event_date,
CASE
WHEN MAX(e2.val) over (partition BY e1.event_date) >= 10
THEN 'T'
ELSE 'F'
END AS fulfilled
FROM
testdata.events e1
JOIN
testdata.events e2
ON
e1.event_date <= e2.event_date
AND e2.event_date <=(e1.event_date + interval '1 month') ::DATE
ORDER BY
e1.event_date
The result:
val event_date fulfilled
--- ---------- ---------
22 2016-12-31 T
8 2017-03-20 F
6 2017-03-22 F
42 2017-12-31 T
25 2018-01-24 T
9 2018-02-11 F
1 2018-03-01 T
2 2018-03-10 T
20 2018-04-01 T
7 2018-04-01 T
1 2018-07-24 F
22 2019-01-01 T
4 2020-10-22 T
123 2020-11-04 T
Currently I am not finding a solution without joining the same table which seems not very stylish to me.

Returning null individual values with postgres tablefunc crosstab()

I am trying to incorporate the null values within the returned lists, such that:
batch_id |test_name |test_value
-----------------------------------
10 | pH | 4.7
10 | Temp | 154
11 | pH | 4.8
11 | Temp | 152
12 | pH | 4.5
13 | Temp | 155
14 | pH | 4.9
14 | Temp | 152
15 | Temp | 149
16 | pH | 4.7
16 | Temp | 150
would return:
batch_id | pH |Temp
---------------------------------------
10 | 4.7 | 154
11 | 4.8 | 152
12 | 4.5 | <null>
13 | <null> | 155
14 | 4.9 | 152
15 | <null> | 149
16 | 4.7 | 150
However, it currently returns this:
batch_id | pH |Temp
---------------------------------------
10 | 4.7 | 154
11 | 4.8 | 152
12 | 4.5 | <null>
13 | 155 | <null>
14 | 4.9 | 152
15 | 149 | <null>
16 | 4.7 | 150
This is an extension of a prior question -
Can the categories in the postgres tablefunc crosstab() function be integers? - which led to this current query:
SELECT *
FROM crosstab('SELECT lab_tests_results.batch_id, lab_tests.test_name, lab_tests_results.test_result::FLOAT
FROM lab_tests_results, lab_tests
WHERE lab_tests.id=lab_tests_results.lab_test AND (lab_tests.test_name LIKE ''Test Name 1'' OR lab_tests.test_name LIKE ''Test Name 2'')
ORDER BY 1,2'
) AS final_result(batch_id VARCHAR, test_name_1 FLOAT, test_name_2 FLOAT);
I also know that I am not the first to ask this question generally, but I have yet to find a solution that works for these circumstances. For example, this one - How to include null values in `tablefunc` query in postgresql? - assumes the same Batch IDs each time. I do not want to specify the Batch IDs, but rather all that are available.
This leads into the other set of solutions I've found out there, which address a null list result from specified categories. Since I'm just taking what's already there, however, this isn't an issue. It's the null individual values causing the problem and resulting in a pivot table with values shifted to the left.
Any suggestions are much appreciated!
Edit: With Klin's help, got it sorted out. Something to note is that the VALUES section must match the actual lab_tests.test_name values you're after, such that:
SELECT *
FROM crosstab(
$$
SELECT lab_tests_results.batch_id, lab_tests.test_name, lab_tests_results.test_result::FLOAT
FROM lab_tests_results, lab_tests
WHERE lab_tests.id = lab_tests_results.lab_test
AND (
lab_tests_results.lab_test = 1
OR lab_tests_results.lab_test = 2
OR lab_tests_results.lab_test = 3
OR lab_tests_results.lab_test = 4
OR lab_tests_results.lab_test = 5
OR lab_tests_results.lab_test = 50 )
ORDER BY 1 DESC, 2
$$,
$$
VALUES('Mash pH'),
('Sparge pH'),
('Final Lauter pH'),
('Wort pH'),
('Wort FAN'),
('Original Gravity'),
('Mash Temperature')
$$
) AS final_result(batch_id VARCHAR,
ph_mash FLOAT,
ph_sparge FLOAT,
ph_final_lauter FLOAT,
ph_wort FLOAT,
FAN_wort FLOAT,
original_gravity FLOAT,
mash_temperature FLOAT)
Thanks for the help!
Use the second form of the function:
crosstab(text source_sql, text category_sql) - Produces a “pivot table” with the value columns specified by a second query.
E.g.:
SELECT *
FROM crosstab(
$$
SELECT lab_tests_results.batch_id, lab_tests.test_name, lab_tests_results.test_result::FLOAT
FROM lab_tests_results, lab_tests
WHERE lab_tests.id=lab_tests_results.lab_test
AND (
lab_tests.test_name LIKE 'Test Name 1'
OR lab_tests.test_name LIKE 'Test Name 2')
ORDER BY 1,2
$$,
$$
VALUES('pH'), ('Temp')
$$
) AS final_result(batch_id VARCHAR, "pH" FLOAT, "Temp" FLOAT);

Ignore the first and last row in a query result

I'm trying to do a query where I want to ignore the first and the last row of the result query.
My result query is retrieving the sum of all mediums in the last hour grouped by 5 minutes.
To ignore the first record I'm using offset(1) and to ignore the last i was trying to do a limit in my id field, ordering by timestamp desc.
My query:
ws_controller_hist=>
SELECT to_timestamp(floor((extract('epoch' FROM TIMESTAMP) / 300)) * 300)
AS timestamp_min,
TYPE,
floor(sum(medium[1]))
FROM default_dataset
WHERE TYPE LIKE 'ap_clients.wlan0'
AND TIMESTAMP > CURRENT_TIMESTAMP - interval '85 minutes'
AND organization_id = '9fc02db4-c3df-4890-93ac-8dd575ca5638'
AND id NOT IN
(SELECT id
FROM default_dataset
ORDER BY TIMESTAMP DESC
LIMIT 1)
GROUP BY timestamp_min,
TYPE
ORDER BY timestamp_min ASC
OFFSET 1;
timestamp_min | type | floor
------------------------+------------------+-------
2017-12-19 14:20:00+00 | ap_clients.wlan0 | 38
2017-12-19 14:25:00+00 | ap_clients.wlan0 | 37
2017-12-19 14:30:00+00 | ap_clients.wlan0 | 39
2017-12-19 14:35:00+00 | ap_clients.wlan0 | 42
2017-12-19 14:40:00+00 | ap_clients.wlan0 | 43
2017-12-19 14:45:00+00 | ap_clients.wlan0 | 44
2017-12-19 14:50:00+00 | ap_clients.wlan0 | 45
2017-12-19 14:55:00+00 | ap_clients.wlan0 | 45
2017-12-19 15:00:00+00 | ap_clients.wlan0 | 43
2017-12-19 15:05:00+00 | ap_clients.wlan0 | 43
2017-12-19 15:10:00+00 | ap_clients.wlan0 | 50
2017-12-19 15:15:00+00 | ap_clients.wlan0 | 52
2017-12-19 15:20:00+00 | ap_clients.wlan0 | 50
2017-12-19 15:25:00+00 | ap_clients.wlan0 | 53
2017-12-19 15:30:00+00 | ap_clients.wlan0 | 49
2017-12-19 15:35:00+00 | ap_clients.wlan0 | 39
2017-12-19 15:40:00+00 | ap_clients.wlan0 | 16
This is not ignoring the last record because i have the same records dont using the subquery " and id not in (select id from default_dataset order by timestamp desc limit 1) "
Wrap your query in an outer query and use lag and OFFSET to do the trick.
SELECT lag(timestamp_min) OVER (ORDER BY timestamp_min) AS timestamp_min,
lag(type) OVER (ORDER BY timestamp_min) AS type,
lag(sum_first_medium) OVER (ORDER BY timestamp_min),
FROM (SELECT to_timestamp(
floor(
(extract('epoch' FROM TIMESTAMP) / 300)
) * 300
) AS timestamp_min,
type,
floor(sum(medium[1])) AS sum_first_medium
FROM default_dataset
WHERE type = 'ap_clients.wlan0'
AND timestamp > current_timestamp - INTERVAL '85 minutes'
AND organization_id = '9fc02db4-c3df-4890-93ac-8dd575ca5638'
GROUP BY timestamp_min, type) lagme
OFFSET 2;
This is probably bit long, but will do exactly as you requested
SELECT z.*
FROM
(SELECT y.*, min(row_number) OVER(), max(row_number) OVER()
FROM
(SELECT x.*, row_number() OVER(ORDER BY timestamp_min)
FROM
(SELECT to_timestamp(floor((extract('epoch' FROM TIMESTAMP) / 300)) * 300)
AS timestamp_min,
TYPE,
floor(sum(medium[1]))
FROM default_dataset
WHERE TYPE LIKE 'ap_clients.wlan0'
AND TIMESTAMP > CURRENT_TIMESTAMP - interval '85 minutes'
AND organization_id = '9fc02db4-c3df-4890-93ac-8dd575ca5638'
AND id NOT IN
(SELECT id
FROM default_dataset
ORDER BY TIMESTAMP DESC
LIMIT 1)
GROUP BY timestamp_min,
TYPE
ORDER BY timestamp_min ASC
) AS x
) AS y
) AS z WHERE row_number NOT IN (min, max)

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;