Aggregation on time range - postgresql

I have a data-set that contains date {yyyy/mm/dd} and time {h,m,s} and temperature {float} as an individual columns.
I want to aggregate temperature values for each day by average function.
The problem is that, I don't know how I can query the time attribute to say for example aggregate {h,m, (0-5)s} and {h,m, (5-10)s} and {h,m, (10-15)s} and ..., automatically.

select
day,
to_char(date_trunc('minute', "time"), 'HH24:MI') as "minute",
extract(second from "time")::integer / 5 as "range",
avg(temperature) as average
from (
select d::date as day, d::time as "time", random() * 100 as temperature
from generate_series('2012-01-01', '2012-01-03', '1 second'::interval) s(d)
) d
group by 1, 2, 3
order by 1, 2, 3
;
If you want the average for all days:
select
to_char(date_trunc('minute', "time"), 'HH24:MI') as "minute",
extract(second from "time")::integer / 5 as "range",
avg(temperature) as average
from (
select d::time as "time", random() * 100 as temperature
from generate_series('2012-01-01', '2012-01-03', '1 second'::interval) s(d)
) d
group by 1, 2
order by 1, 2
;
I think the important part for your question is to group by the integer result of the division of the seconds by the range size.

Related

Extract() from Postgres to calculate minutes between 2 columns

Want to calculate minutes between to columns with start_time and end_time as timestamp without zone for two customers types, and then averange the result for each.
I tried to use extract() by using the following statement, but can't get the right result:
select avg(duration_minutes)
from (
select started_at,ended_at, extract('minute' from (started_at - ended_at)) as duration_minutes
from my_data
where customer_type = 'member'
) avg_duration;
Result:
avg
0.000
This run sucessfuly in BQ using the following:
select avg(duration_minutes) from
(
select started_at,ended_at,
datetime_diff(ended_at,started_at, minute) as duration_minutes
from my_table
where customer_type = "member"
) avg_duration
Result:
f0_
21.46
Wondering what might be failing in postgres?
extract(minute from ...) extracts the field with the minutes from the interval. So if the interval is "1 hour 26 minutes and 45 seconds" the result would be 26 not 86.
To convert an interval to the equivalent number of minutes, extract the total number of seconds using extract(epoch ...) and multiply that with 60.
select avg(duration_minutes)
from (
select started_at,
ended_at,
extract(epoch from (started_at - ended_at)) * 60 as duration_minutes
from my_data
where customer_type = 'member'
) avg_duration;
Note that you can calculate the average of an interval without the need to convert it to minutes:
select avg(duration)
from (
select started_at,
ended_at,
started_at - ended_at as duration
from my_data
where customer_type = 'member'
) avg_duration;
Depending on how you use the result, returning an interval might be more useful. You can also convert that to minutes using:
extract(epoch from avg(duration)) * 60 as average_minutes

How to date trunc in HANA

I have a query to get the count of buses which travel less than 100 km per day. So I use the query in PostgreSQL
select day,count(*)as bus_count from(
SELECT date_trunc('hour',start_time)::timestamp::date as day,bus_id,sum(distance_two_points) as distance
FROM public.datatable where start_time >= '2015-09-05 00:00:00' and start_time <= '2015-09-05 23:59:59'
group by day,bus_id
) as A where distance<=250000 group by day
The query returns the result
day bus_id distance
___ ________ _________
"2015-09-05 00:00:00" 1 523247
"2015-09-05 00:00:00" 2 135114
"2015-09-05 00:00:00" 3 178560
"2015-09-05 00:00:00" 4 400071
"2015-09-05 00:00:00" 5 312832
"2015-09-05 00:00:00" 6 237075
So I now want to use this same query (achieving same results) in SAP HANA but there is no date trunc function and I also tried
SELECT EXTRACT (DAY FROM TO_DATE (START_TIME, 'YYYY-MM-DD')) "extract" as day,
bus_id, sum(distance_two_points) as distance
FROM public.datatable
where start_time >= '2015-09-05 00:00:00' and start_time <= '2015-09-05 23:59:59'
group by day,bus_id
) as A where distance<=250000 group by day
Any help is appreciated.
SELECT SERIES_ROUND('2013-05-24', 'INTERVAL 1 YEAR', ROUND_DOWN) "result" FROM DUMMY;
SELECT SERIES_ROUND('04:25:01', 'INTERVAL 10 MINUTE') "result" FROM DUMMY;
The SERIES_ROUND from SAP Hana provides similar functionalities as DATE_TRUNC in other vendors.
https://help.sap.com/docs/SAP_HANA_PLATFORM/4fe29514fd584807ac9f2a04f6754767/435ec476ab494ad6b8409f22abec13fe.html?version=2.0.00
Converting to a non-datetime data type is usually not a good idea (additional parsing, encoding, semantics...).
Instead use a less granular datetime data type: daydate in this case.
create column table datatab (start_time seconddate, bus_id int, distance_two_points decimal (10, 2));
insert into datatab values (to_seconddate('05.09.2015 13:12:00'), 1, 50.2);
insert into datatab values (to_seconddate('05.09.2015 13:22:00'), 1, 1.2);
insert into datatab values (to_seconddate('05.09.2015 15:32:00'), 1, 24);
insert into datatab values (to_seconddate('05.09.2015 13:12:00'), 1, 50.2);
insert into datatab values (to_seconddate('05.09.2015 14:22:00'), 2, 1.2);
insert into datatab values (to_seconddate('05.09.2015 16:32:00'), 2, 24);
select to_seconddate(day) as day,count(*) as bus_count from(
SELECT to_date(start_time) as day, bus_id, sum(distance_two_points) as distance
FROM datatab
where start_time between '2015-09-05 00:00:00' and '2015-09-05 23:59:59'
group by to_date(start_time),bus_id
) as A
where distance<=250000
group by day;
The inner query gives you:
DAY BUS_ID DISTANCE
2015-09-05 1 75.40
2015-09-05 2 25.20
So, your seconddate "start_time" is now aggregated as daydate and then converted back to 'seconddate'.
What I prefer is using the seconds_between() or nano100_between() function.
select now(),
add_seconds( to_date('1970.01.01', 'YYYY.MM.DD'),
round(
SECONDS_BETWEEN(
to_date('1970.01.01', 'YYYY.MM.DD'),
now()
)/3600
)*3600
)
from dummy;
This looks a bit ugly but given the to_date() is calculated just once and not for each row and the seconds arithmetic is close to how Hana stores the value internally, it should be the most efficient of the lot.
Also it is the most flexible, round by second, minute, hour, day,... everything below year is fine.
PS: round() supports all round and truncate options.
Assuming your start_time is of some data/time type (e.g. SECONDDATE) you could use
...TO_NVARCHAR(START_TIME, 'YYYY-MM-DD') AS DAY...
Instead of date_trunc... in PostgreSQL
Why don't you use CAST() conversion function?
select
cast( now() as date ) myDate
from dummy;

Postgres calculate growth rate over two month

I would like to calculate growth rate for customers for following data.
month | customers
-------------------------
01-2015 | 1
02-2015 | 10
03-2014 | 10
06-2015 | 15
I have used following formula to calculate the growth rate, it works only for one month interval as well as not able to give expected output due to gap between 3rd and 6th month as shown in above table
select
month, total,
(total::float / lag(total) over (order by month) - 1) * 100 growth
from (
select to_char(created, 'yyyy-mm') as month, count(id) total
from customers
group by month
) s
order by month;
I think this can be done by creating a date range and group by that range.
I expect two main output separately
1) Generate growth rate with exact one month difference
2) Growth rate with interval of 2 month instead of single month only. In above data sum the two month result and group by two month instead of month
Still not sure about the second part. Here's growth from your adapted query and twon month growth column:
select
month, total,
(total::float / lag(total) over (order by m) - 1) * 100 growth,m,m2
from (
select created, (sum(customers) over (order by m))::float total,customers,m,m2,to_char(created, 'yyyy-mm') as month
from customers c
right outer join (
select generate_series('2015-01-01','2015-06-01','1 month'::interval) m
) m1 on m=c.created
left outer join (
select generate_series('2015-01-01','2015-06-01','2 month'::interval) m2
) m2 on m2=m
order by m
) s
order by m;
basically answer is use generate_series

Rolling sum per time interval per group

Table, data and task as follows.
See SQL-Fiddle-Link for demo-data and estimated results.
create table "data"
(
"item" int
, "timestamp" date
, "balance" float
, "rollingSum" float
)
insert into "data" ( "item", "timestamp", "balance", "rollingSum" ) values
( 1, '2014-02-10', -10, -10 )
, ( 1, '2014-02-15', 5, -5 )
, ( 1, '2014-02-20', 2, -3 )
, ( 1, '2014-02-25', 13, 10 )
, ( 2, '2014-02-13', 15, 15 )
, ( 2, '2014-02-16', 15, 30 )
, ( 2, '2014-03-01', 15, 45 )
I need to get all rows in an defined time interval. The above table doesn't hold a record per item for each possible date - only dates on which changes applied are recorded ( it is possible that there are n rows per timestamp per item )
If the given interval does not fit exactly on stored timestamps, the latest timestamp before startdate ( nearest smallest neighbour ) should be used as start-balance/rolling-sum.
estimated results ( time interval: startdate = '2014-02-13', enddate = '2014-02-20' )
"item", "timestamp" , "balance", "rollingSum"
1 , '2014-02-13' , -10 , -10
1 , '2014-02-15' , 5 , -5
1 , '2014-02-20' , 2 , -3
2 , '2014-02-13' , 15 , 15
2 , '2014-02-16' , 15 , 30
I checked questions like this and googled a lot, but didn't found a solution yet.
I don't think it's a good idea to extend "data" table with one row per missing date per item, thus the complete interval ( smallest date <-----> latest date per item may expand over several years ).
Thanks in advance!
select sum(balance)
from table
where timestamp >= (select max(timestamp) from table where timestamp <= 'startdate')
and timestamp <= 'enddate'
Don't know what you mean by rolling-sum.
here is an attempt. Seems it gives the right result, not so beautiful. Would have been easier in sqlserver 2012+:
declare #from date = '2014-02-13'
declare #to date = '2014-02-20'
;with x as
(
select
item, timestamp, balance, row_number() over (partition by item order by timestamp, balance) rn
from (select item, timestamp, balance from data
union all
select distinct item, #from, null from data) z
where timestamp <= #to
)
, y as
(
select item,
timestamp,
coalesce(balance, rollingsum) balance ,
a.rollingsum,
rn
from x d
cross apply
(select sum(balance) rollingsum from x where rn <= d.rn and d.item = item) a
where timestamp between '2014-02-13' and '2014-02-20'
)
select item, timestamp, balance, rollingsum from y
where rollingsum is not null
order by item, rn, timestamp
Result:
item timestamp balance rollingsum
1 2014-02-13 -10,00 -10,00
1 2014-02-15 5,00 -5,00
1 2014-02-20 2,00 -3,00
2 2014-02-13 15,00 15,00
2 2014-02-16 15,00 30,00

Checking for the minimum variability of a temporal database in postgresql

I have a table like this:
+------------+------------------+
|temperature |Date_time_of_data |
+------------+------------------+
| 4.5 |9/15/2007 12:12:12|
| 4.56 |9/15/2007 12:14:16|
| 4.44 |9/15/2007 12:16:02|
| 4.62 |9/15/2007 12:18:23|
| 4.89 |9/15/2007 12:21:01|
+------------+------------------+
The data-set contains more than 1000 records and I want to check for the minimum variability.
For every 30 minutes if the variance of temperature doesn't exceed 0.2, I want all the temperature values of that half an hour replaced by NULL.
Here is a SELECT to get the start of a period for every record:
SELECT temperature,
Date_time_of_data,
date_trunc('hour', Date_time_of_data)+
CASE WHEN date_part('minute', Date_time_of_data) >= 30
THEN interval '30 minutes'
ELSE interval '0 minutes'
END as start_of_period
FROM your_table
It truncates the date to hours (9/15/2007 12:12:12 to 9/15/2007 12:12:00)
and then adds 30 minutes if the date initially had more than 30 minutes.
Next - use start_of_period to group results and get min and max for every group:
SELECT temperature,
Date_time_of_data,
max(Date_time_of_data) OVER (PARTITION BY start_of_period) as max_temp,
min(Date_time_of_data) OVER (PARTITION BY start_of_period) as min_temp
FROM (previou_select_here)
Next - filter out the records, where the variance is more than 0.2
SELECT temperature,
Date_time_of_data
FROM (previou_select_here)
WHERE (max_temp - min_temp) <=0.2
And finally update your table
UPDATE your_table
SET temperature = NULL
WHERE Date_time_of_data IN (previous_select_here)
You may need to correct some spelling mistakes in this queries, before they work. I havent tested them.
And you can simplify them, if you need to.
P.S. If you need to filter out the data with variance less than 0.2 , you can simply create a VIEW from the third SELECT with
WHERE (max_temp - min_temp) > 0.2
And use the VIEW instead of table.
This query should do the job:
with intervals as (
select
date_trunc('hour', Date_time_of_data) + interval '30 min' * round(date_part('minute', Date_time_of_data) / 30.0) as valid_interval
from T
group by 1
having var_samp(temperature) > 0.2
)
select * from T
where
date_trunc('hour', Date_time_of_data) + interval '30 min' * round(date_part('minute', Date_time_of_data) / 30.0) in (select valid_interval from intervals)
The inner query (labeled as intervals) returns times when variance is over 0.2 (having var_samp(temperature) > 0.2). date_trunc ... expression rounds Date_time_of_data to half hour intervals.
The query returns nothing on the provided dataset.
create table T (temperature float8, Date_time_of_data timestamp without time zone);
insert into T values
(4.5, '2007-9-15 12:12:12'),
(4.56, '2007-9-15 12:14:16'),
(4.44, '2007-9-15 12:16:02'),
(4.62, '2007-9-15 12:18:23'),
(4.89, '2007-9-15 12:21:01')
;