I have timeseries data that has up to millisecond accuracy. Some of these timestamps can coincide on the exact time which can therefore be sorted out by a database id column to figure out which is the latest.
I am trying to use Timescale to get the latest values per second.
Here is an example of the data I'm looking at
time db_id value
2020-01-01 08:39:23.293 | 4460 | 136.01 |
2020-01-01 08:39:23.393 | 4461 | 197.95 |
2020-01-01 08:40:38.973 | 4462 | 57.95 |
2020-01-01 08:43:01.223 | 4463 | 156 |
2020-01-01 08:43:26.577 | 4464 | 253.43 |
2020-01-01 08:43:26.577 | 4465 | 53.68 |
2020-01-01 08:43:26.577 | 4466 | 160.00 |
When obtaining latest price per second, my results should look like this
time value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:39:24 | 197.95 |
.
.
.
2020-01-01 08:40:37 | 197.95 |
2020-01-01 08:40:38 | 57.95 |
2020-01-01 08:40:39 | 57.95 |
.
.
.
2020-01-01 08:43:25 | 57.95 |
2020-01-01 08:43:26 | 160.00 |
2020-01-01 08:43:27 | 160.00 |
.
.
.
I've successfully obtained the latest results per second using the Timescale time_bucket
SELECT last(value, db_id), time_bucket('1 seconds', time) AS per_second FROM timeseries GROUP BY per_second ORDER BY per_second DESC;
but it leaves holes in the time column.
time value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95 |
2020-01-01 08:43:26 | 160.00 |
The solution I thought up of is creating a database with per second timestamps and null values, migrating data from the previous resulting table and then replacing the null values with last occurring value but it seems like a lot of intermediary steps.
I'd like to know if there is a better approach to this issue of finding the "latest value" per second, minute, hour etc. I originally tried solving the issue with python as it seemed like a simple issue but it took up a lot of computing time.
Found a nice working solution to my problem.
It involves four main steps:
getting latest values
select
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval
order by time;
This will produce a table that has the latest values. last also takes advantage of a column in case another level of sorting is required.
e.g.
time last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95 |
2020-01-01 08:43:26 | 160.00 |
Note that I shift the time by one second with + '1 second' since I only want data before a particular second - without this it will consider on-the-second data as part of the last price.
creating a table with timestamps per second
select
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
order by per_second;
Here I produce a table where each row has per second timestamps.
e.g.
per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000
join them together and add a value_partition column
select
per_second,
last_value,
sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
(
select
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval, time
) a
right join
(
select
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
) b
on a.interval = b.per_second
Inspired by this answer, the goal is to have a counter (value_partition) that increments only if the value is not null.
e.g.
per_second latest_value value_partition
2020-01-01 00:00:00.000 NULL 0
2020-01-01 00:00:01.000 15.82 1
2020-01-01 00:00:02.000 NULL 1
2020-01-01 00:00:03.000 NULL 1
2020-01-01 00:00:04.000 NULL 1
2020-01-01 00:00:05.000 NULL 1
2020-01-01 00:00:06.000 NULL 1
2020-01-01 00:00:07.000 NULL 1
2020-01-01 00:00:08.000 NULL 1
2020-01-01 00:00:09.000 NULL 1
2020-01-01 00:00:10.000 15.72 2
2020-01-01 00:00:10.000 14.67 3
filling in the null values
select
per_second,
first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
select
per_second,
last_value,
sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
(
select
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval
) a
right join
(
select
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
) b
on a.interval = b.per_second
) as q
This final step brings everything together.
This takes advantage of the value_partition column and overwrites the null values accordingly.
e.g.
per_second latest_value
2020-01-01 00:00:00.000 NULL
2020-01-01 00:00:01.000 15.82
2020-01-01 00:00:02.000 15.82
2020-01-01 00:00:03.000 15.82
2020-01-01 00:00:04.000 15.82
2020-01-01 00:00:05.000 15.82
2020-01-01 00:00:06.000 15.82
2020-01-01 00:00:07.000 15.82
2020-01-01 00:00:08.000 15.82
2020-01-01 00:00:09.000 15.82
2020-01-01 00:00:10.000 15.72
2020-01-01 00:00:10.000 14.67
Related
create table your_table(type text,compdate date,amount numeric);
insert into your_table values
('A','2022-01-01',50),
('A','2022-02-01',76),
('A','2022-03-01',300),
('A','2022-04-01',234),
('A','2022-05-01',14),
('A','2022-06-01',9),
('B','2022-01-01',201),
('B','2022-02-01',33),
('B','2022-03-01',90),
('B','2022-04-01',41),
('B','2022-05-01',11),
('B','2022-06-01',5),
('C','2022-01-01',573),
('C','2022-02-01',77),
('C','2022-03-01',109),
('C','2022-04-01',137),
('C','2022-05-01',405),
('C','2022-06-01',621);
I am trying to calculate to show the percentage change in $ from 6 months prior to today's date for each type. In example:
Type A decreased -82% over six months.
Type B decreased -97.5%
Type C increased +8.4%.
How do I write this in postgresql mixed in with other statements?
It looks like comparing against 5, not 6 months prior, and 2022-06-01 isn't today's date.
Join the table with itself based on the matching type and desired time difference. Demo
select
b.type,
b.compdate,
a.compdate "6 months earlier",
b.amount "amount 6 months back",
round(-(100-b.amount/a.amount*100),2) "change"
from your_table a
inner join your_table b
on a.type=b.type
and a.compdate = b.compdate - '5 months'::interval;
-- type | compdate | 6 months earlier | amount 6 months back | change
--------+------------+------------------+----------------------+--------
-- A | 2022-06-01 | 2022-01-01 | 9 | -82.00
-- B | 2022-06-01 | 2022-01-01 | 5 | -97.51
-- C | 2022-06-01 | 2022-01-01 | 621 | 8.38
I have a table pqdf.
which have Effective_Date column, first I will do distinct of Effective_Date.
now from this date I want to generate 6 months dates,
if my start date is 2022-01-01 then my table last row value will be 2022-06-30. and total row count be around 181 rows
+----------------+
| Effective_Date |
+----------------+
| 2022-01-01 |
| 2022-01-01 |
| 2022-01-01 |
+----------------+
please help
I tried below but query but its not working.
select explode (sequence( first_value(to_date('Effective_Date'))), to_date(DATEADD(month, 6, Effective_Date)), interval 1 day) as date from pqdf
See if this works. If it doesn't, can you please also provide the error message that you are seeing?
WITH pqdf AS (
SELECT "2022-01-01" AS Effective_Date
)
SELECT
EXPLODE(SEQUENCE(
DATE(Effective_Date),
TO_DATE(DATEADD(MONTH, 6, DATE(Effective_Date))),
INTERVAL 1 DAY)
) AS date
FROM
pqdf
This is TSQL and I'm trying to calculate repeat purchase rate for last 12 months. This is achieved by looking at sum of customers who have bought more than 1 time last 12 months and the total number of customers last 12 months.
The SQL code below will give me just that; but i would like to dynamically do this for the last 12 months. This is the part where i'm stuck and not should how to best achieve this.
Each month should include data going back 12 months. I.e. June should hold data between June 2018 and June 2018, May should hold data from May 2018 till May 2019.
[Order Date] is a normal datefield (yyyy-mm-dd hh:mm:ss)
DECLARE #startdate1 DATETIME
DECLARE #enddate1 DATETIME
SET #enddate1 = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE())-1, 0) -- Starting June 2018
SET #startdate1 = DATEADD(mm,DATEDIFF(mm,0,GETDATE())-13,0) -- Ending June 2019
;
with dataset as (
select [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
where [Order Date] between #startdate1 and #enddate1
group by [Phone No_]
),
frequentbuyers as (
select who_identifier, sum(mycount) as frequentbuyerscount
from dataset
where mycount > 1
group by who_identifier),
allpurchases as (
select who_identifier, sum(mycount) as allpurchasescount
from dataset
group by who_identifier
)
select sum(frequentbuyerscount) as frequentbuyercount, (select sum(allpurchasescount) from allpurchases) as allpurchasecount
from frequentbuyers
I'm hoping to achieve end result looking something like this:
...Dec, Jan, Feb, March, April, May, June each month holding both values for frequentbuyercount and allpurchasescount.
Here is the code. I made a little modification for the frequentbuyerscount and allpurchasescount. If you use a sumif like expression you don't need a second cte.
if object_id('tempdb.dbo.#tmpMonths') is not null drop table #tmpMonths
create table #tmpMonths ( MonthID datetime, StartDate datetime, EndDate datetime)
declare #MonthCount int = 12
declare #Month datetime = DATEADD(MONTH, DATEDIFF(MONTH, 0, GETDATE()), 0)
while #MonthCount > 0 begin
insert into #tmpMonths( MonthID, StartDate, EndDate )
select #Month, dateadd(month, -12, #Month), #Month
set #Month = dateadd(month, -1, #Month)
set #MonthCount = #MonthCount - 1
end
;with dataset as (
select m.MonthID as MonthID, [Phone No_] as who_identifier,
count(distinct([Order No_])) as mycount
from [MyCompany$Sales Invoice Header]
inner join #tmpMonths m on [Order Date] between m.StartDate and m.EndDate
group by m.MonthID, [Phone No_]
),
buyers as (
select MonthID, who_identifier
, sum(iif(mycount > 1, mycount, 0)) as frequentbuyerscount --sum only if count > 1
, sum(mycount) as allpurchasescount
from dataset
group by MonthID, who_identifier
)
select
b.MonthID
, max(tm.StartDate) StartDate, max(tm.EndDate) EndDate
, sum(b.frequentbuyerscount) as frequentbuyercount
, sum(b.allpurchasescount) as allpurchasecount
from buyers b inner join #tmpMonths tm on tm.MonthID = b.MonthID
group by b.MonthID
Be aware, that the code was tested only syntax-wise.
After the test data, this is the result:
MonthID | StartDate | EndDate | frequentbuyercount | allpurchasecount
-----------------------------------------------------------------------------
2018-08-01 | 2017-08-01 | 2018-08-01 | 340 | 3702
2018-09-01 | 2017-09-01 | 2018-09-01 | 340 | 3702
2018-10-01 | 2017-10-01 | 2018-10-01 | 340 | 3702
2018-11-01 | 2017-11-01 | 2018-11-01 | 340 | 3702
2018-12-01 | 2017-12-01 | 2018-12-01 | 340 | 3703
2019-01-01 | 2018-01-01 | 2019-01-01 | 340 | 3703
2019-02-01 | 2018-02-01 | 2019-02-01 | 2 | 8
2019-03-01 | 2018-03-01 | 2019-03-01 | 2 | 3
2019-04-01 | 2018-04-01 | 2019-04-01 | 2 | 3
2019-05-01 | 2018-05-01 | 2019-05-01 | 2 | 3
2019-06-01 | 2018-06-01 | 2019-06-01 | 2 | 3
2019-07-01 | 2018-07-01 | 2019-07-01 | 2 | 3
I suspect I require some sort of windowing function to do this. I have the following item data as an example:
count | date
------+-----------
3 | 2017-09-15
9 | 2017-09-18
2 | 2017-09-19
6 | 2017-09-20
3 | 2017-09-21
So there are gaps in my data first off, and I have another query here:
select until_date, until_date - (lag(until_date) over ()) as delta_days from ranges
Which I have generated the following data:
until_date | delta_days
-----------+-----------
2017-09-08 |
2017-09-11 | 3
2017-09-13 | 2
2017-09-18 | 5
2017-09-21 | 3
2017-09-22 | 1
So I'd like my final query to produce this result:
start_date | ending_date | total_items
-----------+-------------+--------------
2017-09-08 | 2017-09-10 | 0
2017-09-11 | 2017-09-12 | 0
2017-09-13 | 2017-09-17 | 3
2017-09-18 | 2017-09-20 | 15
2017-09-21 | 2017-09-22 | 3
Which tells me the total count of items from the first table, per day, based on the custom ranges from the second table.
In this particular example, I would be summing up total_items BETWEEN start AND end (since there would be overlap on the dates, I'd subtract 1 from the end date to not count duplicates)
Anyone know how to do this?
Thanks!
Use the daterange type. Note that you do not have to calculate delta_days, just convert ranges to dataranges and use the operator <# - element is contained by.
with counts(count, date) as (
values
(3, '2017-09-15'::date),
(9, '2017-09-18'),
(2, '2017-09-19'),
(6, '2017-09-20'),
(3, '2017-09-21')
),
ranges (until_date) as (
values
('2017-09-08'::date),
('2017-09-11'),
('2017-09-13'),
('2017-09-18'),
('2017-09-21'),
('2017-09-22')
)
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
order by 1;
daterange | total_items
-------------------------+-------------
[2017-09-08,2017-09-11) | 0
[2017-09-11,2017-09-13) | 0
[2017-09-13,2017-09-18) | 3
[2017-09-18,2017-09-21) | 17
[2017-09-21,2017-09-22) | 3
(5 rows)
Note, that in the dateranges above lower bounds are inclusive while upper bound are exclusive.
If you want to calculate items per day in the dateranges:
select
daterange, total_items,
round(total_items::dec/(upper(daterange)- lower(daterange)), 2) as items_per_day
from (
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
) s
order by 1
daterange | total_items | items_per_day
-------------------------+-------------+---------------
[2017-09-08,2017-09-11) | 0 | 0.00
[2017-09-11,2017-09-13) | 0 | 0.00
[2017-09-13,2017-09-18) | 3 | 0.60
[2017-09-18,2017-09-21) | 17 | 5.67
[2017-09-21,2017-09-22) | 3 | 3.00
(5 rows)
I´m having problems querying when lead() values are not within the range of current row, rows on the range's edge return null lead() values.
Let’s say I have a simple table to keep track of continuous counters
create table anytable
( wseller integer NOT NULL,
wday date NOT NULL,
wshift smallint NOT NULL,
wconter numeric(9,1) )
with the following values
wseller wday wshift wcounter
1 2016-11-30 1 100.5
1 2017-01-03 1 102.5
1 2017-01-25 2 103.2
1 2017-02-05 2 106.1
2 2015-05-05 2 81.1
2 2017-01-01 1 92.1
2 2017-01-01 2 93.1
3 2016-12-01 1 45.2
3 2017-01-05 1 50.1
and want net units for current year
wseller wday wshift units
1 2017-01-03 1 2
1 2017-01-25 2 0.7
1 2017-02-05 2 2.9
2 2017-01-01 1 11
2 2017-01-01 2 1
3 2017-01-05 1 4.9
If I use
seletc wseller, wday, wshift, wcounter-lead(wcounter) over (partition by wseller order by wseller, wday desc, wshift desc)
from anytable
where wday>='2017-01-01'
gives me nulls on the first wseller by partition. I´m using this query within a large CTE.
What am I doing wrong?
The scope of a window function takes into account conditions in the WHERE clause. Move the condition to the outer query:
select *
from (
select
wseller, wday, wshift,
wcounter- lead(wcounter) over (partition by wseller order by wday desc, wshift desc)
from anytable
) s
where wday >= '2017-01-01'
order by wseller, wday, wshift
wseller | wday | wshift | ?column?
---------+------------+--------+----------
1 | 2017-01-03 | 1 | 2.0
1 | 2017-01-25 | 2 | 0.7
1 | 2017-02-05 | 2 | 2.9
2 | 2017-01-01 | 1 | 11.0
2 | 2017-01-01 | 2 | 1.0
3 | 2017-01-05 | 1 | 4.9
(6 rows)