Related
Table:
worker_id | created_at | state_id
---------- ------------- ----------
1 | 14-6-2021 | 12
2 | 14-6-2021 | 12
3 | 13-6-2021 | 12
4 | 12-6-2021 | 12
3 | 10-6-2021 | 4
2 | 9-6-2021 | 4
4 | 8-6-2021 | 12
4 | 1-6-2021 | 4
1 | 1-6-2021 | 12
What I want
worker_id | created_at | state_id
---------- ------------- ----------
2 | 14-6-2021 | 12
3 | 13-6-2021 | 12
I need to obtain the worker_id of the workers that have state_id = 12, and that comply with their previous state_id = 4. I have made multiple attempts but none of them work.
First of all, please write your post in english.
To get the result you want you can use window functions and CTE.
with w as (
select *, lag(state_id) over (partition by worker_id order by worker_id, created_at) as previous_state
from workers
)
select * from w where previous_state = 4
Result here
I'm struggling emulating a lead function to calculate the difference of (after date - current date)
I'm currently using mysql 5.7 to accomplish this. I have tried looking at various sources on stack overflow but I'm not sure how to get the result.
This is what I want:
What I currently have now is the same thing without the days column.
I would also like to know how to get a column of dates that grabs the date after the current date.
This seems to work (except for the unclear row=4):
DROP TABLE IF EXISTS table4;
CREATE TABLE table4 (id integer, user_id integer, product varchar(10), `date` date);
INSERT INTO table4 VALUES
(1,1,'item1','2020-01-01'),
(2,1,'item2','2020-01-01'),
(3,1,'item3','2020-01-02'),
(4,1,'item4','2020-01-02'),
(5,2,'item5','2020-01-06'),
(6,2,'item6','2020-01-09'),
(7,2,'item7','2020-01-09'),
(8,2,'item8','2020-01-10');
SELECT
id,
user_id,
product,
date,
(SELECT date FROM table4 t4 WHERE t4.id>t1.id LIMIT 1) x,
COALESCE(DATEDIFF((SELECT date FROM table4 t4 WHERE t4.id>t1.id LIMIT 1),date),0) as days
FROM table4 t1
output:
+ ------- + ------------ + ------------ + --------- + ----------- + --------- +
| id | user_id | product | date | x | days |
+ ------- + ------------ + ------------ + --------- + ----------- + --------- +
| 1 | 1 | item1 | 2020-01-01 | 2020-01-01 | 0 |
| 2 | 1 | item2 | 2020-01-01 | 2020-01-02 | 1 |
| 3 | 1 | item3 | 2020-01-02 | 2020-01-02 | 0 |
| 4 | 1 | item4 | 2020-01-02 | 2020-01-06 | 4 |
| 5 | 2 | item5 | 2020-01-06 | 2020-01-09 | 3 |
| 6 | 2 | item6 | 2020-01-09 | 2020-01-09 | 0 |
| 7 | 2 | item7 | 2020-01-09 | 2020-01-10 | 1 |
| 8 | 2 | item8 | 2020-01-10 | | 0 |
+ ------- + ------------ + ------------ + ---------- + ---------- + --------- +
The column x is only here for to see which date is returned from the subquery, and not really needed for the final result.
DBFIDDLE
EDIT: when there are no "gaps" in the numbering of id, you could do this to get a solution which should have more performance:
SELECT
t1.id,
t1.user_id,
t1.product,
t1.date,
COALESCE(DATEDIFF(t2.date,t1.date),0) as days
FROM table4 t1
LEFT JOIN table4 t2 on t2.id = t1.id+1
I added this to the DBFIDDLE
I have a table in Netezza that looks like this
Date Stock Return
2015-01-01 A xxx
2015-01-02 A xxx
2015-01-03 A 0
2015-01-04 A 0
2015-01-05 A xxx
2015-01-06 A xxx
2015-01-07 A xxx
2015-01-08 A xxx
2015-01-09 A xxx
2015-01-10 A 0
2015-01-11 A 0
2015-01-12 A xxx
2015-01-13 A xxx
2015-01-14 A xxx
2015-01-15 A xxx
2015-01-16 A xxx
2015-01-17 A 0
2015-01-18 A 0
2015-01-19 A xxx
2015-01-20 A xxx
The data represents stock returns for various stocks and dates. what I need to do is group the data by a given interval, and day of that interval. Another difficulty is that weekends the (0s) will have to be discounted (ignoring public holidays). And the start date of the first interval should be an arbitrary date.
For example my out put should look sth like this
Interval Q01 Q02 Q03 Q04 Q05
1 xxx xxx xxx xxx xxx
2 xxx xxx xxx xxx xxx
3 xxx xxx xxx xxx xxx
4 xxx xxx xxx xxx xxx
This output would represent an interval of the length 5 working days, with averaged returns as results, in terms of the raw data from above,
start date 1st Jan, 1st Interval includes 1/2/5/6/7 (3 and 4 are weekends and are ignored) Q01 would be the 1st, Q02 the 2nd, Q03 the 5th etc. The second interval goes from 8/9/12/13/14.
What I tried unsuccessfully is using
CEIL(CAST(EXTRACT(DOY FROM DATE) AS FLOAT) / CAST (10 AS FLOAT)) AS interval
EXTRACT(DAY FROM DATE) % 10 AS DAYinInterval
I also tried playing around with rolling counters and for variable starting dates setting my DOY to zero with s.th like this
CEIL(CAST(EXTRACT(DOY FROM DATE) - EXTRACT(DOY FROM 'start-date' AS FLOAT) / CAST (10 AS FLOAT)) AS Interval
The one thing that came closest to what I would expect is this
SUM(Number) OVER(PARTITION BY STOCK ORDER BY DATE ASC rows 10 preceding) AS Counter
Unfortunately it goes from 1 to 10 followed by 11s where it should start from 1 to 10 again.
I would love to see how this can get implemented in an elegant way. thanks
I'm not entirely sure I understand the question, but I think I might, so I'm going to take a swing at this with some windowed aggregates and subqueries.
Here's the sample data, plugging in some random non-zero data for weekdays.
DATE | STOCK | RETURN
------------+-------+--------
2015-01-01 | A | 16
2015-01-02 | A | 80
2015-01-03 | A | 0
2015-01-04 | A | 0
2015-01-05 | A | 60
2015-01-06 | A | 25
2015-01-07 | A | 12
2015-01-08 | A | 1
2015-01-09 | A | 81
2015-01-10 | A | 0
2015-01-11 | A | 0
2015-01-12 | A | 35
2015-01-13 | A | 20
2015-01-14 | A | 69
2015-01-15 | A | 72
2015-01-16 | A | 89
2015-01-17 | A | 0
2015-01-18 | A | 0
2015-01-19 | A | 100
2015-01-20 | A | 67
(20 rows)
Here's my swing at it, with embedded comments.
select avg(return),
date_period,
day_period
from (
-- use row_number to generate a sequential value for each DOW,
-- with a WHERE to filter out the weekends
select date,
stock,
return,
date_period ,
row_number() over (partition by date_period order by date asc) day_period
from (
-- bin out the entries by date_period using the first_value of the entire set as the starting point
-- modulo 7
select date,
stock,
return,
date + (first_value(date) over (order by date asc) - date) % 7 date_period
from stocks
where date >= '2015-01-01'
-- setting the starting period date here
)
foo
where extract (dow from date) not in (1,7)
)
foo
group by date_period, day_period
order by date_period asc;
The results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
16.000000 | 2015-01-01 | 1
80.000000 | 2015-01-01 | 2
60.000000 | 2015-01-01 | 3
25.000000 | 2015-01-01 | 4
12.000000 | 2015-01-01 | 5
1.000000 | 2015-01-08 | 1
81.000000 | 2015-01-08 | 2
35.000000 | 2015-01-08 | 3
20.000000 | 2015-01-08 | 4
69.000000 | 2015-01-08 | 5
72.000000 | 2015-01-15 | 1
89.000000 | 2015-01-15 | 2
100.000000 | 2015-01-15 | 3
67.000000 | 2015-01-15 | 4
(14 rows)
Changing the starting date to '2015-01-03' to see if it adjusts properly:
...
from stocks
where date >= '2015-01-03'
...
And the results:
AVG | DATE_PERIOD | DAY_PERIOD
------------+-------------+------------
60.000000 | 2015-01-03 | 1
25.000000 | 2015-01-03 | 2
12.000000 | 2015-01-03 | 3
1.000000 | 2015-01-03 | 4
81.000000 | 2015-01-03 | 5
35.000000 | 2015-01-10 | 1
20.000000 | 2015-01-10 | 2
69.000000 | 2015-01-10 | 3
72.000000 | 2015-01-10 | 4
89.000000 | 2015-01-10 | 5
100.000000 | 2015-01-17 | 1
67.000000 | 2015-01-17 | 2
(12 rows)
I use Postgres and i have a large number of rows with values and date per station.
(Dates can be separated by several days.)
id | value | idstation | udate
--------+-------+-----------+-----
1 | 5 | 12 | 1984-02-11 00:00:00
2 | 7 | 12 | 1984-02-17 00:00:00
3 | 8 | 12 | 1984-02-21 00:00:00
4 | 9 | 12 | 1984-02-23 00:00:00
5 | 4 | 12 | 1984-02-24 00:00:00
6 | 8 | 12 | 1984-02-28 00:00:00
7 | 9 | 14 | 1984-02-21 00:00:00
8 | 15 | 15 | 1984-02-21 00:00:00
9 | 14 | 18 | 1984-02-21 00:00:00
10 | 200 | 19 | 1984-02-21 00:00:00
Forgive what may be a silly question, but I'm not much of a database guru.
Is it possible to directly enter a SQL query that will calculate linear regression per station for each date, knowing that the regression must be calculate only with actual id date, previous id date and next id date ?
For example linear regression for id 2 must be calculate with value 7(actual),5(previous),8(next) for dates 1984-02-17 , 1984-02-11 and 1984-02-21
Edit : I have to use regr_intercept(value,udate) but i really don't know how to do this if i have to use only actual, previous and next value/date for each lines.
Edit2 : 3 rows added to idstation(12); id and dates numbers are changed
Hope you can help me, thank you !
This is the combination of Joop's statistics and Denis's window functions:
WITH num AS (
SELECT id, idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
-- id + the ids of the {prev,next} records
-- within the same idstation group
, drag AS (
SELECT id AS center
, LAG(id) OVER www AS prev
, LEAD(id) OVER www AS next
FROM thedata
WINDOW www AS (partition by idstation ORDER BY id)
)
-- junction CTE between ID and its three feeders
, tri AS (
SELECT center AS this, center AS that FROM drag
UNION ALL SELECT center AS this , prev AS that FROM drag
UNION ALL SELECT center AS this , next AS that FROM drag
)
SELECT t.this, n.idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM num n
JOIN tri t ON t.that = n.id
GROUP BY t.this, n.idstation
;
Results:
INSERT 0 7
this | idstation | intercept | slope | rsq | avgx | avgy
------+-----------+-------------------+-------------------+-------------------+------------------+------------------
1 | 12 | -46 | 1 | 1 | 52 | 6
2 | 12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
3 | 12 | -10.6666666666667 | 0.333333333333333 | 1 | 54.5 | 7.5
4 | 14 | | | | 51 | 9
5 | 15 | | | | 51 | 15
6 | 18 | | | | 51 | 14
7 | 19 | | | | 51 | 200
(7 rows)
The clustering of the group-of-three can probably be done more elegantly using a rank() or row_number() function, which would also allow larger sliding windows to be used.
DROP SCHEMA zzz CASCADE;
CREATE SCHEMA zzz ;
SET search_path=zzz;
CREATE TABLE thedata
( id INTEGER NOT NULL PRIMARY KEY
, value INTEGER NOT NULL
, idstation INTEGER NOT NULL
, udate DATE NOT NULL
);
INSERT INTO thedata(id,value,idstation,udate) VALUES
(1 ,5 ,12 ,'1984-02-21' )
,(2 ,7 ,12 ,'1984-02-23' )
,(3 ,8 ,12 ,'1984-02-26' )
,(4 ,9 ,14 ,'1984-02-21' )
,(5 ,15 ,15 ,'1984-02-21' )
,(6 ,14 ,18 ,'1984-02-21' )
,(7 ,200 ,19 ,'1984-02-21' )
;
WITH a AS (
SELECT idstation
, (udate - '1984-01-01'::date) as idate -- count in dayse since jan 1984
, value AS value
FROM thedata
)
SELECT idstation
, regr_intercept(value,idate) AS intercept
, regr_slope(value,idate) AS slope
, regr_r2(value,idate) AS rsq
, regr_avgx(value,idate) AS avgx
, regr_avgy(value,idate) AS avgy
FROM a
GROUP BY idstation
;
output:
idstation | intercept | slope | rsq | avgx | avgy
-----------+-------------------+-------------------+-------------------+------------------+------------------
15 | | | | 51 | 15
14 | | | | 51 | 9
19 | | | | 51 | 200
12 | -24.2105263157895 | 0.578947368421053 | 0.909774436090226 | 53.3333333333333 | 6.66666666666667
18 | | | | 51 | 14
(5 rows)
Note: if you want a spline-like regression you should also use the lag() and lead() window functions, like in Denis's answer.
If the average is ok for you you could use avg build in... Something like
SELECT avg("value") FROM "my_table" WHERE "idstation" = 3;
Should do. For more complicated things you will need to write some pl/SQL-function I'm afraid or check for a addon on PostgreSQL.
Look into window functions. If I get your question correctly, lead() and lag() will likely give you precisely what you want. Example usage:
select idstation as idstation,
id as curr_id,
udate as curr_date,
lag(id) over w as prev_id,
lag(udate) over w as prev_date,
lead(id) over w as next_id,
lead(udate) over w as next_date
from dates
window w as (
partition by idstation order by udate, id
)
order by idstation, udate, id
http://www.postgresql.org/docs/current/static/tutorial-window.html
I am working on a Java implementation for temporal aggregation using a PostgreSQL database.
My table looks like this
Value | Start | Stop
(int) | (Date) | (Date)
-------------------------------
1 | 2004-01-01 | 2010-01-01
4 | 2000-01-01 | 2008-01-01
So to visualize this periods:
------------------------------
----------------------------------------
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[ 4 ][ 5=4+1 ][ 1 ]
My algorithm now calculates temporal aggregations of the data, e.g. SUM():
Value | Start | Stop
-------------------------------
4 | 2000-01-01 | 2004-01-01
5 | 2004-01-01 | 2008-01-01
1 | 2008-01-01 | 2010-01-01
In order to test the gained results, I now would like to query the data directly using PostgreSQL. I know that there is no easy way to this problem, yet. However, there surely is a way to get the same results. The aggregations Count, Max, Min, Sum and Average should be supported. I do not mind a bad or slow solution, it just has to work.
A query I found so far which should work similarly is the following:
select count(*), ts, te
from ( checkout a normalize checkout b using() ) checkoutNorm
group by ts, te;
My adoption looks like this:
select count(*), start, stop
from ( myTable a normalize myTable b using() ) myTableNorm
group by start, stop;
However, an error was reported ERROR: syntax error at or near "normalize" -- LINE 2: from ( ndbs_10 a normalize ndbs_10 b using() ) ndbsNorm.
Does anyone has a solution to this problem? It does not have to be based on the above query, as long as it works. Thanks a lot.
Your question was really hard to understand. But I think I figured it out.
You want a running sum over value. Values are only applicable between start and stop of a time period. So they have to be added at the begin of that period and deducted at the end.
In addition you want the begin and end of the resulting period the sum is valid for.
That should do it:
-- DROP SCHEMA x CASCADE;
CREATE SCHEMA x;
CREATE TABLE x.tbl(val int, start date, stop date);
INSERT INTO x.tbl VALUES
(4 ,'2000-01-01' ,'2008-01-01')
,(7 ,'2001-01-01' ,'2009-01-01')
,(1 ,'2004-01-01' ,'2010-01-01')
,(2 ,'2005-01-01' ,'2006-01-01');
WITH a AS (
SELECT start as ts, val FROM x.tbl
UNION ALL
SELECT stop, val * (-1) FROM x.tbl
ORDER BY 1, 2)
SELECT sum(val) OVER w AS val_sum
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts;
val_sum | start | stop
--------+------------+------------
4 | 2000-01-01 | 2001-01-01
11 | 2001-01-01 | 2004-01-01
12 | 2004-01-01 | 2005-01-01
14 | 2005-01-01 | 2006-01-01
12 | 2006-01-01 | 2008-01-01
8 | 2008-01-01 | 2009-01-01
1 | 2009-01-01 | 2010-01-01
0 | 2010-01-01 |
Edit after request
For all requested aggregate functions:
SELECT period
,val_sum
,val_count
,val_sum::float /val_count AS val_avg
,(SELECT min(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_min
,(SELECT max(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_max
,start
,stop
FROM (
WITH a AS (
SELECT start as ts, val, 1 AS c FROM x.tbl
UNION ALL
SELECT stop, val, -1 FROM x.tbl
ORDER BY 1, 2)
SELECT count(*) OVER w AS period
,sum(val*c) OVER w AS val_sum
,sum(c) OVER w AS val_count
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts
) y
WHERE stop IS NOT NULL;
period | val_sum | val_count | val_avg | val_min | val_max | start | stop
--------+---------+-----------+---------+---------+---------+------------+------------
1 | 4 | 1 | 4 | 4 | 4 | 2000-01-01 | 2001-01-01
2 | 11 | 2 | 5.5 | 4 | 7 | 2001-01-01 | 2004-01-01
3 | 12 | 3 | 4 | 1 | 7 | 2004-01-01 | 2005-01-01
4 | 14 | 4 | 3.5 | 1 | 7 | 2005-01-01 | 2006-01-01
5 | 12 | 3 | 4 | 1 | 7 | 2006-01-01 | 2008-01-01
6 | 8 | 2 | 4 | 1 | 7 | 2008-01-01 | 2009-01-01
7 | 1 | 1 | 1 | 1 | 1 | 2009-01-01 | 2010-01-01
min() and max could possibly be optimized, but that should be good enough.
CTE (WITH clause) and and subqueries are exchangeable, as you can see.