Cross join Redshift with sequence of dates - amazon-redshift

I'd like to do following athena query with redshift, but so far it's been impossible to reproduce it. The query should be called in a CREATE TABLE AS () statement, so generate_sequence() ideas might not work. Any ideas?
Athena query:
SELECT
*
FROM table_one t1
CROSS JOIN UNNEST(slice(sequence(t1.effective_date, t1.expiration_date, INTERVAL '1' MONTH), 1 ,12)) AS t (sequence_date)
As requested I add an example to show what I'm trying to do. Basically I have a record with a validity interval (year units 1, 2, 3...) and I'd like to replicate it N time times such that each new record is assigned to date YYYY-MM-DD + interval*12/N months (See examples)
Original record:
Date
variables
2021-05-06
values
To be (N=12 and interval of 1 year)
Date
variables
2021-05-06
values/12
2021-06-06
values/12
2021-07-06
values/12
2021-08-06
values/12
2021-08-06
values/12
2021-10-06
values/12
2021-11-06
values/12
2021-12-06
values/12
2022-01-06
values/12
2022-02-06
values/12
2022-03-06
values/12
2022-04-06
values/12
To be (N=4 and interval of two years)
Date
variables
2021-05-06
values/2
2021-11-06
values/2
2022-05-06
values/2
2022-11-06
values/2
Thanks for the help

Likely the best way to do this is with a re cursive CTE - https://docs.aws.amazon.com/redshift/latest/dg/r_WITH_clause.html
Example - Generate rows with incrementing dates based on just a starting date in Redshift
What you seem to be doing is a little more complex than this example. If you can't get it post some sample data that the experts here can use to create a sample query for you.
================================================
With the new info and the above recursive CTE process I came up with this:
drop table if exists table_one ;
create table table_one (
dt date,
info varchar(32),
n int,
y int);
insert into table_one values ('2021-05-06', 'record info here', 12, 1);
commit;
with recursive dates(dt, info, n, y, end_dt) as
( select dt::date, info, n, y, dateadd(year, y, dt)::date as end_dt
from table_one
union all
select dateadd(months, 12 * y / n, dt)::date as dt, info, n, y, end_dt::date
from dates d
where d.dt < dateadd(month, -12 * y /n, end_dt)
)
select dt, info from dates;
I'm not sure this is how you want to get N and year into the process but hopefully you can modify from here. Just change the values of N and year in table_one insert statement and rerun the whole process to get your 2nd result.

Related

PostgreSQL grouping timestamp by day

I have a table x(x_id, ts), where ts is a timestamp.
And I have a second table y(y_id, day, month, year), which is supposed to have its values from x(ts).
(Both x_id and y_id are serial)
For example:
x y
_x_id_|__________ts__________ _y_id_|_day_|_month_|__year__
1 | '2019-10-17 09:10:08' 1 17 10 2019
2 | '2019-01-26 11:12:02' 2 26 1 2019
However, if on x I have 2 timestamps on the same day but different hour, this how both tables should look like:
x y
_x_id_|__________ts__________ _y_id_|_day_|_month_|__year__
1 | '2019-10-17 09:10:08' 1 17 10 2019
2 | '2019-10-17 11:12:02'
Meaning y can't have 2 rows with the same day, month and year.
Currently, the way I'm doing this is:
INSERT INTO y(day, month, year)
SELECT
EXTRACT(day FROM ts) AS day,
EXTRACT(month FROM ts) AS month,
EXTRACT(year FROM ts) AS year
FROM x
ORDER BY year, month, day;
However, as you probably know, this doesn't check if the timestamps share the same date, so how can I do that?
Thank you for your time!
Assuming you build the unique index as recommended above change your insert to:
insert into y(day, month, year)
select extract(day from ts) as day,
, extract(month from ts) as month,
, extract(year from ts) as year
from x
on conflict do nothing;
I hope your table X is not very large as the above insert (like your original) will attempt inserting a row into Y for every row in X on every execution - NO WHERE clause.
Add a UNIQUE constraint on table y to prevent adding the same date twice.
CREATE UNIQUE INDEX CONCURRENTLY y_date
ON y (year,month,day)
Then add it to y:
ALTER TABLE y
ADD CONSTRAINT y_unique_date
UNIQUE USING INDEX y_date
Note that you'll get an SQL error when the constraint is violated. If you don't want that and just ignore the INSERT, use a BEFORE INSERT trigger, returning NULL when you detect the "date" already exists, or just use ON CONFLICT DO NOTHING in your INSERT statement, as hinted by #Belayer.

Postgres find where dates are NOT overlapping between two tables

I have two tables and I am trying to find data gaps in them where the dates do not overlap.
Item Table:
id unique start_date end_date data
1 a 2019-01-01 2019-01-31 X
2 a 2019-02-01 2019-02-28 Y
3 b 2019-01-01 2019-06-30 Y
Plan Table:
id item_unique start_date end_date
1 a 2019-01-01 2019-01-10
2 a 2019-01-15 'infinity'
I am trying to find a way to produce the following
Missing:
item_unique from to
a 2019-01-11 2019-01-14
b 2019-01-01 2019-06-30
step-by-step demo:db<>fiddle
WITH excepts AS (
SELECT
item,
generate_series(start_date, end_date, interval '1 day') gs
FROM items
EXCEPT
SELECT
item,
generate_series(start_date, CASE WHEN end_date = 'infinity' THEN ( SELECT MAX(end_date) as max_date FROM items) ELSE end_date END, interval '1 day')
FROM plan
)
SELECT
item,
MIN(gs::date) AS start_date,
MAX(gs::date) AS end_date
FROM (
SELECT
*,
SUM(same_day) OVER (PARTITION BY item ORDER BY gs)
FROM (
SELECT
item,
gs,
COALESCE((gs - LAG(gs) OVER (PARTITION BY item ORDER BY gs) >= interval '2 days')::int, 0) as same_day
FROM excepts
) s
) s
GROUP BY item, sum
ORDER BY 1,2
Finding the missing days is quite simple. This is done within the WITH clause:
Generating all days of the date range and subtract this result from the expanded list of the second table. All dates that not occur in the second table are keeping. The infinity end is a little bit tricky, so I replaced the infinity occurrence with the max date of the first table. This avoids expanding an infinite list of dates.
The more interesting part is to reaggregate this list again, which is the part outside the WITH clause:
The lag() window function take the previous date. If the previous date in the list is the last day then give out true (here a time changing issue occurred: This is why I am not asking for a one day difference, but a 2-day-difference. Between 2019-03-31 and 2019-04-01 there are only 23 hours because of daylight saving time)
These 0 and 1 values are aggregated cumulatively. If there is one gap greater than one day, it is a new interval (the days between are covered)
This results in a groupable column which can be used to aggregate and find the max and min date of each interval
Tried something with date ranges which seems to be a better way, especially for avoiding to expand long date lists. But didn't come up with a proper solution. Maybe someone else?

Trouble joining generate_series timestamp without time zone on a field that's timestamp without timezone

I am trying to figure out a way to report how many people are in a location at the same time, down to the second.
I have a table with the id for the person, the date they entered, the time they entered, the date they left and the time they left.
example:
select unique_id, start_date, start_time, end_date, end_time
from My_Table
where start_date between '09/01/2019' and '09/02/2019'
limit 3
"unique_id" "start_date" "start_time" "end_date" "end_time"
989179 "2019-09-01" "06:03:13" "2019-09-01" "06:03:55"
995203 "2019-09-01" "11:29:27" "2019-09-01" "11:30:13"
917637 "2019-09-01" "11:06:46" "2019-09-01" "11:06:59"
i've concatenated the start_date & start_time as well as end_date & end_time so they are 2 fields
select unique_id, ((start_date + start_time)::timestamp without time zone) as start_date,
((end_date + end_time)::timestamp without time zone) as end_date
result example:
"start_date"
"2019-09-01 09:28:54"
so i'm making that a CTE, then using a second CTE that uses generate_series between dates down to the second.
The goal being, the generate series will have a row for every second between the two dates. Then when i join my data sets, i can count how many records exist in my_table where the start_date(plus time) is equal or greater than the generate_series date_time field, and the end_date(plus time) is less than or equal to the generate_series date_time field.
i feel that was harder to explain than it needed to be.
in theory, if a person was in the room from 2019-09-01 00:01:01 and left at 2019-09-01 00:01:03, i would count that record in the generate_series rows 2019-09-01 00:01:01, 2019-09-01 00:01:02 & 2019-09-01 00:01:03.
When i look at the data i can see that i should be returning hundreds of people in the room at specific peak periods. but the query returns all 0's.
is this possibly a field formatting issue i need to adjust?
Here is the query:
with CTE as (
select unique_id, ((start_date+start_time)::timestamp without time zone) as start_date,
((end_date+end_time)::timestamp without time zone) as end_date
from My_table
where start_date between '09/01/2019' and '09/02/2019'
),
time_series as (
select generate_series( (date '2019-09-01')::timestamp, (date '2019-09-02')::timestamp, interval '1 second') as date_time
)
/*FINAL SELECT*/
select date_time, count(B.unique_id) as NumPpl
FROM (
select A.date_time
FROM time_series a
)x
left join CTE b on b.start_date >= x.date_time AND b.end_date <= x.date_time
GROUP BY 1
ORDER BY 1
(partial) result screenshot
Thank you in advance
i should also add i have read only access to this database so i'm not able to create functions.
Simple version: b.start_date >= x.date_time AND b.end_date <= x.date_time will never be true assuming end_date is always after start_date.
Longer version: You also do not need a CTE for the generate_series() and there is no reason for selecting all columns and all rows of this CTE as a subquery. I would also drop the CTE for your original data and just join it to the seconds (NOTE: this does somehow change the query, since you might now take those entries into account, where start_date is earlier than 2019-09-01. If you do not want this, you can add your condition again to the join condition. But I guess this is what you really wanted). I also removed some casts which were not needed. Try this:
SELECT gs.second, COUNT(my.unique_id)
FROM generate_series('2019-09-01'::timestamp, '2019-09-02'::timestamp, interval '1 second') gs (second)
LEFT JOIN my_table my ON (my.start_date + my.start_time) <= gs.second
AND (my.end_date + my.end_time) >= gs.second
GROUP BY 1
ORDER BY 1

CROSSTAB PostgreSQL - Alternative for PIVOT in Oracle

I'm migrating a query of Oracle pivot to PostgreSQL crosstab.
create table(cntry numeric,week numeric,year numeric,days text,day text);
insert into x_c values(1,15,2015,'DAY1','MON');
...
insert into x_c values(1,15,2015,'DAY7','SUN');
insert into x_c values(2,15,2015,'DAY1','MON');
...
values(4,15,2015,'DAY7','SUN');
I have 4 weeks with 28 rows like this in a table. My Oracle query looks like this:
SELECT * FROM(select * from x_c)
PIVOT (MIN(DAY) FOR (DAYS) IN
('DAY1' AS DAY1 ,'DAY2' DAY2,'DAY3' DAY3,'DAY4' DAY4,'DAY5' DAY5,'DAY6' DAY6,'DAY7' DAY7 ));
Result:
cntry|week|year|day1|day2|day3|day4|day4|day6|day7|
---------------------------------------------------
1 | 15 |2015| MON| TUE| WED| THU| FRI| SAT| SUN|
...
4 | 18 |2015| MON| ...
Now I have written a Postgres crosstab query like this:
select *
from crosstab('select cntry,week,year,days,min(day) as day
from x_c
group by cntry,week,year,days'
,'select distinct days from x_c order by 1'
) as (cntry numeric,week numeric,year numeric
,day1 text,day2 text,day3 text,day4 text, day5 text,day6 text,day7 text);
I'm getting only one row as output:
1|17|2015|MON|TUE| ... -- only this row is coming
Where am I doing wrong?
ORDER BY was missing in your original query. The manual:
In practice the SQL query should always specify ORDER BY 1,2 to ensure that the input rows are properly ordered, that is, values with the same row_name are brought together and correctly ordered within the row.
More importantly (and more tricky), crosstab() requires exactly one row_name column. Detailed explanation in this closely related answer:
Crosstab splitting results due to presence of unrelated field
The solution you found is to nest multiple columns in an array and later unnest again. That's needlessly expensive, error prone and limited (only works for columns with identical data types or you need to cast and possibly lose proper sort order).
Instead, generate a surrogate row_name column with rank() or dense_rank() (rnk in my example):
SELECT cntry, week, year, day1, day2, day3, day4, day5, day6, day7
FROM crosstab (
'SELECT dense_rank() OVER (ORDER BY cntry, week, year)::int AS rnk
, cntry, week, year, days, day
FROM x_c
ORDER BY rnk, days'
, $$SELECT unnest('{DAY1,DAY2,DAY3,DAY4,DAY5,DAY6,DAY7}'::text[])$$
) AS ct (rnk int, cntry int, week int, year int
, day1 text, day2 text, day3 text, day4 text, day5 text, day6 text, day7 text)
ORDER BY rnk;
I use the data type integer for out columns cntry, week, year because that seems to be the (cheaper) appropriate type. You can also use numeric like you had it.
Basics for crosstab queries here:
PostgreSQL Crosstab Query
I got this figured out from http://www.postgresonline.com/journal/categories/24-tablefunc
select year_wk_cntry.t[1],year_wk_cntry.t[2],year_wk_cntry.t[3],day1,day2,day3,day4,day5,day6,day7
from crosstab('select ARRAY[country :: numeric,week,year] as t,days,min(day) as day
from x_c group by country,week,year,days order by 1,2
','select distinct days from x_c order by 1')
as year_wk_cntry (t numeric[],day1 text,day2 text,day3 text,
day4 text, day5 text,day6 text,day7 text);
thanks!!

t-sql get all dates between 2 dates [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Getting Dates between a range of dates
Let's say I have 2 dates (date part only, no time) and I want to get all dates between these 2 dates inclusive and insert them in a table. Is there an easy way to do it with a SQL statement (i.e without looping)?
Ex:
Date1: 2010-12-01
Date2: 2010-12-04
Table should have following dates:
2010-12-01, 2010-12-02, 2010-12-03, 2010-12-04
Assuming SQL Server 2005+, use a recursive query:
WITH sample AS (
SELECT CAST('2010-12-01' AS DATETIME) AS dt
UNION ALL
SELECT DATEADD(dd, 1, dt)
FROM sample s
WHERE DATEADD(dd, 1, dt) <= CAST('2010-12-04' AS DATETIME))
SELECT *
FROM sample
Returns:
dt
---------
2010-12-01 00:00:00.000
2010-12-02 00:00:00.000
2010-12-03 00:00:00.000
2010-12-04 00:00:00.000
Use CAST/CONVERT to format as you like.
Using parameters for start & end:
INSERT INTO dbo.YOUR_TABLE
(datetime_column)
WITH sample AS (
SELECT #start_date AS dt
UNION ALL
SELECT DATEADD(dd, 1, dt)
FROM sample s
WHERE DATEADD(dd, 1, dt) <= #end_date)
SELECT s.dt
FROM sample s
You need a numbers table. If you don't have a permanent one this is a more efficient way of generating one than using a recursive CTE. A permanent one will be more efficient though as long as it is read from the buffer cache.
DECLARE #D1 DATE = '2010-12-01'
DECLARE #D2 DATE = '2010-12-04'
;WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS i FROM L4)
SELECT DATEADD(day,i-1,#D1)
FROM Nums where i <= 1+DATEDIFF(day,#D1,#D2)
I just did something like this:
declare #dt datetime = '2010-12-01'
declare #dtEnd datetime = '2010-12-04'
WHILE (#dt < #dtEnd) BEGIN
insert into table(datefield)
values(#dt)
SET #dt = DATEADD(day, 1, #dt)
END
Repeated Question
Getting Dates between a range of dates
DECLARE #DateFrom smalldatetime, #DateTo smalldatetime;
SET #DateFrom='20000101';
SET #DateTo='20081231';
-------------------------------
WITH T(date)
AS
(
SELECT #DateFrom
UNION ALL
SELECT DateAdd(day,1,T.date) FROM T WHERE T.date < #DateTo
)
SELECT date FROM T OPTION (MAXRECURSION 32767);