Count how many date ranges cover each date from a series - amazon-redshift

I have a table with a series of date ranges like this:
id, start_date, end_date,
(1, '2022-01-01', '2022-01-31'),
(2, '2022-01-01', '2022-01-05'),
(3, '2022-01-03', '2022-01-06'),
(4, '2022-01-01', '2022-01-01')
I would like to be able to look at each date (from a series) and see how many of these ranges fall into each date. The date series would ideally start from the smallest value in a starting range. But starting from say 1 year ago to now would also be sufficient.
The outcome looking like this:
| Date | Count |
| ---------- | ----- |
| 10/07/2022 | 6 |
| 9/07/2022 | 8 |
| 8/07/2022 | 12 |
| 7/07/2022 | 5 |
We cannot use the generate_series function in Amazon Redshift. So I have generated a series like so which goes back 365 days:
SELECT DATEADD('day', 1-n, (DATE_TRUNC('day', CURRENT_DATE))) AS generated_date
FROM (SELECT ROW_NUMBER() OVER () AS n FROM stl_scan LIMIT 365) n
ORDER BY generated_date DESC
Struggling to implement this generated series into my query.

I usually create date sequences in Redshift with a recursive CTE like this - trying to create a date table in Redshift
I am wondering why you need a date sequence for this query. I can imagine queries that use one and queries that don't. Can you describe the query you are expecting to use?

Related

Postgres aggregate sum conditional on row comparison

So, I have data that looks something like this
User_Object | filesize | created_date | deleted_date
row 1 | 40 | May 10 | Aug 20
row 2 | 10 | June 3 | Null
row 3 | 20 | Nov 8 | Null
I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:
SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.
This incorrect syntax illustrates what I want to do:
SELECT User_Object.id, User_Object.created,
SUM(CASE WHEN NOT window_function_row.deleted
OR window_function_row.deleted > User_Object.created
THEN filesize ELSE 0)
OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
When this function runs on the data that I have, it should output something like
id | created | sum_data_used|
1 | May 10 | 40
2 | June 3 | 50
3 | Nov 8 | 30
Something along these lines may work for you:
SELECT a.user_id
,MIN(a.created_date) AS created_date
,SUM(b.filesize) AS sum_data_used
FROM user_object a
JOIN user_object b ON (b.user_id <= a.user_id
AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
GROUP BY a.user_id
ORDER BY a.user_id
For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.
Example table definition:
create table user_object(user_id int, filesize int, created_date date, deleted_date date);
Data:
1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>
Result:
1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30

Grouping based on every N days in postgresql

I have a table that includes ID, date, values (temperature) and some other stuff. My table looks like this:
+-----+--------------+------------+
| ID | temperature | Date |
+-----+--------------+------------+
| 1 | 26.3 | 2012-02-05 |
| 2 | 27.8 | 2012-02-06 |
| 3 | 24.6 | 2012-02-07 |
| 4 | 29.6 | 2012-02-08 |
+-----+--------------+------------+
I want to perform aggregation queries like sum and mean for every 10 days.
I was wondering if it is possible in psql or not?
SQL Fiddle
select
"date",
temperature,
avg(temperature) over(order by "date" rows 10 preceding) mean
from t
order by "date"
select id,
temperature,
sum(temperature) over (order by "date" rows between 10 preceding and current row)
from the_table;
It might not exactly be what you want, as it will do a moving sum over the last 10 rows, which is not necessarily the same as the last 10 days.
Since Postgres 11, you can now use a range based on an interval
select id,
temperature,
avg(temperature) over (order by "date"
range between interval '10 days' preceding and current row)
from the_table;

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)

sum every 3 rows of a table

I have the following query to count all data every minute.
$sql= "SELECT COUNT(*) AS count, date_trunc('minute', date) AS momento
FROM p WHERE fk_id_b=$id_b GROUP BY date_trunc('minute', date)
ORDER BY momento ASC";
What I need to do is get the sum of the count for each row with the count of the 2 past minutes.
For example with the result of the $sql query above
|-------date---------|----count----|
|2012-06-21 05:20:00 | 12 |
|2012-06-21 05:21:00 | 14 |
|2012-06-21 05:22:00 | 10 |
|2012-06-21 05:23:00 | 20 |
|2012-06-21 05:24:00 | 25 |
|2012-06-21 05:25:00 | 30 |
|2012-06-21 05:26:00 | 10 |
I want this result:
|-------date---------|----count----|
|2012-06-21 05:20:00 | 12 |
|2012-06-21 05:21:00 | 26 | 12+14
|2012-06-21 05:22:00 | 36 | 12+14+10
|2012-06-21 05:23:00 | 44 | 14+10+20
|2012-06-21 05:24:00 | 55 | 10+20+25
|2012-06-21 05:25:00 | 75 | 20+25+30
|2012-06-21 05:26:00 | 65 | 25+30+10
Here's a more general solution for the sum of values from current and N previous rows (N=2 in your case).
SELECT "date",
sum("count") OVER (order by "date" ROWS BETWEEN 2 preceding AND current row)
FROM t
ORDER BY "date";
You can change N between 0 and "Unbounded". This approach gives you a chance to have a parameter in your app "count of the N past minutes". Also, no need for handling default values if out of bounds.
You can find more on this in PostgreSQL docs (4.2.8. Window Function Calls)
This is not so tricky with lag() window function (also on SQL Fiddle):
CREATE TABLE t ("date" timestamptz, "count" int4);
INSERT INTO t VALUES
('2012-06-21 05:20:00',12),
('2012-06-21 05:21:00',14),
('2012-06-21 05:22:00',10),
('2012-06-21 05:23:00',20),
('2012-06-21 05:24:00',25),
('2012-06-21 05:25:00',30),
('2012-06-21 05:26:00',10);
SELECT *,
"count"
+ coalesce(lag("count", 1) OVER (ORDER BY "date"), 0)
+ coalesce(lag("count", 2) OVER (ORDER BY "date"), 0) AS "total"
FROM t;
I've double-quoted date and count columns, as these are reserved words;
lag(field, distance) gives me the value of the field column distance rows away from the current one, thus first function gives previous row's value and second call gives the value from the one before;
coalesce() is required to avoid NULL result from lag() function (for the first row in your query there's no “previous” one, thus it's NULL), otherwise the total will also be NULL.
#vyegorov's answer covers it mostly. But I have more gripes than fit into a comment.
Don't use reserved words like date and count as identifiers at all. PostgreSQL allows those two particular key words as identifier - other than every SQL standard. But it's still bad practice. The fact that you can use anything inside double-quotes as identifier, even "; DELETE FROM tbl;" does not make it a good idea. The name "date" for a timestamp is misleading on top of that.
Wrong data type. Example displays timestamp, not timestamptz. Does not make a difference here, but still misleading.
You don't need COALESCE(). With the window functions lag() and lead() you can can provide a default value as 3rd parameter:
Building on this setup:
CREATE TABLE tbl (ts timestamp, ct int4);
INSERT INTO tbl VALUES
('2012-06-21 05:20:00', 12)
, ('2012-06-21 05:21:00', 14)
, ('2012-06-21 05:22:00', 10)
, ('2012-06-21 05:23:00', 20)
, ('2012-06-21 05:24:00', 25)
, ('2012-06-21 05:25:00', 30)
, ('2012-06-21 05:26:00', 10);
Query:
SELECT ts, ct + lag(ct, 1, 0) OVER (ORDER BY ts)
+ lag(ct, 2, 0) OVER (ORDER BY ts) AS total
FROM tbl;
Or better yet: use a single sum() as window aggregate function with a custom window frame:
SELECT ts, sum(ct) OVER (ORDER BY ts ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM tbl;
Same result.
Related:
Group by end of period instead of start date

SQL query to convert date ranges to per day records

Requirements
I have data table that saves data in date ranges.
Each record is allowed to overlap previous record(s) (record has a CreatedOn datetime column).
New record can define it's own date range if it needs to hence can overlap several older records.
Each new overlapping record overrides settings of older records that it overlaps.
Result set
What I need to get is get per day data for any date range that uses record overlapping. It should return a record per day with corresponding data for that particular day.
To convert ranges to days I was thinking of numbers/dates table and user defined function (UDF) to get data for each day in the range but I wonder whether there's any other (as in better* or even faster) way of doing this since I'm using the latest SQL Server 2008 R2.
Stored data
Imagine my stored data looks like this
ID | RangeFrom | RangeTo | Starts | Ends | CreatedOn (not providing data)
---|-----------|----------|--------|-------|-----------
1 | 20110101 | 20110331 | 07:00 | 15:00
2 | 20110401 | 20110531 | 08:00 | 16:00
3 | 20110301 | 20110430 | 06:00 | 14:00 <- overrides both partially
Results
If I wanted to get data from 1st January 2011 to 31st May 2001 resulting table should look like the following (omitted obvious rows):
DayDate | Starts | Ends
--------|--------|------
20110101| 07:00 | 15:00 <- defined by record ID = 1
20110102| 07:00 | 15:00 <- defined by record ID = 1
... many rows omitted for obvious reasons
20110301| 06:00 | 14:00 <- defined by record ID = 3
20110302| 06:00 | 14:00 <- defined by record ID = 3
... many rows omitted for obvious reasons
20110501| 08:00 | 16:00 <- defined by record ID = 2
20110502| 08:00 | 16:00 <- defined by record ID = 2
... many rows omitted for obvious reasons
20110531| 08:00 | 16:00 <- defined by record ID = 2
Actually, since you are working with dates, a Calendar table would be more helpful.
Declare #StartDate date
Declare #EndDate date
;With Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
Select ...
From Calendar
Left Join MyTable
On Calendar.[Date] Between MyTable.Start And MyTable.End
Option ( Maxrecursion 0 );
Addition
Missed the part about the trumping rule in your original post:
Set DateFormat MDY;
Declare #StartDate date = '20110101';
Declare #EndDate date = '20110501';
-- This first CTE is obviously to represent
-- the source table
With SampleData As
(
Select 1 As Id
, Cast('20110101' As date) As RangeFrom
, Cast('20110331' As date) As RangeTo
, Cast('07:00' As time) As Starts
, Cast('15:00' As time) As Ends
, CURRENT_TIMESTAMP As CreatedOn
Union All Select 2, '20110401', '20110531', '08:00', '16:00', DateAdd(s,1,CURRENT_TIMESTAMP )
Union All Select 3, '20110301', '20110430', '06:00', '14:00', DateAdd(s,2,CURRENT_TIMESTAMP )
)
, Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
, RankedData As
(
Select C.[Date]
, S.Id
, S.RangeFrom, S.RangeTo, S.Starts, S.Ends
, Row_Number() Over( Partition By C.[Date] Order By S.CreatedOn Desc ) As Num
From Calendar As C
Join SampleData As S
On C.[Date] Between S.RangeFrom And S.RangeTo
)
Select [Date], Id, RangeFrom, RangeTo, Starts, Ends
From RankedData
Where Num = 1
Option ( Maxrecursion 0 );
In short, I rank all the sample data preferring the newer rows that overlap the same date.
Why do it all in DB when you can do it better in memory
This is the solution (I eventually used) that seemed most reasonable in terms of data transferred, speed and resources.
get actual range definitions from DB to mid tier (smaller amount of data)
generate in memory calendar of a certain date range (faster than in DB)
put those DB definitions in (much easier and faster than DB)
And that's it. I realised that complicating certain things in DB is not not worth it when you have executable in memory code that can do the same manipulation faster and more efficient.