How can 'brand new, never before seen' IDs be counted per month in redshift? - amazon-redshift

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.

First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

Related

Postgresql : Average over a limit of Date with group by

I have a table like this
item_id date number
1 2000-01-01 100
1 2003-03-08 50
1 2004-04-21 10
1 2004-12-11 10
1 2010-03-03 10
2 2000-06-29 1
2 2002-05-22 2
2 2002-07-06 3
2 2008-10-20 4
I'm trying to get the average for each uniq Item_id over the last 3 dates.
It's difficult because there are missing date in between so a range of hardcoded dates doesn't always work.
I expect a result like :
item_id MyAverage
1 10
2 3
I don't really know how to do this. Currently i manage to do it for one item but i have trouble extending it to multiples items :
SELECT AVG(MyAverage.number) FROM (
SELECT date,number
FROM item_list
where item_id = 1
ORDER BY date DESC limit 3
) as MyAverage;
My main problem is with generalising the "DESC limit 3" over a group by id.
attempt :
SELECT item_id,AVG(MyAverage.number)
FROM (
SELECT item_id,date,number
FROM item_list
ORDER BY date DESC limit 3) as MyAverage
GROUP BY item_id;
The limit is messing things up there.
I have made it " work " using between date and date but it's not working as i want because i need a limit and not an hardcoded date..
Can anybody help
You can use row_number() to assign 1 to 3 for the records with the last date for an ID an then filter for that.
SELECT x.item_id,
avg(x.number)
FROM (SELECT il.item_id,
il.number,
row_number() OVER (PARTITION BY il.item_id
ORDER BY il.date DESC) rn
FROM item_list il) x
WHERE x.rn BETWEEN 1 AND 3
GROUP BY x.item_id;

SQL - how to sum groups of 15 rows and find the max sum

The purpose of this question is to optimize some SQL by using set-based operations vs iterative (looping, like I'm doing below):
Some Explanation -
I have this cte that is inserted to a temp table #dataForPeak. Each row represents a minute, and a respective value retrieved.
For every row, my code uses a while loop to add 15 rows at a time (the current row + the next 14 rows). These sums are inserted into another temp table #PeakDemandIntervals, which is my workaround for then finding the max sum of these groups of 15.
I've bolded my end goal above. My code achieves this but in about 12 seconds for 26k rows. I'll be looking at much more data, so I know this is not enough for my use case.
My question is,
can anyone help me find a fast alternative to this loop?
It can include more tables, CTEs, nested queries, whatever. The while loop might not even be the issue, it's probably the inner code.
insert into #dataForPeak
select timestamp, value
from cte
order by timestamp;
while ##ROWCOUNT<>0
begin
declare #timestamp datetime = (select top 1 timestamp from #dataForPeak);
insert into #PeakDemandIntervals
select #timestamp, sum(interval.value) as peak
from (select * from #dataForPeak base
where base.timestamp >= #timestamp
and base.timestamp < DATEADD(minute,14,#timestamp)
) interval;
delete from #dataForPeak where timestamp = #timestamp;
end
select max(peak)
from #PeakDemandIntervals;
Edit
Here's an example of my goal, using groups of 3min instead of 15min.
Given the data:
Time | Value
1:50 | 2
1:51 | 4
1:52 | 6
1:53 | 8
1:54 | 6
1:55 | 4
1:56 | 2
the max sum (peak) I'm looking for is 20, because the group
1:52 | 6
1:53 | 8
1:54 | 6
has the highest sum.
Let me know if I need to clarify more than that.
Based on the example given it seems like you are trying to get the maximum value of a rolling sum. You can calculate the 15-minute rolling sum very easily as follow:
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
Note the key here is the ROWS 14 PRECEDING statement. It effectively state that SQL Server should sum the preceding 14 records with the current record which will give you your 15 minute interval.
Now you can simply max the result of the rolling sum. The full query will look as follow:
;WITH CTE_RollingSum
AS
(
SELECT [Time]
,[Value]
,SUM([Value]) OVER (ORDER BY [Time] ASC ROWS 14 PRECEDING) [RollingSum]
FROM #dataForPeak
)
SELECT MAX([RollingSum]) AS Peak
FROM CTE_RollingSum

Postgres aggregate sum conditional on row comparison

So, I have data that looks something like this
User_Object | filesize | created_date | deleted_date
row 1 | 40 | May 10 | Aug 20
row 2 | 10 | June 3 | Null
row 3 | 20 | Nov 8 | Null
I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:
SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.
This incorrect syntax illustrates what I want to do:
SELECT User_Object.id, User_Object.created,
SUM(CASE WHEN NOT window_function_row.deleted
OR window_function_row.deleted > User_Object.created
THEN filesize ELSE 0)
OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
When this function runs on the data that I have, it should output something like
id | created | sum_data_used|
1 | May 10 | 40
2 | June 3 | 50
3 | Nov 8 | 30
Something along these lines may work for you:
SELECT a.user_id
,MIN(a.created_date) AS created_date
,SUM(b.filesize) AS sum_data_used
FROM user_object a
JOIN user_object b ON (b.user_id <= a.user_id
AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
GROUP BY a.user_id
ORDER BY a.user_id
For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.
Example table definition:
create table user_object(user_id int, filesize int, created_date date, deleted_date date);
Data:
1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>
Result:
1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30

SELECT record based upon dates

Assuming data such as the following:
ID EffDate Rate
1 12/12/2011 100
1 01/01/2012 110
1 02/01/2012 120
2 01/01/2012 40
2 02/01/2012 50
3 01/01/2012 25
3 03/01/2012 30
3 05/01/2012 35
How would I find the rate for ID 2 as of 1/15/2012?
Or, the rate for ID 1 for 1/15/2012?
In other words, how do I do a query that finds the correct rate when the date falls between the EffDate for two records? (Rate should be for the date prior to the selected date).
Thanks,
John
How about this:
SELECT Rate
FROM Table1
WHERE ID = 1 AND EffDate = (
SELECT MAX(EffDate)
FROM Table1
WHERE ID = 1 AND EffDate <= '2012-15-01');
Here's an SQL Fiddle to play with. I assume here that 'ID/EffDate' pair is unique for all table (at least the opposite doesn't make sense).
SELECT TOP 1 Rate FROM the_table
WHERE ID=whatever AND EffDate <='whatever'
ORDER BY EffDate DESC
if I read you right.
(edited to suit my idea of ms-sql which I have no idea about).

SQL query to convert date ranges to per day records

Requirements
I have data table that saves data in date ranges.
Each record is allowed to overlap previous record(s) (record has a CreatedOn datetime column).
New record can define it's own date range if it needs to hence can overlap several older records.
Each new overlapping record overrides settings of older records that it overlaps.
Result set
What I need to get is get per day data for any date range that uses record overlapping. It should return a record per day with corresponding data for that particular day.
To convert ranges to days I was thinking of numbers/dates table and user defined function (UDF) to get data for each day in the range but I wonder whether there's any other (as in better* or even faster) way of doing this since I'm using the latest SQL Server 2008 R2.
Stored data
Imagine my stored data looks like this
ID | RangeFrom | RangeTo | Starts | Ends | CreatedOn (not providing data)
---|-----------|----------|--------|-------|-----------
1 | 20110101 | 20110331 | 07:00 | 15:00
2 | 20110401 | 20110531 | 08:00 | 16:00
3 | 20110301 | 20110430 | 06:00 | 14:00 <- overrides both partially
Results
If I wanted to get data from 1st January 2011 to 31st May 2001 resulting table should look like the following (omitted obvious rows):
DayDate | Starts | Ends
--------|--------|------
20110101| 07:00 | 15:00 <- defined by record ID = 1
20110102| 07:00 | 15:00 <- defined by record ID = 1
... many rows omitted for obvious reasons
20110301| 06:00 | 14:00 <- defined by record ID = 3
20110302| 06:00 | 14:00 <- defined by record ID = 3
... many rows omitted for obvious reasons
20110501| 08:00 | 16:00 <- defined by record ID = 2
20110502| 08:00 | 16:00 <- defined by record ID = 2
... many rows omitted for obvious reasons
20110531| 08:00 | 16:00 <- defined by record ID = 2
Actually, since you are working with dates, a Calendar table would be more helpful.
Declare #StartDate date
Declare #EndDate date
;With Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
Select ...
From Calendar
Left Join MyTable
On Calendar.[Date] Between MyTable.Start And MyTable.End
Option ( Maxrecursion 0 );
Addition
Missed the part about the trumping rule in your original post:
Set DateFormat MDY;
Declare #StartDate date = '20110101';
Declare #EndDate date = '20110501';
-- This first CTE is obviously to represent
-- the source table
With SampleData As
(
Select 1 As Id
, Cast('20110101' As date) As RangeFrom
, Cast('20110331' As date) As RangeTo
, Cast('07:00' As time) As Starts
, Cast('15:00' As time) As Ends
, CURRENT_TIMESTAMP As CreatedOn
Union All Select 2, '20110401', '20110531', '08:00', '16:00', DateAdd(s,1,CURRENT_TIMESTAMP )
Union All Select 3, '20110301', '20110430', '06:00', '14:00', DateAdd(s,2,CURRENT_TIMESTAMP )
)
, Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
, RankedData As
(
Select C.[Date]
, S.Id
, S.RangeFrom, S.RangeTo, S.Starts, S.Ends
, Row_Number() Over( Partition By C.[Date] Order By S.CreatedOn Desc ) As Num
From Calendar As C
Join SampleData As S
On C.[Date] Between S.RangeFrom And S.RangeTo
)
Select [Date], Id, RangeFrom, RangeTo, Starts, Ends
From RankedData
Where Num = 1
Option ( Maxrecursion 0 );
In short, I rank all the sample data preferring the newer rows that overlap the same date.
Why do it all in DB when you can do it better in memory
This is the solution (I eventually used) that seemed most reasonable in terms of data transferred, speed and resources.
get actual range definitions from DB to mid tier (smaller amount of data)
generate in memory calendar of a certain date range (faster than in DB)
put those DB definitions in (much easier and faster than DB)
And that's it. I realised that complicating certain things in DB is not not worth it when you have executable in memory code that can do the same manipulation faster and more efficient.