Increment date within while loop using postgresql on Redshift table - postgresql

MY SITUATION:
I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.
A (VERY) simplified version of the code can be seen below:
WITH start_data AS (
SELECT user_id
,COUNT(web_visits) AS count_web_visits
,COUNT(button_clicks) AS count_button_clicks
,COUNT(login) AS count_log_in
,SUM(time_on_site) AS total_time_on_site
,CURRENT_DATE AS run_date
FROM web.table
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
AND some_flag = 1
AND some_other_flag = 2
GROUP BY user_id
ORDER BY user_id DESC
)
The output might look something like the below:
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
This output is then stored in it's own AWS/Redhsift table and will form base table for the task.
SELECT *
into myschema.base_table
FROM start_data
DESIRED OUTPUT:
What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.
The way I see it, I can either go forwards or backwards, it doesn't matter.
That is to say, I can either:
Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)
OR
Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.
Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 412 | 339 |180 | 3600 | 20-01-20 |
| 2391823 | 417 | 6253 |863 | 2400 | 20-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 20-01-20 |
| 5561296 | 281 | 679 |262 | 4200 | 20-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 331 | 204 |83 | 3200 | 21-01-20 |
| 2391823 | 652 | 1222 |409 | 7200 | 21-01-20 |
| 3729128 | 71 | 248 |71 | 720 | 21-01-20 |
| 5561296 | 366 | 722 |519 | 3600 | 21-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 213 | 808 |57 | 3600 | 22-01-20 |
| 2391823 | 817 | 4265 |476 | 1200 | 22-01-20 |
| 3729128 | 33 | 128 |62 | 120 | 22-01-20 |
| 5561296 | 623 | 411 |283 | 2400 | 22-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
WHAT I HAVE TRIED:
I have successfully created a WHILE loop to sequentially increment the date as follows:
CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE
start_date DATE := '2020-11-20' ;
BEGIN
WHILE CURRENT_DATE > start_date
LOOP
RAISE INFO 'Date: %', start_date;
start_date = start_date + 1;
END LOOP;
RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;
CALL retrospective_data();
Thus producing the dates as follows:
INFO: Date: 2020-11-20
INFO: Date: 2020-11-21
INFO: Date: 2020-11-22
INFO: Date: 2020-11-23
INFO: Date: 2020-11-24
INFO: Date: 2020-11-25
INFO: Date: 2020-11-26
INFO: Loop Statment Executed Successfully
Query 1 OK: CALL
WHAT I NEED HELP WITH:
I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date
But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.
Any help appreciated.

It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.
You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.
Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.
Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).
SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
SELECT visit_date, user_id, web_visits, ...
FROM web.table
WHERE some_flag = 1 AND some_other_flag = 2
UNION ALL -- this is where I want to union with a full set of dates by user_id
( SELECT visit_date, user_id, NULL as web_visits, ...
FROM (
SELECT DISTINCT user_id FROM web.table
CROSS JOIN
SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
FROM web.table
)
)
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;
The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.
Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.

Related

Update column with correct daterange using generate_series

I have a column with incorrect dateranges (a day is missing). The code
to generate these dateranges was written by a previous employee and
cannot be found.
The dateranges look like this, notice the missing day:
+-------+--------+-------------------------+
| id | client | date_range |
+-------+--------+-------------------------+
| 12885 | 30 | [2016-01-07,2016-01-13) |
| 12886 | 30 | [2016-01-14,2016-01-20) |
| 12887 | 30 | [2016-01-21,2016-01-27) |
| 12888 | 30 | [2016-01-28,2016-02-03) |
| 12889 | 30 | [2016-02-04,2016-02-10) |
| 12890 | 30 | [2016-02-11,2016-02-17) |
| 12891 | 30 | [2016-02-18,2016-02-24) |
+-------+--------+-------------------------+
And should look like this:
+-------------------------+
| range |
+-------------------------+
| [2016-01-07,2016-01-14) |
| [2016-01-14,2016-01-21) |
| [2016-01-21,2016-01-28) |
| [2016-01-28,2016-02-04) |
| [2016-02-04,2016-02-11) |
| [2016-02-11,2016-02-18) |
| [2016-02-18,2016-02-25) |
| [2016-02-25,2016-03-03) |
+-------------------------+
The code I've written to generate correct dateranges looks like this:
create or replace function generate_date_series(startsOn date, endsOn date, frequency interval)
returns setof date as $$
select (startsOn + (frequency * count))::date
from (
select (row_number() over ()) - 1 as count
from generate_series(startsOn, endsOn, frequency)
) series
$$ language sql immutable;
select DATERANGE(
generate_date_series(
'2016-01-07'::date, '2024-11-07'::date, interval '7days'
)::date,
generate_date_series(
'2016-01-14'::date, '2024-11-13'::date, interval '7days'
)::date
) as range;
However, I'm having trouble trying to update the column with the
correct dateranges. I initially executed this UPDATE query on a test
database I created:
update factored_daterange set date_range = dt.range from (
select daterange(
generate_date_series(
'2016-01-07'::date, '2024-11-07'::date, interval '7days'
)::date,
generate_date_series(
'2016-01-14'::date, '2024-11-14'::date, interval '7days'
)::date ) as range ) dt where client_id=30;
But that is not correct, it simply assigns the first generated
daterange to each row. I want to essentially update the dateranges
row-by-row since there is no other join or condition I can match the
dates up to. Any assistance in this matter is greatly appreciated.
Your working too hard. Just update the upper range value.
update your_table_name
set date_range = daterange(lower(date_range),(upper(date_range) + interval '1 day')::date) ;

asof (aj) join strictly less than in KDB/Q

I have a quote table and trade table, and would like to list the quotes table and join in the trades table matching on timestamps strictly less than the timestamp of the trade.
For example:
q:([]time:10:00:00 10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm`ibm;qty:100 200 300 400)
t:([]time:10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm;px:10 20 25)
aj[`time;q;t]
returns
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 200 | 20 |
| 10:01:00 | ibm | 300 | 20 |
| 10:01:02 | ibm | 400 | 25 |
+------------+-----+-----+----+
But I'm trying to get a result like:
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 100 | 10 |
| 10:01:00 | ibm | 100 | 20 |
| 10:01:02 | ibm | 300 | 25 |
+------------+-----+-----+----+
Is there a join function that can match based on timestamps that are strictly less than time instead up-to and including?
I think if you do some variation of aj[`time;q;t] then you won't be able to modify the qty column as table t does not contain it. Instead you may need to use the more "traditional" aj[`time;t;q]:
q)#[;`time;+;00:00:01]aj[`time;#[t;`time;-;00:00:01];q]
time sym px qty
-------------------
10:01:00 ibm 10 100
10:01:00 ibm 20 100
10:01:02 ibm 25 300
This shifts the times to avoid matching where they are equal but does not contain a row for each quote you had in the beginning.
I think if you wish to join trades to quotes rather than quotes to trades as I have done you may need to think of some method of differentiating between 2 trades that occur at the same time as in your example. One method to do this may be to use the order they arrive, i.e. match first quote to first trade.
One “hacking” way I’m thinking is to just shift all trades by the minimum time unit do the aj and then shift back

postgres lag when data is missing

I have data on baseball players annual salaries, with some years missing. What I would like to do is calculate the min, max, average change in salary from the prior year for all players in a year.
For example data looks like below from the table 'salaries':
| playerid | yearid | salary |
| a | 2016 | 10000 |
| b | 2016 | 5000 |
| a | 2015 | 9000 |
| b | 2015 | 3000 |
| a | 2014 | 3000 |
| b | 2014 | 15000 |
| a | 2010 | 1000 |
As you can see, player A has a yearly change of 1k and 6k. player B has a yearly change of 2k and -12k. So I would like a select statement that brings out:
| yearid | min change | max change | avg change |
| 2016 | 1k | 2k | 1.5k |
| 2015 | -12k | 6k | -9k |
Is there a way to do this?
My lag function has unfortunately captured the difference between 2014 and 2010 for playerid a and that is obviously wrong. I couldn't figure out how to use the lag function only if the previous row's yearid was 1 less than the current rows yearid.
Any suggestions would be greatly appreciated.
Just use the previous year for the filtering:
select year, min(salary - prev_salary), max(salary - prev_salary),
avg(salary - prev_salary)
from (select s.*,
lag(s.salary) over (partition by s.playerid order by yearid) as prev_salary,
lag(s.yearid) over (partition by s.playerid order by yearid) as prev_yearid
from salaries s
) s
where prev_yearid = yearid - 1;
Or, you can just use a join:
select s.yearid, . . .
from salaries s join
salaries sp
on sp.playerid = s.playerid and sp.yearid = s.yearid - 1
group by s.yearid;

Order by created_date if less than 1 month old, else sort by updated_date

SQL Fiddle: http://sqlfiddle.com/#!15/1da00/5
I have a table that looks something like this:
products
+-----------+-------+--------------+--------------+
| name | price | created_date | updated_date |
+-----------+-------+--------------+--------------+
| chair | 50 | 10/12/2016 | 1/4/2017 |
| desk | 100 | 11/4/2016 | 12/27/2016 |
| TV | 500 | 12/1/2016 | 1/2/2017 |
| computer | 1000 | 12/28/2016 | 1/1/2017 |
| microwave | 100 | 1/3/2017 | 1/4/2017 |
| toaster | 20 | 1/9/2017 | 1/9/2017 |
+-----------+-------+--------------+--------------+
I want to order this table in a way where if the product was created less than 30 days those results should show first (and be ordered by the updated date). If the product was created 30 or more days ago I want it to show after (and have it ordered by updated date within that group)
This is what the result should look like:
products - desired results
+-----------+-------+--------------+--------------+
| name | price | created_date | updated_date |
+-----------+-------+--------------+--------------+
| toaster | 20 | 1/9/2017 | 1/9/2017 |
| microwave | 100 | 1/3/2017 | 1/4/2017 |
| computer | 1000 | 12/28/2016 | 1/1/2017 |
| chair | 50 | 10/12/2016 | 1/4/2017 |
| TV | 500 | 12/1/2016 | 1/2/2017 |
| desk | 100 | 11/4/2016 | 12/27/2016 |
+-----------+-------+--------------+--------------+
I've started writing this query:
SELECT *,
CASE
WHEN created_date > NOW() - INTERVAL '30 days' THEN 0
ELSE 1
END AS order_index
FROM products
ORDER BY order_index, created_date DESC
but that only bring the rows with created_date less thatn 30 days to the top, and then ordered by created_date. I want to also sort the rows where order_index = 1 by updated_date
Unfortunately in version 9.3 only positional column numbers or expressions involving table columns can be used in order by so order_index is not available to case at all and its position is not well defined because it comes after * in the column list.
This will work.
order by
created_date <= ( current_date - 30 ) , case
when created_date > ( current_date - 30 ) then created_date
else updated_date end desc
Alternatively a common table expression can be used to wrap the result and then that can be ordered by any column.
WITH q AS(
SELECT *,
CASE
WHEN created_date > NOW() - INTERVAL '30 days' THEN 0
ELSE 1
END AS order_index
FROM products
)
SELECT * FROM q
ORDER BY
order_index ,
CASE order_index
WHEN 0 THEN created_date
WHEN 1 THEN updated_date
END DESC;
A third approach is to exploit nulls.
order by
case
when created_date > ( current_date - 30 ) then created_date
end desc nulls last,
updated_date desc;
This approach can be useful when the ordering columns are of different types.

query count of rows where id is less than a series of values in Redshift

I have a table etl_control which stores latest_id of x_data table everyday. Now I have a requirement to get the number of rows for each day.
My idea is to run a query to get the count based on a condition x_data.id <= etl_control.latest_id for everyday and get the count.
The table structures are as follows.
etl_control:
record_date | latest_id |
---------------------------------
2016-11-01 | 55 |
2016-11-02 | 125 |
2016-11-03 | 154 |
2016-11-04 | 190 |
2016-11-05 | 201 |
2016-11-06 | 225 |
2016-11-07 | 287 |
x_data:
id | value |
---------------------------------
10 | xyz |
11 | xyz |
21 | xyz |
55 | xyz |
101 | xyz |
108 | xyz |
125 | xyz |
142 | xyz |
154 | xyz |
160 | xyz |
166 | xyz |
178 | xyz |
190 | xyz |
191 | xyz |
The end result should have the number of rows in x_data for each day. I tried a number of variations using JOIN, WITH and COUNT(*) OVER. But the biggest hurdle is to iteratively compare x_data.id with etl_control.latest_id.
Really sorry folks. Got the answer myself after posting the question.
The query is really simple.
WITH data AS (
SELECT e.latest_id
FROM x_data AS x, etl_control AS e
WHERE x.id <= e.latest_id)
SELECT latest_id, count(*) FROM data GROUP BY latest_id;
This basically creates a temp table with latest_id repeated for each row. The latest_id is always greater than or equal to the id from x_data.
A simple group by on this temp table would give the expected result.