historical aggregation of a column up until a specified time in each row in another column

historical aggregation of a column up until a specified time in each row in another column - postgresql

I have two tables login_attempts and checkouts in Amazon RedShift. A user can have multiple (un)successful login attempts and multiple (un)successful checkouts as shown in this example:
login_attempts
login_id | user_id | login | success
-------------------------------------------------------
1 | 1 | 2021-07-01 14:00:00 | 0
2 | 1 | 2021-07-01 16:00:00 | 1
3 | 2 | 2021-07-02 05:01:01 | 1
4 | 1 | 2021-07-04 03:25:34 | 0
5 | 2 | 2021-07-05 11:20:50 | 0
6 | 2 | 2021-07-07 12:34:56 | 1
and
checkouts
checkout_id | checkout_time | user_id | success
------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 0
2 | 2021-07-02 06:54:32 | 2 | 1
3 | 2021-07-04 13:00:01 | 1 | 1
4 | 2021-07-08 09:05:00 | 2 | 1
Given this information, how can I get the following table with historical performance included for each checkout AS OF THAT TIME?
checkout_id | checkout | user_id | lastGoodLogin | lastFailedLogin | lastGoodCheckout | lastFailedCheckout |
---------------------------------------------------------------------------------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 2021-07-01 16:00:00 | 2021-07-01 14:00:00 | NULL | NULL
2 | 2021-07-02 06:54:32 | 2 | 2021-07-02 05:01:01 | NULL | NULL | NULL
3 | 2021-07-04 13:00:01 | 1 | 2021-07-01 16:00:00 | 2021-07-04 03:25:34 | NULL | 2021-07-01 18:00:00
4 | 2021-07-08 09:05:00 | 2 | 2021-07-07 12:34:56 | 2021-07-05 11:20:50 | 2021-07-02 06:54:32 | NULL
Update: I was able to get lastFailedCheckout & lastGoodCheckout because that's doing window operations on the same table (checkouts) but I am failing to understand how to best join it with login_attempts table to get last[Good|Failed]Login fields. (sqlfiddle)
P.S.: I am open to PostgreSQL suggestions as well.

Good start! A couple things in your SQL - 1) You should really try to avoid inequality joins as these can lead to data explosions and aren't needed in this case. Just put a CASE statement inside your window function to use only the type of checkout (or login) you want. 2) You can use the frame clause to not self select the same row when finding previous checkouts.
Once you have this pattern you can use it to find the other 2 columns of data you are looking for. The first step is to UNION the tables together, not JOIN. This means making a few more columns so the data can live together but that is easy. Now you have the userid and the time the "thing" happened all in the same data. You just need to WINDOW 2 more times to pull the info you want. Lastly, you need to strip out the non-checkout rows with an outer select w/ where clause.
Like this:
create table login_attempts(
loginid smallint,
userid smallint,
login timestamp,
success smallint
);
create table checkouts(
checkoutid smallint,
userid smallint,
checkout_time timestamp,
success smallint
);
insert into login_attempts values
(1, 1, '2021-07-01 14:00:00', 0),
(2, 1, '2021-07-01 16:00:00', 1),
(3, 2, '2021-07-02 05:01:01', 1),
(4, 1, '2021-07-04 03:25:34', 0),
(5, 2, '2021-07-05 11:20:50', 0),
(6, 2, '2021-07-07 12:34:56', 1)
;
insert into checkouts values
(1, 1, '2021-07-01 18:00:00', 0),
(2, 2, '2021-07-02 06:54:32', 1),
(3, 1, '2021-07-04 13:00:01', 1),
(4, 2, '2021-07-08 09:05:00', 1)
;
SQL:
select *
from (
select
c.checkoutid,
c.userid,
c.checkout_time,
max(case success when 0 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedCheckout,
max(case success when 1 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodCheckout,
max(case lsuccess when 0 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedLogin,
max(case lsuccess when 1 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodLogin
from (
select checkout_time as event_time, checkoutid, userid,
checkout_time, success,
NULL as login, NULL as lsuccess
from checkouts
UNION ALL
select login as event_time,NULL as checkoutid, userid,
NULL as checkout_time, NULL as success,
login, success as lsuccess
from login_attempts
) c
) o
where o.checkoutid is not null
order by o.checkoutid

Related

postgresql unique index preventing overlaping

My table permission looks like:
id serial,
person_id integer,
permission_id integer,
valid_from date,
valid_to date
I'd like to prevent creating permissions which overlaps valid_from, valid_to date
eg.
1 | 1 | 1 | 2010-10-01 | 2999-12-31
2 | 1 | 2 | 2010-10-01 | 2020-12-31
3 | 2 | 1 | 2015-10-01 | 2999-12-31
this can be added:
4 | 1 | 3 | 2011-10-01 | 2999-12-31 - because no such permission
5 | 2 | 1 | 2011-10-10 | 2999-12-31 - because no such person
6 | 1 | 2 | 2021-01-01 | 2999-12-31 - because doesn't overlaps id:2
but this can't
7 | 1 | 1 | 2009-10-01 | 2010-02-01 - because overlaps id:1
8 | 1 | 2 | 2019-01-01 | 2022-12-31 - because overlaps id:2
9 | 2 | 1 | 2010-01-01 | 2016-12-31 - beacuse overlaps id:3
I can do outside checking but wonder if possible to do it on database

A unique constraint is based on an equality operator and cannot be used in this case, but you can use an exclude constraint. The constraint uses btree operators <> and =, hence you have to install btree_gist extension.
create extension if not exists btree_gist;
create table permission(
id serial,
person_id integer,
permission_id integer,
valid_from date,
valid_to date,
exclude using gist (
person_id with =,
permission_id with =,
daterange(valid_from, valid_to) with &&)
);
These inserts are successful:
insert into permission values
(1, 1, 1, '2010-10-01', '2999-12-31'),
(2, 1, 2, '2010-10-01', '2020-12-31'),
(3, 2, 1, '2015-10-01', '2999-12-31'),
(4, 1, 3, '2011-10-01', '2999-12-31'),
(5, 3, 1, '2011-10-10', '2999-12-31'), -- you meant person_id = 3 I suppose
(6, 1, 2, '2021-01-01', '2999-12-31'),
(7, 1, 1, '2009-10-01', '2010-02-01'); -- ranges do not overlap!
but this one is not:
insert into permission values
(8, 1, 2, '2019-01-01', '2022-12-31');
ERROR: conflicting key value violates exclusion constraint "permission_person_id_permission_id_daterange_excl"
DETAIL: Key (person_id, permission_id, daterange(valid_from, valid_to))=(1, 2, [2019-01-01,2022-12-31)) conflicts with existing key (person_id, permission_id, daterange(valid_from, valid_to))=(1, 2, [2010-10-01,2020-12-31)).
Try it in db<>fiddle.

Aggregate column by condition

I have to write an Uptime Report for a Homepage. In the table that is available for me I have 2 columns. The first one is the status of the Homepage (0 is offline and 1 is online) and the second one is the duration of this status in seconds. An example table might look like this:
-------------------------
| Status | Duration |
-------------------------
| 0 | 50 |
-------------------------
| 1 | 10 |
-------------------------
| 1 | 20 |
-------------------------
| 1 | 50 |
-------------------------
| 0 | 50 |
-------------------------
| 0 | 50 |
-------------------------
| 1 | 20 |
-------------------------
This does not look that nice in my report because the same stati should be aggregated into one row and not be shown as multiple rows like this:
-------------------------
| Status | Duration |
-------------------------
| 0 | 50 |
-------------------------
| 1 | 80 |
-------------------------
| 0 | 100 |
-------------------------
| 1 | 20 |
-------------------------
Is there a way to achieve this with PostgreSQL?

As I said already, you would need an id/datetime column to track the progress.
Only then would you be able to use LEAD/LAG function or TABIBITOSAN method for this scenario.
SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE TABLE t
(id INT,Status int, Duration int)
;
INSERT INTO t
(id,Status, Duration)
VALUES
(1,0, 50),
(2,1, 10),
(3,1, 20),
(4,1, 50),
(5,0, 50),
(6,0, 50),
(7,1, 20)
;
Query 1:
SELECT STATUS
,Sum(duration)
FROM (
SELECT t.*
,row_number() OVER (
ORDER BY id
) - row_number() OVER (
PARTITION BY STATUS ORDER BY id
) AS seq
FROM t
) s
GROUP BY STATUS
,seq
ORDER BY max(id)
Results:
| status | sum |
|--------|-----|
| 0 | 50 |
| 1 | 80 |
| 0 | 100 |
| 1 | 20 |

This aggregation can be achieved by using window functions and grouping:
select max(status) status, sum(duration) duration from (
select status, duration, sum(case when status <> par then 1 else 0 end) over (order by id) wf from (
select id, status, duration, lag(status, 1) over () par from test
) a order by id
) a group by wf order by wf
You just need to properly set your ordering in window function.
Test data:
create table test (status int, duration int, id bigserial primary key);
insert into test (status, duration) values (0, 50);
insert into test (status, duration) values (1, 10);
insert into test (status, duration) values (1, 20);
insert into test (status, duration) values (1, 50);
insert into test (status, duration) values (0, 50);
insert into test (status, duration) values (0, 50);
insert into test (status, duration) values (1, 20);
Output like you wanted.

How to count rows using a variable date range provided by a table in PostgreSQL

I suspect I require some sort of windowing function to do this. I have the following item data as an example:
count | date
------+-----------
3 | 2017-09-15
9 | 2017-09-18
2 | 2017-09-19
6 | 2017-09-20
3 | 2017-09-21
So there are gaps in my data first off, and I have another query here:
select until_date, until_date - (lag(until_date) over ()) as delta_days from ranges
Which I have generated the following data:
until_date | delta_days
-----------+-----------
2017-09-08 |
2017-09-11 | 3
2017-09-13 | 2
2017-09-18 | 5
2017-09-21 | 3
2017-09-22 | 1
So I'd like my final query to produce this result:
start_date | ending_date | total_items
-----------+-------------+--------------
2017-09-08 | 2017-09-10 | 0
2017-09-11 | 2017-09-12 | 0
2017-09-13 | 2017-09-17 | 3
2017-09-18 | 2017-09-20 | 15
2017-09-21 | 2017-09-22 | 3
Which tells me the total count of items from the first table, per day, based on the custom ranges from the second table.
In this particular example, I would be summing up total_items BETWEEN start AND end (since there would be overlap on the dates, I'd subtract 1 from the end date to not count duplicates)
Anyone know how to do this?
Thanks!

Use the daterange type. Note that you do not have to calculate delta_days, just convert ranges to dataranges and use the operator <# - element is contained by.
with counts(count, date) as (
values
(3, '2017-09-15'::date),
(9, '2017-09-18'),
(2, '2017-09-19'),
(6, '2017-09-20'),
(3, '2017-09-21')
),
ranges (until_date) as (
values
('2017-09-08'::date),
('2017-09-11'),
('2017-09-13'),
('2017-09-18'),
('2017-09-21'),
('2017-09-22')
)
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
order by 1;
daterange | total_items
-------------------------+-------------
[2017-09-08,2017-09-11) | 0
[2017-09-11,2017-09-13) | 0
[2017-09-13,2017-09-18) | 3
[2017-09-18,2017-09-21) | 17
[2017-09-21,2017-09-22) | 3
(5 rows)
Note, that in the dateranges above lower bounds are inclusive while upper bound are exclusive.
If you want to calculate items per day in the dateranges:
select
daterange, total_items,
round(total_items::dec/(upper(daterange)- lower(daterange)), 2) as items_per_day
from (
select daterange, coalesce(sum(count), 0) as total_items
from (
select daterange(lag(until_date) over (order by until_date), until_date)
from ranges
) s
left join counts on date <# daterange
where not lower_inf(daterange)
group by 1
) s
order by 1
daterange | total_items | items_per_day
-------------------------+-------------+---------------
[2017-09-08,2017-09-11) | 0 | 0.00
[2017-09-11,2017-09-13) | 0 | 0.00
[2017-09-13,2017-09-18) | 3 | 0.60
[2017-09-18,2017-09-21) | 17 | 5.67
[2017-09-21,2017-09-22) | 3 | 3.00
(5 rows)

Postgresql Time Series for each Record

I'm having issues trying to wrap my head around how to extract some time series stats from my Postgres DB.
For example, I have several stores. I record how many sales each store made each day in a table that looks like:
+------------+----------+-------+
| Date | Store ID | Count |
+------------+----------+-------+
| 2017-02-01 | 1 | 10 |
| 2017-02-01 | 2 | 20 |
| 2017-02-03 | 1 | 11 |
| 2017-02-03 | 2 | 21 |
| 2017-02-04 | 3 | 30 |
+------------+----------+-------+
I'm trying to display this data on a bar/line graph with different lines per Store and the blank dates filled in with 0.
I have been successful getting it to show the sum per day (combining all the stores into one sum) using generate_series, but I can't figure out how to separate it out so each store has a value for each day... the result being something like:
["Store ID 1", 10, 0, 11, 0]
["Store ID 2", 20, 0, 21, 0]
["Store ID 3", 0, 0, 0, 30]

It is necessary to build a cross join dates X stores:
select store_id, array_agg(total order by date) as total
from (
select store_id, date, coalesce(sum(total), 0) as total
from
t
right join (
generate_series(
(select min(date) from t),
(select max(date) from t),
'1 day'
) gs (date)
cross join
(select distinct store_id from t) s
) using (date, store_id)
group by 1,2
) s
group by 1
order by 1
;
store_id | total
----------+-------------
1 | {10,0,11,0}
2 | {20,0,21,0}
3 | {0,0,0,30}
Sample data:
create table t (date date, store_id int, total int);
insert into t (date, store_id, total) values
('2017-02-01',1,10),
('2017-02-01',2,20),
('2017-02-03',1,11),
('2017-02-03',2,21),
('2017-02-04',3,30);

PostgreSQL, two windowing functions at once

I have typical table with data, say mytemptable.
DROP TABLE IF EXISTS mytemptable;
CREATE TEMP TABLE mytemptable
(mydate date, somedoc text, inqty int, outqty int);
INSERT INTO mytemptable (mydate, somedoc, inqty, outqty)
VALUES ('01.01.2016.', '123-13-24', 3, 0),
('04.01.2016.', '15-19-44', 2, 0),
('06.02.2016.', '15-25-21', 0, 1),
('04.01.2016.', '21-133-12', 0, 1),
('04.01.2016.', '215-11-51', 0, 2),
('05.01.2016.', '11-181-01', 0, 1),
('05.02.2016.', '151-80-8', 4, 0),
('04.01.2016.', '215-11-51', 0, 2),
('07.02.2016.', '34-02-02', 0, 2);
SELECT row_number() OVER(ORDER BY mydate) AS rn,
mydate, somedoc, inqty, outqty,
SUM(inqty-outqty) OVER(ORDER BY mydate) AS csum
FROM mytemptable
ORDER BY mydate;
In my SELECT query I try to order result by date and add row numbers 'rn' and cumulative (passing) sum 'csum'. Of course unsuccessfully.
I believe this is because I use two windowing functions in query which conflicts in some way.
How to properly make this query to be fast, well ordered and to get proper result in 'csum' column (3, 5, 4, 2, 0, -1, 3, 2, 0)

Since there is an ordering tie at 2016-04-01 the result for those rows will be the total accumulated sum. If you want it to be different use untie columns in the order by.
From the manual:
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition
Without an untieing column you can use the generated row number in an outer query:
set datestyle = 'dmy';
with mytemptable (mydate, somedoc, inqty, outqty) as (
values
('01-01-2016'::date, '123-13-24', 3, 0),
('04-01-2016', '15-19-44', 2, 0),
('06-02-2016', '15-25-21', 0, 1),
('04-01-2016', '21-133-12', 0, 1),
('04-01-2016', '215-11-51', 0, 2),
('05-01-2016', '11-181-01', 0, 1),
('05-02-2016', '151-80-8', 4, 0),
('04-01-2016', '215-11-51', 0, 2),
('07-02-2016', '34-02-02', 0, 2)
)
select *, sum(inqty-outqty) over(order by mydate, rn) as csum
from (
select
row_number() over(order by mydate) as rn,
mydate, somedoc, inqty, outqty
from mytemptable
) s
order by mydate;
rn | mydate | somedoc | inqty | outqty | csum
----+------------+-----------+-------+--------+------
1 | 2016-01-01 | 123-13-24 | 3 | 0 | 3
2 | 2016-04-01 | 15-19-44 | 2 | 0 | 5
3 | 2016-04-01 | 21-133-12 | 0 | 1 | 4
4 | 2016-04-01 | 215-11-51 | 0 | 2 | 2
5 | 2016-04-01 | 215-11-51 | 0 | 2 | 0
6 | 2016-05-01 | 11-181-01 | 0 | 1 | -1
7 | 2016-05-02 | 151-80-8 | 4 | 0 | 3
8 | 2016-06-02 | 15-25-21 | 0 | 1 | 2
9 | 2016-07-02 | 34-02-02 | 0 | 2 | 0