PostgreSQL, two windowing functions at once - postgresql

I have typical table with data, say mytemptable.
DROP TABLE IF EXISTS mytemptable;
CREATE TEMP TABLE mytemptable
(mydate date, somedoc text, inqty int, outqty int);
INSERT INTO mytemptable (mydate, somedoc, inqty, outqty)
VALUES ('01.01.2016.', '123-13-24', 3, 0),
('04.01.2016.', '15-19-44', 2, 0),
('06.02.2016.', '15-25-21', 0, 1),
('04.01.2016.', '21-133-12', 0, 1),
('04.01.2016.', '215-11-51', 0, 2),
('05.01.2016.', '11-181-01', 0, 1),
('05.02.2016.', '151-80-8', 4, 0),
('04.01.2016.', '215-11-51', 0, 2),
('07.02.2016.', '34-02-02', 0, 2);
SELECT row_number() OVER(ORDER BY mydate) AS rn,
mydate, somedoc, inqty, outqty,
SUM(inqty-outqty) OVER(ORDER BY mydate) AS csum
FROM mytemptable
ORDER BY mydate;
In my SELECT query I try to order result by date and add row numbers 'rn' and cumulative (passing) sum 'csum'. Of course unsuccessfully.
I believe this is because I use two windowing functions in query which conflicts in some way.
How to properly make this query to be fast, well ordered and to get proper result in 'csum' column (3, 5, 4, 2, 0, -1, 3, 2, 0)

Since there is an ordering tie at 2016-04-01 the result for those rows will be the total accumulated sum. If you want it to be different use untie columns in the order by.
From the manual:
There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Many (but not all) window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition
Without an untieing column you can use the generated row number in an outer query:
set datestyle = 'dmy';
with mytemptable (mydate, somedoc, inqty, outqty) as (
values
('01-01-2016'::date, '123-13-24', 3, 0),
('04-01-2016', '15-19-44', 2, 0),
('06-02-2016', '15-25-21', 0, 1),
('04-01-2016', '21-133-12', 0, 1),
('04-01-2016', '215-11-51', 0, 2),
('05-01-2016', '11-181-01', 0, 1),
('05-02-2016', '151-80-8', 4, 0),
('04-01-2016', '215-11-51', 0, 2),
('07-02-2016', '34-02-02', 0, 2)
)
select *, sum(inqty-outqty) over(order by mydate, rn) as csum
from (
select
row_number() over(order by mydate) as rn,
mydate, somedoc, inqty, outqty
from mytemptable
) s
order by mydate;
rn | mydate | somedoc | inqty | outqty | csum
----+------------+-----------+-------+--------+------
1 | 2016-01-01 | 123-13-24 | 3 | 0 | 3
2 | 2016-04-01 | 15-19-44 | 2 | 0 | 5
3 | 2016-04-01 | 21-133-12 | 0 | 1 | 4
4 | 2016-04-01 | 215-11-51 | 0 | 2 | 2
5 | 2016-04-01 | 215-11-51 | 0 | 2 | 0
6 | 2016-05-01 | 11-181-01 | 0 | 1 | -1
7 | 2016-05-02 | 151-80-8 | 4 | 0 | 3
8 | 2016-06-02 | 15-25-21 | 0 | 1 | 2
9 | 2016-07-02 | 34-02-02 | 0 | 2 | 0

Related

historical aggregation of a column up until a specified time in each row in another column

I have two tables login_attempts and checkouts in Amazon RedShift. A user can have multiple (un)successful login attempts and multiple (un)successful checkouts as shown in this example:
login_attempts
login_id | user_id | login | success
-------------------------------------------------------
1 | 1 | 2021-07-01 14:00:00 | 0
2 | 1 | 2021-07-01 16:00:00 | 1
3 | 2 | 2021-07-02 05:01:01 | 1
4 | 1 | 2021-07-04 03:25:34 | 0
5 | 2 | 2021-07-05 11:20:50 | 0
6 | 2 | 2021-07-07 12:34:56 | 1
and
checkouts
checkout_id | checkout_time | user_id | success
------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 0
2 | 2021-07-02 06:54:32 | 2 | 1
3 | 2021-07-04 13:00:01 | 1 | 1
4 | 2021-07-08 09:05:00 | 2 | 1
Given this information, how can I get the following table with historical performance included for each checkout AS OF THAT TIME?
checkout_id | checkout | user_id | lastGoodLogin | lastFailedLogin | lastGoodCheckout | lastFailedCheckout |
---------------------------------------------------------------------------------------------------------------------------------------
1 | 2021-07-01 18:00:00 | 1 | 2021-07-01 16:00:00 | 2021-07-01 14:00:00 | NULL | NULL
2 | 2021-07-02 06:54:32 | 2 | 2021-07-02 05:01:01 | NULL | NULL | NULL
3 | 2021-07-04 13:00:01 | 1 | 2021-07-01 16:00:00 | 2021-07-04 03:25:34 | NULL | 2021-07-01 18:00:00
4 | 2021-07-08 09:05:00 | 2 | 2021-07-07 12:34:56 | 2021-07-05 11:20:50 | 2021-07-02 06:54:32 | NULL
Update: I was able to get lastFailedCheckout & lastGoodCheckout because that's doing window operations on the same table (checkouts) but I am failing to understand how to best join it with login_attempts table to get last[Good|Failed]Login fields. (sqlfiddle)
P.S.: I am open to PostgreSQL suggestions as well.
Good start! A couple things in your SQL - 1) You should really try to avoid inequality joins as these can lead to data explosions and aren't needed in this case. Just put a CASE statement inside your window function to use only the type of checkout (or login) you want. 2) You can use the frame clause to not self select the same row when finding previous checkouts.
Once you have this pattern you can use it to find the other 2 columns of data you are looking for. The first step is to UNION the tables together, not JOIN. This means making a few more columns so the data can live together but that is easy. Now you have the userid and the time the "thing" happened all in the same data. You just need to WINDOW 2 more times to pull the info you want. Lastly, you need to strip out the non-checkout rows with an outer select w/ where clause.
Like this:
create table login_attempts(
loginid smallint,
userid smallint,
login timestamp,
success smallint
);
create table checkouts(
checkoutid smallint,
userid smallint,
checkout_time timestamp,
success smallint
);
insert into login_attempts values
(1, 1, '2021-07-01 14:00:00', 0),
(2, 1, '2021-07-01 16:00:00', 1),
(3, 2, '2021-07-02 05:01:01', 1),
(4, 1, '2021-07-04 03:25:34', 0),
(5, 2, '2021-07-05 11:20:50', 0),
(6, 2, '2021-07-07 12:34:56', 1)
;
insert into checkouts values
(1, 1, '2021-07-01 18:00:00', 0),
(2, 2, '2021-07-02 06:54:32', 1),
(3, 1, '2021-07-04 13:00:01', 1),
(4, 2, '2021-07-08 09:05:00', 1)
;
SQL:
select *
from (
select
c.checkoutid,
c.userid,
c.checkout_time,
max(case success when 0 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedCheckout,
max(case success when 1 then checkout_time end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodCheckout,
max(case lsuccess when 0 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastFailedLogin,
max(case lsuccess when 1 then login end) over (
partition by userid
order by event_time
rows between unbounded preceding and 1 preceding
) as lastGoodLogin
from (
select checkout_time as event_time, checkoutid, userid,
checkout_time, success,
NULL as login, NULL as lsuccess
from checkouts
UNION ALL
select login as event_time,NULL as checkoutid, userid,
NULL as checkout_time, NULL as success,
login, success as lsuccess
from login_attempts
) c
) o
where o.checkoutid is not null
order by o.checkoutid

Sum of consecutive values in column of a Spark dataframe

I have a dataframe
Hi,I have a dataframe as below
+-------+--------+
|id |level |
+-------+--------+
| 0 | 0 |
| 1 | 0 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 0 |
| 6 | 1 |
| 7 | 1 |
| 8 | 0 |
| 9 | 1 |
| 10 | 0 |
+-------+--------+
and I need the sum of consecutive 1's .SO the output should be 3,2,1.However the constraint in this scenario is that i do not need to use UDF Is there any in-built scala/spark function that can do this trick.I am not able to USE UDF
You could use row_number and count (SQL/Dataframe API), to count the number of consecutive values (repeat) in a column.
The trick is to count the offset between the current row and the index of the occurrence of the consecutive targeted values.
Scala
var df = spark.createDataFrame(Seq((0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0))).toDF("id","level")
df.createOrReplaceTempView("DT")
var df_cnt = spark.sql("select level, count(*) from (select *, (row_number() over (order by id) - row_number() over (partition by level order by id) ) as grp from DT order by id) as t where level !=0 group by grp, level ")
df_cnt.show()
The sequence of id must be maintained otherwise it will produce the wrong result.
Pyspark
df = spark.createDataFrame([(0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0)]).toDF("id","level")
df.createOrReplaceTempView('DF')
//same as before with spark.sql(...)
SQL
select level, count(*) from
(select *,
(row_number() over (order by id) -
row_number() over (partition by level order by id)
) as grp
from SDF order by id) as t
where level !=0
group by grp, level
Intermediate sql computation detail (row offset, and grouping) :
You could do something like this:
val seq = Seq(0,0,1,1,1,0,1,1,0,1,0)
val seq1s = seq.foldLeft("")(_ + _).split("0")
seq1s.map(_.sliding(1).count(_ == "1"))
res: Array[Int] = Array(0, 0, 3, 2, 1)
If you donĀ“t want the 0s there you could just filter them out using this instead:
seq1s.map(_.sliding(1).count(_ == "1")).filterNot(_ == 0)
res: Array[Int] = Array(3, 2, 1)

Postgresql Time Series for each Record

I'm having issues trying to wrap my head around how to extract some time series stats from my Postgres DB.
For example, I have several stores. I record how many sales each store made each day in a table that looks like:
+------------+----------+-------+
| Date | Store ID | Count |
+------------+----------+-------+
| 2017-02-01 | 1 | 10 |
| 2017-02-01 | 2 | 20 |
| 2017-02-03 | 1 | 11 |
| 2017-02-03 | 2 | 21 |
| 2017-02-04 | 3 | 30 |
+------------+----------+-------+
I'm trying to display this data on a bar/line graph with different lines per Store and the blank dates filled in with 0.
I have been successful getting it to show the sum per day (combining all the stores into one sum) using generate_series, but I can't figure out how to separate it out so each store has a value for each day... the result being something like:
["Store ID 1", 10, 0, 11, 0]
["Store ID 2", 20, 0, 21, 0]
["Store ID 3", 0, 0, 0, 30]
It is necessary to build a cross join dates X stores:
select store_id, array_agg(total order by date) as total
from (
select store_id, date, coalesce(sum(total), 0) as total
from
t
right join (
generate_series(
(select min(date) from t),
(select max(date) from t),
'1 day'
) gs (date)
cross join
(select distinct store_id from t) s
) using (date, store_id)
group by 1,2
) s
group by 1
order by 1
;
store_id | total
----------+-------------
1 | {10,0,11,0}
2 | {20,0,21,0}
3 | {0,0,0,30}
Sample data:
create table t (date date, store_id int, total int);
insert into t (date, store_id, total) values
('2017-02-01',1,10),
('2017-02-01',2,20),
('2017-02-03',1,11),
('2017-02-03',2,21),
('2017-02-04',3,30);

how to number distinct values while respecting their original ordering?

Here's my input data:
CREATE TEMP TABLE test AS SELECT * FROM (VALUES
(1, 12),
(2, 7),
(3, 8),
(4, 8),
(5, 7)
) AS rows (position, value);
I want to, in a single query (no subqueries or CTEs), assign a unique number for each distinct value. However, I also want those numbers to ascend according to the associated position -- i.e., a distinct value's number should be assigned according to its lowest position.
Assumptions:
each row will always have a unique position
value is not guaranteed unique per row
the number of a distinct value is only for ordinal purposes, e.g. it doesn't matter whether distinct_values goes 1-2-3 or 3-8-14
The desired output is:
position | value | distinct_value
----------+-------+----------------
1 | 12 | 1
2 | 7 | 2
3 | 8 | 3
4 | 8 | 3
5 | 7 | 2
I can get close using DENSE_RANK to number distinct values:
SELECT
position,
value,
DENSE_RANK() OVER (ORDER BY value) AS distinct_value
FROM test ORDER BY position;
The result obviously ignores position:
position | value | distinct_value
----------+-------+----------------
1 | 12 | 3
2 | 7 | 1
3 | 8 | 2
4 | 8 | 2
5 | 7 | 1
Is there a better window function for this?
with
t(x,y) as (values
(1, 12),
(2, 7),
(3, 8),
(4, 8),
(5, 7)),
pos(i,y) as (select min(x), y from t group by y),
ind(i,y) as (select row_number() over(order by i), y from pos)
select * from ind join t using(y) order by x;

Select statement with join, or subquery limit

For few days now I'm trying to solve this problem.
I have table group_user, group_name.
What I wanna to do is select user groups, than description that group (from group_name), and 10 other users from the group.
It's not problem with first two. The problem is, that I'm nowhere to get limit users.
I can select user_group, and other users in that group. I don't know how to limit that.
Using:
SELECT a.g_id,b.group,b.userid
FROM group_user AS a
RIGHT JOIN
(SELECT g_id as group, u_id as userid FROM group_user) AS b ON a.g_id=b.group
WHERE u_id=112
It showing me, my user groups and users in that group. But when I'm trying to limit in subwuery, it limits all, not particular group.
I tried, Select users, with using IN where was goups of my user without luck.
I was thinking maybe group and having will help, but I can't see how I could use it.
So my question is, how can I limit subquery result in MySQL where the subquery is built on result of query.
I think im overload and maybe I don't see something.
UPDATE to show what I really wanna accomplish here's another piece of code.
SELECT g_id FROM group_user WHERE user_id = 112
So I get all groups that user is in let, saye each of that select is var extra_group, so second query will be
SELECT u_id FROM group_user WHERE group_id = extra_group LIMIT 10
I need to do same as above, in one query.
another UPDATE after MIKE post.
I should ADD that, user can be in more than 1 group. So I think the real problem is, that I don't have any clue how to select those groups and in same query select 10 users for selected groups, so in result could be
g_id u_id
1 | 2
1 | 3
1 | 4
3 | 3
3 | 8
where g_id is user groups from that query
SELECT g_id FROM group_user WHERE user_id = 112
Create sample tables and add data:
CREATE TABLE `group_user` (
`u_id` int(11) DEFAULT NULL,
`g_id` int(11) DEFAULT NULL,
`apply_date` date DEFAULT NULL
);
CREATE TABLE `group_name` (
`g_id` int(11) DEFAULT NULL,
`g_name` varchar(255) DEFAULT NULL
);
INSERT INTO `group_name` VALUES
(1, 'Group 1'), (2, 'Group 2'), (3, 'Group 3'), (4, 'Group 4'), (5, 'Group 5');
INSERT INTO `group_user` VALUES
(1, 1, '2010-12-01'), (1, 2, '2010-12-01'), (1, 3, '2010-12-01'), (1, 4, '2010-12-01'), (1, 5, '2010-12-01'),
(2, 1, '2010-12-02'), (2, 2, '2010-12-02'),
(3, 1, '2010-12-03'), (3, 2, '2010-12-03'), (3, 3, '2010-12-03'), (3, 4, '2010-12-03'),
(4, 1, '2010-12-04'), (4, 2, '2010-12-04'),
(5, 1, '2010-12-05'), (5, 2, '2010-12-05'),
(6, 1, '2010-12-06'), (6, 2, '2010-12-06'),
(7, 1, '2010-12-07'), (7, 2, '2010-12-07'), (7, 3, '2010-12-07'), (7, 4, '2010-12-07'), (7, 5, '2010-12-07'),
(8, 1, '2010-12-08'), (8, 2, '2010-12-08'),
(9, 1, '2010-12-09'), (9, 2, '2010-12-09'), (9, 3, '2010-12-09'), (9, 4, '2010-12-09'), (9, 5, '2010-12-09');
Select the groups of which user u_id == 1 is a member. Then for each group select a maximum of 4 members (excluding user u_id == 1), ordered by descending apply_date:
SELECT u3.g_id, g.g_name, u3.u_id, u3.apply_date
FROM (
SELECT
u1.g_id,
u1.u_id,
u1.apply_date,
IF( #prev_gid <> u1.g_id, #user_index := 1, #user_index := #user_index + 1 ) AS user_index,
#prev_gid := u1.g_id AS prev_gid
FROM group_user AS u1
JOIN (SELECT #prev_gid := 0, #user_index := NULL) AS vars
JOIN group_user AS u2
ON u2.g_id = u1.g_id
AND u2.u_id = 1
AND u1.u_id <> 1
ORDER BY u1.g_id, u1.apply_date DESC, u1.u_id
) AS u3
JOIN group_name AS g ON g.g_id = u3.g_id
WHERE u3.user_index <= 4
ORDER BY u3.g_id, u3.apply_date DESC, u3.u_id;
+------+---------+------+------------+
| g_id | g_name | u_id | apply_date |
+------+---------+------+------------+
| 1 | Group 1 | 5 | 2010-12-05 |
| 1 | Group 1 | 4 | 2010-12-04 |
| 1 | Group 1 | 3 | 2010-12-03 |
| 1 | Group 1 | 2 | 2010-12-02 |
| 2 | Group 2 | 5 | 2010-12-05 |
| 2 | Group 2 | 4 | 2010-12-04 |
| 2 | Group 2 | 3 | 2010-12-03 |
| 2 | Group 2 | 2 | 2010-12-02 |
| 3 | Group 3 | 9 | 2010-12-09 |
| 3 | Group 3 | 7 | 2010-12-07 |
| 3 | Group 3 | 3 | 2010-12-03 |
| 4 | Group 4 | 9 | 2010-12-09 |
| 4 | Group 4 | 7 | 2010-12-07 |
| 4 | Group 4 | 3 | 2010-12-03 |
| 5 | Group 5 | 9 | 2010-12-09 |
| 5 | Group 5 | 7 | 2010-12-07 |
+------+---------+------+------------+