How to use Window function in Redshift SQL - amazon-redshift

I have a table like this:
Ans_cnt | Workloadid | Alias
10 | 1 | A
10 | 1 | B
10 | 1 | C
20 | 2 | D
20 | 2 | E
20 | 2 | F
create temp table test
(ans_cnt int, workloadid int, alias varchar(2));
insert into test values
(10, 1, 'A');
insert into test values
(10, 1, 'B');
insert into test values
(10, 1, 'C');
I want to get a result like this:
Ans_cnt | workloadid
10 | 1
20 | 2
i.e., for workloadid 1 the total ans_cnt is still 10. for workloadid 2 the total ans_cnt is still 20, just multiple aliases are assigned to the same workload. Hope that makes sense.
I tried doing sum by partitionin on workloadid but its not working:
select sum(ans_cnt) over (partition by workloadid) as ans_cnt from test
Please help.

What happens in case you have different ans_cnt for same workload??
For e.g. in this case:
Ans_cnt | Workloadid | Alias
10 | 1 | A
10 | 1 | B
20 | 1 | C
10 | 2 | D
20 | 2 | E
30 | 2 | F
My guess is that you want to pick the highest number of ans_cnt per workloads.
If yes, you simply need this SQL:
select workloadid, max(ans_cnt) as ans_cnt from test
group by workloadid;
Which will give this as output:
Ans_cnt | Workloadid
20 | 1
30 | 2
OR if you want to pick the latest ans_cnt and your alias are alphabetically assigned, you need this SQL:
select ans_cnt, workloadid
from (
select ans_cnt, workloadid
, row_number() over (partition by workloadid order by alias desc) as rnk
from test_1
) as t
where rnk=1

Related

Recursive CTE in PostgreSQL for knapsack problem

I have a dataset with 3 columns:
Item_id
Sourced_from
Cost
1
Local
15
2
Local
10
3
Local
20
4
International
60
I am trying to write a query in PostgreSQL to fetch total of local and international items, customer can buy within the cash limit. For a cash limit 50, this is the output I am expecting:
Local
International
3
0
I have a pretty basic knowledge of PostgreSQL, and after googling it seems like this could be solved with recursive CTE, I am unable to figure out how should I select my source seed/anchor point in this scenario.
Any ideas, how should I approach this?
Not with a recursive CTE, but still works:
DDL/DML:
create table T
(
id integer primary key generated by default AS IDENTITY,
kind text not null,
cost integer not null
);
insert into T(kind, cost)
values ('local', 15),
('local', 10),
('local', 20),
('international', 60);
-- 4. This outer CTE and the following self-join is only necessary in order to display the rows that have a count() of 0
with sub as
(
-- 3. find the total cost of buying this row + all previous rows, grouped by its kind
select X.kind, sum(X.cost) as cost, X.rn
from (
with cte as (
-- 1. assign an increasing row number on each row from the table ordered by its cost
select *, row_number() over (order by T.cost asc, T.kind) as rn
from T
)
-- 2. self-join the CTE on each row with the same kind, but join it only with the rows that have a row number less than or equal to the current row number
select A.id, A.kind, A.cost, B.rn
from cte as A
join cte as B on A.kind = B.kind and A.rn <= B.rn
) as X
group by X.kind, X.rn
)
select M.kind, count(N.*)
from sub as M -- 5. count only the amount of goods that fit in out budget (i.e. 50)
left outer join sub as N on M.rn = N.rn and N.cost <= 50
group by M.kind
;
Output (db-fiddle):
+-------------+-----+
|kind |count|
+-------------+-----+
|local |3 |
|international|0 |
+-------------+-----+
I made a CTE example to solve the problem:
Recreated your case with
create table kp (item_id int, sourced_from varchar, cost int);
insert into kp values (1,'local',15);
insert into kp values (2,'local',10);
insert into kp values (3,'local',20);
insert into kp values (4,'international',60);
The following query does:
Selects from kp only items with cost less than 50
adds the item_id in the list_of_items
The recursive bit does:
joins with kp checking the source_from is the same and the kp.item_id is not already contained in the list_of_items (avoiding to put the same item multiple times)
computes the total cost (total_cost)
adds the new item item_id to the list_of_items
WITH RECURSIVE items (item_id, next_item_id, sourced_from, total_cost, nr_items, list_of_items) AS (
SELECT
item_id,
item_id as next_item_id,
sourced_from,
cost as total_cost,
1 as nr_items,
ARRAY[item_id] list_of_items
from kp where cost < 50
UNION ALL
SELECT
kp.item_id,
items.item_id as next_item_id,
items.sourced_from,
items.total_cost + kp.cost total_cost,
items.nr_items + 1 as nr_items,
items.list_of_items || kp.item_id as list_of_items
FROM kp join items
on items.sourced_from=kp.sourced_from
and items.list_of_items::int[] #> ARRAY[kp.item_id] = false
WHERE kp.cost + items.total_cost < 50
)
SELECT * FROM items;
If you run against the above dataset you'll end up with the detailed result
item_id | next_item_id | sourced_from | total_cost | nr_items | list_of_items
---------+--------------+--------------+------------+----------+---------------
1 | 1 | local | 15 | 1 | {1}
2 | 2 | local | 10 | 1 | {2}
3 | 3 | local | 20 | 1 | {3}
1 | 2 | local | 25 | 2 | {2,1}
1 | 3 | local | 35 | 2 | {3,1}
2 | 1 | local | 25 | 2 | {1,2}
2 | 3 | local | 30 | 2 | {3,2}
3 | 1 | local | 35 | 2 | {1,3}
3 | 2 | local | 30 | 2 | {2,3}
1 | 2 | local | 45 | 3 | {3,2,1}
1 | 3 | local | 45 | 3 | {2,3,1}
2 | 1 | local | 45 | 3 | {3,1,2}
2 | 3 | local | 45 | 3 | {1,3,2}
3 | 1 | local | 45 | 3 | {2,1,3}
3 | 2 | local | 45 | 3 | {1,2,3}
(15 rows)
which shows all the permutations of the 3 local items.
Now if you substitute the last SELECT section with
SELECT * FROM items order by nr_items desc, total_cost desc, list_of_items asc limit 1;
You'll be able also to pick the combination having the max number of items, with the cost closest to the budget (I added also an ascending ordering based on list_of_items to receive always the same result in case of multiple combinations), which in the case above would result in
item_id | next_item_id | sourced_from | total_cost | nr_items | list_of_items
---------+--------------+--------------+------------+----------+---------------
3 | 2 | local | 45 | 3 | {1,2,3}
(1 row)
If you are just interested in the maximum by sourced_from then the last SELECT becomes
select sourced_from, max(nr_items) nr_items from items group by sourced_from;
with the expected result being
sourced_from | nr_items
--------------+----------
local | 3
(1 row)
Edit: to speed up the query and avoiding having multiple permutations of the same objects (e.g. {1,2,3} and {1,2,3}) we can force the next item_id to be greater of the current one. Full query
WITH RECURSIVE items (item_id, next_item_id, sourced_from, total_cost, nr_items, list_of_items) AS (
SELECT
item_id,
item_id as next_item_id,
sourced_from,
cost as total_cost,
1 as nr_items,
ARRAY[item_id] list_of_items
from kp where cost < 50
UNION ALL
SELECT
kp.item_id,
items.item_id as next_item_id,
items.sourced_from,
items.total_cost + kp.cost total_cost,
items.nr_items + 1 as nr_items,
items.list_of_items || kp.item_id as list_of_items
FROM kp join items
on items.sourced_from=kp.sourced_from
and items.list_of_items::int[] #> ARRAY[kp.item_id] = false
and items.item_id < kp.item_id
WHERE kp.cost + items.total_cost < 50
)
select * from items;
result
item_id | next_item_id | sourced_from | total_cost | nr_items | list_of_items
---------+--------------+--------------+------------+----------+---------------
1 | 1 | local | 15 | 1 | {1}
2 | 2 | local | 10 | 1 | {2}
3 | 3 | local | 20 | 1 | {3}
2 | 1 | local | 25 | 2 | {1,2}
3 | 1 | local | 35 | 2 | {1,3}
3 | 2 | local | 30 | 2 | {2,3}
3 | 2 | local | 45 | 3 | {1,2,3}
(7 rows)

How to compute frequency/count of concurrent events by combination in postgresql?

I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).
toy dataframe
+------------------+
|name startts endts|
| A 02:20 02:23 |
| A 02:23 02:25 |
| A 02:27 02:28 |
| B 02:20 02:23 |
| B 02:23 02:25 |
| B 02:25 02:27 |
| C 02:27 02:28 |
| D 02:27 02:28 |
| D 02:28 02:31 |
| E 02:27 02:28 |
| E 02:29 02:31 |
+------------------+
Ideal output:
+---------------------------+
|combination| count |
+---------------------------+
| AB | 2 |
| AC | 1 |
| AE | 1 |
| AD | 1 |
| BC | 0 |
| BD | 0 |
| BE | 0 |
| CE | 0 |
+-----------+---------------+
Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.
What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).
User #GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.
#GMB's attempt:
I understand this as a self-join, aggregation, and a conditional count of matching intervals:
select t1.name name1, t2.name name2,
sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
from mytable t1
inner join mytable t2 on t2.name > t1.name
group by t1.name, t2.name
order by t1.name, t2.name
Demo on DB Fiddle:
name1 | name2 | cnt
:---- | :---- | --:
A | B | 2
A | C | 1
A | D | 1
A | E | 1
B | C | 0
B | D | 0
B | E | 0
C | D | 1
C | E | 1
D | E | 1
#GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:
sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt
Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097
Thank you.
Consider the following in MySQL (where your DBFiddle points to):
SELECT name, COUNT(*)
FROM (
SELECT group_concat(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
Equivalent in PostgreSQL:
SELECT name, COUNT(*)
FROM (
SELECT string_agg(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
First, you create a list of concurrent events (in the subquery), and then you count them.

Postgres Insert N Rows in a Loop for All Values in a Selected Column

Suppose I have users stored as
select * from users_t where user_name like 'ABC%';
id user_name
1 ABC1
2 ABC2
.. ..
Now I need to loop through all user_name's and make that number of INSERTs into a different table, RECALLS_T. All the other columns are hard-coded constants that I define.
Assume the following table, with a Sequence called RECALLS_T_ID_SEQ on the ID:
id created_by_user_name field1 field2
1 ABC1 Const1 Const2
2 ABC2 Const1 Const2
.. .. .. ..
How do I insert these in a Postgres loop?
ADDITIONAL QUESTION Also, what if I need to insert X (say 5) Recalls for each User entry? Suppose it's not a 1:1 mapping, but 5:1, where 5 is a hard-coded loop number.
You can use the select in the insert statement:
insert into recalls_t (created_by_user_name, field1, field2)
select user_name, 'Const1', 'Const2'
from users_t
where user_name like 'ABC%';
Use the function generate_series() to insert more than one row for each entry from users_t. I have added the column step to illustrate this:
insert into recalls_t (created_by_user_name, field1, field2, step)
select user_name, 'Const1', 'Const2', step
from users_t
cross join generate_series(1, 3) as step
where user_name like 'ABC%'
returning *
id | created_by_user_name | field1 | field2 | step
----+----------------------+--------+--------+------
1 | ABC1 | Const1 | Const2 | 1
2 | ABC2 | Const1 | Const2 | 1
3 | ABC1 | Const1 | Const2 | 2
4 | ABC2 | Const1 | Const2 | 2
5 | ABC1 | Const1 | Const2 | 3
6 | ABC2 | Const1 | Const2 | 3
(6 rows)
Live demo in Db<>fiddle.

Generate a histogram of values grouped by a column

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
However, the result looks like this:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.
The second query can be rewritten to use ranges to make editing and writing the query a bit easier:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 #> score) as bucket_1,
count(*) filter (where b2 #> score) as bucket_2,
count(*) filter (where b3 #> score) as bucket_3,
count(*) filter (where b4 #> score) as bucket_4,
count(*) filter (where b5 #> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
A range constructed with int4range(0,20) includes the lower end and excludes the upper end.
The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.
I found this post useful
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Use the bucket numbers(100 in above script) in your favour.
Invoke like this
SELECT * FROM histogram($table_name_or_subquery, $column_name);
Example:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');

SQL - group by - limit clause - postgresql

I have a table which has two columns C1 and C2.
C1 has an integer data type and C2 has text.
Table looks like this.
---C1--- ---C2---
1 | a |
1 | b |
1 | c |
1 | d |
1 | e |
1 | f |
1 | g |
2 | h |
2 | i |
2 | j |
2 | k |
2 | l |
2 | m |
2 | n |
------------------
My question: i want a sql query which does group by on column C1 but with size of 3.
looks like this.
------------------
1 | a,b,c |
1 | d,e,f |
1 | g |
2 | h,i,j |
2 | k,l,m |
2 | n |
------------------
is it possible by executing SQL???
Note: I do not want to write stored procedure or function...
You can use a common table expression to partition the results into rows, and then use STRING_AGG to join them into comma separated lists;
WITH cte AS (
SELECT *, (ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2)-1)/3 rn
FROM mytable
)
SELECT C1, STRING_AGG(C2, ',') ALL_C2
FROM cte
GROUP BY C1,rn
ORDER BY C1
An SQLfiddle to test with.
A short explanation of the common table expression;
ROW_NUMBER() OVER (...) will number the results from 1 to n for each value of C1. We then subtract 1 and divide by 3 to get the sequence 0,0,0,1,1,1,2,2,2... and group by that value in the outer query to get 3 results per row.
Apart from Joachim Isaksson's answer,you try this method also
SELECT C1, string_agg(C2, ',') as c2
FROM (
SELECT *, (ROW_NUMBER() OVER (PARTITION BY C1 ORDER BY C2)-1)/3 as row_num
FROM atable) t
GROUP BY C1,row_num
ORDER BY c2