Managing overflows in LISTAGG on Amazon Redshift - amazon-redshift

Using the example from this post: https://blogs.oracle.com/datawarehousing/entry/managing_overflows_in_listagg
The following statement:
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp
GROUP BY deptno;
will generate the following output:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES
Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. This is in actuality 65535 on Amazon Redshift.
We would want the following to be returned in this case:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLER
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
30 TURNER;JAMES
What would be the best way to recreate this result in Amazon Redshift to avoid any data loss and taking speed into consideration?

It's possible to achieve this with 2 subquery:
First:
SELECT id, field,
sum(length(field) + 1) over
(partition by id order by RANDOM() rows unbounded preceding) as total_length_now
from my_schema.my_table)
Initially we want to calculate how many chars we have for each id in our table. We can use a window function to calculate it incrementally for each row. In the 'order by' statement you can use any unique field that you have. If you don't have one, you can simply use random or an hash function, but is mandatory that the field is unique, if not, the function will not work as we want.
The '+1' in the length represent the semicolon that we will use in the listagg function.
Second:
SELECT id, field, total_length_now / 65535 as sub_id
FROM (sub_query_1)
Now we create a sub_id based on the length that we calculated before. If the total_length_now exceed the limit size (in this case 65535) the division's rest will return a new sub_id.
Last Step
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (sub_query_2)
GROUP BY id, sub_id
ORDER BY id, sub_id
Now we can simply call the listagg function grouping by id and sub_id, since each group cannot exceed the size limit.
Complete query
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (
SELECT id, field, total_length_now / 65535 as sub_id
FROM (SELECT id,
field,
sum(length(field) + 1) over
(partition by id order by field rows unbounded preceding) as total_length_now
from support.test))
GROUP BY id, sub_id
order by id, sub_id
Example with your data (with size limit = 10)
First and second query output:
id, field, total_length_now, sub_id
10,KING,5,0
10,CLARK,11,1
10,MILLER,18,1
20,ADAMS,6,0
20,SMITH,12,1
20,JONES,18,1
20,FORD,23,2
20,SCOTT,29,2
30,JAMES,6,0
30,BLAKE,12,1
30,WARD,17,1
30,MARTIN,24,2
30,TURNER,31,3
30,ALLEN,37,3
Final query output:
id,sub_id,namelist
10,0,KING
10,1,CLARK;MILLER
20,0,ADAMS
20,1,SMITH;JONES
20,2,FORD;SCOTT
30,0,JAMES
30,1,BLAKE;WARD
30,2,MARTIN
30,3,TURNER;ALLEN

It is possible to create a partial list, and then the rest of values as separate rows in one go, but if the number of rows is unconstrained you really need a loop statement to then convert that into a list, and the rows for remaining and so on.
So this is really a task for Apache Spark (or any other map-reduce technology).

Related

postgreSQL, first date when cummulative sum reaches mark

I have the following sample table
And the output should be the first date (for each id) when cum_rev reaches the 100 mark.
I tried the following, because I taught with group bz trick and the where condition i will only get the first occurrence of value higher than 100.
SELECT id
,pd
,cum_rev
FROM (
SELECT id
,pd
,rev
,SUM(rev) OVER (
PARTITION BY id
ORDER BY pd
) AS cum_rev
FROM tab1
)
WHERE cum_rev >= 100
GROUP BY id
But it is not working, and I get the following error. And also when I add an alias is not helping
ERROR: subquery in FROM must have an alias LINE 4: FROM (
^ HINT: For example, FROM (SELECT ...) [AS] foo.
So the desired output is:
2 2015-04-02 135.70
3 2015-07-03 102.36
Do I need another approach? Can anyone help?
Thanks
demo:db<>fiddle
SELECT
id, total
FROM (
SELECT
*,
SUM(rev) OVER (PARTITION BY id ORDER BY pd) - rev as prev_total,
SUM(rev) OVER (PARTITION BY id ORDER BY pd) as total
FROM tab1
) s
WHERE total >= 100 AND prev_total < 100
You can use the cumulative SUM() window function for each id group (partition). To find the first which goes over a threshold you need to check the previous value for being under the threshold while the current one meets it.
PS: You got the error because your subquery is missing an alias. In my example its just s

postgresql: How to grab an existing id from a not subsequent ids of a table

Postgresql version 9.4
I have a table with an integer column, which has a number of integers with some gaps, like the sample below; I'm trying to get an existing id from the column at random with the following query, but it returns NULL occasionally:
CREATE TABLE
IF NOT EXISTS test_tbl(
id INTEGER);
INSERT INTO test_tbl
VALUES (10),
(13),
(14),
(16),
(18),
(20);
-------------------------------
SELECT * FROM test_tbl;
-------------------------------
SELECT COALESCE(tmp.id, 20) AS classification_id
FROM (
SELECT tt.id,
row_number() over(
ORDER BY tt.id) AS row_num
FROM test_tbl tt
) tmp
WHERE tmp.row_num =floor(random() * 10);
Please let me know where I'm doing wrong.
but it returns NULL occasionally
and I must add to this that it sometimes returns more than 1 rows, right?
in your sample data there are 6 rows, so the column row_num will have a value from 1 to 6.
This:
floor(random() * 10)
creates a random number from 0 up to 0.9999...
You should use:
floor(random() * 6 + 1)::int
to get a random integer from 1 to 6.
But this would not solve the problem, because the WHERE clause is executed once for each row, so there is a case that row_num will never match the created random number, so it will return nothing, or it will match more than once so it will return more than 1 rows.
See the demo.
The proper (although sometimes not the most efficient) way to get a random row is:
SELECT id FROM test_tbl ORDER BY random() LIMIT 1
Also check other links from SO, like:
quick random row selection in Postgres
You could select one row and order by random(), this way you are ensured to hit an existing row
select id
from test_tbl
order by random()
LIMIT 1;

Postgres : Need distinct records count

I have a table with duplicate entries and the objective is to get the distinct entries based on the latest time stamp.
In my case 'serial_no' will have duplicate entries but I select unique entries based on the latest time stamp.
Below query is giving me the unique results with the latest time stamp.
But my concern is I need to get the total of unique entries.
For example assume my table has 40 entries overall. With the below query I am able to get 20 unique rows based on the serial number.
But the 'total' is returned as 40 instead of 20.
Any help on this pls?
SELECT
*
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp,
COUNT(*) OVER() as total
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC OFFSET 0
LIMIT
10
) AS my_info
ORDER BY
serial_no asc
product_info table intially has this data
serial_no name timestamp
11212 pulp12 2018-06-01 20:00:01
11213 mango 2018-06-01 17:00:01
11214 grapes 2018-06-02 04:00:01
11215 orange 2018-06-02 07:05:30
11212 pulp12 2018-06-03 14:00:01
11213 mango 2018-06-03 13:00:00
After the distict query I got all unique results based on the latest
timestamp:
serial_no name timestamp total
11212 pulp12 2018-06-03 14:00:01 6
11213 mango 2018-06-03 13:00:00 6
11214 grapes 2018-06-02 04:00:01 6
11215 orange 2018-06-02 07:05:30 6
But total is appearing as 6 . I wanted the total to be 4 since it has
only 4 unique entries.
I am not sure how to modify my existing query to get this desired
result.
Postgres supports COUNT(DISTINCT column_name), so if I have understood your request, using that instead of COUNT(*) will work, and you can drop the OVER.
What you could do is move the window function to a higher level select statement. This is because window function is evaluated before distinct on and limit clauses are applied. Also, you can not include DISTINCT keyword within window functions - it has not been implemented yet (as of Postgres 9.6).
SELECT
*,
COUNT(*) OVER() as total -- here
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC
LIMIT
10
) AS my_info
Additionally, offset is not required there and one more sorting is also superfluous. I've removed these.
Another way would be to include a computed column in the select clause but this would not be as fast as it would require one more scan of the table. This is obviously assuming that your total is strictly connected to your resultset and not what's beyond that being stored in the table, but gets filtered out.
select count(*), serial_no from product_info group by serial_no
will give you the number of duplicates for each serial number
The most mindless way of incorporating that information would be to join in a sub query
SELECT
*
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp,
COUNT(*) OVER() as total
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC OFFSET 0
LIMIT
10
) AS my_info
join (select count(*) as counts, serial_no from product_info group by serial_no) as X
on X.serial_no = my_info.serial_no
ORDER BY
serial_no asc

How to reference output rows with window functions?

Suppose I have a table with quantity column.
CREATE TABLE transfers (
user_id integer,
quantity integer,
created timestamp default now()
);
I'd like to iteratively go thru a partition using window functions, but access the output rows, not the input table rows.
To access the input table rows I could do something like this:
SELECT LAG(quantity, 1, 0)
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
I need to access the previous output row to calculate the next output row. How can i access the lag row in the output? Something like:
CREATE VIEW balance AS
SELECT LAG(balance.total, 1, 0) + quantity AS total
OVER (PARTITION BY user_id ORDER BY created)
FROM transfers;
Edit
This is a minimal example to support the question of how to access the previous output row within a window partition. I don't actually want a sum.
It seems you attempt to calculate a running sum. Luckily that's just what Sum() window function does:
WITH transfers AS(
SELECT i, random()-0.3 AS quantity FROM generate_series(1,100) as i
)
SELECT i, quantity, sum(quantity) OVER (ORDER BY i) from transfers;
I guess, looking at the question, that the only you need is to calculate a cumulative sum.
To calculate a cumulative summ use this query:
SELECT *,
SUM( CASE WHEN quantity IS NULL THEN 0 ELSE quantity END)
OVER ( PARTITION BY user_id ORDER BY created
ROWS BETWEEN unbounded preceding AND current row
) As cumulative_sum
FROM transfers
ORDER BY user_id, created
;
But if you want more complex calculations, especially containing some conditions (decisions) that depend on a result from prevoius row, then you need a recursive approach.

Retrieving Representative Records for Unique Values of Single Column

For Postgresql 8.x, I have an answers table containing (id, user_id, question_id, choice) where choice is a string value. I need a query that will return a set of records (all columns returned) for all unique choice values. What I'm looking for is a single representative record for each unique choice. I also want to have an aggregate votes column that is a count() of the number of records matching each unique choice accompanying each record. I want to force choice to lowercase for this comparison to be made (HeLLo and Hello should be considered equal). I can't GROUP BY lower(choice) because I want all columns in the result-set. Grouping by all columns causes all records to return, including all duplicates.
1. Closest I've gotten
select lower(choice), count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
The issue with this is it will not return all columns.
lower | votes
-----------------------------------------------+-------
dancing in the moonlight | 8
pumped up kicks | 7
party rock anthem | 6
sexy and i know it | 5
moves like jagger | 4
2. Trying with all columns
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice) order by votes desc;
Because I am not specifying every column from the SELECT in my GROUP BY, this throws an error telling me to do so.
3. Specifying all columns in the GROUP BY
select *, count(choice) as votes from answers where question_id = 21 group by lower(choice), id, user_id, question_id, choice order by votes desc;
This simply dumps the table with votes column as 1 for all records.
How can I get the vote count and unique representative records from 1., but with all columns from the table returned?
Join grouped results back with primary table, then show only one row for each (question,answer) combination.
similar to this:
WITH top5 AS (
select question_id, lower(choice) as choice, count(*) as votes
from answers
where question_id = 21
group by question_id , lower(choice)
order by count(*) desc
limit 5
)
SELECT DISTINCT ON(question_id,choice) *
FROM top5
JOIN answers USING(question_id,lower(choice))
ORDER BY question_id, lower(choice), answers.id;
Here's what I ended up with:
SELECT answers.*, cc.votes as votes FROM answers join (
select max(id) as id, count(id) as votes
from answers
group by trim(lower(choice))
) cc
on answers.id = cc.id ORDER BY votes desc, lower(response) asc