Interleaving array_agg in postgres - postgresql

I have a postgres query in which I want to interleave my array_agg statements :
SELECT client_user_id,
(array_agg(question), array_agg(client_intake_question_id), array_agg(answer)) as answer
FROM client_intake_answer
LEFT OUTER JOIN client_intake_question
ON client_intake_question.id = client_user_id
GROUP BY client_user_id
Gives me the following:
5 | ("{""Have you ever received counselling?"",""Have you ever received counselling or mental health support in the past?""}","{1,2}","{yes,no}")
I would like the results to be:
5 | ("{""Have you ever received counselling?", 1, "yes"",""Have you ever received counselling or mental health support in the past?", 2, "no""}"
How do I do this?

I've set up a small example similar to yours:
create table answers(user_id int, question_id int, answer varchar(20));
create table questions(question_id int, question varchar(20));
insert into questions values
(1, 'question 1'),
(2, 'question 2');
insert into answers values
(1, 1, 'yes'),
(1, 2, 'no'),
(2, 1, 'no'),
(2, 2, 'yes');
select user_id, array_agg(concat(questions.question, ',', questions.question_id::text, ',', answers.answer))
from questions
inner join answers
on questions.question_id = answers.question_id
group by answers.user_id
user_id | array_agg
------: | :-------------------------------------
1 | {"question 1,1,yes","question 2,2,no"}
2 | {"question 1,1,no","question 2,2,yes"}
dbfiddle here

To interleave or splice together multiple array_agg's you can do the following:
SELECT client_user_id,
array_agg('[' || client_intake_question_id || question || ',' || answer || ']') as answer
FROM client_intake_answer
LEFT OUTER JOIN
client_intake_question ON client_intake_question.id = client_user_id
GROUP BY client_user_id

Related

Eliminate duplicate values in a csv string

I've values in a table which are selected as duplicates if the name is same then the corresponding ids are included in a csv string column as below:
Original table:
create table #original(id int, unique_id varchar(500), name varchar(200))
insert into #original
values
( 1, '12345', 'A'), ( 2, '12345', 'A'), ( 3, null, 'B'), ( 4, '45678', 'B'),
( 5, '900', 'C'), ( 6, '901', 'C'), ( 7, null, 'D'), ( 8, null, 'D'),
( 9, null, 'E'), (10, '1000', 'E'), (11, null, 'E'), (12, '1100', 'F'),
(13, '1101', 'F'), (14, '1102', 'F')
, (15, '9999', 'G'), (16, '9998', 'G'), (17, '', 'G')
, (18, '1111', 'H')
, (19, '1010', 'I'), (20, '1010', 'I'), (21, '', 'I')
Person with name A, B are same but C isn't because unique id is different for C.
A record is duplicate if the name is same AND unique id is there or null. When name is same but the unique ids are different then they aren't the same people.
I'm selecting the data as below:
;with cte as
(select name
from #original
group by name
having count(*) > 1)
I need to get the data as below:
Id unique_id Name
1,2 12345 A
3,4 45678 B
7 null D
8,10,11 1000 E
19,20,21 1010 I
C and F should be avoided as their unique_ids are different though names are same. H should be avoided because it's not a duplicate. G should be avoided because unique ids are different. I should be selected because if unique id is present for duplicates by name, it should be the same for all duplicates to be selected.
Thanks
I think you are looking for something like:
SELECT
STRING_AGG(id, ',') WITHIN GROUP (ORDER BY id) id,
(SELECT top 1 unique_id FROM original o3 WHERE o3.name = o1.name AND o3.unique_id IS NOT NULL) unique_id,
name
FROM original o1
WHERE NOT EXISTS
(SELECT 1 FROM original o2 WHERE o2.name = o1.name AND o2.unique_id <> o1.unique_id)
GROUP BY name
ORDER BY name
The NOT EXISTS condition eliminates names C and F (you could use an IN clause if you prefer, but I don't think it's any prettier in this case).
The GROUP BY name combined with the aggregate STRING_AGG gets the comma separated list of ids for the name.
This uses a subquery with top 1 to get a non-null unique_id. You could use max(unique_id) instead which certainly looks better, but you will get a warning. If you're comfortable ignoring the warning and don't think it will be confusing, I would use the max version.
You can see both versions working in this Fiddle.
Edited to add: To address the new requirement in the comments, please see this Fiddle.
There will be multiple ways of doing this, but the condition...
(SELECT COUNT(DISTINCT unique_id) FROM original o2 WHERE o2.name = o1.name and COALESCE(unique_id, '') <> '') <= 1
... will count the number of non-null and non-empty strings to ensure it is always 0 (to allow cases like D and G) or 1 (to allow the other cases).
Note that this also adds an ORDER BY to the TOP 1 subquery version in order to prefer unique_ids with values over both nulls and empty strings.

(Postgres) Query in a tree table in ascending and descending mode

I'm having some issues with two queries to search in a "tree" table.
So, my table is represented by the following code, and it has one only direction. However, I need to get data in both directions, ascending and descending mode.
create table graph_examle (input int null, output int );
insert into graph_examle (input, output) values
(null, 1),
(1, 2),
(2, 3 ),
(3, 4 ),
(null, 7 ),
(7,8),
(8, 4 ),
(null, 10 ),
(10, 11 ),
(11, 4),
(3, 15),
(25, 15),
(26, 15),
(15, 4 );
The ascending query has some issues. If I search by id 1, I'm expecting to see the relations:
1, 1->2, 1->2->3, 1->2->3->4, but the results are:
WITH recursive cte (initial_id, level, path, loop, input, output) AS
(
SELECT input, 1, ':' ||input || ':' , 0, input, output
FROM graph_examle WHERE input = 1
UNION ALL
SELECT
c.initial_id,
c.level + 1,
c.path ||ur.input|| ':' ,
CASE WHEN c.path LIKE '%:' ||ur.input || ':%' THEN 1 ELSE 0 END,
ur.*
FROM graph_examle ur
INNER JOIN cte c ON c.output = ur.input AND c.loop = 0
)
SELECT *
FROM cte
ORDER BY initial_id, level;
The descending query does not work as expected. If I search by id 4, I'm expecting to see the relations:
4, 4->3, 4->3->2, 4->3->2->1
4->8, 4->8->7
4->11,4->11>10
4->15, (...)
But I'm only getting:
WITH RECURSIVE cte (input, output, level, real_parent_id, path) AS
(
SELECT
ur.input, ur.input, 1, output, ( ur.input|| ' -> ' || ur.output)
FROM graph_examle ur
WHERE ur.output = 4
UNION ALL
SELECT
ur_cte.input, ur.input, level + 1, ur.output, (ur_cte.path || '->' || ur.output)
FROM cte ur_cte
INNER JOIN graph_examle ur on ur.input = ur_cte.real_parent_id
)
SELECT *
FROM cte
ORDER BY path
Note that in my queries I'm trying to solve circular dependencies
The ascending query sounds good ... maybe you can concatenate the path and output columns.
For the descending query, you can try this :
WITH RECURSIVE cte (input, output, level, path, loop) AS
(
SELECT
ur.input, ur.output, 1, ( ur.output|| ' -> ' || ur.input), 0
FROM graph_examle ur
WHERE ur.output = 4
UNION ALL
SELECT
ur.input, ur_cte.output, level + 1, (ur_cte.path || '->' || ur.input),
CASE WHEN ur_cte.path LIKE '%->' || ur.input THEN 1 ELSE 0 END
FROM cte ur_cte
INNER JOIN graph_examle ur on ur.output = ur_cte.input
WHERE ur_cte.loop = 0
AND ur.input IS NOT NULL
)
SELECT *
FROM cte
ORDER BY path
see dbfiddle

Divide table raw into chunks in Postgres with st_dwithin limit

I got a table with linestrings that I want to divide into chunks that have a list of id not higher than provided number for each and store only lines that are within certain distance.
For example, I got a table with 14 rows
create table lines ( id integer primary key, geom geometry(linestring) );
insert into lines (id, geom) values ( 1, 'LINESTRING(0 0, 0 1)');
insert into lines (id, geom) values ( 2, 'LINESTRING(0 1, 1 1)');
insert into lines (id, geom) values ( 3, 'LINESTRING(1 1, 1 2)');
insert into lines (id, geom) values ( 4, 'LINESTRING(1 2, 2 2)');
insert into lines (id, geom) values ( 11, 'LINESTRING(2 2, 2 3)');
insert into lines (id, geom) values ( 12, 'LINESTRING(2 3, 3 3)');
insert into lines (id, geom) values ( 13, 'LINESTRING(3 3, 3 4)');
insert into lines (id, geom) values ( 14, 'LINESTRING(3 4, 4 4)');
create index lines_gix on lines using gist(geom);
I want to split it into chunks with 3 ids for each chunk with lines that are within 2 meters from each other or the first one.
The result I am trying to get from this example is:
| Chunk No.| Id chunk list |
|----------|----------------|
| 1 | 1, 2, 3 |
| 2 | 4, 5, 6 |
| 3 | 7, 8, 9 |
| 4 | 10, 11, 12 |
| 5 | 13, 14 |
I tried to use st_clusterwithin but when lines are close to each other it will return all of them not split into chunks.
I also tried to use some with recursive magic like the one from the answer provided by Paul Ramsey here. But I don't know how to modify the query to return limited grouped id list.
I am not sure if it is the best possible answer so if anyone has a better method or know how to improve provided answer feel free to update it. With a little modification of Paul answer, I've managed to create following queries that are doing what I asked for.
-- Create function for easier interaction
CREATE OR REPLACE FUNCTION find_connected(integer, double precision, integer, integer[])
returns integer[] AS
$$
WITH RECURSIVE lines_r AS -- Recursive allow to use the same query on the output - is like continues append to result and use it inside a query
(SELECT ARRAY[id] AS idlist,
geom, id
FROM lines
WHERE id = $1
UNION ALL
SELECT array_append(lines_r.idlist, lines.id) AS idlist, -- append id list to array
lines.geom AS geom, -- keep geometry
lines.id AS id -- keep source table id
FROM (SELECT * FROM lines WHERE NOT $4 #> array[id]) lines, lines_r -- from source table and recursive table
WHERE ST_DWITHIN(lines.geom, lines_r.geom, $2) -- where lines are within 2 meters
AND NOT lines_r.idlist #> ARRAY[lines.id] -- recursive id list array not contain lines array
AND array_length(idlist, 1) <= $3
)
SELECT idlist
FROM lines_r WHERE array_length(idlist, 1) <= $3 ORDER BY array_length(idlist, 1) DESC LIMIT 1;
$$
LANGUAGE 'sql';
-- Create id chunks
WITH RECURSIVE groups_r AS (
(SELECT find_connected(id, 2, 3, ARRAY[id]) AS idlist, find_connected(id, 2, 3, ARRAY[id]) AS grouplist, id
FROM lines WHERE id = 1)
UNION ALL
(SELECT array_cat(groups_r.idlist, find_connected(lines.id, 2, 3, groups_r.idlist)) AS idlist,
find_connected(lines.id, 2, 3, groups_r.idlist) AS grouplist,
lines.id
FROM lines,
groups_r
WHERE NOT groups_r.idlist #> ARRAY[lines.id]
LIMIT 1))
SELECT
-- (SELECT array_agg(DISTINCT x) FROM unnest(idlist) t (x)) idlist, -- left for better understanding what is happening
row_number() OVER () chunk_id,
(SELECT array_agg(DISTINCT x) FROM unnest(grouplist) t (x)) grouplist,
id input_line_id
FROM groups_r;
The only problem is that performance is quite pure when the number of ids in the chunk increase. For a table with 300 rows and 20 ids per chunk, execution time is around 15 min, even with indexes on geometry and id columns.

Copy value from one row to another row in PostgreSQL

I have a table like this:
id product amount
1 A 6
1 A 8
1 A
1 B 1
1 B
2 C 2
2 C
2 C 4
2 C
2 C
and I need to make it like this:
id product amount
1 A 6
1 A 8
1 A 8
1 B 1
1 B 1
2 C 2
2 C 2
2 C 4
2 C 4
2 C 4
Copy amount by previous non-missing value.
I tried to use lag() function. however, aggregation function lag() is not allowed in UPDATE.
update tableA set amount = lag(amount);
What can I do using PostgreSQL?
You can SELECT what you want to UPDATE, but there is no (easy) way to actually do the UPDATE, because the table fox does not have a primary key (yet).
CREATE TABLE fox (
id integer NOT NULL,
product text NOT NULL,
amount integer
);
To populate the fox with some data.
INSERT INTO fox VALUES
(1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL),
(3, 'What does the fox say?', 5);
The query.
WITH ranks (rank, id, product, amount) AS (
SELECT ROW_NUMBER() OVER (), id, product, amount FROM foo
)
SELECT r.id, r.product,
(SELECT amount FROM ranks
WHERE id = r.id AND product = r.product
AND rank < r.rank AND amount IS NOT NULL
ORDER BY amount DESC LIMIT 1
)
FROM ranks r WHERE r.amount IS NULL ORDER BY 1, 2, 3;
Yields the rows which previously had a NULL and now have the appropriate amount.
id | product | amount
----+---------+--------
1 | A | 8
1 | B | 1
2 | C | 2
2 | C | 4
2 | C | 4
But you cannot use this data to update, because rows are still not uniquely identified by (id, product) - which means you cannot write a WHERE condition identifying your rows uniquely. How would the WHERE clause know whether to change the amount to 2 or 4 in the UPDATE? The multiple rows with (id, product) = (2, 'C') are indistinguishable in the WHERE of the UPDATE.
Let's give the fox a primary key.
ALTER TABLE fox ADD COLUMN IF NOT EXISTS pkey serial ;
ALTER TABLE fox ADD PRIMARY KEY (pkey) ;
Now we can identify the rows by the PRIMARY KEY pkey.
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
)
SELECT pkey,
id, product, -- you can leave these out in your UPDATE: pkey is UNIQUE
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n ORDER BY 1, 2, 3, 4;
to display the changes to be made
pkey | id | product | amount
------+----+---------+--------
3 | 1 | A | 8
5 | 1 | B | 1
7 | 2 | C | 2
9 | 2 | C | 4
10 | 2 | C | 4
And we can use pkey in the UPDATE.
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
WITH nulls AS (
SELECT pkey, id, product
FROM fox
WHERE amount IS NULL
), changes AS (
SELECT pkey,
(SELECT amount FROM fox
WHERE id = n.id AND product = n.product
AND n.pkey > pkey AND amount IS NOT NULL
ORDER BY pkey DESC LIMIT 1)
FROM nulls n
) UPDATE fox f SET amount = c.amount FROM changes c WHERE f.pkey = c.pkey ;
Check the result is okay:
SELECT * FROM fox ORDER BY 1, 2, 3, 4;
And accept using COMMIT or ROLLBACK accordingly.
Alternative to adding a PRIMARY KEY
Every table should always have a primary key.
If you insist not to have one, then you could also compute the rows with their then-not-NULL amount and instead of UPDATEing them, you could INSERT them into your table and then DELETE FROM fox WHERE amount IS NULL remove the rows which had no amount. This way you get around adding a primary key, which is unique. Of course the UPDATE and DELETE are packaged into a TRANSACTION such as not to interfere with other Transactions running concurrently. For example another Transaction adding rows with NULL amount AFTER you have calculated the data to be INSERTed using SELECT and before you DELETE all NULL amounts. You'd miss the concurrently added row with NULL amount in this case (data loss due to concurrency; think ACID).
But a missing primary key will probably bite you later on, anyway.
Without knowing what defines "previous rows" all is a guess. But you can use a anonymous block to do what your want, just make your changes:
CREATE TEMPORARY TABLE test_lag AS
SELECT column1 AS id, column2 AS product, column3 AS amount FROM (
VALUES (1, 'A', 6),
(1, 'A', 8),
(1, 'A', NULL),
(1, 'B', 1),
(1, 'B', NULL),
(2, 'C', 2),
(2, 'C', NULL),
(2, 'C', 4),
(2, 'C', NULL),
(2, 'C', NULL)) AS tmp;
DO $$
BEGIN
--Loop until update all null amounts
--Why we need this? It's because PostgreSQL don't supports IGNORE NULLS clause on lag()
LOOP
WITH tmp AS (
SELECT ctid, lag(amount) OVER() AS last_amount FROM test_lag ORDER BY id, product -- You MUST change this ORDER to right columns (What's previous row?)
)
UPDATE test_lag SET amount = tmp.last_amount FROM tmp WHERE test_lag.ctid = tmp.ctid AND amount IS NULL;
IF NOT FOUND THEN
EXIT;
END IF;
END LOOP;
END $$;
SELECT * FROM test_lag ORDER BY id, product, amount;

How to query the data in a join table by two sets of joined records?

I've got three tables: users, courses, and grades, the latter of which joins users and courses with some metadata like the user's score for the course. I've created a SQLFiddle, though the site doesn't appear to be working at the moment. The schema looks like this:
CREATE TABLE users(
id INT,
name VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO users VALUES
(1, 'Beth'),
(2, 'Alice'),
(3, 'Charles'),
(4, 'Dave');
CREATE TABLE courses(
id INT,
title VARCHAR,
PRIMARY KEY (ID)
);
INSERT INTO courses VALUES
(1, 'Biology'),
(2, 'Algebra'),
(3, 'Chemistry'),
(4, 'Data Science');
CREATE TABLE grades(
id INT,
user_id INT,
course_id INT,
score INT,
PRIMARY KEY (ID)
);
INSERT INTO grades VALUES
(1, 2, 2, 89),
(2, 2, 1, 92),
(3, 1, 1, 93),
(4, 1, 3, 88);
I'd like to know how (if possible) to construct a query which specifies some users.id values (1, 2, 3) and courses.id values (1, 2, 3) and returns those users' grades.score values for those courses
| name | Algebra | Biology | Chemistry |
|---------|---------|---------|-----------|
| Alice | 89 | 92 | |
| Beth | | 93 | 88 |
| Charles | | | |
In my application logic, I'll be receiving an array of user_ids and course_ids, so the query needs to select those users and courses dynamically by primary key. (The actual data set contains millions of users and tens of thousands of courses—the examples above are just a sample to work with.)
Ideally, the query would:
use the course titles as dynamic attributes/column headers for the users' score data
sort the row and column headers alphabetically
include empty/NULL cells if the user-course pair has no grades relationship
I suspect I may need some combination of JOINs and Postgresql's crosstab, but I can't quite wrap my head around it.
Update: learning that the terminology for this is "dynamic pivot", I found this SO answer which appears to be trying to solve a related problem in Postgres with crosstab()
I think a simple pivot query should work here, since you only have 4 courses in your data set to pivot.
SELECT t1.name,
MAX(CASE WHEN t3.title = 'Biology' THEN t2.score ELSE NULL END) AS Biology,
MAX(CASE WHEN t3.title = 'Algebra' THEN t2.score ELSE NULL END) AS Algebra,
MAX(CASE WHEN t3.title = 'Chemistry' THEN t2.score ELSE NULL END) AS Chemistry,
MAX(CASE WHEN t3.title = 'Data Science' THEN t2.score ELSE NULL END) AS Data_Science
FROM users t1
LEFT JOIN grades t2
ON t1.id = t2.user_id
LEFT JOIN courses t3
ON t2.course_id = t3.id
GROUP BY t1.name
Follow the link below for a running demo. I used MySQL because, as you have noticed, SQLFiddle seems to be perpetually busted the other databases.
SQLFiddle