Postgres GROUP BY an array column - postgresql

I have a list of students and parents and would like to group them into families using the student id's. Parents who share common student id's can be considered to be a family while also students who share common parent id's can be considered to be a family. This is a sample table:
p_id | parent_name | s_id | student_name |
------------------------------------------|
1 | John Doe | 100 | Mike Doe |
3 | Jane Doe | 100 | Mike Doe |
3 | Jane Doe | 105 | Lisa Doe |
5 | Will Willy | 108 | William Son |
I'd like to end up with something like:
parents | students |
-------------------|------------------------|
John Doe, Jane Doe | Mike Doe, Lisa Doe |
Will Willy | William Son |
To achieve this I'm currently using:
SELECT array_agg(parents) AS parents FROM (
SELECT array_agg(p_id) AS par_ids, array_agg(parent_name) AS parents, student_name, s_id
FROM (
/* sub query */
)b
GROUP BY s_id, student_name
ORDER BY parents ASC
)c
GROUP BY unnest(par_ids)
ORDER BY parents ASC
But I get an error: ERROR: cannot accumulate arrays of different dimensionality. SQL state: 2202E
How can I attain the desired results?
The inner query from the above statement returns:
| par_ids | parents | student_name | s_id |
--------------------------------|------------------------|
| {1,3} | {John Doe, Jane Doe}| Mike Doe | 100 |
| {3} | {Jane Doe} | Lisa Doe | 105 |
| {5} | {Will Willy} | William Son | 108 |
Grouping these students now to the parents is where I'm stuck.

I did something similar (but a bit more complex) already here: https://stackoverflow.com/a/53129510/3984221
step-by-step demo:db<>fiddle
SELECT
array_agg(parent_name) as parents, -- 4
array_agg(student_name) as students
FROM (
SELECT DISTINCT ON (t.s_id) -- 3
*
FROM (
SELECT
s_id,
array_agg(p_id) as parents -- 1
FROM mytable
GROUP BY s_id
) s JOIN mytable t ON t.p_id = ANY(s.parents) -- 2
ORDER BY t.s_id, CARDINALITY(parents) DESC -- 3
) s
GROUP BY parents
Aggregate the p_id values into an array:
s_id
parents
108
{5}
105
{3}
100
{1,3}
Self-join the original table on this array:
s_id
parents
p_id
parent_name
s_id
student_name
100
{1,3}
1
John Doe
100
Mike Doe
105
{3}
3
Jane Doe
100
Mike Doe
100
{1,3}
3
Jane Doe
100
Mike Doe
105
{3}
3
Jane Doe
105
Lisa Doe
100
{1,3}
3
Jane Doe
105
Lisa Doe
108
{5}
5
Will Willy
108
William Son
Remove all duplicate student records. The remaining ones should be the records with the most complete p_id array. This can be done using DISTINCT ON(s_id) on a descending order by the array length:
s_id
parents
p_id
parent_name
s_id
student_name
100
{1,3}
1
John Doe
100
Mike Doe
100
{1,3}
3
Jane Doe
105
Lisa Doe
108
{5}
5
Will Willy
108
William Son
Finally you can group by the p_id array and aggregate the two name columns:
parents
students
{"John Doe","Jane Doe"}
{"Mike Doe","Lisa Doe"}
{"Will Willy"}
{"William Son"}
If you don't want to get an array, but a string list, you can use string_agg(name_colum, ',') instead of array_agg(name_column)

Related

How to select rows based on properties of another row?

Had a question..
| a_id | name | r_id | message | date
_____________________________________________
| 1 | bob | 77 | bob here | 1-jan
| 1 | bob | 77 | bob here again | 2-jan
| 2 | jack | 77 | jack here. | 2-jan
| 1 | bob | 79 | in another room| 3-feb
| 3 | gill | 79 | gill here | 4-feb
These are basically accounts (a_id) chatting inside different rooms (r_id)
I'm trying to find the last chat message for every room that jack a_id = 2 is chatting in.
What i've tried so far is using distinct on (r_id) ... ORDER BY r_id, date DESC.
But this incorrectly gives me the last message in every room instead of only giving the last message in everyroom that jack belongs to.
| 2 | jack | 77 | jack here. | 2-jan
| 3 | gill | 79 | gill here | 4-feb
Is this a partition problem instead distinct on?
I would suggest :
to group the rows by r_id with a GROUP BY clause
to select only the groups where a_id = 2 is included with a HAVING clause which aggregates the a_id of each group : HAVING array_agg(a_id) #> array[2]
to select the latest message of each selected group by aggregating its rows in an array with ORDER BY date DESC and selecting the first element of the array : (array_agg(t.*))[1]
to convert the selected rows into a json object and then displaying the expected result by using the json_populate_record function
The full query is :
SELECT (json_populate_record(null :: my_table, (array_agg(to_json(t.*)))[1])).*
FROM my_table AS t
GROUP BY r_id
HAVING array_agg(a_id) #> array[2]
and the result is :
a_id
name
r_id
message
date
1
bob
77
bob here
2022-01-01
see dbfiddle
For last message in every chat room simply would be:
select a_id, name, r_id, to_char(max(date),'dd-mon') from chats
where a_id =2
group by r_id, a_id,name;
Fiddle https://www.db-fiddle.com/f/keCReoaXg2eScrhFetEq1b/0
Or seeing messages
with last_message as (
select a_id, name, r_id, to_char(max(date),'dd-mon') date from chats
where a_id =1
group by r_id, a_id,name
)
select l.*, c.message
from last_message l
join chats c on (c.a_id= l.a_id and l.r_id=c.r_id and l.date=to_char(c.date,'dd-mon'));
Fiddle https://www.db-fiddle.com/f/keCReoaXg2eScrhFetEq1b/1
Though all this complication could by avoided with a primary key on your table.

Return rows which have the same values in two columns, but different values in another

I have a table that looks like this:
id | name | address | code
-----------+--------------------------+--------------------+----------
101 | joe smith | 1 long road | SC1
102 | joe smith | 6 long road | SC1
103 | amy hughes | 5 hillside lane | SC5
104 | amy hughes | 5 hillside lane | SC5
I want to return the rows that are duplications based on name and code but have different address fields.
I had something like this originally (which looked for duplications across the name, address and code columns:
SELECT name, address, code, count(*)
FROM table_name
GROUP BY 1,2,3
HAVING count(*) >1;
Is there a way I can expand on the above to only return rows that have the same name and code but different address fields?
In my example data above, I would only want to return:
id | name | address | code
-----------+--------------------------+--------------------+----------
101 | joe smith | 1 long road | SC1
102 | joe smith | 6 long road | SC1
Remove address from the select list and GROUP BY and use count(DISTINCT):
SELECT name, code, count(DISTINCT address)
FROM table_name
GROUP BY name, code
HAVING count(DISTINCT address) > 1;

Postgres join when only one row is equal

I have two tables and I am wanting to do an inner join between table_1 and table_2 but only when there is one row in table_2 that meets the join criteria.
For example:
table_1
id | name | age |
-----------------+------------------+--------------+
1 | john jones | 10 |
2 | pete smith | 15 |
3 | mary lewis | 12 |
4 | amy roberts | 13 |
table_2
id | name | age | hair | height |
-----------------+------------------+--------------+--------------+--------------+
1 | john jones | 10 | brown | 100 |
2 | john jones | 10 | blonde | 132 |
3 | mary lewis | 12 | brown | 146 |
4 | pete smith | 15 | black | 171 |
So I want to do a join when name is equal, but only when there is one corresponding matching name in table_2
So my results would look like this:
id | name | age | hair |
-----------------+------------------+--------------+--------------+
2 | pete smith | 15 | black |
3 | mary lewis | 12 | brown |
As you can see, John Jones isn't in the results as there are two corresponding rows in table_2.
My initial code looks like this:
select tb.id,tb.name,tb.age,sc.hair
from table_1 tb
inner join table_2 sc
on tb.name = sc.name and tb.age = sc.age
Can I apply a clause within the join so that it only joins on rows which are unique matches?
Group by all columns and apply having count(*) = 1
select tb.id,tb.name,tb.age,sc.hair
from table_1 tb
join table_2 sc
on tb.name = sc.name and tb.age = sc.age
group by tb.id,tb.name,tb.age,sc.hair
having count(*) = 1
The interesting thing to note is that you don’t need the aggregate expression (in the case count(*) )in the select clause.

PostgreSQL COUNT DISTINCT on one column while checking duplicates of another column

I have a query that results in such a table:
guardian_id | child_id | guardian_name | relation | child_name |
------------|----------|---------------|----------|------------|
1 | 1 | John Doe | father | Doe Son |
2 | 1 | Jane Doe | mother | Doe Son |
3 | 2 | Peter Pan | father | Pan Dghter |
4 | 2 | Pet Pan | mother | Pan Dghter |
1 | 3 | John Doe | father | Doe Dghter |
2 | 3 | Jane Doe | mother | Doe Dghter |
So from these results, I need to count the families. That is, distinct children with the same guardians. From the results above, There are 3 children but 2 families. How can I achieve this?
If I do:
SELECT COUNT(DISTINCT child_id) as families FROM (
//larger query
)a
I'll get 3 which is not correct.
Alternatively, how can I incorporate a WHERE clause that checks DISTINCT guardian_id's? Any other approaches?
Also note that there are instances where a child may have one guardian only.
To get the distinct family you can try the following approach.
select distinct array_agg(distinct guardian_id)
from family
group by child_id;
The above query will return the list of unique families.
eg.
{1,2}
{3,4}
Now you can apply the count on top of it.

MS Access Group By breaks when using a date

For some reason using a date/time field in a select query with Group By in Access 2010 breaks (records are not properly "grouped by" the text field first, showing the same "aTextField" value multiple times). I am able to replicate the issue in a simple, one table query. Ex:
SELECT aTextField, SUM(aIntField) AS SumOfaIntField
FROM simpleTable
GROUP BY aTextField, aDateField
HAVING aDateField >= Date()
ORDER BY aTextField;
As soon as you remove the "aDateField" from the query (Group By and Having lines) then it works properly. I can even remove the HAVING line and it still breaks. Leaving me to believe that it is something with the Group By.
Any feedback would be great. Thanks!
EDIT More details
**simpleTable**
--------------------------------------------
| ID | aTextField | aIntField | aDateField |
============================================
| 1 | John Doe | 1 | 3/14/2013 |
| 2 | John Doe | | 3/15/2013 |
| 3 | Jane Doe | 1 | 3/15/2013 |
| 4 | John Doe | 2 | 3/18/2013 |
| 5 | Jane Doe | 1 | 3/19/2013 |
| 6 | John Doe | | 3/20/2013 |
| 7 | John Doe | 3 | 3/21/2013 |
| 8 | Jane Doe | 1 | 3/19/2013 |
| 9 | John Doe | | 3/22/2013 |
| 10 | Jane Doe | 2 | 3/20/2013 |
| 11 | Jane Doe | | 3/21/2013 |
| 12 | Jane Doe | | 3/22/2013 |
--------------------------------------------
**Expected Result**
-------------------------------
| aTextField | SumOfaIntField |
===============================
| Jane Doe | 4 |
| John Doe | 3 |
-------------------------------
**Actual Result**
-------------------------------
| aTextField | SumOfaIntField |
===============================
| Jane Doe | 2 |
| Jane Doe | 2 |
| Jane Doe | |
| Jane Doe | |
| John Doe | |
| John Doe | 3 |
| John Doe | |
-------------------------------
So what appears to be happening is that there is a seperate row for each date as well. I just need to filter by the date and not necessarily Group By it. However, Access will not accept the query without grouping it. Options?
You're grouping by aTextField and aDateField. Perhaps simpleTable includes rows where the date is the same, but the time of day is different. In that case your grouping would produce a row for each date/time combination.
Whether or not that was the explanation, you should check what the db engine actually evaluates by including aDateField in the SELECT list.
SELECT aTextField, aDateField, SUM(aIntField)
FROM simpleTable
GROUP BY aTextField, aDateField
HAVING aDateField >= Date()
ORDER BY aTextField;
Also consider using a WHERE instead of HAVING clause:
WHERE aDateField >= Date()
Based on your sample data, I suspect you want ...
SELECT aTextField, SUM(aIntField)
FROM simpleTable
GROUP BY aTextField
WHERE aDateField >= Date()
ORDER BY aTextField;
You should be able to use the following:
SELECT aTextField, SUM(aIntField) AS SumOfaIntField
FROM simpleTable
WHERE aDateField >= Date()
GROUP BY aTextField
ORDER BY aTextField;
You will notice that I removed the GROUP BY on the aDateField column. Since you want the total for each aTextField, then you do not need to group by the date. Grouping by date will result in a separate row for each distinct date.
Note: this query was tested in MS Access 2010 and generated your desired result.
I think you are misunderstanding on how GROUP BY works. You should be seeing the same aTextField once for each unique textfield/datetime combination
Sample
a 2012-01-01
a 2012-01-01
b 2012-01-01
b 2012-01-02
b 2012-01-02
group by aTextField, aDateField
a 2012-01-01
b 2012-01-01
b 2012-01-02
group by aTextField
a
b