Group rows by the largest grouping columns - postgresql

I have a time like this
Item
title
image
hash
A
cat
cat_x.jpg
x-123
B
cat
cat_y.jpg
y-123
C
dog1
dog.jpg
xyx
D
dog2
dog.jpg
xyx
I can group items by the same title, image or hash column and have an array with groupped items.
For example, A and B could be grouped by title
SELECT
title,
array_agg(item) AS items
FROM products
GROUP BY 1
title
items
cat
{A,B}
dog1
{C}
dog2
{D}
but this is not good for C and D which should be grouped by image (dog.jpg)
SELECT
image,
array_agg(item) AS items
FROM products
GROUP BY 1
image
items
cat_x.jpg
{A}
cat_y.jpg
{B}
dog.jpg
{C,D}
and the same for "hash" column.
So if two rows have the same title, or image or hash (or all) should be merged into the same group. If two groups have at least one item in common, they need to merged into the same group.
If X and Y are grouped by title, and Y and Z are grouped by image. Then X,Y,Z should be into the same array.
In the end, I don't want two groups containing the same item.
Each Item should belong to only one group.
An idea could be:
Group items by title
Then group the result by image
Then group the result by hash
Products table contains about 100,000 records.
My progress so far
At the moment I tried with this but for 100,000 records it timeouts. Even if I am using temp tables + indexes.
https://www.db-fiddle.com/f/v4RsA4jRdYz3JHmph3Mykj/1
create temp table titles AS (
SELECT
title,
array_agg(item) AS items
FROM products
GROUP BY 1
);
create temp table images AS (
SELECT
image,
array_agg(item) AS items
FROM products
GROUP BY 1
);
create temp table hashes AS (
SELECT
hash,
array_agg(item) AS items
FROM products
GROUP BY 1
);
CREATE INDEX idx_items_titles on titles USING GIN ("items");
CREATE INDEX idx_items_images on images USING GIN ("items");
CREATE INDEX idx_items_hashes on hashes USING GIN ("items");
SELECT
ARRAY( SELECT DISTINCT e FROM unnest(items || title_items || hash_items) AS a(e) )
FROM images,
LATERAL (
SELECT
titles.items AS title_items
FROM titles
WHERE titles.items && images.items
) x,
LATERAL (
SELECT
hashes.items AS hash_items
FROM hashes
WHERE hashes.items && images.items
) y
GROUP BY 1
Desired output
Groups
{B,A}
{D,C}

Those lateral joins and the array columns look very expensive.
I would instead use a different approach where you add an extra column for the representative element of the respective group, like in a disjoint-set data structure.
A simple definition of a canonical representation for each row might be the smallest id of any other row that shares either the same image, title or hash. Using a window function, we can easily (and efficiently) compute that:
SELECT
array_agg(item) AS items,
array_agg(DISTINCT title) AS titles,
array_agg(DISTINCT image) AS images,
array_agg(DISTINCT hash) AS hashes
FROM (
SELECT *, LEAST (
MIN(item) OVER (PARTITION BY title),
MIN(item) OVER (PARTITION BY image),
MIN(item) OVER (PARTITION BY hash)
) AS group_id
FROM products
) AS tmp
GROUP BY tmp.group_id;
(online demo on your sample data)
Unfortunately, it is also wrong, since it doesn't handle transitive equivalence. You can check this example, where A and D contain no shared values, but they do both share some values (their titles) with other rows that have the same value for another column (their images).
To fix this issue, one will need to actually run the union-find algorithm, repeatedly merging the groups until you end up with the desired equivalence classes. This is not exactly trivial, but can be done with a recursive common table expression:
WITH RECURSIVE eqiv AS (
SELECT id, MIN(id) OVER (PARTITION BY title) AS group_id
FROM products
UNION
SELECT id, MIN(id) OVER (PARTITION BY image) AS group_id
FROM products
UNION
SELECT id, MIN(id) OVER (PARTITION BY hash) AS group_id
FROM products
),
rel AS (
SELECT id, id AS group_id
FROM products
UNION
SELECT eqiv.id, rel.group_id
FROM rel
JOIN eqiv ON (rel.id = eqiv.group_id)
)
SELECT
array_agg(id) AS items,
array_agg(DISTINCT title) AS titles,
array_agg(DISTINCT image) AS images,
array_agg(DISTINCT hash) AS hashes
FROM products
GROUP BY (SELECT MIN(group_id) FROM rel WHERE rel.id = products.id);
(online demo)

Related

Postgres count total matches per group

Input data
I have the following association table:
AssociationTable
- Item ID: Integer
- Tag ID: Integer
Referring to the following example data
Item Tag
1 1
1 2
1 3
2 1
and some input list of tags T (e.g. [1, 2])
What I want
For each item, I would like to know which tags were not provided in the input list T.
With our sample data, we'd get:
Item Num missing
1 1
2 0
My thoughts
The best I've done so far is: select "ItemId", count("TagId") as "Num missing" from "AssociationTab" where "TagId" not in (1) group by "ItemId";
The problem here is that items where all tags match will not be included in the output.
You could use a calendar table with anti-join approach:
WITH cte AS (
SELECT t1.Item, t2.Tag
FROM (SELECT DISTINCT Item FROM AssociationTable) t1
CROSS JOIN (SELECT 1 AS Tag UNION ALL SELECT 2) t2
)
SELECT
t1.Item,
COUNT(*) FILTER (WHERE t2.Item IS NULL) AS num_missing
FROM cte t1
LEFT JOIN AssociationTable t2
ON t1.Item = t2.Item AND
t1.Tag = t2.Tag AND
t2.Tag IN (1, 2)
GROUP BY
t1.Item;
Demo
The strategy here is to build a calendar/reference table in the first CTE which contains all combinations of items and tags. Then, we left join this CTE to your association table, aggregate by item, and then detect how many tags are missing for each item.
Simplest solution is
SELECT
ItemId,
count(*) FILTER (WHERE TagId NOT IN (1,2))
FROM AssociationTab
GROUP BY ItemId
Alternatively, if you already have an Items table with the item list, you could do this:
SELECT
i.ItemId,
count(a.TagId)
FROM Items i
LEFT JOIN AssociationTab a ON a.ItemId = i.ItemId AND a.TagId NOT IN (1,2)
GROUP BY i.ItemId
The key is that LEFT JOIN does not remove the Items row if no tags match.

How to use OPENJSON on multiple rows

I have a temp table with multiple rows in it and each row has a column called Categories; which contains a very simple json array of ids for categories in a different table.
A few example rows of the temp table:
Id Name Categories
---------------------------------------------------------------------------------------------
'539f7e28-143e-41bb-8814-a7b93b846007' Test 1 ["category1Id", "category2Id", "category3Id"]
'f29e2ecf-6e37-4aa9-aa56-4a351d298bfc' Test 2 ["category1Id", "category2Id"]
'34e41a0a-ad92-4cd7-bf5c-8df6bfd6ed5c' Test 3 NULL
Now what I would like to do is to select all of the category ids from all of the rows in the temp table.
What I have is the following and it's not working as it's giving me the error of :
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
SELECT
c.Id
,c.[Name]
,c.Color
FROM
dbo.Category as c
WHERE
c.Id in (SELECT [value] FROM OPENJSON((SELECT Categories FROM #TempTable)))
and c.IsDeleted = 0
Which I guess it makes sense that's failing on that because I'm selecting multiple rows and needing to parse each row's respective category ids json. I'm just not sure what to do/change to give me the results that I want. Thank you in advance for any help.
You'd need to use CROSS APPLY like so:
SELECT id ,
name ,
t.Value AS category_id
FROM #temp
CROSS APPLY OPENJSON(categories, '$') t;
And then, you can JOIN to your Categories table using the category_id column, something like this:
SELECT id ,
name ,
t.Value AS category_id,
c.*
FROM #temp
CROSS APPLY OPENJSON(categories, '$') t
LEFT JOIN Categories c ON c.Id = t.Value

Returning rows with distinct column value with data jpa named query

Assuming I have a table with 3 columns, ID, Name, City and I want to use named query to return rows with unique city..can it be done?
Are you asking whether it is possible to write a query that will return the cities that appear in exactly one row, in a table that has ID/Name/City triplets where there could be multiple rows for the same city but with different names?
If so, it would depend on the database engine behind the scenes - but you could try things like:
with candidates (city, num) as (
select city, count(*) from table
group by city
)
select city from candidates where num = 1
Or
select t1.city from table t1
where not exists (
select * from table t2
where t2.city = t1.city and t2.id <> t1.id
)
where table is your table with these triplets.

Union which excludes values from the first table

The origional problem I am attempting to solve is that I need to show all rows from a specific "joined" table. However these are sometimes blank with no totals and normally would not show (think categories and counts for each).
So what I am attempting to do is union to a "0 value" data set to show all categories. However when I do the union it shows a 0 value row, as well as the normal data. Here is an example..
SELECT category_name, COUNT(files_number)
FROM files
LEFT JOIN categories ON categories.category_id = files.category_id
UNION
SELECT category_name, 0
FROM categories
This will give me a result set that looks similar to this:
category_name | value
----------------------
open file | 0
open file | 23
closed file | 0
Is there any way to remove duplicate zero value entries? Please not there is also a complex WHERE clause in the actual query, so avoiding duplication on it is preferred.
I don't get why you are doing left join and union..
You can do below to remove duplicates,wrap your query and do group by
;with cte
as
(
SELECT category_name, COUNT(files_number)
FROM files
LEFT JOIN categories ON categories.category_id = files.category_id
UNION
SELECT category_name, 0
FROM categories
)
select categoryname,sum(aggcol)
from cte
group by
category
One way is to select all categories from the categories table, and LEFT JOIN onto the file counts (grouped by category_id).
SELECT c.category_name, ISNULL(fc.FileCount, 0) AS FileCount
FROM categories c
LEFT JOIN (
SELECT category_id, COUNT(files_number) AS FileCount
FROM files
GROUP BY category_id
) fc ON c.category_id = fc.category_id
Edit
If you want to reverse the query, you could do it something like this, using a RIGHT OUTER JOIN - so every category from categories table is returned, regardless of if there are any files for it:
SELECT c.category_name, COUNT(f.category_id) AS FileCount
FROM files f
RIGHT JOIN categories c ON c.category_id = f.category_id
GROUP BY c.name

SQL - finding multiple occurrence of one attribute with another in same table

Consider the following tables
Inventory(storeid, itemid, qty)
Items(itemid, description, size, color)
Here is my task: retrieve the id of stores that meet the following criterion: for every item description that is held in its inventory, the store holds a corresponding itemID in all possible sizes for that description.
This is how the response should look:
3667
3706
3742
3842
Where I am at:
with s as (
select *
from inventory
inner join items using (itemID)
),
m as (
select count(distinct size), description
from items
group by description
),
sizes as (
select distinct size
from items
)
select distinct s1.storeID
from s s1
inner join m m1
on s1.description = m1.description
group by s1.storeID;
This just returns storeid's with items that match any of the descriptions...which is every storeid. Having trouble finding a way to grab a description and ensure it has all three sizes (small, medium, large).
http://sqlfiddle.com/#!2/2a743e
It doesn't say three sizes but all possible sizes and I like arrays so:
WITH sizes AS (
SELECT description, array_agg(DISTINCT size) AS sizes
FROM items
GROUP BY description
)
,store_items AS(
SELECT s.storeID, it.description, array_agg(DISTINCT it.size) AS sizes
FROM stores AS s
JOIN inventory AS i
ON s.storeID = i.storeID
JOIN items AS it
ON i.itemID = it.itemID
GROUP BY s.storeID, it.description
)
SELECT s.storeID
FROM stores AS s
WHERE s.storeID NOT IN(
SELECT storeID
FROM store_items AS si
JOIN sizes z
ON z.description = si.description
AND si.sizes<>z.sizes)
fiddle
Using the having clause we can find the descriptions that have a count of size = 3.
We then count the number of those descriptions and compare that to the total number of descriptions in the store.
WITH countofdescirptions
AS (SELECT i.storeid,
Count(DISTINCT it.description)K
FROM inventory i
INNER JOIN items it
ON i.itemid = it.itemid
GROUP BY i.storeid),
has3
AS (SELECT i.storeid,
it.description
FROM inventory i
INNER JOIN items it
ON i.itemid = it.itemid
GROUP BY i.storeid,
it.description
HAVING Count(DISTINCT size) = 3)
SELECT *
FROM countofdescirptions
INNER JOIN (SELECT storeid,
Count(description)K
FROM has3
GROUP BY storeid) has3Count
ON countofdescirptions.storeid = has3Count.storeid
AND countofdescirptions.k = has3Count.k
DEMO
I'm fairly certain that there's a solution using COUNT() OVER