I have 3 postgresql tables : Documents, Keywords and a join table.
I have query that searches document.id and document.date if certain keywords are related to that document. That works fine like so:
SELECT
documents.id, documents.document_date
FROM
documents
INNER JOIN
documents_keywords ON documents_keywords.document_id = documents.id
INNER JOIN
keywords ON keywords.id = documents_keywords.keyword_id
WHERE
keywords.keyword IN ('bread' , 'cake')
GROUP BY documents.id
This returns:
id | document_date
----+-----------
4 | 1200
12 | 1280
(2 rows)
I also want to exclude keywords. I thought I could do NOT IN like so:
SELECT
documents.id, documents.document_date
FROM
documents
INNER JOIN
documents_keywords ON documents_keywords.document_id = documents.id
INNER JOIN
keywords ON keywords.id = documents_keywords.keyword_id
WHERE
keywords.keyword NOT IN ('cranberries')
GROUP BY documents.id
But that always returns empty, whatever keyword I put:
id | document_date
----+-----------
(0 rows)
This is incorrect. I expected:
id | document_date
----+-----------
4 | 1200
(1 row)
You might want to use an array expression, like this:
WHERE keyword = any(array['bread', 'cake'])
when you want to include a row.
If you want to exclude something, you have to do the NOT IN over a subselect of the inverse condition, e.g.
SELECT ... WHERE document_id NOT IN
(SELECT document_id FROM ...joins... WHERE keyword = ANY(array['cranberry']))
Here is an example I put together:
WITH documents(d_id, date) AS (
VALUES(1,'1000'),(2,'2000'),(3,'3000'),(4,'4000')
),
keywords(k_id, keyword) AS (
VALUES(1, 'cake'), (2, 'bread'), (3, 'cranberry')
),
documents_keywords (d_id, k_id) AS (
VALUES(1,1),(1,2),(2,2),(2,3),(3,3)
)
SELECT * FROM documents where d_id NOT IN (
SELECT d_id FROM
documents
JOIN documents_keywords USING(d_id)
JOIN keywords USING(k_id)
WHERE keyword = ANY(array['cranberry'])
)
Also, I am not sure why you are using GROUP BY, I don't think you need it.
Related
I have posts table with the following structure:
| id | score | title | tags |
-------------------------------------------------
| 1 | 42 | Travel | <uk><travel><passport> |
For each blog post I want to find relevant posts, tagged with any of the tags corresponding to the current page, in my case: <uk>, <travel> or <passport>. Then, order results by score, limit it to 5 items and display it to the user.
This is the code I came up with so far, but it seems only getting the result for the first tag in the query – <uk>.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
(tags ~ tag)::int as match_found
) m
where m.match_found > 0
) t
order by t.score desc
limit 5;
EDIT
After #Mike Organek's comment I changed the query this, and it's working as I initially expected.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
position(tag in tags) > 0 as match_found
) m
where m.match_found and tag <> ''
) t
order by t.score desc
limit 5;
I would convert the tags into an array then use array operators to find the relevant posts:
select id, title, score, tags
from posts
where string_to_array(trim(both '<>' from replace(tags, '><', ',')), ',') #> array['uk', 'travel', 'passport']
order by score
limit 5
In the long run, storing the tags as an array or a jsonb array is probably a lot more efficient.
If you do that a lot, things might get a bit easier if you create a function for this:
create function tags_array(p_input text)
returns text[]
as
$$
select string_to_array(trim(both '<>' from replace(p_input, '><', ',')), ',');
$$
language sql
immutable;
Then the query is a bit easier to read:
select id, title, score, tags
from posts
where tags_array(tags) #> array['uk', 'travel', 'passport']
order by score
limit 5
You can even create an index for that if you want:
create index on posts using gin ( (tags_array(tags)) );
I found some results/answers, regarding searching in array, but:
WHERE = ANY, works with a column, but not with a subquery that returns one record that contains an array as a result, triggers error
WHERE IN, also triggers the same error
I also tested untest on the subquery, similar error as the first 2
I don't want to check if a value is in an array but get/execute the query for the values in the array, like WHERE IN (1,2,3,4), not like in the other questions/answers
The error:
No operator matches the given name and argument types. You might need
to add explicit type casts.
or
No operator matches int = int[]
path is an int[]array type.
Structure:
id | name | | slug | path | parent_id
1 name1 slug1 {1} null
2 name2 slug2 {1,2} 1
3 name3 slug3 {1,2,3} 2
4 nam4 slug4 {4} null
What I tries as base:
SELECT t.id, t.name, t.slug FROM types AS t
WHERE t.id in (SELECT t.path FROM types AS t WHERE t.id = 24)
ORDER BY depth ASC
Basically path is like a breadcrumb , {grandparent,parent,type}
Here's one using IN and unnest()
SELECT t1.id,
t1."name",
t1.slug
FROM types t1
WHERE t1.id IN (SELECT un.e
FROM types t2
CROSS JOIN LATERAL unnest(t2.path) un (e)
WHERE t2.id = 2)
ORDER BY array_length(t1.path, 1);
And another one using the array is contained by operator <#.
SELECT t1.id,
t1."name",
t1.slug
FROM types t1
WHERE ARRAY[t1.id] <# (SELECT t2.path
FROM types t2
WHERE t2.id = 2)
ORDER BY array_length(t1.path, 1);
And one using = ANY.
SELECT t1.id,
t1."name",
t1.slug
FROM types t1
WHERE t1.id = ANY ((SELECT t2.path
FROM types t2
WHERE t2.id = 2)::integer[])
ORDER BY array_length(t1.path, 1);
db<>fiddle
You didn't include depth in your sample data so I replaced it with array_length(t1.path, 1) which is probably what it is.
I'm trying to find all IDs in TableA that are mentioned by a set of records in TableB and that set if defined in Table C. I've come so far to the point where a set of INNER JOIN provide me with the following result:
TableA.ID | TableB.Code
-----------------------
1 | A
1 | B
2 | A
3 | B
I want to select only the ID where in this case there is an entry for both A and B, but where the values A and B are based on another Query.
I figured this should be possible with a GROUP BY TableA.ID and HAVING = ALL(Subquery on table C).
But that is returning no values.
Since you did not post your original query, I will assume it is inside a CTE. Assuming this, the query you want is something along these lines:
SELECT ID
FROM cte
WHERE Code IN ('A', 'B')
GROUP BY ID
HAVING COUNT(DISTINCT Code) = 2;
It's an extremely poor question, but you you probably need to compare distinct counts against table C
SELECT a.ID
FROM TableA a
GROUP BY a.ID
HAVING COUNT(DISTINCT a.Code) = (SELECT COUNT(*) FROM TableC)
We're guessing though.
I'm returning a unique list of id's from the users table, where specific columns in a related table (positions) contain a matching string.
The related table may have multiple records for each user record.
The query is taking a really really long time (its not scaleable), so I'm wondering if I'm structuring the query wrong in some fundamental way?
Users Table:
id | name
-----------
1 | frank
2 | kim
3 | jane
Positions Table:
id | user_id | title | company | description
--------------------------------------------------
1 | 1 | manager | apple | 'Managed a team of...'
2 | 1 | assistant | apple | 'Assisted the...'
3 | 2 | developer | huawei | 'Build a feature that...'
For example: I want to return the user's id if a related positions record contains "apple" in either the title, company or description columns.
Query:
select
distinct on (users.id) users.id,
users.name,
...
from users
where (
select
string_agg(distinct users.description, ', ') ||
string_agg(distinct users.title, ', ') ||
string_agg(distinct users.company, ', ')
from positions
where positions.users_id::int = users.id
group by positions.users_id::int) like '%apple%'
UPDATE
I like the idea of moving this into a join clause. But what I'm looking to do is filter users conditional on below. And I'm not sure how to do both in a join.
1) finding the keyword in title, company, description
or
2) finding the keyword with full-text search in an associated string version of a document in another table.
select
to_tsvector(string_agg(distinct documents.content, ', '))
from documents
where users.id = documents.user_id
group by documents.user_id) ## to_tsquery('apple')
So I was originally thinking it might look like,
select
distinct on (users.id) users.id,
users.name,
...
from users
where (
(select
string_agg(distinct users.description, ', ') ||
string_agg(distinct users.title, ', ') ||
string_agg(distinct users.company, ', ')
from positions
where positions.users_id::int = users.id
group by positions.users_id::int) like '%apple%')
or
(select
to_tsvector(string_agg(distinct documents.content, ', '))
from documents
where users.id = documents.user_id
group by documents.user_id) ## to_tsquery('apple'))
But then it was really slow - I can confirm the slowness is from the first condition, not the full-text search.
Might not be the best solution, but a quick option is:
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
JOIN positions p ON (
p.user_id = u.id
AND ( description || title || company )
LIKE '%apple%'
);
Basically got rid of the subquery, unnecessary string_agg usage, grouping on position table etc.
What it does is doing conditional join and removing duplicate is covered by distinct on.
PS! I used table aliases u and p to shorten the example
EDIT: adding also WHERE example as requested
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
JOIN positions p ON ( p.user_id = u.id )
WHERE ( p.description || p.title || p.company ) LIKE '%apple%'
OR ...your other conditions...;
EDIT2: new details revealed setting new requirements of the original question. So adding new example for updated ask:
Since you doing lookups to 2 different tables (positions and uploads) with OR condition then simple JOIN wouldn't work.
But both lookups are verification type lookups - only looking does %apple% exists, then you do not need to aggregate and group by and convert the data.
Using EXISTS that returns TRUE for first match found is what you seem to need anyway. So removing all unnecessary part and using with LIMIT 1 to return positive value if first match found and NULL if not (latter will make EXISTS to become FALSE) will give you same result.
So here is how you could solve it:
SELECT DISTINCT ON ( u.id ) u.id,
u.name
FROM users u
WHERE EXISTS (
SELECT 1
FROM positions p
WHERE p.users_id = u.id::int
AND ( description || title || company ) LIKE '%apple%'
LIMIT 1
)
OR EXISTS (
SELECT 1
FROM uploads up
WHERE up.user_id = u.id::int -- you had here reference to table 'document', but it doesn't exists in your example query, so I just added relation to 'upoads' table as you have in FROM, assuming 'content' column exists there
AND up.content LIKE '%apple%'
LIMIT 1
);
NB! in your example queries have references to tables/aliases like documents which doesn't reflect anywhere in the FROM part. So either you have cut in your example real query with wrong naming or you have made other way typo is something you need to verify and adjust my example query accordingly.
Is there a simpler way to perform this query?
Actually using this in part of a larger query
Would rather not use EXCEPT, UNION, INTERSECT
As part of the more larger query the optimizer can get stupid on the derived table and except
Not of value to post the larger query as it is dynamic
The PK on docSVsys is sID
The PK on docMVenum1 is sID, enumID, valueID
select sID from docSVsys
EXCEPT
select sID
from docMVenum1
where enumID = 140
and valueID in (1,2)
group by sID
having count(*) = 2
select docSVsys.sID from docSVsys
left outer join
( select sID
from docMVenum1
where enumID = 140
and valueID in (1,2)
group by sID
having count(*) = 2 ) as joinTable
on docSVsys.sID = joinTable.sID
where joinTable.sID is null
I know the two queries are the same
I am looking for a 3rd simpler
I believe the IN operator might cause an inefficiency. Try this:
select sID
from docSVsys
EXCEPT
select sID
from (
select d1.sID
from docMVenum1 d1
join docMVenum1 d2
on d1.sID = d2.sID
where d1.enumID = 140 and d1.valueID = 1
and d2.enumID = 140 and d2.valueID = 2
) T
do you absolutely need the aggregation? cant you just do the grouping and aggregation on the data as created tables, that would be a performance booster.
CREATE TABLE new_table
AS (select sID
from docMVenum1 int
where enumID = 140
and valueID in (1,2)
the only query you would then have to run is the below.
select * from new_table where count(*) = 2
group by sID