Indexes to support OR condition over a JOIN - postgresql

I'm wondering if Postgres has any support optimizing for following fundamental problem.
I want to do a search a agains two columns on different tables joined via a foreign key. I have created an index for each column. If I do my join query and have a where condition for either one or the other column, the respective index is used to filter the result and the query performance is great. If use two where clause combined by an OR for one field on each table, the query gets very slow and no indexes are used. Presumably this is because the optimizer sees no other way than doing a full table join and scan to resolve. The query looks something like this:
select table1.id
from table1
left join table2 on table1.fk = table2.id
where table1.haystack ilike '%needle%' or table2.haystack ilike '%needle%'
The operation (ilike) isn't the issue and interchangeable, I have a working Trigram index setup. I just want to find out if there is any other way to make this type of query performant beside denormalizing all searched fields into one table.
I would be very greateful for any ideas.

No, there is no special support in the database to optimize this. Do it yourself:
SELECT table1.id
FROM table1
JOIN table2 ON table1.fk = table2.id
WHERE table1.haystack ILIKE '%needle%'
UNION
SELECT table1.id
FROM table1
JOIN table2 ON table1.fk = table2.id
WHERE table2.haystack ILIKE '%needle%'
Provided both conditions are selective and indexed with a trigram index, and you have indexes on the join condition, that will be faster.

Related

If two PostgreSQL table has gist index, can their union be considered as an indexed table?

I have three tables, table_a, table_b, table_c. All of them has gist index.
I would like to perform a left join between table_c and the UNION of table_a and table_b.
Can the UNION be considered "indexed"? I assume it would better to create new table as the UNION, but these tables are huge so I try to avoid this kind of redundancy.
In terms of SQL, my question:
Is this
SELECT * FROM myschema.table_c AS a
LEFT JOIN
(SELECT col_1,col_2,the_geom FROM myschema.table_a
UNION
SELECT col_1,col_2,the_geom FROM myschema.table_b) AS b
ON ST_Intersects(a.the_geom,b.the_geom);
equal to this?
CREATE TABLE myschema.table_d AS
SELECT col_1,col_2,the_geom FROM myschema.table_a
UNION
SELECT col_1,col_2,the_geom FROM myschema.table_b;
CREATE INDEX idx_table_d_the_geom
ON myschema.table_d USING gist
(the_geom)
TABLESPACE mydb;
SELECT * FROM myschema.table_c AS a
LEFT JOIN myschema.table_d AS b
ON ST_Intersects(a.the_geom,b.the_geom);
You can look at the execution plan with EXPLAIN, but I doubt that it will use the indexes.
Rather than performing a left join between one table and the union of three other tables, you should perform the union of the left joins between the one table and each of the three tables in turn. That will be a longer statement, but PostgreSQL will be sure to use the index if that can speed up the left joins.
Be sure to use UNION ALL rather than UNION unless you really have to remove duplicates.

Refactoring query using DISTINCT and JOINing table with a lot of records

I am using PostgreSQL v 11.6. I've read a lot of questions asking about how to optimize queries which are using DISTINCT. Mine is not that different, but despite the other questions where the people usually want's to keep the other part of the query and just somehow make DISTINCT ON faster, I am willing to rewrite the query with the sole purpose to make it as performent as possible. The current query is this:
SELECT DISTINCT s.name FROM app.source AS s
INNER JOIN app.index_value iv ON iv.source_id = s.id
INNER JOIN app.index i ON i.id = iv.index_id
INNER JOIN app.namespace AS ns ON i.namespace_id=ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
ORDER BY s.name;
The app.source table contains about 800 records. The other tables are under 5000 recrods tops, but the app.index_value contains 35_420_354 (about 35 million records) which I guess causes the overall slow execution of the query.
The EXPLAIN ANALYZE returns this:
I think that all relevent indexes are in place (maybe there can be made some small optimization) but I think that in order to get significant improvements in the time execution I need a better logic for the query.
The current execution time on a decent machine is 35~38 seconds.
Your query is not using DISTINCT ON. It is merely using DISTINCT which is quite a different thing.
SELECT DISTINCT is indeed often an indicator for a oorly written query, because DISTINCT is used to remove duplicates and it is often the case tat the query creates those duplicates itself. The same is true for your query. You simply want all names where certain entries exist. So, use EXISTS (or IN for that matter).
EXISTS
SELECT s.name
FROM app.source AS s
WHERE EXISTS
(
SELECT NULL
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE iv.source_id = s.id
AND (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
IN
SELECT s.name
FROM app.source AS s
WHERE s.id IN
(
SELECT iv.source_id
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
Thus we avoid creating an unnecessarily large intermediate result.
Update 1
From the database side we can support queries with appropriate indexes. The only criteria used in your query that limits selected rows is the array lookup, though. This is probably slow, because the DBMS cannot use database indexes here as far as I know. And depending on the array content we can end up with zero app.namespace rows, few rows, many rows or even all rows. The DBMS cannot even make proper assumptions on know how many. From there we'll retrieve the related index and index_value rows. Again, these can be all or none. The DBMS could use indexes here or not. If it used indexes this would be very fast on small sets of rows and extremely slow on large data sets. And if it used full table scans and joined these via hash joins for instance, this would be the fastest approach for many rows and rather slow on few rows.
You can create indexes and see whether they get used or not. I suggest:
create index idx1 on app.index (namespace_id, id);
create index idx2 on app.index_value (index_id, source_id);
create index idx3 on app.source (id, name);
Update 2
I am not versed with arrays. But t looks like you want to check if a matching condition exists. So again EXISTS might be a tad more appropriate:
WHERE EXISTS
(
SELECT NULL
FROM UNNEST(Array['Default']::CITEXT[]) AS nss
WHERE ns.name ILIKE nss
)
Update 3
One more idea (I feel stupid now to have missed that): For each source we just look up whether there is at least one match. So maybe the DBMS starts with the source table and goes from that table to the next. For this we'd use the following indexes:
create index idx4 on index_value (source_id, index_id);
create index idx5 on index (id, namespace_id);
create index idx6 on namespace (id, name);
Just add them to your database and see what happens. You can always drop indexes again when you see the DBMS doesn't use them.

Should I do ORDER BY twice when selecting from subquery?

I have SQL query (code below) which selects some rows from subquery. In subquery I perform ORDER BY.
The question is: will order of subquery be preserved in parent query?
Is there some spec/document or something which proves that?
SELECT sub.id, sub.name, ot.field
FROM (SELECT t.id, t.name
FROM table t
WHERE t.something > 10
ORDER BY t.id
LIMIT 25
) sub
LEFT JOIN other_table ot ON ot.table_id = sub.id
/**order by id?**/```
will order of subquery be preserved in parent query
It might happen, but you can not rely on that.
For example, if the optimizer decides to use a hash join between your derived table and other_table then the order of the derived table will not be preserved.
If you want a guaranteed sort order, then you have to use an order by in the outer query as well.

Does SQL execute subqueries fully?

Imagine I have this SQL query and the table2 is HUGE.
select product_id, count(product_id)
from table1
where table2_ptr_id in (select id
from table2
where author is not null)
Will SQL first execute the subquery and load all the table2 into memory? like if table1 has 10 rows and table2 has 10 million rows will it be better to join first and then filter? Or DB is smart enough to optimize this query as it is written.
You have to EXPLAIN the query to know what it is doing.
However, your query will likely perform better in PostgreSQL if you rewrite it to
SELECT product_id
FROM table1
WHERE EXISTS (SELECT 1
FROM table2
WHERE table2.id = table1.table2_ptr_id
AND table2.author IS NOT NULL);
Then PostgreSQL can use an anti-join, which will probably perform much better with a huge table2.
Remark: the count in your query doesn't make any sense to me.

PostgreSQL 9.4.5: Limit number of results on INNER JOIN

I'm trying to implement a many-to-many relationship using PostgreSQL's Array type, because it scales better for my use case than a join table would. I have two tables: table1 and table2. table1 is the parent in the relationship, having the column child_ids bigint[] default array[]::bigint[]. A single row in table1 can have upwards of tens of thousands of references to table2 in the table1.child_ids column, therefore I want to try to limit the amount returned by my query to a maximum of 10. How would I structure this query?
My query to dereference the child ids is SELECT *, json_agg(table2.*) as children FROM table1 INNER JOIN table2 ON table2 = ANY(table1.child_ids). I don't see a way I could set a limit without limiting the entire response as a whole. Is there a way to either limit this INNER JOIN, or at least utilize a subquery to that I can use LIMIT to restrict the amount of results from table2?
This would have been dead simple with properly normalized tables, but here goes with arrays:
SELECT *
FROM table1 t1, LATERAL (
SELECT json_agg(*) AS children
FROM table2
WHERE id = ANY (t1.child_ids)
LIMIT 10) t2;
Of course, you have no influence over which 10 rows per id of table2 will be selected.