Inner join versus doing a where in clause - postgresql

Say I have about 10-30 rows returned from my query:
select * from users where location=10;
Is there any difference between the following:
select *
from users u
inner join location l on u.location = l.id
where location=10;
versus:
select * from users where location=10; # say this returns 1,2,3,4,5
select * from location where id IN (1,2,3,4,5)
Basically I want to know if there are any performance differences between doing an inner join versus doing a WHERE IN clause.

Is there a difference between issuing one query and issuing two queries? Well, I certainly hope so. The SQL engine is doing work, and it does twice as much work (from a certain perspective) for two queries.
In general, parsing a single query is going to be faster than parsing one query, returning an intermediate result set, and then feeding it back to another query. There is overhead in query compilation and in passing data back and forth.
For this query:
select *
from users u inner join
location l
on u.location = l.id
where u.location = 10;
You want an index on users(location) and location(id).
I do want to point something else out. The queries are not equivalent. The real comparison query is:
select l.*
from location l
where l.id = 10;
You are using the same column for the where and the on. Hence, this would be the most efficient version and you want an index on location(id).

One way to compare the performance of different queries is to use postgresql's EXPLAIN command, for instance:
EXPLAIN select *
from users u inner join
location l
on u.location = l.id
where u.location = 10;
Which will tell you how the database will get the data. Watch for things like sequential scans, which indicate you could benefit from an index. It also gives estimates on the cost of each operation and how many rows it might be operating on. Some operations can yield a lot more rows than you would expect, which the database then reduces to the set it returns to you.
You can also use EXPLAIN ANALYZE [query], which will actually run the query and give you timing information.
See the postgresql documentation for more information.

Related

Refactoring query using DISTINCT and JOINing table with a lot of records

I am using PostgreSQL v 11.6. I've read a lot of questions asking about how to optimize queries which are using DISTINCT. Mine is not that different, but despite the other questions where the people usually want's to keep the other part of the query and just somehow make DISTINCT ON faster, I am willing to rewrite the query with the sole purpose to make it as performent as possible. The current query is this:
SELECT DISTINCT s.name FROM app.source AS s
INNER JOIN app.index_value iv ON iv.source_id = s.id
INNER JOIN app.index i ON i.id = iv.index_id
INNER JOIN app.namespace AS ns ON i.namespace_id=ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
ORDER BY s.name;
The app.source table contains about 800 records. The other tables are under 5000 recrods tops, but the app.index_value contains 35_420_354 (about 35 million records) which I guess causes the overall slow execution of the query.
The EXPLAIN ANALYZE returns this:
I think that all relevent indexes are in place (maybe there can be made some small optimization) but I think that in order to get significant improvements in the time execution I need a better logic for the query.
The current execution time on a decent machine is 35~38 seconds.
Your query is not using DISTINCT ON. It is merely using DISTINCT which is quite a different thing.
SELECT DISTINCT is indeed often an indicator for a oorly written query, because DISTINCT is used to remove duplicates and it is often the case tat the query creates those duplicates itself. The same is true for your query. You simply want all names where certain entries exist. So, use EXISTS (or IN for that matter).
EXISTS
SELECT s.name
FROM app.source AS s
WHERE EXISTS
(
SELECT NULL
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE iv.source_id = s.id
AND (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
IN
SELECT s.name
FROM app.source AS s
WHERE s.id IN
(
SELECT iv.source_id
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
Thus we avoid creating an unnecessarily large intermediate result.
Update 1
From the database side we can support queries with appropriate indexes. The only criteria used in your query that limits selected rows is the array lookup, though. This is probably slow, because the DBMS cannot use database indexes here as far as I know. And depending on the array content we can end up with zero app.namespace rows, few rows, many rows or even all rows. The DBMS cannot even make proper assumptions on know how many. From there we'll retrieve the related index and index_value rows. Again, these can be all or none. The DBMS could use indexes here or not. If it used indexes this would be very fast on small sets of rows and extremely slow on large data sets. And if it used full table scans and joined these via hash joins for instance, this would be the fastest approach for many rows and rather slow on few rows.
You can create indexes and see whether they get used or not. I suggest:
create index idx1 on app.index (namespace_id, id);
create index idx2 on app.index_value (index_id, source_id);
create index idx3 on app.source (id, name);
Update 2
I am not versed with arrays. But t looks like you want to check if a matching condition exists. So again EXISTS might be a tad more appropriate:
WHERE EXISTS
(
SELECT NULL
FROM UNNEST(Array['Default']::CITEXT[]) AS nss
WHERE ns.name ILIKE nss
)
Update 3
One more idea (I feel stupid now to have missed that): For each source we just look up whether there is at least one match. So maybe the DBMS starts with the source table and goes from that table to the next. For this we'd use the following indexes:
create index idx4 on index_value (source_id, index_id);
create index idx5 on index (id, namespace_id);
create index idx6 on namespace (id, name);
Just add them to your database and see what happens. You can always drop indexes again when you see the DBMS doesn't use them.

What is the execution order of a query with sub queries?

Consider this query
select *
from documents d
where exists (select 1 as [1]
from (
select *
from (
select *
from ProductMediaDocuments
where d.id = MediaDocuments_Id
) as [dummy1]
) as [s2]
where exists(
select *
from ProductSkus psk
where psk.Product_Id = s2.MediaProducts_Id
)
)
Could someone tell me how this is being processed by SQL Server? When statements appears in parentheses, this means it will execute first. But does this also apply for the above statement? In this case I don't think so, because the sub queries needs values of outer queries. So, how does this works under the hood?
That's completely up to the database engine.
Since SQL is a declarative language, you specify WHAT you want, but the HOW part is up to the DB Engine and it really depends on many factors like indexes presence, type, fragmentation; row cardinality, statistics.
That's just to mention few, because the list can goes on.
Of course you can look to the execution plan but the point is that you can't know HOW it will be executed just reading the query.
The execution plan will tell you what the engine actually does. That is, the physical processing order. AFAIK, the query planner will rewrite your query if it finds a better way to express it to itself or the engine. If your question is, "Why is my query not working the way I think it should." then that is where you should start.
The doc says the logical processing order is:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
It also has this note:
The [preceding] steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM would include inline views (subqueries) or CTE aliases. Each time it finds a subquery, it should start over from the beginning and evaluate that query.
I simplified your code a bit.
SELECT *
FROM documents d
WHERE EXISTS ( SELECT 1
FROM ProductMediaDocuments s2
WHERE d.id = MediaDocuments_Id
AND EXISTS (
SELECT *
FROM ProductSkus psk
WHERE psk.Product_Id = s2.MediaProducts_Id
)
)
I think this code is clearer don't you??
SELECT d.*
FROM documents d
JOIN ProductMediaDocuments s2 ON d.id = MediaDocuments_Id
JOIN ProductSkus psk ON psk.Product_Id = s2.MediaProducts_Id

optimize sql "With ismember Select From query"

I have a query which run extremely slow when "checking is_member" in comparison to just loading the whole dataset. This view acts as a security check, it checks if you are a member of a particular group - ie group 1, then the next column will state what access it has - ie division 2.
This view then is joined with the Fact table, so that it will only retrieve division 2 rows.
The question is, does the is_member execute for each line of Fact data? Just my theory because it runs 1000 times faster without this view. And if anyone can suggest an alternative structure?
WITH group_security AS (SELECT DISTINCT division_cod FROM dbo.dim_group_security_division AS gsd
WHERE (IS_MEMBER(group_name) = 1))
SELECT dbo.dim_division.dim_division_key, dbo.dim_division.division_ID, dbo.dim_division.division_code, dbo.dim_division.division_name
FROM dbo.dim_division INNER JOIN
group_security ON dbo.dim_division.division_code = group_security.division_code OR group_security.division_code = 'ALL'
Since you JOIN on dbo.dim_division.division_code, do you have an index on this column?
Alternatively you could give this a try:
SELECT dim.dim_division_key,
dim.division_ID,
dim.division_code,
dim.division_name
FROM dim_division dim
WHERE EXISTS ( SELECT *
FROM dbo.dim_group_security_division gsd
WHERE gsd.division_code IN ('ALL', dim.division_code)
AND IS_MEMBER(gsd.group_name) = 1 )
This way the system can stop at the first 'match' in dim_group_security_division instead of having to find all and then aggregate the result because of the DISTINCT.
In this case, it might also be useful to have an index on gsd.division_code to speed things up a bit.

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here

Understanding the execution plan of a query

I have this SQL:
SELECT
*
FROM
Requisicao r
join convenio c on c.idconvenio = r.idconvenio
join empresa e on e.idempresa = c.idempresa
When I execute it I get this plan of execution:
PLAN JOIN (C NATURAL,E INDEX (INTEG_160),R INDEX (INTEG_318))
What means that Convenio's index was not used (every table has its indexes)
I would like to understand it a little better so I can improve some performance issues I'm having with this system.
Thanks.
What seems wrong for you? Because you don't have any conditions (WHERE clause) server will read one table naturally, i.e. from the very first row to the last one. Taking into consideration index's selectivity served decided that it would be better to read from c and to join records from e and r.
I agree with Andrei. If convenio.idconvenio has low selectivity, the plan is fine.