Postgres - Is "not exists" slower than join? - postgresql

I'm trying to locate the cause of a slow query that hits 3 tables with records ranging from a few hundred thousand to a several million
tango - 6166101
kilo_golf - 822805
three_romeo - 535782
Version
PostgreSQL 11.10
Current query
select count(*) as aggregate
from "tango"
where "lima" = juliet
and not exists(select 1
from "three_romeo" quebec_seven_oscar
where quebec_seven_oscar.six_two = tango.six_two
and quebec_seven_oscar."romeo" >= six_seven
and quebec_seven_oscar."three_seven" in
('partial_survey', 'survey_completed', 'wrong_number', 'moved'))
and ("mike" <= '2021-02-03 13:26:22' or "mike" is null)
and not exists(select 1
from "kilo_golf" as "delta"
where "delta"."to" = tango.six_two
and "two" = november
and "delta"."romeo" >= '2021-02-05 13:49:15')
and not exists(select 1
from "three_romeo" as "four"
where "four".foxtrot = tango.quebec_seven_victor
and "four"."three_seven" in ('deceased', 'block_calls', 'block_all'))
and "tango"."yankee" is null;
This is the analysis of the query in its current state - https://explain.depesz.com/s/Di51
It feels like the problematic area is in the tango table
tango.lima is equal to 'juliet' in the majority of records (low cardinality), we don't currently have an index on this
The long filter makes me wonder if I should create some sort of composite index?
After reading another post (https://stackoverflow.com/a/50148594/682754) tried removing the or "mike" is null and this helped quite a lot
https://explain.depesz.com/s/XgmB
Should I try and remove the not exists in favour of using joins?
Thanks

I don't think that using explicit joins will help you, since PostgreSQL converts NOT EXISTS into an anti-join anyway.
But you spotted the problem: it is the OR. I would recommend that you use a dynamic query: add the cindition only if mikeis not NULL rather than having a static query with OR.

You are counting about 6 million rows, and that will take some time. The reason that removing or "mike" is null can help so much is that it no longer needs to count the rows where mike is null, which is vast majority of them.
But this is of no use to you if you actually do need to count those rows. So, do you? I'm having a hard time picturing a situation in which you need an exact count of 6 million rows often enough that waiting 4 seconds for it is a problem.

Related

where column in (single value) performance

I am writing dynamic sql code and it would be easier to use a generic where column in (<comma-seperated values>) clause, even when the clause might have 1 term (it will never have 0).
So, does this query:
select * from table where column in (value1)
have any different performance than
select * from table where column=value1
?
All my test result in the same execution plans, but if there is some knowledge/documentation that sets it to stone, it would be helpful.
This might not hold true for each and any RDBMS as well as for each an any query with its specific circumstances.
The engine will translate WHERE id IN(1,2,3) to WHERE id=1 OR id=2 OR id=3.
So your two ways to articulate the predicate will (probably) lead to exactly the same interpretation.
As always: We should not really bother about the way the engine "thinks". This was done pretty well by the developers :-) We tell - through a statement - what we want to get and not how we want to get this.
Some more details here, especially the first part.
I Think this will depend on platform you are using (optimizer of the given SQL engine).
I did a little test using MySQL Server and:
When I query select * from table where id = 1; i get 1 total, Query took 0.0043 seconds
When I query select * from table where id IN (1); i get 1 total, Query took 0.0039 seconds
I know this depends on Server and PC and what.. But The results are very close.
But you have to remember that IN is non-sargable (non search argument able), it will not use the index to resolve the query, = is sargable and support the index..
If you want the best one to use, You should test them in your environment because they both work so good!!

Improve speed by moving to NoSQL

Hello and thank you for reading my question!
Currently, we use PostgreSQL v.10 on 3 nodes through stolon (https://github.com/sorintlab/stolon)
We have 3 tables (I want to make my question simple):
Invoice (150 000 000 records)
User (35 000 000 records)
User_Address (20 000 000 records)
The main query looks like this (The original query is a large, using a temporary table and have a lot of where conditions, but the sample shows my problem.)
select
i.*
from invoice as i
inner join get_similar_name('Jon') as s on i.name ilike s.name
left join user_address as a on i.user_id = a.user_id
where
a.state = 'OH'
and
i.last_name = 'Smith'
and
i.date between '2016-01-01'::date and '2018-12-31'::date;
The function get_similar_name returns similar names (example: get_similar_name('Jon') will return John, Jonny, Jonathan ... etc) average 200 - 1000 names. I must use the function :\
The query was executed a long time, around 30 - 120 seconds,
but if I exclude the function get_similar_name from the query, then execution time will be not more then 1 second.
I already configured PostgreSQL and the server working pretty good. I also created indexes and my query don't use seq scan and etc.
We don't have the possibility to make partitioned tables because we have a lot of columns for this. We can't divide a table only by one row.
I think about migrating my warehouse to MongoDB
My questions are:
Am I right about moving to MongoDB?
Increase performance if I move warehouse from PostgreSQL to 20-40 nodes under MongoDB control?
Is it possible to have the function get_similar_name on MongoDB or similar solution? If yes, how?
Do you have good experience to use fulltext search in MongoDB?
Is it right way to use MongoDB on production?
Can you please advise a "google-vector" to right solution on your opinion?
I don't know if moving to MongoDB will solve a text search problem, but Postgres has excellent features like Vector and trigram. Have you tired any of this?
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
https://www.postgresql.org/docs/9.6/pgtrgm.html
On my previous project, we used pg_trgm and were pretty happy with its performance.

Firebird index not used when using JOIN, why?

I'm using FB 2.5.5 and I'm trying to understand why a very simple query does not use an index and thus takes forever to execute. I've read a lot of articles about why existing indices might be ignored by the query optimizer but I'm not understanding how it can happens in my case. I recomputed the selectivity for all my indices within IB Expert, and I've also done a backup/restore of the database to be sure I wasn't missing something.
The index selectivity, as displayed by IB Expert, is approx 0,000024 - which is far from 1 :
CREATE INDEX TVERSIONS_IDX_LASTMODDATE ON TVERSIONS (LASTMODDATE)
The table I'm querying contains approx. 2M records :
SELECT COUNT(ID) FROM TVERSIONS
2479518
I'm trying to fetch all records based on the LASTMODDATE field (TIMETSAMP, indexed by TVERSIONS_IDX_LASTMODDATE). An oversimplified version of the query would be :
SELECT COUNT(ID) FROM TVERSIONS WHERE LASTMODDATE > :TheDate
In this case, the execution plan shows that the index is actually used :
Plan
PLAN (TVERSIONS INDEX (TVERSIONS_IDX_LASTMODDATE))
...and records matching the condition are fetched very quickly :
------ Performance info ------
Prepare time = 172ms
Execute time = 16ms <----
Avg fetch time = 16,00 ms
Current memory = 2 714 672
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 57
Writes from cache to disk = 0
Fetches from cache = 387
Now, the "real" query fetches the same fields using the same condition on LASTMODDATE but adds a JOIN over 3 tables :
SELECT COUNT(ID) FROM TVERSIONS
JOIN TFILES ON TFILES.ID = TVERSIONS.FILEID
JOIN TROOTS ON TROOTS.ID = TFILES.ROOTID
JOIN TUSERSBACKUPS ON TROOTS.BACKUPID = TUSERSBACKUPS.BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
Now the query plan does not use the index anymore :
Plan
PLAN JOIN (TUSERSBACKUPS INDEX (RDB$FOREIGN4), TROOTS INDEX (RDB$FOREIGN3), TFILES INDEX (RDB$FOREIGN2), TVERSIONS INDEX (RDB$FOREIGN6))
Without any surprise execution time is far more slower (approx. 1 minute):
------ Performance info ------
Prepare time = 329ms
Execute time = 53s 593ms <---
Avg fetch time = 53 593,00 ms
Current memory = 3 044 736
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 55 732
Writes from cache to disk = 0
Fetches from cache = 6 952 648
In other words, searching the WHOLE table is magnitude faster than searching into a subset of rows returned by JOIN.
I can't understand why the index on the LASTMODDATE field is not used anymore just because I'm adding the join clause. The selectivity of the index is good and the query is very simple. What do I miss ?
It seems Firebird decided to start with condition TUSERSBACKUPS.USERID=:UserID using index RDB$FOREIGN4. Probably it happens because you have here equality, and for condition TVERSIONS.LASTMODDATE >:TheDate you have inequality which could lead to potentially larger set of records (for example if TheDate is a date 200 years ago it will include the whole table).
To force Firebird use a plan which you (but not its optimizer) prefer - use PLAN clause, see http://www.firebirdfaq.org/faq224/
I think I've understood what happened, and... I guess it was my fault.
I forgot that the table I'm querying has been "denormalized" to avoid such long JOINs. The problematic query can indeed by rewritten in a much shorter way :
SELECT COUNT(TVERSIONS.ID) FROM TVERSIONS
JOIN TUSERSBACKUPS ON TUSERSBACKUPS.BACKUPID = TVERSIONS.RD_BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
This one properly uses the indices I set before and has a very short execution time.
I have the impression that when Firebird detects you're deliberately using a sub-optimal path to access records in a table it does not even try to use your indices and let you shoot yourself in the foot...
Anyway, the problem is solved. Thank you all for your suggestions.

Kaminari : page count performance for search query

We have used Kaminari for paginating records. We have hacked total_count method because giving a total count is very slow after 2m + records.
def total_count
#_hacked_total_count || (#_hacked_total_count = self.connection.execute("SELECT (reltuples)::integer FROM pg_class r WHERE relkind = 'r' AND relname ='#{table_name}'").first["reltuples"].to_i)
end
This returns approximate count of total records in the table which is fine.
However, that number is totally irrelevant when there is a search query.
I am looking for a fast way to return some number for search queries as well. (Doesn't need to be totally exact -- but needs to be meaningful (i.e. somewhat pertaining to the search query)
Please suggest if our approach to get total count of records as mentioned above is not correct.
P.S - We are using Postgres as our database provider and Rails as web development framework.
After stumbling upon this for couple of days, I did this by using EXPLAIN query and then extracting count rows count.
Here is the code snippet I wrote:
query_plan = ActiveRecord::Base.connection.execute("EXPLAIN <query>").first["QUERY PLAN"]
query_plan.match(/rows=(\d+)/)[1].to_i # returns the rows count
With this I can safety remove reltuples query.

T-SQL speed comparison between LEFT() vs. LIKE operator

I'm creating result paging based on first letter of certain nvarchar column and not the usual one, that usually pages on number of results.
And I'm not faced with a challenge whether to filter results using LIKE operator or equality (=) operator.
select *
from table
where name like #firstletter + '%'
vs.
select *
from table
where left(name, 1) = #firstletter
I've tried searching the net for speed comparison between the two, but it's hard to find any results, since most search results are related to LEFT JOINs and not LEFT function.
"Left" vs "Like" -- one should always use "Like" when possible where indexes are implemented because "Like" is not a function and therefore can utilize any indexes you may have on the data.
"Left", on the other hand, is function, and therefore cannot make use of indexes. This web page describes the usage differences with some examples. What this means is SQL server has to evaluate the function for every record that's returned.
"Substring" and other similar functions are also culprits.
Your best bet would be to measure the performance on real production data rather than trying to guess (or ask us). That's because performance can sometimes depend on the data you're processing, although in this case it seems unlikely (but I don't know that, hence why you should check).
If this is a query you will be doing a lot, you should consider another (indexed) column which contains the lowercased first letter of name and have it set by an insert/update trigger.
This will, at the cost of a minimal storage increase, make this query blindingly fast:
select * from table where name_first_char_lower = #firstletter
That's because most database are read far more often than written, and this will amortise the cost of the calculation (done only for writes) across all reads.
It introduces redundant data but it's okay to do that for performance as long as you understand (and mitigate, as in this suggestion) the consequences and need the extra performance.
I had a similar question, and ran tests on both. Here is my code.
where (VOUCHER like 'PCNSF%'
or voucher like 'PCLTF%'
or VOUCHER like 'PCACH%'
or VOUCHER like 'PCWP%'
or voucher like 'PCINT%')
Returned 1434 rows in 1 min 51 seconds.
vs
where (LEFT(VOUCHER,5) = 'PCNSF'
or LEFT(VOUCHER,5)='PCLTF'
or LEFT(VOUCHER,5) = 'PCACH'
or LEFT(VOUCHER,4)='PCWP'
or LEFT (VOUCHER,5) ='PCINT')
Returned 1434 rows in 1 min 27 seconds
My data is faster with the left 5. As an aside my overall query does hit some indexes.
I would always suggest to use like operator when the search column contains index. I tested the above query in my production environment with select count(column_name) from table_name where left(column_name,3)='AAA' OR left(column_name,3)= 'ABA' OR ... up to 9 OR clauses. My count displays 7301477 records with 4 secs in left and 1 second in like i.e where column_name like 'AAA%' OR Column_Name like 'ABA%' or ... up to 9 like clauses.
Calling a function in where clause is not a best practice. Refer http://blog.sqlauthority.com/2013/03/12/sql-server-avoid-using-function-in-where-clause-scan-to-seek/
Entity Framework Core users
You can use EF.Functions.Like(columnName, searchString + "%") instead of columnName.startsWith(...) and you'll get just a LIKE function in the generated SQL instead of all this 'LEFT' craziness!
Depending upon your needs you will probably need to preprocess searchString.
See also https://github.com/aspnet/EntityFrameworkCore/issues/7429
This function isn't present in Entity Framework (non core) EntityFunctions so I'm not sure how to do it for EF6.