I'm using FB 2.5.5 and I'm trying to understand why a very simple query does not use an index and thus takes forever to execute. I've read a lot of articles about why existing indices might be ignored by the query optimizer but I'm not understanding how it can happens in my case. I recomputed the selectivity for all my indices within IB Expert, and I've also done a backup/restore of the database to be sure I wasn't missing something.
The index selectivity, as displayed by IB Expert, is approx 0,000024 - which is far from 1 :
CREATE INDEX TVERSIONS_IDX_LASTMODDATE ON TVERSIONS (LASTMODDATE)
The table I'm querying contains approx. 2M records :
SELECT COUNT(ID) FROM TVERSIONS
2479518
I'm trying to fetch all records based on the LASTMODDATE field (TIMETSAMP, indexed by TVERSIONS_IDX_LASTMODDATE). An oversimplified version of the query would be :
SELECT COUNT(ID) FROM TVERSIONS WHERE LASTMODDATE > :TheDate
In this case, the execution plan shows that the index is actually used :
Plan
PLAN (TVERSIONS INDEX (TVERSIONS_IDX_LASTMODDATE))
...and records matching the condition are fetched very quickly :
------ Performance info ------
Prepare time = 172ms
Execute time = 16ms <----
Avg fetch time = 16,00 ms
Current memory = 2 714 672
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 57
Writes from cache to disk = 0
Fetches from cache = 387
Now, the "real" query fetches the same fields using the same condition on LASTMODDATE but adds a JOIN over 3 tables :
SELECT COUNT(ID) FROM TVERSIONS
JOIN TFILES ON TFILES.ID = TVERSIONS.FILEID
JOIN TROOTS ON TROOTS.ID = TFILES.ROOTID
JOIN TUSERSBACKUPS ON TROOTS.BACKUPID = TUSERSBACKUPS.BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
Now the query plan does not use the index anymore :
Plan
PLAN JOIN (TUSERSBACKUPS INDEX (RDB$FOREIGN4), TROOTS INDEX (RDB$FOREIGN3), TFILES INDEX (RDB$FOREIGN2), TVERSIONS INDEX (RDB$FOREIGN6))
Without any surprise execution time is far more slower (approx. 1 minute):
------ Performance info ------
Prepare time = 329ms
Execute time = 53s 593ms <---
Avg fetch time = 53 593,00 ms
Current memory = 3 044 736
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 55 732
Writes from cache to disk = 0
Fetches from cache = 6 952 648
In other words, searching the WHOLE table is magnitude faster than searching into a subset of rows returned by JOIN.
I can't understand why the index on the LASTMODDATE field is not used anymore just because I'm adding the join clause. The selectivity of the index is good and the query is very simple. What do I miss ?
It seems Firebird decided to start with condition TUSERSBACKUPS.USERID=:UserID using index RDB$FOREIGN4. Probably it happens because you have here equality, and for condition TVERSIONS.LASTMODDATE >:TheDate you have inequality which could lead to potentially larger set of records (for example if TheDate is a date 200 years ago it will include the whole table).
To force Firebird use a plan which you (but not its optimizer) prefer - use PLAN clause, see http://www.firebirdfaq.org/faq224/
I think I've understood what happened, and... I guess it was my fault.
I forgot that the table I'm querying has been "denormalized" to avoid such long JOINs. The problematic query can indeed by rewritten in a much shorter way :
SELECT COUNT(TVERSIONS.ID) FROM TVERSIONS
JOIN TUSERSBACKUPS ON TUSERSBACKUPS.BACKUPID = TVERSIONS.RD_BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
This one properly uses the indices I set before and has a very short execution time.
I have the impression that when Firebird detects you're deliberately using a sub-optimal path to access records in a table it does not even try to use your indices and let you shoot yourself in the foot...
Anyway, the problem is solved. Thank you all for your suggestions.
Related
I'm learning DB2, and I came across this clause: OPTIMIZE FOR 1 ROW right after FETCH FIRST 100 ROWS ONLY.
I understand that FETCH FIRST 100 ROWS ONLY would give me the first 100 rows that qualified. But I don't understand what the OPTIMIZE FOR 1 ROW really doing here. I read this DB2 documentation, it says
Use OPTIMIZE FOR 1 ROW clause to influence the access path. OPTIMIZE FOR 1 ROW tells Db2 to select an access path that returns the first qualifying row quickly.
and this DB2 documentation, it says
In general, if you are retrieving only a few rows, specify OPTIMIZE FOR 1 ROW to influence the access path that Db2 selects.
But I'm still confused. Is using OPTIMIZE FOR n ROWS would make a query more efficient?
I also found this post on SO and it seems like OPTIMIZE FOR n ROWS is equivalent to FETCH FIRST n ROWS ONLY per the accepted answer.
But when I experimented it myself using OPTIMIZE FOR n ROWS instead of FETCH FIRST n ROWS ONLY, the result set was not the same. With OPTIMIZE FOR n ROWS, the query returns all qualifying rows.
Could someone please explain it to me what OPTIMIZE FOR n ROWS really does? Thanks!
Is using OPTIMIZE FOR n ROWS would make a query more efficient?
Not necessarily. However, it might cause your application to start receiving rows earlier than it otherwise would, if there is an access plan alternative that can find the first row matching the query criteria faster although the entire query will as a result run longer.
There's this bit in the Db2 for LUW docs that gives some examples specific to that platform:
Try specifying OPTIMIZE FOR n ROWS along with FETCH FIRST n ROWS ONLY, to encourage query access plans that return rows directly from the referenced tables, without first performing a buffering operation such as inserting into a temporary table, sorting, or inserting into a hash join hash table.
Applications that specify OPTIMIZE FOR n ROWS to encourage query access plans that avoid buffering operations, yet retrieve the entire result set, might experience poor performance. This is because the query access plan that returns the first n rows fastest might not be the best query access plan if the entire result set is being retrieved.
I'm trying to locate the cause of a slow query that hits 3 tables with records ranging from a few hundred thousand to a several million
tango - 6166101
kilo_golf - 822805
three_romeo - 535782
Version
PostgreSQL 11.10
Current query
select count(*) as aggregate
from "tango"
where "lima" = juliet
and not exists(select 1
from "three_romeo" quebec_seven_oscar
where quebec_seven_oscar.six_two = tango.six_two
and quebec_seven_oscar."romeo" >= six_seven
and quebec_seven_oscar."three_seven" in
('partial_survey', 'survey_completed', 'wrong_number', 'moved'))
and ("mike" <= '2021-02-03 13:26:22' or "mike" is null)
and not exists(select 1
from "kilo_golf" as "delta"
where "delta"."to" = tango.six_two
and "two" = november
and "delta"."romeo" >= '2021-02-05 13:49:15')
and not exists(select 1
from "three_romeo" as "four"
where "four".foxtrot = tango.quebec_seven_victor
and "four"."three_seven" in ('deceased', 'block_calls', 'block_all'))
and "tango"."yankee" is null;
This is the analysis of the query in its current state - https://explain.depesz.com/s/Di51
It feels like the problematic area is in the tango table
tango.lima is equal to 'juliet' in the majority of records (low cardinality), we don't currently have an index on this
The long filter makes me wonder if I should create some sort of composite index?
After reading another post (https://stackoverflow.com/a/50148594/682754) tried removing the or "mike" is null and this helped quite a lot
https://explain.depesz.com/s/XgmB
Should I try and remove the not exists in favour of using joins?
Thanks
I don't think that using explicit joins will help you, since PostgreSQL converts NOT EXISTS into an anti-join anyway.
But you spotted the problem: it is the OR. I would recommend that you use a dynamic query: add the cindition only if mikeis not NULL rather than having a static query with OR.
You are counting about 6 million rows, and that will take some time. The reason that removing or "mike" is null can help so much is that it no longer needs to count the rows where mike is null, which is vast majority of them.
But this is of no use to you if you actually do need to count those rows. So, do you? I'm having a hard time picturing a situation in which you need an exact count of 6 million rows often enough that waiting 4 seconds for it is a problem.
Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.
I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1
Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.
Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!
We have used Kaminari for paginating records. We have hacked total_count method because giving a total count is very slow after 2m + records.
def total_count
#_hacked_total_count || (#_hacked_total_count = self.connection.execute("SELECT (reltuples)::integer FROM pg_class r WHERE relkind = 'r' AND relname ='#{table_name}'").first["reltuples"].to_i)
end
This returns approximate count of total records in the table which is fine.
However, that number is totally irrelevant when there is a search query.
I am looking for a fast way to return some number for search queries as well. (Doesn't need to be totally exact -- but needs to be meaningful (i.e. somewhat pertaining to the search query)
Please suggest if our approach to get total count of records as mentioned above is not correct.
P.S - We are using Postgres as our database provider and Rails as web development framework.
After stumbling upon this for couple of days, I did this by using EXPLAIN query and then extracting count rows count.
Here is the code snippet I wrote:
query_plan = ActiveRecord::Base.connection.execute("EXPLAIN <query>").first["QUERY PLAN"]
query_plan.match(/rows=(\d+)/)[1].to_i # returns the rows count
With this I can safety remove reltuples query.