My queries get very slow when I add a limit 1.
I have a table object_values with timestamped values for objects:
timestamp | objectID | value
--------------------------------
2014-01-27| 234 | ksghdf
Per object I want to get the latest value:
SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC LIMIT 1;
(I cancelled the query after more than 10 minutes)
This query is very slow when there are no values for a given objectID (it is fast if there are results).
If I remove the limit it tells me nearly instantaneous that there are no results:
SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC;
...
Time: 0.463 ms
An explain shows me that the query without limit uses the index, where as the query with limit 1 does not make use of the index:
Slow query:
explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC limit 1;
QUERY PLAN`
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..2350.44 rows=1 width=126)
-> Index Scan Backward using object_values_timestamp on object_values (cost=0.00..3995743.59 rows=1700 width=126)
Filter: (objectID = 53708)`
Fast query:
explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Sort (cost=6540.86..6545.11 rows=1700 width=126)
Sort Key: timestamp
-> Index Scan using object_values_objectID on working_hours_t (cost=0.00..6449.65 rows=1700 width=126)
Index Cond: (objectID = 53708)
The table contains 44,884,559 rows and 66,762 distinct objectIDs.
I have separate indexes on both fields: timestamp and objectID.
I have done a vacuum analyze on the table and I have reindexed the table.
Additionally the slow query becomes fast when I set the limit to 3 or higher:
explain SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp DESC limit 3;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=6471.62..6471.63 rows=3 width=126)
-> Sort (cost=6471.62..6475.87 rows=1700 width=126)
Sort Key: timestamp
-> Index Scan using object_values_objectID on object_values (cost=0.00..6449.65 rows=1700 width=126)
Index Cond: (objectID = 53708)
In general I assume it has to do with the planner making wrong assumptions about the exectution costs and therefore chooses for a slower execution plan.
Is this the real reason? Is there a solution for this?
You can avoid this issue by adding an unneeded ORDER BY clause to the query.
SELECT * FROM object_values WHERE (objectID = 53708) ORDER BY timestamp, objectID DESC limit 1;
You're running into an issue which relates, I think, to the lack of statistics on row correlations. Consider reporting it to pg-bugs for reference if this is using the latest version Postgres.
The interpretation I'd suggest for your plans is:
limit 1 makes Postgres look for a single row, and in doing so it assumes that your object_id is common enough that it'll show up reasonably quickly in an index scan.
Based on the stats you gave its thinking probably is that it'll need to read ~70 rows on average to find one row that fits; it just doesn't realize that object_id and timestamp correlate to the point where it's actually going to read a large portion of the table.
limit 3, in contrast, makes it realize that it's uncommon enough, so it seriously considers (and ends up…) top-n sorting an expected 1700 rows with the object_id you want, on grounds that doing so is likely cheaper.
For instance, it might know that the distribution of these rows is so that they're all packed in the same area on the disk.
no limit clause means it'll fetch the 1700 anyways, so it goes straight for the index on object_id.
Solution, btw: add an index on (object_id, timestamp) or (object_id, timestamp desc).
I started having similar symptoms on an update-heavy table, and what was needed in my case was
analyze $table_name;
In this case the statistics needed to be refreshed, which then fixed the slow query plans that were occurring.
Supporting docs: https://www.postgresql.org/docs/current/sql-analyze.html
Not a fix, but sure enough switching from limit 1 to limit 50 (for me) and returning the first result row is way faster...Postgres 9.x in this instance. Just thought I'd mention it as a workaround mentioned by the OP.
Related
I have the following basic table in PostgreSQL 12.13:
Record:
database_id: FK to database table
updated_at: timestamp
I've created an index on both the database_id and updated_at fields.
I have a query that fetches the most recent 100 records for a given database id:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
ORDER BY store_record.updated_at DESC
LIMIT 100;
This query is EXTREMELY slow (recently took about 6 min to run). Here is the query plan:
Limit (cost=0.09..1033.82 rows=100 width=486)
-> Index Scan Backward using record_updated_at on record (cost=0.09..8149369.05 rows=788343 width=486)
Filter: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
If I change ORDER BY DESC to ORDER BY ASC then the query takes milliseconds, even though the query plan looks about the same:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
ORDER BY store_record.updated_at
LIMIT 100;
Limit (cost=0.09..1033.86 rows=100 width=486)
-> Index Scan using record_updated_at on record (cost=0.09..8149892.78 rows=788361 width=486)
Filter: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
If I remove the ORDER BY completely then the query is also fast:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
LIMIT 100;
Limit (cost=0.11..164.75 rows=100 width=486)
-> Index Scan using record_database_id on record (cost=0.11..1297917.10 rows=788366 width=486)
Index Cond: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
Few questions:
Why is the first query so much slower than the other two? I understand why the last one is faster but I don't understand why changing the ORDER BY DESC to ORDER BY makes such a difference. Amy I missing an index?
How can I speed up the initial query?
The plan follows the index to read records in order of updated_at DESC, testing each for the desired database_id, and then stops as soon as it finds 100 which have the desired database_id. But that specific desired value of database_id is much more common on one end of the index than the other. There is nothing magically about the DESC here, presumably there is some other value of database_id for which it works the opposite way, finding 100 of them when descending much faster than when ascending. If you had done EXPLAIN (ANALYZE), this would have been immediately clear based on the different reported values for "Rows Removed by Filter"
If you have a multicolumn index on (database_id,updated_at), then it can immediately jump to the desired spot of the index and read the first 100 rows it finds (after "filtering" them for visibility), and that would work fast no matter which direction you want the ORDER BY to go.
In Postgres, some queries are a whole lot slower when adding a LIMIT:
The queries:
SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4; -- 51 sec
SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4; -- 0.020s
SELECT * FROM review WHERE clicker_id=28 LIMIT 4; -- 0.007s
SELECT * FROM review WHERE clicker_id=28 ORDER BY id; -- 0.007s
As you can see, I need to add a dummy id to the ORDER BY in order for things to go fast. I'm trying to understand why.
Running EXPLAIN on them:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id;
gives this:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4
Limit (cost=0.44..249.76 rows=4 width=56)
-> Index Scan using review_done on review (cost=0.44..913081.13 rows=14649 width=56)
Filter: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4
Limit (cost=11970.75..11970.76 rows=4 width=56)
-> Sort (cost=11970.75..12007.37 rows=14649 width=56)
Sort Key: id, done DESC
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4
Limit (cost=0.44..3.65 rows=4 width=56)
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id
Sort (cost=12764.61..12801.24 rows=14649 width=56)
Sort Key: id
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
I'm no SQL expert, but I take it Postgres expected the query to be faster than it actually is, and so used a way to fetch the data that's actually inappropriate, correct?
The database:
The review table:
Contains 22+ million rows.
A given user will get 7 066 rows tops.
The one in the test (id 28) has 288 at the time.
Has this structure:
id: bigint Auto Increment [nextval('public.review_id_seq')]
type: review_type NULL
iteration: smallint NULL
repetitions: smallint NULL
due: timestamptz NULL
done: timestamptz NULL
added: timestamptz NULL
clicker_id: bigint NULL
monologue_id: bigint NULL
Has these indexes:
UNIQUE type, clicker_id, monologue_id, iteration
INDEX clicker_id
INDEX done, due, monologue_id
INDEX id
INDEX done DESC
INDEX type
Additional details:
Environment:
The queries were ran in development with Postgres 9.6.14.
Running the queries into production (Heroku Postgres, version 9.6.16) the difference is less dramatic, but still not great: the slow queries might take 600 ms.
Variable speed:
Sometimes, the same queries (be it the exact same, or for a different clicker_id) run a lot faster (under 1 sec), but I don't understand why. I need them to be consistently fast.
If I use LIMIT 288 for a user that has 288 rows, then it's so much faster (< 1sec), but if I do the same for a user with say 7066 rows then it's back to super slow.
Before I figured the use of a dummy ORDER BY, I tried these:
Re-importing the database.
analyze review;
Setting the index for done to DESC (used to be set to default/ASC.) [The challenge then was that there's no proper way to check if/when the index is done rebuilding.]
None helped.
The question:
My issue in itself is solved, but I'm dissatisfied with it:
Is there a name for this "pattern" that consists of adding a dummy ORDER BY to speed things up?
How can I spot such issues in the future? (This took ages to figure.) Unless I missed something, the EXPLAIN is not that useful:
For the slow query, the cost is misleadingly slow, while for the fast variant it's misleadingly high.
Alternative: is there another way to handle this? (Because this solution feels like a hack.)
Thanks!
Similar questions:
PostgreSQL query very slow with limit 1 is almost the same question, except his queries were slow with LIMIT 1 and fine with LIMIT 3 or higher. And then of course the data is not the same.
The underlying issue here is what's called an abort-early query plan. Here's a thread from pgsql-hackers describing something similar:
https://www.postgresql.org/message-id/541A2335.3060100%40agliodbs.com
Quoting from there, this is why the planner is using the often-extremely-slow index scan when the ORDER BY done DESC is present:
As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.
In your case, the planner thinks that if it just starts going through rows ordered by done desc (IOW, using the review_done index), it will find 4 rows with clicker_id=28 quickly. Since the rows need to be returned in "done" descending order, it thinks this will save a sort step and be faster than retrieving all rows for clicker 28 and then sorting. Given the real-world distribution of rows, this can often turn out not to be the case, requiring it to skip a huge number of rows before finding 4 with clicker=28.
A more-general way of handling it is to use a CTE (which, in 9.6, is still an optimization fence - this changes in PG 12, FYI) to fetch the rows without an order by, then add the ORDER BY in the outer query. Given that fetching all rows for a user, sorting them, and returning however many you need is completely reasonable for your dataset (even the 7k-rows clicker shouldn't be an issue), you can prevent the planner from believing an abort-early plan will be fastest by not having an ORDER BY or a LIMIT in the CTE, giving you a query of something like:
WITH clicker_rows as (SELECT * FROM review WHERE clicker_id=28)
select * From clicker_rows ORDER BY done DESC LIMIT 4;
This should be reliably fast while still respecting the ORDER BY and the LIMIT you want. I'm not sure if there's a name for this pattern, though.
I'm having trouble with a select query, it's a simple query to select the most recent row from a table, but the query planner is doing something weird.
SELECT id FROM plots WHERE fk_id='73a711a5-cb31-545d-b8d6-75c2a0e3ba9d'
ORDER BY created_at DESC LIMIT 1
Limit (cost=0.43..155.97 rows=1 width=24)
-> Index Scan Backward using idx_plot_created_at on plots
(cost=0.43..304694.48 rows=1959 width=24)
Filter: (fk_id = '73a711a5-cb31-545d-b8d6-75c2a0e3ba9d'::uuid)
And it takes about 2,8 seconds to execute. But when I remove the order by and limit, then it uses the index on the fk_id field and returns 6 results in 102 ms. Shouldn't Postgresql just read the 6 results using the fk_id index and order them by the created_at field? Instead it seems to scan the table using the index on the field that is used for the ORDER BY clause and will then check every row with the Filter condition. Why does it do that?
the sql is very simple.
"orders_express_idx" btree (express). express is index.
works well. because express a is exists.
select * from orders where express = 'a' order by id desc limit 1;
Limit (cost=0.43..1.29 rows=1 width=119)
-> Index Scan Backward using orders_pkey on orders (cost=0.43..4085057.23 rows=4793692 width=119)
Filter: ((express)::text = 'a'::text)
works slow. data is nonexistent. and I use limit.
select * from orders where express = 'b' order by id desc limit 1;
Limit (cost=0.43..648.86 rows=1 width=119)
-> Index Scan Backward using orders_pkey on orders (cost=0.43..4085061.83 rows=6300 width=119)
Filter: ((express)::text = 'a'::text)
works well. data is nonexistent. but I didn't use limit.
select * from orders where express = 'b' order by id desc;
Sort (cost=24230.91..24246.66 rows=6300 width=119)
Sort Key: id
-> Index Scan using orders_express_idx on orders (cost=0.56..23833.35 rows=6300 width=119)
Index Cond: ((express)::text = 'a'::text)
https://www.postgresql.org/docs/9.6/static/using-explain.html
go to the seciotn with
Here is an example showing the effects of LIMIT:
and further:
This is the same query as above, but we added a LIMIT so that not all
the rows need be retrieved, and the planner changed its mind about
what to do. Notice that the total cost and row count of the Index Scan
node are shown as if it were run to completion. However, the Limit
node is expected to stop after retrieving only a fifth of those rows,
so its total cost is only a fifth as much, and that's the actual
estimated cost of the query. This plan is preferred over adding a
Limit node to the previous plan because the Limit could not avoid
paying the startup cost of the bitmap scan, so the total cost would be
something over 25 units with that approach.
So basically - yes. adding LIMIT changes the plan and thus it can become more effective for smaller data set (expected), but also the impact can be negative (depending on statistics and settings (scan cost, effective_cache_size and so on)...
If you give the execution plans for queries above we would explain WHAT happens. But basically this is documented behaviour - LIMIT changes the plan and thus execution time - yes.
I've been looking for a straight clean answer to the this question. Let's say I have a photo table.
Now this table has 1,000,000 rows. Let's do the following query:
SELECT * FROM photos ORDER BY creation_time LIMIT 10;
Will this query grab all 1,000,000 rows and then give me 10? or does it just grab the latest 10? I'm quite curious as to how this works because if it does grab 1,000,000 (mind you this table is constantly growing) then it's wasteful query. You're basically throwing away 999,980 rows away. Is there a more efficient way to do this?
Whether the database has to scan the whole table or not depends on a number of
factors - in the case you describe the main factors are whether there is an ORDER BY
clause and whether there is an index on the sort field(s).
All is revealed by looking at the query plan, and the cost approximations on each
of the operations. Consider the case where there is no ordering clause:
testdb=> explain select * from bigtable limit 10;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=0.00..0.22 rows=10 width=39)
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(2 rows)
The planner has decided that a sequential scan is the way to go. The expected cost
already gives us a clue. It is expressed as a range, 0.00..6943.06. The first number
(0.00) is the amount of work the database expects to have to do before it can deliver
any rows, while the second number is an estimate of the work required to deliver
the whole scan.
Thus, the input to the 'Limit' clause is going to start straight away, and it will
not have to process the full output of the sequential scan (since the total cost
is only 0.22, not 6943.06). So it definitely will not have to read the whole table
and discard most of it.
Now lets see what happens if you add an ORDER BY clause, using a column that is not
indexed.
testdb=> explain select * from bigtable ORDER BY title limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit (cost=13737.26..13737.29 rows=10 width=39)
-> Sort (cost=13737.26..14523.28 rows=314406 width=39)
Sort Key: title
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(4 rows)
We have a similar plan, but there is a 'Sort' operation in between the seq scan
and the limit. It has to scan the complete table, sort the full content of it,
and only then can is start delivering rows to the Limit clause. It makes sense
when you think about it - LIMIT is supposed to apply after ORDER BY; so it would
have to be sure to have found the top 10 rows in the whole table.
Now what happens when an index is used? Suppose we have a 'time' column which is
indexed:
testdb=> explain select * from bigtable ORDER BY time limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.35 rows=10 width=39)
-> Index Scan using bigtable_time_idx on bigtable (cost=0.00..10854.96 rows=314406 width=39)
(2 rows)
An index scan, using the time index, is able to start delivering rows in already
sorted order (cost starts at 0.00). The LIMIT can cut the query short after
only 10 rows, so the overall cost is very small.
The moral to the story is to carefully choose which columns or combinations of
columns you will index. You can't add them indiscriminately because adding an
index has a cost of its own - it makes it more expensive to insert, update or
delete records.