Hello and thanks in advance to anyone who will take some time to answer me.
This is a question to learn how to use indexes efficiently on count queries.
Version Postgresql 12.11
I have a huge table lots (~15M rows) which should have been partitioned by status (an integer column with actual values between 0 and 3), but it has not.
Here is how the data is distributed on it:
-- lots by status
SELECT status, count(*), ROUND((count(*) * 100 / SUM(count(*)) OVER ()), 1) AS "%"
FROM lots
GROUP BY status
ORDER BY count(*) DESC;
status
count
%
2
~13.3M
90%
0
~1.5M
10%
1
~6K
~0%
NULL
~0.5K
~0%
I also have those indexes on it:
tablename
indexname
num_rows
table_size
index_size
unique
number_of_scans
tuples_read
tuples_fetched
lots
index_lots_on_status
1.4742644e+07
5024 MB
499 MB
N
3451
7060928281
134328966
lots
pidx_active_lots_on_id
1.4742644e+07
5024 MB
38 MB
Y
23491795
1496103827
2680228
where the pidx_active_lots_on_id is a partial index defined as follow:
CREATE UNIQUE INDEX CONCURRENTLY "pidx_active_lots_on_id" ON "lots" ("id" DESC) WHERE status = 0;
As you can see, the partial index on lots with status = 0 is "only" 38MB (against the 0.5GB of the full status index).
I've introduced the latter index to try to optimise this query:
SELECT count(*) FROM lots WHERE status = 0;
because the count of the lots on status 0 is the most common count case for that table, but for some reason the index seems to be ignored.
I also tried to perform a more specific query:
SELECT count(id) FROM lots WHERE status = 0;
with this second query, the index is used, but with worst results.
NOTE: I also ran an ANALYSE lots; after the introduction of the index.
My questions are:
Why is the index partial index ignored on the first count case (count(*))?
Why is the second query performing worst?
Detail on plan:
EXPLAIN(ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT COUNT(*) FROM lots WHERE lots.status = 0
Aggregate (cost=539867.77..539867.77 rows=1 width=8) (actual time=16517.790..16517.791 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=79181 read=287729 dirtied=16606 written=7844
I/O Timings: read=14040.416 write=58.453
-> Index Only Scan using index_lots_on_status on public.lots (cost=0.11..539125.83 rows=1483881 width=0) (actual time=0.498..16238.580 rows=1501060 loops=1)
Output: status
Index Cond: (lots.status = 0)
Heap Fetches: 1545139
Buffers: shared hit=79181 read=287729 dirtied=16606 written=7844
I/O Timings: read=14040.416 write=58.453
Planning Time: 1.856 ms
JIT:
Functions: 3
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.466 ms, Inlining 80.076 ms, Optimization 15.797 ms, Emission 12.393 ms, Total 108.733 ms
Execution Time: 16568.670 ms
EXPLAIN(ANALYZE, COSTS, VERBOSE, BUFFERS)
SELECT COUNT(id) FROM lots WHERE lots.status = 0
Aggregate (cost=660337.71..660337.72 rows=1 width=8) (actual time=32127.686..32127.687 rows=1 loops=1)
Output: count(id)
Buffers: shared hit=80426 read=334949 dirtied=3 written=75
I/O Timings: read=11365.273 write=22.365
-> Bitmap Heap Scan on public.lots (cost=11304.17..659595.77 rows=1483887 width=4) (actual time=3783.122..30680.836 rows=1501176 loops=1)
Output: id, url, title, ... *(list of all of the 32 columns)*
Recheck Cond: (lots.status = 0)
Heap Blocks: exact=402865
Buffers: shared hit=80426 read=334949 dirtied=3 written=75
I/O Timings: read=11365.273 write=22.365
-> Bitmap Index Scan on pidx_active_lots_on_id (cost=0.00..11229.97 rows=1483887 width=0) (actual time=2534.845..2534.845 rows=1614888 loops=1)
Buffers: shared hit=4866
Planning Time: 0.248 ms
JIT:
Functions: 5
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 1.170 ms, Inlining 56.485 ms, Optimization 474.508 ms, Emission 205.882 ms, Total 738.045 ms
Execution Time: 32169.349 ms
The partial index may be smaller, but the size of the part of the index that needs to be read will be about the same between the two indexes. Skipping the parts of the complete index which are for the wrong "status" will be very efficient.
Your table is obviously not very well vacuumed, based on both the heap fetches and the number of buffers dirtied and written. Try vacuuming the table and then repeating the queries.
I have one table that includes about 100K rows and their growth and growth.
My query response has a bad response time and it affects my front-end user experience.
I want to ask for your help to improve my response time from the DB.
Today the PostgreSQL runs on my local machine, Macbook pro 13 2019 16 RAM and I5 Core.
I the future I will load this DB on a docker and run it on a Better server.
What can I do to improve it for now?
Table Structure:
CREATE TABLE dots
(
dot_id INT,
site_id INT,
latitude float ( 6 ),
longitude float ( 6 ),
rsrp float ( 6 ),
dist INT,
project_id INT,
dist_from_site INT,
geom geometry,
dist_from_ref INT,
file_name VARCHAR
);
The dot_id resets after inserting the bulk of data and each for "file_name" column.
Table Dots:
The queries:
Query #1:
await db.query(
`select MAX(rsrp) FROM dots where site_id=$1 and ${table}=$2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=4) (actual time=198.416..201.762 rows=1 loops=1)
Buffers: shared hit=16165 read=16031
-> Gather (cost=37159.66..37159.87 rows=2 width=4) (actual time=198.299..201.752 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16165 read=16031
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=4) (actual time=179.009..179.010 rows=1 loops=3)
Buffers: shared hit=16165 read=16031
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=4) (actual time=122.889..178.817 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16165 read=16031
Planning Time: 0.290 ms
Execution Time: 201.879 ms
(14 rows)
Query #2:
await db.query(
`SELECT DISTINCT (${table}) FROM dots where site_id=$1 and project_id = $2 and file_name ilike $3 order by ${table}`,
[site_id, project_id, filename]
);
Time for response: 1100ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=41322.12..41322.31 rows=77 width=4) (actual time=1176.071..1176.077 rows=66 loops=1)
Sort Key: dist_from_ref
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=16175 read=16021
-> HashAggregate (cost=41318.94..41319.71 rows=77 width=4) (actual time=1176.024..1176.042 rows=66 loops=1)
Group Key: dist_from_ref
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=16175 read=16021
-> Seq Scan on dots (cost=0.00..40499.42 rows=327807 width=4) (actual time=0.423..1066.316 rows=326668 loops=1)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (project_id = 1))
Rows Removed by Filter: 147813
Buffers: shared hit=16175 read=16021
Planning:
Buffers: shared hit=5 dirtied=1
Planning Time: 0.242 ms
Execution Time: 1176.125 ms
(16 rows)
Query #3:
await db.query(
`SELECT count(*) FROM dots WHERE site_id = $1 AND ${table} = $2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=8) (actual time=198.725..202.335 rows=1 loops=1)
Buffers: shared hit=16160 read=16036
-> Gather (cost=37159.66..37159.87 rows=2 width=8) (actual time=198.613..202.328 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16160 read=16036
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=8) (actual time=179.182..179.183 rows=1 loops=3)
Buffers: shared hit=16160 read=16036
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=0) (actual time=119.340..179.020 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16160 read=16036
Planning Time: 0.109 ms
Execution Time: 202.377 ms
(14 rows)
Tables do no have any indexes.
I added an index and it helped a bit... create index idx1 on dots (site_id, project_id, file_name, dist_from_site,dist_from_ref)
OK, that's a bit too much.
An index on columns (a,b) is useful for "where a=..." and also for "where a=... and b=..." but it is not really useful for "where b=...". Creating an index with many columns uses more disk space and is slower than with fewer columns, which is a waste if the extra columns in the index don't make your queries faster. Both dist_ columns in the index will probably not be used.
Indices are a compromise : if your table has small rows, like two integer columns, and you create an index on these two columns, then it will be as big as the table, so you better be sure you need it. But in your case, your rows are pretty large at around 5kB and the number of rows is small, so adding an index or several on small int columns costs very little overhead.
So, since you very often use WHERE conditions on both site_id and project_id you can create an index on site_id,project_id. This will also work for a WHERE condition on site_id alone. If you sometimes use project_id alone, you can swap the order of the columns so it appears first, or even create another index.
You say the cardinality of these columns is about 30, so selecting on one value of site_id or project_id should hit 1/30 or 3.3% of the table, and selecting on both columns should hit 0.1% of the table, if they are uncorrelated and evenly distributed. This should already result in a substantial speedup.
You could also add an index on dist_from_site, and another one on dist_on_ref, if they have good selectivity (ie, high cardinality in those columns). Postgres can combine indices with bitmap index scan. But, if say 50% of the rows in the table have the same value for dist_from_site, then an index will be useless for this value, due to not having enough selectivity.
You could also replace the previous 2-column index with 2 indices on site_id,project_id,dist_from_site and site_id,project_id,dist_from_ref. You can try it, see if it is worth the extra resources.
Also there's the filename column and ILIKE. ILIKE can't use an index, which means it's slow. One solution is to use an expression index
CREATE INDEX dots_filename ON dots( lower(file_name) );
and replace your where condition with:
lower(file_name) like lower($4)
This will use the index unless the parameter $4 starts with a "%". And if you never use the LIKE '%' wildcards, and you're just using ILIKE for case-insensitive comparison, then you can replace LIKE with the = operator. Basically lower(a) = lower(b) is a case-insensitive comparison.
Likewise you could replace the previous 2-column index with an index on site_id,project_id,lower(filename) if you often use these three together in the WHERE condition. But as said above, it won't optimize searches on filename alone.
Since your rows are huge, even adding 1 kB of index per row will only add 20% overhead to your table, so you can overdo it without too much trouble. So go ahead and experiment, you'll see what works best.
I have table (over 100 millions records) on PostgreSQL 13.1
CREATE TABLE report
(
id serial primary key,
license_plate_id integer,
datetime timestamp
);
Indexes (for test I create both of them):
create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);
So, my question is why query like
select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100
Is very slow (~10sec). But query without order statement is fast (milliseconds).
Explain:
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
Buffers: shared hit=103
-> Index Scan using report_lp_id_idx on report r (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
Buffers: shared hit=11455 read=671
-> Sort (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
Sort Key: datetime DESC
Sort Method: top-N heapsort Memory: 128kB
Buffers: shared hit=11455 read=671
-> Bitmap Heap Scan on report r (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Heap Blocks: exact=12063
Buffers: shared hit=11455 read=671
-> Bitmap Index Scan on report_lp_id_idx (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms
You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.
The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:
CLUSTER report_lp_id_idx USING report_lp_id_idx;
Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.
But if you need this query to be fast no matter what, CLUSTER is the way to go.
Your two indices do exactly the same thing, so you can remove the second one, it's useless.
To optimize your query, the order of the fields inside the index must be reversed:
create index report_lp_datetime_index on report (datetime,license_plate_id);
BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;
Limit (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
-> Index Only Scan Backward using foo_d_i on foo (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9016
Heap Fetches: 0
Planning Time: 0.339 ms
Execution Time: 9.387 ms
Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.
It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.
Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.
EXPLAIN (ANALYZE,buffers) SELECT * FROM foo
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i)
ORDER BY d DESC LIMIT 100;
Limit (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
Buffers: shared hit=453
-> Nested Loop (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
Buffers: shared hit=453
-> Limit (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
Buffers: shared hit=53
-> Index Only Scan using foo_d_i on foo foo_1 (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=53
-> Index Scan using foo_d_i on foo (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
Buffers: shared hit=400
Execution Time: 3.663 ms
Another option is to just add the primary key to the date,license_plate index.
SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;
Limit (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
Buffers: shared hit=473
-> Sort (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
Sort Key: foo.d DESC
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=473
-> Nested Loop (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
Buffers: shared hit=473
-> Limit (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
Buffers: shared hit=73
-> Index Only Scan using foo_d_i_id on foo foo_1 (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=73
-> Index Scan using foo_pkey on foo (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
Index Cond: (id = foo_1.id)
Buffers: shared hit=400
Execution Time: 3.972 ms
Edit
After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:
EXPLAIN (ANALYZE,BUFFERS)
SELECT * FROM foo
JOIN (SELECT d,i FROM
(VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist
CROSS JOIN LATERAL
(SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2
ORDER BY d DESC LIMIT 100
) f3 USING (d,i)
ORDER BY d DESC LIMIT 100;
It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.
If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.
Please post results for the winning query 🔥
I have a query that is very fast for large date filter
EXPLAIN ANALYZE
SELECT "advertisings"."id",
"advertisings"."page_id",
"advertisings"."page_name",
"advertisings"."created_at",
"posts"."image_url",
"posts"."thumbnail_url",
"posts"."post_content",
"posts"."like_count"
FROM "advertisings"
INNER JOIN "posts" ON "advertisings"."post_id" = "posts"."id"
WHERE "advertisings"."created_at" >= '2020-01-01T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
ORDER BY "like_count" DESC LIMIT 20
And the query plan is:
Limit (cost=0.85..20.13 rows=20 width=552) (actual time=0.026..0.173 rows=20 loops=1)
-> Nested Loop (cost=0.85..951662.55 rows=987279 width=552) (actual time=0.025..0.169 rows=20 loops=1)
-> Index Scan using posts_like_count_idx on posts (cost=0.43..378991.65 rows=1053015 width=504) (actual time=0.013..0.039 rows=20 loops=1)
-> Index Scan using advertisings_post_id_index on advertisings (cost=0.43..0.53 rows=1 width=52) (actual time=0.005..0.006 rows=1 loops=20)
Index Cond: (post_id = posts.id)
Filter: ((created_at >= '2020-01-01 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
Planning Time: 0.365 ms
Execution Time: 0.199 ms
However, when I narrow the filter (change "created_at" >= '2020-11-25T00:00:00Z') which returns 9 records (which is less than the limit 20), the query is very slow
EXPLAIN ANALYZE
SELECT "advertisings"."id",
"advertisings"."page_id",
"advertisings"."page_name",
"advertisings"."created_at",
"posts"."image_url",
"posts"."thumbnail_url",
"posts"."post_content",
"posts"."like_count"
FROM "advertisings"
INNER JOIN "posts" ON "advertisings"."post_id" = "posts"."id"
WHERE "advertisings"."created_at" >= '2020-11-25T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
ORDER BY "like_count" DESC LIMIT 20
Query plan:
Limit (cost=1000.88..8051.73 rows=20 width=552) (actual time=218.485..4155.336 rows=9 loops=1)
-> Gather Merge (cost=1000.88..612662.09 rows=1735 width=552) (actual time=218.483..4155.328 rows=9 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.85..611461.80 rows=723 width=552) (actual time=118.170..3786.176 rows=3 loops=3)
-> Parallel Index Scan using posts_like_count_idx on posts (cost=0.43..372849.07 rows=438756 width=504) (actual time=0.024..1542.094 rows=351005 loops=3)
-> Index Scan using advertisings_post_id_index on advertisings (cost=0.43..0.53 rows=1 width=52) (actual time=0.006..0.006 rows=0 loops=1053015)
Index Cond: (post_id = posts.id)
Filter: ((created_at >= '2020-11-25 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
Rows Removed by Filter: 1
Planning Time: 0.394 ms
Execution Time: 4155.379 ms
I spent hours googling but couldn't find the right solution. And help would be greatly appreciated.
Updated
When I continue narrowing the filter to
WHERE "advertisings"."created_at" >= '2020-11-27T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
which also returns the 9 records as the slow above query. However, this time, the query is really fast again.
Limit (cost=8082.99..8083.04 rows=20 width=552) (actual time=0.062..0.065 rows=9 loops=1)
-> Sort (cost=8082.99..8085.40 rows=962 width=552) (actual time=0.061..0.062 rows=9 loops=1)
Sort Key: posts.like_count DESC
Sort Method: quicksort Memory: 32kB
-> Nested Loop (cost=0.85..8057.39 rows=962 width=552) (actual time=0.019..0.047 rows=9 loops=1)
-> Index Scan using advertisings_created_at_index on advertisings (cost=0.43..501.30 rows=962 width=52) (actual time=0.008..0.012 rows=9 loops=1)
Index Cond: ((created_at >= '2020-11-27 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
-> Index Scan using posts_pkey on posts (cost=0.43..7.85 rows=1 width=504) (actual time=0.003..0.003 rows=1 loops=9)
Index Cond: (id = advertisings.post_id)
Planning Time: 0.540 ms
Execution Time: 0.096 ms
I have no idea what happens
PostgreSQL follows two different strategies in the first two and the last query:
If there are many matching advertisings rows, it uses a nested loop join to fetch the rows in the order of the ORDER BY clause and discards rows that don't match the condition until it has found 20.
If there are few matching advertisings rows, it fetches those few rows, then the matching rows in posts, then sorts and takes the first 20 rows.
The second execution is slow because PostgreSQL overestimates the rows in advertisings that match the condition. See how it estimates 962 instead of 9 in the third query?
The solution is to improve PostgreSQL's estimate:
if running
ANALYZE advertisings;
is enough to make the slow query fast, tell PostgreSQL to collect statistics more often:
ALTER TABLE advertisings SET (autovacuum_analyze_scale_factor = 0.05);
if that is not enough, try collecting more detailed statistics:
SET default_statistics_target = 1000;
ANALYZE advertisings;
You can experiment with values up to 10000. Once you found the value that works, persist it:
ALTER TABLE advertisings ALTER created_at SET STATISTICS 1000;