We have two tables - event_deltas and deltas_to_retrieve - which both have BTREE indexes on the same two columns:
CREATE TABLE event_deltas
(
event_id UUID REFERENCES events(id) NOT NULL,
version INT NOT NULL,
json_patch JSONB NOT NULL,
PRIMARY KEY (event_id, version)
);
CREATE TABLE deltas_to_retrieve(event_id UUID NOT NULL, version INT NOT NULL);
CREATE UNIQUE INDEX event_id_version ON deltas_to_retrieve (event_id, version);
In terms of table size, deltas_to_retrieve is a tiny lookup table of ~500 rows. The event_deltas table contains ~7,000,000 rows. Due to the size of the latter table, we want to limit how much we retrieve at once. Therefore, the tables are queried as follows:
SELECT ed.event_id, ed.version
FROM deltas_to_retrieve zz, event_deltas ed
WHERE zz.event_id = ed.event_id
AND ed.version > zz.version
ORDER BY ed.event_id, ed.version
LIMIT 5000;
Without the LIMIT, for the example I'm looking at the query returns ~30,000 rows.
What's odd about this query is the impact of the ORDER BY. Due to the existing indexes, the data comes back in the order we want with or without it. I would rather keep the explicit ORDER BY there so we're future-proofed against future changes, as well as for readability etc. However, as things stand it has a significant negative impact on performance.
According to the docs:
An important special case is ORDER BY in combination with LIMIT n: an explicit sort will have to process all the data to identify the first n rows, but if there is an index matching the ORDER BY, the first n rows can be retrieved directly, without scanning the remainder at all.
This makes me think that, given the indexes we already have in place, the ORDER BY should not slow down the query at all. However, in practice I'm seeing execution times of ~10s with the ORDER BY and <1s without. I've included the plans outputted by EXPLAIN below:
Without ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.56..20033.38 rows=5000 width=20)
-> Nested Loop (cost=0.56..331980.39 rows=82859 width=20)
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..9.37 rows=537 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..616.66 rows=154 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.56..20039.35 rows=5000 width=20) (actual time=3.675..2083.063 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Nested Loop (cost=0.56..1055082.88 rows=263260 width=20) (actual time=3.673..2080.745 rows=5000 loops=1)
" Buffers: shared hit=1450 read=4783, local hit=2"
-> Seq Scan on deltas_to_retrieve zz (cost=0.00..27.00 rows=1700 width=20) (actual time=0.022..0.307 rows=241 loops=1)
Buffers: local hit=2
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..619.07 rows=155 width=20) (actual time=1.317..8.612 rows=21 loops=241)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Heap Fetches: 5000
Buffers: shared hit=1450 read=4783
Planning Time: 1.150 ms
Execution Time: 2084.647 ms
With ORDER BY
Just EXPLAIN:
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20)
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20)
-> Materialize (cost=0.28..6178.03 rows=1700 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20)
More detailed EXPLAIN (ANALYZE, BUFFERS):
QUERY PLAN
Limit (cost=0.84..929199.06 rows=5000 width=20) (actual time=4457.770..506706.443 rows=5000 loops=1)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Merge Join (cost=0.84..48924145.53 rows=263260 width=20) (actual time=4457.768..506704.815 rows=5000 loops=1)
Merge Cond: (ed.event_id = zz.event_id)
Join Filter: (ed.version > zz.version)
" Buffers: shared hit=78806 read=1071004 dirtied=148, local hit=63"
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..48873353.76 rows=12318733 width=20) (actual time=4.566..505443.407 rows=1813438 loops=1)
Heap Fetches: 1814767
Buffers: shared hit=78806 read=1071004 dirtied=148
-> Materialize (cost=0.28..6178.03 rows=1700 width=20) (actual time=0.063..2.524 rows=5000 loops=1)
Buffers: local hit=63
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..6173.78 rows=1700 width=20) (actual time=0.056..0.663 rows=78 loops=1)
Heap Fetches: 78
Buffers: local hit=63
Planning Time: 1.088 ms
Execution Time: 506709.819 ms
I'm not very experienced at reading these plans, but it's obviously thinking that it needs to retrieve everything, sort it and then return TOP N, rather than just grabbing the first N using the index. It's doing a Seq Scan on the smaller deltas_to_retrieve table rather than an Index Only Scan - is that the problem? That table is v. small (~500 rows), so I wonder if it's just not bothering to use the index because of that?
Postgres version: 11.12
Upgrading to Postgres 13 fixed this for us, with the introduction of incremental sort. From some docs on the feature:
Incremental sorting: Sorting is a performance-intensive task, so every improvement in this area can make a difference. Now PostgreSQL 13 introduces incremental sorting, which leverages early-stage sorts of a query and sorts only the incremental unsorted fields, increasing the chances the sorted block will fit in memory and by that, improving performance.
The new query plan from EXPLAIN is as follows, with the query now completing in <500ms consistently:
QUERY PLAN
Limit (cost=71.06..820.32 rows=5000 width=20)
-> Incremental Sort (cost=71.06..15461.82 rows=102706 width=20)
" Sort Key: ed.event_id, ed.version"
Presorted Key: ed.event_id
-> Nested Loop (cost=0.84..6659.05 rows=102706 width=20)
-> Index Only Scan using event_id_version on deltas_to_retrieve zz (cost=0.28..1116.39 rows=541 width=20)
-> Index Only Scan using event_deltas_pkey on event_deltas ed (cost=0.56..8.35 rows=190 width=20)
Index Cond: ((event_id = zz.event_id) AND (version > zz.version))
Note:
[Start by running VACUUM ANALYZE on both tables]
since deltas_to_retrieve only needs to contain the lowest versions, it could be unique on event_id
you can simplify the query to:
SELECT event_id, version
FROM event_deltas ed
WHERE EXISTS (
SELECT * FROM deltas_to_retrieve zz
WHERE zz.event_id = ed.event_id
AND zz.version < ed.version
)
ORDER BY event_id, version
LIMIT 5000;
I have table (over 100 millions records) on PostgreSQL 13.1
CREATE TABLE report
(
id serial primary key,
license_plate_id integer,
datetime timestamp
);
Indexes (for test I create both of them):
create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);
So, my question is why query like
select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100
Is very slow (~10sec). But query without order statement is fast (milliseconds).
Explain:
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
Buffers: shared hit=103
-> Index Scan using report_lp_id_idx on report r (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
Buffers: shared hit=11455 read=671
-> Sort (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
Sort Key: datetime DESC
Sort Method: top-N heapsort Memory: 128kB
Buffers: shared hit=11455 read=671
-> Bitmap Heap Scan on report r (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Heap Blocks: exact=12063
Buffers: shared hit=11455 read=671
-> Bitmap Index Scan on report_lp_id_idx (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms
You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.
The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:
CLUSTER report_lp_id_idx USING report_lp_id_idx;
Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.
But if you need this query to be fast no matter what, CLUSTER is the way to go.
Your two indices do exactly the same thing, so you can remove the second one, it's useless.
To optimize your query, the order of the fields inside the index must be reversed:
create index report_lp_datetime_index on report (datetime,license_plate_id);
BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;
Limit (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
-> Index Only Scan Backward using foo_d_i on foo (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9016
Heap Fetches: 0
Planning Time: 0.339 ms
Execution Time: 9.387 ms
Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.
It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.
Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.
EXPLAIN (ANALYZE,buffers) SELECT * FROM foo
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i)
ORDER BY d DESC LIMIT 100;
Limit (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
Buffers: shared hit=453
-> Nested Loop (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
Buffers: shared hit=453
-> Limit (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
Buffers: shared hit=53
-> Index Only Scan using foo_d_i on foo foo_1 (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=53
-> Index Scan using foo_d_i on foo (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
Buffers: shared hit=400
Execution Time: 3.663 ms
Another option is to just add the primary key to the date,license_plate index.
SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;
Limit (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
Buffers: shared hit=473
-> Sort (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
Sort Key: foo.d DESC
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=473
-> Nested Loop (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
Buffers: shared hit=473
-> Limit (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
Buffers: shared hit=73
-> Index Only Scan using foo_d_i_id on foo foo_1 (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=73
-> Index Scan using foo_pkey on foo (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
Index Cond: (id = foo_1.id)
Buffers: shared hit=400
Execution Time: 3.972 ms
Edit
After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:
EXPLAIN (ANALYZE,BUFFERS)
SELECT * FROM foo
JOIN (SELECT d,i FROM
(VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist
CROSS JOIN LATERAL
(SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2
ORDER BY d DESC LIMIT 100
) f3 USING (d,i)
ORDER BY d DESC LIMIT 100;
It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.
If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.
Please post results for the winning query 🔥
I have a query that is very fast for large date filter
EXPLAIN ANALYZE
SELECT "advertisings"."id",
"advertisings"."page_id",
"advertisings"."page_name",
"advertisings"."created_at",
"posts"."image_url",
"posts"."thumbnail_url",
"posts"."post_content",
"posts"."like_count"
FROM "advertisings"
INNER JOIN "posts" ON "advertisings"."post_id" = "posts"."id"
WHERE "advertisings"."created_at" >= '2020-01-01T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
ORDER BY "like_count" DESC LIMIT 20
And the query plan is:
Limit (cost=0.85..20.13 rows=20 width=552) (actual time=0.026..0.173 rows=20 loops=1)
-> Nested Loop (cost=0.85..951662.55 rows=987279 width=552) (actual time=0.025..0.169 rows=20 loops=1)
-> Index Scan using posts_like_count_idx on posts (cost=0.43..378991.65 rows=1053015 width=504) (actual time=0.013..0.039 rows=20 loops=1)
-> Index Scan using advertisings_post_id_index on advertisings (cost=0.43..0.53 rows=1 width=52) (actual time=0.005..0.006 rows=1 loops=20)
Index Cond: (post_id = posts.id)
Filter: ((created_at >= '2020-01-01 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
Planning Time: 0.365 ms
Execution Time: 0.199 ms
However, when I narrow the filter (change "created_at" >= '2020-11-25T00:00:00Z') which returns 9 records (which is less than the limit 20), the query is very slow
EXPLAIN ANALYZE
SELECT "advertisings"."id",
"advertisings"."page_id",
"advertisings"."page_name",
"advertisings"."created_at",
"posts"."image_url",
"posts"."thumbnail_url",
"posts"."post_content",
"posts"."like_count"
FROM "advertisings"
INNER JOIN "posts" ON "advertisings"."post_id" = "posts"."id"
WHERE "advertisings"."created_at" >= '2020-11-25T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
ORDER BY "like_count" DESC LIMIT 20
Query plan:
Limit (cost=1000.88..8051.73 rows=20 width=552) (actual time=218.485..4155.336 rows=9 loops=1)
-> Gather Merge (cost=1000.88..612662.09 rows=1735 width=552) (actual time=218.483..4155.328 rows=9 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.85..611461.80 rows=723 width=552) (actual time=118.170..3786.176 rows=3 loops=3)
-> Parallel Index Scan using posts_like_count_idx on posts (cost=0.43..372849.07 rows=438756 width=504) (actual time=0.024..1542.094 rows=351005 loops=3)
-> Index Scan using advertisings_post_id_index on advertisings (cost=0.43..0.53 rows=1 width=52) (actual time=0.006..0.006 rows=0 loops=1053015)
Index Cond: (post_id = posts.id)
Filter: ((created_at >= '2020-11-25 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
Rows Removed by Filter: 1
Planning Time: 0.394 ms
Execution Time: 4155.379 ms
I spent hours googling but couldn't find the right solution. And help would be greatly appreciated.
Updated
When I continue narrowing the filter to
WHERE "advertisings"."created_at" >= '2020-11-27T00:00:00Z'
AND "advertisings"."created_at" < '2020-12-02T23:59:59Z'
which also returns the 9 records as the slow above query. However, this time, the query is really fast again.
Limit (cost=8082.99..8083.04 rows=20 width=552) (actual time=0.062..0.065 rows=9 loops=1)
-> Sort (cost=8082.99..8085.40 rows=962 width=552) (actual time=0.061..0.062 rows=9 loops=1)
Sort Key: posts.like_count DESC
Sort Method: quicksort Memory: 32kB
-> Nested Loop (cost=0.85..8057.39 rows=962 width=552) (actual time=0.019..0.047 rows=9 loops=1)
-> Index Scan using advertisings_created_at_index on advertisings (cost=0.43..501.30 rows=962 width=52) (actual time=0.008..0.012 rows=9 loops=1)
Index Cond: ((created_at >= '2020-11-27 00:00:00'::timestamp without time zone) AND (created_at < '2020-12-02 23:59:59'::timestamp without time zone))
-> Index Scan using posts_pkey on posts (cost=0.43..7.85 rows=1 width=504) (actual time=0.003..0.003 rows=1 loops=9)
Index Cond: (id = advertisings.post_id)
Planning Time: 0.540 ms
Execution Time: 0.096 ms
I have no idea what happens
PostgreSQL follows two different strategies in the first two and the last query:
If there are many matching advertisings rows, it uses a nested loop join to fetch the rows in the order of the ORDER BY clause and discards rows that don't match the condition until it has found 20.
If there are few matching advertisings rows, it fetches those few rows, then the matching rows in posts, then sorts and takes the first 20 rows.
The second execution is slow because PostgreSQL overestimates the rows in advertisings that match the condition. See how it estimates 962 instead of 9 in the third query?
The solution is to improve PostgreSQL's estimate:
if running
ANALYZE advertisings;
is enough to make the slow query fast, tell PostgreSQL to collect statistics more often:
ALTER TABLE advertisings SET (autovacuum_analyze_scale_factor = 0.05);
if that is not enough, try collecting more detailed statistics:
SET default_statistics_target = 1000;
ANALYZE advertisings;
You can experiment with values up to 10000. Once you found the value that works, persist it:
ALTER TABLE advertisings ALTER created_at SET STATISTICS 1000;
Postgres is using a much heavier Seq Scan on table tracking when an index is available. The first query was the original attempt, which uses a Seq Scan and therefore has a slow query. I attempted to force an Index Scan with an Inner Select, but postgres converted it back to effectively the same query with nearly the same runtime. I finally copied the list from the Inner Select of query two to make the third query. Finally postgres used the Index Scan, which dramatically decreased the runtime. The third query is not viable in a production environment. What will cause postgres to use the last query plan?
(vacuum was used on both tables)
Tables
tracking (worker_id, localdatetime) total records: 118664105
project_worker (id, project_id) total records: 12935
INDEX
CREATE INDEX tracking_worker_id_localdatetime_idx ON public.tracking USING btree (worker_id, localdatetime)
Queries
SELECT worker_id, localdatetime FROM tracking t JOIN project_worker pw ON t.worker_id = pw.id WHERE project_id = 68475018
Hash Join (cost=29185.80..2638162.26 rows=19294218 width=16) (actual time=16.912..18376.032 rows=177681 loops=1)
Hash Cond: (t.worker_id = pw.id)
-> Seq Scan on tracking t (cost=0.00..2297293.86 rows=118716186 width=16) (actual time=0.004..8242.891 rows=118674660 loops=1)
-> Hash (cost=29134.80..29134.80 rows=4080 width=8) (actual time=16.855..16.855 rows=2102 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 115kB
-> Seq Scan on project_worker pw (cost=0.00..29134.80 rows=4080 width=8) (actual time=0.004..16.596 rows=2102 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 10833
Planning Time: 0.192 ms
Execution Time: 18382.698 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (SELECT id FROM project_worker WHERE project_id = 68475018 LIMIT 500)
Hash Semi Join (cost=6905.32..2923969.14 rows=27733254 width=24) (actual time=19.715..20191.517 rows=20530 loops=1)
Hash Cond: (t.worker_id = project_worker.id)
-> Seq Scan on tracking t (cost=0.00..2296948.27 rows=118698327 width=24) (actual time=0.005..9184.676 rows=118657026 loops=1)
-> Hash (cost=6899.07..6899.07 rows=500 width=8) (actual time=1.103..1.103 rows=500 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Limit (cost=0.00..6894.07 rows=500 width=8) (actual time=0.006..1.011 rows=500 loops=1)
-> Seq Scan on project_worker (cost=0.00..28982.65 rows=2102 width=8) (actual time=0.005..0.968 rows=500 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 4493
Planning Time: 0.224 ms
Execution Time: 20192.421 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (322016383,316007840,...,285702579)
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4766798.31 rows=21877360 width=24) (actual time=0.079..29.756 rows=22112 loops=1)
" Index Cond: (worker_id = ANY ('{322016383,316007840,...,285702579}'::bigint[]))"
Planning Time: 1.162 ms
Execution Time: 30.884 ms
... is in place of the 500 id entries used in the query
Same query ran on another set of 500 id's
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4776714.91 rows=21900980 width=24) (actual time=0.105..5528.109 rows=117838 loops=1)
" Index Cond: (worker_id = ANY ('{286237712,286237844,...,216724213}'::bigint[]))"
Planning Time: 2.105 ms
Execution Time: 5534.948 ms
The distribution of "worker_id" within "tracking" seems very skewed. For one thing, the number of rows in one of your instances of query 3 returns over 5 times as many rows as the other instance of it. For another, the estimated number of rows is 100 to 1000 times higher than the actual number. This can certainly lead to bad plans (although it is unlikely to be the complete picture).
What is the actual number of distinct values for worker_id within tracking: select count(distinct worker_id) from tracking? What does the planner think this value is: select n_distinct from pg_stats where tablename='tracking' and attname='worker_id'? If those values are far apart and you force the planner to use a more reasonable value with alter table tracking alter column worker_id set (n_distinct = <real value>); analyze tracking; does that change the plans?
If you want to nudge PostgreSQL towards a nested loop join, try the following:
Create an index on tracking that can be used for an index-only scan:
CREATE INDEX ON tracking (worker_id) INCLUDE (localdatetime);
Make sure that tracking is VACUUMed often, so that an index-only scan is effective.
Reduce random_page_cost and increase effective_cache_size so that the optimizer prices index scans lower (but don't use insane values).
Make sure that you have good estimates on project_worker:
ALTER TABLE project_worker ALTER project_id SET STATISTICS 1000;
ANALYZE project_worker;