postgres text field performance

postgres text field performance - postgresql

I am using postgres 9.2.4.
We have a background job that imports user's emails into our system and stores them in a postgres database table.
Below is the table:
CREATE TABLE emails
(
id serial NOT NULL,
subject text,
body text,
personal boolean,
sent_at timestamp without time zone NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
account_id integer NOT NULL,
sender_user_id integer,
sender_contact_id integer,
html text,
folder text,
draft boolean DEFAULT false,
check_for_response timestamp without time zone,
send_time timestamp without time zone,
CONSTRAINT emails_pkey PRIMARY KEY (id),
CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
REFERENCES accounts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
REFERENCES contacts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE emails
OWNER TO paulcowan;
-- Index: emails_account_id_index
-- DROP INDEX emails_account_id_index;
CREATE INDEX emails_account_id_index
ON emails
USING btree
(account_id);
-- Index: emails_sender_contact_id_index
-- DROP INDEX emails_sender_contact_id_index;
CREATE INDEX emails_sender_contact_id_index
ON emails
USING btree
(sender_contact_id);
-- Index: emails_sender_user_id_index
-- DROP INDEX emails_sender_user_id_index;
CREATE INDEX emails_sender_user_id_index
ON emails
USING btree
(sender_user_id);
The query is further complicated because I have a view on this table where I pull in other data:
CREATE OR REPLACE VIEW email_graphs AS
SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html,
emails.personal, emails.draft, emails.created_at, emails.updated_at,
emails.sent_at, emails.sender_contact_id, emails.sender_user_id,
emails.addresses, emails.read_by, emails.check_for_response,
emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
FROM emails
LEFT JOIN ( SELECT todos.reference_email_id AS email_id,
array_to_json(array_agg(todos.id)) AS ids
FROM todos
GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
LEFT JOIN ( SELECT calls.reference_email_id AS email_id,
array_to_json(array_agg(calls.id)) AS ids
FROM calls
GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
LEFT JOIN ( SELECT deals.reference_email_id AS email_id,
array_to_json(array_agg(deals.id)) AS ids
FROM deals
GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
LEFT JOIN ( SELECT meetings.reference_email_id AS email_id,
array_to_json(array_agg(meetings.id)) AS ids
FROM meetings
GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
LEFT JOIN ( SELECT comments.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
FROM comments
WHERE comments.email_id IS NOT NULL
GROUP BY comments.email_id) c ON c.email_id = emails.id
LEFT JOIN ( SELECT email_participants.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
FROM email_participants
GROUP BY email_participants.email_id) p ON p.email_id = emails.id
LEFT JOIN ( SELECT attachments.reference_email_id AS email_id,
array_to_json(array_agg(attachments.id)) AS ids
FROM attachments
GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;
ALTER TABLE email_graphs
OWNER TO paulcowan;
We then run paginated queries against this view e.g.
SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0
As the table has grown, queries on this table have dramatically slowed down.
If I run the paginated query with EXPLAIN ANALYZE
EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;
I get this result
-> Seq Scan on deals (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
-> Sort (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
Sort Key: ms.email_id
Sort Method: quicksort Memory: 26kB
-> Subquery Scan on ms (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
-> HashAggregate (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
-> Seq Scan on meetings (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
-> Sort (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
Sort Key: c.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on c (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
-> HashAggregate (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
-> Seq Scan on comments (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (email_id IS NOT NULL)
Rows Removed by Filter: 2
SubPlan 1
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=56) (never executed)
-> Materialize (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
-> GroupAggregate (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
-> Sort (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
Sort Key: public.email_participants.email_id
Sort Method: external sort Disk: 1760kB
-> Seq Scan on email_participants (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
SubPlan 2
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
-> Sort (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
Sort Key: atts.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on atts (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
-> HashAggregate (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
-> Seq Scan on attachments (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
-> Index Only Scan using email_participants_email_id_user_id_index on email_participants (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
Index Cond: (user_id = 75)
Heap Fetches: 43
Total runtime: 1642.157 ms
(75 rows)

I am definitely not looking for fixes :) or a refactored query. Any sort of hight level advice would be most welcome.
Per my comment, the gist of the issue lies in aggregates that get joined with one another. This prevents the use of indexes, and yields a bunch of merge joins (and a materialize) in your query plan…
Put another way, think of it as a plan so convoluted that Postgres proceeds by materializing temporary tables in memory, and then sorting them repeatedly until they're all merge-joined as appropriate. From where I'm standing, the full hogwash seems to amount to selecting all rows from all tables, and all of their possible relationships. Once it has been figured out and conjured up into existence, Postgres proceeds to sorting the mess in order to extract the top-n rows.
Anyway, you want rewrite the query so it can actually use indexes to begin with.
Part of this is simple. This, for instance, is a big no no:
select …,
ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
Fetch the emails in one query. Fetch the related objects in separate queries with a bite-sized email_id in (…) clause. Merely doing that should speed things up quite a bit.
For the rest, it may or may not be simple or involve some re-engineering of your schema. I only scanned through the incomprehensible monster and its gruesome query plan, so I cannot comment for sure.

I think the big view is not likely to ever perform well and that you should break it into more manageable components, but still here are two specific bits of advice that come to mind:
Schema change
Move the text and html bodies out of the main table.
Although large contents get automatically stored in TOAST space, mail parts will often be smaller than the TOAST threshold (~2000 bytes), especially for plain text, so it won't happen systematically.
Each non-TOASTED content inflates the table in a way that is detrimental to I/O performance and caching, if you consider that the primary purpose of the table is to contain the header fields like sender, recipients, date, subject...
I can test this with contents I happen to have in a mail database. On a sample of the 55k mails in my inbox:
average text/plain size: 1511 bytes.
average text/html size: 11895 bytes (but 42395 messages have no html at all)
Size of the mail table without the bodies: 14Mb (no TOAST)
If adding the bodies as 2 more TEXT columns like you have: 59Mb in main storage, 61Mb in TOAST.
Despite TOAST, the main storage appears to be 4 times larger. So when scanning the table without need for the TEXT columns, 80% of the I/Os are wasted. Future row updates are likely to make this worse with the fragmentation effect.
The effect in terms of block reads can be spotted through the pg_statio_all_tables view (compare heap_blks_read + heap_blks_hit before and after the query)
Tuning
This part of the EXPLAIN
Sort Method: external sort Disk: 1760kB
suggests that you work_mem is too small. You don't want to hit the disk for such small sorts. Make it as least 10Mb unless you're low on free memory. While you're at it, set shared_buffers to a reasonable value if it's still the default. See http://wiki.postgresql.org/wiki/Performance_Optimization for more.

Related

Improve PostgreSQL query response

I have one table that includes about 100K rows and their growth and growth.
My query response has a bad response time and it affects my front-end user experience.
I want to ask for your help to improve my response time from the DB.
Today the PostgreSQL runs on my local machine, Macbook pro 13 2019 16 RAM and I5 Core.
I the future I will load this DB on a docker and run it on a Better server.
What can I do to improve it for now?
Table Structure:
CREATE TABLE dots
(
dot_id INT,
site_id INT,
latitude float ( 6 ),
longitude float ( 6 ),
rsrp float ( 6 ),
dist INT,
project_id INT,
dist_from_site INT,
geom geometry,
dist_from_ref INT,
file_name VARCHAR
);
The dot_id resets after inserting the bulk of data and each for "file_name" column.
Table Dots:
The queries:
Query #1:
await db.query(
`select MAX(rsrp) FROM dots where site_id=$1 and ${table}=$2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=4) (actual time=198.416..201.762 rows=1 loops=1)
Buffers: shared hit=16165 read=16031
-> Gather (cost=37159.66..37159.87 rows=2 width=4) (actual time=198.299..201.752 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16165 read=16031
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=4) (actual time=179.009..179.010 rows=1 loops=3)
Buffers: shared hit=16165 read=16031
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=4) (actual time=122.889..178.817 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16165 read=16031
Planning Time: 0.290 ms
Execution Time: 201.879 ms
(14 rows)
Query #2:
await db.query(
`SELECT DISTINCT (${table}) FROM dots where site_id=$1 and project_id = $2 and file_name ilike $3 order by ${table}`,
[site_id, project_id, filename]
);
Time for response: 1100ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=41322.12..41322.31 rows=77 width=4) (actual time=1176.071..1176.077 rows=66 loops=1)
Sort Key: dist_from_ref
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=16175 read=16021
-> HashAggregate (cost=41318.94..41319.71 rows=77 width=4) (actual time=1176.024..1176.042 rows=66 loops=1)
Group Key: dist_from_ref
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=16175 read=16021
-> Seq Scan on dots (cost=0.00..40499.42 rows=327807 width=4) (actual time=0.423..1066.316 rows=326668 loops=1)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (project_id = 1))
Rows Removed by Filter: 147813
Buffers: shared hit=16175 read=16021
Planning:
Buffers: shared hit=5 dirtied=1
Planning Time: 0.242 ms
Execution Time: 1176.125 ms
(16 rows)
Query #3:
await db.query(
`SELECT count(*) FROM dots WHERE site_id = $1 AND ${table} = $2 and project_id = $3 and file_name ilike $4`,
[site_id, dist, project_id, filename]
);
Time for response: 200ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=37159.88..37159.89 rows=1 width=8) (actual time=198.725..202.335 rows=1 loops=1)
Buffers: shared hit=16160 read=16036
-> Gather (cost=37159.66..37159.87 rows=2 width=8) (actual time=198.613..202.328 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=16160 read=16036
-> Partial Aggregate (cost=36159.66..36159.67 rows=1 width=8) (actual time=179.182..179.183 rows=1 loops=3)
Buffers: shared hit=16160 read=16036
-> Parallel Seq Scan on dots (cost=0.00..36150.01 rows=3861 width=0) (actual time=119.340..179.020 rows=1088 loops=3)
Filter: (((file_name)::text ~~* 'BigFile'::text) AND (site_id = 42047) AND (dist_from_ref = 500) AND (project_id = 1))
Rows Removed by Filter: 157073
Buffers: shared hit=16160 read=16036
Planning Time: 0.109 ms
Execution Time: 202.377 ms
(14 rows)
Tables do no have any indexes.

I added an index and it helped a bit... create index idx1 on dots (site_id, project_id, file_name, dist_from_site,dist_from_ref)
OK, that's a bit too much.
An index on columns (a,b) is useful for "where a=..." and also for "where a=... and b=..." but it is not really useful for "where b=...". Creating an index with many columns uses more disk space and is slower than with fewer columns, which is a waste if the extra columns in the index don't make your queries faster. Both dist_ columns in the index will probably not be used.
Indices are a compromise : if your table has small rows, like two integer columns, and you create an index on these two columns, then it will be as big as the table, so you better be sure you need it. But in your case, your rows are pretty large at around 5kB and the number of rows is small, so adding an index or several on small int columns costs very little overhead.
So, since you very often use WHERE conditions on both site_id and project_id you can create an index on site_id,project_id. This will also work for a WHERE condition on site_id alone. If you sometimes use project_id alone, you can swap the order of the columns so it appears first, or even create another index.
You say the cardinality of these columns is about 30, so selecting on one value of site_id or project_id should hit 1/30 or 3.3% of the table, and selecting on both columns should hit 0.1% of the table, if they are uncorrelated and evenly distributed. This should already result in a substantial speedup.
You could also add an index on dist_from_site, and another one on dist_on_ref, if they have good selectivity (ie, high cardinality in those columns). Postgres can combine indices with bitmap index scan. But, if say 50% of the rows in the table have the same value for dist_from_site, then an index will be useless for this value, due to not having enough selectivity.
You could also replace the previous 2-column index with 2 indices on site_id,project_id,dist_from_site and site_id,project_id,dist_from_ref. You can try it, see if it is worth the extra resources.
Also there's the filename column and ILIKE. ILIKE can't use an index, which means it's slow. One solution is to use an expression index
CREATE INDEX dots_filename ON dots( lower(file_name) );
and replace your where condition with:
lower(file_name) like lower($4)
This will use the index unless the parameter $4 starts with a "%". And if you never use the LIKE '%' wildcards, and you're just using ILIKE for case-insensitive comparison, then you can replace LIKE with the = operator. Basically lower(a) = lower(b) is a case-insensitive comparison.
Likewise you could replace the previous 2-column index with an index on site_id,project_id,lower(filename) if you often use these three together in the WHERE condition. But as said above, it won't optimize searches on filename alone.
Since your rows are huge, even adding 1 kB of index per row will only add 20% overhead to your table, so you can overdo it without too much trouble. So go ahead and experiment, you'll see what works best.

PostgreSQL slow order

I have table (over 100 millions records) on PostgreSQL 13.1
CREATE TABLE report
(
id serial primary key,
license_plate_id integer,
datetime timestamp
);
Indexes (for test I create both of them):
create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);
So, my question is why query like
select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100
Is very slow (~10sec). But query without order statement is fast (milliseconds).
Explain:
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
Buffers: shared hit=103
-> Index Scan using report_lp_id_idx on report r (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms
explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
Buffers: shared hit=11455 read=671
-> Sort (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
Sort Key: datetime DESC
Sort Method: top-N heapsort Memory: 128kB
Buffers: shared hit=11455 read=671
-> Bitmap Heap Scan on report r (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Heap Blocks: exact=12063
Buffers: shared hit=11455 read=671
-> Bitmap Index Scan on report_lp_id_idx (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms

You seem to have rather slow storage, if reading 671 8kB-blocks from disk takes a couple of seconds.
The way to speed this up is to reorder the table in the same way as the index, so that you can find the required rows in the same or adjacent table blocks:
CLUSTER report_lp_id_idx USING report_lp_id_idx;
Be warned that rewriting the table in this way causes downtime – the table will not be available while it is being rewritten. Moreover, PostgreSQL does not maintain the table order, so subsequent data modifications will cause performance to gradually deteriorate, so that after a while you will have to run CLUSTER again.
But if you need this query to be fast no matter what, CLUSTER is the way to go.

Your two indices do exactly the same thing, so you can remove the second one, it's useless.
To optimize your query, the order of the fields inside the index must be reversed:
create index report_lp_datetime_index on report (datetime,license_plate_id);
BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;
Limit (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
-> Index Only Scan Backward using foo_d_i on foo (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9016
Heap Fetches: 0
Planning Time: 0.339 ms
Execution Time: 9.387 ms
Note the index is not used to optimize the WHERE clause. It is used here as a compact and fast way to store references to the rows ordered by date DESC, so the ORDER BY can do an index-only scan and avoid sorting. By adding column id to the index, an index-only scan can be performed to test the condition on id, without hitting the table for every row. Since there is a low LIMIT value it does not need to scan the whole index, it only scans it in date DESC order until it finds enough rows satisfying the WHERE condition to return the result.
It will be faster if you create the index in date DESC order, this could be useful if you use ORDER BY date DESC + LIMIT in other queries too.
You forget that OP's table has a third column, and he is using SELECT *. So that wouldn't be an index-only scan.
Easy to work around. The optimum way to do this query would be an index-only scan to filter on WHERE conditions, then LIMIT, then hit the table to get the rows. For some reason if "select *" is used postgres takes the id column from the table instead of taking it from the index, which results in lots of unnecessary heap fetches for rows whose id is rejected by the WHERE condition.
Easy to work around, by doing it manually. I've also added another bogus column to make sure the SELECT * hits the table.
EXPLAIN (ANALYZE,buffers) SELECT * FROM foo
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i)
ORDER BY d DESC LIMIT 100;
Limit (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
Buffers: shared hit=453
-> Nested Loop (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
Buffers: shared hit=453
-> Limit (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
Buffers: shared hit=53
-> Index Only Scan using foo_d_i on foo foo_1 (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=53
-> Index Scan using foo_d_i on foo (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
Buffers: shared hit=400
Execution Time: 3.663 ms
Another option is to just add the primary key to the date,license_plate index.
SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;
Limit (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
Buffers: shared hit=473
-> Sort (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
Sort Key: foo.d DESC
Sort Method: quicksort Memory: 32kB
Buffers: shared hit=473
-> Nested Loop (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
Buffers: shared hit=473
-> Limit (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
Buffers: shared hit=73
-> Index Only Scan using foo_d_i_id on foo foo_1 (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
Rows Removed by Filter: 9010
Heap Fetches: 0
Buffers: shared hit=73
-> Index Scan using foo_pkey on foo (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
Index Cond: (id = foo_1.id)
Buffers: shared hit=400
Execution Time: 3.972 ms
Edit
After thinking about it... since the LIMIT restricts the output to 100 rows ordered by date desc, wouldn't it be nice if we could get the 100 most recent rows for each license_plate_id, put all that into a top-n sort, and only keep the best 100 for all license_plate_ids? That would avoid reading and throwing away a lot of rows from the index. Even if that's much faster than hitting the table, it will still load up these index pages in RAM and clog up your buffers with stuff you don't actually need to keep in cache. Let's use LATERAL JOIN:
EXPLAIN (ANALYZE,BUFFERS)
SELECT * FROM foo
JOIN (SELECT d,i FROM
(VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist
CROSS JOIN LATERAL
(SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2
ORDER BY d DESC LIMIT 100
) f3 USING (d,i)
ORDER BY d DESC LIMIT 100;
It's even faster: 2ms, and it uses the index on (license_plate_id,date) instead of the other way around. Also, and this is important, since each subquery in the lateral hits only the index pages that contain rows that will actually be selected, while the previous queries hit much more index pages. So you save on RAM buffers.
If you don't need the index on (date,license_plate_id) and don't want to keep a useless index, that could be interesting since this query doesn't use it. On the other hand, if you need the index on (date,license_plate_id) for something else and want to keep it, then... maybe not.
Please post results for the winning query 🔥

Postgres uses Hash Join with Seq Scan when Inner Select Index Cond is faster

Postgres is using a much heavier Seq Scan on table tracking when an index is available. The first query was the original attempt, which uses a Seq Scan and therefore has a slow query. I attempted to force an Index Scan with an Inner Select, but postgres converted it back to effectively the same query with nearly the same runtime. I finally copied the list from the Inner Select of query two to make the third query. Finally postgres used the Index Scan, which dramatically decreased the runtime. The third query is not viable in a production environment. What will cause postgres to use the last query plan?
(vacuum was used on both tables)
Tables
tracking (worker_id, localdatetime) total records: 118664105
project_worker (id, project_id) total records: 12935
INDEX
CREATE INDEX tracking_worker_id_localdatetime_idx ON public.tracking USING btree (worker_id, localdatetime)
Queries
SELECT worker_id, localdatetime FROM tracking t JOIN project_worker pw ON t.worker_id = pw.id WHERE project_id = 68475018
Hash Join (cost=29185.80..2638162.26 rows=19294218 width=16) (actual time=16.912..18376.032 rows=177681 loops=1)
Hash Cond: (t.worker_id = pw.id)
-> Seq Scan on tracking t (cost=0.00..2297293.86 rows=118716186 width=16) (actual time=0.004..8242.891 rows=118674660 loops=1)
-> Hash (cost=29134.80..29134.80 rows=4080 width=8) (actual time=16.855..16.855 rows=2102 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 115kB
-> Seq Scan on project_worker pw (cost=0.00..29134.80 rows=4080 width=8) (actual time=0.004..16.596 rows=2102 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 10833
Planning Time: 0.192 ms
Execution Time: 18382.698 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (SELECT id FROM project_worker WHERE project_id = 68475018 LIMIT 500)
Hash Semi Join (cost=6905.32..2923969.14 rows=27733254 width=24) (actual time=19.715..20191.517 rows=20530 loops=1)
Hash Cond: (t.worker_id = project_worker.id)
-> Seq Scan on tracking t (cost=0.00..2296948.27 rows=118698327 width=24) (actual time=0.005..9184.676 rows=118657026 loops=1)
-> Hash (cost=6899.07..6899.07 rows=500 width=8) (actual time=1.103..1.103 rows=500 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 28kB
-> Limit (cost=0.00..6894.07 rows=500 width=8) (actual time=0.006..1.011 rows=500 loops=1)
-> Seq Scan on project_worker (cost=0.00..28982.65 rows=2102 width=8) (actual time=0.005..0.968 rows=500 loops=1)
Filter: (project_id = 68475018)
Rows Removed by Filter: 4493
Planning Time: 0.224 ms
Execution Time: 20192.421 ms
SELECT worker_id, localdatetime FROM tracking t WHERE worker_id IN (322016383,316007840,...,285702579)
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4766798.31 rows=21877360 width=24) (actual time=0.079..29.756 rows=22112 loops=1)
" Index Cond: (worker_id = ANY ('{322016383,316007840,...,285702579}'::bigint[]))"
Planning Time: 1.162 ms
Execution Time: 30.884 ms
... is in place of the 500 id entries used in the query
Same query ran on another set of 500 id's
Index Scan using tracking_worker_id_localdatetime_idx on tracking t (cost=0.57..4776714.91 rows=21900980 width=24) (actual time=0.105..5528.109 rows=117838 loops=1)
" Index Cond: (worker_id = ANY ('{286237712,286237844,...,216724213}'::bigint[]))"
Planning Time: 2.105 ms
Execution Time: 5534.948 ms

The distribution of "worker_id" within "tracking" seems very skewed. For one thing, the number of rows in one of your instances of query 3 returns over 5 times as many rows as the other instance of it. For another, the estimated number of rows is 100 to 1000 times higher than the actual number. This can certainly lead to bad plans (although it is unlikely to be the complete picture).
What is the actual number of distinct values for worker_id within tracking: select count(distinct worker_id) from tracking? What does the planner think this value is: select n_distinct from pg_stats where tablename='tracking' and attname='worker_id'? If those values are far apart and you force the planner to use a more reasonable value with alter table tracking alter column worker_id set (n_distinct = <real value>); analyze tracking; does that change the plans?

If you want to nudge PostgreSQL towards a nested loop join, try the following:
Create an index on tracking that can be used for an index-only scan:
CREATE INDEX ON tracking (worker_id) INCLUDE (localdatetime);
Make sure that tracking is VACUUMed often, so that an index-only scan is effective.
Reduce random_page_cost and increase effective_cache_size so that the optimizer prices index scans lower (but don't use insane values).
Make sure that you have good estimates on project_worker:
ALTER TABLE project_worker ALTER project_id SET STATISTICS 1000;
ANALYZE project_worker;

PostgreSQL is not using a straight forward index

I have a PostgreSQL 10.6 database on Amazon RDS. My table is like this:
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country varchar(100) NOT NULL,
product_group varchar(100) NOT NULL,
distribution_type varchar(100) NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
category varchar(100) NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
data_status bpchar(1) NOT NULL,
panel_market_units numeric(38,6) NOT NULL,
panel_market_sales_value_eur numeric(38,6) NOT NULL,
panel_market_sales_value_usd numeric(38,6) NOT NULL,
panel_market_sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_dpretailer_dfo_by_quarter PRIMARY KEY (release_key, country, category, product_group, distribution_type, year, quarter),
CONSTRAINT fk_dpretailer_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id)
);
I understand Primary Key implies a unique index
If I simply ask how many rows I have when filtering on non existing data (release_key = 1 returns nothing), I can see it uses the index
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 1
Aggregate (cost=6.32..6.33 rows=1 width=8)
-> Index Only Scan using pk_dpretailer_dfo_by_quarter on dfo_by_quarter (cost=0.55..6.32 rows=1 width=0)
Index Cond: (release_key = 1)
But if I run the same query on a value that returns data, it scans the table, which is bound to be more expensive...
EXPLAIN
SELECT COUNT(*)
FROM dpretailer.dfo_by_quarter
WHERE release_key = 2
Finalize Aggregate (cost=47611.07..47611.08 rows=1 width=8)
-> Gather (cost=47610.86..47611.07 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=46610.86..46610.87 rows=1 width=8)
-> Parallel Seq Scan on dfo_by_quarter (cost=0.00..46307.29 rows=121428 width=0)
Filter: (release_key = 2)
I get it that using the index when there is no data makes sense and is driven by the stats on the table (I ran ANALYSE before the tests)
But why not using my index if there is data?
Surely, it must be quicker to scan part of an index (because release_key is the first column) rather than scanning an entire table???
I must be missing something...?
Update 2019-03-07
Thank You for your comments, which are very useful.
This simple query was just me trying to understand why the index was not used...
But I should have known better (I am new to postgresql but have MANY years experience with SQL Server) and it makes sense that it is not, as you commented about.
bad selectivity because my criteria only filters about 20% of the rows
bad table design (too fat, which we knew and are now addressing)
index not "covering" the query, etc...
So let me change "slightly" my question if I may...
Our table will be normalised in facts/dimensions (no more varchars in the wrong place).
We do only inserts, never updates and so few deletes that we can ignore it.
The table size will not be huge (tens of million of rows order).
Our queries will ALWAYS specify an exact release_key value.
Our new version of the table would look like this
CREATE TABLE dfo_by_quarter (
release_key int4 NOT NULL,
country_key int2 NOT NULL,
product_group_key int2 NOT NULL,
distribution_type_key int2 NOT NULL,
category_key int2 NOT NULL,
"year" int2 NOT NULL,
"date" date NULL,
quarter int2 NOT NULL,
units numeric(38,6) NOT NULL,
sales_value_eur numeric(38,6) NOT NULL,
sales_value_usd numeric(38,6) NOT NULL,
sales_value_local numeric(38,6) NOT NULL,
CONSTRAINT pk_milly_dfo_by_quarter PRIMARY KEY (release_key, country_key, category_key, product_group_key, distribution_type_key, year, quarter),
CONSTRAINT fk_milly_dfo_by_quarter_release FOREIGN KEY (release_key) REFERENCES dpretailer.dfo_release(release_id),
CONSTRAINT fk_milly_dim_dfo_category FOREIGN KEY (category_key) REFERENCES milly.dim_dfo_category(category_key),
CONSTRAINT fk_milly_dim_dfo_country FOREIGN KEY (country_key) REFERENCES milly.dim_dfo_country(country_key),
CONSTRAINT fk_milly_dim_dfo_distribution_type FOREIGN KEY (distribution_type_key) REFERENCES milly.dim_dfo_distribution_type(distribution_type_key),
CONSTRAINT fk_milly_dim_dfo_product_group FOREIGN KEY (product_group_key) REFERENCES milly.dim_dfo_product_group(product_group_key)
);
With that in mind, in a SQL Server environment, I could solve this by having a "Clustered" primary key (the entire table being sorted), or having an index on the primary key with INCLUDE option for the other columns required to cover the queries (Units, Values, etc).
Question 1)
In postgresql, is there an equivalent to the SQL Server Clustered index? A way to actually sort the entire table? I suppose it might be difficult because postgresql does not do updates "in place", hence it might make sorting expensive...
Or, is there a way to create something like a SQL Server Index WITH INCLUDE(units, values)?
update: I came across the SQL CLUSTER command, which is the closest thing I suppose.
It would be suitable for us
Question 2
With the query below
EXPLAIN (ANALYZE, BUFFERS)
WITH "rank_query" AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY "year" ORDER BY SUM("main"."units") DESC) AS "rank_by",
"year",
"main"."product_group_key" AS "productgroupkey",
SUM("main"."units") AS "salesunits",
SUM("main"."sales_value_eur") AS "salesvalue",
SUM("sales_value_eur")/SUM("units") AS "asp"
FROM "milly"."dfo_by_quarter" AS "main"
WHERE
"release_key" = 17 AND
"main"."year" >= 2010
GROUP BY
"year",
"main"."product_group_key"
)
,BeforeLookup
AS (
SELECT
"year" AS date,
SUM("salesunits") AS "salesunits",
SUM("salesvalue") AS "salesvalue",
SUM("salesvalue")/SUM("salesunits") AS "asp",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END AS "productgroupkey"
FROM
"rank_query"
GROUP BY
"year",
CASE WHEN "rank_by" <= 50 THEN "productgroupkey" ELSE -1 END
)
SELECT BL.date, BL.salesunits, BL.salesvalue, BL.asp
FROM BeforeLookup AS BL
INNER JOIN milly.dim_dfo_product_group PG ON PG.product_group_key = BL.productgroupkey;
I get this
Hash Join (cost=40883.82..40896.46 rows=558 width=98) (actual time=676.565..678.308 rows=663 loops=1)
Hash Cond: (bl.productgroupkey = pg.product_group_key)
Buffers: shared hit=483 read=22719
CTE rank_query
-> WindowAgg (cost=40507.15..40632.63 rows=5577 width=108) (actual time=660.076..668.272 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> Sort (cost=40507.15..40521.09 rows=5577 width=68) (actual time=660.062..661.226 rows=5418 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 616kB
Buffers: shared hit=480 read=22719
-> Finalize HashAggregate (cost=40076.46..40160.11 rows=5577 width=68) (actual time=648.762..653.227 rows=5418 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Gather (cost=38710.09..39909.15 rows=11154 width=68) (actual time=597.878..622.379 rows=11938 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=480 read=22719
-> Partial HashAggregate (cost=37710.09..37793.75 rows=5577 width=68) (actual time=594.044..600.494 rows=3979 loops=3)
Group Key: main.year, main.product_group_key
Buffers: shared hit=480 read=22719
-> Parallel Seq Scan on dfo_by_quarter main (cost=0.00..36019.74 rows=169035 width=22) (actual time=106.916..357.071 rows=137171 loops=3)
Filter: ((year >= 2010) AND (release_key = 17))
Rows Removed by Filter: 546602
Buffers: shared hit=480 read=22719
CTE beforelookup
-> HashAggregate (cost=223.08..238.43 rows=558 width=102) (actual time=676.293..677.167 rows=663 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=480 read=22719
-> CTE Scan on rank_query (cost=0.00..139.43 rows=5577 width=70) (actual time=660.079..672.978 rows=5418 loops=1)
Buffers: shared hit=480 read=22719
-> CTE Scan on beforelookup bl (cost=0.00..11.16 rows=558 width=102) (actual time=676.296..677.665 rows=663 loops=1)
Buffers: shared hit=480 read=22719
-> Hash (cost=7.34..7.34 rows=434 width=4) (actual time=0.253..0.253 rows=435 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 24kB
Buffers: shared hit=3
-> Seq Scan on dim_dfo_product_group pg (cost=0.00..7.34 rows=434 width=4) (actual time=0.017..0.121 rows=435 loops=1)
Buffers: shared hit=3
Planning time: 0.319 ms
Execution time: 678.714 ms
Does anything spring to mind?
If I read it properly, it means my biggest cost by far is the initial scanof the table... but I don't manage to make it use an index...
I had created an index I hoped would help but it got ignored...
CREATE INDEX eric_silly_index ON milly.dfo_by_quarter(release_key, YEAR, date, product_group_key, units, sales_value_eur);
ANALYZE milly.dfo_by_quarter;
I also tried to cluster the table but no visible effect either
CLUSTER milly.dfo_by_quarter USING pk_milly_dfo_by_quarter; -- took 30 seconds (uidev)
ANALYZE milly.dfo_by_quarter;
Many thanks
Eric

Because release_key isn't actually a unique column, it's not possible from the information you've provided to know whether or not the index should be used. If a high percentage of rows have release_key = 2 or even a smaller percentage of rows match on a large table, it may not be efficient to use the index.
In part this is because Postgres indexes are indirect -- that is the index actually contains a pointer to the location on disk in the heap where the real tuple lives. So looping through an index requires reading an entry from the index, reading the tuple from the heap, and repeating. For a large number of tuples it's often more valuable to scan the heap directly and avoid the indirect disk access penalty.
Edit:
You generally don't want to be using CLUSTER in PostgreSQL; it's not how indexes are maintained, and it's rare to see that in the wild for that reason.
Your updated query with no data gives this plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on beforelookup bl (cost=8.33..8.35 rows=1 width=98) (actual time=0.143..0.143 rows=0 loops=1)
Buffers: shared hit=4
CTE rank_query
-> WindowAgg (cost=8.24..8.26 rows=1 width=108) (actual time=0.126..0.126 rows=0 loops=1)
Buffers: shared hit=4
-> Sort (cost=8.24..8.24 rows=1 width=68) (actual time=0.060..0.061 rows=0 loops=1)
Sort Key: main.year, (sum(main.units)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4
-> GroupAggregate (cost=8.19..8.23 rows=1 width=68) (actual time=0.011..0.011 rows=0 loops=1)
Group Key: main.year, main.product_group_key
Buffers: shared hit=1
-> Sort (cost=8.19..8.19 rows=1 width=64) (actual time=0.011..0.011 rows=0 loops=1)
Sort Key: main.year, main.product_group_key
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1
-> Index Scan using pk_milly_dfo_by_quarter on dfo_by_quarter main (cost=0.15..8.18 rows=1 width=64) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: ((release_key = 17) AND (year >= 2010))
Buffers: shared hit=1
CTE beforelookup
-> HashAggregate (cost=0.04..0.07 rows=1 width=102) (actual time=0.128..0.128 rows=0 loops=1)
Group Key: rank_query.year, CASE WHEN (rank_query.rank_by <= 50) THEN (rank_query.productgroupkey)::integer ELSE '-1'::integer END
Buffers: shared hit=4
-> CTE Scan on rank_query (cost=0.00..0.03 rows=1 width=70) (actual time=0.127..0.127 rows=0 loops=1)
Buffers: shared hit=4
Planning Time: 0.723 ms
Execution Time: 0.485 ms
(27 rows)
So PostgreSQL is entirely capable of using the index for your query, but the planner is deciding that it's not worth it (i.e., the costing for using the index directly is higher than the costing for using the parallel sequence scan).
If you set enable_indexscan = off; with no data, you get a bitmap index scan (as I'd expect). If you set enable_bitmapscan = off; with no data you get an (non-parallel) sequence scan.
You should see the plan change back (with large amounts of data) if you set max_parallel_workers = 0;.
But looking at your query's explain results, I'd very much expect using the index to be more expensive and take longer than using the parallel sequence scan. In your updated query you're still scanning a very high percentage of the table and a large number of rows, and you're also forcing accessing the heap by accessing fields not in the index. Postgres 11 (I believe) adds covering indexes which would theoretically allow you to make this query be driven by the index alone, but I'm not at all convinced in this example it would actually be worth it.

Generally, while possible, a PK spanning 7 columns, several of which being varchar(100) is not optimized for performance, to say the least.
Such an index is large to begin with and tends to bloat quickly, if you have updates on involved columns.
I would operate with a surrogate PK, a serial (or bigserial if you have that many rows). Or IDENTITY. See:
Auto increment table column
And a UNIQUE constraint on all 7 to enforce uniqueness (all are NOT NULL anyway).
If you have lots of counting queries with the only predicate on release_key consider an additional plain btree index on just that column.
The data type varchar(100) for so many columns may not be optimal. Some normalization might help.
More advise depends on missing information ...

The answer to my initial question: why is postgresql not using my index on something like SELECT (*)... can be found in the documentation...
Introduction to VACUUM, ANALYZE, EXPLAIN, and COUNT
In particular: This means that every time a row is read from an index, the engine has to also read the actual row in the table to ensure that the row hasn't been deleted.
This explains a lot why I don't manage to get postgresql to use my indexes when, from a SQL Server perspective, it obviously "should".

Is there a way to use pg_trgm like operator with btree indexes on PostgreSQL?

I have two tables:
table_1 with ~1 million lines, with columns id_t1: integer, c1_t1: varchar, etc.
table_2 with ~50 million lines, with columns id_t2: integer, ref_id_t1: integer, c1_t2: varchar, etc.
ref_id_t1 is filled with id_t1 values , however they are not linked by a foreign key as table_2 doesn't know about table_1.
I need to do a request on both table like the following:
SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');
Without any change or with basic indexes the request takes about a minute to complete as a sequencial scan is peformed on table_2. To prevent this I created a GIN idex with gin_trgm_ops option:
CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);
However this does not solve the problem as the inner request still takes a very long time.
EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'
Gives the following
Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
Rows Removed by Index Recheck: 49069703
Heap Blocks: exact=611548
-> Bitmap Index Scan on gin_trg (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms
The Bitmap Index Scan is fast, but as we need t2.ref_id_t1 PostgreSQL needs to perform an Bitmap Heap Scan which is not quick on 65000 lines of data.
The solution to avoid the Bitmap Heap Scan would be to perform an Index Only Scan. This is possible using multiple column with btree indexes, see https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
If I change the request like to search the begining of c1_t2, even with the inner request returning 90000 lines, and if I create a btree index on c1_t2 and ref_id_t1 the request takes just over a second.
CREATE INDEX c1_t2_ref_id_t1_index
ON table_2 USING btree
(c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)
EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');
Hash Join (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
Hash Cond: (t1.id_t1 = t2.ref_id_t1)
-> Seq Scan on table_1 t1 (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
Filter: (c1_t1 = 'A')
Rows Removed by Filter: 1083798
-> Hash (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 3472kB
-> HashAggregate (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
Group Key: t2.ref_id_t1
-> Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2 (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
Index Cond: ((c1_t2 ~>=~ 'aaa'::text) AND (c1_t2 ~<~ 'chb'::text))
Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms
However this is not possible with gin indexes, as these indexes don't store all data in the key.
Is there a way to use pg_trmg like extension with btree index so we can have index only scan with LIKE '%abc%' requests?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse