I can not convince PostgreSQL to use my BRIN index. I have tested on PostgreSQL 14.2 and 11.1. Here is my initial setup.
SET max_parallel_workers_per_gather = 0;
DROP TABLE IF EXISTS Measure;
CREATE TABLE Measure (
id int,
sensor_id int,
temperature int
);
INSERT INTO Measure(id, sensor_id, temperature)
VALUES (generate_series(1, 1000000),
round(random()*100000)::int,
round(random()*100)::int);
DROP INDEX IF EXISTS idxbrin_measure_sensor_id;
CREATE INDEX idxbrin_measure_sensor_id ON Measure USING brin(sensor_id);
When I run a simple query returning around 10 rows such as
EXPLAIN SELECT * FROM Measure WHERE sensor_id = 10;
the BRIN index is not used:
Seq Scan on measure (cost=0.00..17906.00 rows=10 width=12)
Filter: (sensor_id = 10)
What I'm doing wrong?
A BRIN index can only be used if the logical ordering of the indexed value correlates almost perfectly with the physical ordering in the table. So it would work for the id column (if you never delete or update any rows), but not for sensor_id. Note that for a small table like that, a BRIN index is not very useful.
As pointed out by Laurenz Albe, the BRIN index key has to correspond to the physical ordering of the data. Therefore, I ordered the table according to the sensor_id and created the BRIN index, and PG used it.
CREATE INDEX idx_measure_sensor_id ON MEASURE(sensor_id);
CLUSTER MEASURE USING idx_measure_sensor_id;
DROP INDEX IF EXISTS idx_measure_sensor_id;
CREATE INDEX idxbrin_measure_sensor_id ON Measure USING brin(sensor_id);
EXPLAIN SELECT * FROM Measure WHERE sensor_id = 10;
In such a case the result plan is
Bitmap Heap Scan on measure (cost=20.39..148083.72 rows=2 width=1016)
Recheck Cond: (sensor_id = 10)
-> Bitmap Index Scan on idxbrin_measure_sensor_id (cost=0.00..20.38 rows=416427 width=0)
Index Cond: (sensor_id = 10)
Related
I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine
To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.
If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.
Say you have a table with some indices:
create table mail
(
identifier serial primary key,
member text,
read boolean
);
create index on mail(member_identifier);
create index on mail(read);
If you now query on multiple columns which have separate indices, will it ever use both indices?
select * from mail where member = 'Jess' and read = false;
That is, can PostgreSQL decide to first use the index on member to fetch all mails for Jess and then use the index on read to fetch all unread mails and then intersect both results to construct the output set?
I know you can have an index with multiple columns (on (member, read) in this case), but what happens if you have two separate indices? Will PostgreSQL pick just one or can it use both in some cases?
This is not a question about a specific query. It is a generic question to understand the internals.
Postgres Documentation about multiple query indexes
Article says it will create an abstract representation of where both indexes apply then combine the results.
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of table rows that
are reported as matching that index's conditions. The bitmaps are then
ANDed and ORed together as needed by the query. Finally, the actual
table rows are visited and returned.
CREATE TABLE fifteen
(one serial PRIMARY KEY
, three integer not NULL
, five integer not NULL
);
INSERT INTO fifteen(three,five)
SELECT gs%33+5,gs%55+11
FROM generate_series(1,60000) gs
;
CREATE INDEX ON fifteen(three);
CREATE INDEX ON fifteen(five);
ANALYZE fifteen;
EXPLAIN ANALYZE
SELECT*
FROM fifteen
WHERE three= 7
AND five =13
;
Result:
CREATE TABLE
INSERT 0 60000
CREATE INDEX
CREATE INDEX
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fifteen (cost=19.24..51.58 rows=31 width=12) (actual time=0.391..0.761 rows=364 loops=1)
Recheck Cond: ((five = 13) AND (three = 7))
Heap Blocks: exact=324
-> BitmapAnd (cost=19.24..19.24 rows=31 width=0) (actual time=0.355..0.355 rows=0 loops=1)
-> Bitmap Index Scan on fifteen_five_idx (cost=0.00..7.15 rows=1050 width=0) (actual time=0.136..0.136 rows=1091 loops=1)
Index Cond: (five = 13)
-> Bitmap Index Scan on fifteen_three_idx (cost=0.00..11.93 rows=1788 width=0) (actual time=0.194..0.194 rows=1819 loops=1)
Index Cond: (three = 7)
Planning time: 0.259 ms
Execution time: 0.796 ms
(10 rows)
Changing {33,55} to {3,5} will yield an index scan over only one index, plus an addtional filter condition .
(probablythe costsavings would be too little)
I have a very simple SQL:
select * from email.email_task where acquire_time < now() and state IN ('CREATED', 'RELEASED') order by creation_time asc limit 1;
I have 2 indexes created:
Index of state
Index of state, acquire_time, creation_time
Ideally I think Postgres should pick the 2nd one since it matches every column required in this SQL:
However the execution plan shows differently, it uses neither of the indexes:
Limit (cost=187404.36..187404.36 rows=1 width=743)
-> Sort (cost=187404.36..190753.58 rows=1339690 width=743)
Sort Key: creation_time
-> Seq Scan on email_task (cost=0.00..180705.91 rows=1339690 width=743)
Filter: (((state)::text = 'CREATED'::text) AND (acquire_time < now()))
I understand that if the number of rows returned arrives like 10% of total, then it would pick Seq Scan over Index Scan. (As explained at Why does PostgreSQL perform sequential scan on indexed column?
) So that's why index1 is not picked.
What I don't understand is why index2 is not picked since matches all the columns?
Then I tried a 3rd index:
Index of create_time, acquire_time, state
And this time it uses the index3 (I add the index using another smaller database
perf_1 because the original one has 2 million rows and it takes too much time)
Limit (cost=0.29..0.36 rows=1 width=75) (actual time=0.043..0.043 rows=1 loops=1)
-> Index Scan using perf_1 on email_task (cost=0.29..763.76 rows=9998 width=75) (actual time=0.042..0.042 rows=1 loops=1)
Index Cond: (acquire_time < now())
Filter: ((state)::text = ANY ('{CREATED,RELEASED}'::text[]))
It seems that, Postgres execution planner is picking the order by clause first then the where clause which is a little bit counter-intuitive.
Is my understanding correct or there are some other factors that impact the Postgres planner?
Thanks in advance.
I have the following situation:
Data = around 400 million (string1, string2, score) tuples
Data size ~ 20gb, doesn't fit in memory.
Data is stored in a file in csv format, and not sorted by any
field.
I need to efficiently retrieve all tuples with a particular
string, e.g. all tuples s.t. string1 = 'google'.
How do I design a system such that I can do this efficiently ?
I have already tried postgresql with a B-tree index and GIN index, but they aren't fast enough (> 20-30 seconds) per query.
Ideally, I need a solution which sorts the tuples by string1, stores them in sorted fashion and then run binary search, followed by sequential scan for retrieval. But, I don't know which database or system implements such functionality.
UPDATE:
Here's the postgres details:
I bulk-loaded data into postgres using COPY command. Then I created two indices on string1, one b-tree and one GIN. However, postgres is not using either of them.
Create tables:
CREATE TABLE mytable(
string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
CREATE INDEX string1_index ON mytable(lower(string1));
Query plan:
isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.mytable (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
Output: hyponym, string2, source_id, score
Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
Rows Removed by Index Recheck: 34573
-> Bitmap Index Scan on string1_gin_index (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
Index Cond: ((mytable.string1)::text ~~* 'google'::text)
Total runtime: 25568.209 ms
(7 rows)
isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.mytable (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
Output: string1, string2, source_id, score
Filter: ((mytable.string1)::text = 'google'::text)
Rows Removed by Filter: 124417194
Total runtime: 139403.950 ms
(5 rows)
I've got a query with an ORDER and a LIMIT to support a paginated interface:
SELECT segment_members.id AS t0_r0,
segment_members.segment_id AS t0_r1,
segment_members.account_id AS t0_r2,
segment_members.score AS t0_r3,
segment_members.created_at AS t0_r4,
segment_members.updated_at AS t0_r5,
segment_members.posts_count AS t0_r6,
accounts.id AS t1_r0,
accounts.platform AS t1_r1,
accounts.username AS t1_r2,
accounts.created_at AS t1_r3,
accounts.updated_at AS t1_r4,
accounts.remote_id AS t1_r5,
accounts.name AS t1_r6,
accounts.language AS t1_r7,
accounts.description AS t1_r8,
accounts.timezone AS t1_r9,
accounts.profile_image_url AS t1_r10,
accounts.post_count AS t1_r11,
accounts.follower_count AS t1_r12,
accounts.following_count AS t1_r13,
accounts.uri AS t1_r14,
accounts.location AS t1_r15,
accounts.favorite_count AS t1_r16,
accounts.raw AS t1_r17,
accounts.followers_completed_at AS t1_r18,
accounts.followings_completed_at AS t1_r19,
accounts.followers_started_at AS t1_r20,
accounts.followings_started_at AS t1_r21,
accounts.profile_fetched_at AS t1_r22,
accounts.managed_source_id AS t1_r23
FROM segment_members
INNER JOIN accounts ON accounts.id = segment_members.account_id
WHERE segment_members.segment_id = 1
ORDER BY accounts.follower_count ASC LIMIT 20
OFFSET 0;
Here are the indexes on the tables:
accounts
"accounts_pkey" PRIMARY KEY, btree (id)
"index_accounts_on_remote_id_and_platform" UNIQUE, btree (remote_id, platform)
"index_accounts_on_description" btree (description)
"index_accounts_on_favorite_count" btree (favorite_count)
"index_accounts_on_follower_count" btree (follower_count)
"index_accounts_on_following_count" btree (following_count)
"index_accounts_on_lower_username_and_platform" btree (lower(username::text), platform)
"index_accounts_on_post_count" btree (post_count)
"index_accounts_on_profile_fetched_at_and_platform" btree (profile_fetched_at, platform)
"index_accounts_on_username" btree (username)
segment_members
"segment_members_pkey" PRIMARY KEY, btree (id)
"index_segment_members_on_segment_id_and_account_id" UNIQUE, btree (segment_id, account_id)
"index_segment_members_on_account_id" btree (account_id)
"index_segment_members_on_segment_id" btree (segment_id)
In my development and staging databases, the query plan looks like the following, and the query executes very quickly.
Limit (cost=4802.15..4802.20 rows=20 width=2086)
-> Sort (cost=4802.15..4803.20 rows=421 width=2086)
Sort Key: accounts.follower_count
-> Nested Loop (cost=20.12..4790.95 rows=421 width=2086)
-> Bitmap Heap Scan on segment_members (cost=19.69..1244.24 rows=421 width=38)
Recheck Cond: (segment_id = 1)
-> Bitmap Index Scan on index_segment_members_on_segment_id_and_account_id (cost=0.00..19.58 rows=
421 width=0)
Index Cond: (segment_id = 1)
-> Index Scan using accounts_pkey on accounts (cost=0.43..8.41 rows=1 width=2048)
Index Cond: (id = segment_members.account_id)
In production, however, the query plan is the following, and the query takes forever (several minutes until it hits the statement timeout).
Limit (cost=0.86..25120.72 rows=20 width=2130)
-> Nested Loop (cost=0.86..4614518.64 rows=3674 width=2130)
-> Index Scan using index_accounts_on_follower_count on accounts (cost=0.43..2779897.53 rows=3434917 width=209
2)
-> Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members (cost=0.43..0.52 row
s=1 width=38)
Index Cond: ((segment_id = 1) AND (account_id = accounts.id))
accounts has about 6m rows in staging and 3m in production. segment_members has about 300k rows in staging and 4m in production. Is it the differences in table sizes that is causing the differences in the query plan selection? Is there any way I can get Postgres to use the faster query plan in production?
Update:
Here's the EXPLAIN ANALYZE from the slow production server:
Limit (cost=0.86..22525.66 rows=20 width=2127) (actual time=173.148..187568.247 rows=20 loops=1)
-> Nested Loop (cost=0.86..4654749.92 rows=4133 width=2127) (actual time=173.141..187568.193 rows=20 loops=1)
-> Index Scan using index_accounts_on_follower_count on accounts (cost=0.43..2839731.81 rows=3390197 width=2089) (actual time=0.110..180374.279 rows=1401278 loops=1)
-> Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members (cost=0.43..0.53 rows=1 width=38) (actual time=0.003..0.003 rows=0 loops=1401278)
Index Cond: ((segment_id = 1) AND (account_id = accounts.id))
Total runtime: 187568.318 ms
(6 rows)
Either your table statistics are not up to date or the two queries you present are very different. The second one estimates to retrieve 3.5M rows (rows=3434917). ORDER BY / LIMIT 20 is forced to sort all 3.5 million rows to find the top 20, which is going to be extremely expensive - unless you have a matching index.
The first query plan expects to sort 421 rows. Not even close. Different query plans are no surprise.
It would be interesting to see the output of EXPLAIN ANALYZE, not just EXPLAIN. (Expensive for the second query!)
It very much depends on how many account_id for each segment_id. If segment_id is not selective, the query cannot be fast. Your only other option is a MATERIALIZED VIEW with the top n rows per segment_id and an appropriate regime to keep it up to date.
If your statistics are not up to date, just run ANALYZE on both tables and retry.
It might help to increase the statistics target for selected columns:
ALTER TABLE segment_members ALTER segment_id SET STATISTICS 1000;
ALTER TABLE segment_members ALTER account_id SET STATISTICS 1000;
ALTER TABLE accounts ALTER id SET STATISTICS 1000;
ALTER TABLE accounts ALTER follower_count SET STATISTICS 1000;
ANALYZE segment_members(segment_id, account_id);
ANALYZE accounts (id, follower_count);
Details:
Check statistics targets in PostgreSQL
Keep PostgreSQL from sometimes choosing a bad query plan
Better indexes
I addition to your existing UNIQUE constraint index_segment_members_on_segment_id_and_account_id on segment_members, I suggest a multicolumn index on accounts:
CREATE INDEX index_accounts_on_follower_count ON accounts (id, follower_count)
Again, run ANALYZE after creating the index.
Some indexes useless?
All other indexes in your question are irrelevant for this query. They may be useful for other purposes or useless.
This index is 100% dead freight, drop it. (Detailed explanation here.)
"index_segment_members_on_segment_id" btree (segment_id)
This one may be useless:
"index_accounts_on_description" btree (description)
Since a "description" is typically free text that is hardly used to order rows or in a WHERE condition with a suitable operator. But that's just an educated guess.