When running a query locally, EXPLAIN shows an index being used, but in production, it's not being used.
The index in question is on a JSONB column.
I've rebuilt the indexes in production to see if that was the issue, but it didn't make a difference.
How can I debug (or fix) production not using the indexes properly?
Here's the DDL of the table and its indices.
-- Table Definition ----------------------------------------------
CREATE TABLE tracks (
id BIGSERIAL PRIMARY KEY,
album_id bigint REFERENCES albums(id),
artist_id bigint REFERENCES artists(id),
duration integer,
explicit boolean,
spotify_id text,
link text,
name text,
popularity integer,
preview_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
audio_features jsonb,
lyrics text,
lyrics_last_checked_at timestamp without time zone,
audio_features_last_checked timestamp without time zone
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX tracks_pkey ON tracks(id int8_ops);
CREATE INDEX index_tracks_on_album_id ON tracks(album_id int8_ops);
CREATE INDEX index_tracks_on_artist_id ON tracks(artist_id int8_ops);
CREATE INDEX index_tracks_on_spotify_id ON tracks(spotify_id text_ops);
CREATE INDEX index_tracks_on_explicit ON tracks(explicit bool_ops);
CREATE INDEX index_tracks_on_lyrics_last_checked_at ON tracks(lyrics_last_checked_at timestamp_ops);
CREATE INDEX index_tracks_on_audio_features ON tracks USING GIN (audio_features jsonb_ops);
CREATE INDEX index_tracks_on_audio_features_energy ON tracks(((audio_features ->> 'energy'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_tempo ON tracks(((audio_features ->> 'tempo'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_key ON tracks(((audio_features ->> 'key'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_danceability ON tracks(((audio_features ->> 'danceability'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_acousticness ON tracks(((audio_features ->> 'acousticness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_speechiness ON tracks(((audio_features ->> 'speechiness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_instrumentalness ON tracks(((audio_features ->> 'instrumentalness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_valence ON tracks(((audio_features ->> 'valence'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_last_checked ON tracks(audio_features_last_checked timestamp_ops);
Here's the query I'm running.
EXPLAIN ANALYZE SELECT "tracks".* FROM "tracks" WHERE ((audio_features ->> 'speechiness')::numeric between 0.1 and 1.0)
Here's the output locally.
Bitmap Heap Scan on tracks (cost=209.20..2622.49 rows=5943 width=616) (actual time=23.510..179.007 rows=5811 loops=1)
Recheck Cond: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Heap Blocks: exact=1844
-> Bitmap Index Scan on index_tracks_on_audio_features_speechiness (cost=0.00..207.72 rows=5943 width=0) (actual time=21.463..21.463 rows=5999 loops=1)
Index Cond: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Planning Time: 10.248 ms
Execution Time: 179.460 ms
And here's the output in production.
Gather (cost=1000.00..866366.71 rows=607073 width=1318) (actual time=0.486..100252.108 rows=606680 loops=1)
Workers Planned: 4
Workers Launched: 3
-> Parallel Seq Scan on tracks (cost=0.00..804659.41 rows=151768 width=1318) (actual time=0.211..99938.152 rows=151670 loops=4)
Filter: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Rows Removed by Filter: 731546
Planning Time: 3.029 ms
Execution Time: 100292.766 ms
The data are quite different in both databases, so it is not surprising if the same query is executed differently.
The production query returns more than 17% of the rows in the table, which is obviously so much that PostgreSQL thinks it cheaper to perform a sequential scan.
For test purposes, you can set enable_seqscan to off in your session, run the query again and check if PostgreSQL was right.
If you didn't select all columns from the table, but only columns that are indexed, and the table has been vacuumed recently, you might get an index only scan which would be much faster.
Related
I am currently working on a Postgresql DB where we implemented a multi-tenant infrastructure.
Here's how it works :
We have several tables (table1, tables2, ...) where we added a tenant column. This column is used to filter rows based on different DB users. We have several users (tenant1, tenant2, ...) and a superuser (no tenant applied to it).
I want to optimize the simple following query :
SELECT id
FROM table
WHERE UPPER("table"."column"::text) = UPPER('blablabla')
Thus, I created a function index :
CREATE INDEX "upper_idx" ON "table" (upper("column") );
If I connect to the DB as superuser and execute my SELECT query, it runs smoothly and fast.
Bitmap Heap Scan on table (cost=71.66..9225.47 rows=2998 width=4)
Recheck Cond: (upper((column)::text) = 'blablabla'::text)
-> Bitmap Index Scan on upper_idx (cost=0.00..70.91 rows=2998 width=0)
Index Cond: (upper((column)::text) = 'blablabla'::text)
However, when I connect as tenant1 the index is not picked up and the db runs a sequential scan instead :
Gather (cost=1000.00..44767.19 rows=15 width=4)
Workers Planned: 2
-> Parallel Seq Scan on table (cost=0.00..43765.69 rows=6 width=4)
Filter: (((tenant)::text = (CURRENT_USER)::text) AND (upper((column)::text) = 'blablabla'::text))
Do you know how to make it work in this case ?
EDIT - added EXPLAIN(ANALYZE, BUFFER)
Gather (cost=1000.00..44767.19 rows=15 width=4) (actual time=502.601..503.466 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=781 read=36160
-> Parallel Seq Scan on table (cost=0.00..43765.69 rows=6 width=4) (actual time=498.978..498.978 rows=0 loops=3)
Filter: (((tenant)::text = (CURRENT_USER)::text) AND (upper((column)::text) = 'blablabla'::text))
Rows Removed by Filter: 199846
Buffers: shared hit=781 read=36160
Planning Time: 1.650 ms
Execution Time: 503.510 ms
EDIT 2 - added the (truncated) CREATE TABLE statement
-- public.table definition
-- Drop table
-- DROP TABLE public.table;
CREATE TABLE public.table (
id serial NOT NULL,
created timestamptz NOT NULL,
modified timestamptz NOT NULL,
...
column varchar(100) NULL,
"tenant" tenant NOT NULL,
...
);
...
CREATE INDEX upper_idx ON public.table USING btree (upper((column)::text), tenant);
CREATE INDEX table_column_91bdd18f ON public.table USING btree (column);
CREATE INDEX table_column_91bdd18f_like ON public.table USING btree (column varchar_pattern_ops);
...
-- Table Triggers
create trigger archive_deleted_rows after
delete
on
public.table for each row execute procedure archive.archive('{id}');
create trigger set_created_modified before
insert
or
update
on
public.table for each row execute procedure set_created_modified();
create trigger set_tenant before
insert
or
update
on
public.table for each row execute procedure set_tenant();
-- public.table foreign keys
...
EDIT 3 - dump of \d table
Table "public.table"
Column | Type | Collation | Nullable | Default
------------------------------------------+--------------------------+-----------+----------+--------------------------------------------
id | integer | | not null | nextval('table_id_seq'::regclass)
........
column | character varying(100) | | |
tenant | tenant | | not null |
........
Indexes:
"table_pkey" PRIMARY KEY, btree (id)
..........
"table__column_upper_idx" btree (upper(column::text), tenant)
"table_column_91bdd18f" btree (column)
"table_column_91bdd18f_like" btree (column varchar_pattern_ops)
.........
Check constraints:
.........
Foreign-key constraints:
.........
Referenced by:
.........
Policies:
POLICY "tenant_policy"
TO tenant1,tenant2
USING (((tenant)::text = (CURRENT_USER)::text))
Triggers:
........
set_tenant BEFORE INSERT OR UPDATE ON table FOR EACH ROW EXECUTE PROCEDURE set_tenant()
EDIT 4 - Added tenant data type
CREATE TYPE tenant AS ENUM (
'tenant1',
'tenant2');
You should add a multi-column index like whis:
CREATE INDEX "upper_column_for_tenant_idx" ON "table" (upper("column") , tenant);
In order to have only one index you should place upper("column") first and then tenant.
PostgreSQL docs state:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
I've tried to recreate your setup in db<>fiddle. you can see the plan for query
EXPLAIN ANALYZE
SELECT * FROM public.table WHERE upper(("column")::text) = '50' and tenant='5'
is:
QUERY PLAN
Bitmap Heap Scan on "table" (cost=14.20..2333.45 rows=954 width=24) (actual time=0.093..0.831 rows=1000 loops=1)
Recheck Cond: ((upper(("column")::text) = '50'::text) AND ((tenant)::text = '5'::text))
Heap Blocks: exact=74
-> Bitmap Index Scan on upper_idx (cost=0.00..13.97 rows=954 width=0) (actual time=0.065..0.065 rows=1000 loops=1)
Index Cond: ((upper(("column")::text) = '50'::text) AND ((tenant)::text = '5'::text))
Planning Time: 0.361 ms
Execution Time: 0.997 ms
You should create an index on both columns. Since tenant is an enum data type and you compare it to a function result of type name, both sides are cast to the "common denominator" text. So use this index:
CREATE INDEX ON "table" (upper("column"), tenant::text);
Then calculate new statistics for the table:
ANALYZE "table";
I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine
To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.
If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.
I'm trying to index a large JSONB column based on a text field (with an ISO date string). This index works fine using = but is ignored if I use a > condition.
create table test_table (
id text NOT null primary key,
data jsonb,
text_test text
);
Then I add a bunch data to the jsonb column. And to ensure my JSON is valid, extract/copy the value I'm interested in from the JSONB column into another text column to test against too.
update test_table set text_test = (data->>'dueDate');
A quick sample shows it's good ISO formatted date strings:
select text_test, (data->>'dueDate') from test_table limit 1;
-- 2020-08-07T11:59:00 2020-08-07T11:59:00
I add btree indexes to both the JSONB and the text_test copy column. I tried adding one with explicit '::text' casting, as well as one with 'text_pattern_ops'.
create index test_table_duedate_iso on test_table using btree(text_test);
create index test_table_duedate_iso_jsonb on test_table using btree((data->>'dueDate'));
create index test_table_duedate_iso_jsonb_cast on test_table using btree(((data->>'dueDate')::text));
create index test_table_duedate_iso_jsonb_cast_pattern on test_table using btree(((data->>'dueDate')::text) text_pattern_ops);
Now if I query an exact value, explain shows it using the 'cast' version of the index. Good.
explain select * from test_table where (data->>'dueDate') = '2020-08-07T11:59:00';
"-> Bitmap Index Scan on test_table_duedate_iso_jsonb_cast (cost=0.00..10.37 rows=261 width=0)"
But if I try it with a >, it does a very slow full scan.
explain analyze select count(*) from test_table where (data->>'dueDate') > '2020-04-14';
--Aggregate (cost=10037.94..10037.95 rows=1 width=8) (actual time=1070.808..1070.813 rows=1 loops=1)
-- -> Seq Scan on test_table (cost=0.00..9994.42 rows=17409 width=0) (actual time=0.069..1057.258 rows=2930 loops=1)
-- Filter: ((data ->> 'dueDate'::text) > '2020-04-14'::text)
-- Rows Removed by Filter: 49298
--Planning Time: 0.252 ms
--Execution Time: 1070.874 ms
So just to check my sanity, I do the same query against the text_test column, it uses it's index as desired:
explain analyze select count(*) from test_table where text_test > '2020-04-14';
--Aggregate (cost=6037.02..6037.03 rows=1 width=8) (actual time=19.979..19.984 rows=1 loops=1)
-- -> Bitmap Heap Scan on test_table (cost=77.76..6030.14 rows=2754 width=0) (actual time=1.354..11.007 rows=2930 loops=1)
-- Recheck Cond: (text_test > '2020-04-14'::text)
-- Heap Blocks: exact=455
-- -> Bitmap Index Scan on test_table_duedate_iso (cost=0.00..77.07 rows=2754 width=0) (actual time=1.215..1.217 rows=2930 loops=1)
-- Index Cond: (text_test > '2020-04-14'::text)
--Planning Time: 0.145 ms
--Execution Time: 20.041 ms
I have also tested indexing a numerical field within the JSON and it actually works properly, using it's index for ranged type queries. So it's something about the text field or something I'm doing wrong with it.
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14), 64-bit
I have two Postgres indexes on my table cache, both on jsonb column on fields date and condition.
The first one works on an immutable function, which takes the text field and transforms it into the date type.
The second one is created only on text.
So, when I tried the second one, it turns my btree index into a bitmap index and somehow works slower than the first one, which takes another two steps to work, but use only an index scan.
I have two questions: why and how?
Why does the first one use only index, compared with the second, which for some reason uses a bitmap? And how I can force PostgreSQL to use only the index and no the bitmap on the second index, because I don't want use the function.
If there another solution then please give me hints, because I don't have permission to install packages on the server.
Function index:
create index cache_ymd_index on cache (
to_yyyymmdd_date(((data -> 'Info'::text) ->> 'Date'::text)::character varying),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
Text index:
create index cache_data_index on cache (
((data -> 'Info'::text) ->> 'Date'::text),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
The function itself:
create or replace function to_yyyymmdd_date(the_date character varying) returns date
immutable language sql
as
$$
select to_date(the_date, 'YYYY-MM-DD')
$$;
ANALYZE condition for function index:
Index Scan using cache_ymd_index on cache (cost=0.29..1422.43 rows=364 width=585) (actual time=0.065..66.842 rows=71634 loops=1)
Index Cond: ((to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) >= '2018-01-01'::date) AND (to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) <= '2020-12-01'::date))
Planning Time: 0.917 ms
Execution Time: 70.464 ms
ANALYZE condition for text index:
Bitmap Heap Scan on cache (cost=12.15..1387.51 rows=364 width=585) (actual time=53.794..87.802 rows=71634 loops=1)
Recheck Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text) AND (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text))
Heap Blocks: exact=16465
-> Bitmap Index Scan on cache_data_index (cost=0.00..12.06 rows=364 width=0) (actual time=51.216..51.216 rows=71634 loops=1)
Index Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text))
Planning Time: 0.247 ms
Execution Time: 90.586 ms
A “bitmap index scan” is also an index scan. It is what PostgreSQL typically chooses if a bigger percentage of the table blocks have to be visited, because it is more efficient in that case.
For an index range scan like in your case, there are two possible explanations for this:
ANALYZE has run between the two indexes have been created, so that PostgreSQL knows about the distribution of the indexed values in the one case, but no the other.
To figure out if that was the case, run
ANALYZE cache;
and then try the two statements again. Maybe the plans are more similar now.
The statements were run on two different tables, which contain the same data, but they are physically arranged in a different way, so that the correlation is good on the one, but bad on the other. If the correlation is close to 1 or -1, and index scan becomes cheaper. Otherwise, a bitmap index scan is the best way.
Since you indicate that it is the same table in both cases, this explanation can be ruled out.
The second column of your index is superfluous; you should just omit it.
Otherwise, your two indexes should work about the same.
Of course all that would work much better if the table was defined with a date column in the first place...
I have the following situation:
Data = around 400 million (string1, string2, score) tuples
Data size ~ 20gb, doesn't fit in memory.
Data is stored in a file in csv format, and not sorted by any
field.
I need to efficiently retrieve all tuples with a particular
string, e.g. all tuples s.t. string1 = 'google'.
How do I design a system such that I can do this efficiently ?
I have already tried postgresql with a B-tree index and GIN index, but they aren't fast enough (> 20-30 seconds) per query.
Ideally, I need a solution which sorts the tuples by string1, stores them in sorted fashion and then run binary search, followed by sequential scan for retrieval. But, I don't know which database or system implements such functionality.
UPDATE:
Here's the postgres details:
I bulk-loaded data into postgres using COPY command. Then I created two indices on string1, one b-tree and one GIN. However, postgres is not using either of them.
Create tables:
CREATE TABLE mytable(
string1 varchar primary key, string2 varchar, source_id integer REFERENCES sources(id), score real);
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX string1_gin_index ON mytable USING gin (string1 gin_trgm_ops);
CREATE INDEX string1_index ON mytable(lower(string1));
Query plan:
isa=# EXPLAIN ANALYZE VERBOSE select * from mytable where string1 ilike 'google';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.mytable (cost=235.88..41872.81 rows=11340 width=29) (actual time=20234.765..25566.128 rows=30971 loops=1)
Output: hyponym, string2, source_id, score
Recheck Cond: ((mytable.string1)::text ~~* 'google'::text)
Rows Removed by Index Recheck: 34573
-> Bitmap Index Scan on string1_gin_index (cost=0.00..233.05 rows=11340 width=0) (actual time=20218.263..20218.263 rows=65544 loops=1)
Index Cond: ((mytable.string1)::text ~~* 'google'::text)
Total runtime: 25568.209 ms
(7 rows)
isa=# EXPLAIN ANALYZE VERBOSE select * from isa where string1 = 'google';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.mytable (cost=0.00..2546373.30 rows=3425 width=29) (actual time=11692.606..139401.099 rows=30511 loops=1)
Output: string1, string2, source_id, score
Filter: ((mytable.string1)::text = 'google'::text)
Rows Removed by Filter: 124417194
Total runtime: 139403.950 ms
(5 rows)