I have two Postgres indexes on my table cache, both on jsonb column on fields date and condition.
The first one works on an immutable function, which takes the text field and transforms it into the date type.
The second one is created only on text.
So, when I tried the second one, it turns my btree index into a bitmap index and somehow works slower than the first one, which takes another two steps to work, but use only an index scan.
I have two questions: why and how?
Why does the first one use only index, compared with the second, which for some reason uses a bitmap? And how I can force PostgreSQL to use only the index and no the bitmap on the second index, because I don't want use the function.
If there another solution then please give me hints, because I don't have permission to install packages on the server.
Function index:
create index cache_ymd_index on cache (
to_yyyymmdd_date(((data -> 'Info'::text) ->> 'Date'::text)::character varying),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
Text index:
create index cache_data_index on cache (
((data -> 'Info'::text) ->> 'Date'::text),
((data -> 'Info'::text) ->> 'Condition'::text)
) where (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text);
The function itself:
create or replace function to_yyyymmdd_date(the_date character varying) returns date
immutable language sql
as
$$
select to_date(the_date, 'YYYY-MM-DD')
$$;
ANALYZE condition for function index:
Index Scan using cache_ymd_index on cache (cost=0.29..1422.43 rows=364 width=585) (actual time=0.065..66.842 rows=71634 loops=1)
Index Cond: ((to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) >= '2018-01-01'::date) AND (to_yyyymmdd_date((((data -> 'Info'::text) ->> 'Date'::text))::character varying) <= '2020-12-01'::date))
Planning Time: 0.917 ms
Execution Time: 70.464 ms
ANALYZE condition for text index:
Bitmap Heap Scan on cache (cost=12.15..1387.51 rows=364 width=585) (actual time=53.794..87.802 rows=71634 loops=1)
Recheck Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text) AND (((data -> 'Info'::text) ->> 'Condition'::text) = '3'::text))
Heap Blocks: exact=16465
-> Bitmap Index Scan on cache_data_index (cost=0.00..12.06 rows=364 width=0) (actual time=51.216..51.216 rows=71634 loops=1)
Index Cond: ((((data -> 'Info'::text) ->> 'Date'::text) >= '2018-01-01'::text) AND (((data -> 'Info'::text) ->> 'Date'::text) <= '2020-12-01'::text))
Planning Time: 0.247 ms
Execution Time: 90.586 ms
A “bitmap index scan” is also an index scan. It is what PostgreSQL typically chooses if a bigger percentage of the table blocks have to be visited, because it is more efficient in that case.
For an index range scan like in your case, there are two possible explanations for this:
ANALYZE has run between the two indexes have been created, so that PostgreSQL knows about the distribution of the indexed values in the one case, but no the other.
To figure out if that was the case, run
ANALYZE cache;
and then try the two statements again. Maybe the plans are more similar now.
The statements were run on two different tables, which contain the same data, but they are physically arranged in a different way, so that the correlation is good on the one, but bad on the other. If the correlation is close to 1 or -1, and index scan becomes cheaper. Otherwise, a bitmap index scan is the best way.
Since you indicate that it is the same table in both cases, this explanation can be ruled out.
The second column of your index is superfluous; you should just omit it.
Otherwise, your two indexes should work about the same.
Of course all that would work much better if the table was defined with a date column in the first place...
Related
I have the following table:
CREATE TABLE m2m_entries_n_elements(
entry_id UUID
element_id UUID
value JSONB
)
Value is jsonb object in following format: {<type>: <value>}
And I want to create GIN index only for number values:
CREATE INDEX IF NOT EXISTS idx_element_value_number
ON m2m_entries_n_elements
USING GIN (element_id, CAST(value ->> 'number' AS INT))
WHERE value ? 'number';
But when I use EXPLAIN ANALYZE I see that index does not work:
EXPLAIN ANALYZE
SELECT *
FROM m2m_entries_n_elements WHERE CAST(value ->> 'number' AS INT) = 2;
Seq Scan on m2m_entries_n_elements (cost=0.00..349.02 rows=50 width=89) (actual time=0.013..2.087 rows=1663 loops=1)
Filter: (((value ->> 'number'::text))::integer = 2)
Rows Removed by Filter: 8338
Planning Time: 0.042 ms
Execution Time: 2.150 ms
But if I remove WHERE value ? 'number' from creating the index, it starts working:
Bitmap Heap Scan on m2m_entries_n_elements (cost=6.39..70.29 rows=50 width=89) (actual time=0.284..0.819 rows=1663 loops=1)
Recheck Cond: (((value ->> 'number'::text))::integer = 2)
Heap Blocks: exact=149
-> Bitmap Index Scan on idx_elements (cost=0.00..6.38 rows=50 width=0) (actual time=0.257..0.258 rows=1663 loops=1)
Index Cond: (((value ->> 'number'::text))::integer = 2)
Planning Time: 0.207 ms
Execution Time: 0.922 ms
PostgreSQL does not have a general theorem prover. Maybe you intuit that value ->> 'number' being defined implies that value ? 'number' is true, but PostgreSQL doesn't know that. You would need to explicitly include the ? condition in your query to get use of the index.
But PostgreSQL is smart enough to know that CAST(value ->> 'number' AS INT) = 2 does imply that the LHS can't be null, so if you create the partial index WHERE value ->> 'number' IS NOT NULL then it will get used with no change to your query.
I have a simple request like this, on a very large table :
(select "table_a"."id",
"table_a"."b_id",
"table_a"."timestamp"
from "table_a"
left outer join "table_b"
on "table_b"."b_id" = "table_a"."b_id"
where ((cast("table_b"."total" ->> 'bar' as int) - coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)), 0)) > 0 and coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)),
0) > 0)
order by "table_a"."timestamp" desc fetch next 25 rows only)
Problem is it takes quite some time :
Limit (cost=0.84..160.44 rows=25 width=41) (actual time=2267.067..2267.069 rows=0 loops=1)
-> Nested Loop (cost=0.84..124849.43 rows=19557 width=41) (actual time=2267.065..2267.066 rows=0 loops=1)
-> Index Scan using table_a_timestamp_index on table_a (cost=0.42..10523.32 rows=188976 width=33) (actual time=0.011..57.550 rows=188976 loops=1)
-> Index Scan using table_b_b_id_key on table_b (cost=0.42..0.60 rows=1 width=103) (actual time=0.011..0.011 rows=0 loops=188976)
Index Cond: ((b_id)::text = (table_a.b_id)::text)
" Filter: ((COALESCE((((ok ->> 'bar'::text))::integer + ((ko ->> 'bar'::text))::integer), 0) > 0) AND ((((total ->> 'bar'::text))::integer - COALESCE((((ok ->> 'bar'::text))::integer + ((ko ->> 'bar'::text))::integer), 0)) > 0))"
Rows Removed by Filter: 1
Planning Time: 0.411 ms
Execution Time: 2267.135 ms
I tried adding indexes :
create index table_b_bar_total ON "table_b" using BTREE (coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)),
0));
create index table_b_bar_remaining ON "table_b" using BTREE
((cast("table_b"."total" ->> 'bar' as int) - coalesce(
(cast("table_b"."ok" ->> 'bar' as int) +
cast("table_b"."ko" ->> 'bar' as int)), 0)));
But it doesn't change anything . How can I make this request run faster ?
Ordinary column indexes don't have their own statistics, as the table's statistics are sufficient for the indexed to be assessed for planning. But expressional indexes have their own statistics collected (on the expression results) whenever the table is analyzed. But a problem is that creating an expressional index does not trigger an autoanalyze to be run on the table, so those needed stats can stay uncollected for a long time. So it is a good idea to manually run ANALYZE after creating an expressional index.
Since your expressions are always compared to zero, it might be better to create one index on the larger expression (including the >0 comparisons and the ANDing of them as part of the indexed expression), rather than two indexes which need to be bitmap-ANDed. Since that expression is a boolean, it might be tempting to create a partial index with it, but I think that that would be a mistake. Unlike expressional indexes, partial indexes do not have statistics collected, and so do not help inform the planner on how many rows will be found.
When running a query locally, EXPLAIN shows an index being used, but in production, it's not being used.
The index in question is on a JSONB column.
I've rebuilt the indexes in production to see if that was the issue, but it didn't make a difference.
How can I debug (or fix) production not using the indexes properly?
Here's the DDL of the table and its indices.
-- Table Definition ----------------------------------------------
CREATE TABLE tracks (
id BIGSERIAL PRIMARY KEY,
album_id bigint REFERENCES albums(id),
artist_id bigint REFERENCES artists(id),
duration integer,
explicit boolean,
spotify_id text,
link text,
name text,
popularity integer,
preview_url text,
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
audio_features jsonb,
lyrics text,
lyrics_last_checked_at timestamp without time zone,
audio_features_last_checked timestamp without time zone
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX tracks_pkey ON tracks(id int8_ops);
CREATE INDEX index_tracks_on_album_id ON tracks(album_id int8_ops);
CREATE INDEX index_tracks_on_artist_id ON tracks(artist_id int8_ops);
CREATE INDEX index_tracks_on_spotify_id ON tracks(spotify_id text_ops);
CREATE INDEX index_tracks_on_explicit ON tracks(explicit bool_ops);
CREATE INDEX index_tracks_on_lyrics_last_checked_at ON tracks(lyrics_last_checked_at timestamp_ops);
CREATE INDEX index_tracks_on_audio_features ON tracks USING GIN (audio_features jsonb_ops);
CREATE INDEX index_tracks_on_audio_features_energy ON tracks(((audio_features ->> 'energy'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_tempo ON tracks(((audio_features ->> 'tempo'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_key ON tracks(((audio_features ->> 'key'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_danceability ON tracks(((audio_features ->> 'danceability'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_acousticness ON tracks(((audio_features ->> 'acousticness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_speechiness ON tracks(((audio_features ->> 'speechiness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_instrumentalness ON tracks(((audio_features ->> 'instrumentalness'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_valence ON tracks(((audio_features ->> 'valence'::text)::numeric) numeric_ops);
CREATE INDEX index_tracks_on_audio_features_last_checked ON tracks(audio_features_last_checked timestamp_ops);
Here's the query I'm running.
EXPLAIN ANALYZE SELECT "tracks".* FROM "tracks" WHERE ((audio_features ->> 'speechiness')::numeric between 0.1 and 1.0)
Here's the output locally.
Bitmap Heap Scan on tracks (cost=209.20..2622.49 rows=5943 width=616) (actual time=23.510..179.007 rows=5811 loops=1)
Recheck Cond: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Heap Blocks: exact=1844
-> Bitmap Index Scan on index_tracks_on_audio_features_speechiness (cost=0.00..207.72 rows=5943 width=0) (actual time=21.463..21.463 rows=5999 loops=1)
Index Cond: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Planning Time: 10.248 ms
Execution Time: 179.460 ms
And here's the output in production.
Gather (cost=1000.00..866366.71 rows=607073 width=1318) (actual time=0.486..100252.108 rows=606680 loops=1)
Workers Planned: 4
Workers Launched: 3
-> Parallel Seq Scan on tracks (cost=0.00..804659.41 rows=151768 width=1318) (actual time=0.211..99938.152 rows=151670 loops=4)
Filter: ((((audio_features ->> 'speechiness'::text))::numeric >= 0.1) AND (((audio_features ->> 'speechiness'::text))::numeric <= 1.0))
Rows Removed by Filter: 731546
Planning Time: 3.029 ms
Execution Time: 100292.766 ms
The data are quite different in both databases, so it is not surprising if the same query is executed differently.
The production query returns more than 17% of the rows in the table, which is obviously so much that PostgreSQL thinks it cheaper to perform a sequential scan.
For test purposes, you can set enable_seqscan to off in your session, run the query again and check if PostgreSQL was right.
If you didn't select all columns from the table, but only columns that are indexed, and the table has been vacuumed recently, you might get an index only scan which would be much faster.
I have a Postgres 9.6 database running in production. It has a table that has around 98,000,000 rows and is growing.
It has a column file_path that stores the relative path to a file. Example: /directory1/123456/CT1_1111_111_111-CT2_2222_222_222-fail.xml. The values of the CTx_xxx keep changing.
Currently there are no indexes on this column since we didn't really do a search using it. However, the need has arisen to fetch using this column with no other supporting indexed columns. What makes my problem harder is that the search needs to support wild card search, where file_path like '%CT1_1111%'.
Running this in a query takes forever as expected. I need to index this column, but can't seem to find a solution for this.
A simple b-tree index obviously didn't work since it wont support LIKE.
Then I tried the text_pattern_ops too. That won't work either due to the preceding wild card.
I tried the gin_trgm_ops index too, but that search was super slow as well. This table has a cardinality of 1.14793103E10
I expect to have a query to be able to return the result in - say - 2-3 secs. My problem is that this is a very old database structure with a lot of rows. I would want to avoid restructuring the db for the same reason.
There will most likely be no guarantee for the 2-3 seconds response time. At least not as long as disk I/O is involved and you're not running on latest SSD (or even better: NVMe) with high IOPS and lowest latency. Also enough RAM is a requirement here. Please consider this before deciding about the indexing strategy.
If neither your data nor the indexes fit into memory, you have to be sure to reduce the number of disk I/O's per query as much as possible or let PostgreSQL use strategies that helps mitigating random I/O at least (e.g. what bitmap index scans where built for).
Text search using LIKE in a contains substring manner is not going to perform well on any big table.
An alternative (will only work if queries are searching for the same parts in file_path) could be (for your example searching for CTX_XXXX):
-- create a function to extract the specific file_path substring
CREATE OR REPLACE FUNCTION get_filename_part(file_path text, idx int)
RETURNS text
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT regexp_replace(file_path, '.*/(CT.{6}).*-(CT.{6}).*', E'\\' || idx);
$$;
-- create a helper function for querying...
CREATE OR REPLACE FUNCTION check_filename_parts(file_path text, c_value text)
RETURNS boolean
LANGUAGE SQL
IMMUTABLE
AS $$
SELECT get_filename_part(file_path, 1) = c_value OR get_filename_part(file_path, 2) = c_value;
$$;
-- create indexes...
CREATE INDEX idx_filename_ct_first ON text_search (get_filename_part(file_path, 1));
CREATE INDEX idx_filename_ct_second ON text_search (get_filename_part(file_path, 2));
...and use a query such as:
SELECT *
FROM text_search
WHERE check_filename_parts(file_path, 'CT1_1111');
Explained with test data
Please note that the following tests where made using 8 years old consumer-grade hardware (but at least using an SSD).
Create test data (8,000,000 rows - pretty much random):
CREATE TABLE text_search (id serial PRIMARY KEY, file_path text);
INSERT INTO text_search (file_path)
SELECT '/directory1/123456/CT' || (random() * 8 + 1)::int || '_' || (random() * 8999 + 1000)::int || '_' || (random() * 899 + 100)::int || '_' || (random() * 899 + 100)::int || '-CT' || (random() * 8 + 1)::int || '_' || (random() * 8999 + 1000)::int || '_' || (random() * 899 + 100)::int || '_' || (random() * 899 + 100)::int || '-fail.xml'
FROM generate_series(1, 8000000);
--- and analyze...
ANALYZE text_search;
...explained above select query (after a server restart):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on text_search (cost=5.49..409.93 rows=203 width=66) (actual time=0.092..0.882 rows=110 loops=1)
Recheck Cond: ((get_filename_part(file_path, 1) = 'CT1_1111'::text) OR (get_filename_part(file_path, 2) = 'CT1_1111'::text))
Heap Blocks: exact=110
Buffers: shared read=116
I/O Timings: read=0.576
-> BitmapOr (cost=5.49..5.49 rows=203 width=0) (actual time=0.071..0.072 rows=0 loops=1)
Buffers: shared read=6
I/O Timings: read=0.036
-> Bitmap Index Scan on idx_filename_ct_first (cost=0.00..2.70 rows=102 width=0) (actual time=0.038..0.038 rows=48 loops=1)
Index Cond: (get_filename_part(file_path, 1) = 'CT1_1111'::text)
Buffers: shared read=3
I/O Timings: read=0.017
-> Bitmap Index Scan on idx_filename_ct_second (cost=0.00..2.69 rows=101 width=0) (actual time=0.032..0.032 rows=62 loops=1)
Index Cond: (get_filename_part(file_path, 2) = 'CT1_1111'::text)
Buffers: shared read=3
I/O Timings: read=0.019
Planning Time: 4.996 ms
Execution Time: 0.922 ms
(18 rows)
Generic filter using gin_trgm_ops
...compared to a generic LIKE query using a gin_trgm_ops index (after 3 runs - data in cache):
-- create index...
CREATE INDEX idx_filename ON text_search USING gin (file_path gin_trgm_ops);
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM text_search
WHERE file_path LIKE '%CT1_1111%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on text_search (cost=94.70..1264.40 rows=800 width=66) (actual time=20.699..27.775 rows=110 loops=1)
Recheck Cond: (file_path ~~ '%CT1_1111%'::text)
Rows Removed by Index Recheck: 8207
Heap Blocks: exact=7978
Buffers: shared hit=8277
-> Bitmap Index Scan on idx_filename (cost=0.00..94.50 rows=800 width=0) (actual time=19.328..19.328 rows=8317 loops=1)
Index Cond: (file_path ~~ '%CT1_1111%'::text)
Buffers: shared hit=299
Planning Time: 0.722 ms
Execution Time: 27.912 ms
(10 rows)
TL;DR
If possible by any means, invest in a little infrastructure to get the best possible performance using = comparison internally. This will save a lot on I/O, CPU compared to any other approach. But also keep an eye on write performance degradation with the growing indexes. You might just come up with a trade-off.
If your pattern should match the start of the file name (without the path), then an alternative would be to index the last element of the path which then can be searched for using a right anchored pattern, e.g. CT1_111%:
create index idx_last_element on your_table
(((string_to_array(file_path,'/'))[cardinality(string_to_array(file_path,'/'))]) text_pattern_ops);
You then need to use that expression in your SQL query:
select *
from your_table
where (string_to_array(file_path,'/'))[cardinality(string_to_array(file_path,'/'))] like 'CT1_111%';
This would use the above index.
You can simplify your query by wrapping that expression in a function:
create or replace function extract_file_name(p_path text)
returns text
as
$$
select elements[cardinality(elements)]
from (select string_to_array(p_path,'/') elements ) t;
$$
language sql
immutable;
And use that function to create the index:
create index idx_file_name on your_table( (extract_file_name(file_path)) text_pattern_ops);
Then use that function in the query:
select *
from your_table
where extract_file_name(file_path) like 'CT1_111%';
On my Windows laptop using Postgres 11 with 2 million rows, this results in the following execution plan:
Index Scan using last_element on public.file_paths (cost=0.43..2.69 rows=200 width=82) (actual time=0.193..0.437 rows=36 loops=1)
Output: id, created_at, path
Index Cond: ((extract_file_name(file_paths.path) ~>=~ 'CT1_111'::text) AND (extract_file_name(file_paths.path) ~<~ 'a504'::text))
Filter: (extract_file_name(file_paths.path) ~~ 'CT1_111%'::text)
Buffers: shared hit=36 read=3
I/O Timings: read=0.066
Planning Time: 0.918 ms
Execution Time: 0.459 ms
Trigram indexes are your only hope.
You used a GIN index and not a GiST index, right?
Make sure you dictate a minimum length for the search string, see that the result set is limited and the search is reasonably fast.
Say you have a table with some indices:
create table mail
(
identifier serial primary key,
member text,
read boolean
);
create index on mail(member_identifier);
create index on mail(read);
If you now query on multiple columns which have separate indices, will it ever use both indices?
select * from mail where member = 'Jess' and read = false;
That is, can PostgreSQL decide to first use the index on member to fetch all mails for Jess and then use the index on read to fetch all unread mails and then intersect both results to construct the output set?
I know you can have an index with multiple columns (on (member, read) in this case), but what happens if you have two separate indices? Will PostgreSQL pick just one or can it use both in some cases?
This is not a question about a specific query. It is a generic question to understand the internals.
Postgres Documentation about multiple query indexes
Article says it will create an abstract representation of where both indexes apply then combine the results.
To combine multiple indexes, the system scans each needed index and
prepares a bitmap in memory giving the locations of table rows that
are reported as matching that index's conditions. The bitmaps are then
ANDed and ORed together as needed by the query. Finally, the actual
table rows are visited and returned.
CREATE TABLE fifteen
(one serial PRIMARY KEY
, three integer not NULL
, five integer not NULL
);
INSERT INTO fifteen(three,five)
SELECT gs%33+5,gs%55+11
FROM generate_series(1,60000) gs
;
CREATE INDEX ON fifteen(three);
CREATE INDEX ON fifteen(five);
ANALYZE fifteen;
EXPLAIN ANALYZE
SELECT*
FROM fifteen
WHERE three= 7
AND five =13
;
Result:
CREATE TABLE
INSERT 0 60000
CREATE INDEX
CREATE INDEX
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fifteen (cost=19.24..51.58 rows=31 width=12) (actual time=0.391..0.761 rows=364 loops=1)
Recheck Cond: ((five = 13) AND (three = 7))
Heap Blocks: exact=324
-> BitmapAnd (cost=19.24..19.24 rows=31 width=0) (actual time=0.355..0.355 rows=0 loops=1)
-> Bitmap Index Scan on fifteen_five_idx (cost=0.00..7.15 rows=1050 width=0) (actual time=0.136..0.136 rows=1091 loops=1)
Index Cond: (five = 13)
-> Bitmap Index Scan on fifteen_three_idx (cost=0.00..11.93 rows=1788 width=0) (actual time=0.194..0.194 rows=1819 loops=1)
Index Cond: (three = 7)
Planning time: 0.259 ms
Execution time: 0.796 ms
(10 rows)
Changing {33,55} to {3,5} will yield an index scan over only one index, plus an addtional filter condition .
(probablythe costsavings would be too little)