Postgres similarity function not appropriately using trigram index - postgresql

I have a simple person table with a last_name column that I've added a GIST index with
CREATE INDEX last_name_idx ON person USING gist (last_name gist_trgm_ops);
According to the docs at https://www.postgresql.org/docs/10/pgtrgm.html, the <-> operator should utilize this index. However, when I actually try to use this difference operator using this query:
explain verbose select * from person where last_name <-> 'foobar' > 0.5;
I get this back:
Seq Scan on public.person (cost=0.00..290.82 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)
And it doesn't look like the index is being used. However, if I use the % operator with this command:
explain verbose select * from person where last_name % 'foobar';
It seems to use the index:
Bitmap Heap Scan on public.person (cost=4.25..41.51 rows=13 width=233)
Output: person_id, first_name, last_name
Recheck Cond: (person.last_name % 'foobar'::text)
-> Bitmap Index Scan on last_name_idx (cost=0.00..4.25 rows=13 width=0)
Index Cond: (person.last_name % 'foobar'::text)
I also noticed that if I move the operator to the select portion of the query, the index gets ignored again:
explain verbose select last_name % 'foobar' from person;
Seq Scan on public.person (cost=0.00..257.19 rows=13455 width=1)
Output: (last_name % 'foobar'::text)
Am I missing something obvious about how the similarity function uses the trigram index?
I am using Postgres 10.5 on OSX.
EDIT 1
As per Laurenz's suggestion, I tried setting enable_seqscan = off but unfortunately, the query with the <-> operator still seems to ignore the index.
show enable_seqscan;
enable_seqscan
----------------
off
explain verbose select * from person where last_name <-> 'foobar' < 0.5;
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.person (cost=10000000000.00..10000000290.83 rows=4485 width=233)
Output: person_id, first_name, last_name
Filter: ((person.last_name <-> 'foobar'::text) < '0.5'::double precision)

This behavior is normal for all kinds of indexes.
The first query is not in a form that can use the index. For that, a condition would have to be of the form
<indexed expression> <operator supported by the index> <quasi-constant>
where the last expressions remains constant for the duration of the index scan and the operator returns a boolean value. Your expression ´last_name <-> 'foobar' > 0.5` is not of that form.
The <-> operator has to be used in an ORDER BY clause to be able to use the index.
The third query doesn't use the index because the query affects all rows of the table. An index does not speed up the evaluation of an expression, it is only useful to quickly identify a subset of the table (or to get rows in a certain sort order).

Related

Is this the right way to create a partial index in Postgres?

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

Postgres: Sorting by an immutable function index doesn't use index

I have a simple table.
CREATE TABLE posts
(
id uuid NOT NULL,
vote_up_count integer,
vote_down_count integer,
CONSTRAINT post_pkey PRIMARY KEY(id)
);
I have an IMMUTABLE function that does simple (but could be complex) arithmetic.
CREATE OR REPLACE FUNCTION score(
ups integer,
downs integer)
RETURNS integer AS
$BODY$
select $1 - $2
$BODY$
LANGUAGE sql IMMUTABLE
COST 100;
ALTER FUNCTION score(integer, integer)
OWNER TO postgres;
I create an index on the posts table that uses my function.
CREATE INDEX posts_score_index ON posts(score(vote_up_count, vote_down_count), date_created);
When I EXPLAIN the following query, it doesn't seem to be using the index.
SELECT * FROM posts ORDER BY score(vote_up_count, vote_down_count), date_created
Sort (cost=1.02..1.03 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_of_com (...)"
Sort Key: ((posts.vote_up_count - posts.vote_down_count)), posts.date_created
-> Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
Output: id, date_created, last_edit_date, slug, sub_id, user_id, user_ip, type, title, content, url, domain, send_replies, vote_up_count, vote_down_count, verdict, approved_by, removed_by, verdict_message, number_of_reports, ignore_reports, number_ (...)
How do I get my ORDER BY to use an index from an IMMUTABLE function that could have some very complex arithmetic?
EDIT: Based off of #Егор-Рогов's suggestions, I change the query a bit to see if I can get it to use an index. Still no luck.
set enable_seqscan=off;
EXPLAIN VERBOSE select date_created from posts ORDER BY (hot(vote_up_count, vote_down_count, date_created),date_created);
Here is the output.
Sort (cost=10000000001.06..10000000001.06 rows=1 width=16)
Output: date_created, (ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision) / 4 (...)
Sort Key: (ROW(round((((log((GREATEST(abs((posts.vote_up_count - posts.vote_down_count)), 1))::double precision) * sign(((posts.vote_up_count - posts.vote_down_count))::double precision)) + ((date_part('epoch'::text, posts.date_created) - 1134028003::dou (...)
-> Seq Scan on public.posts (cost=10000000000.00..10000000001.05 rows=1 width=16)
Output: date_created, ROW(round((((log((GREATEST(abs((vote_up_count - vote_down_count)), 1))::double precision) * sign(((vote_up_count - vote_down_count))::double precision)) + ((date_part('epoch'::text, date_created) - 1134028003::double precision (...)
EDIT2: It seems that I was not using the index because of a second order by with date_created.
I can see a couple of points that discourages the planner from using the index.
1.
Look at this line in the explain output:
Seq Scan on public.posts (cost=0.00..1.01 rows=1 width=310)
It says that the planner believes there is only one row in the table. In this case it makes no sense to use index scan, for sequential scan is faster.
Try to add more rows to the table, do analyze and try again. You can also test it by temporarily disabling sequential scans by set enable_seqscan=off;.
2.
You use your the function to sort the results. So the planner may decide to use the index in order to get tuple ids in the correct order. But then it needs to fetch each tuple from the table to get values of all columns (because of select *).
You can make the index more attractive to the planner by adding all necessary columns to it, which make possible to avoid table scan. This is called index-only scan.
CREATE INDEX posts_score_index ON posts(
score(vote_up_count, vote_down_count),
date_created,
id, -- do you actually need it in result set?
vote_up_count, -- do you actually need it in result set?
vote_down_count -- do you actually need it in result set?
);
And make sure you run vacuum after inserting/updating/deleting rows to update the visibility map.
The downside is the increased index size, of course.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.

PostgreSQL daterange not using index correctly

I have a simple table which has a user_birthday field with a type of date (which can be
NULL value)
CREATE TABLE users
(
user_id bigserial NOT NULL,
user_email text NOT NULL,
user_password text,
user_first_name text NOT NULL,
user_middle_name text,
user_last_name text NOT NULL,
user_birthday date,
CONSTRAINT pk_users PRIMARY KEY (user_id)
)
There's an index (btree) defined on that field, with the rule of NOT
user_birthday IS NULL.
CREATE INDEX ix_users_birthday
ON users
USING btree
(user_birthday)
WHERE NOT user_birthday IS NULL;
Trying to follow up on another idea, I've added the extension btree_gist and created the following index:
CREATE INDEX ix_users_birthday_gist
ON glances.users
USING gist
(user_birthday)
WHERE NOT user_birthday IS NULL;
But it had no affect either, as from what I could read it is not used for range checking.
The PostgreSQL version is 9.3.4.0 (22) Postgres.app
and issue also exists in 9.3.3.0 (21) Postgres.app
I've been intrigued by the following queries:
Query #1:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01)')
Query #2:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
which, at first glance both should have the same execution plan, but for some
reason, here are the results:
Query #1:
"Seq Scan on users (cost=0.00..52314.25 rows=11101 width=241) (actual
time=0.014..478.983 rows=208886 loops=1)"
" Filter: (user_birthday <# '[1978-07-15,1983-03-01)'::daterange)"
" Rows Removed by Filter: 901214"
"Total runtime: 489.584 ms"
Query #2:
"Bitmap Heap Scan on users (cost=4468.01..46060.53 rows=210301 width=241)
(actual time=57.104..489.785 rows=209019 loops=1)"
" Recheck Cond: ((user_birthday >= '1978-07-15'::date) AND (user_birthday
<= '1983-03-01'::date))"
" Rows Removed by Index Recheck: 611375"
" -> Bitmap Index Scan on ix_users_birthday (cost=0.00..4415.44
rows=210301 width=0) (actual time=54.621..54.621 rows=209019 loops=1)"
" Index Cond: ((user_birthday >= '1978-07-15'::date) AND
(user_birthday <= '1983-03-01'::date))"
"Total runtime: 500.983 ms"
As you can see, the <# daterange is not utilizing the existing index, while
BETWEEN does.
Important to note that the actual use case for this rule is in a more complex query,
which doesn't result in the Recheck Cond and Bitmap Heap scan.
In the application complex query, the difference between the two methods (with 1.2 million records) is massive:
Query #1 at 415ms
Query #2 at 84ms.
Is this a bug with daterange?
Am I doing something wrong? or datarange <# is performing as designed?
There's also a discussion in the pgsql-bugs mailing list
BETWEEN includes upper and lower border. Your condition
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
matches
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01]')
I see you mention a btree index. For that use simple comparison operators.
Detailed manual page on which index is good for which operators.
The range type operators <# or #> would work with GiST indexes.
Example:
Perform this hours of operation query in PostgreSQL

Postgresql - full-text ordering of sorted out set

I have a Postgresql table with something like 200k tuples, so not that much.
What I try to do is filter out some rows and then order them using full-text matching:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND regexp_replace(d.description, '(...)', '') !~* '...'
AND regexp_replace(d.description, '...', '') !~* '...'
AND d.id != ?
ORDER BY ts_rank_cd(to_tsvector('english', name), plainto_tsquery('english', 'my search words')) DESC LIMIT 5 OFFSET 0';
There is a GIN index on description field.
Now this query works well only when there is less then 4000 or so records in the category. When its more like 5k or 6k then the query gets extremely slow.
I was trying different variations of this query. What I noticed is when I remove either WHERE clause or ORDER BY clause then I get big speed up. (Of course then I get irrelevant results)
What can I do to speedup this combination? Any way of optimizing or should I look for a solution outside Postgresql?
Additional question:
I'm experimenting further and for example this is the simplest query that I think runs too slow. Can I tell from explain analyze when it uses gist index and when doesn't?
SELECT d.*, d.description <-> 'banana' as dist FROM descriptions as d ORDER BY dist DESC LIMIT 5
"Limit (cost=16046.88..16046.89 rows=5 width=2425) (actual time=998.811..998.813 rows=5 loops=1)"
" -> Sort (cost=16046.88..16561.90 rows=206010 width=2425) (actual time=998.810..998.810 rows=5 loops=1)"
" Sort Key: (((description)::text <-> 'banana'::text))"
" Sort Method: top-N heapsort Memory: 27kB"
" -> Seq Scan on products d (cost=0.00..12625.12 rows=206010 width=2425) (actual time=0.033..901.260 rows=206010 loops=1)"
"Total runtime: 998.866 ms"`
Answered (kgrittn): DESC keyword is not correct for KNN-GiST and it's actually not wanted here. Removing it fixes the problem and gives right results.
An output of explain analyze of your query would be helpful. But I guess that this regexp_replace lines are your problem. Postgres planner just cannot know how many rows will match this two lines, so it is guessing and planning a query based on this flawed quess.
I'd recommend to create a function like this:
create function good_description(text) returns boolean as $$
select
regexp_replace($1, '(...)', '') !~* '...'
and
regexp_replace($1, '...', '') !~* '...'
$$ language sql immutable strict;
And creating a partial index on expression using this function:
create index descriptions_good_description_idx
on good_description(description)
where description != '';
And then querying in a way that allows Postgres to use this index:
SELECT * FROM descriptions as d
WHERE d.category_id = ?
AND d.description != ''
AND good_description(d.description)
AND d.id != ?
ORDER BY ts_rank_cd(
to_tsvector('english', name),
plainto_tsquery('english', 'my search words')
) DESC
LIMIT 5 OFFSET 0;
For this type of application, we have been moving from the tsearch feature to the trigram feature; when you want to pick a small number of best matches, it is much faster. People here often prefer the semantics of the trigram similarity matching over the text-search ranking, anyway.
http://www.postgresql.org/docs/current/interactive/pgtrgm.html
"Borrowing" the later query from the edited question, formatting it, and including the index creation statement, to make the answer self-contained without a raft of comments:
CREATE INDEX descriptions_description_trgm
ON descriptions
USING gist (description gist_trgm_ops);
SELECT d.*, d.description <-> 'banana' as dist
FROM descriptions as d
ORDER BY dist LIMIT 5;
This should return rows from the GiST index in "distance" sequence until it hits the LIMIT.