Postgresql 9.4: index not working in a pattern search - postgresql

I have a table called "doctors" and a field called "fullname" which will store names with accents.
What I need to do is an "accent insensitive + case insensitive" search, something like:
SELECT *
FROM doctors
WHERE unaccent_t(fullname) ~* 'unaccented_and_lowercase_string';
where the value to search will come unaccented+lowercase and unaccent_t is a function defined as:
CREATE FUNCTION unaccent_t(text, lowercase boolean DEFAULT false)
RETURNS text AS
$BODY$
SELECT CASE
WHEN $2 THEN unaccent('unaccent', lower(trim($1)))
ELSE unaccent('unaccent', trim($1))
END;
$BODY$ LANGUAGE sql IMMUTABLE SET search_path = public, pg_temp;
(I already installed 'unaccent' extension).
So, I went ahead and created the index for "fullname" field:
CREATE INDEX doctors_fullname ON doctors (unaccent_t(fullname) text_pattern_ops);
(I also tried with varchar_pattern_ops and also no specifying ops at all)
In the doctors table, I have around 15K rows.
The query works and I get the expected results, but when I add the explain analyze to the query, I don't see that the index is used:
Seq Scan on doctors (cost=0.00..4201.76 rows=5 width=395) (actual time=0.282..182.025 rows=15000 loops=1)
Filter: (unaccent_t((fullname)::text, false) ~* 'garcia'::text)
Rows Removed by Filter: 1
Planning time: 0.207 ms
Execution time: 183.387 ms
I also tried removing the optional parameter from unaccent_t but I got the same results.
In a scenario like this, how should I define the index so it gets used in a query like the one above?

Btree indexes are usable to speed up operations only when the pattern is left anchored.
Starting from PostgreSQL 9.3 you can speed up generic regular expression searches using a GIN or GiST index with the operator classes provided by the pg_trgm contrib module.
You can read more about it on the PostgreSQL manual at http://www.postgresql.org/docs/9.4/static/pgtrgm.html#AEN163078

Related

PostgreSQL function having parameters is not using existing index

I have a PostgreSQL plpgsql function and noticed that it does not use an existing index on the MATERIALIZED VIEW (my_view), it is querying.
The special thing about this function is, that it invokes another function (check_user) passing its parameter v_username. The result is a boolean and dependant on its value the function decides with the case-construct, if the user gets all data from given destination view (my_view) back or a one, which is being joined with another table.
CREATE OR REPLACE function my_view_secured (v_username text)
RETURNS setof my_view
LANGUAGE plpgsql stable
AS
$function$
declare
v_show_all boolean := check_user(v_username);
begin
CASE
WHEN v_show_all then return query select * from my_view;
WHEN v_show_all = false then return query select st.* from my_view st join other_table st2 on st2.id = st.id;
end case;
end;
$function$
;
When executing the both queries defined in the CASE-WHEN part directly / without the function, PostgreSQL is using an existing index and the query is returning data quite fast (50ms).
When invoking this wrapper function (my_view_secured), I assume the index is not used because it takes about 10-20 seconds to return.
select * from my_view --takes some ms
vs
select * from my_view_secured('RETURNS_TRUE') -- takes 10-20 secs, although same underlying query as the one above
(Maybe it is my DBeaver settings which falsely gives me the impression, that query 1 takes just some ms)
I have read here PostgreSQL is not using index when passed by param, that it is because due to the function parameter, PostgreSQL can not optimize the query. My first attempt was to rewrite this PLPGSQL style of function to a pure SQL-native query, but I struggle to find the correct logic for it.
From client perspective, I simply want to call the function with
SELECT .. FROM my_view_secured ('someusername')
and the function takes care about the data to be returned (either all or joined, depending on the return value from check_user function call). The index itself is set on an id field which is used by the join.
Does anyone has an idea, how to solve this issue?
Some additional informations:
PGSQL-Version: 13.6
my_view (MV): 19 million rows (index on id col)
other_table: 60k rows (unique index on id col)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1;
SET auto_explain.log_nested_statements = ON;
SET client_min_messages TO log;
Query MV directly:
explain (analyze, verbose, buffers) select * from my_view
Seq Scan on my_view (cost=0.00..598633.52 rows=18902952 width=185) (actual time=0.807..15754.467 rows=18902952 loops=1)
Output: <removed>
Buffers: shared read=409604
Planning:
Buffers: shared hit=67 read=8
Planning Time: 2.870 ms
Execution Time: 16662.400 ms
2a) Query MV via wrapper function (which returns all data / no join):
explain (analyze, verbose, buffers) select * from my_view_secured ('some_username_returning_all_data')
Function Scan on my_view_secured (cost=0.25..10.25 rows=1000 width=3462) (actual time=9006.965..11887.518 rows=18902952 loops=1)
Output: <removed>
Function Call: my_view_secured('some_username_returning_to_all_data'::text)
Buffers: shared hit=174 read=409590, temp read=353030 written=353030
Planning Time: 0.052 ms
Execution Time: 13091.509 ms
2b) Query MV via wrapper function (which returns joined data):
explain (analyze, verbose, buffers) select * from my_view_secured ('some_username_triggering_join')
Function Scan on my_view_secured (cost=0.25..10.25 rows=1000 width=3462) (actual time=10183.590..11756.417 rows=8624367 loops=1)
Output: <removed>
Function Call: my_view_secured('some_username_triggering_join'::text)
Buffers: shared hit=126 read=409792, temp read=161138 written=161138
Planning Time: 0.050 ms
Execution Time: 12434.169 ms
I just recreated your scenario and I get index scans for the nested queries as expected. Postgres absolutely can use indexes here.
PL/pgSQL handles nested SQL DML statements like this: every statement reached by control is parsed, planned and executed. Since neither of the two SELECT statements involves any parameters, those plans are saved immediately and reused on repeated execution. Either way, if a plain select * from my_view; "uses indexes", exactly the same should be the case for the nested statement.
There must be something going on that is not reflected in your question.
A couple of notes:
You misunderstood the linked answer. Your case is different as neither query involves parameters to begin with.
About ...
does not use an existing index on the views (my_view), it is querying.
Maybe just phrased ambiguously, but to be clear: there are no indexes on views. Tables (incl. MATERIALIZED VIEWs) can have indexes. A VIEW is basically just a stored query with some added secondary settings attached to an empty staging table with rewrite rules. Underlying tables may be indexed.
How do you know the nested queries do "not use an existing index" to begin with? It's not trivial to inspect query plans for nested statements like that. See:
Postgres query plan of a function invocation written in plpgsql
It would seem you are barking up the wrong tree.

Why does postgres use index scan over sequential scan even with a mismatching data type on the indexed column and query condition

I have the following PostgreSQL table:
CREATE TABLE staff (
id integer primary key,
full_name VARCHAR(100) NOT NULL,
department VARCHAR(100) NULL,
tier bigint
);
Filled random data into this table using following block:
do $$
declare
begin
FOR counter IN 1 .. 100000 LOOP
INSERT INTO staff (id, full_name, department, tier)
VALUES (nextval('staff_sequence'),
random_string(10),
random_string(20),
get_department(),
floor(random() * 5 + 1)::bigint);
end LOOP;
end; $$
After the data is populated, I created an index on this table on the tier column:
create index staff_tier_idx on staff(tier);
Although I created this index, when I execute a query using this column, I want this index NOT to be used. To accomplish this, I tried to execute this query:
select count(*) from staff where tier=1::numeric;
Due to mismatching data types on the indexed column and the query condition, I thought the index will not be used & instead a sequential scan will be executed. However, when I run EXPLAIN ANALYZE on the above query I get the following output:
Aggregate (cost=2349.54..2349.55 rows=1 width=8) (actual time=17.078..17.079 rows=1 loops=1)
-> Index Only Scan using staff_tier_idx on staff (cost=0.29..2348.29 rows=500 width=0) (actual time=0.022..15.925 rows=19942 loops=1)
Filter: ((tier)::numeric = '1'::numeric)
Rows Removed by Filter: 80058
Heap Fetches: 0
Planning Time: 0.305 ms
Execution Time: 17.130 ms
Showing that the index has indeed been used.
How do I change this so that the query uses a sequential scan instead of the index? This is purely for a testing/learning purposes.
If its of any importance, I am running this on an Amazon RDS database instance
From the "Filter" rows of the plan like
Rows Removed by Filter: 80058
you can see that the index is not being used as a real index, but just as a skinny table, testing the casted condition for each row. This appears favorable because the index is less than 1/4 the size of the table, while the default ratio of random_page_cost/seq_page_cost = 4.
In addition to just outright disabling index scans as Adrian already suggested, you could also discourage this "skinny table" usage by just increasing random_page_cost, since pages of indexes are assumed to be read in random order.
Another method would be to change the query so it can't use the index-only scan. For example, just using count(full_name) would do that, as PostgreSQL then needs to visit the table to make sure full_name is not NULL (even though it has a constraint asserting that already--sometimes it is not very clever)
Which method is better depends on what it is you are wanting to test/learn.

PostgreSQL doesn't use index with unaccent function

I have the following table:
CREATE TABLE products (
id bigserial NOT NULL PRIMARY KEY,
name varchar(2048)
-- Many other rows
);
I want to make a case and diacritics insensitive LIKE query on name.
For that I have created the following function :
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE OR REPLACE FUNCTION immutable_unaccent(varchar)
RETURNS text AS $$
SELECT unaccent($1)
$$ LANGUAGE sql IMMUTABLE;
And then created an index on name using this function:
CREATE INDEX products_search_name_key ON products(immutable_unaccent(name));
However, when I make a query, the query is very slow (about 2.5s for 300k rows). I'm pretty sure PostgreSQL is not using the index
-- Slow (~2.5s for 300k rows)
SELECT products.* FROM products
WHERE immutable_unaccent(products.name) LIKE immutable_unaccent('%Hello world%')
-- Fast (~60ms for 300k rows), and there is no index
SELECT products.* FROM products
WHERE products.name LIKE '%Hello world%'
I have tried creating a separate column with a case and diacritics insensitive copy of the name like so, and in that case the query is fast:
ALTER TABLE products ADD search_name varchar(2048);
UPDATE products
SET search_name = immutable_unaccent(name);
-- Fast (~60ms for 300k rows), and there is no index
SELECT products.* FROM products
WHERE products.search_name LIKE immutable_unaccent('%Hello world%')
What am I doing wrong ? Why doesn't my index approach work ?
Edit: Execution plan for the slow query
explain analyze SELECT products.* FROM products
WHERE immutable_unaccent(products.name) LIKE immutable_unaccent('%Hello world%')
Seq Scan on products (cost=0.00..79568.32 rows=28 width=2020) (actual time=1896.131..1896.131 rows=0 loops=1)
Filter: (immutable_unaccent(name) ~~ '%Hello world%'::text)
Rows Removed by Filter: 277986
Planning time: 1.014 ms
Execution time: 1896.220 ms
If you're wanting to do a like '%hello world%' type query, you must find another way to index it.
(you may have to do some initial installation of a couple of contrib modules. To do so, login as the postgres admin/root user and issue the following commands)
Prerequisite:
CREATE EXTENSION pg_trgm;
CREATE EXTENSION fuzzystrmatch;
Try the following:
create index on products using gist (immutable_unaccent(name) gist_trgm_ops);
It should use an index with your query at that point.
select * from product
where immutable_unaccent(name) like '%Hello world%';
Note: this index could get large, but with 240 character limit, probably wont get that big.
You could also use full text search, but that's a lot more complicated.
What the above scenario does is index "trigrams" of the name, IE, each set of "3 letters" within the name. So it the product is called "hello world" it would index hel,ell,llo ,lo , wo, wor, orl, and rld.
Then it can use that index against your search term in a more efficient way. You can use either a gist or a gin index type if you like.
Basically
GIST will be slightly slower to query, but faster to update.
GIN is the opposite>

Suggest appropriate index on my database

I have a table product
Product(id BIGINT,
... Some more columns here
expired DATE);
I want to create index on expired field for faster retrieval.
Majority of time my where clause is
...
WHERE (expired IS NULL OR expired > now());
Please can you suggest which index is more suitable for me.
When I execute explain analyze for the above query
EXPLAIN ANALYZE
SELECT 1
FROM product
WHERE (expired IS NULL) OR (expired > now());
it gave me following result. In which it is not using index which I have created.
Seq Scan on product (cost=0.00..190711.22 rows=5711449 width=0) (actual time=0.009..8653.380 rows=7163105 loops=1)
Filter: ((expired IS NULL) OR (expired > now()))
Rows Removed by Filter: 43043
Planning time: 0.117 ms
Execution time: 15679.478 ms
(5 rows)
I guess that is because of "OR" condition. i tried to create function base index but it gave me following error
ERROR: functions in index expression must be marked IMMUTABLE
Is there any alternate way we can do?
You should change the WHERE clause to look like this:
... WHERE COALESCE(expired, DATE 'infinity') > current_date;
This is equivalent to your query, but now you can use the following index:
CREATE INDEX ON product (COALESCE(expired, DATE 'infinity'));
Probably the default B-tree index is the most appropriate one for you; hash indexes only handle "equals" comparisons, and the GiST and GIN indexes are for more complex data types that what you are using:
https://www.postgresql.org/docs/current/static/indexes-types.html
In fact the B-tree is the default, so all you need to do is something like:
CREATE INDEX Products_expired_idx ON TABLE Products (expired)

Can PostgreSQL index array columns?

I can't find a definite answer to this question in the documentation. If a column is an array type, will all the entered values be individually indexed?
I created a simple table with one int[] column, and put a unique index on it. I noticed that I couldn't add the same array of ints, which leads me to believe the index is a composite of the array items, not an index of each item.
INSERT INTO "Test"."Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test"."Test" VALUES ('{10, 20, 30}');
SELECT * FROM "Test"."Test" WHERE 20 = ANY ("Column1");
Is the index helping this query?
Yes you can index an array, but you have to use the array operators and the GIN-index type.
Example:
CREATE TABLE "Test"("Column1" int[]);
INSERT INTO "Test" VALUES ('{10, 15, 20}');
INSERT INTO "Test" VALUES ('{10, 20, 30}');
CREATE INDEX idx_test on "Test" USING GIN ("Column1");
-- To enforce index usage because we have only 2 records for this test...
SET enable_seqscan TO off;
EXPLAIN ANALYZE
SELECT * FROM "Test" WHERE "Column1" #> ARRAY[20];
Result:
Bitmap Heap Scan on "Test" (cost=4.26..8.27 rows=1 width=32) (actual time=0.014..0.015 rows=2 loops=1)
Recheck Cond: ("Column1" #> '{20}'::integer[])
-> Bitmap Index Scan on idx_test (cost=0.00..4.26 rows=1 width=0) (actual time=0.009..0.009 rows=2 loops=1)
Index Cond: ("Column1" #> '{20}'::integer[])
Total runtime: 0.062 ms
Note
it appears that in many cases the gin__int_ops option is required
create index <index_name> on <table_name> using GIN (<column> gin__int_ops)
I have not yet seen a case where it would work with the && and #> operator without the gin__int_ops options
#Tregoreg raised a question in the comment to his offered bounty:
I didn't find the current answers working. Using GIN index on
array-typed column does not increase the performance of ANY()
operator. Is there really no solution?
#Frank's accepted answer tells you to use array operators, which is still correct for Postgres 11. The manual:
... the standard distribution of PostgreSQL includes a GIN operator
class for arrays, which supports indexed queries using these
operators:
<#
#>
=
&&
The complete list of built-in operator classes for GIN indexes in the standard distribution is here.
In Postgres indexes are bound to operators (which are implemented for certain types), not data types alone or functions or anything else. That's a heritage from the original Berkeley design of Postgres and very hard to change now. And it's generally working just fine. Here is a thread on pgsql-bugs with Tom Lane commenting on this.
Some PostGis functions (like ST_DWithin()) seem to violate this principal, but that is not so. Those functions are rewritten internally to use respective operators.
The indexed expression must be to the left of the operator. For most operators (including all of the above) the query planner can achieve this by flipping operands if you place the indexed expression to the right - given that a COMMUTATOR has been defined. The ANY construct can be used in combination with various operators and is not an operator itself. When used as constant = ANY (array_expression) only indexes supporting the = operator on array elements would qualify and we would need a commutator for = ANY(). GIN indexes are out.
Postgres is not currently smart enough to derive a GIN-indexable expression from it. For starters, constant = ANY (array_expression) is not completely equivalent to array_expression #> ARRAY[constant]. Array operators return an error if any NULL elements are involved, while the ANY construct can deal with NULL on either side. And there are different results for data type mismatches.
Related answers:
Check if value exists in Postgres array
Index for finding an element in a JSON array
SQLAlchemy: how to filter on PgArray column types?
Can IS DISTINCT FROM be combined with ANY or ALL somehow?
Asides
While working with integer arrays (int4, not int2 or int8) without NULL values (like your example implies) consider the additional module intarray, that provides specialized, faster operators and index support. See:
How to create an index for elements of an array in PostgreSQL?
Compare arrays for equality, ignoring order of elements
As for the UNIQUE constraint in your question that went unanswered: That's implemented with a btree index on the whole array value (like you suspected) and does not help with the search for elements at all. Details:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
It's now possible to index the individual array elements. For example:
CREATE TABLE test (foo int[]);
INSERT INTO test VALUES ('{1,2,3}');
INSERT INTO test VALUES ('{4,5,6}');
CREATE INDEX test_index on test ((foo[1]));
SET enable_seqscan TO off;
EXPLAIN ANALYZE SELECT * from test WHERE foo[1]=1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using test_index on test (cost=0.00..8.27 rows=1 width=32) (actual time=0.070..0.071 rows=1 loops=1)
Index Cond: (foo[1] = 1)
Total runtime: 0.112 ms
(3 rows)
This works on at least Postgres 9.2.1. Note that you need to build a separate index for each array index, in my example I only indexed the first element.