PostgreSQL accent + case insensitive search - postgresql

I'm looking for a way to support with good performances case insensitive + accent insensitive search. Till now we had no issue on this using MSSql server, on Oracle we had to use OracleText, and now we need it on PostgreSQL.
I've found this post about it, but we need to combine it with case insensitive. We also need to use indexes, otherwise performances could be impacted.
Any real experience about the best approach for large databases?

If you need to "combine with case insensitive", there are a number of options, depending on your exact requirements.
Maybe simplest, make the expression index case insensitive.
Building on the function f_unaccent() laid out in the referenced answer:
Does PostgreSQL support "accent insensitive" collations?
CREATE INDEX users_lower_unaccent_name_idx ON users(lower(f_unaccent(name)));
Then:
SELECT *
FROM users
WHERE lower(f_unaccent(name)) = lower(f_unaccent('João'));
Or you could build the lower() into the function f_unaccent(), to derive something like f_lower_unaccent().
Or (especially if you need to do fuzzy pattern matching anyways) you can use a trigram index provided by the additional module pg_trgm building on above function, which also supports ILIKE. Details:
LOWER LIKE vs iLIKE
I added a note to the referenced answer.
Or you could use the additional module citext (but I rather avoid it):
Deferrable, case-insensitive unique constraint

Full-Text-Search Dictionary that Unaccent case-insensitive
FTS is naturally case-insensitive by default,
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English).
And you can define your own dictionary using unaccent,
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
Which you can then index with a functional index,
-- Just some sample data...
CREATE TABLE myTable ( myCol )
AS VALUES ('fóó bar baz'),('qux quz');
-- No index required, but feel free to create one
CREATE INDEX ON myTable
USING GIST (to_tsvector('mydict', myCol));
You can now query it very simply
SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) ## 'foo & bar'
mycol
-------------
fóó bar baz
(1 row)
See also
Creating a case-insensitive and accent/diacritics insensitive search on a field

Related

fuzzy finding through database - prisma

I am trying to build a storage manager where users can store their lab samples/data. Unfortunately, this means that the tables will end up being quite dynamic, as each sample might have different data associated with it. I will still require users to define a schema, so I can display the data properly, however, I think this schema will have to be represented as a JSON field in the underlying database.
I was wondering, in Prisma, is there a way to fuzzy search through collections. Could I type something like help and then return all rows that match this expression ANYWHERE in their columns? (including the JSON fields). Could i do something like this at all with posgresql? Or with MongoDB?
thank you
You can easily do that with jsonb in PostgreSQL.
If you have a table defined like
CREATE TABLE userdata (
id bigint PRIMARY KEY,
important_col1 text,
important_col2 integer,
other_cols jsonb
);
You can create an index like this
CREATE INDEX ON userdata USING gin (other_cols);
and search efficiently with
SELECT id FROM userdata WHERE other_cols #> '{"attribute": "value"}';
Here, #> is the JSON containment operator in PostgreSQL.
Yes, in PostgreSQL you surely can do this. It's quite straightforward. Here is an example.
Let your table be called the_table aliased as tht. Cast an entire table row as text tht::text and use case insensitive regular expression match operator ~* to find rows that contain help in this text. You can use more elaborate and powerful regular expression for searching too.
Please note that since the ~* operator will defeat any index, this query will result in a sequential scan.
select * -- or whatever list of expressions you need
from the_table as tht
where tht::text ~* 'help';

Efficient postgres index type for LIKE operator (fixed ending)

There is a postgres query using like statement:
email LIKE '%#%domain.com'
What is the most appropriate index type that I can use?
I've found pg_trgm module which must be enabled:
CREATE EXTENSION pg_trgm;
The pg_trgm module provides functions and operators for determining the similarity of ASCII alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.
And then you can
CREATE INDEX <index name> ON <table name> USING gin (<column> gin_trgm_ops);
Is there a better option?
gin_trgm_ops is described here: https://niallburkley.com/blog/index-columns-for-like-in-postgres/
The trigram index might be fine, and allows you write the query naturally. But more efficient would be a reversed string index:
create index on foobar (reverse(email) text_pattern_ops);
select * from foobar where reverse(email) LIKE reverse('%#%domain.com');
If your default collation is "C", then you don't need to specify text_pattern_ops.
If the search parameter contains any escaped (literal) characters, then you will have to do something more complicated than simply reversing it.

to_tsquery query for phone numbers

We have a postgres database and a good amount of indexing using tsvector of all the text search attributes we might need for search, and the usage of ts_query is highly performant in our postgres database. But one possible search condition is phone numbers, where we have to support every possible format the user might search for.
Let's say my phone number on the tsvector is stored like "12985345885" and the user searches for 2985345885, how do I handle that in ts_query?
Basically:
select
('12985345885')::tsvector ## ('12985345885:*')::tsquery
is true
and
select
('12985345885')::tsvector ## ('2985345885:*')::tsquery
is false. It appears that postgres tsquery doesn't support a wildcard prefix?
Full-text search is not the right tool for that, because it is designed to search for words.
I would go for a substring search and a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON mytab USING gin (phone gin_trgm_ops);
SELECT * FROM mytab WHERE phone LIKE '%2985345885%';

Index Types for Exact Match and ILIKE Search

(Postgres 12)
I am implementing a text search that allows for both exact match and fuzzy (ILIKE) match:
attributes->>'ID' = 'some-id'
-- OR
attributes->>'ID' ILIKE '%some-%'
(The user declares whether the search will be exact or not, so only one of the above is ever included in the query)
I am putting indexes on the most commonly searched attributes, ID and Name. When I use a GIN w/ gin_trgm_ops, the fuzzy match is much faster. With a BTREE index, the exact match is much faster.
I can have both BTREE and GIN indexes, but I am wondering if that is strictly necessary. Is there a way to nudge postgres into using the GIN index for the exact match search?
Starting in v14, pg_trgm will automatically handle equality. It will not be as efficient as a btree index would be, but it might be good enough so that it is not worthwhile having both indexes.
Until then, the best solution would probably be to just use LIKE without adding % to the fore and aft of the search term (and indeed, escaping any % or _ which happen to exist in the search term) when you want the exact match.

Why does PostgreSQL not use my index to do text prefix search under certain collations?

Consider:
create table tab (foo text not null);
create index tab_ix_foo on tab(foo);
select * from tab where foo like 'pre%';
Postgres doesn't use the index to do that search. When using the collation "POSIX", Postgres uses the index: http://sqlfiddle.com/#!12/ed1cc/1
When using the collation "en_US", Postgres uses a sequential scan: http://sqlfiddle.com/#!12/efb18/1
Why the difference?
When using locales other than C (ie POSIX) you need to create your indexes for LIKE and ~ prefix text searches using the text_pattern_ops opclass. See the docs on Operator classes and index types. I'm sure there's a better reference than that docs page but I can't seem to find it at the moment.
If I alter your SQLFiddle to use text_pattern_ops for the en_US index you'll see that it's able to use the index:
create index tab_ix_foo on tab using btree (foo collate "en_US" text_pattern_ops);
-- ^^^^^^^^^^^^^^^^
It is quite likely that you'll need to create different indexes for different collations if you're using the COLLATE option in 9.2+, since pretty much by definition different collations imply different orderings of strings and therefore different b-tree organisation. It appears that you are already doing this in your tests.
It's also possible that your data is just too small for index use to be particularly useful. Try testing with a more useful amount of data.
This post may be useful, as may the docs on Collation support.
For why you can't just use the same b-tree index for different collations, consider that b-trees require a stable and consistent ordering, but:
regress=> SELECT ' Bill''s' > ('bills' COLLATE "POSIX");
?column?
----------
f
(1 row)
regress=> SELECT ' Bill''s' > ('bills' COLLATE "en_US");
?column?
----------
t
(1 row)
As you can see, the collation changes the sort ordering. That's what a collation does, pretty much by definition. Trying to use the same index for different collations is like trying to use the same functional index for different functions; it makes absolutely no sense.