to_tsquery query for phone numbers - postgresql

We have a postgres database and a good amount of indexing using tsvector of all the text search attributes we might need for search, and the usage of ts_query is highly performant in our postgres database. But one possible search condition is phone numbers, where we have to support every possible format the user might search for.
Let's say my phone number on the tsvector is stored like "12985345885" and the user searches for 2985345885, how do I handle that in ts_query?
Basically:
select
('12985345885')::tsvector ## ('12985345885:*')::tsquery
is true
and
select
('12985345885')::tsvector ## ('2985345885:*')::tsquery
is false. It appears that postgres tsquery doesn't support a wildcard prefix?

Full-text search is not the right tool for that, because it is designed to search for words.
I would go for a substring search and a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON mytab USING gin (phone gin_trgm_ops);
SELECT * FROM mytab WHERE phone LIKE '%2985345885%';

Related

Full-text with partitioning in PostgreSQL

I have a table that I want to search in.
Table:
user_id: integer
text: text
deleted_at: datetime (nullable)
Index:
CREATE INDEX CONCURRENTLY "index_notifications_full_text" ON "notifications"
USING "gist" (to_tsvector('simple'::REGCONFIG, COALESCE(("text")::TEXT, ''::TEXT))) WHERE "deleted_at" IS NULL;
I need to implement a full-text search for users (only inside their messages that are not deleted).
How can I implement an index that indexes both user_id and text?
Using the btree_gin and/or btree_gist extensions, you can include user_id directly into a multicolumn FTS index. You can try it on each type in turn, as it can be hard to predict which one will be better in a given situation.
Alternatively, you could partition the table by user_id using declarative partitioning, and then keep the single-column index (although in that case, GIN is likely better than GiST).
If you want more detailed advice, you need to give us more details. Like how many use_id are there, how many notifications per user_id, and many tokens are there per notification, and an example of a plausible query you hope to support efficiently.
You can add a new column with the name e.g document_with_idx with tsvector type
on your notifications table,
ALTER TABLE notifications ADD COLUMN document_with_idx tsvector;
Then update that column value with the vectorized value of user_id and text column.
update notifications set document_with_idx = to_tsvector(user_id || ' ' || coalesce(text, ''));
Finally, create an index with the name e.g document_idx on that column,
CREATE INDEX document_idx
ON notifications
USING GIN (document_with_idx);
Now you can do a full-text search on both user_id and text column value using that document_with_idx column.
Now search like,
select user_id, text from notifications
where document_with_idx ## to_tsquery('your search string goes here');
See more: https://www.postgresql.org/docs/9.5/textsearch-tables.html

How to index a PostgreSQL JSONB flat text array for fuzzy and right-anchored searches?

PostgreSQL version: 9.6.
The events table has a visitors JSONB column:
CREATE TABLE events (name VARCHAR(256), visitors JSONB);
The visitors column contains a "flat" JSON array:
["John Doe","Frédéric Martin","Daniel Smith",...].
The events table contains 10 million of rows, each row has between 1 and 20 visitors.
Is it possible to index the values of the array to perform efficient pattern-matching searches:
left anchored: select events whose visitors match 'John%'
right anchored: select events whose visitors match '%Doe'
unaccented: select events whose visitors match 'Frederic%'
case-insensitive: select events whose visitors match 'john%'
I am aware of the existence of the Postgres trigram extension gin_trgm_ops enabling to create indexes for case-insensitive and right-anchored searches, but I can't figure out how to create trigram indexes for the content of "flat" JSON arrays.
I read Pattern matching on jsonb key/value and Index for finding an element in a JSON array but the solutions provided do not seem to apply to my use case.
You should cast the jsonb to text and create a trigram index on it:
CREATE EXTENSION pg_trgm;
CREATE INDEX ON events USING gin
((visitors::text) gin_trgm_ops);
Then use regular expression searches on the column. For example, to search for John Doe, you can use:
SELECT ...
FROM events
WHERE visitors::text *~ '\mJohn Doe\M';
The trigram index will support this query.

PostgreSQL accent + case insensitive search

I'm looking for a way to support with good performances case insensitive + accent insensitive search. Till now we had no issue on this using MSSql server, on Oracle we had to use OracleText, and now we need it on PostgreSQL.
I've found this post about it, but we need to combine it with case insensitive. We also need to use indexes, otherwise performances could be impacted.
Any real experience about the best approach for large databases?
If you need to "combine with case insensitive", there are a number of options, depending on your exact requirements.
Maybe simplest, make the expression index case insensitive.
Building on the function f_unaccent() laid out in the referenced answer:
Does PostgreSQL support "accent insensitive" collations?
CREATE INDEX users_lower_unaccent_name_idx ON users(lower(f_unaccent(name)));
Then:
SELECT *
FROM users
WHERE lower(f_unaccent(name)) = lower(f_unaccent('João'));
Or you could build the lower() into the function f_unaccent(), to derive something like f_lower_unaccent().
Or (especially if you need to do fuzzy pattern matching anyways) you can use a trigram index provided by the additional module pg_trgm building on above function, which also supports ILIKE. Details:
LOWER LIKE vs iLIKE
I added a note to the referenced answer.
Or you could use the additional module citext (but I rather avoid it):
Deferrable, case-insensitive unique constraint
Full-Text-Search Dictionary that Unaccent case-insensitive
FTS is naturally case-insensitive by default,
Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s or es in English).
And you can define your own dictionary using unaccent,
CREATE EXTENSION unaccent;
CREATE TEXT SEARCH CONFIGURATION mydict ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION mydict
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
Which you can then index with a functional index,
-- Just some sample data...
CREATE TABLE myTable ( myCol )
AS VALUES ('fóó bar baz'),('qux quz');
-- No index required, but feel free to create one
CREATE INDEX ON myTable
USING GIST (to_tsvector('mydict', myCol));
You can now query it very simply
SELECT *
FROM myTable
WHERE to_tsvector('mydict', myCol) ## 'foo & bar'
mycol
-------------
fóó bar baz
(1 row)
See also
Creating a case-insensitive and accent/diacritics insensitive search on a field

Indexing strategy for full text search in a multi-tenant PostgreSQL database

I have a PostgreSQL database that stores a contact information table (first, last names) for multiple user accounts. Each contact row has a user id column. What would be the most performant way to set up indexes so that users could search for the first few letters of the first or last name of their contacts?
I'm aware of conventional b-tree indexing and PG-specific GIN and GiST, but I'm just not sure how they could (or could not) work together such that a user with just a few contacts doesn't have to search all of the contacts before filtering by user_id.
You should add the account identifier as the first column of any index you create. This will in effect first narrow down the search to rows belonging to that account. For gist or gin fulltext indexes you will need to install the btree_gist or btree_gin extensions.
If you only need to search for the first letters, the simplest and probably fastest would be to use a regular btree that supports text operations for both columns and do 2 lookups. You'll need to use the text_pattern_ops opclass to support text prefix queries and lower() the fields to ensure case insensitivity:
CREATE INDEX contacts_firstname_idx ON contacts(aid, lower(firstname) text_pattern_ops);
CREATE INDEX contacts_lastname_idx ON contacts(aid, lower(lastname) text_pattern_ops);
The query will then look something like this:
SELECT * FROM contacts WHERE aid = 123 AND
(lower(firstname) LIKE 'an%' OR lower(lastname) LIKE 'an%')

Postgres Indexing?

I am a newbie in postgres. I have a column named host (string varchar2) in a table which has around 20 million rows. How do I use indexing to optimize my search to find particular host. Also, this column will be updated daily do I need to write trigger indexing at particular interval? If yes, how do I do that? (For Records I am using Ruby and Rails 3)
Assuming you're doing exact matches, you should just be able to create the index and leave it:
CREATE INDEX host_index ON table_name (host)
The query optimizer should just use that automatically.
You may wish to specify other options such as the collation to use.
See the PostgreSQL docs for CREATE INDEX for more information.
I'd suggest using BRIN Index since its introduction from PostgreSQL 9.5 rather than the conventional btree index.
For text search, it is recommended that you use GIN or GiST index types.
https://www.postgresql.org/docs/9.5/static/textsearch-indexes.html
Another possibility is that if you were only performing exact matching in the host column, i.e., no inequality comparisons (>, <) and partial matching (like, wildcard) involved, you may consider converting host to a hash integer to speed up the search significantly.