PostgreSQL: to_tsquery starts with and ends with search - postgresql

Recently, I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it. This table has over 200 million rows and querying using to_tsquery worked pretty well for the column of type tsvector.
Now I need to hit the following queries but reading the documentation I couldn't find how to do it (or it's there and I didn't understand because full-text search is something new to me yet)
Starts with
Ends with
How can I make the query below return true only if the query is "The cat" (starts with) and "the book" (ends with), if it's possible in full-text search.
select to_tsvector('The cat is on the book') ## to_tsquery('Cat')

I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it.
How did you do that? FTS doesn't apply for LIKE queries. It applies for FTS queries, such as ##.
You can't directly look for strings starting and ending with certain words. You can use the index to filter on cat and book, then refilter those rows for ones having them in the right place.
select * from whatever where tsv_col ## to_tsquery('cat & book') and text_col LIKE 'The cat % the book';
Unless you want to match something like 'The cathe book' then you would have to do something else, with two different LIKE.

Related

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

Using full text search in PostgreSQL, how can I make certain words worth less to match?

I am trying to use Postgres full-text search to search an index of company names. There are lots of duplicates, typos, etc. When matching company names, things like LLC and Inc are not quite stop-words (as in, I want them to count for something) but they are not nearly as important as most other words. Is there a way to query such that some words count more than other words when matching?
(I'm doing this all through Django, but if I can figure out the SQL to use I can probably get the rest of the way there...)
You can use the 3-argument form of "setweight" to de-weight specific lexemes. You would do this in the tsvector, not in the tsquery.
select setweight(setweight(to_tsvector('The DBA LLC'),'A'),'D','{llc}');
setweight
-------------------
'dba':2A 'llc':3D

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

PostgreSQL: Full text search multitenant site, plus only parts of site

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:
1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.
I wonder what indexes should I add?
Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).
I can think of three approaches:
Create three indexes. 1) One that indexes everything on a per site basis.
And 2) one that indexes everything on a per-site plus site-section basis.
And 3) one that indexes everything on a per-site and page-id basis.
Create one single index, and insert into [the text to index] magic words like:
"site_<site-id>"
and "section_<section-id>" and "page_<page-id>", and then when I search
for section XX in site YYY I could prefix the search query like so:
"site_XX AND section_YYY AND ...".
Dynamically add database indexes when a new site or site section is created:
create index dw1_posts__search_site_YYY
on dw1_posts using gin(to_tsvector('english', approved_text))
where site_id = 'YYY';
Does any of these three approaches above make sense? Are there better alternatives?
(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:
> create index dw1_posts__search_site
on dw1_posts (site_id)
using gin(to_tsvector('english', approved_text));
ERROR: syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
^
> create index dw1_posts__search_site
on dw1_posts
using gin(to_tsvector('english', approved_text))
(site_id);
ERROR: syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);
(If approach 1 was possible, then I could do queries like:
select ... from ... where site_id = ... and <full-text-search-column> ## <query>;
and have PostgreSQL first check site_id and then the full-text-search column, using one single index.)
)
/ End details.)
Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)
The normal approach would be to create:
one full text index:
CREATE INDEX idx1
ON dw1_posts USING gin(to_tsvector('english', approved_text));
a simple index on the site_id:
CREATE INDEX idx2
on dw1_posts(page_id);
another simple index on the page_id:
CREATE INDEX idx3
on dw1_posts(site_id);
Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.
Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:
SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>
And then use prefix matching (i.e. :*) when searching:
select ... from .. where .. ## 'SiteSectionPage_NN_MMM:* AND (the search phrase)'
where NN is the site ID and MMM is the section ID.
But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries