PostgreSQL Full Text Search and Trigram Confusion - postgresql

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?

They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

Related

Postgres fuzzy array intersection

I'm using Postgresql 13 and my problem was easily solved with #> operator like this:
select id from documents where keywords #> '{"winter", "report", "2020"}';
meaning that keywords array should contain all these elements. Also I've created a GIN index on this column.
Is it possible to achieve similar behavior even if I provide my request like '{"re", "202", "w"}' ? I heard that ngrams have semantics like this, but "intersection" capabilities of arrays are crucial for me.
In your example, the matches are all prefixes. Is that the general rule here? If so, you would probably went to use the match feature of full text search, not trigrams. It would require you reformat your data, or at least your query.
select * from
(values (to_tsvector('simple','winter report 2020'))) f(x)
where x## 're:* & 202:* & w:*'::tsquery;
If the strings can contain punctuation which you want preserved, you would need to take pains to properly format them into a quoted tsvector yourself rather than just letting to_tsvector deal with it. Using 'simple' config gets rid of the stemming and stop word removal features, which would interfere with what you want to do.

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

Using full text search in PostgreSQL, how can I make certain words worth less to match?

I am trying to use Postgres full-text search to search an index of company names. There are lots of duplicates, typos, etc. When matching company names, things like LLC and Inc are not quite stop-words (as in, I want them to count for something) but they are not nearly as important as most other words. Is there a way to query such that some words count more than other words when matching?
(I'm doing this all through Django, but if I can figure out the SQL to use I can probably get the rest of the way there...)
You can use the 3-argument form of "setweight" to de-weight specific lexemes. You would do this in the tsvector, not in the tsquery.
select setweight(setweight(to_tsvector('The DBA LLC'),'A'),'D','{llc}');
setweight
-------------------
'dba':2A 'llc':3D

How to make a tsquery perform a partial match?

I have the following situation. In our database, our user has the ability to search part numbers as 'keywords'. Part numbers are attached as 'footnotes' which get attached to certain items. An example of a footnote of this nature would have a description of:
Part Number: 09C888
Our keyword search searches multiple tables through an incredibly fun set of LEFT JOINs eventually forming a ts_vector which then is used against a tsquery. Our current issue is that this methodology seems to only accept exact matches. Example:
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888:*');
?column?
---------
t
Using the full version of the part number as the search criteria works fine. However...
select to_tsvector('Part Number: 09C888') ## to_tsquery('9C888:*');
?column?
----------
f
Is there a way to modify the above tsquery item to match against 09C888 with values of 09C888 AND 9C888? Normally, I could do something similar with the LIKE construct, but we're currently using full text search for efficiency on large amounts of data. From perusing the postgresql documentation, I cannot figure out an easy way to do this. I am also hesitant to change the overall query since it's doing... well, its doing a lot of stuff of which the text matching is only one part of. (Obviously a potential place for improvement.)
EDIT:
I've actually figured out how to do this using a modified query
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888|9C888:*');
Is there a better way to determine match than what I've listed above? Mostly because the solution in incredibly specific, but essentially these part numbers may or may not have leading 0s.
Have you considered storing the part number with leading zeroes removed in a separate column and search against that?
+---------------------+-------+
| Part Number: 09C888 | 9C888 |
+---------------------+-------+
CREATE INDEX footnote_part_number_txt_idx
ON footnotes (stripped_part_number text_pattern_ops);
then you can query (using the index)
SELECT footnote_str
FROM footnotes
WHERE stripped_part_number LIKE '9C88%'
See: http://petereisentraut.blogspot.se/2009/10/rethink-your-text-column-indexing-with.html

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.