How to make a tsquery perform a partial match? - postgresql

I have the following situation. In our database, our user has the ability to search part numbers as 'keywords'. Part numbers are attached as 'footnotes' which get attached to certain items. An example of a footnote of this nature would have a description of:
Part Number: 09C888
Our keyword search searches multiple tables through an incredibly fun set of LEFT JOINs eventually forming a ts_vector which then is used against a tsquery. Our current issue is that this methodology seems to only accept exact matches. Example:
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888:*');
?column?
---------
t
Using the full version of the part number as the search criteria works fine. However...
select to_tsvector('Part Number: 09C888') ## to_tsquery('9C888:*');
?column?
----------
f
Is there a way to modify the above tsquery item to match against 09C888 with values of 09C888 AND 9C888? Normally, I could do something similar with the LIKE construct, but we're currently using full text search for efficiency on large amounts of data. From perusing the postgresql documentation, I cannot figure out an easy way to do this. I am also hesitant to change the overall query since it's doing... well, its doing a lot of stuff of which the text matching is only one part of. (Obviously a potential place for improvement.)
EDIT:
I've actually figured out how to do this using a modified query
select to_tsvector('Part Number: 09C888') ## to_tsquery('09C888|9C888:*');
Is there a better way to determine match than what I've listed above? Mostly because the solution in incredibly specific, but essentially these part numbers may or may not have leading 0s.

Have you considered storing the part number with leading zeroes removed in a separate column and search against that?
+---------------------+-------+
| Part Number: 09C888 | 9C888 |
+---------------------+-------+
CREATE INDEX footnote_part_number_txt_idx
ON footnotes (stripped_part_number text_pattern_ops);
then you can query (using the index)
SELECT footnote_str
FROM footnotes
WHERE stripped_part_number LIKE '9C88%'
See: http://petereisentraut.blogspot.se/2009/10/rethink-your-text-column-indexing-with.html

Related

Search engine like full text search in PostgreSQL

I have a list of titles and descriptions in a table which are indexed in a tsvector column. How can I implement Google Search like full text search functionality in Postgres for these fields. I tried various functions offered by standard Postgres like
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
plainto_tsquery('apple orange') -- apple & orange
This function requires all of the terms in the query. But I want results including both apple and orange first but can still have results including even one of these terms just later in the results.
phraseto_tsquery('apple orange') -- apple <> orange
This function only matches orange followed by apple but not vice versa. But for me orange <> apple is also still relevant.
I also tried websearch_to_tsquery() but it behaves very similar to above functions.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
to_tsquery('apple | orange') -- apple | orange
This function returns rows as long as it has one of these terms so it doesn't produce highly relevant results at top which should have both of the terms.
Unless you tell it how to order the rows, rows of a single query are returned in arbitrary order. There is no "top" without an ORDER BY, there is just something which happens to be seen first.
How can I ask Postgres to list highly relevant rows first which contains most of the terms in the search query no matter the order of the terms and then followed by rows with less number of terms?
Use the | operator, then rank those rows using ts_rank, ts_rank_cd, or a custom ranking function you write yourself. For performance, you might want to use the & operator first, then revert to | if you don't get enough rows.
The built in ranking functions don't care about order, but also don't care about proximity. So they might not do what you want. But writing your own won't be particularly easy, so I'd at least try them out first.
It would be nice if the introduction of websearch_to_tsquery or phraseto_tsquery had also introduced some corresponding ranking functions. But since they invented only ordered proximity, not proximity without order, it is unlikely they would do you want if they did exist.

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

PostgreSQL Full Text Search and Trigram Confusion

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:
SELECT * FROM articles
WHERE search_vector ## plainto_tsquery('english', 'cat, bat, rat');
The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:
Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?
They serve very different purposes.
Full Text Search is used to return documents that match a search query of stemmed words.
Trigrams give you a method for comparing two strings and determining how similar they look.
Consider the following examples:
SELECT 'cat' % 'cats'; --true
The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).
SELECT 'there is a cat with a dog' % 'cats'; --false
The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.
SELECT to_tsvector('there is a cat with a dog') ## to_tsquery('cats'); --true
This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.
It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.
If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.