Postgresql Misspelling in Full Text Search - postgresql

I'm using postgresql to perform Full Text Search and I am finding that users will not receive results if there are misspellings.
What is the best way to handle misspelt words in Postgres full text search?

Take a look at pg_similarity extension which stuffs PSQL with a lot of similarity operators and functions. This will allow you to add (easy enough) some forgiveness into queries.

By typing "spelling correction postgresql fts" into google I get the top result being a page that links to just such a topic.
It suggests using a separate table of all the valid words in your database and running search terms against that to suggest corrections. The trigram matching allows you to measure how "similar" the real words in your table are to the search terms supplied.

Related

Postgresql Text Search Performance

I have been looking into text search (without tsvector) of a varchar field (more or less between 10 to 400 chars) that has the following format:
field,field_a,field_b,field_c,...,field_n
The query I am planning to run is probably similar to:
select * from information_table where fields like '%field_x%'
As there are no spaces in fields, I wonder if there are some performance issues if I run the search across 500k+ rows.
Any insights into this?
Any documentation around performance of varchar and maybe varchar index?
I am not sure if tsvector will work on a full string without spaces. What do you think about this solution? Do you see another solutions that could help improve the performance?
Thanks and I look forward to hearing from you.
R
In general the text search parser will treat commas and spaces the same, so if you want to use FTS, the structure with commas does not pose a problem. pg_trgm also treats commas and spaces the same, so if you want to use that method instead it will also not have a problem due to the commas.
The performance is going to depend on how popular or rare the tokens in the query are in the body of text. It is hard to generalize that based on one example row and one example query, neither of which looks very realistic. Best way to figure it out would be to run some real queries with real (or at least realistic) data with EXPLAIN (ANALYZE, BUFFERS) and with track_io_timing turned on.

PostgreSQL(Full Text Search) vs ElasticSearch

Hi I am doing some research before I implement search feature into my service.
I'm currently using PostgreSQL as my main storage. I could definitely use PostgreSQL's built-in Full-Text-Search but the problem is that I have data scattered around several tables.
My service is an e-commerce website. So if a customer searches "good apple laptop", I need to join Brand table, post table and review table(1 post is a combination of several reviews + short summary) to fully search all posts. If I were to use elasticsearch, I could insert complete posts by preprocessing.
From my research, some people said PostgreSQL's FTS and elasticsearch have similar performance and some people said elasticsearch is faster. Which would be better solution for my case?
Thanks in advance
If PostgreSQL is already in your stack the best option for you is using the PostgreSQL full-text search.
Why full-text search (FTS) in PostgreSQL ?
Because otherwise you have to feed database content to external search engines.
External search engines (e.g. elasticsearch) are fast BUT:
They can't index all documents - could be totally virtual
They don't have access to attributes - no complex queries
They have to be maintained — headache for DBA
Sometimes they need to be certified
They don't provide instant search (need time to download new data and reindex)
They don't provide consistency — search results can be already deleted from database
If you want to read more about FTS in PostgreSQL there's a great presentation by Oleg Bartunov (I extracted the list above from here): "Do you need a Full-Text Search in PostgreSQL ?"
This as a short example how you can create a "Document" (read the text search documentation) from more than one table in SQL:
SELECT to_tsvector(posts.summary || ' ' || brands.name)
FROM posts
INNER JOIN brands ON (brand_id = brands.id);
If you are using Django for your e-commerce website you can also read this article I wrote on "Full-Text Search in Django with PostgreSQL"
I've found research for 2021 with some benchmarks
Postgresql vs ElasticSearch performance graph
and useful Conclusion
With each new version of PostgreSQL, the search response time is improving, and it is proceeding toward an apple to apple comparison when compared with ElasticSearch. So, if the project is not going to have millions of records or large-scale data, Postgresql Full-Text Search would be the best option to opt for.
Short Answer: Elasticsearch is better
Explanation:
PostgreSQL and Elasticsearch are 2 different types of databases. Elasticsearch is powerful for document searching, and PostgreSQL is a traditional RDBMS. No matter how well PostgreSQL does on its full-text searches, Elasticsearch is designed to search in enormous texts and documents(or records). And the more size you want to search in, the more Elasticsearch is better than PostgreSQL in performance. Additionally, you could also get many benefits and great performance if you pre-process the posts into several fields and indexes well before storing into Elasticsearch.
If you surely need the full-text feature, you may consider MSSQL, which may do better than PostgreSQL.
Reply on Comments: It should be commonsense for the properties comparison on those different types of DBs. Since OP didn't provide what amount and size of data are stored. If this is small size data-in-search, Maybe choose Postgres or ES, both are OK. However, if transactions and data repository become larger in future, ES will provide benefits.
You could check this site to know the current ranking of each type DB, and choose the best one for your requirements, architecture and future data growth of your applications.

Pattern matching performance issue Postgres

I found the query like below taking longer time as this pattern matching causes the performance in my batch job,
Query:
select a.id, b.code
from table a
left join table b
on a.desc_01 like '%'||b.desc_02||'%';
I have tried with LEFT, STRPOS functions to improve the performance. But at the end am losing few data if i apply these functions.
Any other suggestion please.
It's not that clear what your data (or structure) really looks like, but your search is performing a contains comparison. That's not the simplest thing to optimize because a standard index, and many matching algorithms, are biased towards the start of the string. When you lead with %, then then a B-tree can't be used efficiently as it splits/branches based on the front of the string.
Depending on how you really want to search, have you considered trigram indexes? they're pretty great. Your string gets split into three letter chunks, which overcomes a lot of the problems with left-anchored text comparison. The reason why is simple: now every character is the start of a short, left-anchored chunk. There are traditionally two methods of generating trigrams (n-grams), one with leading padding, one without. Postgres uses padding, which is the better default. I got help with a related question recently that may be relevant to you:
Searching on expression indexes
If you want something more like a keyword match, then full text search might be of help. I had not been using them much because I've got a data set where converting words to "lexemes" doesn't make sense. It turns out that you can tell the parser to use the "simple" dictionary instead, and that gets you a unique word list without any stemming transformations. Here's a recent question on that:
https://dba.stackexchange.com/questions/251177/postgres-full-text-search-on-words-not-lexemes/251185#251185
If that sounds more like what you need, you might also want to get rid of stop/skip/noise words. Here's a thread that I think is a bit clearer on the docs regarding how to set this up (it's not hard):
https://dba.stackexchange.com/questions/145016/finding-the-most-commonly-used-non-stop-words-in-a-column/186754#186754
The long term answer is to clean up and re-organize your data so you don't need to do this.
Using a pg_trgm index might be the short term answer.
create extension pg_trgm;
create index on a using gin (desc_01 gin_trgm_ops);
How fast this will be is going to depend on what is in b.desc_02.

Fuzzy search on large table

I have a very large PostgreSQL table with 12M names and I would like to show an autocomplete. Previously I used a ILIKE "someth%" clause but I'm not really satisfied with it. For example it doesn't sort by similarity and any spelling error would cause wrong or no results. The field is a string, usually one or two words (in any language). I need a fast response because suggestions are shown live to the user while he is typing (i.e. autocomplete). I cannot restrict the fuzzy match to a subset because all names are equally important. I can also say that most names are different.
I have tried pg_trgm but even with a gin index is very slow. The search of a name similar to 'html' takes a few milliseconds, but - don't ask me why - other searches like 'htm' take a lot of seconds - e.g. 25 seconds. Also other people has reported performance issues with pg_trgm on large tables.
Is there anything I can do to efficiently show an autocomplete on that field?
Would a full text search engine (e.g. Lucene, Solr) be an appropriate solution? Or I would encounter the same inefficiency?

How can you emulate a Solr "more like this query" with Postgresql full text search?

I'd like to emulate this type of Solr query:
http://wiki.apache.org/solr/MoreLikeThis
with PostgreSQL using its full text search facility.
Is there a way to do something like a "more like this" query with pure postgres?
Not out of the box I am afraid. It might be possible to compare two tsvectors to determine if they are similar enough, or pull the top n similar tsvectors, but there is no out of the box functionality to do this. The good news is that since tsvectors support GIN indexing, the complicated part is done for you.
What I think you'd need to do is create a function in C which determines the intersection of two tsvectors. From there you could create a function which determines if they overlap and an operator which addresses this. From there it shouldn't be too hard to create a ranking based on largest overlap.
Of course I suspect that this will be easiest to do in a language like C but you could probably use other procedural languages as well if you need to.
The wonderful thing about PostgreSQL is that anything is possible. of course the downside is that when you move further from core functionality you get to do a lot of it yourself.