PostgreSQL Full Text Search case study? - postgresql

Is there any case study available of any project that uses PostgreSQL 8.3+ Full Text Search on a large amount of data?

Not sure what your definition is of a large one.
There's some data about search.postgresql.org which uses it, available here.

Related

PostgreSQL(Full Text Search) vs ElasticSearch

Hi I am doing some research before I implement search feature into my service.
I'm currently using PostgreSQL as my main storage. I could definitely use PostgreSQL's built-in Full-Text-Search but the problem is that I have data scattered around several tables.
My service is an e-commerce website. So if a customer searches "good apple laptop", I need to join Brand table, post table and review table(1 post is a combination of several reviews + short summary) to fully search all posts. If I were to use elasticsearch, I could insert complete posts by preprocessing.
From my research, some people said PostgreSQL's FTS and elasticsearch have similar performance and some people said elasticsearch is faster. Which would be better solution for my case?
Thanks in advance
If PostgreSQL is already in your stack the best option for you is using the PostgreSQL full-text search.
Why full-text search (FTS) in PostgreSQL ?
Because otherwise you have to feed database content to external search engines.
External search engines (e.g. elasticsearch) are fast BUT:
They can't index all documents - could be totally virtual
They don't have access to attributes - no complex queries
They have to be maintained — headache for DBA
Sometimes they need to be certified
They don't provide instant search (need time to download new data and reindex)
They don't provide consistency — search results can be already deleted from database
If you want to read more about FTS in PostgreSQL there's a great presentation by Oleg Bartunov (I extracted the list above from here): "Do you need a Full-Text Search in PostgreSQL ?"
This as a short example how you can create a "Document" (read the text search documentation) from more than one table in SQL:
SELECT to_tsvector(posts.summary || ' ' || brands.name)
FROM posts
INNER JOIN brands ON (brand_id = brands.id);
If you are using Django for your e-commerce website you can also read this article I wrote on "Full-Text Search in Django with PostgreSQL"
I've found research for 2021 with some benchmarks
Postgresql vs ElasticSearch performance graph
and useful Conclusion
With each new version of PostgreSQL, the search response time is improving, and it is proceeding toward an apple to apple comparison when compared with ElasticSearch. So, if the project is not going to have millions of records or large-scale data, Postgresql Full-Text Search would be the best option to opt for.
Short Answer: Elasticsearch is better
Explanation:
PostgreSQL and Elasticsearch are 2 different types of databases. Elasticsearch is powerful for document searching, and PostgreSQL is a traditional RDBMS. No matter how well PostgreSQL does on its full-text searches, Elasticsearch is designed to search in enormous texts and documents(or records). And the more size you want to search in, the more Elasticsearch is better than PostgreSQL in performance. Additionally, you could also get many benefits and great performance if you pre-process the posts into several fields and indexes well before storing into Elasticsearch.
If you surely need the full-text feature, you may consider MSSQL, which may do better than PostgreSQL.
Reply on Comments: It should be commonsense for the properties comparison on those different types of DBs. Since OP didn't provide what amount and size of data are stored. If this is small size data-in-search, Maybe choose Postgres or ES, both are OK. However, if transactions and data repository become larger in future, ES will provide benefits.
You could check this site to know the current ranking of each type DB, and choose the best one for your requirements, architecture and future data growth of your applications.

Postgresql Misspelling in Full Text Search

I'm using postgresql to perform Full Text Search and I am finding that users will not receive results if there are misspellings.
What is the best way to handle misspelt words in Postgres full text search?
Take a look at pg_similarity extension which stuffs PSQL with a lot of similarity operators and functions. This will allow you to add (easy enough) some forgiveness into queries.
By typing "spelling correction postgresql fts" into google I get the top result being a page that links to just such a topic.
It suggests using a separate table of all the valid words in your database and running search terms against that to suggest corrections. The trigram matching allows you to measure how "similar" the real words in your table are to the search terms supplied.

Clustering structured (numeric) and text data simultaneously

Folks,
I have a bunch of documents (approx 200k) that have a title and abstract. There is other meta data available for each document for example category - (only one of cooking, health, exercise etc), genre - (only one of humour, action, anger) etc. The meta data is well structured and all this is available in a MySql DB.
I need to show to our user related documents while she is reading one of these document on our site. I need to provide the product managers weight-ages for title, abstract and meta data to experiment with this service.
I am planning to run clustering on top of this data, but am hampered by the fact that all Mahout Clustering example use either DenseVectors formulated on top of numbers, or Lucene based text vectorization.
The examples are either numeric data only or text data only. Has any one solved this kind of a problem before. I have been reading Mahout in Action book and the Mahout Wiki, without much success.
I can do this from the first principles - extract all titles and abstracts in to a DB, calculate TFIDF & LLR, treat each word as a dimension and go about this experiment with a lot of code writing. That seems like a longish way to the solution.
That in a nutshell is where I am trapped - am I doomed to the first principles or there exist a tool / methodology that I somehow missed. I would love to hear from folks out there who have solved similar problem.
Thanks in advance
You have a text similarity problem here and I think you're thinking about it correctly. Just follow any example concerning text. Is it really a lot of code? Once you count the words in the docs you're mostly done. Then feed it into whatever clusterer you want. The term extractions is not something you do with Mahout, though there are certainly libraries and tools that are good at it.
I'm actually working on something similar, but without the need of distinciton between numeric and text fields.
I have decided to go with the semanticvectors package which does all the part about tfidf, the semantic space vectors building, and the similarity search. It uses a lucene index.
Please note that you can also use the s-space package if semanticvectors doesn't suit you (if you go down that road of course).
The only caveat I'm facing with this approach is that the indexing part can't be iterative. I have to index everything every time a new document is added, or an old document is modified. People using semanticvectors say they have very good indexing times. But I don't know how large their corpora are. I'm going to test these issues with the wikipedia dump to see how fast it can be.

Is there a way to get around space usage issues when using long field names in MongoDB?

It looks like having descriptive field names (the ones I like the most) can take much space in the memory for big collections. I don't like the idea of giving them short and cryptic names to save memory, neither do I like the idea to translate field names to shortened fields somewhere in the application.
Is there a way to tell mongo not to store every field name as text?
For now the only thing you can do is to vote and wait for SERVER-863 to be solved. After almost a year of discussion the status of this issue has been changes to planned but not scheduled...
The workaround is to use document mapping libraries likes Spring Data Document or morphia (in Java world) and work with nicely named objects. But the underlying database names are still cryptic.
If you are using an "object-document mapper" library to access MongoDB, many of them provide facilities for using descriptive names within your application code, but storing short names in the database. If your application has a data access layer, it may be possible for you to implement this logic in your application code, as well.
Since you haven't said what language you're using, or whether you're using an ODM at all, I provide any more guidance on which ODMs might fit your needs.

Full Text Searching in Apple's Core Data Framework

I would like to implement a full text search in an iPhone application. I have data stored in an sqlite database that I access via the Core Data framework. Just using predicates and ORing a bunch of "contains[cd]" phrases for every search word and column does not work well at all.
What have you done that seems to work well?
We have FTS3 working very nicely on 150,000+ records. We are getting subsecond query times returning over 200 results on a single keyword query.
Presently the only way to get Sqlite FTS3 working on the iPhone is to compile your own binary and link it to your project. To my knowledge, the binary included in your own project will not work with Core Data. Perhaps Apple will turn on the FTS3 compiler option in a future release?
You can still link in your own Sqlite FTS3 binary and use it just for full text searches. This would be very similar to the way Sphinx or Lucene are used in Web App environments. Note you will still have to update the search index at some point to keep synchronicity with the Core Data stores.
Good luck !!
I assume that by "does not work well" you mean 'performs badly'. Full-text search is always relatively slow, especially in memory or space constrained environments. You may be able to speed things up by making sure the attributes you're searching against are indexed and using BEGINSWITH[cd] instead of CONTAINS[cd]. My recollection (can't find the cocoa-dev post at this time) is that SQLite will use the index for prefix matching, but falls back to linear search for infix searches.
I use contains[cd] in my predicate and this works fine. Perhaps you could post your predicate and we could see if there's an obvious fault.
Sqlite has its own full text indexing module: http://sqlite.org/fts3.html
You have to have full control of the SQL you send to the db (I don't know how Core Data works), but using the full text indexing module is key to speed of execution and simplicity in your SQL SELECT statements that do full text searching.
Using CONTAINS is fine if you don't need fast execution, but selects made with it can't make use of regular indexes so are destined to be slow, and the larger the database the slower it will be. Using real full text indexing allows same sort of searches as you can do with 'CONTAINS', but things are indexed for fast results even with large db's.
I've been working on this same problem and just got around to following up on my post about this from a few weeks ago. Instead of using CONTAINS, I created a separate entity with an instance for each canonicalized word. I added an index on the words (in XCode model builder) and can then use a BEGINSWITH operator to exploit the index. Nevertheless, as I just posted a few minutes ago, query time is still very slow for even small data sets.
There must be a better way! After all, we see this sort of full text search in lots of apps!