PostgreSQL(Full Text Search) vs ElasticSearch - postgresql

Hi I am doing some research before I implement search feature into my service.
I'm currently using PostgreSQL as my main storage. I could definitely use PostgreSQL's built-in Full-Text-Search but the problem is that I have data scattered around several tables.
My service is an e-commerce website. So if a customer searches "good apple laptop", I need to join Brand table, post table and review table(1 post is a combination of several reviews + short summary) to fully search all posts. If I were to use elasticsearch, I could insert complete posts by preprocessing.
From my research, some people said PostgreSQL's FTS and elasticsearch have similar performance and some people said elasticsearch is faster. Which would be better solution for my case?
Thanks in advance

If PostgreSQL is already in your stack the best option for you is using the PostgreSQL full-text search.
Why full-text search (FTS) in PostgreSQL ?
Because otherwise you have to feed database content to external search engines.
External search engines (e.g. elasticsearch) are fast BUT:
They can't index all documents - could be totally virtual
They don't have access to attributes - no complex queries
They have to be maintained — headache for DBA
Sometimes they need to be certified
They don't provide instant search (need time to download new data and reindex)
They don't provide consistency — search results can be already deleted from database
If you want to read more about FTS in PostgreSQL there's a great presentation by Oleg Bartunov (I extracted the list above from here): "Do you need a Full-Text Search in PostgreSQL ?"
This as a short example how you can create a "Document" (read the text search documentation) from more than one table in SQL:
SELECT to_tsvector(posts.summary || ' ' || brands.name)
FROM posts
INNER JOIN brands ON (brand_id = brands.id);
If you are using Django for your e-commerce website you can also read this article I wrote on "Full-Text Search in Django with PostgreSQL"

I've found research for 2021 with some benchmarks
Postgresql vs ElasticSearch performance graph
and useful Conclusion
With each new version of PostgreSQL, the search response time is improving, and it is proceeding toward an apple to apple comparison when compared with ElasticSearch. So, if the project is not going to have millions of records or large-scale data, Postgresql Full-Text Search would be the best option to opt for.

Short Answer: Elasticsearch is better
Explanation:
PostgreSQL and Elasticsearch are 2 different types of databases. Elasticsearch is powerful for document searching, and PostgreSQL is a traditional RDBMS. No matter how well PostgreSQL does on its full-text searches, Elasticsearch is designed to search in enormous texts and documents(or records). And the more size you want to search in, the more Elasticsearch is better than PostgreSQL in performance. Additionally, you could also get many benefits and great performance if you pre-process the posts into several fields and indexes well before storing into Elasticsearch.
If you surely need the full-text feature, you may consider MSSQL, which may do better than PostgreSQL.
Reply on Comments: It should be commonsense for the properties comparison on those different types of DBs. Since OP didn't provide what amount and size of data are stored. If this is small size data-in-search, Maybe choose Postgres or ES, both are OK. However, if transactions and data repository become larger in future, ES will provide benefits.
You could check this site to know the current ranking of each type DB, and choose the best one for your requirements, architecture and future data growth of your applications.

Related

Replace PostgreSQL with MongoDB?

My client has an existing PostgreSQL database with around 100 tables and most every table has one or more relationships to other tables. He's got around a thousand customers who use an app that hits that database.
Recently he hired a new frontend web developer, and that person is trying to tell him that we should throw out the PostgreSQL database and replace it with a MongoDB solution. That seems odd to me, but I don't have experience with MongoDB.
Is there any clear reasons why he should, or should not, make the change? Obviously I'm arguing against it and the other guy for it, but I would like to remove the "I like this one better" from the argument and really hear from the community on their experience with such things.
1) Performance
During last years, there were several benchmarks comparing Postgres and Mongo.
Here you can find the most recent performance benchmark (Yahoo): https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional (start with slide #58, where some overview of the past becnhmarks is given).
Notice, that traditionally, MongoDB provided benchmarks, where they didn't turn on write ahead logging or even turned fsync off, so their benchmarks were unfair -- in such states the database system doesn't wait for filesystem, so TPS are high but probability to lose data is also very high.
2) Flexibility – JSON
Postgres has non-structured and semistructured data types since 2003 (hstore, XML, array data types). And now has very strong JSON support with indexing (jsonb data type), you can create partial indexes, functional indexes, index only part of JSON documents, index whole documents in different manners (you can tweek index to reduce it's size and speed).
More interestingly, with Postgres, you can combine relational approach and non-relational JSON data – see this talk again https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional for details. This gives you a lot of flexibility and power (I wouldn't keep money-related or basic accounts-related data in JSON format).
3) Standards and costs of support
SQL experiences new born now -- NoSQL products started to add SQL dialects, there is a lot of people making big data analysis with SQL, you can even run machine learning algorithms inside RDBMS (see MADlib project http://madlib.incubator.apache.org).
When you need to work with data, SQL was, is and will be for long time the best language – there are such many things included to it, so all other languages are lagging too much. I recommend http://modern-sql.com/ to learn modern SQL features and https://use-the-index-luke.com (from the same author) to learn how reach the best performance using SQL.
When Mongo needed to create "BI connector", they also needed to speak SQL, so guess what they chose? https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-john-de-goes
SQL will go nowhere, it's extended with SQL/JSON now and this means that for future, Postgres is an excellent choice.
4) Scalability
If you data size is up to several terabytes -- it's easy to live on "single master - multiple replicas" architectuyre either on your own installation or in clouds (Amazon RDS, Heroku, Google Cloud Platform, and since recently, Azure – all them support Postgres). There is an increasing number of solutions which help you to work with microservice architecture, have automatic failover, and/or shard your data. Here is only few of them, which are actively developed and supported, without specific order:
https://wiki.postgresql.org/wiki/PL/Proxy
https://github.com/zalando/spilo and https://github.com/zalando/patroni
https://github.com/dalibo/PAF
https://github.com/postgrespro/postgres_cluster
https://www.2ndquadrant.com/en/resources/bdr/
https://www.postgresql.org/docs/10/static/postgres-fdw.html
5) Extensibility
There are much more additional projects built to work with Postgres than with Mongo. You can work with literally any data type (including but not limited to time ranges, geospatial data, JSON, XML, arrays), have index support for it, ACID and manipulate with it using standard SQL. You can develop your own functions, data types, operators, index structures and much more!
If your data is relational (and it appears that it is), it makes no sense whatsoever to use a non-relational db (like mongodb). You can't underestimate the power and expressiveness of standard SQL queries.
On top of that, postgres has full ACID. And it can handle free-form JSON reasonably well, if that is that guy's primary motivation.

MongoDB + Elasticsearch or only Elasticsearch?

We have a new project there for index a large amount of data and for provide real time. I have also complexe search with facets, full text, geospatial...
The first prototype is to index in MongoDB and next, into Elasticsearch, because I had read that Elasticsearch does not apply a checksum on stored files and the index can't be fully trusted.
But since last versions (in the version 1.5), there is now a checksum and I'm guessing if we can use Elasticsearch as primary data store ? And what is the benefit to use MongoDB in addition to Elasticsearch ?
I can't find up to date answer about thoses features in Elasticsearch
Thanks a lot
Talking about arguments to use Mongo instead of/together with ES:
User/role management.
Built-in in MongoDB. May not fit all your needs, may be clumsy somewhere, but it exists and it was implemented pretty long time ago.
The only thing for security in ES is shield. But it ships only for Gold/Platinum subscription for production use.
Schema
ES is schemaless, but its built on top of Lucene and written in Java. The core idea of this tool - index and search documents, and working this way requires index consistency. At back end, all documents should be fitted in flat lucene index, which requires some understanding about how ES should deal with your nested documents and values, and how you should organize your indexes to maintain balance between speed and data completeness/consistency. Working with ES requires you to keep some things about schema in mind constantly. I.e: as you can index almost anything to ES without putting corresponding mapping in advance, ES can "guess" mapping on the fly but sometimes do it wrong and sometimes implicit mapping is evil, because once it put, it can't be changed w/o reindexing whole index. So, its better to not treat ES as schemaless store, because you can step on a rake some time (and this will be pain :) ), but rather treat it as schema-intensive, at least when you work with documents, that can be sliced to concrete fields.
Mongo, on the other hand, can "chew and leave no crumbs" out of almost anything you put in it. And most your queries will work fine, `til you remember how Mongo will deal with your data from JavaScript perspective. And as JS is weakly typed, you can work with really schemaless workflow (for sure, if you need such)
Handling non-table-like data.
ES is limited to handle data without putting it to search index. And this solution is good enough, when you need to store and retrieve some extra data (comparing to data you want to search against).
MongoDB supports gridFS. This gives you ability to handle large chunks of data behind the same interface. I.e., you can store binary data in Mongo and retrieve it within the same interface, from your code perspective.
Well, choose the right tool for the right job. If you require searching capabilities such as full text search, faceting etc, then nothing can beat a full fledged search engine. ElasticSearch(ES) or Solr is just a matter of choice.
You can actually feed(index) documents into ES for searching and then fetch the complete details for a particular entry from MongoDB or any other database.
I can make your task easier, do take a look at my open source work that's using MongoDB, ES, Redis and RabbitMQ, all integrated at one place, here on github
Please note that the application is built in .Net C#.
After having used Elasticsearch on production, I can add up to this thread few notes :
We securized our Elasticsearch clustering via a reverse proxy which check client certificate authenticity at request time before letting the query in : it proves that there is multiple way to add authentication anyway. (If you need more accuracy in security, like by using roles, there is few plugins that can be added to manage permissions)
Elasticsearch mapping and settings (tuning) are really important concepts to fully understand before going on production with it, and that's no that easy to get how everything works quickly.
Clustering and horizontal scaling is very flexible and easy to set up
The suite tools (Kibana, beats, etc ..) are a very convinient way to gather logs, expose key data, etc ...
Search features are extremely advanced, you can really do amazing things when you master a bit how full text search works (fuzzyness, boosting, scoring, stemming, tokenizer, analyzers, and so on ...).
API's are a bit scattered and there is not unique ways to achieve something. And some API are really WTF to use, like the bulk insert API: you need to pass binary data, with JSON format (ofc don't forget end of line characters) and repeating some fields multiple times. This is very verbose and I guess it's legacy code like we all have in our projects ;).
Last thing : if you develop a Java project, do not use Hibernate Search to duplicate data from a datasource to your ES cluster, we had so much issues with Hibernate Search, if we had to do that again, we'd do that manually.
Now about the real question :
To my mind, using only Elasticsearch is sufficient and may reduce complexity of having a multiple NoSQL storage systems.
I think it's worthy when you are doing a duo Relational and Transactional database + NoSQL search engine, but having two system which roughly serves the same purposes is a bit overkilled
I have recently developed a feature in my company,
we wanted to perform some searches and rank the result according to its relevance on multiple factors and conditions.
So in my application, we were already using MongoDB as Db,
So on ElasticSearch index, I exported some of the fields from MongoDB that I want to perform search and filters on.
So according to required conditions I prepared my mongo query and elasticsearch query also and performed the search. Then I filtered and sorted the result according to my need.
The whole flow will was designed in such a way that,
even if there is an error from ES, mongo will fetch the records.
If I get the result from ES then, mongo result will depend on ES result.
This is how I used mongo and ES in combination.
Also, don't forget to properly handle all updates, deletes and new record insertions.
And Just to Know, results for me were Really Good.

full text search in java web applicatioin [duplicate]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 12 months ago.
The community reviewed whether to reopen this question 12 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I'm building a Django site and I am looking for a search engine.
A few candidates:
Lucene/Lucene with Compass/Solr
Sphinx
Postgresql built-in full text search
MySQl built-in full text search
Selection criteria:
result relevance and ranking
searching and indexing speed
ease of use and ease of integration with Django
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
scalability
extra features such as "did you mean?", related searches, etc
Anyone who has had experience with the search engines above, or other engines not in the list -- I would love to hear your opinions.
EDIT: As for indexing needs, as users keep entering data into the site, those data would need to be indexed continuously. It doesn't have to be real time, but ideally new data would show up in index with no more than 15 - 30 minutes delay
Good to see someone's chimed in about Lucene - because I've no idea about that.
Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.
Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.
I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.
I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.
I don't know Sphinx, but as for Lucene vs a database full-text search, I think that Lucene performance is unmatched. You should be able to do almost any search in less than 10 ms, no matter how many records you have to search, provided that you have set up your Lucene index correctly.
Here comes the biggest hurdle though: personally, I think integrating Lucene in your project is not easy. Sure, it is not too hard to set it up so you can do some basic search, but if you want to get the most out of it, with optimal performance, then you definitely need a good book about Lucene.
As for CPU & RAM requirements, performing a search in Lucene doesn't task your CPU too much, though indexing your data is, although you don't do that too often (maybe once or twice a day), so that isn't much of a hurdle.
It doesn't answer all of your questions but in short, if you have a lot of data to search, and you want great performance, then I think Lucene is definitely the way to go. If you're not going to have that much data to search, then you might as well go for a database full-text search. Setting up a MySQL full-text search is definitely easier in my book.
Apache Solr
Apart from answering OP's queries, Let me throw some insights on Apache Solr from simple introduction to detailed installation and implementation.
Simple Introduction
Anyone who has had experience with the search engines above, or other
engines not in the list -- I would love to hear your opinions.
Solr shouldn't be used to solve real-time problems. For search engines, Solr is pretty much game and works flawlessly.
Solr works fine on High Traffic web-applications (I read somewhere that it is not suited for this, but I am backing up that statement). It utilizes the RAM, not the CPU.
result relevance and ranking
The boost helps you rank your results show up on top. Say, you're trying to search for a name john in the fields firstname and lastname, and you want to give relevancy to the firstname field, then you need to boost up the firstname field as shown.
http://localhost:8983/solr/collection1/select?q=firstname:john^2&lastname:john
As you can see, firstname field is boosted up with a score of 2.
More on SolrRelevancy
searching and indexing speed
The speed is unbelievably fast and no compromise on that. The reason I moved to Solr.
Regarding the indexing speed, Solr can also handle JOINS from your database tables. A higher and complex JOIN do affect the indexing speed. However, an enormous RAM config can easily tackle this situation.
The higher the RAM, The faster the indexing speed of Solr is.
ease of use and ease of integration with Django
Never attempted to integrate Solr and Django, however you can achieve to do that with Haystack. I found some interesting article on the same and here's the github for it.
resource requirements - site will be hosted on a VPS, so ideally the search engine wouldn't require a lot of RAM and CPU
Solr breeds on RAM, so if the RAM is high, you don't to have to worry about Solr.
Solr's RAM usage shoots up on full-indexing if you have some billion records, you could smartly make use of Delta imports to tackle this situation. As explained, Solr is only a near real-time solution.
scalability
Solr is highly scalable. Have a look on SolrCloud.
Some key features of it.
Shards (or sharding is the concept of distributing the index among multiple machines, say if your index has grown too large)
Load Balancing (if Solrj is used with Solr cloud it automatically takes care of load-balancing using it's Round-Robin mechanism)
Distributed Search
High Availability
extra features such as "did you mean?", related searches, etc
For the above scenario, you could use the SpellCheckComponent that is packed up with Solr. There are a lot other features, The SnowballPorterFilterFactory helps to retrieve records say if you typed, books instead of book, you will be presented with results related to book.
This answer broadly focuses on Apache Solr & MySQL. Django is out of scope.
Assuming that you are under LINUX environment, you could proceed to this article further. (mine was an Ubuntu 14.04 version)
Detailed Installation
Getting Started
Download Apache Solr from here. That would be version is 4.8.1. You could download new versions, I found this stable.
After downloading the archive , extract it to a folder of your choice.
Say .. Downloads or whatever.. So it will look like Downloads/solr-4.8.1/
On your prompt.. Navigate inside the directory
shankar#shankar-lenovo: cd Downloads/solr-4.8.1
So now you are here ..
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1$
Start the Jetty Application Server
Jetty is available inside the examples folder of the solr-4.8.1 directory , so navigate inside that and start the Jetty Application Server.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/example$ java -jar start.jar
Now , do not close the terminal , minimize it and let it stay aside.
( TIP : Use & after start.jar to make the Jetty Server run in the
background )
To check if Apache Solr runs successfully, visit this URL on the browser. http://localhost:8983/solr
Running Jetty on custom Port
It runs on the port 8983 as default. You could change the port either here or directly inside the jetty.xml file.
java -Djetty.port=9091 -jar start.jar
Download the JConnector
This JAR file acts as a bridge between MySQL and JDBC , Download the Platform Independent Version here
After downloading it, extract the folder and copy themysql-connector-java-5.1.31-bin.jar and paste it to the lib directory.
shankar#shankar-lenovo:~/Downloads/solr-4.8.1/contrib/dataimporthandler/lib
Creating the MySQL table to be linked to Apache Solr
To put Solr to use, You need to have some tables and data to search for. For that, we will use MySQL for creating a table and pushing some random names and then we could use Solr to connect to MySQL and index that table and it's entries.
1.Table Structure
CREATE TABLE test_solr_mysql
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NULL,
created TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id)
);
2.Populate the above table
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jean');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jack');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jason');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Vego');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Grunt');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jasper');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Fred');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Jenna');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Rebecca');
INSERT INTO `test_solr_mysql` (`name`) VALUES ('Roland');
Getting inside the core and adding the lib directives
1.Navigate to
shankar#shankar-lenovo: ~/Downloads/solr-4.8.1/example/solr/collection1/conf
2.Modifying the solrconfig.xml
Add these two directives to this file..
<lib dir="../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" />
Now add the DIH (Data Import Handler)
<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler" >
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
3.Create the db-data-config.xml file
If the file exists then ignore, add these lines to that file. As you can see the first line, you need to provide the credentials of your MySQL database. The Database name, username and password.
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/yourdbname" user="dbuser" password="dbpass"/>
<document>
<entity name="test_solr" query="select CONCAT('test_solr-',id) as rid,name from test_solr_mysql WHERE '${dataimporter.request.clean}' != 'false'
OR `created` > '${dataimporter.last_index_time}'" >
<field name="id" column="rid" />
<field name="solr_name" column="name" />
</entity>
</document>
</dataConfig>
( TIP : You can have any number of entities but watch out for id field,
if they are same then indexing will skipped. )
4.Modify the schema.xml file
Add this to your schema.xml as shown..
<uniqueKey>id</uniqueKey>
<field name="solr_name" type="string" indexed="true" stored="true" />
Implementation
Indexing
This is where the real deal is. You need to do the indexing of data from MySQL to Solr inorder to make use of Solr Queries.
Step 1: Go to Solr Admin Panel
Hit the URL http://localhost:8983/solr on your browser. The screen opens like this.
As the marker indicates, go to Logging inorder to check if any of the above configuration has led to errors.
Step 2: Check your Logs
Ok so now you are here, As you can there are a lot of yellow messages (WARNINGS). Make sure you don't have error messages marked in red. Earlier, on our configuration we had added a select query on our db-data-config.xml, say if there were any errors on that query, it would have shown up here.
Fine, no errors. We are good to go. Let's choose collection1 from the list as depicted and select Dataimport
Step 3: DIH (Data Import Handler)
Using the DIH, you will be connecting to MySQL from Solr through the configuration file db-data-config.xml from the Solr interface and retrieve the 10 records from the database which gets indexed onto Solr.
To do that, Choose full-import , and check the options Clean and Commit. Now click Execute as shown.
Alternatively, you could use a direct full-import query like this too..
http://localhost:8983/solr/collection1/dataimport?command=full-import&commit=true
After you clicked Execute, Solr begins to index the records, if there were any errors, it would say Indexing Failed and you have to go back to the Logging section to see what has gone wrong.
Assuming there are no errors with this configuration and if the indexing is successfully complete., you would get this notification.
Step 4: Running Solr Queries
Seems like everything went well, now you could use Solr Queries to query the data that was indexed. Click the Query on the left and then press Execute button on the bottom.
You will see the indexed records as shown.
The corresponding Solr query for listing all the records is
http://localhost:8983/solr/collection1/select?q=*:*&wt=json&indent=true
Well, there goes all 10 indexed records. Say, we need only names starting with Ja , in this case, you need to target the column name solr_name, Hence your query goes like this.
http://localhost:8983/solr/collection1/select?q=solr_name:Ja*&wt=json&indent=true
That's how you write Solr Queries. To read more about it, Check this beautiful article.
I am surprised that there isn't more information posted about Solr. Solr is quite similar to Sphinx but has more advanced features (AFAIK as I haven't used Sphinx -- only read about it).
The answer at the link below details a few things about Sphinx which also applies to Solr.
Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Solr also provides the following additional features:
Supports replication
Multiple cores (think of these as separate databases with their own configuration and own indexes)
Boolean searches
Highlighting of keywords (fairly easy to do in application code if you have regex-fu; however, why not let a specialized tool do a better job for you)
Update index via XML or delimited file
Communicate with the search server via HTTP (it can even return Json, Native PHP/Ruby/Python)
PDF, Word document indexing
Dynamic fields
Facets
Aggregate fields
Stop words, synonyms, etc.
More Like this...
Index directly from the database with custom queries
Auto-suggest
Cache Autowarming
Fast indexing (compare to MySQL full-text search indexing times) -- Lucene uses a binary inverted index format.
Boosting (custom rules for increasing relevance of a particular keyword or phrase, etc.)
Fielded searches (if a search user knows the field he/she wants to search, they narrow down their search by typing the field, then the value, and ONLY that field is searched rather than everything -- much better user experience)
BTW, there are tons more features; however, I've listed just the features that I have actually used in production. BTW, out of the box, MySQL supports #1, #3, and #11 (limited) on the list above. For the features you are looking for, a relational database isn't going to cut it. I'd eliminate those straight away.
Also, another benefit is that Solr (well, Lucene actually) is a document database (e.g. NoSQL) so many of the benefits of any other document database can be realized with Solr. In other words, you can use it for more than just search (i.e. Performance). Get creative with it :)
I'm looking at PostgreSQL full-text search right now, and it has all the right features of a modern search engine, really good extended character and multilingual support, nice tight integration with text fields in the database.
But it doesn't have user-friendly search operators like + or AND (uses & | !) and I'm not thrilled with how it works on their documentation site. While it has bolding of match terms in the results snippets, the default algorithm for which match terms is not great. Also, if you want to index rtf, PDF, MS Office, you have to find and integrate a file format converter.
OTOH, it's way better than the MySQL text search, which doesn't even index words of three letters or fewer. It's the default for the MediaWiki search, and I really think it's no good for end-users: http://www.searchtools.com/analysis/mediawiki-search/
In all cases I've seen, Lucene/Solr and Sphinx are really great. They're solid code and have evolved with significant improvements in usability, so the tools are all there to make search that satisfies almost everyone.
for SHAILI - SOLR includes the Lucene search code library and has the components to be a nice stand-alone search engine.
Just my two cents to this very old question. I would highly recommend taking a look at ElasticSearch.
Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
The advantages over other FTS (full text search) Engines are:
RESTful interface
Better scalability
Large community
Built by Lucene
developers
Extensive documentation
There are many open source libraries available (including Django)
We are using this search engine at our project and very happy with it.
SearchTools-Avi said "MySQL text search, which doesn't even index words of three letters or fewer."
FYIs, The MySQL fulltext min word length is adjustable since at least MySQL 5.0. Google 'mysql fulltext min length' for simple instructions.
That said, MySQL fulltext has limitations: for one, it gets slow to update once you reach a million records or so, ...
I would add mnoGoSearch to the list. Extremely performant and flexible solution, which works as Google : indexer fetches data from multiple sites, You could use basic criterias, or invent Your own hooks to have maximal search quality. Also it could fetch the data directly from the database.
The solution is not so known today, but it feets maximum needs. You could compile and install it or on standalone server, or even on Your principal server, it doesn't need so much ressources as Solr, as it's written in C and runs perfectly even on small servers.
In the beginning You need to compile it Yourself, so it requires some knowledge. I made a tiny script for Debian, which could help. Any adjustments are welcome.
As You are using Django framework, You could use or PHP client in the middle, or find a solution in Python, I saw some articles.
And, of course mnoGoSearch is open source, GNU GPL.
We just switched from Elasticsearch to Postgres Full Text. Since we have already used Postgres, we now save ourselves the hassle of keeping the index up to date.
But this only affects the full text search. There are, however, use cases where Elasicsearch is significantly better. Maybe facets or something like that.

Log viewing utility database choice

I will be implementing log viewing utility soon. But I stuck with DB choice. My requirements are like below:
Store 5 GB data daily
Total size of 5 TB data
Search in this log data in less than 10 sec
I know that PostgreSQL will work if I fragment tables. But will I able to get this performance written above. As I understood NoSQL is better choice for log storing, since logs are not very structured. I saw an example like below and it seems promising using hadoop-hbase-lucene:
http://blog.mgm-tp.com/2010/03/hadoop-log-management-part1/
But before deciding I wanted to ask if anybody did a choice like this before and could give me an idea. Which DBMS will fit this task best?
My logs are very structured :)
I would say you don't need database you need search engine:
Solr based on Lucene and it packages everything what you need together
ElasticSearch another Lucene based search engine
Sphinx nice thing is that you can use multiple sources per search index -- enrich your raw logs with other events
Scribe Facebook way to search and collect logs
Update for #JustBob:
Most of the mentioned solutions can work with flat file w/o affecting performance. All of then need inverted index which is the hardest part to build or maintain. You can update index in batch mode or on-line. Index can be stored in RDBMS, NoSQL, or custom "flat file" storage format (custom - maintained by search engine application)
You can find a lot of information here:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
See which fits your needs.
Anyway for such a task NoSQL is the right choice.
You should also consider the learning curve, MongoDB / CouchDB, even though they don't perform such as Cassandra or Hadoop, they are easier to learn.
MongoDB being used by Craigslist to store old archives: http://www.10gen.com/presentations/mongodb-craigslist-one-year-later

NoSQL good solution for daily data?

I'm developing a restaurant application and there will be daily orders for particular guests. because of that daily base of data I thought to use a NoSQL Database e.g. MongoDB to avoid a lot of joins in a relational database (e.g. meal of an order for a particular day of a particular guest ). Other data like guest data (pre name, last name ....) would be stored in a relational database.
What do you think? Is a NoSQL Database a good solution for that type of problem?
thanks
I would stick with a traditional RDBMS - unless this is a project to learn/understand MongoDB/other, a normal RDBMS is going to help you achieve what you want much more easily.
Databases in the style of Mongo offer a number of advantages over traditional RDBMSs, but these advantages are only really in areas such as:
handling/processing immense (web-scale?) quantities of not-particularly-structured data
providing very very quick performance on cheaper hardware
providing easy clustering for maximum uptime
The application you describe on the other hand is unlikely to need near-bulletproof uptime, and is also unlikely to need to process/store massive quantities of data quickly.
Your data sounds very structured with clearly-defined relationships, and even a very busy restaurant is not going to produce the amounts of data that would justify sharding/clustering in the MongoDB style of things.
So, unless you are looking for a project to help you learn MongoDB, I would recommend sticking with a traditional database.
This is more than a weak description of your requirements in order to give any hint.
If your data model fits the options of MongoDB (no JOIN, embedded documents, database references) then give it a go.
http://www.mongodb.org/display/DOCS/Schema+Design
In addition google for "Mongodb Schema Design"...lots of useful slides and blogs coming up.
I'm using both mysql and MongoDB for my app. Let's face it, as hard as I tried, I still needed some type of join queries. In Mongo, it meant making two calls to the DB but because it was so quick, I didn't take a performance hit. I stored my user session information inside MySQL and use Mongo to store user information. What I love about it is the geo-spatial feature.