NoSQL (MongoDB) vs Lucene (or Solr) as your database [closed] - mongodb

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
With the NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).
So, the question: Why would you want to use NoSQL (MongoDB, Cassandra, CouchDB, etc) over Lucene (or Solr) as your "database"?
What I am (and I am sure others are) looking for in an answer is some deep-dive comparisons of them. Let's skip over relational database discussions all together, as they serve a different purpose.
Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.
You can even throw in a comparison of out-of-proc cache providers such as Velocity or MemCached when talking about similar data storing and scalability of MongoDB.
The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.
MongoDB (and others) do have some advantages, such as the ease of use of their API. New up a document, create an id, and store it. Done. Nice and easy.

This is a great question, something I have pondered over quite a bit. I will summarize my lessons learned:
You can easily use Lucene/Solr in lieu of MongoDB for pretty much all situations, but not vice versa. Grant Ingersoll's post sums it up here.
MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. It appears to be a simpler and arguably easier transition for programmers detoxing from the RDBMS world. Unless one's used to it Lucene & Solr have a steeper learning curve.
There aren't many examples of using Lucene/Solr as a datastore, but Guardian has made some headway and summarize this in an excellent slide-deck, but they too are non-committal on totally jumping on Solr bandwagon and "investigating" combining Solr with CouchDB.
Finally, I will offer our experience, unfortunately cannot reveal much about the business-case. We work on the scale of several TB of data, a near real-time application. After investigating various combinations, decided to stick with Solr. No regrets thus far (6-months & counting) and see no reason to switch to some other.
Summary: if you do not have a search requirement, Mongo offers a simple & powerful approach. However if search is key to your offering, you are likely better off sticking to one tech (Solr/Lucene) and optimizing the heck out of it - fewer moving parts.
My 2 cents, hope that helped.

You can't partially update a document in solr. You have to re-post all of the fields in order to update a document.
And performance matters. If you do not commit, your change to solr does not take effect, if you commit every time, performance suffers.
There is no transaction in solr.
As solr has these disadvantages, some times NoSQL is a better choice.
UPDATE: Solr 4+ Started supporting commit and soft-commits. Refer to the latest document https://lucene.apache.org/solr/guide/8_5/

We use MongoDB and Solr together and they perform well. You can find my blog post here where i described how we use this technologies together. Here's an excerpt:
[...] However we observe that query performance of Solr decreases when index
size increases. We realized that the best solution is to use both Solr
and Mongo DB together. Then, we integrate Solr with MongoDB by storing
contents into the MongoDB and creating index using Solr for full-text
search. We only store the unique id for each document in Solr index
and retrieve actual content from MongoDB after searching on Solr.
Getting documents from MongoDB is faster than Solr because there is no
analyzers, scoring etc. [...]

Also please note that some people have integrated Solr/Lucene into Mongo by having all indexes be stored in Solr and also monitoring oplog operations and cascading relevant updates into Solr.
With this hybrid approach you can really have the best of both worlds with capabilities such as full text search and fast reads with a reliable datastore that can also have blazing write speed.
It's a bit technical to setup but there are lots of oplog tailers that can integrate into solr. Check out what rangespan did in this article.
http://denormalised.com/home/mongodb-pub-sub-using-the-replication-oplog.html

From my experience with both, Mongo is great for simple, straight-forward usage. The main Mongo disadvantage we've suffered is the poor performance on unanticipated queries (you cannot created mongo indexes for all the possible filter/sort combinations, you simple can't).
And here where Lucene/Solr prevails big time, especially with the FilterQuery caching, Performance is outstanding.

Since no one else mentioned it, let me add that MongoDB is schema-less, whereas Solr enforces a schema. So, if the fields of your documents are likely to change, that's one reason to choose MongoDB over Solr.

#mauricio-scheffer mentioned Solr 4 - for those interested in that, LucidWorks is describing Solr 4 as "the NoSQL Search Server" and there's a video at http://www.lucidworks.com/webinar-solr-4-the-nosql-search-server/ where they go into detail on the NoSQL(ish) features. (The -ish is for their version of schemaless actually being a dynamic schema.)

If you just want to store data using key-value format, Lucene is not recommended because its inverted index will waste too much disk spaces. And with the data saving in disk, its performance is much slower than NoSQL databases such as redis because redis save data in RAM. The most advantage for Lucene is it supports much of queries, so fuzzy queries can be supported.

MongoDB Atlas will have a lucene-based search engine soon. The big announcement was made at this week's MongoDB World 2019 conference. This is a great way to encourage more usage of their high revenue MongoDB Atlas product.
I was hoping to see it rolled into the MongoDB Enterprise version 4.2 but there's been no news of bringing it to their on-prem product line.
More info here: https://www.mongodb.com/atlas/full-text-search

The third party solutions, like a mongo op-log tail are attractive. Some thoughts or questions remain about whether the solutions could be tightly integrated, assuming a development/architecture perspective. I don't expect to see a tightly integrated solution for these features for a few reasons (somewhat speculative and subject to clarification and not up to date with development efforts):
mongo is c++, lucene/solr are java
maybe lucene could use some mongo libs
maybe mongo could rewrite some lucene algorithms, see also:
http://clucene.sourceforge.net/
http://lucy.apache.org/
lucene supports various doc formats
mongo is focused on JSON (BSON)
lucene uses immutable documents
single field updates are an issue, if they are available
lucene indexes are immutable with complex merge ops
mongo queries are javascript
mongo has no text analyzers / tokenizers (AFAIK)
mongo doc sizes are limited, that might go against the grain for lucene
mongo aggregation ops may have no place in lucene
lucene has options to store fields across docs, but that's not the same thing
solr somehow provides aggregation/stats and SQL/graph queries

Related

ElasticSearch & Mongo

Very newbie question I assume.. I started playing around with ES and MongoDB and I'm trying to move data out a SQL DB as an exercise.
I can't help but wonder, what data would I store in Mongo and what in ES? Can I store everything in ES? Assume big data load, as in price trends.
To begin with, MongoDB is so-called a document store. Key feature of such concept is that is stores schema-dynamic documents:
Each record in a document collection can have a different structure
Types of each records can be different
Document properties (columns) can have nested structures
It's not schema-free, it's schema-dynamic (or flexible schema). To get into the concept, you can find a great tutorial here: https://docs.mongodb.org/manual/data-modeling/
MongoDB is the most widely used document store - please, see http://db-engines.com/en/system/MongoDB.
It has "drivers" for most programming languages, enabling rapid development. You can dive into Mongo quite quickly, there are a lot of tutorials and official Mongo University - a great course for developers and DBAs.
In short terms it supports indexing, aggregations, filters, load balancing, sharding, replications (replica sets) etc. Data is stored and transferred in a BSON format (http://bsonspec.org/).
A good comparison of MongoDB vs RDBMS concepts can be found in this official reference: https://docs.mongodb.org/manual/reference/sql-comparison/
What is it good for? It enables agile development, where schema can change over a period of time, especially form based data, user generated content, location based data, user profiles and more. It also enables storing large documents (up to 16MB each).
Now, Elasticsearch is not a database. It is a search engine with some great aggregation capabilities (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html - make sure you check out Metrics, Buckets and Pipeline aggregations).
Typical RDBSM is not designed for full-text searches or loosely structured data. Queries in ES can return results much faster than any database (e.g. seconds in RDBMS compared to milliseconds in ES). You need to remember that a key is to design indexes well, and that they will take your disk space.
There is a very detailed article comparing both in regards to performance, you may find it useful: http://blog.quarkslab.com/mongodb-vs-elasticsearch-the-quest-of-the-holy-performances.html.
You can actually use both successfully - MongoDB will store your data, where ES will be used as serving layer (search, aggregations etc.).
There is a big difference between mongodb and ES.
MongoDB is a database which was design in order to store data in it and query thats, while elasticsearch is an lucene base indexer in which you should only index data for searches and should not trust elastisearch. even though you can use store:true in elastic search, it is not recommended and i wouldn't rely on that for important data.

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

Full text search options for MongoDB setup

We are planning to store millions of documents in MongoDB and full text search is very much required. I read Elasticsearch and Solr are the best available solutions for full text search.
Is Elastic search is mature enough to be used for Mongodb full text search? We also be sharding the collections. Does Elasticsearch works with Sharded collections?
What are the advantages and disadvantages of using Elasticsearch or Solr?
Is MongoDB capable of doing full text search?
There are some search capabilities in MongoDB but it is not as feature-rich as search engines.
http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo
We use Mongo with Solr to make content searchable. We prefer Solr because
It is easy to configure and customize
It has large community (This is really helpful if you are working with opensource tools)
Since we didn't work with ES i could not say much about it. You can found some discussions about Solr vs ES on the links below.
Solr vs ES 1
Solr vs ES 2
Solr vs ES 3
I have a professional experience with both Solr/MySQL and ElasticSearch/MongoDB.
If you are going to query a lot your search engine, you already shard your MongoDB (I mean, if you want to shard too your search engine): you should use ElasticSearch, unless what you want to do can't be done with ElasticSearch. And you should use it even if you are not going to shard.
ElasticSearch is a new project on top of Lucene that brings the sharding mechanism, from someone who is used to distributed environments and search (Shay Bannon made Compass and worked for Gigaspaces, the datagrid editor).
ElasticSearch is as easy as MongoDB to shard, I think it is even simpler and the default works great for most cases.
I don't like Solr so much.
The query langage is not structured at all (but it's the case of plugins and Lucene, and I think you can use this unstructured query langage with ES too)
I don't think there is a proper Solr client. Solr java client sucks, and I hearh PHP guys also complaining, while ElasticSearch Java client is very nice, much more typesafe and offers async support (nice if you use Netty for exemple). With Solr, you will do a LOT of string concatenation.
Less easy to scale
Not so new project, I felt the technical dept it has. ElasticSearch is born from Compass, so I guess all the technical dept has been dropped to have a fresh new approach.
Concerning data importing, I have experience with both Solr DataImportHandler and ElasticSearch rivers (CouchDB and MongoDB). What I can tell you is:
Solr permits to do more things, but in a very unstructured XML way, and the documentation doesn't help you so much to understand what is really happing once you are out of the hello world and try to use some advanced features.
ElasticSearch approach is more simple and also limited but has out of the box support for some technologies while DataImportHandler seems more complex-SQL friendly
With my Solr project I had to use manual indexation for some documents, but it was mostly because of the impossibility to denormalize the needed data into a document (the Solr project uses MySQL).
There is also a new MongoDB connector for both Solr and ElasticSearch which I need to test asap :)
http://blog.mongodb.org/post/29127828146/introducing-mongo-connector
So in the end, I'll definitly choose ElasticSearch, because:
It now has a great community
Many people I know with experience with Solr like ElasticSearch
The client side is safer and structured, and provides async with Java Futures
Both can probably import data from MongoDB easily with the new connector
As far as I know, it permits to do almost everything Solr does (in my experience but I'm not a search engine expert)
It adds sharding out of the box
It adds percolation which can help to built realtime scalable applications (but you'll probably need an additional messaging technology)
The source code I read has nearly no technical dept compared to Solr (at least on the client side), and it seems easy to create plugins.
In terms of MongoDB natively, no it doesn't have full text search support. You can see that it is a popular feature request:
https://jira.mongodb.org/browse/SERVER-380
From what I know of the ES river plugin for MongoDB, it tails the oplog for it's functionality. Since a sharded setup would have multiple oplogs and there would be no way to easily alter that code to connect via a mongos.
Similarly for Solr, the examples I have seen usually involve similar behavior to the ES plugin. Some more solid info here:
http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html
I have not got any experience using one but others have made comparisons before, take a look here:
Solr vs. ElasticSearch
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
MongoDB can't do efficient full text search. You can do wildcard searches on fields, but i don't think these use indexes efficiently.
I would recommend using the river functionality of ElasticSearch to automatically push the documents from MongoDB to ElasticSearch.
elasticsearch-river-mongodb is a MongoDB to Elasticsearch river that when a document changes in MongoDB, ElasticSearch will monitoring the oplog and then automatically update its index.
This minimises the problem of keeping the two datastores in sync, as ElasticSearch is just monitoring the replication tables of Mongo.
Mongo is not at al good for fulltext search.
Obviously you need to index you fields for fast searching, and indexing fields containing BIG data (long long strings) will be failed in mongo. it has a limit of 1k for index, if you have content more thn 1k, it will be ignored by index and will not be displayed in your search results. obviously if you are trying to perform a full text search for your articles, mongo is not at al a good choice.
Currently, in MongoDB 2.4.6, there now IS a full-text search in MongoDB and it is more feature rich, then in previous versions. On http://docs.mongodb.org/manual/core/text-search/ are described the capabilities of the new functionality.
Worth mentioning:
tokenizes and stems the search term(s) during both the index creation and the text command execution. assigns a score to each document that
contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
However, in this answer (from September 2013) https://stackoverflow.com/a/18631775/1920149 you can see, that mongo still warns from using this functionality in production. This functionality is still in beta stage.
Full text search become possible in product environment with Mongodb since the version 2.6 by creating text index on the required fields.
indexe text in mongodb

NoSQL engines that support dynamic queries?

What "NoSQL" database engines support dynamic / advanced queries in a similar fashion to MongoDB (http://www.mongodb.org/display/DOCS/Advanced+Queries) ?
Specifically interested in options that support ad-hoc querying from a shell or within client languages.
None just use MongoDB ;)
Honestly, it really depends on what type of querying you plan to do. For Key/Value style queries where you plan to just pull up one document at a time, then basically all of the NoSQL DBs are good for this.
When it comes to pulling back "sets" of data or using alternate keys, then MongoDB is probably your best "crossover" here. Many NoSQL DBs have limited querying functions, especially on non-key fields. Of course, that's kind of the point of "Key-Value stores", so Mongo is kind of a mutant here.
The last I checked with Cassandra, there was definitely some "hoop-jumping" involved to really support ad-hoc non-key queries. And CouchDB seems to point to "just Map / Reduce".
That stated, I believe that there is motion from several NoSQL dbs to support such ad-hoc querying mechanism. So this answer could be completely wrong in 2 months :)

Which NoSQL solution for a dating search site?

I am highly interested in new NoSQL solutions to implement a search engine for a dating site. However because of having a lot of possibilities, I am little bid confused. My requirements,
1) 10 million people
2) More than 8 index (gender, online, city, name etc...)
3) Scalability
Thanks
You wanna go for either mangoDB or CouchDB.
CouchDB scales a little better while mangoDB syntax is a little more familiar.
also it depends what framework/language u use to create the dating site.
i personally would choose couchdb. (u should know javascript...a lot)
Apache Solr is a data store and fulltext search engine that might be useful to you. Solr is rarely mentioned as a NoSQL technology, but it shares many characteristics with document-oriented databases.
Keep in mind that you have to know what type of queries you're going to run before you can choose a NoSQL solution or design your database.
That's in contrast to a relational database, where you can design a general-purpose database based on the data relationships.
With that large of a dataset you would probably be well advised to look at search as separate from data store. As someone suggested, SOLR will index your data for you to search independently of your database. You have 2 problems, data store and search.
ElasticSearch http://www.elasticsearch.org/overview/
Can handle age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.
You'd want something that has sophisticated search and aggregation support.
Elasticsearch is a good candidate. In addition to its ability to perform fuzzy, proximity searches (which is something you'd likely want), you'd also want to integrate some machine learning pipeline to constantly improve your matching 'accuracy'.