MongoDB Schema design for Events Management System with Full Text Search - mongodb

I'm designing a solution for Events Management System with Full Text Search functionality (preferably Solr).
Following are 4 major Entities/Document Types in this system:
A Venue is a place where something can occur.
A Title is a description of something that can occur.
An Event is a particular title at a particular venue, starting on a particular date and ending on a particular date.
Within an event there will be one or more EventTimes, which are the individual showing/opening times within the event.
System has 2 Data sources:
Daily data feeds from 3rd party supplier.
User Generated Content (UGC) from end user of the system
I'm considering use of MongoDB as Database for this system and Solr for Full Text Search support. I'm also considering to use Mongo Connector for keeping Data sync between MongoDB and Solr. Mongo connector require collection in MongoDB which should directly map on Solr Document to be populated from MongoDB.
In my natural design, I don't need a MongoDB collection which will hold all the attributes which needs to be searched, but due to this requirement from connector, I'm not sure how to
create this new collection?
Any advice will be much appreciated.
Solr is selected due to comprehensive search features Solr is offering e.g. Faceted Search, Filtering, Geospatial Search with support for multiple points per document and geo polygons, Powerful Extensions to the Lucene Query Language and many more.

Related

Which approach of search is feasible (elastic search + mongoDB) or mongoDB text indexes

In my project I have to implement text search and and have to choose a feasible
approach among the two that are :-
Synchronising MongoDB database with ElasticSearch.
MongoDB's own text indexes that has Elastic Search like text searching
capabilities.
I have gone through many articles that provide the pros of each of the cases but haven't found any relevant document that provides comparison between the two approaches and which approach is better than the other or what are the limitation for a specific approach.
Note:- I am using Node.js with express.js.
MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed by Lucene. You can store data in MongoDB and use Elasticsearch exclusively for its' full-text search capabilities. According to your use case, you can only send a subset of the mongo data fields to elastic.
Synchronization of mongodb with Elasticsearch can be done with mongoosastic. You can solve data safety concern by persisting in mongo and speed up your search by using elasticsearch. Also can use python script using elasticsearch.py package to sync mongo and elasticsearch.
Mongodb search is very slow as compared to elasticsearch. Also indexing in mongodb takes up more time and more resources.

MongoDB integration with Solr

I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.

Neo4j, storing text data in node properties, text analysis & full-text search a requirement

Is it ok to store text data in the graph nodes when text analysis will be a requirement?
I have an application in mind involving thousands documents that are interlinked through subject, author, references etc.. I want to store the links between the documents but also be able to analyse the text of the documents using text analysis techniques, text analysis will also require analysing the text of documents on all nodes to arrive at word counts etc.
At the moment I've researched a number to options trying to arrive at the best/most practical:
Use a relational database technology with bridge tables to manage relationship information (Con: The SQL queries to "traverse" the relationships will be difficult)
Use a graph database technology to store the relationship and document information (Cons: Graph databases aren't optimal for text storage and retrieval, worried that trying to run full-text analysis across all nodes will be slow and difficult to use with text analysis frameworks),
Use a graph database to store the relationships and another such as CouchDB to store the document information (Cons: Managing the two stores and keeping them in sync),
Use only a graph database to store the relationships and store the documents on disk or in HDFS etc. for analysis.
Other?
Can anyone suggest if one or other of these is the best approach to implement?
Thanks,
Paul
Neo4js default index provider (Lucene) can do some text analytics. If that is not enough then 3 or 4 is prob the best.
http://lucene.apache.org/

What is the typical usage of ElasticSearch in conjuncion with other storage?

It is not recommended to use ElasticSearch as the only storage from some obvious reasons like security, transactions etc. So how it is usually used together with other database?
Say, I want to store some documents in MongoDB and be able to effectively search by some of their properties. What I'd do would be to store full document in Mongo as usual and then trigger insertion to ElasticSearch but I'd insert only searchable properties plus MongoDB ObjectID there. Then I can search using ElasticSearch and having ObjectID found, go to Mongo and fetch whole documents.
Is this correct usage of ElasticSearch? I don't want to duplicate whole data as I have them already in Mongo.
The best practice is for now to duplicate documents in ES.
The cool thing here is that when you search, you don't have to return to your database to fetch content as ES provide it in only one single call.
You have everything with ES Search Response to display results to your user.
My 2 cents.
You may like to use mongodb river take a look at this post
There are more issue then the size of the data you store or index, you might like to have MongoDB as a backup with "near real time" query for inserted data. and as a queue for the data to indexed (you may like to use mongodb as cluster with the relevant write concern suited for you application

Full text search options for MongoDB setup

We are planning to store millions of documents in MongoDB and full text search is very much required. I read Elasticsearch and Solr are the best available solutions for full text search.
Is Elastic search is mature enough to be used for Mongodb full text search? We also be sharding the collections. Does Elasticsearch works with Sharded collections?
What are the advantages and disadvantages of using Elasticsearch or Solr?
Is MongoDB capable of doing full text search?
There are some search capabilities in MongoDB but it is not as feature-rich as search engines.
http://www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo
We use Mongo with Solr to make content searchable. We prefer Solr because
It is easy to configure and customize
It has large community (This is really helpful if you are working with opensource tools)
Since we didn't work with ES i could not say much about it. You can found some discussions about Solr vs ES on the links below.
Solr vs ES 1
Solr vs ES 2
Solr vs ES 3
I have a professional experience with both Solr/MySQL and ElasticSearch/MongoDB.
If you are going to query a lot your search engine, you already shard your MongoDB (I mean, if you want to shard too your search engine): you should use ElasticSearch, unless what you want to do can't be done with ElasticSearch. And you should use it even if you are not going to shard.
ElasticSearch is a new project on top of Lucene that brings the sharding mechanism, from someone who is used to distributed environments and search (Shay Bannon made Compass and worked for Gigaspaces, the datagrid editor).
ElasticSearch is as easy as MongoDB to shard, I think it is even simpler and the default works great for most cases.
I don't like Solr so much.
The query langage is not structured at all (but it's the case of plugins and Lucene, and I think you can use this unstructured query langage with ES too)
I don't think there is a proper Solr client. Solr java client sucks, and I hearh PHP guys also complaining, while ElasticSearch Java client is very nice, much more typesafe and offers async support (nice if you use Netty for exemple). With Solr, you will do a LOT of string concatenation.
Less easy to scale
Not so new project, I felt the technical dept it has. ElasticSearch is born from Compass, so I guess all the technical dept has been dropped to have a fresh new approach.
Concerning data importing, I have experience with both Solr DataImportHandler and ElasticSearch rivers (CouchDB and MongoDB). What I can tell you is:
Solr permits to do more things, but in a very unstructured XML way, and the documentation doesn't help you so much to understand what is really happing once you are out of the hello world and try to use some advanced features.
ElasticSearch approach is more simple and also limited but has out of the box support for some technologies while DataImportHandler seems more complex-SQL friendly
With my Solr project I had to use manual indexation for some documents, but it was mostly because of the impossibility to denormalize the needed data into a document (the Solr project uses MySQL).
There is also a new MongoDB connector for both Solr and ElasticSearch which I need to test asap :)
http://blog.mongodb.org/post/29127828146/introducing-mongo-connector
So in the end, I'll definitly choose ElasticSearch, because:
It now has a great community
Many people I know with experience with Solr like ElasticSearch
The client side is safer and structured, and provides async with Java Futures
Both can probably import data from MongoDB easily with the new connector
As far as I know, it permits to do almost everything Solr does (in my experience but I'm not a search engine expert)
It adds sharding out of the box
It adds percolation which can help to built realtime scalable applications (but you'll probably need an additional messaging technology)
The source code I read has nearly no technical dept compared to Solr (at least on the client side), and it seems easy to create plugins.
In terms of MongoDB natively, no it doesn't have full text search support. You can see that it is a popular feature request:
https://jira.mongodb.org/browse/SERVER-380
From what I know of the ES river plugin for MongoDB, it tails the oplog for it's functionality. Since a sharded setup would have multiple oplogs and there would be no way to easily alter that code to connect via a mongos.
Similarly for Solr, the examples I have seen usually involve similar behavior to the ES plugin. Some more solid info here:
http://blog.knuthaugen.no/2010/04/cooking-with-mongodb-and-solr.html
I have not got any experience using one but others have made comparisons before, take a look here:
Solr vs. ElasticSearch
ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?
MongoDB can't do efficient full text search. You can do wildcard searches on fields, but i don't think these use indexes efficiently.
I would recommend using the river functionality of ElasticSearch to automatically push the documents from MongoDB to ElasticSearch.
elasticsearch-river-mongodb is a MongoDB to Elasticsearch river that when a document changes in MongoDB, ElasticSearch will monitoring the oplog and then automatically update its index.
This minimises the problem of keeping the two datastores in sync, as ElasticSearch is just monitoring the replication tables of Mongo.
Mongo is not at al good for fulltext search.
Obviously you need to index you fields for fast searching, and indexing fields containing BIG data (long long strings) will be failed in mongo. it has a limit of 1k for index, if you have content more thn 1k, it will be ignored by index and will not be displayed in your search results. obviously if you are trying to perform a full text search for your articles, mongo is not at al a good choice.
Currently, in MongoDB 2.4.6, there now IS a full-text search in MongoDB and it is more feature rich, then in previous versions. On http://docs.mongodb.org/manual/core/text-search/ are described the capabilities of the new functionality.
Worth mentioning:
tokenizes and stems the search term(s) during both the index creation and the text command execution. assigns a score to each document that
contains the search term in the indexed fields. The score determines the relevance of a document to a given search query.
However, in this answer (from September 2013) https://stackoverflow.com/a/18631775/1920149 you can see, that mongo still warns from using this functionality in production. This functionality is still in beta stage.
Full text search become possible in product environment with Mongodb since the version 2.6 by creating text index on the required fields.
indexe text in mongodb