We implemented solr cloud on 3 replicas and three shards. We are importing data from Mongo and indexing it to solr. While adding data to solr we found two solutions.
Add live data as user registered and create profile it will get indexed in solr.
Using cron pull data from mongo and index to solr.
Solution 2 is complex as we need to mentain failure status and all.
If we select solution 1 what will be actual problem in production environment? If any benchmark available ?
High availability and performance is main concerns
Related
We are using .net Core and node.js micro services some of them with mongoDB.
Currently we got the following DB structure :
Every customer gets his own Database.
So if we got a micro service for Invoices, every new customer adds 1 new DB for that micro service.
Invoice_customerA
Invoice_customerB
etc...
While the collections in each such DB remain the same (usually we got 1-3 collections in each DB)
In terms of logic - We choose the right DB by request input in runtime.
I am thinking now about changing it a bit, to start making separation on the collections instead:
So if we take the same example from before this time around this Invoice Service will only have 1 DB,
Invoice_allCustomers
and there will be 1 new collection for each customer in it ( or more if there were more collections for this service).
collection_customerA
collection_customerB
What I am trying to understand is if there is any difference performance wise?
Or is it mostly a "cosmetic" change?
Or maybe there are some other considerations?
P.S.
If the change is mostly cosmetic I am thinking that the new solution is better for us since we usually got only 1-2 collections per each micro service.
And it will be easier to navigate when there are significantly less Databases.
As far as I know in microservices,each service should have its own database. If it is not a different service than you can use one database with different collections in it. It is more of cosmetic changes but I should also warn you that mongodb still has it limits which you can find here. It really depends on the amount of data that will be stored and retrieved.
In my project I have to implement text search and and have to choose a feasible
approach among the two that are :-
Synchronising MongoDB database with ElasticSearch.
MongoDB's own text indexes that has Elastic Search like text searching
capabilities.
I have gone through many articles that provide the pros of each of the cases but haven't found any relevant document that provides comparison between the two approaches and which approach is better than the other or what are the limitation for a specific approach.
Note:- I am using Node.js with express.js.
MongoDB is a general purpose database, Elasticsearch is a distributed text search engine backed by Lucene. You can store data in MongoDB and use Elasticsearch exclusively for its' full-text search capabilities. According to your use case, you can only send a subset of the mongo data fields to elastic.
Synchronization of mongodb with Elasticsearch can be done with mongoosastic. You can solve data safety concern by persisting in mongo and speed up your search by using elasticsearch. Also can use python script using elasticsearch.py package to sync mongo and elasticsearch.
Mongodb search is very slow as compared to elasticsearch. Also indexing in mongodb takes up more time and more resources.
I am beginner with mongodb and its integraiton with Solr. From different posts I got an idea about the integration steps. But need info on the below
I have the data in mongodb, for faster retrieval we are integrating it with Solr.
Solr indexes all mongodb entries. Is this indexing one time activity after integration or Do we need to periodically update Solr to index the entries which got inserted after the integration ?
If we need to periodically update solr, it becomes an extra overhead to maintain it in Solr as well along with mongodb. Best approaches on overcoming it.
As far as I know you do not have official(supported/complete) solution to integrate MongoDB and Solr, but let me give you some ideas/direction.
For me the best approach is when it is possible to modify the application and add to the persistence layer the fact that you have all writes operations done in MongoDB and Solr in the "same" time. Like that you can control exactly what you want to send to the Database and what you want to index for a full text operation. But as I said this means that you have to change your application code. (You will have anyway to change it to be able to query Solr when needed). And yes you have to index all the existing documents the first time
You can use a "connector" approach where MongoDB and Solr are kind of connected together, this could be done in various ways.
You can use for example the MongoDB Connector available here : https://github.com/10gen-labs/mongo-connector
LucidWorks, the company behind Solr has also a connector for MongoDB, documented here : http://docs.lucidworks.com/display/help/Create+a+New+MongoDB+Data+Source# (I have not used it so cannot comment, but it is also an approach)
You point #2 is true, you have to manage two clusters and be sure the data are in sync, and sometimes pay the price of inconsistency between the Solr index and the document just updated in MongoDB... So you need to see if the best approach for your application is to use MongoDB alone or MongoDB with Solr (see comment below)
Just a small comment in addition to this answer:
You are talking about "faster retrieval", not sure it should be the reason, if you write correct queries with correct indexes in MongoDB you should be able to do it without Solr. If you requirement is really oriented towards the power of solr meaning: full text index (with all related features it makes sense)
How large is your data? MongoDB has a few good indexing mechanism of its own.
There is a powerful geo-api and for full text search there is http://docs.mongodb.org/manual/core/index-text/. So it would be ideal to identify if your need fits into MongoDB or you need to spill over to SOLR.
About the indexing part. How often if your data updated? If you can afford to have infrequent updates, then a batch job with once a day re-indexing may work for you. Ideally SOLR would work well for some form of master data.
I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?
You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.
I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.
Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.
I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!
I have deployed mongodb 64 bit 2.x version on aws m1.large instance.
I am trying to find best performance that mongo can give us on aws in-light of http://www.snailinaturtleneck.com/blog/tag/mongodb/ (and mongodb read/write performance and mongo hosting in the cloud)
I have created one db with one collection i.e. user and inserted 100,000 records/json object (each json object size is 4KB) using random number as suffix to “user-“. Also, created index on user id.
Further, I set db profiler to log slow query taking 20ms or more. I have executed java program with 10 threads. Each java class generates user id with random number and finds it in user collection in infinite loop. With such load I have observed latency in query/read up-to 60ms.
I also observed that when I run less number of threads say 3 or 4 (having query load on user collection 5K per second to find users) then I see no latency or less then 2ms latency.
I failed to understand why increasing load of finding user in collection is causing latency. I believe that mongo db can perform much more concurrent read then what I am trying and should not impact on performance as such.
One possibility I assume that would be - mongo is having performance issues if there are large queries executed on single collection like in our case, I expect to have 10K to 20K queries per second on single collection.
We would appreciate your thoughts / suggestion.
Some information is missing - what is your disk configuration? The EBS may contribute to the latency if everything is persisted to disk.
Amazon had released a white paper with best practices on how to install mongo on EC2: MongoDB on AWS. Here's its description
This whitepaper provides an overview of general best practices that apply to all major NoSQL systems and highlights one of popular NoSQL systems - MongoDB - and discusses how to best run it on the AWS cloud. It further examines different MongoDB configurations so you can optimize it for performance, durability, and security.