Need some way to push data from clients database to central database.Basically, there are several instances of MongoDB running on remote machines [clients] , and need some method to periodically update central mongo database with newly added and modified documents in clients.it must replicate its records to the single central server
Eg:
If I have 3 mongo instances running on 3 machines each having data of 10GB then after the data migration 4th machine's mongoDB must have 30GB of data. And cenral mongoDB machine must get periodically updated with data of all those 3 machines. But these 3 machines not only get new documents but existing documents in them may get updated. I would like the central mongoDB machine also to get these updations.
Your desired replication strategy is not formally supported by MongoDB.
A MongoDB replica set consists of a single primary with asynchronous replication to one or more secondary servers in the same replica set. You cannot configure a replica set with multiple primaries or replication to a different replica set.
However, there are a few possible approaches for your use case depending on how actively you want to keep your central server up to date and the volume of data/updates you need to manage.
Some general caveats:
Merging data from multiple standalone servers can create unexpected conflicts. For example, unique indexes would not know about documents created on other servers.
Ideally the data you are consolidating will still be separated by a unique database name per origin server so you don't have strange crosstalk between disparate documents that happen to have the same namespace and _id shared by different origin servers.
Approach #1: use mongodump and mongorestore
If you just need to periodically sync content to your central server, one way to do so is using mongodump and mongorestore. You can schedule a periodic mongodump from each of your standalone instances and use mongorestore to import them into the central server.
Caveats:
There is a --db parameter for mongorestore that allows you to restore into a different database from the original name (if needed)
mongorestore only performs inserts into the existing database (i.e. does not perform updates or upserts). If existing data with the same _id already exists on the target database, mongorestore will not replace it.
You can use mongodump options such as --query to be more selective on data to export (for example, only select recent data rather than all)
If you want to limit the amount of data to dump & restore on each run (for example, only exporting "changed" data), you will need to work out how to handle updates and deletions on the central server.
Given the caveats, the simplest use of this approach would be to do a full dump & restore (i.e. using mongorestore --drop) to ensure all changes are copied.
Approach #2: use a tailable cursor with the MongoDB oplog.
If you need more realtime or incremental replication, a possible approach is creating tailable cursors on the MongoDB replication oplog.
This approach is basically "roll your own replication". You would have to write an application which tails the oplog on each of your MongoDB instances and looks for changes of interest to save to your central server. For example, you may only want to replicate changes for selective namespaces (databases or collections).
A related tool that may be of interest is the experimental Mongo Connector from 10gen labs. This is a Python module that provides an interface for tailing the replication oplog.
Caveats:
You have to implement your own code for this, and learn/understand how to work with the oplog documents
There may be an alternative product which better supports your desired replication model "out of the box".
You should be aware that there are only replica set for doing replication there a replicat set always means: one primary, multiple secondary. Write always go to the primary server. Appearently you want multi-master replication which is not supported by MongoDB. So you want to look into a different technology like CouchDB or CouchBase. MongoDB is barrel burst here.
There may be a way since MongoDB 3.6 to achieve your goal: Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.
There are some configuration options that affect whether you can use Change Streams or not, so please read about them.
Another option is Delayed Replica Set Members.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error. For example, a delayed member can make it possible to recover from unsuccessful application upgrades and operator errors including dropped databases and collections.
Hidden Replica Set Members may be another option to consider.
A hidden member maintains a copy of the primary's data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set.
Another option may be to configure a Priority 0 Replica Set Member.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error.
I am interested in these options myself, but I haven't decided what approach I will use.
Related
I want to setup PostgreSQL databases in a way that the one (primary) server receives data from multiple agents, does some post-processing and regularly syncs to read-only (archive) replica. The sync does not have to be real-time. The replica server is used to feed the processed data to other clients.
The idea is that the replica will keep a complete database in the historical sense but the primary will keep all data from all agents (before some aggregation etc).
The primary server would ideally keep only part of the database e.g. last month.
The more I think about it, the term replica is not correct here.
What I probably need is to setup a procedure (for the aggregation etc.) and as a last step send the resulted data the 'replica' and delete portion of the primary DB.
Is there a better term for my use case?
We have software that generates a large amount of data in a short period of time, and is stored in a single MongoDB database. To increase write performance we are looking into setting up a sharded cluster to handle the incoming data. Because this is all being done on amazon ec2 instances, we would prefer to consolidate our data from the sharded cluster to a single persistent server once the process is done to save on cost. Obviously we can write a python script that will port the data off the cluster when done, but I am hoping there is a cleaner, more automated method. Once the data has been written, the access is all read-only and a single server can handle the workload sufficiently. I was looking for some solution combining replica sets and sharding, but that doesn't seem to to be the way those work. Any suggestions for how to best implement this architecture?
One way to migrate a MongoDB with zero downtime is to create a replica-set consisting of the old and the new servers and removing the old ones as soon as the new have synced. But that doesn't work when the old database is sharded and the new one isn't, because shards are build from replica-sets, not the other way around. That means that you have to copy the database the old-fashioned way. There are two methods to do this:
The network method: Use the command db.copyDatabase(<remote_db_name>, <local_db_name>, <remote_host>, <remote_username>, <remote_password>)
on the destination to copy the database from the source via network.
The file method: Do a mongodump on the source to export the data to a file. Then do a mongorestore on the new server to import it.
I have one mongo instance running on amazon. There are 5M docs in a single collection. And 20docs/1sec come in data. No index. And my server just have 50G space, already used 22G.
Now I need to do some data analyse for those data, but because on index, I execute one query, the db is block and can't insert data until I restart the server.
And data keep come in, so I worry about the space is not enough.
What I'm trying to do is build another server, setup a new mongo instance, then copy the data into it. Then add index on the new one and do the analyse.
Waht is the best way, any suggestion?
Probably the best way is to just create an index in the background. This will not block anything and you can then just run the indexed query on your node. Creating an index in the background takes a bit longer but it does prevent the blocking:
db.collection.ensureIndex( { col: 1 }, { background: true } );
See also: http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/
If you really want a secondary to do analysis, then you can create a replica set from your existing member. But for that you will have to take MongoDB down - and restart it with the replSet parameter. After starting it with that parameter, you can now add a new replica set member which will sync the data. This synching will also impact performance as lots of data will have to be copied. The primary will also need more disk space now because of the oplog that MongoDB needs to sync secondaries with.
mongodump and mongorestore can also be an option but then the data between the two nodes will not stay in sync. You would have to run the dump+restore each time you want to run analysis on the new data. In that case, a replica set might be better.
A replica set really wants 3 members though, to prevent a split brain in case a node goes down. This can be another data node, but in your case you would probably want to set-up an arbiter. If you don't want automatic failover (I don't think you'd need it in this case, as you're just doing analysis), then set up your replica set two nodes, but make the second (new) one hidden: http://docs.mongodb.org/manual/tutorial/configure-a-hidden-replica-set-member/
set up replica set from this existing member, and then add the index
on the secondary and do analysis.
Take a mongodump and restore to a new server and do the analysis
Background: I have a very specific use case where I have an existing MongoDB that I need to interact with via reads, but I have to ensure that the data can never be modified. However I also need to trigger some form of event when new data comes in so I can do post processing on it.
The current plan is to use replication to get the data onto a slave for the read processing. However for my purposes I only care about new data in various document stores. Part of the issue is that I can not modify the existing MongoDB and not all the data is timestamped, so there is no incremental way to handle this that I can think of.
Question: Is it possible to fire an event from a slave that would tell me I have new data and what it is? I will only have access to the slave DB, as the master will be locked.
I may have some limited ability to change the master DB, but I can not expect to change the document structure at all.
Instead of using a master/slave configuration you could instead use a replica set with a priority 0 secondary (so that it can never become primary).
You can tail the oplog on that secondary looking for insert operations.
We've have a production app running Mongo with a replica set on different box.
I'd like to start doing some BI on the data, possibly using Pentaho.
My question is: how should I setup my architecture such that I'm not doing BI directly on the production environment?
Should I create a separate BI instance and do an mongoexport to the BI instance or is there some other best-practice I should follow?
There are several options to consider depending on your data set, BI requirements, and MongoDB server version. If you just need to read data for reports, there are more options than if you want to write data as well (eg. for a map/reduce operation). MongoDB 2.2 also introduces some very useful features and improvements as noted below.
In general, use of a replica set configuration will be extremely helpful for administrative purposes as this enables a full copy of your data set to be available without disrupting your primary MongoDB server. For larger data sets and horizontal write scaling, MongoDB's sharding feature can also be used in conjunction with any of the suggestions below.
Before going down the path of separating your BI data, it would be worth determining what the actual impact is by testing in a staging environment.
The following approaches are roughly in order of isolation from your production environment:
With replica sets you can use read preferences to direct queries to appropriate servers. In versions of MongoDB prior to 2.2 the general read preferences were limited to reading from a primary or allowing reads from non-hidden secondaries with "slaveOK" (equivalent to "secondaryPreferred"). In MongoDB 2.2 there are some additional read preferences including "secondary" (read from secondary if available, otherwise error); "primary preferred" (read from primary if available .. otherwise a secondary); and "nearest" (read from nearest primary or secondary node based on network latency). The read preferences in MongoDB 2.2 can be used in conjunction with tag sets to provide even more granular control over directing queries to servers in a replica set or sharded cluster.
For MongoDB 1.8 and higher, you can use replica sets with a hidden secondary node. A hidden node will not be advertised to clients connecting to the replica set normally, but can be connected to directly for report generation. Note: the hidden node will be a read-only secondary, so this limits the use of some queries. For example, map/reduce requires write access to save to an output collection .. but you could use an inline map/reduce depending on your BI requirements.
MongoDB 2.2 has a database-level write lock (an improvement from prior versions that had a global write lock). If you need to write BI data, you can save this into a separate database to minimize lock contention. You still have to consider the overall resource effect .. for example, processing a significant number of older documents for BI purposes might compete with the caching of latest documents that are being queried by your production application.
If you want to completely separate BI data from the production environment, you could create a separate instance using one of the MongoDB backup strategies. If you have replication enabled, you can create a backup from a secondary in your replica set. Depending on the size of your data set, it will likely be faster to do a snapshot copy of the data (which includes indexes that are already built) rather than a full mongodump/mongorestore cycle.
Use a replica set and run the analysis on a secondary node (as long as only are involved).