Consolidating shard data into single persistent DB in MongoDB - mongodb

We have software that generates a large amount of data in a short period of time, and is stored in a single MongoDB database. To increase write performance we are looking into setting up a sharded cluster to handle the incoming data. Because this is all being done on amazon ec2 instances, we would prefer to consolidate our data from the sharded cluster to a single persistent server once the process is done to save on cost. Obviously we can write a python script that will port the data off the cluster when done, but I am hoping there is a cleaner, more automated method. Once the data has been written, the access is all read-only and a single server can handle the workload sufficiently. I was looking for some solution combining replica sets and sharding, but that doesn't seem to to be the way those work. Any suggestions for how to best implement this architecture?

One way to migrate a MongoDB with zero downtime is to create a replica-set consisting of the old and the new servers and removing the old ones as soon as the new have synced. But that doesn't work when the old database is sharded and the new one isn't, because shards are build from replica-sets, not the other way around. That means that you have to copy the database the old-fashioned way. There are two methods to do this:
The network method: Use the command db.copyDatabase(<remote_db_name>, <local_db_name>, <remote_host>, <remote_username>, <remote_password>)
on the destination to copy the database from the source via network.
The file method: Do a mongodump on the source to export the data to a file. Then do a mongorestore on the new server to import it.

Related

What is the best way to maintain a redacted replica of my MongoDB for analytical and investigation purposes?

I have a production dataset in my MongoDB which I use to run my application, I would like to give my devs access to the data in this database but the database contains sensitive data which I don't want exposed to devs poking around in the database. I would also prefer that the devs don't have access directly to the prod database, but rather have access to a replica of it stored somewhere else.
Ideally, I would prefer to use some tool to maintain a perfect replica of my MongoDB database in another MongoDB database, however, with the replica being redacted so no sensitive data is present.
As a plus, it would be nice if the data could also be transformed and aggregated in different ways before it lands in the second database.
What would be the best way to go about doing this?
Set up a change stream. In the change stream listener, redact the new/updated documents and write them to the analytics instance.

Postgres replication betwenn 2 databases on same server

I need to create a replica of existing database, that would copy any changing operation from master to slave, I.e create a mirror some sort of. I found a lot of examples in web but they all describes process when master and slave are on different servers.
I would like to create a write replica on the same server where master is located , without spinning up second instance of Postgres.
Is it possible to do so and could you point me a direction where I could find a solution how to do it?
Thank you.
P.S. I understand that replication on 2 servers is better, but I just need to do it on one common server.
If you want physical replication, you will need to run two instances of PostgreSQL. If they are on the same server machine, they will need to have different port numbers. The different port numbers is the only complexity, otherwise it is just like running on two different servers.
If you want logical replication, you can do that within a single instance, but you will need to jump through some hoops to create the subscription intra-instance, as described in the "Notes" section
You could consider using a simple trigger to insert/update/delete data on the other database as soon as the main one get modified.
A more "professional" way would be to use synchronous replication.

PostgreSQL: point-in-time recovery for individual database and not whole cluster

As per standard Postgres documentation
As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset.
From this, I understood that it is not possible to setup PITR for individual databases in a cluster (a.k.a. a database instance holding multiple databases).
If my understanding is incorrect, probably the next part of the question is not relevant, but if not, here it is:
I still do not get the problem in setting this up theoretically as each database is generating its own WAL archive.
The problem here is: I am in need of setting up multiple Postgres clusters and somehow I have only 2 RHEL 7.6 machines to handle this. I am trying to reduce the number of clusters on these 2 machines to only 2. I am planning to create multiple database rather than multiple instances to handle customer applications. But that means that I have to sacrifice PITS, as PITR only can be performed on the instance/cluster level and not on the database level (as per the official documentation).
Could someone please help clarifying my misunderstanding.
You are correct, you can only do PITR on a PostgreSQL database cluster, not on an individual database.
There is only one WAL stream for the complete database cluster; WAL is not split up per database.
Don't hesitate to run several PostgreSQL clusters on a single machine if that is advantageous for you.
There is little overhead in running a second database cluster. The biggest resource that is hogged by a cluster is shared buffers, but you want that to be only a fraction of the available RAM anyway. Most of the memory should be left to the filesystem cache that is shared by all PostgreSQL clusters.

Mongodb backup or replication or clone with existing big data

I have one mongo instance running on amazon. There are 5M docs in a single collection. And 20docs/1sec come in data. No index. And my server just have 50G space, already used 22G.
Now I need to do some data analyse for those data, but because on index, I execute one query, the db is block and can't insert data until I restart the server.
And data keep come in, so I worry about the space is not enough.
What I'm trying to do is build another server, setup a new mongo instance, then copy the data into it. Then add index on the new one and do the analyse.
Waht is the best way, any suggestion?
Probably the best way is to just create an index in the background. This will not block anything and you can then just run the indexed query on your node. Creating an index in the background takes a bit longer but it does prevent the blocking:
db.collection.ensureIndex( { col: 1 }, { background: true } );
See also: http://docs.mongodb.org/manual/reference/method/db.collection.ensureIndex/
If you really want a secondary to do analysis, then you can create a replica set from your existing member. But for that you will have to take MongoDB down - and restart it with the replSet parameter. After starting it with that parameter, you can now add a new replica set member which will sync the data. This synching will also impact performance as lots of data will have to be copied. The primary will also need more disk space now because of the oplog that MongoDB needs to sync secondaries with.
mongodump and mongorestore can also be an option but then the data between the two nodes will not stay in sync. You would have to run the dump+restore each time you want to run analysis on the new data. In that case, a replica set might be better.
A replica set really wants 3 members though, to prevent a split brain in case a node goes down. This can be another data node, but in your case you would probably want to set-up an arbiter. If you don't want automatic failover (I don't think you'd need it in this case, as you're just doing analysis), then set up your replica set two nodes, but make the second (new) one hidden: http://docs.mongodb.org/manual/tutorial/configure-a-hidden-replica-set-member/
set up replica set from this existing member, and then add the index
on the secondary and do analysis.
Take a mongodump and restore to a new server and do the analysis

MongoDB one way replication

Need some way to push data from clients database to central database.Basically, there are several instances of MongoDB running on remote machines [clients] , and need some method to periodically update central mongo database with newly added and modified documents in clients.it must replicate its records to the single central server
Eg:
If I have 3 mongo instances running on 3 machines each having data of 10GB then after the data migration 4th machine's mongoDB must have 30GB of data. And cenral mongoDB machine must get periodically updated with data of all those 3 machines. But these 3 machines not only get new documents but existing documents in them may get updated. I would like the central mongoDB machine also to get these updations.
Your desired replication strategy is not formally supported by MongoDB.
A MongoDB replica set consists of a single primary with asynchronous replication to one or more secondary servers in the same replica set. You cannot configure a replica set with multiple primaries or replication to a different replica set.
However, there are a few possible approaches for your use case depending on how actively you want to keep your central server up to date and the volume of data/updates you need to manage.
Some general caveats:
Merging data from multiple standalone servers can create unexpected conflicts. For example, unique indexes would not know about documents created on other servers.
Ideally the data you are consolidating will still be separated by a unique database name per origin server so you don't have strange crosstalk between disparate documents that happen to have the same namespace and _id shared by different origin servers.
Approach #1: use mongodump and mongorestore
If you just need to periodically sync content to your central server, one way to do so is using mongodump and mongorestore. You can schedule a periodic mongodump from each of your standalone instances and use mongorestore to import them into the central server.
Caveats:
There is a --db parameter for mongorestore that allows you to restore into a different database from the original name (if needed)
mongorestore only performs inserts into the existing database (i.e. does not perform updates or upserts). If existing data with the same _id already exists on the target database, mongorestore will not replace it.
You can use mongodump options such as --query to be more selective on data to export (for example, only select recent data rather than all)
If you want to limit the amount of data to dump & restore on each run (for example, only exporting "changed" data), you will need to work out how to handle updates and deletions on the central server.
Given the caveats, the simplest use of this approach would be to do a full dump & restore (i.e. using mongorestore --drop) to ensure all changes are copied.
Approach #2: use a tailable cursor with the MongoDB oplog.
If you need more realtime or incremental replication, a possible approach is creating tailable cursors on the MongoDB replication oplog.
This approach is basically "roll your own replication". You would have to write an application which tails the oplog on each of your MongoDB instances and looks for changes of interest to save to your central server. For example, you may only want to replicate changes for selective namespaces (databases or collections).
A related tool that may be of interest is the experimental Mongo Connector from 10gen labs. This is a Python module that provides an interface for tailing the replication oplog.
Caveats:
You have to implement your own code for this, and learn/understand how to work with the oplog documents
There may be an alternative product which better supports your desired replication model "out of the box".
You should be aware that there are only replica set for doing replication there a replicat set always means: one primary, multiple secondary. Write always go to the primary server. Appearently you want multi-master replication which is not supported by MongoDB. So you want to look into a different technology like CouchDB or CouchBase. MongoDB is barrel burst here.
There may be a way since MongoDB 3.6 to achieve your goal: Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.
There are some configuration options that affect whether you can use Change Streams or not, so please read about them.
Another option is Delayed Replica Set Members.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error. For example, a delayed member can make it possible to recover from unsuccessful application upgrades and operator errors including dropped databases and collections.
Hidden Replica Set Members may be another option to consider.
A hidden member maintains a copy of the primary's data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set.
Another option may be to configure a Priority 0 Replica Set Member.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error.
I am interested in these options myself, but I haven't decided what approach I will use.