What is the best way to maintain a redacted replica of my MongoDB for analytical and investigation purposes? - mongodb

I have a production dataset in my MongoDB which I use to run my application, I would like to give my devs access to the data in this database but the database contains sensitive data which I don't want exposed to devs poking around in the database. I would also prefer that the devs don't have access directly to the prod database, but rather have access to a replica of it stored somewhere else.
Ideally, I would prefer to use some tool to maintain a perfect replica of my MongoDB database in another MongoDB database, however, with the replica being redacted so no sensitive data is present.
As a plus, it would be nice if the data could also be transformed and aggregated in different ways before it lands in the second database.
What would be the best way to go about doing this?

Set up a change stream. In the change stream listener, redact the new/updated documents and write them to the analytics instance.

Related

MongoDB to DynamoDB

I have a database currently in Mongo running on an EC2 instance and would like to migrate the data to DynamoDB. Is this possible and what is the most cost effective way to achieve this?
When you ask for a "cost effective way" to migrate data, I assume you are looking for existing technologies that can ease your life. If so, you could do the following:
Export your MongoDB data to a text file, say in tsv format, using mongoexport.
Upload that file somewhere in S3.
Import this data, in S3, to DynamoDB using AWS Data Pipeline.
Of course, you should design & finalize your DynamoDB table schema before doing all this.
Whenever you are changing databases, you have to be very careful about the way you migrate data. Certain data formats maintain type consistency, while others do not.
Then there are just data formats that cannot handle your schema. For example, CSV is great at handling data when it is one row per entry, but how do you render an embedded array in CSV? It really isn't possible, JSON is good at this, but JSON has its own problems.
The easiest example of this is JSON and DateTime. JSON does not have a specification for storing DateTime values, they can end up as ISO8601 dates, or perhaps UNIX Epoch Timestamps, or really anything a developer can dream up. What about Longs, Doubles, Ints? JSON doesn't discriminate, it makes them all strings, which can cause loss of precision if not deserialized correctly.
This makes it very important that you choose the appropriate translation medium. The generally means you have to roll your own solution. This means loading up the drivers for both databases, reading an entry from one, translating, and writing to this other. This is the best way to be absolutely sure errors are handled properly for your environment, that types are kept consistently, and that the code properly translates schema from source to destination (if necessary).
What does this all mean for you? It means a lot of leg work for you. It is possible somebody has already rolled something that is broad enough for your case, but I have found in the past that it is best for you to do it yourself.
I know this post is old, Amazon made it possible with AWS DMS, check this document :
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
Some relevant parts:
Using an Amazon DynamoDB Database as a Target for AWS Database
Migration Service
You can use AWS DMS to migrate data to an Amazon DynamoDB table.
Amazon DynamoDB is a fully managed NoSQL database service that
provides fast and predictable performance with seamless scalability.
AWS DMS supports using a relational database or MongoDB as a source.

Meteor app as front end to externally updated mongo database

I'm trying to set up an app that will act as a front end to an externally updated mongo database. The data will be pushed into the database by another process.
I so far have the app connecting to the external mongo instance and pulling data out with on issues, but its not reactive (not seeing any of the new data going into the mongo database).
I've done some digging and it so far can only find that I might need to set up replica sets and use oplog, is there a way to do this without going to replica sets (or is that the best way anyway)?
The code so far is really simple, a single collection, a single publication (pulling out the last 10 records from the database) and a single template just displaying that data.
No deps that I've written (not sure if that's what I'm missing).
Thanks.
Any reason not to use Oplog? For what I've read it is the recommended approach even if your DB isn't updated by an external process, and a must if it does.
Nevertheless, without Oplog your app should see the changes on the DB made by the external process anyway. It should take longer (up to 10 seconds), but it should update.

Consolidating shard data into single persistent DB in MongoDB

We have software that generates a large amount of data in a short period of time, and is stored in a single MongoDB database. To increase write performance we are looking into setting up a sharded cluster to handle the incoming data. Because this is all being done on amazon ec2 instances, we would prefer to consolidate our data from the sharded cluster to a single persistent server once the process is done to save on cost. Obviously we can write a python script that will port the data off the cluster when done, but I am hoping there is a cleaner, more automated method. Once the data has been written, the access is all read-only and a single server can handle the workload sufficiently. I was looking for some solution combining replica sets and sharding, but that doesn't seem to to be the way those work. Any suggestions for how to best implement this architecture?
One way to migrate a MongoDB with zero downtime is to create a replica-set consisting of the old and the new servers and removing the old ones as soon as the new have synced. But that doesn't work when the old database is sharded and the new one isn't, because shards are build from replica-sets, not the other way around. That means that you have to copy the database the old-fashioned way. There are two methods to do this:
The network method: Use the command db.copyDatabase(<remote_db_name>, <local_db_name>, <remote_host>, <remote_username>, <remote_password>)
on the destination to copy the database from the source via network.
The file method: Do a mongodump on the source to export the data to a file. Then do a mongorestore on the new server to import it.

In mongodb is there a performance reason to separate collections into more than on database?

I have an application using a MongoDB database and am about to add custom analytics. Is there a reason to use a separate database for the analytics or am I better off to build it in the application database?
Here are the reasons I can think of:
Name collisions between production collections and analytics collections
You need a different replica set configuration for analytics
You need a different sharding configuration for analytics
You want different physical media for some data (production data on fast disks, analytics on slow disk, for example)
Starting in Mongo 2.2, the "global write lock" will be a per-database lock, so different databases will isolate analytics traffic from production traffic a bit more.
Unless something on this list applies to you, then you don't need to split them out. Also, it's much easier to move a collection across DBs in MongoDB than an RDBMS (as you don't have foreign keys to cause trouble), so IMO it's a relatively easy decision to delay.

MongoDB - Single Database or Multiple Databases for SaaS Offering

We have decided to use MongoDB for a SaaS offering we are creating. Each company that signs up gets their own url (mycompany.domain.com) and their own private set of users, projects, etc... Since we are using a NoSQL solution, and wouldn't have to manage pushing out schema updates to every database like we would with MySQL, I am wondering if it would be better to have one huge database containing all the data, or to have one database per client.
Since MongoDB can shard the database across multiple servers, I'm thinking there wouldn't be a huge performance hit if we had a giant database, but I also think backups and exporting data would be much easier if there was one database per client. Any thoughts?
Go with one but make sure to take advantage of some sort of replication for backup purposes!
Look into sharding or look into replica-sets.