MongoDB BI Architecture

MongoDB BI Architecture - mongodb

We've have a production app running Mongo with a replica set on different box.
I'd like to start doing some BI on the data, possibly using Pentaho.
My question is: how should I setup my architecture such that I'm not doing BI directly on the production environment?
Should I create a separate BI instance and do an mongoexport to the BI instance or is there some other best-practice I should follow?

There are several options to consider depending on your data set, BI requirements, and MongoDB server version. If you just need to read data for reports, there are more options than if you want to write data as well (eg. for a map/reduce operation). MongoDB 2.2 also introduces some very useful features and improvements as noted below.
In general, use of a replica set configuration will be extremely helpful for administrative purposes as this enables a full copy of your data set to be available without disrupting your primary MongoDB server. For larger data sets and horizontal write scaling, MongoDB's sharding feature can also be used in conjunction with any of the suggestions below.
Before going down the path of separating your BI data, it would be worth determining what the actual impact is by testing in a staging environment.
The following approaches are roughly in order of isolation from your production environment:
With replica sets you can use read preferences to direct queries to appropriate servers. In versions of MongoDB prior to 2.2 the general read preferences were limited to reading from a primary or allowing reads from non-hidden secondaries with "slaveOK" (equivalent to "secondaryPreferred"). In MongoDB 2.2 there are some additional read preferences including "secondary" (read from secondary if available, otherwise error); "primary preferred" (read from primary if available .. otherwise a secondary); and "nearest" (read from nearest primary or secondary node based on network latency). The read preferences in MongoDB 2.2 can be used in conjunction with tag sets to provide even more granular control over directing queries to servers in a replica set or sharded cluster.
For MongoDB 1.8 and higher, you can use replica sets with a hidden secondary node. A hidden node will not be advertised to clients connecting to the replica set normally, but can be connected to directly for report generation. Note: the hidden node will be a read-only secondary, so this limits the use of some queries. For example, map/reduce requires write access to save to an output collection .. but you could use an inline map/reduce depending on your BI requirements.
MongoDB 2.2 has a database-level write lock (an improvement from prior versions that had a global write lock). If you need to write BI data, you can save this into a separate database to minimize lock contention. You still have to consider the overall resource effect .. for example, processing a significant number of older documents for BI purposes might compete with the caching of latest documents that are being queried by your production application.
If you want to completely separate BI data from the production environment, you could create a separate instance using one of the MongoDB backup strategies. If you have replication enabled, you can create a backup from a secondary in your replica set. Depending on the size of your data set, it will likely be faster to do a snapshot copy of the data (which includes indexes that are already built) rather than a full mongodump/mongorestore cycle.

Use a replica set and run the analysis on a secondary node (as long as only are involved).

Related

PostgreSQL: point-in-time recovery for individual database and not whole cluster

As per standard Postgres documentation
As with the plain file-system-backup technique, this method can only support restoration of an entire database cluster, not a subset.
From this, I understood that it is not possible to setup PITR for individual databases in a cluster (a.k.a. a database instance holding multiple databases).
If my understanding is incorrect, probably the next part of the question is not relevant, but if not, here it is:
I still do not get the problem in setting this up theoretically as each database is generating its own WAL archive.
The problem here is: I am in need of setting up multiple Postgres clusters and somehow I have only 2 RHEL 7.6 machines to handle this. I am trying to reduce the number of clusters on these 2 machines to only 2. I am planning to create multiple database rather than multiple instances to handle customer applications. But that means that I have to sacrifice PITS, as PITR only can be performed on the instance/cluster level and not on the database level (as per the official documentation).
Could someone please help clarifying my misunderstanding.

You are correct, you can only do PITR on a PostgreSQL database cluster, not on an individual database.
There is only one WAL stream for the complete database cluster; WAL is not split up per database.
Don't hesitate to run several PostgreSQL clusters on a single machine if that is advantageous for you.
There is little overhead in running a second database cluster. The biggest resource that is hogged by a cluster is shared buffers, but you want that to be only a fraction of the available RAM anyway. Most of the memory should be left to the filesystem cache that is shared by all PostgreSQL clusters.

Consolidating shard data into single persistent DB in MongoDB

We have software that generates a large amount of data in a short period of time, and is stored in a single MongoDB database. To increase write performance we are looking into setting up a sharded cluster to handle the incoming data. Because this is all being done on amazon ec2 instances, we would prefer to consolidate our data from the sharded cluster to a single persistent server once the process is done to save on cost. Obviously we can write a python script that will port the data off the cluster when done, but I am hoping there is a cleaner, more automated method. Once the data has been written, the access is all read-only and a single server can handle the workload sufficiently. I was looking for some solution combining replica sets and sharding, but that doesn't seem to to be the way those work. Any suggestions for how to best implement this architecture?

One way to migrate a MongoDB with zero downtime is to create a replica-set consisting of the old and the new servers and removing the old ones as soon as the new have synced. But that doesn't work when the old database is sharded and the new one isn't, because shards are build from replica-sets, not the other way around. That means that you have to copy the database the old-fashioned way. There are two methods to do this:
The network method: Use the command db.copyDatabase(<remote_db_name>, <local_db_name>, <remote_host>, <remote_username>, <remote_password>)
on the destination to copy the database from the source via network.
The file method: Do a mongodump on the source to export the data to a file. Then do a mongorestore on the new server to import it.

MongoDB one way replication

Need some way to push data from clients database to central database.Basically, there are several instances of MongoDB running on remote machines [clients] , and need some method to periodically update central mongo database with newly added and modified documents in clients.it must replicate its records to the single central server
Eg:
If I have 3 mongo instances running on 3 machines each having data of 10GB then after the data migration 4th machine's mongoDB must have 30GB of data. And cenral mongoDB machine must get periodically updated with data of all those 3 machines. But these 3 machines not only get new documents but existing documents in them may get updated. I would like the central mongoDB machine also to get these updations.

Your desired replication strategy is not formally supported by MongoDB.
A MongoDB replica set consists of a single primary with asynchronous replication to one or more secondary servers in the same replica set. You cannot configure a replica set with multiple primaries or replication to a different replica set.
However, there are a few possible approaches for your use case depending on how actively you want to keep your central server up to date and the volume of data/updates you need to manage.
Some general caveats:
Merging data from multiple standalone servers can create unexpected conflicts. For example, unique indexes would not know about documents created on other servers.
Ideally the data you are consolidating will still be separated by a unique database name per origin server so you don't have strange crosstalk between disparate documents that happen to have the same namespace and _id shared by different origin servers.
Approach #1: use mongodump and mongorestore
If you just need to periodically sync content to your central server, one way to do so is using mongodump and mongorestore. You can schedule a periodic mongodump from each of your standalone instances and use mongorestore to import them into the central server.
Caveats:
There is a --db parameter for mongorestore that allows you to restore into a different database from the original name (if needed)
mongorestore only performs inserts into the existing database (i.e. does not perform updates or upserts). If existing data with the same _id already exists on the target database, mongorestore will not replace it.
You can use mongodump options such as --query to be more selective on data to export (for example, only select recent data rather than all)
If you want to limit the amount of data to dump & restore on each run (for example, only exporting "changed" data), you will need to work out how to handle updates and deletions on the central server.
Given the caveats, the simplest use of this approach would be to do a full dump & restore (i.e. using mongorestore --drop) to ensure all changes are copied.
Approach #2: use a tailable cursor with the MongoDB oplog.
If you need more realtime or incremental replication, a possible approach is creating tailable cursors on the MongoDB replication oplog.
This approach is basically "roll your own replication". You would have to write an application which tails the oplog on each of your MongoDB instances and looks for changes of interest to save to your central server. For example, you may only want to replicate changes for selective namespaces (databases or collections).
A related tool that may be of interest is the experimental Mongo Connector from 10gen labs. This is a Python module that provides an interface for tailing the replication oplog.
Caveats:
You have to implement your own code for this, and learn/understand how to work with the oplog documents
There may be an alternative product which better supports your desired replication model "out of the box".

You should be aware that there are only replica set for doing replication there a replicat set always means: one primary, multiple secondary. Write always go to the primary server. Appearently you want multi-master replication which is not supported by MongoDB. So you want to look into a different technology like CouchDB or CouchBase. MongoDB is barrel burst here.

There may be a way since MongoDB 3.6 to achieve your goal: Change Streams.
Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a single collection, a database, or an entire deployment, and immediately react to them. Because change streams use the aggregation framework, applications can also filter for specific changes or transform the notifications at will.
There are some configuration options that affect whether you can use Change Streams or not, so please read about them.
Another option is Delayed Replica Set Members.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error. For example, a delayed member can make it possible to recover from unsuccessful application upgrades and operator errors including dropped databases and collections.
Hidden Replica Set Members may be another option to consider.
A hidden member maintains a copy of the primary's data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set.
Another option may be to configure a Priority 0 Replica Set Member.
Because delayed members are a "rolling backup" or a running "historical" snapshot of the data set, they may help you recover from various kinds of human error.
I am interested in these options myself, but I haven't decided what approach I will use.

How can I use MongoDB as a cache for Postgresql?

I have an application that can not afford to lose data, so Postgresql is my choice for database (ACID)
However, speed and query advantages of MongoDB are very attractive, but based on what I've read so far, MongoDB can report a successful write which may not have gone to disk, so I can't make it my mission critical db (I'll also need transactions)
I've seen references to people using mysql and MongoDB together, one for the transactions and the other for queries. Please not that I'm not talking about keeping some data in one DB and the rest in another. I want to use Postgresql as a gateway to data entry, and MongoDB for reads.
Are there any resources that offer an architecture/guide for Postgresql + MongoDB usage in this way? I can remember seeing this topic in Postgresql conference agenda, but I could not find the link.

I don't think you'll get much speed using MongoDB just as a cache. It's strengths are replication and horizontal scalability. On one computer you'd make Mongo and Postgres compete for memory, IO bandwidth and processor time.
As you can not afford to loose transactions you'll be better with Postgres only. Its has efficient caching, sophisticated query planner, prepared queries and wide indexing support cause that read-only queries will be very fast - really comparable to MongoDB on a single computer.
Postgres can even scale horizontally now using asynchronous, or, from version 9.1, synchronous replication.

One way to achieve this would be to set up a master-slave replication with the PostgreSQL database as master, and the MongoDB database as slave. You would then do all reads from MongoDB, and all writes to PostgreSQL.
This post discusses such a setup using a tool called Bucardo:
http://blog.endpoint.com/2011/06/mongodb-replication-from-postgres-using.html
You may also be able to do it with Tungsten Replicator, although it seems designed to be used with MySQL:
http://code.google.com/p/tungsten-replicator/wiki/TRCHeterogeneousReplication

I can remember seeing this topic in Postgresql conference agenda, but I could not find the
link.
Maybe, you are talking about this: https://www.postgresqlconference.org/content/hybrid-applications-using-mongodb-and-postgres

Depending how important transactions are to you, one option is to use MongoDb driver's safe mode and drop Postgresql.
http://www.mongodb.org/display/DOCS/getLastError+Command

How can you expect transactional consistency from Postgres but trust MongoDB for reads? How would you support rollbacks in this scenario? How do you detect when they've gotten out of sync?
I think you're better off going with memcache and implementing a higher level object cache. Alternatively, you could consider a replication slave for reads. If you have performance needs beyond what a dedicated read slave can provide, consider denormalizing your tables on your slave system.
Make sure that any of this is actually needed. For thin tables with PK lookups most modern database engines like Postgres or InnoDB are going to generally keep up with NoSQL solutions. Don't fall into the ROFLSCALE trap
http://www.youtube.com/watch?v=b2F-DItXtZs

I think you can run a mongo replica set.. Let say 3 Slave and 1 Master.. Then in your app you should run all write transactions on Postgresql and then on Mongo ReplicaSet.. After that you can query read operations on Mongo Replica set..
But Synchronizing will be a problem, you should work on it..

you may find some replacement for mongo in here or here that is safer and fast as well.
but I advise to simplify your solution instead of making a complicated design.
Visual Guide to NoSQL Systems
lucky

In mongodb we can specify writeConcern property to specify that it should write to journal/ instances and then send confirmation/ acknowledgement and i think even mongodb has teh concept of transactions. Not sure why we need postgres behind it.

In mongodb is there a performance reason to separate collections into more than on database?

I have an application using a MongoDB database and am about to add custom analytics. Is there a reason to use a separate database for the analytics or am I better off to build it in the application database?

Here are the reasons I can think of:
Name collisions between production collections and analytics collections
You need a different replica set configuration for analytics
You need a different sharding configuration for analytics
You want different physical media for some data (production data on fast disks, analytics on slow disk, for example)
Starting in Mongo 2.2, the "global write lock" will be a per-database lock, so different databases will isolate analytics traffic from production traffic a bit more.
Unless something on this list applies to you, then you don't need to split them out. Also, it's much easier to move a collection across DBs in MongoDB than an RDBMS (as you don't have foreign keys to cause trouble), so IMO it's a relatively easy decision to delay.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse