How to scale MongoDB? - mongodb

I know that MongoDB can scale vertically. What about if I am running out of disk?
I am currently using EC2 with EBS. As you know, I have to assign EBS for a fixed size.
What if the MongoDB growth bigger than the EBS size? Do I have to create a larger EBS and Copy & Paste the files?
Or shall we start more MongoDB instance and each connect to different EBS disk? In such case, I could connect to a different instance for different databases.

If you're running out of disk, you obviously need to get a bigger disk.
There are several ways to migrate your data, it really depends on the type of up-time you need. First steps of course involve bundling the machine and creating the new volume.
These tips go from easiest to hardest.
Can you take the database completely off-line for several minutes?
If so, do this (migration by copy):
Mount new EBS on the server.
Stop your app from connecting to Mongo.
Shut down mongod and wait for everything to write (check the logs)
Copy all of the data files (and probably the logs) to the new EBS volume.
While the copy is happening, update your mongod start script (or config file) to point to the new volume.
Start mongod and check connection
Restart your app.
Can you take the database off-line for just a few minutes?
If so, do this (slaving and switch):
Start up a new instance and mount the new EBS on that server.
Install / start mongod as a --slave pointing at the current database. (you may need to re-start the current as --master)
The slave will do a fresh synchronization. Once the slave is up-to-date, you'll do a "switch" (next steps).
Turn off writes from the system.
Shut down the original mongod process.
Re-start the "new" mongod as a master instead of the slave.
Re-activate system writes pointing at the new master.
Done correctly those last three steps can happen in minutes or even seconds.
Can you not afford any down-time?
If so, do this (master-master):
Start up a new instance and mount the new EBS on that server.
Install / start mongod as a master and a slave against the current database. (may need to re-start current as master, minimal down-time?)
The new computer should do a fresh synchronization.
Once the new computer is up-to-date, switch the system to point at the new server.
I know it seems like this last version is actually the best, but it can be a little dicey (as of this writing). The reason is simply that I've honestly had a lot of issues with "Master-Master" replication, especially if you don't start with both active.
If you plan on using this method, I highly suggest a smaller practice run first. If something bombs here, Mongo might simply wipe all of your data files which will have the effect of taking more stuff down.
If you get a good version of this please post the commands, I'd like to see it in action.

Doesn't the E in EBS stand for elastic meaning something like resizing on the fly?
Currently the MongoDB team is working on finishining sharding which will allow you horizontal scaling by partitioning data separately on different servers. Give it a month or two and it will work fine. The developers are quite good at keeping their promises.
http://api.mongodb.org/wiki/current/Sharding%20Introduction.html
http://api.mongodb.org/wiki/current/Sharding%20Limits.html

You could slave the bigger disk off the smaller until it's caught up
or
fsync+lock and take a file system snapshot and copy it onto the bigger disk.

well, I am using Mongo DB now. I am pretty amazed the performance it generated, especially on some simple sorting.
I believe it's a good tool for simple web application logic. The remaining concern for is how to scale and backup. I will continue to explore.
The only disadvantage I have is that I didn't have any good tools to reveal the data stored inside. For example, I want to put my logging from MYSQL into Mongo as well. However, it's pretty difficult for me to view the log. Previously, i can use MYSQL query to fetch what I want easily.
Anyway, it's a good tool and I will continue to use it.

Related

Do I gain anything by using "proper" replicas for a read-only MongoDB database?

I have a web-app that depends on a read-only MongoDB database. Through trial and error, I discovered that by far the fastest way to run the ETL pipeline that populates the database is to run a local copy of MongoDB, populate the database, stop the database, and tarball the state directory.
To deploy a high-availability "cluster," I create multiple instances (or containers) running the app, each with access to a copy of the state in locally mounted storage. Putting these behind a load balancer with regular health checks and autoscaling (or in a Kubernetes cluster as a ReplicaSet), I get isolation, redundancy, easy rollbacks (using versioned storage), and easy setup in virtually any environment.
The key idea here is that because the database is read-only, it is in a sense a "stateless" application. Thus, I can treat it like any other static provider of information
There are many apparent advantages to this setup. Nevertheless, I have always had a nagging feeling that I was missing something. Given a read-only context, is there still some reason why it might be better to run a "proper" MongoDB cluster?
If you don't mind outages when the single node goes down and you don't mind taking the system down during upgrades then this is probably an ok deployment. You might get a safer dump and restore using mongodump and mongorestore rather than tar but apart from that this setup should work for a read-only deployment.

How to quickly mirror a Mongo database?

What's a quick and efficient way to transfer a large Mongo database?
I want to transfer a 10GB production Mongo 3.4 database to a staging environment for testing. I used the mongodump/mongorestore tools to test this transfer to my localhost, but it took over 8 hours and consumed a massive amount of CPU and memory, which is something I'd like to avoid in the future. The database doesn't have any indexes, so the mongodump option to exclude indexes doesn't increase performance.
My staging environment will mostly be read-only, but it will still need to write occasionally, so it can't be setup as a permanent read replica of production.
I've read about [replication sets][1], but they seem very complicated to setup and designed for permanent mirroring of a primary to two or more secondaries. I've read some posts about people hacking this to be temporary, so they can do a one-time mirroring, but I can't find any reliable documentation since this isn't the intended usage of the feature. All the guides I've read also say you need at least 3 servers, which seems unintuitive since I only have 2 (production and staging) and don't want to create a third.
Several options exist today (2020-05-06).
Copy Data Directory
If you can take the system offline you can copy the data directory from one host to another then set the configuration to point to this directory and start up the new mongod.
Mongomirror
Mongomirror (https://docs.atlas.mongodb.com/import/mongomirror/) is intended to be a tool to migrate from on-premises to Atlas, but this tool can be leveraged to copy data to another on-premises host. Beware, this connection requires SSL configurations on source and target to transfer.
Replicaset
MongoDB has built-in High Availability features using a replica set model (https://docs.mongodb.com/manual/tutorial/deploy-replica-set/). It is not overly complicated and works very well. This option allows the original system to stay online while replication does its magic. Once the replication completes reconfigure the replica set to be a single node replica set referring only to the new host and shut down the original host. This configuration is referred to as a single-node replica set. Having a single node replica set offers benefits over a stand-alone installation in that the replica set underpinnings (oplog) are the basis for other features such as change streams (https://docs.mongodb.com/manual/changeStreams/)
Backup and Restore
As you mentioned you can use mongodump/mongorestore. There is a point in time where the backup must be restored. During this time it is expected the original system is offline and not accepting any additional writes. This method is robust but has downtime associated with it. You could use mongoexport/mongoimport to use a JSON file as an intermediate step but this is not recommended as BSON data types could be lost in translation.
Per Mongo documentation, you should be able to cp/rsync files for creating a backup (if you are able to halt write ops temporarily on your production setup - or if you do this during a maintenance window)
https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files
Back Up with cp or rsync
If your storage system does not support snapshots, you can copy the files >directly using cp, rsync, or a similar tool. Since copying multiple files is not >an atomic operation, you must stop all writes to the mongod before copying the >files. Otherwise, you will copy the files in an invalid state.
Backups produced by copying the underlying data do not support point in time >recovery for replica sets and are difficult to manage for larger sharded >clusters. Additionally, these backups are larger because they include the >indexes and duplicate underlying storage padding and fragmentation. mongodump, >by contrast, creates smaller backups.
FYI - for replica sets, the third "server" is an arbiter which exists to break the tie when electing a new primary. It does not consume as many resources as the primary/secondaries. Since you are looking to creating a staging environment, i would not recommend creating a replica set that includes production and staging env. Your primary instance could switch over to the staging instance and clients who are meant to access production instance will end up reading/writing from staging instance.

Data Recovery: PostgreSQL showing base volume under postgres pg_default tablespace, but does not recognize separate databases

I had an instance of Postgres (v 9.2), running locally on Windows 7. I have yet to isolate the cause, but PG became corrupted in such a way that the server abruptly stopped, and the service would shut down immediately when I attempted to restart it. I reinstalled 9.2, and that fixed the problem with the service not starting. However, now pgAdmin does not show any of the databases were there previously (yet the files are still there in the data\base directory). Oddly, the size of the pg_default tablespace shows 11GB, the correct size, but does not show any of the databases or objects under the dependencies. The backups I have are a few days old, so I would like to restore the databases directly from the files. How do I get PG to recognize the database files that are in the data/base directory?
In general, every data recovery job is unique. You aren't going to find a simple answer, and these require a lot of hands-on troubleshooting. If you are going to do this yourself, I have some pointers below for getting started. If the data is important, hire an expert (2ndQuadrant, PgExperts, etc).
A few general rules:
Work on a copy of the files (i.e. back up your data directory and all tablespaces, and work on that, on another computer). Better yet, create a validated copy and work on a copy of that.
After having made and verified copies (ideally with hashes of data), run hardware diagnostics on the corrupted system to see what went wrong.
Now to get started, you are probably want to look over the PostgreSQL architecture docs and source code relating to on-disk layout. You will probably need a hex editor. You will certainly want to look at the system tables to see why the relations are not showing up. If you don't have a good understanding of memory and disk alignment issues on your platform you need to brush up on that as well.

MongoDB single server production setup

I am developing a server to a customer who has only one machine for his production deployment.
It's a CentOS 64bit with 8Gb of memory.
I am using Mongo and the question is, do I still need to deploy a replica set even though it's a single machine?
Will I get the advantages of a replica set or since it's a single machine it really does not matter and journaling is enough?
You definitely have to enable journaling (It will ensure consistent state even in cases of HW failure scenarios, you will not have to run costy repair command after a crash). You should enable RAID under the data directrory (Anyway this is in general recommended), while here will be crucial not to lose data due to a disk failure (You do not have copy on an other box or so). There is no option for HA within one box it is quite straightforward, however it is not harmful, and in some cases useful to configure 1 node (1 mongod) replicaset (Than you will have oplog). This will help for example when you likely to have MMS backup, or just for enable point in time backup feature of mongodump. Later if you will likely to scale out for HA this way you will only have to add the new nodes to your initially established replicaset.
Make no sense to run several replicas inside one box, while they will race on HW resources and will bring nothing as an advantage.

Mongo db partial back ups

We have a 5 node replication set up on our development server. We are looking for a way to allow developers to back up a subset of data in a mongo db and restore this to their local development enviroments.
We have looked into the clonedb and the mongodump utils, but both only allow for a backup/dump of the complete database. Due to the possible size of the database, we need an option that allows us to limit the data being backed up or restored.
Do any know of a util or way to achieve this?
I just stumbled upon this question again and decided to add a description of our backup strategy we opt in for:
Current back up strategy for our mongo db this server consist of 2 setups; backup via delayed passive secondarynode and daily backup using mongodump (takes journalling and oplog into play).
Besides our normal production nodes, we have setup another secondary node with a priority of 0 (this can either be on its own server or piggy backing off another mongo server but using a seperate port), hidden as true and a delay of 7200 seconds (2hours). This slave is there for "butter fingers", when some one accidentally drops a database or clears a collection, we have 2 hours before these changes replicate to this passive secondary. The passive secondary can NOT be used for READING or WRITING. It's role is simply a back up node. We also use this node for nightly backup to prevent unnecessary overhead on any of the other nodes.
The nightly backup is set to run every night at 23:00 via a cron tab. The command simply executes a script setup in /opt/auto-mongo-backup. This script can be found at https://github.com/jaconel/automongobackup (originally found it at https://github.com/micahwedemeyer/automongobackup). This script allows for a single nightly cron to cover weekly backups and monthly backups. Back ups are saved at /var/backups/mongodb.
Hope this helps some one out.