Deploying large data on mongodb replicaset - mongodb

Can I deploy large database by copying its files (eg. testing database with files: testing.0,testing.1,testing.ns found on mongodb dbpath) from another server to the target servers (replica set) to avoid usage of communication bandwidth for replication (in case it is only deployed to the primary)? So basically I want to avoid the slow process of replication.
If journaling is enabled, what is the effect on the process?

Yes you can, this is a perfectly valid way of solving having to do tedious and time consuming replication between members of a distanced or latenced network.
If journaling is enabled nothing really happens, copying via the file system goes around MongoDB.

Related

Mongo DB - difference between standalone & 1-node replica set

I needed to use Mongo DB transactions, and recently I understood that transactions don't work for Mongo standalone mode, but only for replica sets
(Mongo DB with C# - document added regardless of transaction).
Also, I read that standalone mode is not recommended for production.
So I found out that simply defining a replica set name in the mongod.cfg is enough to run Mongo DB as a replica set instead of standalone.
After changing that, Mongo transactions started working.
However, it feels a bit strange using it as replica-set although I'm not really using the replication functionality, and I want to make sure I'm using a valid configuration.
So my questions are:
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality? (as said I need it to allow transactions)
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
I've read that standalone mode is not recommended for production, although it sounds like it's the most basic configuration. I understand that this configuration is not used in most scenarios, but sometimes you may want to use it as a standard DB on a local machine. So why is standalone mode not recommended? Is it not stable enough, or other reasons?
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality?
You don't have high availability afforded by a proper replica set. Thus it's not recommended for a production deployment. This is fine for development though.
Note that a replica set's function is primarily about high availability instead of scaling.
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
A single-node replica set would have the oplog. This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification).
So why is standalone mode not recommended? Is it not stable enough, or other reasons?
MongoDB in production was designed with a replica set deployment in mind, for:
High availability in the face of node failures
Rolling maintenance/upgrades with no downtime
Possibility to scale-out reads
Possibility to have a replica of data in a special-purpose node that is not part of the high availability nodes
In short, MongoDB was designed to be a fault-tolerant distributed database (scales horizontally) instead of the typical SQL monolithic database (scales vertically). The idea is, if you lose one node of your replica set, the others will immediately take over. Most of the time your application don't even know there's a failure in the database side. In contrast, a failure in a monolithic database server would immediately disrupt your application.
I think kevinadi answered well, but I still want to add it.
A standalone is an instance of mongod that runs on a single server but is not part of a replica set. Standalone instances used for testing and development, but always recomended to use replica sets in production.
A single-node replica set would have the oplog which records all changes to its data sets . This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification). It also supports point in time recovery.
Please follow Convert a Standalone to a Replica Set if you would like to convert the standalone database to replicaset.
Transactions have been introduced in MongoDB version 4.0. Starting in version 4.0, for situations that require atomicity for updates to multiple documents or consistency between reads to multiple documents, MongoDB provides multi-document transactions for replica sets. The transaction is not available in standalone because it requires oplog to maintain strong consistency within a cluster.

How to quickly mirror a Mongo database?

What's a quick and efficient way to transfer a large Mongo database?
I want to transfer a 10GB production Mongo 3.4 database to a staging environment for testing. I used the mongodump/mongorestore tools to test this transfer to my localhost, but it took over 8 hours and consumed a massive amount of CPU and memory, which is something I'd like to avoid in the future. The database doesn't have any indexes, so the mongodump option to exclude indexes doesn't increase performance.
My staging environment will mostly be read-only, but it will still need to write occasionally, so it can't be setup as a permanent read replica of production.
I've read about [replication sets][1], but they seem very complicated to setup and designed for permanent mirroring of a primary to two or more secondaries. I've read some posts about people hacking this to be temporary, so they can do a one-time mirroring, but I can't find any reliable documentation since this isn't the intended usage of the feature. All the guides I've read also say you need at least 3 servers, which seems unintuitive since I only have 2 (production and staging) and don't want to create a third.
Several options exist today (2020-05-06).
Copy Data Directory
If you can take the system offline you can copy the data directory from one host to another then set the configuration to point to this directory and start up the new mongod.
Mongomirror
Mongomirror (https://docs.atlas.mongodb.com/import/mongomirror/) is intended to be a tool to migrate from on-premises to Atlas, but this tool can be leveraged to copy data to another on-premises host. Beware, this connection requires SSL configurations on source and target to transfer.
Replicaset
MongoDB has built-in High Availability features using a replica set model (https://docs.mongodb.com/manual/tutorial/deploy-replica-set/). It is not overly complicated and works very well. This option allows the original system to stay online while replication does its magic. Once the replication completes reconfigure the replica set to be a single node replica set referring only to the new host and shut down the original host. This configuration is referred to as a single-node replica set. Having a single node replica set offers benefits over a stand-alone installation in that the replica set underpinnings (oplog) are the basis for other features such as change streams (https://docs.mongodb.com/manual/changeStreams/)
Backup and Restore
As you mentioned you can use mongodump/mongorestore. There is a point in time where the backup must be restored. During this time it is expected the original system is offline and not accepting any additional writes. This method is robust but has downtime associated with it. You could use mongoexport/mongoimport to use a JSON file as an intermediate step but this is not recommended as BSON data types could be lost in translation.
Per Mongo documentation, you should be able to cp/rsync files for creating a backup (if you are able to halt write ops temporarily on your production setup - or if you do this during a maintenance window)
https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files
Back Up with cp or rsync
If your storage system does not support snapshots, you can copy the files >directly using cp, rsync, or a similar tool. Since copying multiple files is not >an atomic operation, you must stop all writes to the mongod before copying the >files. Otherwise, you will copy the files in an invalid state.
Backups produced by copying the underlying data do not support point in time >recovery for replica sets and are difficult to manage for larger sharded >clusters. Additionally, these backups are larger because they include the >indexes and duplicate underlying storage padding and fragmentation. mongodump, >by contrast, creates smaller backups.
FYI - for replica sets, the third "server" is an arbiter which exists to break the tie when electing a new primary. It does not consume as many resources as the primary/secondaries. Since you are looking to creating a staging environment, i would not recommend creating a replica set that includes production and staging env. Your primary instance could switch over to the staging instance and clients who are meant to access production instance will end up reading/writing from staging instance.

Hosting Replicaset and Config servers MongoDB on other servers?

Should I keep the replicasets and config servers on separate servers? Or have one replicaset and one config server on one server? Can I have all replicasets on one server and all config servers on another one server? (Does this defeat the purpose of sharding?)
The purpose of sharding is distributing load on multiple servers. The purpose of replication is (mostly) redundancy by allowing one server to take the place of another when that server goes offline for some reason. Obviously, it does not make much sense in either case to run multiple instances on the same server. So yes, it would defeat the purpose of sharding.
However, when you only have two servers and have to choose between replication and sharding, you can get the best of both worlds by creating two shards where each shard has a secondary which runs on the server of the primary of the other shard. That way you have both the performance-improvement when everything is OK but don't lose access to half your data when one server goes down.
Regarding the config servers: MongoDB recommends to make them a separate replica-set which runs on separate servers. But when you are on a budget, it should technically be possible to put that replica-set on the same hardware which runs the actual database. The config servers are only required when a mongos process (re-)starts or when a chunk migration happens and are relatively idle the rest of the time. Unfortunately a chunk migration is also a phase where the involved shards are very busy, so running the config servers on the same hardware will make performance drops during chunk migrations even worse.

MongoDB on Amazon SSD-backed EC2

We have mongodb sharded cluster currently deployed on EC2 instances in Amazon. These shards are also replica sets. The instances used are using EBS with IOPS provisioned.
We have about 30 million documents in a collection. Our queries count the whole collection that matches the filters. We have indexes on almost all of the query-able fields. This results to the RAM reaching 100% usage. Our working set exceeds the size of the RAM. We think that the slow response of our queries are caused by EBS being slow so we are thinking of migrating to the new SSD-backed instances.
C3 is available
http://aws.typepad.com/aws/2013/11/a-generation-of-ec2-instances-for-compute-intensive-workloads.html
I2 is coming soon
http://aws.typepad.com/aws/2013/11/coming-soon-the-i2-instance-type-high-io-performance-via-ssd.html
Our only concern is that SSD is ephemeral, meaning the data will be gone once the instance stops, terminates, or fails. How can we address this? How do we automate backups. Is it a good idea to migrate to SSD to improve the performance of our queries? Do we still need to set-up a sharded cluster?
Working with the ephemeral disks is a risk but if you have your replication setup correctly it shouldn't be a huge concern. I'm assuming you've setup a three node replica set correct? Also you have three nodes for your config servers?
I can speak of this from experience as the company I'm at has been setup this way. To help mitigate risk I'm moving towards a backup strategy that involved a hidden replica. With this setup I can shutdown the hidden replica set and one of the config servers (first having stopped balancing) and take a complete copy of the data files (replica and config server) and have a valid backup. If AWS went down on my availability zone I'd still have a daily backup available on S3 to restore from.
Hope this helps.

MongoDB single server production setup

I am developing a server to a customer who has only one machine for his production deployment.
It's a CentOS 64bit with 8Gb of memory.
I am using Mongo and the question is, do I still need to deploy a replica set even though it's a single machine?
Will I get the advantages of a replica set or since it's a single machine it really does not matter and journaling is enough?
You definitely have to enable journaling (It will ensure consistent state even in cases of HW failure scenarios, you will not have to run costy repair command after a crash). You should enable RAID under the data directrory (Anyway this is in general recommended), while here will be crucial not to lose data due to a disk failure (You do not have copy on an other box or so). There is no option for HA within one box it is quite straightforward, however it is not harmful, and in some cases useful to configure 1 node (1 mongod) replicaset (Than you will have oplog). This will help for example when you likely to have MMS backup, or just for enable point in time backup feature of mongodump. Later if you will likely to scale out for HA this way you will only have to add the new nodes to your initially established replicaset.
Make no sense to run several replicas inside one box, while they will race on HW resources and will bring nothing as an advantage.