In MongoDB, can I run the compact command without shutting down each instance? - mongodb

In the server structure, primary, secondary, and arbiter are each physically operated.
mongo db version is 4.2.3.
Some of the documents were deleted in the oldest order because too many documents were accumulated in a specific collection.
However, even deleting documents did not release the storage area.
Upon checking, I found that mongodb's mechanism retains reusable bytes even if the document is deleted.
Also, I found out that unnecessary disk space can be freed with the compact command in the WiredTiger engine.
Currently, all clients connected to the db are querying using the arbiter ip and port.
Since the DB is composed only of replication, not sharding, if each individual executes the compact command independently, Even if each instance is locked, it is expected that the arbiter will distribute the query to the currently available instances.
Is this possible?
Or, Should I shutdown each instance, run it standalone, run the compact command, and then reconfigure psa?

You may upgrade your MonogDB to latest version 4.4. Documentation of compact:
Blocking
Changed in version 4.4.
Starting in v4.4, on WiredTiger, compact only blocks the following
metadata operations:
db.collection.drop
db.collection.createIndex and db.collection.createIndexes
db.collection.dropIndex and db.collection.dropIndexes
compact does not block MongoDB CRUD Operations for the database it is
currently operating on.
Before v4.4, compact blocked all operations for the database it was
compacting, including MongoDB CRUD Operations, and was therefore
recommended for use only during scheduled maintenance periods.
Starting in v4.4, the compact command is appropriate for use at any
time.

To anyone looking for the answer with 4.4 please see this bug and the documentation entry as the compact routine still forces the node to recovery state if you are running in replica set (and I assume this is the default use case for most projects)

Related

How to locally test MongoDB multicollection transactions when standalone mode does not support them

I am just starting out with MongoDB and am using the docker mongo instance for local development and testing.
My code has to update 2 collections in the same transaction so that the data is logically consistent:
using (var session = _client.StartSession())
{
session.StartTransaction();
ec.InsertOne(evt);
sc.InsertMany(snapshot.Selections.Select(ms => new SelectionEntity(snapshot.Id, ms)));
session.CommitTransaction();
}
This is failing with the error:
'Standalone servers do not support transactions
The error is obvious, my standalone docker container does not support transactions. I am confused though as this means it's impossible to test code such as the above unless I have a replica set running. This doesn't appear to be listed as a requirement in the documentation - and it refers to the fact that transactions could be multi-document OR distributed:
For situations that require atomicity of reads and writes to multiple documents (in a single or multiple collections), MongoDB supports multi-document transactions. With distributed transactions, transactions can be used across multiple operations, collections, databases, documents, and shards.
It's not clear to me how to create a multi-document transaction that does not require a replica based server to exist or how to properly test code locally that may not have a mongo replica cluster to work against.
How do people handle this?
For testing puirposes, you could set up a local replica set using docker-compose. There are various blog posts on the topic available, e.g. Create a replica set in MongoDB with docker-compose.
Another option is to use a cluster on MongoDB Atlas. There is a free tier available so you can test this without any extra cost.
In addition, you could change your code so that transactions can be disabled depending on the configuration. This way, you can test the code without transactions locally and enable them on staging or production.

Mongo DB - difference between standalone & 1-node replica set

I needed to use Mongo DB transactions, and recently I understood that transactions don't work for Mongo standalone mode, but only for replica sets
(Mongo DB with C# - document added regardless of transaction).
Also, I read that standalone mode is not recommended for production.
So I found out that simply defining a replica set name in the mongod.cfg is enough to run Mongo DB as a replica set instead of standalone.
After changing that, Mongo transactions started working.
However, it feels a bit strange using it as replica-set although I'm not really using the replication functionality, and I want to make sure I'm using a valid configuration.
So my questions are:
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality? (as said I need it to allow transactions)
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
I've read that standalone mode is not recommended for production, although it sounds like it's the most basic configuration. I understand that this configuration is not used in most scenarios, but sometimes you may want to use it as a standard DB on a local machine. So why is standalone mode not recommended? Is it not stable enough, or other reasons?
Is there any problem/disadvantage with running Mongo as a 1-node replica set, assuming I don't really need the replication, load balancing or any other scalable functionality?
You don't have high availability afforded by a proper replica set. Thus it's not recommended for a production deployment. This is fine for development though.
Note that a replica set's function is primarily about high availability instead of scaling.
What are the functionality and performance differences, if any, between running as standalone vs. running as a 1-node replica set?
A single-node replica set would have the oplog. This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification).
So why is standalone mode not recommended? Is it not stable enough, or other reasons?
MongoDB in production was designed with a replica set deployment in mind, for:
High availability in the face of node failures
Rolling maintenance/upgrades with no downtime
Possibility to scale-out reads
Possibility to have a replica of data in a special-purpose node that is not part of the high availability nodes
In short, MongoDB was designed to be a fault-tolerant distributed database (scales horizontally) instead of the typical SQL monolithic database (scales vertically). The idea is, if you lose one node of your replica set, the others will immediately take over. Most of the time your application don't even know there's a failure in the database side. In contrast, a failure in a monolithic database server would immediately disrupt your application.
I think kevinadi answered well, but I still want to add it.
A standalone is an instance of mongod that runs on a single server but is not part of a replica set. Standalone instances used for testing and development, but always recomended to use replica sets in production.
A single-node replica set would have the oplog which records all changes to its data sets . This means that you'll use more disk space to store the oplog, and also any insert/update operation would be written to the oplog as well (write amplification). It also supports point in time recovery.
Please follow Convert a Standalone to a Replica Set if you would like to convert the standalone database to replicaset.
Transactions have been introduced in MongoDB version 4.0. Starting in version 4.0, for situations that require atomicity for updates to multiple documents or consistency between reads to multiple documents, MongoDB provides multi-document transactions for replica sets. The transaction is not available in standalone because it requires oplog to maintain strong consistency within a cluster.

How to make MongoDB `repairDatabase` and `compact` commands work with replica-set? Downtime is ok

We need to free some of our MongoDB space, and we identified 100Gb + worth of documents that can be safely removed from a collection.
So we removed them from our test environment which has this setting:
mongodb version 3.0.1
no sharding
1 node, no replica
wiredtiger engine
When done, we found out that the space on disk was still used and needed to be reclaimed. We found this post and it helped us: after running both
db.runCommand({repairDatabase: 1})
and
db.runCommand({compact: collection-name })
We freed 100Gb +.
We then proceeded in production, forgetting that the setting was different since we had 1 replica node:
mongodb version 3.0.1
no sharding
1 primary node, 1 replica node
wiredtiger engine
After removing the documents, we run
db.runCommand({repairDatabase: 1})
and got the OK message (after a while, 10 min +). We tried running
db.runCommand({compact: collection-name })
and got this error:
will not run compact on an active replica set primary as this is a
slow blocking operation. use force:true to force
So we run
db.runCommand({compact: collection-name, force: true })
and got the OK message (almost instantly), but the disk on space is still used, it wasn't freed.
we searched for solutions for running the repairDatabase and compact commands with replica-set but the advise was focused on avoiding downtime as if that was the only issue. However, we can schedule downtime and our problem is rather that the commands don't work as expected since the space is not actually reclaimed.
What did we do wrong?
For replica set configurations, the best and the safest method to recover space is to perform an initial sync. If you need to recover space from all nodes in the set, you can perform a rolling initial sync. That is, perform initial sync on each of the secondaries, before finally stepping down the primary and perform initial sync on it.
Note that the rolling initial sync is only possible if your deployment contains at least three nodes replica set (for reasons I will describe below).
Rolling initial sync method is the safest method to perform replica set maintenance, and it also involves no downtime as a bonus.
Having said that, there are some things that are worth mentioning:
Regarding compact:
The compact command on WiredTiger on MongoDB 3.0.x was affected by this bug: SERVER-21833 which was fixed in MongoDB 3.2.3. Prior to this version, compact on WiredTiger could silently fail.
Regarding repairDatabase:
Please don't run repairDatabase on replica set nodes. This is strongly not recommended, as mentioned in the repairDatabase page. The name repairDatabase is a bit misleading, since the command doesn't attempt to repair anything. The command was intended to be used when there's disk corruption, which could lead to corrupt documents.
The repairDatabase command could be more accurately described as "salvage database". That is, it recreates the databases by discarding corrupt documents, in an attempt to get the database into a state where you can start it and salvage intact document from it.
In a replica set, MongoDB expects all nodes in the set to contain identical data. If you run repairDatabase on a replica set node, there is a chance that the node contains undetected corruption, and repairDatabase will dutifully remove the corrupt documents. Predictably, this makes that node contains a different dataset from the rest of the set. If an update happens to hit that single document, the whole set could crash. To make matters worse, it is entirely possible that this situation could stay dormant for a long time, only to strike suddenly with no apparent reason.
Regarding your setup:
I noticed that in your production environment, you created a replica set with two nodes. This setup is not recommended, since the failure of a single node will render the remaining node to become a secondary, and thus disallowing writes to the set.
Due to the way MongoDB high availability works (see Replica Set Election), it's strongly recommended to deploy three data-bearing nodes at a minimum, or at least add an arbiter node (see Replica Set Members) so that your replica set contains an odd number of members.
Having only two members in a replica set also makes rolling upgrades/initial sync/maintenance much harder or even impossible in some cases.
MongoDB 3.0.1 was released in March 17, 2015, which is more than 2 years ago as of this writing. If you're forced to use MongoDB 3.0 series, please consider moving to 3.0.15. Or better yet, to 3.4.7 (the latest as of Aug 10, 2017), which contains massive improvements over 3.0.1.

How to quickly mirror a Mongo database?

What's a quick and efficient way to transfer a large Mongo database?
I want to transfer a 10GB production Mongo 3.4 database to a staging environment for testing. I used the mongodump/mongorestore tools to test this transfer to my localhost, but it took over 8 hours and consumed a massive amount of CPU and memory, which is something I'd like to avoid in the future. The database doesn't have any indexes, so the mongodump option to exclude indexes doesn't increase performance.
My staging environment will mostly be read-only, but it will still need to write occasionally, so it can't be setup as a permanent read replica of production.
I've read about [replication sets][1], but they seem very complicated to setup and designed for permanent mirroring of a primary to two or more secondaries. I've read some posts about people hacking this to be temporary, so they can do a one-time mirroring, but I can't find any reliable documentation since this isn't the intended usage of the feature. All the guides I've read also say you need at least 3 servers, which seems unintuitive since I only have 2 (production and staging) and don't want to create a third.
Several options exist today (2020-05-06).
Copy Data Directory
If you can take the system offline you can copy the data directory from one host to another then set the configuration to point to this directory and start up the new mongod.
Mongomirror
Mongomirror (https://docs.atlas.mongodb.com/import/mongomirror/) is intended to be a tool to migrate from on-premises to Atlas, but this tool can be leveraged to copy data to another on-premises host. Beware, this connection requires SSL configurations on source and target to transfer.
Replicaset
MongoDB has built-in High Availability features using a replica set model (https://docs.mongodb.com/manual/tutorial/deploy-replica-set/). It is not overly complicated and works very well. This option allows the original system to stay online while replication does its magic. Once the replication completes reconfigure the replica set to be a single node replica set referring only to the new host and shut down the original host. This configuration is referred to as a single-node replica set. Having a single node replica set offers benefits over a stand-alone installation in that the replica set underpinnings (oplog) are the basis for other features such as change streams (https://docs.mongodb.com/manual/changeStreams/)
Backup and Restore
As you mentioned you can use mongodump/mongorestore. There is a point in time where the backup must be restored. During this time it is expected the original system is offline and not accepting any additional writes. This method is robust but has downtime associated with it. You could use mongoexport/mongoimport to use a JSON file as an intermediate step but this is not recommended as BSON data types could be lost in translation.
Per Mongo documentation, you should be able to cp/rsync files for creating a backup (if you are able to halt write ops temporarily on your production setup - or if you do this during a maintenance window)
https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files
Back Up with cp or rsync
If your storage system does not support snapshots, you can copy the files >directly using cp, rsync, or a similar tool. Since copying multiple files is not >an atomic operation, you must stop all writes to the mongod before copying the >files. Otherwise, you will copy the files in an invalid state.
Backups produced by copying the underlying data do not support point in time >recovery for replica sets and are difficult to manage for larger sharded >clusters. Additionally, these backups are larger because they include the >indexes and duplicate underlying storage padding and fragmentation. mongodump, >by contrast, creates smaller backups.
FYI - for replica sets, the third "server" is an arbiter which exists to break the tie when electing a new primary. It does not consume as many resources as the primary/secondaries. Since you are looking to creating a staging environment, i would not recommend creating a replica set that includes production and staging env. Your primary instance could switch over to the staging instance and clients who are meant to access production instance will end up reading/writing from staging instance.

Local MongoDB instance with index in remote server

One of our clients have a server running a MongoDB instance and we have to build an analytical application using the data stored in their MongoDB database which changes frequently.
Clients requirements are:
That we do not connect to their MongoDB instance directly or run another instance of MongoDB on their server but just somehow run our own MongoDB instance on our machine in our office using their MongoDB database directory with read only access remotely.
We've suggested deploying a REST application, getting a copy of their database dump but they did not want that. They just want us to run our own MongoDB intance which is hooked up with the MongoDB instance directory. Is this even possible ?
I've been searching for a solution for the past two days and we have to submit a solution by Monday. I really need some help.
I think this is normal request because analytical queries could cause too much load on the production server. It is pretty normal to separate production and analytical databases.
The easiest option is to use MongoDB replication. Set up MongoDB replica set with production database instance as primary and analytical database instance as secondary, also configure the analytical instance to never become primary.
If it is not possible to use replication - for example client doesn't want this, the servers could not connect directly to each other... - there is another option. You can read oplog from remote database and apply operations to your database instance. This is exactly the low level mechanism how replica set works, but you can do it manually too. For example MMS (Mongo Monitoring Sevice) Backup uses reading oplog for online backups of MongoDB.
Update: mongooplog could be the right tool for real-time application of replication oplog pulled from remote server on local server.
I don't think that running two databases that points to the same database files is possible or even recommended.
You could use mongorestore to restore from their data files directly, but this will only work if their mongod instance is not running (because mongorestore will need to lock the directory).
Another solution will be to do file system snapshots and then restore to your local database.
The downside to this backup/restore solutions is that your data will not be synced all the time.
Probably the best solution will be to use replica sets with hidden members.
You can create a replica set with just two members:
Primary - this will be the client server.
Secondary - hidden, with votes and priority set to 0. This will be your local instance.
Their server will always be primary (because hidden members cannot become primaries). Clients cannot see hidden members so for all intents and purposes your server will be read only.
Another upside to this is that the MongoDB replication will do all the "heavy" work of syncing the data between servers and your instance will always have the latest data.