Does mongod run in a single thread when performing a mongorestore? - mongodb

When monitoring a mongodb restore, I'm tracking two processes...
CPU COMMAND
100% mongod --config /etc/mongo/mongod.conf
0% mongorestore /data/dump
MongoDB is 4.4.14 and mongorestore is version is 100.5.3. I'm running it inside a docker container.
I never see mongod go past 100%.. Is there a way to allow it to use more than a single core when performing a mongo restore?

By default mongorestore will restore max 4x collections in paralel with single insertionWorkerPerCollection( Since you see only two processes running in paralel , perhaps you have only 2x collections in your database... ) , but you can increase this parameters if you have more collections and would like to restore faster , check the official docs for details:
--numParalelCollections ( default=4 )
Number of collections mongorestore should restore in parallel.
--numInsertionWorkersPerCollection ( default=1 )
Specifies the number of insertion workers to run concurrently per collection.
For large imports, increasing the number of insertion workers may increase the speed of the import.

Related

faster mongoimport, in parallel in airflow?

tl;dr: there seems to be a limit on how fast data is inserted into our mongodb atlas cluster. Inserting data in parallel does not speed this up. How can we speed this up? Is our only option to get a larger mongodb atlas cluster with more Write IOPS? What even are write IOPS?
We replace and re-insert >10GB+ of data daily into our mongodb cluster with atlas. We have the following 2 bash commands, wrapped in python functions to help parameterize the commands, that we use with BashOperator in airflow:
upload single JSON to mongo cluster
def mongoimport_file(mongo_table, file_name):
# upload single file from /tmp directory into Mongo cluster
# cleanup: remove .json in /tmp at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& mongoimport --uri "{uri}" --collection {mongo_table} --drop --file /tmp/{file_name}.json \
&& echo AND REMOVE LOCAL FILEs... \
&& rm /tmp/{file_name}.json
"""
upload directory of JSONs to mongo cluster
def mongoimport_dir(mongo_table, dir_name):
# upload directory of JSONs into mongo cluster
# cleanup: remove directory at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& cat /tmp/{dir_name}/*.json | mongoimport --uri "{uri}" --collection {mongo_table} --drop \
&& echo AND REMOVE LOCAL FILEs... \
&& rm -rf /tmp/{dir_name}
"""
There are called in airflow using the BashOperator:
import_to_mongo = BashOperator(
task_id=f'mongo_import_v0__{this_table}',
bash_command=mongoimport_file(mongo_table = 'tname', file_name = 'fname')
)
Both of these work, although with varying performance:
mongoimport_file with 1 5GB file: takes ~30 minutes to mongoimport
mongoimport_dir with 100 50MB files: takes ~1 hour to mongoimport
There is currently no parallelization with ** mongoimport_dir**, and in fact it is slower than importing just a single file.
Within airflow, is it possible to parallelize the mongoimport of our directory of 100 JSONs, to achieve a major speedup? If there's a parallel solution using python's pymongo that cannot be done with mongoimport, we're happy to switch (although we'd strongly prefer to avoid loading these JSONs into memory).
What is the current bottleneck with importing to mongo? Is it (a) CPUs in our server / docker container, or (b) something with our mongo cluster configuration (cluster RAM, or cluster vCPU, or cluster max connections, or cluster read / write IOPS (what are these even?)). For reference, here is our mongo config. I assume we can speed up our import by getting a much bigger cluster but mongodb atlas becomes very expensive very fast. 0.5 vCPUs doesn't sound like much, but this already runs us $150 / month...
First of all "What is the current bottleneck with importing to mongo?" and "Is it (a) CPUs in our server / docker container " - don't believe to anyone who will tell you the answer from the screenshot you provided.
Atlas has monitoring tools that will tell you if the bottleneck is in CPU, RAM, disk or network or any combination of those on db side:
On the client side (airflow) - please use system monitor of your host OS to answer the question. Test disk I/O inside docker. Some combinations of host OS and docker storage drivers performed quite poor in the past.
Next, "What even are write IOPS" - random
write operations per second
https://cloud.google.com/compute/docs/disks/performance
IOPS calculation differs depending on cloud provider. Try AWS and Azure to compare cost vs speed. M10 on AWS gives you 2 vCPU, yet again I doubt you can compare them 1:1 between vendors. The good thing is it's on-demand and will cost you less than a cup of coffee to test and delete the cluster.
Finally, "If there's a parallel solution using python's pymongo" - I doubt so. mongoimport uses batches of 100,000 documents, so essentially it sends it as fast as the stream is consumed on the receiver. The limitations on the client side could be: network, disk, CPU. If it is network or disk, parallel import won't improve a thing. Multi-core systems could benefit from parallel import if mongoimport was using a single CPU and it was the limiting factor. By default mongoimport uses all CPUs available: https://github.com/mongodb/mongo-tools/blob/cac1bfbae193d6ba68abb764e613b08285c6f62d/common/options/options.go#L302. You can hardly beat it with pymongo.

Mongorestore seems to run out of memory and kills the mongo process

In current setup there are two Mongo Docker containers, running on hosts A and B, with Mongo version of 3.4 and running in a replica set. I would like to upgrade them to 3.6 and increase a member so the containers would run on hosts A, B and C. Containers have 8GB memory limit and no swap allocated (currently), and are administrated in Rancher. So my plan was to boot up the three new containers, initialize a replica set for those, take a dump from the 3.4 container, and restore it the the new replica set master.
Taking the dump went fine, and its size was about 16GB. When I tried to restore it to the new 3.6 master, restoring starts fine, but after it has restored roughly 5GB of the data, mongo process seems to be killed by OS/Rancher, and while the container itself doesn't restart, MongoDB process just crashes and reloads itself back up again. If I run mongorestore to the same database again, it says unique key error for all the already inserted entries and then continue where it left off, only to do the same again after 5GB or so. So it seems that mongorestore loads all the entries it restores to memory.
So I've got to get some solution to this, and:
Every time it crashes, just run the mongorestore command so it continues where it left off. It probably should work, but I feel a bit uneasy doing it.
Restore the database one collection at a time, but the largest collection is bigger than 5GB so it wouldn't work properly either.
Add swap or physical memory (temporarily) to the container so the process doesn't get killed after the process runs out of physical memory.
Something else, hopefully a better solution?
Increasing the swap size as the other answer pointed out worked out for me. Also, The --numParallelCollections option controls the number of collections mongodump/mongorestore should dump/restore in parallel. The default is 4 which may consume a lot of memory.
Since it sounds like you're not running out of disk space due to mongorestore continuing where it left off successfully, focusing on memory issues is the correct response. You're definitely running out of memory during the mongorestore process.
I would highly recommend going with the swap space, as this is the simplest, most reliable, least hacky, and arguably the most officially supported way to handle this problem.
Alternatively, if you're for some reason completely opposed to using swap space, you could temporarily use a node with a larger amount of memory, perform the mongorestore on this node, allow it to replicate, then take the node down and replace it with a node that has fewer resources allocated to it. This option should work, but could become quite difficult with larger data sets and is pretty overkill for something like this.
I solved the OOM problem by using the --wiredTigerCacheSizeGB parameter of mongod. Excerpt from my docker-compose.yaml below:
version: '3.6'
services:
db:
container_name: db
image: mongo:3.2
volumes:
- ./vol/db/:/data/db
restart: always
# use 1.5GB for cache instead of the default (Total RAM - 1GB)/2:
command: mongod --wiredTigerCacheSizeGB 1.5
Just documenting here my experience in 2020 using mongodb 4.4:
I ran into this problem restoring a 5GB collection on a machine with 4GB mem. I added 4GB swap which seemed to work, I was no longer seeing the KILLED message.
However, a while later I noticed I was missing a lot of data! Turns out if mongorestore runs out of memory during the final step (at 100%) it will not show killed, BUT IT HASNT IMPORTED YOUR DATA.
You want to make sure you see this final line:
[########################] cranlike.files.chunks 5.00GB/5.00GB (100.0%)
[########################] cranlike.files.chunks 5.00GB/5.00GB (100.0%)
[########################] cranlike.files.chunks 5.00GB/5.00GB (100.0%)
[########################] cranlike.files.chunks 5.00GB/5.00GB (100.0%)
[########################] cranlike.files.chunks 5.00GB/5.00GB (100.0%)
restoring indexes for collection cranlike.files.chunks from metadata
finished restoring cranlike.files.chunks (23674 documents, 0 failures)
34632 document(s) restored successfully. 0 document(s) failed to restore.
In my case I needed 4GB mem + 8GB swap, to import 5GB GridFS collection.
Rather than starting up a new replica set, it's possible to do the entire expansion and upgrade without even going offline.
Start MongoDB 3.6 on host C
On the primary (currently A or B), add node C into the replica set
Node C will do an initial sync of the data; this may take some time
Once that is finished, take down node B; your replica set has two working nodes still (A and C) so will continue uninterrupted
Replace v3.4 on node B with v3.6 and start back up again
When node B is ready, take down node A
Replace v3.4 on node A with v3.6 and start back up again
You'll be left with the same replica set running as before, but now with three nodes all running v.3.4.
PS Be sure to check out the documentation on Upgrade a Replica Set to 3.6 before you start.
I ran into a similar issue running 3 nodes on a single machine (8GB RAM total) as part of testing a replicaset. The default storage cache size is .5 * (Total RAM - 1GB). The mongorestore caused each node to use the full cache size on restore and consume all available RAM.
I am using ansible to template this part of mongod.conf, but you can set your cacheSizeGB to any reasonable amount so multiple instances do not consume the RAM.
storage:
wiredTiger:
engineConfig:
cacheSizeGB: {{ ansible_memtotal_mb / 1024 * 0.2 }}
My scenario is similar to #qwertz but to be able to upload all collection to my database the following script was created to handle partials uploads; uploading every single collection one-by-one instead of trying to send all database at once was the only way to properly populate it.
populate.sh:
#!/bin/bash
backupFiles=`ls ./backup/${DB_NAME}/*.bson.gz`
for file in $backupFiles
do
file="${file:1}"
collection=(${file//./ })
collection=(${collection//// })
collection=${collection[2]}
mongorestore \
$file \
--gzip \
--db=$DB_NAME \
--collection=$collection \
--drop \
--uri="${DB_URI}://${DB_USERNAME}:${DB_PASSWORD}#${DB_HOST}:${DB_PORT}/"
done
Dockerfile.mongorestore:
FROM alpine:3.12.4
RUN [ "apk", "add", "--no-cache", "bash", "mongodb-tools" ]
COPY populate.sh .
ENTRYPOINT [ "./populate.sh" ]
docker-compose.yml:
...
mongorestore:
build:
context: .
dockerfile: Dockerfile.mongorestore
restart: on-failure
environment:
- DB_URI=${DB_URI}
- DB_NAME=${DB_NAME}
- DB_USERNAME=${DB_USERNAME}
- DB_PASSWORD=${DB_PASSWORD}
- DB_HOST=${DB_HOST}
- DB_PORT=${DB_PORT}
volumes:
- ${PWD}/backup:/backup
...

Mongodump while writing

Is it safe to run mongodump against running server with many writes per second? Is it possible to get corrupted dump doing in this way?
From here:
Use --oplog to capture incoming write operations during the mongodump operation to ensure that the backups reflect a consistent data state.
Does it mean that no matter how many writes in database dump will be consistent?
If I ran mongodump --oplog at 1AM and it finished at 2AM then I run mongorestore --oplogReplay what state will I get?
From here:
However, the use of mongodump and mongorestore as a backup strategy can be problematic for sharded clusters and replica sets.
but why? I had replica set of 1 primary and 2 secondary. What the problem to run mongodump against one of secondary? It should same as primary (except replication lag difference).
The docs are quite clear about it:
--oplog
Creates a file named oplog.bson as part of the mongodump output. The oplog.bson file, located in the top level of the output directory, contains oplog entries that occur during the mongodump operation. This file provides an effective point-in-time snapshot of the state of a mongod instance. To restore to a specific point-in-time backup, use the output created with this option in conjunction with mongorestore --oplogReplay.
Without --oplog, if there are write operations during the dump operation, the dump will not reflect a single moment in time. Changes made to the database during the update process can affect the output of the backup.
--oplog has no effect when running mongodump against a mongos instance to dump the entire contents of a sharded cluster. However, you can use --oplog to dump individual shards.
Without --oplog you still get a valid dump, just a bit inconsistent - some of the writes done between 1 AM and 2 AM will be missing.
With --oplog you have the oplog file captured at 2 AM. The dump remains inconsistent, and replaying the oplog on restore fixes this issue.
The problems dumping the sharded clusters deserve a dedicated page in the docs. Essentially because of complexity to synchronise backups of all nodes:
To create backups of a sharded cluster, you will stop the cluster balancer, take a backup of the config database, and then take backups of each shard in the cluster using mongodump to capture the backup data. To capture a more exact moment-in-time snapshot of the system, you will need to stop all application writes before taking the filesystem snapshots; otherwise the snapshot will only approximate a moment in time.
There are no problems to dump replica set.

Mongod backup without replicate sets or db.fsyncLock()

I need to make a daily backups of mongodb database/s. I am running Mongo 3.4 on a 64 bit system so journaling is enabled by default.
I don't have replica sets for this setup so I want to manually (
actually schedule a cron job) copy or make an image of the mongo dbs from disk. I don't want to use the mongodump command.
Questions
Because of journaling am I able to simple do a cp -R /data/db/* /backup? Or do I need to do a db.fsyncLock() I'm not concerned about loss of some small amount of data that could be persistent in memory when I do the cp. More that the copy will work if I need to use it in the future I.e. it won't contains errors etc.
If I do need to do a db.fsyncLock() before copying the databases what happens if there are reads and writes that continue to occur? If they block will they perform as normal once the db.fsyncUnlock() occurs?
If i do need to use mongodump is it true that using _ids other than normal ObjectIds will cause issues? One of my databases doesn't use normal ObjectIds

Auto compact the deleted space in mongodb?

The mongodb document says that
To compact this space, run db.repairDatabase() from the mongo shell (note this operation will block and is slow).
in http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
I wonder how to make the mongodb free deleted disk space automatically ?
p.s. We stored many downloading task in mongodb, up to 20GB, and finished these in half an hour.
In general if you don't need to shrink your datafiles you shouldn't shrink them at all. This is because "growing" your datafiles on disk is a fairly expensive operation and the more space that MongoDB can allocate in datafiles the less fragmentation you will have.
So, you should try to provide as much disk-space as possible for the database.
However if you must shrink the database you should keep two things in mind.
MongoDB grows it's data files by
doubling so the datafiles may be
64MB, then 128MB, etc up to 2GB (at
which point it stops doubling to
keep files until 2GB.)
As with most any database ... to
do operations like shrinking you'll
need to schedule a separate job to
do so, there is no "autoshrink" in
MongoDB. In fact of the major noSQL databases
(hate that name) only Riak
will autoshrink. So, you'll need to
create a job using your OS's
scheduler to run a shrink. You could use an bash script, or have a job run a php script, etc.
Serverside Javascript
You can use server side Javascript to do the shrink and run that JS via mongo's shell on a regular bases via a job (like cron or the windows scheduling service) ...
Assuming a collection called foo you would save the javascript below into a file called bar.js and run ...
$ mongo foo bar.js
The javascript file would look something like ...
// Get a the current collection size.
var storage = db.foo.storageSize();
var total = db.foo.totalSize();
print('Storage Size: ' + tojson(storage));
print('TotalSize: ' + tojson(total));
print('-----------------------');
print('Running db.repairDatabase()');
print('-----------------------');
// Run repair
db.repairDatabase()
// Get new collection sizes.
var storage_a = db.foo.storageSize();
var total_a = db.foo.totalSize();
print('Storage Size: ' + tojson(storage_a));
print('TotalSize: ' + tojson(total_a));
This will run and return something like ...
MongoDB shell version: 1.6.4
connecting to: foo
Storage Size: 51351
TotalSize: 79152
-----------------------
Running db.repairDatabase()
-----------------------
Storage Size: 40960
TotalSize: 65153
Run this on a schedule (during none peak hours) and you are good to go.
Capped Collections
However there is one other option, capped collections.
Capped collections are fixed sized
collections that have a very high
performance auto-FIFO age-out feature
(age out is based on insertion order).
They are a bit like the "RRD" concept
if you are familiar with that.
In addition, capped collections
automatically, with high performance,
maintain insertion order for the
objects in the collection; this is
very powerful for certain use cases
such as logging.
Basically you can limit the size of (or number of documents in ) a collection to say .. 20GB and once that limit is reached MongoDB will start to throw out the oldest records and replace them with newer entries as they come in.
This is a great way to keep a large amount of data, discarding the older data as time goes by and keeping the same amount of disk-space used.
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
Also this can not be automated. Well it could, but I don't think I'm willing to try.
Running db.repairDatabase() will require that you have space equal to the current size of the database available on the file system. This can be bothersome when you know that the collections left or data you need to retain in the database would currently use much less space than what is allocated and you do not have enough space to make the repair.
As an alternative if you have few collections you actually need to retain or only want a subset of the data, then you can move the data you need to keep into a new database and drop the old one. If you need the same database name you can then move them back into a fresh db by the same name. Just make sure you recreate any indexes.
use cleanup_database
db.dropDatabase();
use oversize_database
db.collection.find({},{}).forEach(function(doc){
db = db.getSiblingDB("cleanup_database");
db.collection_subset.insert(doc);
});
use oversize_database
db.dropDatabase();
use cleanup_database
db.collection_subset.find({},{}).forEach(function(doc){
db = db.getSiblingDB("oversize_database");
db.collection.insert(doc);
});
use oversize_database
<add indexes>
db.collection.ensureIndex({field:1});
use cleanup_database
db.dropDatabase();
An export/drop/import operation for databases with many collections would likely achieve the same result but I have not tested.
Also as a policy you can keep permanent collections in a separate database from your transient/processing data and simply drop the processing database once your jobs complete. Since MongoDB is schema-less, nothing except indexes would be lost and your db and collections will be recreated when the inserts for the processes run next. Just make sure your jobs include creating any nessecary indexes at an appropriate time.
If you are using replica sets, which were not available when this question was originally written, then you can set up a process to automatically reclaim space without incurring significant disruption or performance issues.
To do so, you take advantage of the automatic initial sync capabilities of a secondary in a replica set. To explain: if you shut down a secondary, wipe its data files and restart it, the secondary will re-sync from scratch from one of the other nodes in the set (by default it picks the node closest to it by looking at ping response times). When this resync occurs, all data is rewritten from scratch (including indexes), effectively do the same thing as a repair, and disk space it reclaimed.
By running this on secondaries (and then stepping down the primary and repeating the process) you can effectively reclaim disk space on the whole set with minimal disruption. You do need to be careful if you are reading from secondaries, since this will take a secondary out of rotation for a potentially long time. You also want to make sure your oplog window is sufficient to do a successful resync, but that is generally something you would want to make sure of whether you do this or not.
To automate this process you would simply need to have a script run to perform this action on separate days (or similar) for each member of your set, preferably during your quiet time or maintenance window. A very naive version of this script would look like this in bash:
NOTE: THIS IS BASICALLY PSEUDO CODE - FOR ILLUSTRATIVE PURPOSES ONLY - DO NOT USE FOR PRODUCTION SYSTEMS WITHOUT SIGNIFICANT CHANGES
#!/bin/bash
# First arg is host MongoDB is running on, second arg is the MongoDB port
MONGO=/path/to/mongo
MONGOHOST=$1
MONGOPORT=$2
DBPATH = /path/to/dbpath
# make sure the node we are connecting to is not the primary
while (`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'db.isMaster().ismaster'`)
do
`$MONGO --quiet --host $MONGOHOST --port $MONGOPORT --eval 'rs.stepDown()'`
sleep 2
done
echo "Node is no longer primary!\n"
# Now shut down that server
# something like (assuming user is set up for key based auth and has password-less sudo access a la ec2-user in EC2)
ssh -t user#$MONGOHOST sudo service mongodb stop
# Wipe the data files for that server
ssh -t user#$MONGOHOST sudo rm -rf $DBPATH
ssh -t user#$MONGOHOST sudo mkdir $DBPATH
ssh -t user#$MONGOHOST sudo chown mongodb:mongodb $DBPATH
# Start up server again
# similar to shutdown something like
ssh -t user#$MONGOHOST sudo service mongodb start