performance issue until mongodump - mongodb

we operate for our customer a server with a single mongo instance, gradle, postgres and nginx running on it. The problem is we had massiv performance problmes until mongodump is running. The mongo queue is growing and no data be queried. The next problem is the costumer want not invest in a replica-set or a software update (mongod 3.x).
Has somebody any idea how i clould improve the performance.
command to create the dump:
mongodump -u ${MONGO_USER} -p ${MONGO_PASSWORD} -o ${MONGO_DUMP_DIR} -d ${MONGO_DATABASE} --authenticationDatabase ${MONGO_DATABASE} > /backup/logs/mongobackup.log
tar cjf ${ZIPPED_FILENAME} ${MONGO_DUMP_DIR}
System:
6 Cores
36 GB RAM
1TB SATA HDD
+ 2TB (backup NAS)
MongoDB 2.6.7
Thanks
Best regards,
Markus

As you have heavy load, adding a replica set is a good solution, as backup could be taken on secondary node, but be aware that replica need at least three servers (you can have an master/slave/arbiter - where the last need a little amount of resources)
MongoDump makes general query lock which will have an impact if there is a lot of writes in dumped database.
Hint: try to make backup when there is light load on system.

Try with volume snapshots. Check with your cloud provider what are the options available to take snapshots. It is super fast and cheaper if you compare actual pricing used in taking a backup(RAM and CPU used and if HDD then transactions const(even if it is little)).

Related

Why is pg_restore that slow and PostgreSQL almost not even using the CPU?

I just had to use pg_restore with a small dump of 30MB and it took in average 5 minutes! On my colleagues' computers, it is ultra fast, like a dozen of seconds. The difference between the two is the CPU usage: while for the others, the database uses quite a bunch of CPU (60-70%) during the restore operation, on my machine, it stays around a few percents only (0-3%) as if it was not active at all.
The exact command was : pg_restore -h 127.0.0.1 --username XXX --dbname test --no-comments test_dump.sql
The originating command to produce this dump was: pg_dump --dbname=XXX --user=XXX --no-owner --no-privileges --verbose --format=custom --file=/sql/test_dump.sql
Look at the screenshot taken in the middle of the restore operation:
Here is the corresponding vmstat 1 result running the command:
I've looked at the web for a solution during a few hours but this under-usage of the CPU remains quite mysterious. Any idea will be appreciated.
For the stack, I am on Ubuntu 20.04 and postgres version 13.6 is running into a docker container. I have a decent hardware, neither bad nor great.
EDIT: This very same command worked in the past on my machine with a same common HDD but now it is terribly slow. The only difference I saw with others (for whom it is blazing fast) was really on the CPU-usage from my point of view (even if they have an SSD which shouldn't be at all the limiting factor especially with a 30 MB dump).
EDIT 2: For those who proposed the problem was about IO-boundness and maybe a slow disk, I just tried without any conviction to run my command on an SSD partition I just made and nothing has changed.
The vmstat output shows that you are I/O bound. Get faster storage, and performance will improve.
PostgreSQL, by default, is tuned for data durability. Usually transactions are flushed to the disk at each and every commit, forcing write-through of any disk write cache, so it seems to be very IO-bound.
When restoring database from a dump file, it may make sense to lower these durability settings, especially if the restore is done while your application is offline, especially in non-production environments.
I temporarily run postgres with these options: -c fsync=off -c synchronous_commit=off -c full_page_writes=off -c checkpoint_flush_after=256 -c autovacuum=off -c max_wal_senders=0
Refer to these documentation sections for more information:
14.4.9. Some Notes about pg_dump
14.5. Non-Durable Settings.
Also this article:
Settings for a fast pg_restore

faster mongoimport, in parallel in airflow?

tl;dr: there seems to be a limit on how fast data is inserted into our mongodb atlas cluster. Inserting data in parallel does not speed this up. How can we speed this up? Is our only option to get a larger mongodb atlas cluster with more Write IOPS? What even are write IOPS?
We replace and re-insert >10GB+ of data daily into our mongodb cluster with atlas. We have the following 2 bash commands, wrapped in python functions to help parameterize the commands, that we use with BashOperator in airflow:
upload single JSON to mongo cluster
def mongoimport_file(mongo_table, file_name):
# upload single file from /tmp directory into Mongo cluster
# cleanup: remove .json in /tmp at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& mongoimport --uri "{uri}" --collection {mongo_table} --drop --file /tmp/{file_name}.json \
&& echo AND REMOVE LOCAL FILEs... \
&& rm /tmp/{file_name}.json
"""
upload directory of JSONs to mongo cluster
def mongoimport_dir(mongo_table, dir_name):
# upload directory of JSONs into mongo cluster
# cleanup: remove directory at the end
uri = 'mongodb+srv://<user>:<pass>#our-cluster.dwxnd.gcp.mongodb.net/ourdb'
return f"""
echo INSERT \
&& cat /tmp/{dir_name}/*.json | mongoimport --uri "{uri}" --collection {mongo_table} --drop \
&& echo AND REMOVE LOCAL FILEs... \
&& rm -rf /tmp/{dir_name}
"""
There are called in airflow using the BashOperator:
import_to_mongo = BashOperator(
task_id=f'mongo_import_v0__{this_table}',
bash_command=mongoimport_file(mongo_table = 'tname', file_name = 'fname')
)
Both of these work, although with varying performance:
mongoimport_file with 1 5GB file: takes ~30 minutes to mongoimport
mongoimport_dir with 100 50MB files: takes ~1 hour to mongoimport
There is currently no parallelization with ** mongoimport_dir**, and in fact it is slower than importing just a single file.
Within airflow, is it possible to parallelize the mongoimport of our directory of 100 JSONs, to achieve a major speedup? If there's a parallel solution using python's pymongo that cannot be done with mongoimport, we're happy to switch (although we'd strongly prefer to avoid loading these JSONs into memory).
What is the current bottleneck with importing to mongo? Is it (a) CPUs in our server / docker container, or (b) something with our mongo cluster configuration (cluster RAM, or cluster vCPU, or cluster max connections, or cluster read / write IOPS (what are these even?)). For reference, here is our mongo config. I assume we can speed up our import by getting a much bigger cluster but mongodb atlas becomes very expensive very fast. 0.5 vCPUs doesn't sound like much, but this already runs us $150 / month...
First of all "What is the current bottleneck with importing to mongo?" and "Is it (a) CPUs in our server / docker container " - don't believe to anyone who will tell you the answer from the screenshot you provided.
Atlas has monitoring tools that will tell you if the bottleneck is in CPU, RAM, disk or network or any combination of those on db side:
On the client side (airflow) - please use system monitor of your host OS to answer the question. Test disk I/O inside docker. Some combinations of host OS and docker storage drivers performed quite poor in the past.
Next, "What even are write IOPS" - random
write operations per second
https://cloud.google.com/compute/docs/disks/performance
IOPS calculation differs depending on cloud provider. Try AWS and Azure to compare cost vs speed. M10 on AWS gives you 2 vCPU, yet again I doubt you can compare them 1:1 between vendors. The good thing is it's on-demand and will cost you less than a cup of coffee to test and delete the cluster.
Finally, "If there's a parallel solution using python's pymongo" - I doubt so. mongoimport uses batches of 100,000 documents, so essentially it sends it as fast as the stream is consumed on the receiver. The limitations on the client side could be: network, disk, CPU. If it is network or disk, parallel import won't improve a thing. Multi-core systems could benefit from parallel import if mongoimport was using a single CPU and it was the limiting factor. By default mongoimport uses all CPUs available: https://github.com/mongodb/mongo-tools/blob/cac1bfbae193d6ba68abb764e613b08285c6f62d/common/options/options.go#L302. You can hardly beat it with pymongo.

How to limit pg_dump's memory usage?

I have a ~140 GB postgreDB on Heroku / AWS. I want to create a dump of this on a windows Azure - Windows server 2012 R2 virtual machine, as i need to move the DB into Azure environment.
The DB has a couple of smaller tables, but mainly consists of a single table taking ~130 GB, including indexes. It has ~500 million rows.
I've tried to use pg_dump for this, with:
./pg_dump -Fc --no-acl --no-owner --host * --port 5432 -U * -d * > F:/051418.dump
I've tried on various Azure virtual machine sizes, including some fairly large with (D12_V2) 28GB ram, 4 VCPUs 12000 MAXIOPs, etc. But in all cases the pg_dump stalls completely due to memory swapping.
On above machine it's currently using all available memory and has used the past 12 hrs swapping memory on the disk. I dont expect it to complete, due to the swapping.
From other posts i've understood it could be an issue with the network speed, beeing much faster than the disk IO speed, causing pg_dump to suck up all available memory and more, so i've tried using the azure machine with most IOPs. This hasnt helped.
So is there another way i can force pg_dump to cap it's memory usage, or wait on pulling more data until it has written to disk and clear memory ?
Looking forward to your help!
Krgds.
Christian

mongodump a db on archlinux

I try to backup my local mongodb. I use archlinux and installed mongodb-tools in order to use mongodump.
I tried :
mongodump --host localhost --port 27017
mongodump --host localhost --port 27017 --db mydb
Every time I have the same response :
Failed: error connecting to db server: no recheable server
I'm however able to connect to the database using
mongo --host localhost --port 27017
or just
mongo
My mongodb version is 3.0.7.
I did not set any username/password
How can I properly use mongodump to backup my local database ?
This appears to be a bug in the mongodump tool, see this JIRA ticket for more detail. You should be able to use mongodump if you explicitely specify the IP address:
mongo --host 127.0.0.1 --port 27017
"Properly" is a highly subjective term in this context. To give you an impression:
mongodump and mongorestore aren't incredibly fast. In sharded environments, they can take days (note the plural!) for reasonably sized databases. Which in turn means that in a worst case scenario, you can loose days worth of data. Furthermore, during backup, the data may change quite a bit, so the state of your backup may be inconsistent. It is better to think of mongodump as "mongodumb" in this aspect.
Your application has to be able to deal with the lack of consistency gracefully, which can be quite a pain in the neck to develop. Furthermore, long restore times cost money and (sometimes even more important) reputation.
I personally use mongodump only in two scenarios: for backing up a sharded clusters metadata (which is only a couple of MB in size) and for (relatively) cheap data, which is easy to reobtain by other means.
For doing a MongoDB backup properly, imho, there are only three choices:
MongoDB Inc's cloud backup,
MongoDB Ops Manager
Filesystem snapshots
Cloud backup
It has several advantages. You can do point in time recoveries, guaranteeing to have the database in a consistent state as it was at the chosen point in time. It is extremely easy to set up and maintain.
However, you guessed it, it comes with a price tag based on data volatility and overall size, which, imho, is reasonable for small to medium sized data with low to moderate volatility.
MongoDB Ops Manager
Being an on premise version of the cloud backup (It has quite some other features out of the scope of this answer, too), it offers the same benefits. It is more suited for upper scale medium size to large databases or for databases with disproportionate high volatility (as indicated by a high "OplogGb/h" value in comparison to the data size).
Filesystem snapshots
Well, it is sort of cheap. Just make a snapshot, mount it, copy it to some backup space, unmount and destroy the snapshot, optionally compress the copied data and you are done. There are some caveats, though.
Synchronization
To get a backup of consistent data, you need to synchronize your snapshots on a sharded cluster. Especially since the sharded clusters metadata needs to be consistent with the backups, too, if you want a halfway fast recovery. That can become a bit tricky. To make sure your data is consistent, you'd need to disconnect all mongos, stop the balancer, fsync the data to the files on each node, make the snapshot, start the balancer again and restart all mongos. To have this properly synced, you need a maintenance window of some minutes every time you make a backup.
Note that for a simple replica set, synchronization is not required and backups work flawlessly.
Overprovisioning
Filesystem snapshots work with what is called "Copy-On-Write" (CoW). A bit simplified: When you make a snapshot and a file is modified, it is instead copied and the changes are applied to the newly copied file. The snapshot however, points to the old file. It is obvious that in order to be able to make a snapshot, as per CoW, you need some additional disk space so that MongoDB can work while you deal with the snapshot. Let us assume a worst case scenario in which all the data is changed – you'd need to overprovision your partition for MongoDB by at least 100% of your data size or, to put it in other terms, your critical disk utilization would be 50% minus some threshold for the time you need to scale up. Of course, this is a bit exaggerated, but you get the picture.
Conclusion
IMHO, proper backups should be done this way:
mongorestore for cheap data and little concern for consistency
Filesystem snapshots for replica sets
Cloud Backups for small to medium sized sharded databases with low to moderate volatility
Ops Manager Backups for large databases or small to medium ones with disproportionate high volatility
As said: "properly" is a highly subjective term when it comes to backups. ;)

How to Backup from mongoDB without locking tables

There is a Replica set (primary, secondary, arbiter) with 300GB data. i want to make daily backup without lock. The Replica is placedWe use Windows 2008R2, so seems not possible to use lvm tools.
If i want to make folder copy on secondary, it needed to shut down mongod first (because its not possible copy mongod.lock while mongod is running).
What is the best solution to make fastest daily backup
I don't know if it is feasible for you, but you can add another member to replica set. This member would be hidden, so it would not be used for queries or writing operations. You can stop this server every day for make your database backups.
because it is a replicate cluster i use mongodump with the --oplog option. runs pretty quick on linux. and i think it may have some advantages in a multi tenant server over over tar or snap. disadvange is that the indexes are built when you do the mongorestore