I've got an 8GB MYSQL database dump of InnoDB tables created with mysqldump. I import the data with:
mysql -uroot -p my_db < dump.sql
An 5 GB dump of the DB was imported within an hour. The 8 GB dump takes 50 hour and counting. When I inspected the processlist with
SHOW PROCESSLIST;
Most of the time there was a single INSERT query visible with state "'freeing items"
Is there a way, besides copying the raw files, to speed up the import process?
The trick really, is to ensure that the biggest single table fits in the innodb buffer pool. If it does not, then inserts (and import of course) will be extremely slow.
It doesn't matter the size of the whole database, but the biggest single table.
For significantly larger databases, you might want to consider alternative methods of transfering the database, such as filesystem snapshots. This of course works best if your machines are running the same version of the database, OS and architecture.
How much memory does the machine have? My first guess would be the machine has 6gb or 8gb of memory, and mysql was able to keep the first dump completely in-memory but is somehow swapping hard-core on the second import. Can you run a vmstat 5 for a few iterations while doing the import and see how heavily the system is swapping?
Related
I have a Database with
On disk size 19.032GB (using show dbs command)
Data size 56 GB (using db.collectionName.stats(1024*1024*1024).size command)
While taking mongodump using command mongodump we can set param --gzip. These are the observations I have with and without this flag.
command
timeTaken in dump
size of dump
restoration time
observation
with gzip
30 min
7.5 GB
20 min
in mongostat the insert rate was ranging from 30K to 80k par sec
without gzip
10 min
57 GB
50 min
in mongostat the insert rate was very erratic, and ranging from 8k to 20k par sec
Dump was taken from machine with 8 core, 40 GB ram(Machine B) to 12 core, 48GB ram machine (Machine A). And restored to 12 core, 48 gb machine(Machine C) from Machine A to make sure there is no resource contention between mongo, mongorestore and mongodump process. Mongo version 4.2.0
I have few questions like
What is the functional difference between 2 dumps?
Can the bson dump be zipped to make it zip?
how does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)
Is there a way to make overall process faster. From these result I see that have we have to choose 1 between dump and restore speed.
Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
Will smaller dump help in overall time?
Update:
2. Can the bson dump be zipped to make it zip?
yes
% ./mongodump -d=test
2022-11-16T21:02:24.100+0530 writing test.test to dump/test/test.bson
2022-11-16T21:02:24.119+0530 done dumping test.test (10000 documents)
% gzip dump/test/test.bson
% ./mongorestore --db=test8 --gzip dump/test/test.bson.gz
2022-11-16T21:02:51.076+0530 The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION}
2022-11-16T21:02:51.077+0530 checking for collection data in dump/test/test.bson.gz
2022-11-16T21:02:51.184+0530 restoring test8.test from dump/test/test.bson.gz
2022-11-16T21:02:51.337+0530 finished restoring test8.test (10000 documents, 0 failures)
2022-11-16T21:02:51.337+0530 10000 document(s) restored successfully. 0 document(s) failed to restore.
I am no MongoDB expert, but I have good experience working with MongoDB backup and restore activities and will answer to the best of my knowledge.
What is the functional difference between 2 dumps?
mongodump command without the use of the --gzip option will save each and every document to a file in bson format.
This will significantly reduce the time taken for backup and restore operations since it just reads the bson file and inserts the document, with the compromise being the .bson dump file size
However, when we pass the --gzip option, the bson data is compressed and it is being dumped to a file. This will significantly increase the time taken for mongodump and mongorestore, but the size of the backup file will be very less due to compression.
Can the bson dump be zipped to make it zip?
Yes, it can be further zipped. But, You will be spending additional time since you have to compress the already compressed file and extract it again before the restore operation, increasing the overall time taken. Do it if the compressed file size is very small compared to just gzip.
EDIT:
As #best wishes pointed, I completely misread this question.
gzip performed by mongodump is just a gzip performed on the mongodump side. It is literally the same as compressing the original BSON file manually from our end.
For instance, If you extract the .gzip.bson file with any compression application, you will get the actual BSON backup file.
Note that zip and gzip are not the same (in terms of compression) since they both use different compression algorithms, even though they both compress files. So you will get different results in file size on comparing mongodump gzip and manual zip of files.
How does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)
Whenever you take a dump, mongodump tool creates a <Collection-Name.metadata.json> file. This basically contains all the indexes followed by collection name, uuid, colmod, dbUsersAndRoles and so on.
The number and type of index in the collection will not have an impact during the mongodump operation. However, after the restoration of data using mongorestore command, it will go through all the indexes in the metaadata file and try to recreate the indexes.
The time taken by this operation depends on the number of indexes and the number of documents in your collection. In short (No. of Indexes * No. of Documents). The type of the index (Even if it's unique) doesn't have a mojor impact on performance. If the indexes are applied in the original collection using the background: true option, it's going to take even more time to rebuild the indexes while restoring.
You can avoid the indexing operaion during the mongorestore operation by passing the --noIndexRestore option in commandline. You can index later on when required.
In the Production backup environment of my company, indexing of keys takes more time compared to the restoration of data.
Is there a way to make the overall process faster. From these result I see that have we have to choose 1 between dump and restore speed
The solution depends...
If Network bandwidth is not an issue (Example: Moving data between two instances running in the cloud), don't use and compression, since it will save you time.
If the data in the newly moved instance won't be accessed immediately, perform the restoration process with the --noIndexRestore flag.
If the backup is for cold storage or saving data for later use, apply gzip compression, or manual zip compression, or both (whatever works best for you)
Choose whichever scenario works best for you, but you have to find the right balance between time and space primarily while deciding and secondly, whether to apply indexes or not.
In my company, we usually take non-compressed backup and restore for P-1 and gzip compression for weeks old prod backups, and further manually compress it for backups that are months older.
You have one more option and I DON'T RECOMMEND THIS METHOD. you can directly move the data path pointed by your MongoDB instance and change the DB path in the MongoDB instance of the migrated machine. Again, I don't recommend this method as there are many things that could go wrong, although I had no issues with this methodology on my end. But I can't guarantee the same for you. Do this at your own risk if you decide to.
Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
I don't think so. I am not sure about this, but I have 16 GB RAM and I restored a backup of 40GB mongodump to my local and didn't face any bottleneck due to RAM, but I could be wrong as I am not sure. Please let me know if you come to know the answer yourself.
Will smaller dump help in overall time?
If by smaller dump, you mean limiting the data to be dumped using the --query flag, for sure it will since the data to be backed up and restored is very less. Remember the No. of Indexes * No. of Documents rule.
Hope this helped you answer your questions. Let me know if you have any:
Any further questions
If I made any mistakes
Found a better solution
What you have decided finally
Here are my two cents:
In my experience: using --gzip to save space on storage with time, so-called space for time, both mongodump and mongorestore will have overhead.
in additions, I also use parallel settings:
--numParallelCollections=n1
--numInsertionWorkersPerCollection=n2
which may increase performance by little around 10%, n1 and n2 depending on CPU numbers on the server.
restore process also has rebuild index, it depends on how much indexes in your databases. In general speaking, rebuild index is faster than data restore.
hope these help!
Having a postgres DB on AWS-RDS the Swap Usage in constantly rising.
Why is it rising? I tried rebooting but it does not sink. AWS writes that high swap usage is "indicative of performance issues"
I am writing data to this DB. CPU and Memory do look healthy:
To be precise i have a
db.t2.micro-Instance and at the moment ~30/100 GB Data in 5 Tables - General Purpose SSD. With the default postgresql.conf.
The swap-graph looks as follows:
Swap Usage warning:
Well It seems that your queries are using a memory volume over your available. So you should look at your queries execution plan and find out largest loads. That queries exceeds the memory available for postgresql. Usually over-much joining (i.e. bad database structure, which would be better denonarmalized if applicable), or lots of nested queries, or queries with IN clauses - those are typical suspects. I guess amazon delivered as much as possible for postgresql.conf and those default values are quite good for this tiny machine.
But once again unless your swap size is not exceeding your available memory and your are on a SSD - there would be not that much harm of it
check the
select * from pg_stat_activity;
and see if which process taking long and how many processes sleeping, try to change your RDS DBparameter according to your need.
Obviously you ran out of memory. db.t2.micro has only 1GB of RAM. You should look in htop output to see which processes takes most of memory and try to optimize memory usage. Also there is nice utility called pgtop (http://ptop.projects.pgfoundry.org/) which shows current queries, number of rows read, etc. You can use it to view your postgress state in realtime. By the way, if you cannot install pgtop you can get just the same information from posgres internal tools - check out documentation of postgres stats collector https://www.postgresql.org/docs/9.6/static/monitoring-stats.html
Actually it is difficult to say what the problem is exactly but db.t2.micro is a very limited instance. You should consider taking a biggier instance especially if you are using postgres in production.
I have a big db (nominatim db, for address geocoding reverse), is about 408gb big.
Now, to provide an estimate to the customer, I would like to know how long will take the export/reimport procedure and how big will .sql dump file be.
My postgresql version is 9.4, is installed on a centOS 6.7 virtual machine, with 16gb RAM and 500 gb disk space.
Can you help me?
Thank you all guys for your answer, anyway to restore the dumped db I don't use the command pg_restore but psql -d newdb -f dump.sql (I read this way to do in a official doc). This because I have to set-up this db on another machine to avoid the nominatim db indexing procedure! I don't know if someone knows nominatim (is a openstreetmap opensource product) but the db indexing process of European map (15.8 gb), in a CentOS 6.7 machine with 16gb ram tooks me 32 days...
Than another possible question should be: pg_restore is equal to psql -d -f? Wich is faster?
Thanks again
As #a_horse_with_no_name says, nobody will be able to give you exact answers for your environment. But this is the procedure I would use to get some estimates.
I have generally found that a compressed backup of my data is 1/10th or less the size of the live database. You can also usually deduct the on-disk size of the indexes from the backup size as well. Examine the size of things in-database to get a better idea. You can also try forming a subset of the database you have which is much smaller and compare the live size to the compressed backup; this may give you a ratio that should be in the ballpark. SQL files are gassy and compress well; the on-disk representation Postgres uses seems to be even gassier though. Price of performance probably.
The best way to estimate time is just to do some exploratory runs. In my experience this usually takes longer than you expect. I have a ~1 TB database that I'm fairly sure would take about a month to restore, but it's also aggressively indexed. I have several ~20 GB databases that backup/restore in about 15 minutes. So it's pretty variable, but indexes add time. If you can set up a similar server, you can try the backup-restore procedure and see how long it will take. I would recommend doing this anyway, just to build confidence and suss out any lingering issues before you pull the trigger.
I would also recommend you try out pg_dump's "custom format" (pg_dump -Fc) which makes compressed archives that are easy for pg_restore to use.
I have a server with 64GB RAM and PostgreSQL 9.2. On it is one small database "A" with only 4GB which is only queried once an hour or so and one big database "B" with about 60GB which gets queried 40-50x per second!
As expected, Linux and PostgreSQL fill the RAM with the bigger database's data as it is more often accessed.
My problem now is that the queries to the small database "A" are critical and have to run in <500ms. The logfile shows a couple of queries per day that took >3s though. If I execute them by hand they, too, take only 10ms so my indexes are fine.
So I guess that those long runners happen when PostgreSQL has to load chunks of the small databases indexes from disk.
I already have some kind of "cache warmer" script that repeats "SELECT * FROM x ORDER BY y" queries to the small database every second but it wastes a lot of CPU power and only improves the situation a little bit.
Any more ideas how to tell PostgreSQL that I really want that small database "sticky" in memory?
PostgreSQL doesn't offer a way to pin tables in memory, though the community would certainly welcome people willing to work on well thought out, tested and benchmarked proposals for allowing this from people who're willing to back those proposals with real code.
The best option you have with PostgreSQL at this time is to run a separate PostgreSQL instance for the response-time-critical database. Give this DB a big enough shared_buffers that the whole DB will reside in shared_buffers.
Do NOT create a tablespace on a ramdisk or other non-durable storage and put the data that needs to be infrequently but rapidly accessed there. All tablespaces must be accessible or the whole system will stop; if you lose a tablespace, you effectively lose the whole database cluster.
After recently experimenting with MongoDB, I tried a few different methods of importing/inserting large amounts of data into collections. So far the most efficient method I've found is mongoimport. It works perfectly, but there is still overhead. Even after the import is complete, memory isn't made available unless I reboot my machine.
Example:
mongoimport -d flightdata -c trajectory_data --type csv --file trjdata.csv --headerline
where my headerline and data look like:
'FID','ACID','FLIGHT_INDEX','ORIG_INDEX','ORIG_TIME','CUR_LAT', ...
'20..','J5','79977,'79977','20110116:15:53:11','1967', ...
With 5.3 million rows by 20 columns, about 900MB, I end up like this:
This won't work for me in the long run; I may not always be able to reboot, or will eventually run out of memory. What would be a more effective way of importing into MongoDB? I've read about periodic RAM flushing, how could I implement something like with the example above?
Update:
I don't think my case would benefit much from adjusting fsync, syncdelay, or journaling. I'm just curious as to when that would be a good idea, and best practice, even if I was running on high RAM servers.
I'm guessing that memory is being used by mongodb itself, not mongoimport. Mongodb by design tries to keep all of its data into memory and relies on the OS to swap the memory-mapped files out when there's not enough room. So I'd give you two pieces of advice:
Don't worry too much about what your OS is telling you about how much memory is "free" -- a modern well-running OS will generally use every bit of RAM available for something.
If you can't abide by #1, don't run mongodb on your laptop.