Fastest way to restore sql dump to RDS - mysql-workbench

I am trying to restore large *.sql dump (~4 GB) to one of my DB on RDS. Last time tried to restore it using Workbench and it took about 24+ hours until the whole process is complete.
I wonder if there is a quicker way to do this. Please help and share your thoughts
EDIT: i have my sql dump on my local computer by the way.
At the moment i have 2 options in mind:
Follow this link
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.Procedural.Importing.NonRDSRepl.html (with low confidence)
dump the DB and compress it, and then upload the compressed dump to one of my EC2 instance, and then SSH to my EC2 instance and do
mysql> source backup.sql;
I prefer the second approach (simply because i have more confidence in that), as well as it would fastened the upload time since the entire dump is first uploaded, un-compressed and finally restored.

My suggestion is to take table-wise backup of large tables and restore them by disabling indexes. Which inserts records quickly (at more than double speed) and simply enable the indexes after restore completes.
Before restore command:
ALTER TABLE `table_name` DISABLE KEYS;
After restore completes:
ALTER TABLE `table_name` ENABLE KEYS;
Also add these extra commands at the top of the file to avoid a great deal of disk access:
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET AUTOCOMMIT = 0;
And add these at the end:
SET UNIQUE_CHECKS = 1;
SET FOREIGN_KEY_CHECKS = 1;
COMMIT;
I hope this will work, thank you.

Your intuition about using an EC2 intermediary is correct, but I really think the biggest thing that will benefit you is actually being inside AWS. I occasionally perform a similar procedure to move the contents of a 6GB database from one RDS DB instance to another.
My configuration:
Source DB instance and Target DB instance are in the same region and availability zones. Both are db.m3.large.
EC2 instance is in the same region and availability zone as the DB instances.
The EC2 instance is compute optimized. (I use c3.xlarge but would recommend the c4 family if I were to start again from scratch).
From here, I just use the EC2 instance to perform a very simple mysqldump from the source instance, and then a very simple mysqlrestore to the target instance.
Being in the same region and availability zones really make a big difference, because it reduces network latency during the procedure. The data transfer is also free or near-free in this situation. The instance class you choose for both your EC2 and RDS instances is also important -- if you want this to be a fast operation, you want to be able to maximize I/O.
External factors like network latency and CPU can (and in my experience, have) bottleneck the operation if you don't provide enough resources. In my example with the db.m3.large instance, the MySQLDump takes less than 5 minutes and the MySQLRestore takes about 15 minutes. I've attempted to restore to a db.m3.medium RDS instance before and the restore time took a little over an hour because the CPU bottlenecked -- it wasn't able to keep up with the EC2 instance. Back when I was restoring from my local machine, being outside of their network caused whole process to take over 4 hours.
This EC2 intermediary shouldn't need to be available 24/7. Turn it off or terminate it when you're done with it. I only have to pay for an hour of c3.xlarge time. Also remember that you can scale your RDS instance class up/down temporarily to increase resources available during your dump. Try to get your local machine out of the equation, if you can.
If you're interested in this strategy, AWS themselves has provided some documentation on the matter.

Command is always fast than Workbench.
Try to use this command to restore your database.
To Restore :
mysql -u root -p YOUR_DB_NAME < D:\your\file\location\dump.sql
To Take Dump :
mysqldump -u root -p YOUR_DB_NAME > D:\your\file\location\dump.sql

Related

wal-e/wal-g any benefit for simple backup and restore via S3

I'm using AWS RDS and have a need to replicate "database_a" in an RDS instance to "database_a" in a different RDS instance. The replication only needs to be once every 24 hours.
I'm currently solving this with pg_dump and pg_restore but am wondering if there is a better (ie faster/more efficient) way I can go about things.
Using wal-e/g and RDS, is it at all possible for my use case to simply push the latest changes from the last say 24 hours? The 2 RDS cannot speak to each other so all connection would be by S3. I'm not clear what the docs mean by 'When uploading backups to S3, the user should pass in the path containing the backup started by Postgres:' - does this mean i can create a pg backup to my EC2 and then point wal-g at this backup?
Finally, is it at all possible to just use wal-e/g for complete backups (ie non incremental) just as i am doing now with pg_dump/pg_restore and in doing so would I see a speed improvement by switching?
Thanks in advance,
In a word, yes.
On a system using dump/restore, you're consuming a lot more CPU and network resources (therefore costs) which you could reduce notably by using the WALs for incremental backups, and only doing an image perhaps once a week. This is especially true if your database is mostly data that doesn't change. It might be incorrect if your database is not growing but is made of records that are updated many times per 24 hours (e.g. stock prices).
Once you are publishing WALs to S3 frequently, then you'll have a far more up to date backup than nightly backups.
When publishing WALs you can recover to any point in time
WAL-E and WAL-G both have built in encryption
There is also differential backup support, but not something I've played with

RDS instance unusably slow after restoring from snapshot

Details:
Database: Postgres.
Version: 9.6
Host: Amazon RDS
Problem: After restoring from snapshot, the database is unusably slow.
Why: Something AWS calls the "first touch penalty". When a newly restored instance becomes available, the EBS volume attachment is complete but not all the data has been migrated to the attached EBS volume from S3. Only after initially "touching" the data will RDS realize the data isn't on the EBS volume and it needs to pull it from S3. This completely destroys our performance. We also cannot use dd or fio to pre-touch the data because RDS does not allow access to the mounted EBS volumes.
What I've done: Contact AWS support. They acknowledged that it's a problem, that they are working on it and that the only solution is to select * from all tables.
Why I still need help: The select * strategy did speed things up (I selected everything from the public schema), but not as much as is needed. So I read up on how postgres stores data to disk. There's a heck of a lot on disk that wouldn't be "touched" by a simple select from user-defined tables.
My question: Being limited to only SQL queries/functions and not having direct access to the underlying disk, what are the best sql statements I can use to "touch" as much as possible on the disk in order to get it loaded on the EBS volume from S3?
My suggestion would be to manually trigger a vacuum analyze, this will do a full table scan of each table within scope to update the planner with fresh statistics. You can scope this fairly easily to only a certain schema, the database in question and the Postgres schema for example could help keep total time down if you have multiple databases within the one host.
The operation is rather time consuming and I'm not aware of a good way to parallelize it. There is also the vacuumdb utility but this just runs a query with a vacuum statement in it.
Source: I asked RDS support this very question a few days ago.
[1] https://www.postgresql.org/docs/9.5/static/sql-vacuum.html
edit: will reformat later, on mobile

Is it possible to run Postgres on a write-protected file system? Or a shared file system?

I'm trying to set up a distributed processing environment,
with all of the data sitting in a single shared network drive.
I'm not going to write anything to it, and just be reading from it,
so we're considering write-protecting the network drive as well.
I remember when I was working with MSSQL,
I could back up databases to a DVD and load it directly as a read-only database.
If I can do something like that in Postgres,
I should be able to give it an abstraction like a read-only DVD,
and all will be good.
Is something like this possible in Postgres,
if not, any alternatives? (MySQL? sqlite even?)
Or if that's not possible is there some way to specify a shared file system?
(Make it know that other processes are reading from it as well?)
For various reasons, using a parallel dbms is not possible,
and I need two DB processes running parallel...
Any help is greatly appreciated.
Thanks!!
Write-protecting the data directory will cause PostgreSQL to fail to start, as it needs to be able to write postmaster.pid. PostgreSQL also needs to be able to write temporary files and tablespaces, set hint bits, manage the visibility map, and more.
In theory it might be possible to modify the PostgreSQL server to support running on a read-only database, but right now AFAIK this is not supported. Don't expect it to work. You'll need to clone the data directory for each instance.
If you want to run multiple PostgreSQL instances for performance reasons, having them fighting over shared storage would be counter-productive anyway. If the DB is small enough to fit in RAM it'd be OK ... but in that case it's also easy to just clone it to each machine. If the DB isn't big enough to be cached in RAM then both DB instances would be I/O bottlenecked and unlikely to perform any better than (probably slightly worse than) a single DB not subject to storage contention.
There's some chance that you could get it to work by:
Moving the constant data into a new tablespace onto read-only shared storage
Taking a basebackup of the database, minus the newly separated tablespace for shared data
Copying the basebackup of the DB to read/write private storage on each host that'll run a DB
Mounting the shared storage and linking the tablespace in place where Pg expects it
Starting pg
... at least if you force hint-bit setting and VACUUM FREEZE everything in the shared tablespace first. It isn't supported, it isn't tested, it probably won't work, there's no benefit over running private instances, and I sure as hell wouldn't do it, but if you really insist you could try it. Crashes, wrong query results, and other bizarre behaviour are not unlikely.
I've never tried it, but it may be possible to run postgres with a data dir which is mostly on a RO file system if all your use is indeed read-only. You will need to be sure to disable autovacuum. I think even read activity may generate xlog mutation, so you will probably have to symlink the pg_xlog directory onto a writeable file system. Sometimes read queries will spill to disk for large sorts or other temp requirements, so you should also link base/pgsql_tmp to a writeable disk area.
As Richard points out there are visibility hint bits in the data heap. May want to try VACUUM FULL FREEZE ANALYZE on the db before putting it on the RO file system.
"Is something like this possible in Postgres, if not, any alternatives? (MySQL? sqlite even?)"
I'm trying to figure out if I can do this with postgres as well, to port over a system from sqlite. I can confirm that this works just fine with sqlite3 database files on a read-only NFS share. Sqlite does work nicely for this purpose.
When done with sqlite, we cut over to a new directory with new sqlite files whenever there are updates. We don't ever insert into the in-use database. I'm not sure if inserts would pose any problems (with either database). Caching read-only data at the OS level could be an issue if another database instance mounted the dir read-write. This is something I would ideally like to be able to do.

is it possible to fork a mysqldump of data?

I am restoring a mysql database with perl on a remote server with about 30 million records. It's taking > 2 days & looking at my network connections I am not fully utilizing my uplink bandwidth. I will need to do this at least 1x per week. Is there a way to fork a mysqldump (I'm using perl) so that I can take full advantage of my bandwidth (I don't mind if I'm choked off for a bit...I just need to get this done faster).
Can't you upload the whole dump to the remote server and start the restore there?
A restore of a mysqldump is just the execution of a long series of commands that would restore your database from scratch. If the execution path for that is; 1) send command 2) remote system executes command 3) remote system replies that the command is complete 4) send next command, then you are spending most of your time waiting on network latency.
I do know that most SQL hosts will allow you to upload a dump file specifically to avoid the kinds of restore time that you're talking about. The company that takes my money each month even has a web-based form that you can use to restore a database from a file that has been uploaded via sftp. Poke around your hosting service's documentation. They should have something similar. If nothing else (and you're comfortable on the command line) you can upload it directly to your account and do it from a shell there.
mk-parallel-dump and mk-parallel-restore are designed to do what you want, but in my testing mk-parallel-dump was actually slower than plain old mysqldump. Your mileage may vary.
(I would guess the biggest factor would be the number of spindles your data files reside on, which in my case, 1, was not especially conducive to parallelization.)
First caveat: mk-parallel-* writes a bunch of files, and figuring out when it's safe to start sending them (and when you're done receiving them) may be a little tricky. I believe that's left as an exercise for the reader, sorry.
Second caveat: mk-parallel-dump is specifically advertised as not being for backups. Because "At the time of this release there is a bug that prevents --lock-tables from working correctly," it's really only useful for databases that you know will not change, e.g., a slave that you can STOP SLAVE on with no repercussions, and then START SLAVE once mk-parallel-dump is done.
I think a better solution than parallelizing a dump may be this:
If you're doing your mysqldump on a weekly basis, you can just do it once (dumping with --single-transaction (which you should be doing anyway) and --master-data=n) and then start a slave that connects over an ssh tunnel to the remote master, so the slave is continually updated. The disadvantage is that if you want to clone a local copy (perhaps to make a backup) you will need enough disk to keep an extra copy around. The advantage is that a week's worth of (query-based) replication log is probably quite a bit smaller than resending the data, and also it arrives gradually so you don't clog your pipe.
How big is your database in total? What kind of tables are you using?
A big risk with backups using mysqldump has to do with table locking, and updates to tables during the backup process.
The mysqldump backup process basically works as follows:
For each table {
Lock table as Read-Only
Dump table to disk
Unlock table
}
The danger is that if you run an INSERT/UPDATE/DELETE query that affects multiple tables while your backup is running, your backup may not capture the results of your query properly. This is a very real risk when your backup takes hours to complete and you're dealing with an active database. Imagine - your code runs a series of queries that update tables A,B, and C. The backup process currently has table B locked.
The update to A will not be captured, as this table was already backed up.
The update to B will not be captured, as the table is currently locked for writing.
The update to C will be captured, because the backup has not reached C yet.
This is an easy way to destroy referential integrity in your database.
Your backup process needs to be atomic, and transactional. If you can't shut down the entire database to writes during the backup process, you're risking disaster.
Also - there must be something wrong here. At a previous company, we were running nightly backups of a 450G Mysql DB (largest table had 150M rows), and it took less than 6 hours for the backup to complete.
Two thoughts:
Do you have a slave database? Run the backup from there - Stop replication (preventing RW risk), run the backup, restart replication.
Are your tables using InnoDB? Consider investing in InnoDBhotbackup, which solves this problem, as the backup process leverages the journaling that is part of the InnoDB storage engine.

Postgresql PITR backup: best practices to handle multiple databases?

Hy guys, i have a postgresql 8.3 server with many database.
Actually, im planning to backup those db with a script that will store all the backup in a folder with the same name of the db, for example:
/mypath/backup/my_database1/
/mypath/backup/my_database2/
/mypath/backup/foo_database/
Every day i make 1 dump each 2 hours, overwriting the files every day... for example, in the my_database1 folder i have:
my_database1.backup-00.sql //backup made everyday at the 00.00 AM
my_database1.backup-02.sql //backup made everyday at the 02.00 AM
my_database1.backup-04.sql //backup made everyday at the 04.00 AM
my_database1.backup-06.sql //backup made everyday at the 06.00 AM
my_database1.backup-08.sql //backup made everyday at the 08.00 AM
my_database1.backup-10.sql //backup made everyday at the 10.00 AM
[...and so on...]
This is how i actually assure myself to be able to restore everydatabase loosing at least 2 hours of data.
2 hours still looks too much.
I've got a look to the postgresql pitr trought the WAL files, but, those files seem to contain all the data about all my database.
I'll need to separate those files, in the same way i do separate the dump files.
How to?
Otherwise, there is another easy-to-install to have a backup procedure that allo me to restore just 1 backup at 10 seconds earlier, but without creating a dump file every 10 seconds?
It is not possible with one instance of PostgresSQL.
You can divide your 500 tables between several instances, each listening on different port, but it would mean that they will not use resources like memory effectively (memory reserved but unused in one instance can not be used by another).
Slony will also not work here, as it does not replicate DDL statements, like dropping a table.
I'd recommend doing both:
continue to do your pg_dump backups, but try to smooth it - throttle pg_dump io bandwith, so it will not cripple a server, and run it continuously - when it finishes with the last database then immediately start with a first one;
additionally setup PITR.
This way you can restore a single database fast, but you can loose some data. If you'll decide that you cannot afford to loose that much data then you can restore your PITR backup to a temporary location (with fsync=off and pg_xlog symlinked to ramdisk for speed), pg_dump affected database from there and restore it to your main database.
Why do you want to separate the databases?
The way the PITR works, it is not possible to do since it works on the complete cluster.
What you can do in that case is to create a data directory and a separate cluster for each of those databases (not recommended though since it will require different ports, and postmaster instances).
I believe that the benefits of using PITR instead of regular dumps outweigh having separate backups for each database, so perhaps you can re-think the reasons for why you need to separate it.
Another way could be to set up some replication with Slony-I but that would require a separate machine (or instance) that receives the data. On the other hand, that way you would have a replicated system in near real-time.
Update for comment:
To recover from mistakes, like deleting a table, PITR would be perfect since you can replay to a specific time. However, for 500 databases I understand that can be a lot of overhead. Slony-I would probably not work, since it is replicating. Not sure how it handles table deletions.
I am not aware of any other ways you can go. What I would do would probably still be going for PITR and just not do any mistakes ;). Jokes aside, depending how frequently mistakes are being made this could be a solution:
Set it up for PITR
have a second instance ready on standby.
When a mistake happens, replay the restore to the point in time on the second instance.
Do a pg_dump of the affected database from that instance.
Do a pg_restore on the production instance for that database.
However, it would require you to have a second instance ready, either on the same server or a different one (different is recommended). Also, the restore time would be a bit longer since it would require you to do one extra dump and restore.
I think the way you are doing this is flawed. You should have one database with multiple schemas and roles. Then you can use PITR. However PITR is not a replacement for dumps.