Slow database restore from SQL file, considering binary replication to rollback tests database - postgresql

When using Cucumber with Capybara, I have to load test database data from SQL data dump.
Unfortunately it takes 10s for each scenario, which slows down the tests.
I have found something like: http://wiki.postgresql.org/wiki/Binary_Replication_Tutorial#How_to_Replicate
Do you think binary replication will be quicker then using SQL files?
Is there anything I can do to make the restore quicker (I restore just data, not structure)?
What approaches would you recommend to try?

You could try to put your test data into a "template" database (e.g. mydb_template)
To prepare the test scenario, you simply drop your database using DROP DATABASE mydb and recreate based on the template: CREATE DATABASE mydb TEMPLATE = mydb_template;.
Of course you'll need to connect to e.g. template0 or the postgres database in order to be able to drop mydb.
I think this could be faster than importing a dump.
I recall a discussion on the PG mailing list regarding this approach and some performance problems with large "templates" that were fixed with 9.0.

(I restore just data, not structure)
COPY is always fastest for importing just data. Other answer deals with restoring a whole database.

Related

What is the easiest way to generate a script to drop and create all objects in a database?

I'm used to working with SQL Server and the SQL Server Management Studio has the option to automatically generate a script to drop and recreate everything in a database (tables/views/procedures/etc). I find that when developing a new application and writing a bunch of junk in a local database for basic testing it's very helpful to have the options to just nuke the whole thing and recreate it in a clean slate, so I'm looking for a similar functionality within postgres/pgadmin.
PGAdmin has an option to generate a create script for a specific table but right clicking each table would be very tedious and I'm wondering if there's another way to do it.
To recreate a clean schema only database you can use the pg_dump client included with a Postgres server install. The options to use are:
-c
--clean
Output commands to clean (drop) database objects prior to outputting the commands for creating them. (Unless --if-exists is also specified, restore might generate some harmless error messages, if any objects were not present in the destination database.)
This option is ignored when emitting an archive (non-text) output file. For the archive formats, you can specify the option when you call pg_restore.
and:
-s
--schema-only
Dump only the object definitions (schema), not data.
This option is the inverse of --data-only. It is similar to, but for historical reasons not identical to, specifying --section=pre-data --section=post-data.
(Do not confuse this with the --schema option, which uses the word “schema” in a different meaning.)
To exclude table data for only a subset of tables in the database, see --exclude-table-data.
clean in Flyway
The database migration tool Flyway offers a clean command that drops all objects in the configured schemas.
To quote the documentation:
Clean is a great help in development and test. It will effectively give you a fresh start, by wiping your configured schemas completely clean. All objects (tables, views, procedures, …) will be dropped.
Needless to say: do not use against your production DB!

Managing foreign keys when using pg_restore with multiple dumps

I have a bit of a weird issue. We were trying to create a database baseline for our local environment that has very specific data pre-seeded into it. Our hopes were to make sure that everyone was operating with the same data, making collaboration and reviewing code a bit simpler.
My idea for this was to run a command to dump the database whenever we run a migration or decide a new account is necessary for local dev. The issue with this is the database dump is around 17MB. I'm trying to avoid us having to add a 17MB file to GitHub every time we update the database.
So the best solution I could think of was to setup a script to dump each individual table in the database. This way, if a single table is updated, we'd only be pushing that backup to GitHub and it would be more along a ~200kb file as opposed to 17mb.
The main issue I'm running into with this is trying to restore the database. With a full dump, handling the foreign keys is relatively simple as it's all done in a single restore command. But with multiple restores, it gets a bit more complicated.
I'm looking to find a way to restore all tables to a database, ignoring triggers and constraints, and then enabling them again once the data has been populated. (or find a way to export the tables based on the order the foreign keys are defined). There are a lot of tables to work with, so doing this manually would be a bit of a task.
I'm also concerned about the relational integrity of the database if I disabled/re-enable constraints. Any help or advice would be appreciated.
Right now I'm running the following on every single table:
pg_dump postgres://user:password#pg:5432/database -t table_name -Fc -Z9 -f /data/www/database/data/table_name.bak
And then this command to restore all backups to the DB.
$data_command = "pg_restore --disable-triggers -d $dbUrl -Fc \"%s\"";
$backups = glob("$directory*.bak");
foreach($backups as $data_file){
if($data_file != 'data_roles.bak') {
exec(sprintf($data_command, $data_file));
}
}
This obviously doesn't work as I hit a ton of "Relationship doesn't exist" errors. I guess I'm just looking for a better way to accomplish this.
I would separate the table data and the database metadata.
Create a pre- and post-data scfipt with
pg_dump --section=pre-data -f pre.sql mydb
pg_dump --section=post-data -f post.sql mydb
Then dump just the data for each table:
pg_dump --section=data --table=tab1 -f tab1.sql mydb
To restore the database, first restore pre.sql, then all the table data, then post.sql.
The pre- and post-data will change often, but they are not large, so that shouldn't be a problem.

Backing up redshift database

I want to backup entire redshift cluster, such that I can use it in other databases like mysql or hadoop in future.
I was looking up and creating a manual screenshot seems to be an option but I guess that wont work for cross database languages.
So what would be the detailed steps to backup the entire cluster of redshift
Cluster backups can be done via the aws console, however these can only be restored to another redshift cluster.
Because Redshift is not the same as postgres in many ways, it will be inpossible / tricky to use standard tools like pg_dump and pg_restore.
I think that your best option is to :
extract the ddl from the Redshift tables that you wish to create elsewhere, most ide's have a simple way to do this.
modify the ddl to work with your target database (e.g. postgres will
be easy, mysql harder)
copy the contents of the Redshift database, one table at a time to s3 using
the unload command
import the data that you unloaded in step 3 to your target tables

Backup specific tables in AWS RDS Postgres Instance

I have two databases on Amazon RDS, both Postgres. Database 1 and 2
I need to restore an instance from a snapshot of Database 1 for my Staging environment. (Database 2 is my current Staging DB).
However, I want the data from a few of the tables in Database 2 to overwrite the tables in the newly restored snapshot. What is the best way to do this?
When restoring RDS from a Snapshot, a new database instance is created. If you only wish to copy a portion of the snapshot:
Restore the snapshot to a new (temporary) database
Connect to the new database and dump the desired tables using pg_dump
Connect to your staging server and restore the tables using pg_restore (most probably deleting any matching existing tables first)
Delete the temporary database
pg_dump actually outputs SQL commands that are then used to recreate tables and restore data. Look at the content of a dump to understand how the restore process actually works.
I hope this still works for someone else.
With my team we faced a similar issue. We also had 2 Postgres databases and we also just needed to backup some tables from db1 to db2.
What we did is to use a lambda function using Python (from AWS lambda ofc) that connected to both databases and validates if db1.table1 has the same data as db2.table1, if not, then the lambda function should write the missing data from db1.table1 into db2.table1. The approach of using lambda was because we wanted to automate the process due to the main db (let's say db1) is constantly being updated. In addition, it allowed us to only backup our desired tables (let's say 3 tables out of 10), instead of backing up the whole database.
Note: Maybe you want to do these writes using temporary tables to avoid issues with any constraints you have in your tables.

Can PostgreSQL be used with an on-disk database?

Currently, I have an application that uses Firebird in embedded mode to connect to a relatively simple database stored as a file on my hard drive. I want to switch to using PostgreSQL to do the same thing (Yes, I know it's overkill). I know that PostgreSQL cannot operate in embedded mode and that is fine - I can leave the server process running and that's OK with me.
I'm trying to figure out a connection string that will achieve this, but have been unsuccessful. I've tried variations on the following:
jdbc:postgresql:C:\myDB.fdb
jdbc:postgresql://C:\myDB.fdb
jdbc:postgresql://localhost:[port]/C:\myDB.fdb
but nothing seems to work. PostgreSQL's directions don't include an example for this case. Is this even possible?
You can trick it. If you are running PostGRESQL on a UNIXlike system, then you should be able to create a RAMDISK and use that for the database storage. Here's a pretty good step by step guide for RAMdisks on Linux.
In general though, I would suggest using SQLITE for an SQL db in RAM type of application.
Postgres databases are not a single file. There will be one file for each table and each index in the data directory, inside a directory for the database. All files will be named with the object ID (OID) of db / table / index.
The JDBC urls point to the database name, not any specific file:
jdbc:postgresql:foodb (localhost is implied)
If by "disk that behaves like memory", you mean that the db only exists for the lifetime of your program, there's no reason why you can't create a db at program start and drop it at program exit. Note that this is just DDL to create the DB, not creating the data dir via the init-db program. You could connect to the default 'postgres' db, create your db then connect to it.
Firebird 2.1 onwards supports global temporary tables, which only exist for the duration of the database connection.
Syntax goes something like CREATE GLOBAL TEMPORARY TABLE ... ON COMMIT PRESERVE ROWS