Faster method for copying postgresql databases - postgresql

I have 12 test databases on a test server and I generate the dump from the main server and copy it to the test server every day. I populate the test databases using:
zcat ${dump_address} | psql $db_name
It takes 45 minutes for each database. Is it faster if I do that for just one database then use:
CREATE DATABASE newdb WITH TEMPLATE olddb;
for the rest? Are there any other methods I could try?

It depends how much data and indexes there is. Some ideas to speed things up include:
Make sure the dump does not INSERT commands - they tend to be much slower than COPY.
If you have many indexes or constraints, use the custom or archive dump format and then call pg_restore -j $my_cpu_count. -j controls how many threads are creating these concurrently, and index creation is often CPU-bound.
Use faster disks.
Copy $PGDATA in its entirety (rsync or some fancy snapshots).

Related

Backing a very large PostgreSQL DB "by parts"

I have a PostgreQL DB that is about 6TB. I want to transfer this database to another server using for example pg_dumpall. The problem I have is that I only have a 1TB HD. How can I do to copy this database to the other new server that has enough space? Let's suppose I can not get another HD. Is there the possibility to do partial backup files, upload them to the new server, erase the HD and get another batch of backup files until the transfer is complete?
This works here(proof of concept):
shell-command issued from the receiving side
remote side dumps through the network-connection
local side psql just accepts the commands from this connection
the data is never stored in a physical file
(for brevity, I only sent the table definitions, not the actual data: --schema-only)
you could have some problems with users and tablespaces (these are global for an installation in Postgres) pg_dumpall will dump+restore these, too, IIRC.
#!/bin/bash
remote=10.224.60.103
dbname=myremotedbname
pg_dump -h ${remote} --schema-only -c -C ${dbname} | psql
#eof
As suggested above if you have a fast network connection between source and destination you can do it without any extra disk.
However for a 6 TB DB (which includes indexes I assume) using the archive dump format (-Fc) could yield a database dump of less than 1 TB.
Regarding the "by parts" question: yes, it possible using the table pattern (-t, --table):
pg_dump -t TABLE_NAME ...
You can also exclude tables using -T, --exclude-table:
pg_dump -T TABLE_NAME ...
The above options (-t , -T) can be specified multiple times and can be even combined.
They also support patterns for specifying the tables:
pg_dump -t 'employee_*' ...

Managing foreign keys when using pg_restore with multiple dumps

I have a bit of a weird issue. We were trying to create a database baseline for our local environment that has very specific data pre-seeded into it. Our hopes were to make sure that everyone was operating with the same data, making collaboration and reviewing code a bit simpler.
My idea for this was to run a command to dump the database whenever we run a migration or decide a new account is necessary for local dev. The issue with this is the database dump is around 17MB. I'm trying to avoid us having to add a 17MB file to GitHub every time we update the database.
So the best solution I could think of was to setup a script to dump each individual table in the database. This way, if a single table is updated, we'd only be pushing that backup to GitHub and it would be more along a ~200kb file as opposed to 17mb.
The main issue I'm running into with this is trying to restore the database. With a full dump, handling the foreign keys is relatively simple as it's all done in a single restore command. But with multiple restores, it gets a bit more complicated.
I'm looking to find a way to restore all tables to a database, ignoring triggers and constraints, and then enabling them again once the data has been populated. (or find a way to export the tables based on the order the foreign keys are defined). There are a lot of tables to work with, so doing this manually would be a bit of a task.
I'm also concerned about the relational integrity of the database if I disabled/re-enable constraints. Any help or advice would be appreciated.
Right now I'm running the following on every single table:
pg_dump postgres://user:password#pg:5432/database -t table_name -Fc -Z9 -f /data/www/database/data/table_name.bak
And then this command to restore all backups to the DB.
$data_command = "pg_restore --disable-triggers -d $dbUrl -Fc \"%s\"";
$backups = glob("$directory*.bak");
foreach($backups as $data_file){
if($data_file != 'data_roles.bak') {
exec(sprintf($data_command, $data_file));
}
}
This obviously doesn't work as I hit a ton of "Relationship doesn't exist" errors. I guess I'm just looking for a better way to accomplish this.
I would separate the table data and the database metadata.
Create a pre- and post-data scfipt with
pg_dump --section=pre-data -f pre.sql mydb
pg_dump --section=post-data -f post.sql mydb
Then dump just the data for each table:
pg_dump --section=data --table=tab1 -f tab1.sql mydb
To restore the database, first restore pre.sql, then all the table data, then post.sql.
The pre- and post-data will change often, but they are not large, so that shouldn't be a problem.

Best way to make PostgreSQL backups

I have a site that uses PostgreSQL. All content that I provide in my site is created at a development environment (this happens because it's webcrawler content). The only information created at the production environment is information about the users.
I need to find a good way to update data stored at production. May I restore to production only the tables updated at development environment and PostgreSQL will update this records at production or the best way would be to backup the users information at production, insert them at development and restore the whole database at production?
Thank you
You can use pg_dump to export the data just from the non-user tables in the development environment and pg_restore to bring that into prod.
The -t switch will let you pick specific tables.
pg_dump -d <database_name> -t <table_name>
https://www.postgresql.org/docs/current/static/app-pgdump.html
There are many tips arounds this subject here and here.
I'd suggest you to take a look on these links before everything.
If your data is discarded at each update process then a plain dump will be enough. You can redirect pg_dump output directly to psql connected on production to avoid pg_restore step, something like below:
#Of course you must drop tables to load it again
#so it'll be reasonable to make a full backup before this
pg_dump -Fp -U user -h host_to_dev -T=user your_db | psql -U user -h host_to_production your_db
You might asking yourself "Why he's saying to drop my tables"?
Bulk loading data on a fresh table is faster than deleting old data and inserting again. A quote from the docs:
Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
Ps¹: If you can't connect on both environment at same time then you need to do pg_restore manually.
Ps²: I don't recommend it but you can append --clean option on pg_dump to generate DROP statements automatically. Be extreme careful with this option to avoid dropping unnexpected objects.

How to migrate data to remote server with PostgreSQL?

How can I dump my database schema and data in such a way that the usernames, database names and the schema names of the dumped data matches these variables on the servers I deploy to?
My current process entails moving the data in two steps. First, I dump the schema of the database (pg_dump --schema-only -C -c) then I dump out the data with pg_dump --data-only -C and restore these on the remote server in tandem using the psql command. But there has to be a better way than this.
We use the following to replicate databases.
pg_basebackup -x -P -D /var/lib/pgsql/9.2/data -h OTHER_DB_IP_ADDR -U postgres
It requires the "master" server at OTHER_DB_IP_ADDR to be running the replication service and pg_hba.conf must allow replication connections. You do not have to run the "slave" service as a hot/warm stand by in order to replicate. One downside of this method compared with a dump/restore, the restore operation effectively vacuums and re-indexes and resets EVERYTHING, while the replication doesn't, so replicating can use a bit more disk space if your database has been heavily edited. On the other hand, replicating is MUCH faster (15 mins vs 3 hours in our case) since indexes do not have to be rebuilt.
Some useful references:
http://opensourcedbms.com/dbms/setup-replication-with-postgres-9-2-on-centos-6redhat-el6fedora/
http://www.postgresql.org/docs/9.2/static/high-availability.html
http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgresql-streaming-replication-in-10-mins/

A faster way to copy a postgresql database (or the best way)

I did a pg_dump of a database and am now trying to install the resulting .sql file on to another server.
I'm using the following command.
psql -f databasedump.sql
I initiated the database install earlier today and now 7 hours later the database is still being populated. I don't know if this his how long it is supposed to take, but I continue to monitor it, so far I've seen over 12 millon inserts and counting. I suspect there's a faster way to do this.
Create your dumps with
pg_dump -Fc -Z 9 --file=file.dump myDb
Fc
Output a custom archive suitable for input into pg_restore. This is the most flexible format in that it allows reordering of loading data as well as object definitions. This format is also compressed by default.
Z 9: --compress=0..9
Specify the compression level to use. Zero means no compression. For the custom archive format, this specifies compression of individual table-data segments, and the default is to compress at a moderate level. For plain text output, setting a nonzero compression level causes the entire output file to be compressed, as though it had been fed through gzip; but the default is not to compress. The tar archive format currently does not support compression at all.
and restore it with
pg_restore -Fc -j 8 file.dump
-j: --jobs=number-of-jobs
Run the most time-consuming parts of pg_restore — those which load data, create indexes, or create constraints — using multiple concurrent jobs. This option can dramatically reduce the time to restore a large database to a server running on a multiprocessor machine.
Each job is one process or one thread, depending on the operating system, and uses a separate connection to the server.
The optimal value for this option depends on the hardware setup of the server, of the client, and of the network. Factors include the number of CPU cores and the disk setup. A good place to start is the number of CPU cores on the server, but values larger than that can also lead to faster restore times in many cases. Of course, values that are too high will lead to decreased performance because of thrashing.
Only the custom and directory archive formats are supported with this option. The input must be a regular file or directory (not, for example, a pipe). This option is ignored when emitting a script rather than connecting directly to a database server. Also, multiple jobs cannot be used together with the option --single-transaction.
Links:
pg_dump
pg_restore
Improve pg dump&restore
PG_DUMP | always use format directory with -j option
time pg_dump -j 8 -Fd -f /tmp/newout.dir fsdcm_external
PG_RESTORE | always use tuning for postgres.conf with format directory With -j option
work_mem = 32MB
shared_buffers = 4GB
maintenance_work_mem = 2GB
full_page_writes = off
autovacuum = off
wal_buffers = -1
time pg_restore -j 8 --format=d -C -d postgres /tmp/newout.dir/`
For more info
https://gitlab.com/yanar/Tuning/wikis/improve-pg-dump&restore
Why are you producing a raw .sql dump? The opening description of pg_dump recommends the "custom" format -Fc.
Then you can use pg_restore which will restore your data (or selected parts of it). There is a "number of jobs" option -j which can use multiple cores (assuming your disks aren't already the limiting factor). In most cases, on a modern machine you can expect at least some gains from this.
Now you say "I don't know how long this is supposed to take". Well, until you've done a few restores you won't know. Do monitor what your system is doing and whether you are limited by cpu or disk I/O.
Finally, the configuration settings you want for restoring a database are not those you want to run it. A couple of useful starters:
Increase maintenance_work_mem so you can build indexes in larger chunks
Turn off fsync during the restore. If your machine crashes, you'll start from scratch again anyway.
Do remember to reset them after the restore though.
The usage of pg_dump is generally recommended to be paired with pg_restore, instead of psql. This method can be split among cores to speed up the loading process by passing the --jobs flag as such:
$ pg_dump -Fc db > db.Fc.dump
$ pg_restore -d db --jobs=8 db.Fc.dump
Postgres themselves have a guide on bulk loading of data.
I also would recommend heavily tuning your postgresql.conf configuration file and set appropriately high values for the maintenance_work_mem and checkpoint_segments values; higher values on these may dramatically increase your write performance.