Backing a very large PostgreSQL DB "by parts" - postgresql

I have a PostgreQL DB that is about 6TB. I want to transfer this database to another server using for example pg_dumpall. The problem I have is that I only have a 1TB HD. How can I do to copy this database to the other new server that has enough space? Let's suppose I can not get another HD. Is there the possibility to do partial backup files, upload them to the new server, erase the HD and get another batch of backup files until the transfer is complete?

This works here(proof of concept):
shell-command issued from the receiving side
remote side dumps through the network-connection
local side psql just accepts the commands from this connection
the data is never stored in a physical file
(for brevity, I only sent the table definitions, not the actual data: --schema-only)
you could have some problems with users and tablespaces (these are global for an installation in Postgres) pg_dumpall will dump+restore these, too, IIRC.
#!/bin/bash
remote=10.224.60.103
dbname=myremotedbname
pg_dump -h ${remote} --schema-only -c -C ${dbname} | psql
#eof

As suggested above if you have a fast network connection between source and destination you can do it without any extra disk.
However for a 6 TB DB (which includes indexes I assume) using the archive dump format (-Fc) could yield a database dump of less than 1 TB.
Regarding the "by parts" question: yes, it possible using the table pattern (-t, --table):
pg_dump -t TABLE_NAME ...
You can also exclude tables using -T, --exclude-table:
pg_dump -T TABLE_NAME ...
The above options (-t , -T) can be specified multiple times and can be even combined.
They also support patterns for specifying the tables:
pg_dump -t 'employee_*' ...

Related

Best way to make PostgreSQL backups

I have a site that uses PostgreSQL. All content that I provide in my site is created at a development environment (this happens because it's webcrawler content). The only information created at the production environment is information about the users.
I need to find a good way to update data stored at production. May I restore to production only the tables updated at development environment and PostgreSQL will update this records at production or the best way would be to backup the users information at production, insert them at development and restore the whole database at production?
Thank you
You can use pg_dump to export the data just from the non-user tables in the development environment and pg_restore to bring that into prod.
The -t switch will let you pick specific tables.
pg_dump -d <database_name> -t <table_name>
https://www.postgresql.org/docs/current/static/app-pgdump.html
There are many tips arounds this subject here and here.
I'd suggest you to take a look on these links before everything.
If your data is discarded at each update process then a plain dump will be enough. You can redirect pg_dump output directly to psql connected on production to avoid pg_restore step, something like below:
#Of course you must drop tables to load it again
#so it'll be reasonable to make a full backup before this
pg_dump -Fp -U user -h host_to_dev -T=user your_db | psql -U user -h host_to_production your_db
You might asking yourself "Why he's saying to drop my tables"?
Bulk loading data on a fresh table is faster than deleting old data and inserting again. A quote from the docs:
Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded.
Ps¹: If you can't connect on both environment at same time then you need to do pg_restore manually.
Ps²: I don't recommend it but you can append --clean option on pg_dump to generate DROP statements automatically. Be extreme careful with this option to avoid dropping unnexpected objects.

Where does pgdump do its compression?

Here is the command I am using
pgdump -h localhost -p 54321 -U example_user --format custom
which dumps a database on a remote server which I have connected to with a port forward on port 54321.
I know that the custom format does some compression by default.
Does this compression happen on the database server, or does everything get sent across to my local machine where the compression happens.
Compression is done on the client side so everything gets sent to your computer. What pg_dump does to the database is that it just executes ordinary queries to get the data.
PostgreSQL Documentation: 24.1. SQL Dump:
pg_dump is a regular PostgreSQL client application (albeit a particularly clever one).
PostgreSQL Documentation - II. PostgreSQL Client Applications - pg_dump:
pg_dump internally executes SELECT statements. If you have problems running pg_dump, make sure you are able to select information from the database using, for example, psql.
If you need more information about the inner workings of pg_dump I would suggest asking it from PostgreSQL mailing list or looking at the source code.

mysqldump by query, then use to update remote database

I have a database containing a very large table including binary data which I want to update on a remote machine, once a day. Rather than dumping the entire table, transferring and recreating it on the remote machine, I want to dump only the updates, then use that dump to update the remote machine.
I already understand that I can produce the dump file as such.
mysqldump -u user -p=pass --quick --result-file=dump_file \
--where "Updated >= one_day_ago" \
database_name table_name
1) Does the resulting "restore" on the remote machine
mysql -u user -p=pass database_name < dump_file
only update the necessary rows? Or are there other adverse effects?
2) Is this the best way to do this? (I am unable to pipe to the server directly and using --host option)
If you only dump rows where the WHERE clause is true, then you will only get a .sql file that contains the values you want to update. So you will never get duplicate values if you use the current export options. However, inserting these into a database will not work. You will have to use the commandline parameter --replace, otherwise, if you dump your database and a row with id 6 in table table1 and try to import this into your other database, you'll get an error on duplicates if a row already has id = 6. Using the --replace parameter, it will overwrite older values, which can only happen if there is a new one (according to your WHERE clause).
So to quickly answer:
Yes, this will restore on the remote machine, but only if you saved using --replace (this will restore the latest version of the file you have)
I am not entirely sure if you can pipe backups. According to this website, you can, but I have never tried it before.

How to migrate data to remote server with PostgreSQL?

How can I dump my database schema and data in such a way that the usernames, database names and the schema names of the dumped data matches these variables on the servers I deploy to?
My current process entails moving the data in two steps. First, I dump the schema of the database (pg_dump --schema-only -C -c) then I dump out the data with pg_dump --data-only -C and restore these on the remote server in tandem using the psql command. But there has to be a better way than this.
We use the following to replicate databases.
pg_basebackup -x -P -D /var/lib/pgsql/9.2/data -h OTHER_DB_IP_ADDR -U postgres
It requires the "master" server at OTHER_DB_IP_ADDR to be running the replication service and pg_hba.conf must allow replication connections. You do not have to run the "slave" service as a hot/warm stand by in order to replicate. One downside of this method compared with a dump/restore, the restore operation effectively vacuums and re-indexes and resets EVERYTHING, while the replication doesn't, so replicating can use a bit more disk space if your database has been heavily edited. On the other hand, replicating is MUCH faster (15 mins vs 3 hours in our case) since indexes do not have to be rebuilt.
Some useful references:
http://opensourcedbms.com/dbms/setup-replication-with-postgres-9-2-on-centos-6redhat-el6fedora/
http://www.postgresql.org/docs/9.2/static/high-availability.html
http://www.rassoc.com/gregr/weblog/2013/02/16/zero-to-postgresql-streaming-replication-in-10-mins/

Faster method for copying postgresql databases

I have 12 test databases on a test server and I generate the dump from the main server and copy it to the test server every day. I populate the test databases using:
zcat ${dump_address} | psql $db_name
It takes 45 minutes for each database. Is it faster if I do that for just one database then use:
CREATE DATABASE newdb WITH TEMPLATE olddb;
for the rest? Are there any other methods I could try?
It depends how much data and indexes there is. Some ideas to speed things up include:
Make sure the dump does not INSERT commands - they tend to be much slower than COPY.
If you have many indexes or constraints, use the custom or archive dump format and then call pg_restore -j $my_cpu_count. -j controls how many threads are creating these concurrently, and index creation is often CPU-bound.
Use faster disks.
Copy $PGDATA in its entirety (rsync or some fancy snapshots).