We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.
When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.
Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?
Thanks!
There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.
There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.
It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.
You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.
There is an option to separate the data and creating index in postgresql while taking pg_dump.
Here pre-data refers to Schema, post-data refers to index and triggers.
From the docs,
--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more
than once to select multiple sections. The default is to dump all
sections.
The data section contains actual table data, large-object contents,
and sequence values. Post-data items include definitions of indexes,
triggers, rules, and constraints other than validated check
constraints. Pre-data items include all other data definition items.
May be this would help :)
Related
I'm trying to move postgresql between two servers. There's rsync connectivity between the two servers.
My tables are large, around 200GB in total with nearly 800 million rows across 15 tables. For this volume of data, I found that COPY command for the key tables was far faster than the usual pg_dump. However, this only dumps the data.
Is there a way to dump only data this way, but also then dump the database creation script -- which will create the tables, and separately indexes? I'm thinking of the following sequence:
COPY all tables into file system. Just 15 files, therefore.
RSYNC these files to the new server.
On the new server, Create a fresh PG database: tables, foreign keys etc. But no indexes yet.
In this fresh PG database, COPY FROM all the tables, one by one. Slightly painful but worth it.
Then create the indexes, all in one go.
I'm seeing ways to get some scripts for #3 and #5 dumped by PG on the older server. The complication in the PG world is the OIDs for tables etc. Will this affect the tables and data on the new server? The pg_dump reference is a bit cryptic in its help material.
For #3, jsut the creation of the "schema" and tables, I could do this:
pg_dump --schema-only mybigdb
Will this carry all the OIDs and other complications, thereby being a good way to complete step #3?
And for only #5, not sure what I'd do. Just the indexes etc. Will I have to look inside the "schema only" file and separate out the indexes?
Appreciate any pointers.
Funny, the sequence you are describing is a pretty good description of what pg_dump/pg_restore does (with some oversights: e.g., for performance reasons, you wouldn't define a foreign key before you restore the data).
So I think that you should use pg_dump instead of reinventing the wheel.
You can get better performance out of pg_dump as follows:
Use the directory format (-Fd) and parallelize the COPY commands with -j number-of-jobs.
Restore the dump with pg_restore and use -j number-of-jobs for several parallel workers for data restore and index creation.
The only drawback is that you have to wait for pg_dump to finish before you can start pg_restore if you use the directory format. If that is a killer, you could use the custom format (-Fc) and pipe the result into pg_restore. That won't allow you to use -j with pg_dump, but you can still parallelize index creation and such with pg_restore -j.
Reindex or backup/restore to optimize database? Do indexes rebuild while restoring db from backup?
If practical, a full backup and restore is always better than a simple reindex simply because you also get an extra backup file.
The restore process will (1) create tables, then (2) copy data in and finally (3) create indexes, apply constraints etc.
This is not the same as using CLUSTER of course, which physically re-orders a table based on one of its indexes. In some cases that can be useful.
If you are going to do this though, make sure you have good measurements before and after your "optimization" because many factors affect overall database performance and this may prove pointless.
I'm trying to migrate our database engine from MsSql to PostgreSQL. In our automated test, we restore the database back to "clean" state at the start of every test. We do this by comparing the "diff" between the working copy of the database with the clean copy (table by table). Then copying over any records that have changed. Or deleting any records that have been added. So far this strategy seems to be the best way to go about for us because per test, not a lot of data is changed, and the size of the database is not very big.
Now I'm looking for a way to essentially do the same thing but with PostgreSQL. I'm considering doing the exact same thing with PostgreSQL. But before doing so, I was wondering if anyone else has done something similar and what method you used to restore data in your automated tests.
On a side note - I considered using MsSql's snapshot or backup/restore strategy. The main problem with these methods is that I have to re-establish the db connection from the app after every test, which is not possible at the moment.
If you're okay with some extra storage, and if you (like me) are particularly not interested in re-inventing the wheel in terms of checking for diffs via your own code, you should try creating a new DB (per run) via templates feature of createdb command (or CREATE DATABASE statement) in PostgreSQL.
So for e.g.
(from bash) createdb todayDB -T snapshotDB
or
(from psql) CREATE DATABASE todayDB TEMPLATE snaptshotDB;
Pros:
In theory, always exact same DB by design (no custom logic)
Replication is a file-transfer (not DB restore). So far less time taken (i.e. doesn't run SQL again, doesn't recreate indexes / restore tables etc.)
Cons:
Takes 2x the disk space (although template could be on a low performance NFS etc)
For my specific situation. I decided to go back to the original solution. Which is to compare the "working" copy of the database with "clean" copy of the database.
There are 3 types of changes.
For INSERT records - find max(id) from clean table and delete any record on working table that has higher ID
For UPDATE or DELETE records - find all records in clean table EXCEPT records found in working table. Then UPSERT those records into working table.
I'm copying several tables (~1.5M records) from one data source to another, but it is taking a long time. I'm looking to speed up my use of DBD::Pg.
I'm currently using pg_getcopydata/pg_putcopydata, but I suppose that the indexes on the destination tables are slowing the process down.
I found that I can find some information on table's indexes using $dbh->statistics_info, but I'm curious if anyone has a programmatic way to dynamically drop/recreate indexes based on this information.
The programmatic way, I guess, is to submit the appropriate CREATE INDEX SQL statements via DBI that you would enter into psql.
Sometimes when copying a large table it's better to do it in this order:
create table with out indexes
copy data
add indexes
I've got a Postgres 9.0 database which frequently I took data dumps of it.
This database has a lot of indexes and everytime I restore a dump postgres starts background task vacuum cleaner (is that right?). That task consumes much processing time and memory to recreate indexes of the restored dump.
My question is:
Is there a way to dump the database data and the indexes of that database?
If there is a way, will worth the effort (I meant dumping the data with the indexes will perform better than vacuum cleaner)?
Oracle has some the "data pump" command a faster way to imp and exp. Does postgres have something similar?
Thanks in advance,
Andre
If you use pg_dump twice, once with --schema-only, and once with --data-only, you can cut the schema-only output in two parts: the first with the bare table definitions and the final part with the constraints and indexes.
Something similar can probably be done with pg_restore.
Best Practice is probably to
restore the schema without indexes
and possibly without constraints,
load the data,
then create the constraints,
and create the indexes.
If an index exists, a bulk load will make PostgreSQL write to the database and to the index. And a bulk load will make your table statistics useless. But if you load data first, then create the index, the stats are automatically up to date.
We store scripts that create indexes and scripts that create tables in different files under version control. This is why.
In your case, changing autovacuum settings might help you. You might also consider disabling autovacuum for some tables or for all tables, but that might be a little extreme.