I'm trying to move postgresql between two servers. There's rsync connectivity between the two servers.
My tables are large, around 200GB in total with nearly 800 million rows across 15 tables. For this volume of data, I found that COPY command for the key tables was far faster than the usual pg_dump. However, this only dumps the data.
Is there a way to dump only data this way, but also then dump the database creation script -- which will create the tables, and separately indexes? I'm thinking of the following sequence:
COPY all tables into file system. Just 15 files, therefore.
RSYNC these files to the new server.
On the new server, Create a fresh PG database: tables, foreign keys etc. But no indexes yet.
In this fresh PG database, COPY FROM all the tables, one by one. Slightly painful but worth it.
Then create the indexes, all in one go.
I'm seeing ways to get some scripts for #3 and #5 dumped by PG on the older server. The complication in the PG world is the OIDs for tables etc. Will this affect the tables and data on the new server? The pg_dump reference is a bit cryptic in its help material.
For #3, jsut the creation of the "schema" and tables, I could do this:
pg_dump --schema-only mybigdb
Will this carry all the OIDs and other complications, thereby being a good way to complete step #3?
And for only #5, not sure what I'd do. Just the indexes etc. Will I have to look inside the "schema only" file and separate out the indexes?
Appreciate any pointers.
Funny, the sequence you are describing is a pretty good description of what pg_dump/pg_restore does (with some oversights: e.g., for performance reasons, you wouldn't define a foreign key before you restore the data).
So I think that you should use pg_dump instead of reinventing the wheel.
You can get better performance out of pg_dump as follows:
Use the directory format (-Fd) and parallelize the COPY commands with -j number-of-jobs.
Restore the dump with pg_restore and use -j number-of-jobs for several parallel workers for data restore and index creation.
The only drawback is that you have to wait for pg_dump to finish before you can start pg_restore if you use the directory format. If that is a killer, you could use the custom format (-Fc) and pipe the result into pg_restore. That won't allow you to use -j with pg_dump, but you can still parallelize index creation and such with pg_restore -j.
Related
We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.
When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.
Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?
Thanks!
There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.
There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.
It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.
You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.
There is an option to separate the data and creating index in postgresql while taking pg_dump.
Here pre-data refers to Schema, post-data refers to index and triggers.
From the docs,
--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more
than once to select multiple sections. The default is to dump all
sections.
The data section contains actual table data, large-object contents,
and sequence values. Post-data items include definitions of indexes,
triggers, rules, and constraints other than validated check
constraints. Pre-data items include all other data definition items.
May be this would help :)
When using
pg_dump --section post-data
I get a dump that contains the definitions of the indexes but not the index data. As the database I'm working on is really big and complex, recreating the indexes takes a lot of time.
Is there a way get the index data into my dump, so that I can restore an actually working database?
There is no way to include index data in a logical dump (pg_dump).
There's no way to extract index data from the SQL level, where pg_dump operates, nor any way to write it back. Indexes refer to the physical structure of the table (tuple IDs by page and offset) in a way that isn't preserved across a dump and reload anyway.
You can use a low-level disk copy using pg_basebackup if you want to copy the whole DB, indexes and all. Unlike a pg_dump you can't restore this to a different PostgreSQL version, you can't dump just one database, etc; it's all or nothing.
Installed postgres 9.1 in both the machine.
Initially the DB size is 7052 MB then i used the following command for copy to another server.
pg_dump -C dbname | bzip2 | ssh remoteuser#remotehost "bunzip2 | psql dbname"
After successfully copies, In destination machine i check size it shows 6653 MB.
then i checked for table count its same.
Has there been data loss? Is there missing data?
Note:
Two machines have same hardware and software configuration.
i used:
SELECT pg_size_pretty(pg_database_size('dbname'));
One of the PostgreSQL's most sophisticated features is so called Multi-Version Concurrency Control (MVCC), a standard technique for avoiding conflicts between reads and writes of the same object in database. MVCC guarantees that each transaction sees a consistent view of the database by reading non-current data for objects modified by concurrent transactions. Thanks to MVCC, PostgreSQL has great scalability, a robust hot backup tool and many other nice features comparable to the most advanced commercial databases.
Unfortunately, there is one downside to MVCC, the databases tend to grow over time and sometimes it can be a problem. In recent versions of PostgreSQL there is a separate server process called the autovacuum daemon (pg_autovacuum), whose purpose is to keep the database size reasonable. It does that by trying to recover reusable chunks of the database files. Still, there are many scenarios that will force the database to grow, even if the amount of the useful data in it doesn't really change. That happens typically if you have lots of UPDATE and/or DELETE statements in the applications that are using the database.
When you do a COPY, you recover extraneous space and so your copied DB appears smaller.
That looks normal. Databases are often smaller after restore, because a newly created b-tree index is more compact than one that's been progressively built by inserts. Additionally, UPDATEs and DELETEs leave empty space in the tables.
So you have nothing to worry about. You'll find that if you diff an SQL dump from the old DB and a dump taken from the just-restored DB, they'll be the same except for comments.
I've got a Postgres 9.0 database which frequently I took data dumps of it.
This database has a lot of indexes and everytime I restore a dump postgres starts background task vacuum cleaner (is that right?). That task consumes much processing time and memory to recreate indexes of the restored dump.
My question is:
Is there a way to dump the database data and the indexes of that database?
If there is a way, will worth the effort (I meant dumping the data with the indexes will perform better than vacuum cleaner)?
Oracle has some the "data pump" command a faster way to imp and exp. Does postgres have something similar?
Thanks in advance,
Andre
If you use pg_dump twice, once with --schema-only, and once with --data-only, you can cut the schema-only output in two parts: the first with the bare table definitions and the final part with the constraints and indexes.
Something similar can probably be done with pg_restore.
Best Practice is probably to
restore the schema without indexes
and possibly without constraints,
load the data,
then create the constraints,
and create the indexes.
If an index exists, a bulk load will make PostgreSQL write to the database and to the index. And a bulk load will make your table statistics useless. But if you load data first, then create the index, the stats are automatically up to date.
We store scripts that create indexes and scripts that create tables in different files under version control. This is why.
In your case, changing autovacuum settings might help you. You might also consider disabling autovacuum for some tables or for all tables, but that might be a little extreme.
When I make a backup in postgres 8 it only backs up the schemas and data, but not the indexes. How can i do this?
Sounds like you're making a backup using the pg_dump utility. That saves the information needed to recreate the database from scratch. You don't need to dump the information in the indexes for that to work. You have the schema, and the schema includes the index definitions. If you load this backup, the indexes will be rebuilt from the data, the same way they were created in the first place: built as new rows are added.
If you want to do a physical backup of the database blocks on disk, which will include the indexes, you need to do a PITR backup instead. That's a much more complicated procedure, but the resulting backup will be instantly usable. The pg_dump style backups can take quite some time to restore.
If I understand you correctly, you want a dump of the indexes as well as the original table data.
pg_dump will output CREATE INDEX statements at the end of the dump, which will recreate the indexes in the new database.
You can do a PITR backup as suggested by Greg Smith, or stop the database and just copy the binaries.