Postgres: pg_dump gives different create statement than the one generated by PgAdmin - postgresql

Is it common to get differing create statements from pg_dump vs. PgAdmin?

Pg_dump creates a bare table, then uses extra commands later to create the indexes, constraints, etc., even when it would be possible to do it so at creation time. This is more efficient to bulk load before those things are created, and can also avoid dependency order problems.

Related

Delayed indexing in postgres

We have a system which stores data in a postgres database. In some cases, the size of the database has grown to several GBs.
When this system is upgraded, the data in the said database is backed up, and finally it's restored in the database. Owing to the huge amounts of data, the indexing takes a long time to complete (~30 minutes) during restoration, thereby delaying the upgrade process.
Is there a way where the data copy and indexing can be split into two steps, where the data is copied first to complete the upgrade, followed by indexing which can be done at a later time in the background?
Thanks!
There's no built-in way to do it with pg_dump and pg_restore. But pg_restore's -j option helps a lot.
There is CREATE INDEX CONCURRENTLY. But pg_restore doesn't use it.
It would be quite nice to be able to restore everything except secondary indexes not depended on by FK constraints. Then restore those as a separate phase using CREATE INDEX CONCURRENTLY. But no such support currently exists, you'd have to write it yourself.
You can, however, filter the table-of-contents used by pg_restore, so you could possibly do some hacky scripting to do the needed work.
There is an option to separate the data and creating index in postgresql while taking pg_dump.
Here pre-data refers to Schema, post-data refers to index and triggers.
From the docs,
--section=sectionname Only dump the named section. The section name can be pre-data, data, or post-data. This option can be specified more
than once to select multiple sections. The default is to dump all
sections.
The data section contains actual table data, large-object contents,
and sequence values. Post-data items include definitions of indexes,
triggers, rules, and constraints other than validated check
constraints. Pre-data items include all other data definition items.
May be this would help :)

SQLite to PostgreSQL data-only transfer (to maintain alembic functionality)

There are a few questions and answers already on PostgreSQL import (as well as the specific SQLite->PostgreSQL situation). This question is about a specific corner-case.
Background
I have an existing, in-production web-app written in python (pyramid) and using alembic for easy schema migration. Due to the database creaking with unexpectedly high write-load (probably due to the convoluted nature of my own code), I've decided to migrate to PostgreSQL.
Data migration
There are a few recommendations on data migration. The simplest one involved using
sqlite3 my.db .dump > sqlitedumpfile.sql
and then importing it with
psql -d newpostgresdb < sqlitedumpfile.sql
This required a bit of editing of sqlitedumpfile. In particular, removing some incompatible operations, changing values (sqlite represents booleans as 0/1) etc. It ended up being too complicated to do programmatically for my data, and too much work to handle manually (some tables had 20k rows or so).
A good tool for data migration which I eventually settled on was pgloader, which 'worked' immediately. However, as is typical for data migration of this sort, this exposed various data inconsistencies in my database which I had to solve at source before doing the migration (in particular, removing foreign keys to non-unique columns which seemed a good idea at the time for convenient joins and removing orphan rows which relied on rows in other tables which had been deleted). After these were solved, I could just do
pgloader my.db postgresql:///newpostgresdb
And get all my data appropriately.
The problem?
pgloader worked really well for data but not so well for the table structure itself. This resulted in three problems:-
I had to create a new alembic revision with a ton of changes (mostly datatype related, but also some related to problem 2).
Constraint/index names were unreliable (unique numeric names generated). There's actually an option to disable this, and this was a problem because I needed a reliable upgrade path which was replicable in production without me having to manually tweak the alembic code.
Sequences/autoincrement just failed for most primary keys. This broke my webapp as I was not able to add new rows for some (not all) databases.
In contrast, re-creating a blank database using alembic to maintain the schema works well without changing any of my webapps code. However pgloader defaults to over-riding existing tables, so this would leave me nowhere as the data is what really needs migrating.
How do I get proper data migration using a schema I've already defined (and which works)?
What eventually worked was, in summary:-
Create the appropriate database structure in postgresql://newpostgresdb (I just used alembic upgrade head for this)
Use pgloader to move data over from sqlite to a different database in postgresql. As mentioned in the question, some data inconsistencies need to be solved before this step, but that's not relevant to this question itself.
createdb tempdb
pgloader my.db postgresql:///tempdb
Dump the data in tempdb using pg_dump
pg_dump -a -d tempdb > dumped_postgres_database
Edit the resulting dump to accomplish the following:-
SET session_replication_role = replica because some of my rows are circular in reference to other rows in the same table
Delete the alembic_version table, as we're restarting a new branch for alembic.
Regenerate any sequences, with the equivalent of SELECT pg_catalog.setval('"table_colname_seq"', (select max(colname) from table));
Finally, psql can be used to load the data to your actual database
psql -d newpostgresdb < dumped_postgres_database

Moving Postgresql - download all data and table creation, but without indexes

I'm trying to move postgresql between two servers. There's rsync connectivity between the two servers.
My tables are large, around 200GB in total with nearly 800 million rows across 15 tables. For this volume of data, I found that COPY command for the key tables was far faster than the usual pg_dump. However, this only dumps the data.
Is there a way to dump only data this way, but also then dump the database creation script -- which will create the tables, and separately indexes? I'm thinking of the following sequence:
COPY all tables into file system. Just 15 files, therefore.
RSYNC these files to the new server.
On the new server, Create a fresh PG database: tables, foreign keys etc. But no indexes yet.
In this fresh PG database, COPY FROM all the tables, one by one. Slightly painful but worth it.
Then create the indexes, all in one go.
I'm seeing ways to get some scripts for #3 and #5 dumped by PG on the older server. The complication in the PG world is the OIDs for tables etc. Will this affect the tables and data on the new server? The pg_dump reference is a bit cryptic in its help material.
For #3, jsut the creation of the "schema" and tables, I could do this:
pg_dump --schema-only mybigdb
Will this carry all the OIDs and other complications, thereby being a good way to complete step #3?
And for only #5, not sure what I'd do. Just the indexes etc. Will I have to look inside the "schema only" file and separate out the indexes?
Appreciate any pointers.
Funny, the sequence you are describing is a pretty good description of what pg_dump/pg_restore does (with some oversights: e.g., for performance reasons, you wouldn't define a foreign key before you restore the data).
So I think that you should use pg_dump instead of reinventing the wheel.
You can get better performance out of pg_dump as follows:
Use the directory format (-Fd) and parallelize the COPY commands with -j number-of-jobs.
Restore the dump with pg_restore and use -j number-of-jobs for several parallel workers for data restore and index creation.
The only drawback is that you have to wait for pg_dump to finish before you can start pg_restore if you use the directory format. If that is a killer, you could use the custom format (-Fc) and pipe the result into pg_restore. That won't allow you to use -j with pg_dump, but you can still parallelize index creation and such with pg_restore -j.

How can I determine which stored procedures use a certain table in their queries?

I am doing some maintenance work for a postgres database, and one particular table has two status columns that I want to convert into one bitmask column. This database uses a lot of different stored procedures for data manipulation that are sadly not covered by unit tests.
Is there a way to tell whether this table is being used inside queries by any stored procedures? Well, aside from manually going through each procedure's body using the \df+ switch inside psql.
Sadly, there isn't. The body of a plpgsql function is just a string that is saved and executed upon call. When you create a function only superficial syntax checks are run on it.
Sometimes this is a blessing. Other times, it's a curse.
What I do in such a case: dump the schema and search the dump with vim (or grep or the tool of your choice).
pg_dump $DB -p $PORT -s -f filename.pgsql
If all your functions reside in a particular schema (same word, different meaning!), add: -n $SCHEMA

libpq code to create, list and delete databases (C++/VC++, PostgreSQL)

I am new to the PostgreSQL database. What my visual c++ application needs to do is to create multiple tables and add/retrieve data from them.
Each session of my application should create a new and distinct database. I can use the current date and time for a unique database name.
There should also be an option to delete all the databases.
I have worked out how to connect to a database, create tables, and add data to tables. I am not sure how to make a new database for each run or how to retrieve number and name of databases if user want to clear all databases.
Please help.
See the libpq examples in the documentation. The example program shows you how to list databases, and in general how to execute commands against the database. The example code there is trivial to adapt to creating and dropping databases.
Creating a database is a simple CREATE DATABASE SQL statement, same as any other libpq operation. You must connect to a temporary database (usually template1) to issue the CREATE DATABASE, then disconnect and make a new connection to the database you just created.
Rather than creating new databases, consider creating new schema instead. Much less hassle, since all you need to do is change the search_path or prefix your table references, you don't have to disconnect and reconnect to change schemas. See the documentation on schemas.
I question the wisdom of your design, though. It is rarely a good idea for applications to be creating and dropping databases (or tables, except temporary tables) as a normal part of their operation. Maybe if you elaborated on why you want to do this, we can come up with solutions that may be easier and/or perform better than your current approach.