How to hash an entire table in postgres? - postgresql

I'd like to get a hash for data in an entire table. I need to compare two databases after migration to validate that the data migration was successful. Is it possible to reliably and reproducibly generate a hash for an entire table in a database?

You can do this from the command line (replacing of course my_database and my_table):
psql my_database -c 'copy my_table to stdout' |sha1sum
If you want to use a query to limit columns, add ordering, etc., just modify the query:
psql my_database -c 'copy (select * from my_table order by my_id_column) to stdout' |sha1sum
Note that this does not hash anything except the column data. No schema information, constraints, indexes, metadata, permissions, etc.
Note also that sha1sum is an arbitrary hashing program; you can pipe this to any program that generates a hash. Some cuspy options are sha256sum and md5sum.

Related

postgres: dump partial tables between databases while maintaining sequences

I want to copy select bits of several tables from one database to another while maintaining the sequences and schema. I first dump the schema using pg_dump -s but when it comes to copying the data I'm a little at a loss. Here's what I've tried so far:
pg_dump -t <table1> gives me sequences but includes the whole table
copy (SELECT bits from table1) gives me partial tables but doesn't keep the sequences up to date.
How can I keep my sequences up to date while only dumping parts of the tables?
Nothing built-in will dump part of a table for you, so doing a -s/--schema-only dump and writing your own COPY statements is the way to go.
As noted in the pg_dump docs, the -t/--table option will also take a sequence name. You can combine this with the -a/--data-only flag to output just the sequence's setval(...) command:
pg_dump --data-only -t <sequence_name>
Of course, if your sequences are associated with a SERIAL column, you usually don't know (or care) exactly what they're called. In that case, you can (probably) rely on the default <table>_<column>_seq naming convention to dump them all at once:
pg_dump --data-only -t *_seq
If you have non-standard sequence names, or if you're unfortunate enough to have a table name which ends in _seq, you might need to generate the sequence list programmatically. In bash, something like this would probably do it:
pg_dump --data-only -t $(psql -tAc "SELECT string_agg(oid::regclass::text, ',') FROM pg_class WHERE relkind = 'S'")

pg_dump for all metadata and only table data of selected tables

I want to create a script that will dump the whole schema and the data of only a few tables and write it to one file.
Use the --exclude-table-data option of pg_dump to define the tables whose data should be excluded from the dump.
multiple -t lists table you want take backup of, eg
MacBook-Air:~ vao$ pg_dump -d t -t pg_database -t a -t so | grep 'CREATE TABLE'
CREATE TABLE pg_database (
CREATE TABLE a (
CREATE TABLE so (
takes backup of structure and data of three mentioned tables. I use grep to hide other rows and yet give idea of backup contents
https://www.postgresql.org/docs/current/static/app-pgdump.html
-t table
--table=table
Dump only tables with names matching table. For this purpose, “table”
includes views, materialized views, sequences, and foreign tables.
Multiple tables can be selected by writing multiple -t switches.

What is the quickest way to duplicate/clone a table in Postgres?

I know that I could do CREATE TABLE tbl_2 AS (select * from tbl_1)
But is there a better/faster/stronger way to do this? I am talking about performance more than anything else. The tables are all denormalised and I do not have any foreign key constraints to worry about.
EDIT
May be there isn't any better way? Ref: https://dba.stackexchange.com/questions/55661/how-to-duplicate-huge-postgres-table
A better way really depends on what exactly you're hoping to accomplish.
If you want to keep all the constraints and indexes from the original table you can use the LIKE clause in your CREATE TABLE statement like so:
CREATE TABLE tbl_2 (LIKE tbl_1 INCLUDING INDEXES INCLUDING CONSTRAINTS);
But that just creates an empty table. You would still have to copy in the data.
Alternatively you can use something like the following:
$ pg_dump -t tbl_1 | sed -e 's/^SET search_path = .*$/SET search_path = tmpschema, pg_catalog;' > table.sql
$ psql -d test -c 'CREATE SCHEMA tmpschema'
$ psql -1 -d test -f table.sql
$ psql -d test -c 'ALTER TABLE tmpschema.tbl_1 RENAME TO tbl_2; ALTER TABLE tmpschema.tbl_2 SET SCHEMA public; DROP SCHEMA tmpschema'
Perhaps it is not faster than CREATE TABLE ... AS (SELECT ...), but it will copy all indexes and constraints as well.

Copy table data from one database to another

I have two databases on the same server and need to copy data from a table in the first db to a table in the second. A few caveats:
Both tables already exist (ie: I must not drop the 'copy-to' table first. I need to just add the data to the existing table)
The column names differ. So I need to specify exactly which columns to copy, and what their names are in the new table
After some digging I have only been able to find this:
pg_dump -t tablename dbname | psql otherdbname
But the above command doesn't take into account the two caveats I listed.
For a table t, with columns a and b in the source database, and x and y in the target:
psql -d sourcedb -c "copy t(a,b) to stdout" | psql -d targetdb -c "copy t(x,y) from stdin"
I'd use an ETL tool for this. There are free tools available, they can help you change column names and they are widely used and tested. Most tools allow external schedulers like the windows task scheduler or cron to run transformations based on whatever time schedule you need.
I personally have used Pentaho PDI for similar tasks in the past and it has always worked well for me. For your requirement I'd create a single transformation that first loads the table data from the source database, modify the column names in a "Select Values"-step and then insert the values into the target table using the "truncate" option to remove the existing rows from the target table. If your table is too big to be re-filled each time, you'd need to figure out a delta load procedure.

pg_dump only dumps table info, and very little table data?

pg_dump -U postgres mydb > mydb.bak.sql
From the docs, it doesn't seem like I need to pass any flag to include table data in a dump. Yet, the dump resulting from the above includes data for a strange, tiny subset of tables--aside from their create statements, most tables are only listed as
COPY <tablename> (vals) FROM stdin;
\.
Is there some circumstance where you have to explicitly tell pg_dump to include all table data?