Getting a clean backup of my schema from PostgreSQL - postgresql

I am backing up my schema and stored procedures by running pg_dump -s database. This works, but it seems to show every change I've made (an ALTER for every time I've altered something, each new declaration of stored procedures I've changed over time, etc.)
I'd like just the current schema so that I can restore things should something go wrong. Is this possible?

TL/DR: This doesn't happen. pg_dump produces output that is optimized to make imports faster.
Are you sure about that? postgresql does not store at all the history of schema changes, so it would not be possible for pg_dump to output that. Here are some reasons that might have caused the confusion. Firstly, pg_dump typically breaks up create table statements into multiple statements. For example, consider this create table statement:
CREATE TABLE t (
id integer primary key,
q integer not null references q
);
pg_dump will convert it to
CREATE TABLE t (
id integer NOT NULL,
q integer NOT NULL
);
ALTER TABLE ONLY t
ADD CONSTRAINT t_pkey PRIMARY KEY (id);
ALTER TABLE ONLY t
ADD CONSTRAINT t_q_fkey FOREIGN KEY (q) REFERENCES q(id);
This is equivalent to the original one. Typically pg_dump creates statements for create table that in this order: (1) create the table without the constrains/indexes, (2) import the table data, and finally (3) create the constraints/indexes with alter table/create index. The reason that it does it in that order, is because it is faster to import the table data without indexes and create the indexes afterwards. But this doesn't mean that postgresql remembers the full history of changes to the table. If you add another column and call pg_dump afterwards, you will see the column in the resulting create table. Now, if you use the -s flag, this breaking up might be unnecessary but pg_dump does not change how it outputs the statements to create tables. It simply skips step (2) from above and it does steps (1) & (3).
Finally, the is another issue that might cause confusion. Function in postgresql can be overloaded by providing multiple definitions that have different types of arguments or different numbers of arguments. So if you do
CREATE OR REPLACE FUNCTION foo(x int) ...
and then later on you do
CREATE OR REPLACE FUNCTION foo(x text, y boolean) ...
then second statement will not delete the function created in the first one because the two functions are treated as different functions. So pg_dump will output them both. Again, this does not mean that postgresql remembers your old delete functions.

Related

Postgres add column generated always stored ... but without locking the table

I would like to add a generated column to a very large table without locking it:
alter table customer alter column foo text generated always as ((infos_json->'customerinfos'->>'redirect')) stored;
Is there a way to do this without locking? Maybe concurrently?

Postgres: difference between DEFAULT in CREATE TABLE and ALTER TABLE in database dump

In database dump created with pg_dump, some tables have DEFAULTs in the CREATE TABLE statement, i.e.:
CREATE TABLE test (
f1 integer DEFAULT nextval('test_f1_seq'::regclass) NOT NULL
);
But others have an additional ALTER statement:
ALTER TABLE ONLY test2 ALTER COLUMN f1 SET DEFAULT nextval('test2_f1_seq'::regclass);
What is the reason of this? All sequential fields were created with type SERIAL, but in the dump they look different, and I can't guess any rule for this.
The difference must be that in the first case, the sequence is “owned” by the table column.
You can specify this dependency using the OWNED BY clause when you create a sequence. A sequence that is owned by a column will automatically be dropped when the column is.
If a sequence is implicitly created by using serial, it will be owned by the column.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.

PostgreSQL bulk insert with ActiveRecord

I've a lot of records that are originally from MySQL. I massaged the data so it will be successfully inserted into PostgreSQL using ActiveRecord. This I can easily do with insertions on row basis i.e one row at a time. This is very slow I want to do bulk insert but this fails if any of the rows contains invalid data. Is there anyway I can achieve bulk insert and only the invalid rows failing instead of the whole bulk?
COPY
When using SQL COPY for bulk insert (or its equivalent \copy in the psql client), failure is not an option. COPY cannot skip illegal lines. You have to match your input format to the table you import to.
If data itself (not decorators) is violating your table definition, there are ways to make this a lot more tolerant though. For instance: create a temporary staging table with all columns of type text. COPY to it, then fix offending rows with SQL commands before converting to the actual data type and inserting into the actual target table.
Consider this related answer:
How to bulk insert only new rows in PostreSQL
Or this more advanced case:
"ERROR: extra data after last expected column" when using PostgreSQL COPY
If NULL values are offending, remove the NOT NULL constraint from your target table temporarily. Fix the rows after COPY, then reinstate the constraint. Or take the route with the staging table, if you cannot afford to soften your rules temporarily.
Sample code:
ALTER TABLE tbl ALTER COLUMN col DROP NOT NULL;
COPY ...
-- repair, like ..
-- UPDATE tbl SET col = 0 WHERE col IS NULL;
ALTER TABLE tbl ALTER COLUMN col SET NOT NULL;
Or you just fix the source table. COPY tells you the number of the offending line. Use an editor of your preference and fix it, then retry. I like to use vim for that.
INSERT
For an INSERT (like commented) the check for NULL values is trivial:
To skip a row with a NULL value:
INSERT INTO (col1, ...
SELECT col1, ...
WHERE col1 IS NOT NULL
To insert sth. else instead of a NULL value (empty string in my example):
INSERT INTO (col1, ...
SELECT COALESCE(col1, ''), ...
A common work-around for this is to import the data into a TEMPORARY or UNLOGGED table with no constraints and, where data in the input is sufficiently bogus, text typed columns.
You can then do INSERT INTO ... SELECT queries against the data to populate the real table with a big query that cleans up the data during import. You can use a lot of CASE statements for this. The idea is to transform the data in one pass.
You might be able to do many of the fixes in Ruby as you read the data in, then push the data to PostgreSQL using COPY ... FROM STDIN. This is possible with Ruby's Pg gem, see eg https://bitbucket.org/ged/ruby-pg/src/tip/sample/copyfrom.rb .
For more complicated cases, look at Pentaho Kettle or Talend Studio ETL tools.

What is the difference between these two T-SQL statements?

In a SSIS package at work there are some SQL tasks that create staging tables for holding import data. All the statements take this form:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.tbNewTable') AND type in (N'U'))
BEGIN
TRUNCATE TABLE dbo.tbNewTable
END
ELSE
BEGIN
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
END
In Itzik Ben-Gan's T-SQL Fundamentals I see a different form of statement for creating a table:
IF OBJECT_ID('dbo.tbNewTable', 'U') IS NOT NULL
BEGIN
DROP TABLE dbo.tbNewTable
END
CREATE TABLE dbo.tbNewTable (
ColumnA VARCHAR(10) NULL,
ColumnB VARCHAR(10) NULL,
ColumnC INT NULL
) ON PRIMARY
Each of these appears to do the same thing. After execution, there will be a empty table called tbNewTable in the dbo schema.
Are there any practical or theoretical differences between the two? What implications might they have?
The first one assumes that if the table exists, it has the same columns as those it would create. The second one does not make that assumption. So if a table with that name happened to exist and had a different set of columns, the two would have very different results.
The first will not actually DROP the table -- it merely TRUNCATES all the data in said table. Hence why the CREATE is guarded.
Thus the form with the DROP will allow the subsequent CREATE to change the schema (when the new table is created) even if tbNewTable previously existed.
Because the DROP/CREATE alters the database schema it may not also be allowed in all cases. For instance, a view created with a SCHEMABINDING will prevent the table from being dropped. (This also hold true for more general FK relationships, should any exist.)
...when SCHEMABINDING is specified, the base table or tables cannot be modified in a way that would affect the view definition.
The TRUNCATE should be marginally faster in one of those constant "don't care" ways: there should be no performance consideration given to one over the other.
There are also permission differences. TRUNCATE only requires the ALTER permission.
The minimum permission required is ALTER on table_name. TRUNCATE TABLE permissions default to the table owner...
Happy coding.
These are very different..
The first does an equality check on the sys.objects system table and looks to see if there is a matching table name. If so, it truncates the table. Basically removing all rows but maintaining the table structure itself - i.e. the actual table is never dropped.
In the second, the check to make sure that the table exists is implicitly done using the OBJECT_ID() method. If so, the table is dropped completely - rows and structure.
If you have a primary and foreign key constraint on the table, you'll certainly have issues dropping it completely... and if you have other tables that are linked to the table you are trying to 'truncate' you'll have issues there too, unless you have cascade deletion turned on.
I tend to dislike either construction in an SSIS package. I create the tables in a deployment script and I want the package to fail if one of the tables I use is missing later on because then something drastically wrong has happened and I want to investigate what before I try putting data anywhere.