Make a column NOT NULL in a large table without locking issues? - postgresql

I want to change a column to NOT NULL:
ALTER TABLE "foos" ALTER "bar_id" SET NOT NULL
The "foos" table has almost 1 000 000 records. It does fairly low volumes of writes, but quite constantly. There are a lot of reads.
In my experience, changing a column in a big table to NOT NULL like this can cause downtime in the app, presumably because it leads to (b)locks.
I've yet to find a good explanation corroborating this, though.
And if it is true, what can I do to avoid it?
EDIT: The docs (via this comment) say:
Adding a column with a DEFAULT clause or changing the type of an existing column will require the entire table and its indexes to be rewritten.
I'm not sure if changing NULL counts as "changing the type of an existing column", but I believe I did have an index on the column the last time I saw this issue.
Perhaps removing the index, making the column NOT NULL, and then adding the index back would improve things?

I think you can do that using a check constraint rather then set not null.
ALTER TABLE foos
add constraint id_not_null check (bar_id is not null) not valid;
This will still require an ACCESS EXCLUSIVE lock on the table, but it is very quick because Postgres doesn't validate the constraint (so it doesn't have to scan the entire table). This will already make sure that new rows (or changed rows) can not put a null value into that column
Then (after committing the alter table!) you can do:
alter table foos validate constraint id_not_null;
Which does not require an ACCESS EXCLUSIVE lock and still allows access to the table.

Related

Unexpected creation of duplicate unique constraints in Postgres

I am writing an idempotent schema change script for a Postgres 12 database. However I noticed that if I include the IF NOT EXISTS in an ADD COLUMN statement then even if the column already exists it is adding duplicate Indexes for the uniqueness constraint which already exists. Simple example:
-- set up base table
CREATE TABLE IF NOT EXISTS test_table
(id SERIAL PRIMARY KEY
);
-- statement intended to be idempotent
ALTER TABLE test_table
ADD COLUMN IF NOT EXISTS name varchar(50) UNIQUE;
Running this script creates a new index test_table_name_key[n] each time it is run. I can't find anything in the Postgres documentation and don't understand why this is allowed to happen? If I break it into two parts eg:
ALTER TABLE test_table
ADD COLUMN IF NOT EXISTS name varchar(50);
ALTER TABLE
ADD CONSTRAINT test_table_name_key UNIQUE (name);
Then the transaction fails because Postgres rejects the creation of a constraint which already exists (which I can then catch in a DO EXCEPTION block). As far as I can tell this is because doing it by this approach I am forced to give the constraint a name. This constrasts with the ALTER COLUMN SET NOT NULL which can be run multiple times without error or side effects as far as I can tell.
Question: why does it add a duplicate unique constraint and are there any problems with having multiple identical indexes on a table column? (I think this is a subtle 'error' and only spotted it by chance so am concerned it may arise in a production situation)
You can create multiple unique constraints on the same column as long as they have different names, simply because there is nothing in the PostgreSQL code that forbids that. Each unique constraint will create a unique index with the same name, because that is how unique constraints are implemented.
This can be a valid use case: for example, if the index is bloated, you could create a new constraint and then drop the old one.
But normally, it is useless and does harm, because each index will make data modifications on the table slower.

Is it possible to access current column data on conflict

I want to get such behaviour on inserting data (conflict on id):
if there is no model with same id in db do INSERT
if there is entry with same id in db and that entry is newer (updated_at field) do NOT UPDATE
if there is entry with same id in db and that entry is older (updated_at field) do UPDATE
I'm using Ecto for that and want to work on constraints, however I cannot find an option to do so in documentation. Pseudo code of constraint could look like:
CHECK: NULL(current.updated_at) or incoming.updated_at > current.updated_at
Is such behaviour possible in Postgres?
PostgreSQL does not support CHECK constraints that reference table
data other than the new or updated row being checked. While a CHECK
constraint that violates this rule may appear to work in simple tests,
it cannot guarantee that the database will not reach a state in which
the constraint condition is false (due to subsequent changes of the
other row(s) involved). This would cause a database dump and reload to
fail. The reload could fail even when the complete database state is
consistent with the constraint, due to rows not being loaded in an
order that will satisfy the constraint. If possible, use UNIQUE,
EXCLUDE, or FOREIGN KEY constraints to express cross-row and
cross-table restrictions.
If what you desire is a one-time check against other rows at row
insertion, rather than a continuously-maintained consistency
guarantee, a custom trigger can be used to implement that. (This
approach avoids the dump/reload problem because pg_dump does not
reinstall triggers until after reloading data, so that the check will
not be enforced during a dump/reload.)
That should be simple using the WHERE clause of ON CONFLICT ... DO UPDATE:
INSERT INTO mytable (id, entry) VALUES (42, '2021-05-29 12:00:00')
ON CONFLICT (id)
DO UPDATE SET entry = EXCLUDED.entry
WHERE mytable.entry < EXCLUDED.entry;

How to add a column to a table on production PostgreSQL with zero downtime?

Here
https://stackoverflow.com/a/53016193/10894456
is an answer provided for Oracle 11g,
My question is the same:
What is the best approach to add a not null column with default value
in production oracle database when that table contain one million
records and it is live. Does it create any locks if we do the column
creation , adding default value and making it as not null in a single
statement?
but for PostgreSQL ?
This prior answer essentially answers your query.
Cross referencing the relevant PostgreSQL doc with the PostgreSQL sourcecode for AlterTableGetLockLevel mentioned in the above answer shows that ALTER TABLE ... ADD COLUMN will always obtain an ACCESS EXCLUSIVE table lock, precluding any other transaction from accessing the table for the duration of the ADD COLUMN operation.
This same exclusive lock is obtained for any ADD COLUMN variation; ie. it doesn't matter whether you add a NULL column (with or without DEFAULT) or have a NOT NULL with a default.
However, as mentioned in the linked answer above, adding a NULL column with no DEFAULT should be very quick as this operation simply updates the catalog.
In contrast, adding a column with a DEFAULT specifier necessitates a rewrite the entire table in PostgreSQL 10 or less.
This operation is likely to take a considerable time on your 1M record table.
According to the linked answer, PostgreSQL >= 11 does not require such a rewrite for adding such a column, so should perform more similarly to the no-DEFAULT case.
I should add that for PostgreSQL 11 and above, the ALTER TABLE docs note that table rewrites are only avoided for non-volatile DEFAULT specifiers:
When a column is added with ADD COLUMN and a non-volatile DEFAULT is specified, the default is evaluated at the time of the statement and the result stored in the table's metadata. That value will be used for the column for all existing rows. If no DEFAULT is specified, NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT [...] will require the entire table and its indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.

Does adding a null column to a postgres table cause a lock?

I think I read somewhere that running an ALTER TABLE foo ADD COLUMN baz text on a postgres database will not cause a read or write lock. Setting a default value causes locking, but allowing a null default prevents a lock.
I can't find this in the documentation, though. Can anyone point to a place that says, definitively, if this is true or not?
The different sorts of locks and when they're used are mentioned in the doc in
Table-level Locks. For instance, Postgres 11's ALTER TABLE may acquire a SHARE UPDATE EXCLUSIVE, SHARE ROW EXCLUSIVE, or ACCESS EXCLUSIVE lock.
Postgres 9.1 through 9.3 claimed to support two of the above three but actually forced Access Exclusive for all variants of this command. This limitation was lifted in Postgres 9.4 but ADD COLUMN remains at ACCESS EXCLUSIVE by design.
It's easy to check in the source code because there's a function dedicated to establishing the lock level needed for this command in various cases: AlterTableGetLockLevel in src/backend/commands/tablecmds.c.
Concerning how much time the lock is held, once acquired:
When the column's default value is NULL, the column's addition should be very quick because it doesn't need a table rewrite: it's only an update in the catalog.
When the column has a non-NULL default value, it depends on PostgreSQL version: with version 11 or newer, there is no immediate rewriting of all the rows, so it should be as fast as the NULL case. But with version 10 or older, the table is entirely rewritten, so it may be quite expensive depending on the table's size.
Adding new null column will lock the table for very very short time since no need to rewrite all data on disk. While adding column with default value requires PostgreSQL to make new versions of all rows and store them on the disk. And during that time table will be locked.
So when you need to add column with default value to big table it's recommended to add null value first and then update all rows in small portions. This way you'll avoid high load on disk and allow autovacuum to do it's job so you'll not end up doubling table size.
http://www.postgresql.org/docs/current/static/sql-altertable.html#AEN57290
"Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten."
So the documentation only specifies when the table is not rewritten.
There will always be a lock, but it will be very short in case the table is not to be rewritten.

ADD COLUMN with DEFAULT value to a huge table

I have a postgresql DB and a table with almost billion of rows.
when I try to add a new column with default value:
ALTER TABLE big_table
ADD COLUMN some_flag integer NOT NULL DEFAULT 0;
The transaction goes on for 30+ min .. and the DB logs starts to shoots warnings.
Any way to optimize the query ?
Besides doing it in batches (which will still take a while):
You could dump the table as COPY statements and write a script to edit the contents of the COPY statements to insert another column (COPY can be CSV IIRC).
Then you just reload your altered COPY dump and it should in theory be faster than the ALTER because COPY will not log transactions.
The other option is to turn off fsync while you run the command... just remember to turn it back on.
You can also do both of the above in batches.
Starting from PostgreSQL 11 this behaviour will change.
Waiting for PostgreSQL 11 – Fast ALTER TABLE ADD COLUMN with a non-NULL default:
So, for the longest time, when you did:
alter table x add column z text;
it was virtually instantaneous. Get a lock on table, add information about new column to system catalogs, and it's done.
But when you tried:
alter table x add column z text default 'some value';
then it took long time. How long it did depend on size of table.
This was because postgresql was actually rewriting the whole table, adding the column to each row, and filling it with default value.
"What happens if you want to set the column to NOT NULL also? Are we back to the slow version in that case or does this handle that as well?"
not null doesn’t change anything. it is a constraint for new rows. so adding a column with “not null default ‘xxx'” will be fast.
I'd consider creating the column without the default and manually updating the rows in batches with intermittent commits to apply the default.