ADD COLUMN with DEFAULT value to a huge table - postgresql

I have a postgresql DB and a table with almost billion of rows.
when I try to add a new column with default value:
ALTER TABLE big_table
ADD COLUMN some_flag integer NOT NULL DEFAULT 0;
The transaction goes on for 30+ min .. and the DB logs starts to shoots warnings.
Any way to optimize the query ?

Besides doing it in batches (which will still take a while):
You could dump the table as COPY statements and write a script to edit the contents of the COPY statements to insert another column (COPY can be CSV IIRC).
Then you just reload your altered COPY dump and it should in theory be faster than the ALTER because COPY will not log transactions.
The other option is to turn off fsync while you run the command... just remember to turn it back on.
You can also do both of the above in batches.

Starting from PostgreSQL 11 this behaviour will change.
Waiting for PostgreSQL 11 – Fast ALTER TABLE ADD COLUMN with a non-NULL default:
So, for the longest time, when you did:
alter table x add column z text;
it was virtually instantaneous. Get a lock on table, add information about new column to system catalogs, and it's done.
But when you tried:
alter table x add column z text default 'some value';
then it took long time. How long it did depend on size of table.
This was because postgresql was actually rewriting the whole table, adding the column to each row, and filling it with default value.
"What happens if you want to set the column to NOT NULL also? Are we back to the slow version in that case or does this handle that as well?"
not null doesn’t change anything. it is a constraint for new rows. so adding a column with “not null default ‘xxx'” will be fast.

I'd consider creating the column without the default and manually updating the rows in batches with intermittent commits to apply the default.

Related

Serial column takes up disproportional amount of space in PostgreSQL

I would like to create an auto-incrementing id column that is not a primary key in a PostgreSQL table. The table is currently just over 200M rows and contains 14 columns.
SELECT pg_size_pretty(pg_total_relation_size('mytable'));
The above query reveals that mytable takes up 57 GB on disk. I currently have 30 GB free space remaining on disk after checking with df -h (on Ubuntu 20.04)
What I don't understand is why, after trying to create a SERIAL column, I completely run out of disk space - the query ends up never finishing. I run the following command:
ALTER TABLE mytable ADD COLUMN id SERIAL;
and then see how gradually, my disk space runs out until there is nothing left and the query fails. I am no database expert but it does not make sense. Why would a simple serialized column take up more than half of the space of the table itself, especially when it is not a primary key and therefore has no index? Is there a known workaround to creating such an auto-incrementing id column?
As a proof of concept:
create table id_test(pk_fld integer primary key generated always as identity);
--FYI, in Postgres 14+ the overriding system value won't be needed.
--That is a hack around a bug in 13-
insert into id_test overriding system value values (default), (default);
select * from id_test;
pk_fld
--------
1
2
alter table id_test add column id_fld integer ;
update id_test set id_fld = 0;
alter table id_test alter COLUMN id_fld set not null;
alter table id_test alter COLUMN id_fld add generated always as identity;
update id_test set id_fld = default;
select * from id_test;
pk_fld | id_fld
--------+--------
1 | 1
2 | 2
Basically this breaks the process down into steps. Obviously this is just a toy table and not representative of your setup. I would try it on test table that is a subset of you actual table to see what happens to disk space consumption. It would not hurt to use VACUUM after the updates to return rows to the database.
Adding a serial column is adding an integer column with a non-constant DEFAULT value. This will cause PostgreSQL to rewrite the table, because the new column value has to be added to all existing rows. So PostgreSQL writes a new copy of the table and discards the old one after it is done. This will require more than double the disk space of the original table temporarily, which explains why you run out of disk space.
You can split the operation into several steps:
ALTER TABLE mytable ADD id bigint;
CREATE SEQUENCE mytable_id_seq OWNED BY mytable.id;
ALTER TABLE mytable ALTER id SET DEFAULT nextval('mytable_id_seq');
This will not rewrite the table, and it will leave the existing rows untouched. The value of id for these columns will be NULL.
You probably want to update the existing rows to be NOT NULL, but be careful: if you update them all at once, you will run out of disk space as well, because in PostgreSQL an UPDATE writes a complete new version of the row to the table. You'd have to update the rows in batches and run VACUUM between these updates.
All in all, this is rather annoying and complicated. So do yourself a favor and increase the disk space. That is the simple and best solution.

How to add a column to a table on production PostgreSQL with zero downtime?

Here
https://stackoverflow.com/a/53016193/10894456
is an answer provided for Oracle 11g,
My question is the same:
What is the best approach to add a not null column with default value
in production oracle database when that table contain one million
records and it is live. Does it create any locks if we do the column
creation , adding default value and making it as not null in a single
statement?
but for PostgreSQL ?
This prior answer essentially answers your query.
Cross referencing the relevant PostgreSQL doc with the PostgreSQL sourcecode for AlterTableGetLockLevel mentioned in the above answer shows that ALTER TABLE ... ADD COLUMN will always obtain an ACCESS EXCLUSIVE table lock, precluding any other transaction from accessing the table for the duration of the ADD COLUMN operation.
This same exclusive lock is obtained for any ADD COLUMN variation; ie. it doesn't matter whether you add a NULL column (with or without DEFAULT) or have a NOT NULL with a default.
However, as mentioned in the linked answer above, adding a NULL column with no DEFAULT should be very quick as this operation simply updates the catalog.
In contrast, adding a column with a DEFAULT specifier necessitates a rewrite the entire table in PostgreSQL 10 or less.
This operation is likely to take a considerable time on your 1M record table.
According to the linked answer, PostgreSQL >= 11 does not require such a rewrite for adding such a column, so should perform more similarly to the no-DEFAULT case.
I should add that for PostgreSQL 11 and above, the ALTER TABLE docs note that table rewrites are only avoided for non-volatile DEFAULT specifiers:
When a column is added with ADD COLUMN and a non-volatile DEFAULT is specified, the default is evaluated at the time of the statement and the result stored in the table's metadata. That value will be used for the column for all existing rows. If no DEFAULT is specified, NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT [...] will require the entire table and its indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

Query rows by time of creation?

I have a table that contains no date or time related fields. Still I want to query that table based on when records/rows were created. Is there a way to do this in PostgreSQL?
I prefer an answer about doing it in PostgreSQL directly. But if that's not possible, can hibernate do it for PostgreSQL?
Basically: no. There is no automatic timestamp for rows in PostgreSQL.
I usually add a column like this to my tables (ignoring time zones):
ALTER TABLE tbl ADD COLUMN log_in timestamp DEFAULT localtimestamp NOT NULL;
As long as you don't manipulate the values in that column, you got your creation timestamp. You can add a trigger and / or restrict write privileges to avoid tempering with the values.
Second class options
If you have a serial column, you could at least tell with some probability in what order rows were entered. That's not 100% reliable, because the values can be changed by hand, and applications can get values from the sequence and INSERT out of order.
If you created your table WITH (OIDS=TRUE), then the OID column could be some indication - unless your database is heavily written and / or very old, then you may have gone through OID wrap-around and later rows can have a smaller OID. That's one of the reasons, why this feature is hardly used any more.
The default depends on the setting of default_with_oids I quote the manual:
The parameter is off by default; in PostgreSQL 8.0 and earlier, it was
on by default.
If you have not updated your rows or went through a dump / restore cycle, or ran VACUUM FULL or CLUSTER or .. , a plain SELECT * FROM tbl returns all rows in the order they were entered. But this is very unreliable and implementation-dependent. PostgreSQL (like any RDBMS) does not guarantee any order without an ORDER BY clause.

PostgreSQL v7.4 ALTER TABLE to change column

I have a need to change the length of CHAR columns in tables in a PostgreSQL v7.4 database. This version did not support the ability to directly change the column type or size using the ALTER TABLE statement. So, directly altering a column from a CHAR(10) to CHAR(20) for instance isn't possible (yeah, I know, "use varchars", but that's not an option in my current circumstance). Anyone have any advice/tricks on how to best accomplish this? My initial thoughts:
-- Save the table's data in a new "save" table.
CREATE TABLE save_data AS SELECT * FROM table_to_change;
-- Drop the columns from the first column to be changed on down.
ALTER TABLE table_to_change DROP column_name1; -- for each column starting with the first one that needs to be modified
ALTER TABLE table_to_change DROP column_name2;
...
-- Add the columns back, using the new size for the CHAR column
ALTER TABLE table_to_change ADD column_name1 CHAR(new_size); -- for each column dropped above
ALTER TABLE table_to_change ADD column_name2...
-- Copy the data bace from the "save" table
UPDATE table_to_change
SET column_name1=save_data.column_name1, -- for each column dropped/readded above
column_name2=save_date.column_name2,
...
FROM save_data
WHERE table_to_change.primary_key=save_data.primay_key;
Yuck! Hopefully there's a better way? Any suggestions appreciated. Thanks!
Not PostgreSQL, but in Oracle I have changed a column's type by:
Add a new column with a temporary name (ie: TMP_COL) and the new data type (ie: CHAR(20))
run an update query: UPDATE TBL SET TMP_COL = OLD_COL;
Drop OLD_COL
Rename TMP_COL to OLD_COL
I would dump the table contents to a flat file with COPY, drop the table, recreate it with the correct column setup, and then reload (with COPY again).
http://www.postgresql.org/docs/7.4/static/sql-copy.html
Is it acceptable to have downtime while performing this operation? Obviously what I've just described requires making the table unusable for a period of time, how long depends on the data size and hardware you're working with.
Edit: But COPY is quite a bit faster than INSERTs and UPDATEs. According to the docs you can make it even faster by using BINARY mode. BINARY makes it less compatible with other PGSQL installs but you won't care about that because you only want to load the data to the same instance that you dumped it from.
The best approach to your problem is to upgrade pg to something less archaic :)
Seriously. 7.4 is going to be removed from "supported versions" pretty soon, so I wouldn't wait for it to happen with 7.4 in production.