PostgreSQL: increasing a column's length in a very large table - postgresql

Aurora PostgreSQL, version 10.4.
I have a table with several million rows. One of the columns is defined as character varying(255). Once upon a time, 255 was plenty of room, but now it's not, so I have to make more.
I found this in the PG 9.1 release notes:
Allow ALTER TABLE ... SET DATA TYPE to avoid table rewrites in appropriate cases (Noah Misch, Robert Haas)
For example, converting a varchar column to text no longer requires a rewrite of the table. However, increasing the length constraint on a varchar column still requires a table rewrite.
This suggests that changing to a longer varchar is not practical (since rewriting a table of that size would lock it for an ungodly amount of time), but changing to text would work. Is this correct?
Any other things I should know about when making such a change? I want to avoid data loss, obviously, and I can't afford to make this table inaccessible for more than a short period.

You should have read all release notes.
Because just one version later
Increasing the length limit for a varchar or varbit column, or removing the limit altogether, no longer requires a table rewrite.
You can test that easily for yourself:
postgres=# select version();
version
------------------------------------------------------------
PostgreSQL 10.5, compiled by Visual C++ build 1800, 64-bit
(1 row)
postgres=# \timing on
Timing is on.
postgres=# create table alter_test (id serial, some_col varchar(255));
CREATE TABLE
Time: 22.331 ms
postgres=# insert into alter_test (some_col) select md5(random()::text) from generate_series(1,10e6);
INSERT 0 10000000
Time: 40894.275 ms (00:40.894)
postgres=# alter table alter_test alter column some_col type varchar(500);
ALTER TABLE
Time: 5.297 ms
postgres=#

Related

Serial column takes up disproportional amount of space in PostgreSQL

I would like to create an auto-incrementing id column that is not a primary key in a PostgreSQL table. The table is currently just over 200M rows and contains 14 columns.
SELECT pg_size_pretty(pg_total_relation_size('mytable'));
The above query reveals that mytable takes up 57 GB on disk. I currently have 30 GB free space remaining on disk after checking with df -h (on Ubuntu 20.04)
What I don't understand is why, after trying to create a SERIAL column, I completely run out of disk space - the query ends up never finishing. I run the following command:
ALTER TABLE mytable ADD COLUMN id SERIAL;
and then see how gradually, my disk space runs out until there is nothing left and the query fails. I am no database expert but it does not make sense. Why would a simple serialized column take up more than half of the space of the table itself, especially when it is not a primary key and therefore has no index? Is there a known workaround to creating such an auto-incrementing id column?
As a proof of concept:
create table id_test(pk_fld integer primary key generated always as identity);
--FYI, in Postgres 14+ the overriding system value won't be needed.
--That is a hack around a bug in 13-
insert into id_test overriding system value values (default), (default);
select * from id_test;
pk_fld
--------
1
2
alter table id_test add column id_fld integer ;
update id_test set id_fld = 0;
alter table id_test alter COLUMN id_fld set not null;
alter table id_test alter COLUMN id_fld add generated always as identity;
update id_test set id_fld = default;
select * from id_test;
pk_fld | id_fld
--------+--------
1 | 1
2 | 2
Basically this breaks the process down into steps. Obviously this is just a toy table and not representative of your setup. I would try it on test table that is a subset of you actual table to see what happens to disk space consumption. It would not hurt to use VACUUM after the updates to return rows to the database.
Adding a serial column is adding an integer column with a non-constant DEFAULT value. This will cause PostgreSQL to rewrite the table, because the new column value has to be added to all existing rows. So PostgreSQL writes a new copy of the table and discards the old one after it is done. This will require more than double the disk space of the original table temporarily, which explains why you run out of disk space.
You can split the operation into several steps:
ALTER TABLE mytable ADD id bigint;
CREATE SEQUENCE mytable_id_seq OWNED BY mytable.id;
ALTER TABLE mytable ALTER id SET DEFAULT nextval('mytable_id_seq');
This will not rewrite the table, and it will leave the existing rows untouched. The value of id for these columns will be NULL.
You probably want to update the existing rows to be NOT NULL, but be careful: if you update them all at once, you will run out of disk space as well, because in PostgreSQL an UPDATE writes a complete new version of the row to the table. You'd have to update the rows in batches and run VACUUM between these updates.
All in all, this is rather annoying and complicated. So do yourself a favor and increase the disk space. That is the simple and best solution.

Is there a method to do an ALTER Column in postgres 12 on an huge table without waiting a lifetime?

Is there a method to do an ALTER COLUMN in postgres 12 on an huge table without waiting a lifetime?
I try to convert a field from bigint to smallint :
ALTER TABLE huge ALTER COLUMN result_code TYPE SMALLINT;
It takes 28 hours, is there a smarter method?
The table has sequences, keys and foreign keys
The table has to be rewritten, and you have to wait.
If you have several columns whose data type you want to change, you can use several ALTER COLUMN clauses in a single ALTER TABLE statement and save time that way.
An alternative idea would be to use logical replication: set up an empty copy of the database (pg_dump -s), where your large table is defined with smallint columns. Replicate your database to that database, and switch over as soon as replication has caught up.

PostgreSQL ADD COLUMN DEFAULT NULL locks and performance

I have a table in my PostgreSQL 9.6 database with 3 million rows. This table already has a null bitmap (it has 2 other DEFAULT NULL fields). I want to add a new boolean nullable column to this table. I stuck with the difference between these two statements:
ALTER TABLE my_table ADD COLUMN my_column BOOLEAN;
ALTER TABLE my_table ADD COLUMN my_column BOOLEAN DEFAULT NULL;
I think that these statements have no difference, but:
I can't find any proof of it in documentation. Documentation tells that providing DEFAULT value for the new column makes PostgreSQL to rewrite all the tuples, but I don't think that it's true for this case, cause default value is NULL.
I ran some tests on copy of this table, and the first statement (without DEFAULT NULL) took a little bit more time than the second. I can't understand why.
My questions are:
Will PostgreSQL use the same lock type (ACCESS EXCLUSIVE) for those two statements?
Will PostgreSQL rewrite all tuples to add NULL value to every of them in case that I use DEFAULT NULL?
Are there any difference between those two statements?
There's a issue in the response of Vao Tsun in point 2.
If you use ALTER TABLE my_table ADD COLUMN my_column BOOLEAN; it won't rewrite all the tuples, it will be just a change in the metadata.
But if you use ALTER TABLE my_table ADD COLUMN my_column BOOLEAN DEFAULT NULL, it will rewrite all the tuples, and it will last for ever on long tables.
The documentation itself tells this.
When a column is added with ADD COLUMN, all existing rows in the table are initialized with the column's default value (NULL if no DEFAULT clause is specified). If there is no DEFAULT clause, this is merely a metadata change and does not require any immediate update of the table's data; the added NULL values are supplied on readout, instead.
This tell us that if there is a DEFAULT clause, even if it is NULL, it will rewrite all the tuples.
This is due to a performance issue on the updates clause. If you need to make an update over a no rewrited tuple, it will need to move the tuple to another disk space, consuming more time.
I tested this by my own on Postgresql 9.6, when i had to add a column, on a table that had 300+ million tuples. Without the DEFAULT NULL it lasted 11 ms, and with the DEFAULT NULL it lasted more than 30 minutes.
https://www.postgresql.org/docs/current/static/sql-altertable.html
Yes - same ACCESS EXCLUSIVE, no exceptions for DEFAULT NULL or no DEFAULT mentionned (statistics, "options", constraints, cluster would require less strict I think, but not add column)
Note that the lock level required may differ for each subform. An
ACCESS EXCLUSIVE lock is held unless explicitly noted. When multiple
subcommands are listed, the lock held will be the strictest one
required from any subcommand.
No - it will rather append NULL to result on select
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a
metadata change and does not require any immediate update of the
table's data; the added NULL values are supplied on readout, instead.
No - no difference AFAIK. Just metadata change in both cases (as I believe it is one case expressed with different semantics)
Edit - Demo:
db=# create table so(i int);
CREATE TABLE
Time: 9.498 ms
db=# insert into so select generate_series(1,10*1000*1000);
INSERT 0 10000000
Time: 13899.190 ms
db=# alter table so add column nd BOOLEAN;
ALTER TABLE
Time: 1025.178 ms
db=# alter table so add column dn BOOLEAN default null;
ALTER TABLE
Time: 13.849 ms
db=# alter table so add column dnn BOOLEAN default true;
ALTER TABLE
Time: 14988.450 ms
db=# select version();
version
----------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.1 on x86_64-apple-darwin15.6.0, compiled by Apple LLVM version 8.0.0 (clang-800.0.42.1), 64-bit
(1 row)
lastly to avoid speculations it is data type specific:
db=# alter table so add column t text;
ALTER TABLE
Time: 25.831 ms
db=# alter table so add column tn text default null;
ALTER TABLE
Time: 13.798 ms
db=# alter table so add column tnn text default 'null';
ALTER TABLE
Time: 15440.318 ms

Attempts to alter a Postgresql column type from varchar to bytea hangs indefinitely

I've got a table with 4 rows in it in a non-production database used for development. There are 2 varchar columns that I want to convert to bytea. I don't care about the contents so I could of course drop the columns and then add them back, but I became confused when I tried to just change the type:
alter table whatever
alter column col_1 set data type bytea using null,
alter column col_2 set data type bytea using null;
When I try that, the psql client just hangs. By that I mean that it just sits there giving no feedback until I eventually hit ^C and it aborts. I've tried that with a little test table and it works fine, but for some reason it doesn't work on the real table (which, really, is also just a "little test table").
The using clause doesn't seem to make a difference one way or the other; I can leave it out or give other values, and the command does the same thing.
I don't get an error, I just don't get anything. Is that what I should expect?
I'm on 9.1 on ubuntu 14.10 if it matters.
I don't care about the contents
In that case, this works on an empty table:
ALTER TABLE tablename
ALTER COLUMN colname TYPE bytea USING colname::bytea
;
Simple:
Get the active locks from pg_locks:
select t.relname,l.locktype,page,virtualtransaction,pid,mode,granted from pg_locks l, pg_stat_all_tables t where l.relation=t.relid order by relation asc;
Copy the pid(ex: 14210) from above result and substitute in the below command.
SELECT pg_terminate_backend('14210')

Query rows by time of creation?

I have a table that contains no date or time related fields. Still I want to query that table based on when records/rows were created. Is there a way to do this in PostgreSQL?
I prefer an answer about doing it in PostgreSQL directly. But if that's not possible, can hibernate do it for PostgreSQL?
Basically: no. There is no automatic timestamp for rows in PostgreSQL.
I usually add a column like this to my tables (ignoring time zones):
ALTER TABLE tbl ADD COLUMN log_in timestamp DEFAULT localtimestamp NOT NULL;
As long as you don't manipulate the values in that column, you got your creation timestamp. You can add a trigger and / or restrict write privileges to avoid tempering with the values.
Second class options
If you have a serial column, you could at least tell with some probability in what order rows were entered. That's not 100% reliable, because the values can be changed by hand, and applications can get values from the sequence and INSERT out of order.
If you created your table WITH (OIDS=TRUE), then the OID column could be some indication - unless your database is heavily written and / or very old, then you may have gone through OID wrap-around and later rows can have a smaller OID. That's one of the reasons, why this feature is hardly used any more.
The default depends on the setting of default_with_oids I quote the manual:
The parameter is off by default; in PostgreSQL 8.0 and earlier, it was
on by default.
If you have not updated your rows or went through a dump / restore cycle, or ran VACUUM FULL or CLUSTER or .. , a plain SELECT * FROM tbl returns all rows in the order they were entered. But this is very unreliable and implementation-dependent. PostgreSQL (like any RDBMS) does not guarantee any order without an ORDER BY clause.