Serial column takes up disproportional amount of space in PostgreSQL - postgresql

I would like to create an auto-incrementing id column that is not a primary key in a PostgreSQL table. The table is currently just over 200M rows and contains 14 columns.
SELECT pg_size_pretty(pg_total_relation_size('mytable'));
The above query reveals that mytable takes up 57 GB on disk. I currently have 30 GB free space remaining on disk after checking with df -h (on Ubuntu 20.04)
What I don't understand is why, after trying to create a SERIAL column, I completely run out of disk space - the query ends up never finishing. I run the following command:
ALTER TABLE mytable ADD COLUMN id SERIAL;
and then see how gradually, my disk space runs out until there is nothing left and the query fails. I am no database expert but it does not make sense. Why would a simple serialized column take up more than half of the space of the table itself, especially when it is not a primary key and therefore has no index? Is there a known workaround to creating such an auto-incrementing id column?

As a proof of concept:
create table id_test(pk_fld integer primary key generated always as identity);
--FYI, in Postgres 14+ the overriding system value won't be needed.
--That is a hack around a bug in 13-
insert into id_test overriding system value values (default), (default);
select * from id_test;
pk_fld
--------
1
2
alter table id_test add column id_fld integer ;
update id_test set id_fld = 0;
alter table id_test alter COLUMN id_fld set not null;
alter table id_test alter COLUMN id_fld add generated always as identity;
update id_test set id_fld = default;
select * from id_test;
pk_fld | id_fld
--------+--------
1 | 1
2 | 2
Basically this breaks the process down into steps. Obviously this is just a toy table and not representative of your setup. I would try it on test table that is a subset of you actual table to see what happens to disk space consumption. It would not hurt to use VACUUM after the updates to return rows to the database.

Adding a serial column is adding an integer column with a non-constant DEFAULT value. This will cause PostgreSQL to rewrite the table, because the new column value has to be added to all existing rows. So PostgreSQL writes a new copy of the table and discards the old one after it is done. This will require more than double the disk space of the original table temporarily, which explains why you run out of disk space.
You can split the operation into several steps:
ALTER TABLE mytable ADD id bigint;
CREATE SEQUENCE mytable_id_seq OWNED BY mytable.id;
ALTER TABLE mytable ALTER id SET DEFAULT nextval('mytable_id_seq');
This will not rewrite the table, and it will leave the existing rows untouched. The value of id for these columns will be NULL.
You probably want to update the existing rows to be NOT NULL, but be careful: if you update them all at once, you will run out of disk space as well, because in PostgreSQL an UPDATE writes a complete new version of the row to the table. You'd have to update the rows in batches and run VACUUM between these updates.
All in all, this is rather annoying and complicated. So do yourself a favor and increase the disk space. That is the simple and best solution.

Related

Is there a method to do an ALTER Column in postgres 12 on an huge table without waiting a lifetime?

Is there a method to do an ALTER COLUMN in postgres 12 on an huge table without waiting a lifetime?
I try to convert a field from bigint to smallint :
ALTER TABLE huge ALTER COLUMN result_code TYPE SMALLINT;
It takes 28 hours, is there a smarter method?
The table has sequences, keys and foreign keys
The table has to be rewritten, and you have to wait.
If you have several columns whose data type you want to change, you can use several ALTER COLUMN clauses in a single ALTER TABLE statement and save time that way.
An alternative idea would be to use logical replication: set up an empty copy of the database (pg_dump -s), where your large table is defined with smallint columns. Replicate your database to that database, and switch over as soon as replication has caught up.

PostgreSQL: increasing a column's length in a very large table

Aurora PostgreSQL, version 10.4.
I have a table with several million rows. One of the columns is defined as character varying(255). Once upon a time, 255 was plenty of room, but now it's not, so I have to make more.
I found this in the PG 9.1 release notes:
Allow ALTER TABLE ... SET DATA TYPE to avoid table rewrites in appropriate cases (Noah Misch, Robert Haas)
For example, converting a varchar column to text no longer requires a rewrite of the table. However, increasing the length constraint on a varchar column still requires a table rewrite.
This suggests that changing to a longer varchar is not practical (since rewriting a table of that size would lock it for an ungodly amount of time), but changing to text would work. Is this correct?
Any other things I should know about when making such a change? I want to avoid data loss, obviously, and I can't afford to make this table inaccessible for more than a short period.
You should have read all release notes.
Because just one version later
Increasing the length limit for a varchar or varbit column, or removing the limit altogether, no longer requires a table rewrite.
You can test that easily for yourself:
postgres=# select version();
version
------------------------------------------------------------
PostgreSQL 10.5, compiled by Visual C++ build 1800, 64-bit
(1 row)
postgres=# \timing on
Timing is on.
postgres=# create table alter_test (id serial, some_col varchar(255));
CREATE TABLE
Time: 22.331 ms
postgres=# insert into alter_test (some_col) select md5(random()::text) from generate_series(1,10e6);
INSERT 0 10000000
Time: 40894.275 ms (00:40.894)
postgres=# alter table alter_test alter column some_col type varchar(500);
ALTER TABLE
Time: 5.297 ms
postgres=#

PostgreSQL id column not defined

I am new in PostgreSQL and I am working with this database.
I got a file which I imported, and I am trying to get rows with a certain ID. But the ID is not defined, as you can see it in this picture:
so how do I access this ID? I want to use an SQL command like this:
SELECT * from table_name WHERE ID = 1;
If any order of rows is ok for you, just add a row number according to the current arbitrary sort order:
CREATE SEQUENCE tbl_tbl_id_seq;
ALTER TABLE tbl ADD COLUMN tbl_id integer DEFAULT nextval('tbl_tbl_id_seq');
The new default value is filled in automatically in the process. You might want to run VACUUM FULL ANALYZE tbl to remove bloat and update statistics for the query planner afterwards. And possibly make the column your new PRIMARY KEY ...
To make it a fully fledged serial column:
ALTER SEQUENCE tbl_tbl_id_seq OWNED BY tbl.tbl_id;
See:
Creating a PostgreSQL sequence to a field (which is not the ID of the record)
What you see are just row numbers that pgAdmin displays, they are not really stored in the database.
If you want an artificial numeric primary key for the table, you'll have to create it explicitly.
For example:
CREATE TABLE mydata (
id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
obec text NOT NULL,
datum timestamp with time zone NOT NULL,
...
);
Then to copy the data from a CSV file, you would run
COPY mydata (obec, datum, ...) FROM '/path/to/csvfile' (FORMAT 'csv');
Then the id column is automatically filled.

ADD COLUMN with DEFAULT value to a huge table

I have a postgresql DB and a table with almost billion of rows.
when I try to add a new column with default value:
ALTER TABLE big_table
ADD COLUMN some_flag integer NOT NULL DEFAULT 0;
The transaction goes on for 30+ min .. and the DB logs starts to shoots warnings.
Any way to optimize the query ?
Besides doing it in batches (which will still take a while):
You could dump the table as COPY statements and write a script to edit the contents of the COPY statements to insert another column (COPY can be CSV IIRC).
Then you just reload your altered COPY dump and it should in theory be faster than the ALTER because COPY will not log transactions.
The other option is to turn off fsync while you run the command... just remember to turn it back on.
You can also do both of the above in batches.
Starting from PostgreSQL 11 this behaviour will change.
Waiting for PostgreSQL 11 – Fast ALTER TABLE ADD COLUMN with a non-NULL default:
So, for the longest time, when you did:
alter table x add column z text;
it was virtually instantaneous. Get a lock on table, add information about new column to system catalogs, and it's done.
But when you tried:
alter table x add column z text default 'some value';
then it took long time. How long it did depend on size of table.
This was because postgresql was actually rewriting the whole table, adding the column to each row, and filling it with default value.
"What happens if you want to set the column to NOT NULL also? Are we back to the slow version in that case or does this handle that as well?"
not null doesn’t change anything. it is a constraint for new rows. so adding a column with “not null default ‘xxx'” will be fast.
I'd consider creating the column without the default and manually updating the rows in batches with intermittent commits to apply the default.

How to avoid fragmented database storage by very often updates?

When I have the following table:
CREATE TABLE test
(
"id" integer NOT NULL,
"myval" text NOT NULL,
CONSTRAINT "test-id-pkey" PRIMARY KEY ("id")
)
When doing a lot of queries like the following:
UPDATE "test" set "myval" = "myval" || 'foobar' where "id" = 12345
Then the row myval will get larger and larger over time.
What will postgresql do? Where will it get the space from?
Can I avoid that postgresql needs more than one seek to read a particular myval-column?
Will postgresql do this automatically?
I know that normally I should try to normalize the data much more. But I need to read the value with one seek. Myval will enlarge by about 20 bytes with each update (that adds data). Some colums will have 1-2 updates, some 1000 updates.
Normally I would just use one new row instead of an update. But then selecting is getting slow.
So I came to the idea of denormalizing.
Change the FILLFACTOR of the table to create space for future updates. This can also be HOT updates because the text field doesn't have an index, to make the update faster and autovacuum overhead lower because HOT updates use a microvacuum. The CREATE TABLE statement has some information about the FILLFACTOR.
ALTER TABLE test SET (fillfactor = 70);
-- do a table rebuild to blow some space in your current table:
VACUUM FULL ANALYZE test;
-- start testing
The value 70 is not the perfect setting, it depends on your unique situation. Maybe you're fine with 90, it could also be 40 or something else.
This is related to this question about TEXT in PostgreSQL, or at least the answer is similar. PostgreSQL stores large columns away from the main table storage:
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
So you can expect a TEXT (or BYTEA or large VARCHAR) column to always be stored away from the main table and something like SELECT id, myval FROM test WHERE id = 12345 will take two seeks to pull both columns off the disk (and more seeks to resolve their locations).
If your UPDATEs really are causing your SELECTs to slow down then perhaps you need to review your vacuuming strategy.