Database size doubles on update to single new column - postgresql

I have a fairly simple table used to drive tile processing for a web mapping application.
Column | Type | Modifiers
---------------+--------------------------+---------------------------------------------------------
id | integer | not null default
nextval('wmts_tiles_id_seq'::regclass)
tile_matrix | integer |
rowx | integer |
coly | integer |
geom | geometry(Geometry,27700) |
processed | smallint |
Indexes:
"wmts_tiles_pkey" PRIMARY KEY, btree (id)
"wmts_tiles_wmts_tile_matrix_x_y" UNIQUE CONSTRAINT, btree (tile_matrix, rowx, coly)
"ix_spatial_wmts_tiles" gist (geom)
"ix_tile_matrix_processed" btree (tile_matrix, processed)
with various indexes (one spatial) and constaints as shown. This table has 240 million rows and the pg_relation_size and pg_total_relation_size indicate that this table is 66 GB, of which, half is the indexes and half the data.
I added a single date column and then ran an update to populate it,
alter database wmts_tiles add column currency_date date;
update wmts_tiles set currency_date = '2014-05-01'
After this, the size went to 133 GB, ie, it doubled. Once I ran, VACUUM FULL on the table, the size shrunk back to 67 GB, ie, 1GB larger than before, which is what you would expect after adding 240 million rows of a 4 byte field (date).
I understand that there will often be a reasonably percentage of dead rows in a table where there are a lot of inserts and deletes happening, but why would a table size double under one single update and is there anything I can do to prevent this? Note, this update was the only transaction running and the table had just been dumped and recreated, so the data pages and index were in a compact state prior to the update.
EDIT: I have seen this question, Why does my postgres table get much bigger under update? and I understand that while the table is being updated that to support MVCC the table needs to be rewritten, what I don't understand is why it stays twice the size, until I explicitly run VACUUM FULL.

Most of this is covered by this prior question.
The reason it doesn't shrink is that PostgreSQL doesn't know you want it to. It's inefficient to allocate disk space (grow a table) and then release it (shrink the table) repeatedly. PostgreSQL prefers to grow a table then keep the disk space, marking it empty and ready for re-use.
Not only is the allocation and release cycle inefficient, but the OS also only permits release of space at the end of a file*. So PostgreSQL has to move all the rows from the end of the file, which was the only place it could write them when you did the update, to the now-free space at the start. It could't do this as it went because it couldn't overwrite any of the old data until the update transaction committed.
If you know you won't need the space again, you can use VACUUM FULL to compact the table and release the space.
There's no periodic vacuum full done by autovacuum, partly because it might be quite bad for performance if the table just has to be expanded again, partly because vacuum full is quite I/O intensive (so it'll slow everything else down) and partly because vacuum full requires an access exclusive lock that prevents any concurrent access to the table. PostgreSQL would need an incremental table repack command/feature, which it doesn't have yet, for this to be possible with autovacuum. Patches are welcome... though this part of the code is very complex and getting it right would not be a beginner job.
* Yes, I know you can mark large regions of zeroes within a file as sparse regions on some platforms. Feel free to submit a PostgreSQL patch for that. In practice you'll have to compact anyway, because you don't find large regions with free pages in the table normally. Plus you'd have to deal with page headers.

Related

Is estimated row count accurate when only inserts are done in a table?

We use PostgreSQL for analytics. Three typical operations we do on tables are:
Create table as select
Create table followed by insert in table
Drop table
We are not doing any UPDATE, DELETE etc.
For this situation can we assume that estimates would just be accurate?
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
With autovacuum running (which is the default), ANALYZE and VACUUM are fired up automatically - both of which update reltuples. Basic configuration parameters for ANALYZE (which typically runs more often), (quoting the manual):
autovacuum_analyze_threshold (integer)
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE in any one table. The default is 50
tuples. This parameter can only be set in the postgresql.conf file
or on the server command line; but the setting can be overridden for
individual tables by changing table storage parameters.
autovacuum_analyze_scale_factor (floating point)
Specifies a fraction of the table size to add to
autovacuum_analyze_threshold when deciding whether to trigger an
ANALYZE. The default is 0.1 (10% of table size). This parameter can
only be set in the postgresql.conf file or on the server command
line; but the setting can be overridden for individual tables by
changing table storage parameters.
Another quote gives insight to details:
For efficiency reasons, reltuples and relpages are not updated
on-the-fly, and so they usually contain somewhat out-of-date values.
They are updated by VACUUM, ANALYZE, and a few DDL commands such
as CREATE INDEX. A VACUUM or ANALYZE operation that does not
scan the entire table (which is commonly the case) will incrementally
update the reltuples count on the basis of the part of the table it
did scan, resulting in an approximate value. In any case, the planner
will scale the values it finds in pg_class to match the current
physical table size, thus obtaining a closer approximation.
Estimates are up to date accordingly. You can change autovacuum settings to be more aggressive. You can even do this per table. See:
Aggressive Autovacuum on PostgreSQL
On top of that, you can scale estimates like Postgres itself does it. See:
Fast way to discover the row count of a table in PostgreSQL
Note that VACUUM (of secondary relevance to your case) wasn't triggered by only INSERTs before Postgres 13. Quoting the release notes:
Allow inserts, not only updates and deletes, to trigger vacuuming
activity in autovacuum (Laurenz Albe, Darafei
Praliaskouski)
Previously, insert-only activity would trigger auto-analyze but not
auto-vacuum, on the grounds that there could not be any dead tuples to
remove. However, a vacuum scan has other useful side-effects such as
setting page-all-visible bits, which improves the efficiency of
index-only scans. Also, allowing an insert-only table to receive
periodic vacuuming helps to spread out the work of “freezing” old
tuples, so that there is not suddenly a large amount of freezing work
to do when the entire table reaches the anti-wraparound threshold all
at once.
If necessary, this behavior can be adjusted with the new parameters
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor, or the equivalent
table storage options.

AWS database single column adds extremely much data

I'm retrieving data from an AWS database using PgAdmin. This works well. The problem is that I have one column that I set to True after I retrieve the corresponding row, where originally it is set to Null. Doing so adds an enormous amount of data to my database.
I have checked that this is not due to other processes: it only happens when my program is running.
I am certain no rows are being added, I have checked the number of rows before and after and they're the same.
Furthermore, it only does this when changing specific tables, when I update other tables in the same database with the same process, the database size stays the same. It also does not always increase the database size, only once every couple changes does the total size increase.
How can changing a single boolean from Null to True add 0.1 MB to my database?
I'm using the following commands to check my database makeup:
To get table sizes
SELECT
relname as Table,
pg_total_relation_size(relid) As Size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as External Size
FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;
To get number of rows:
SELECT schemaname,relname,n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
To get database size:
SELECT pg_database_size('mydatabasename')
If you have not changed that then your fillfactor is at 100% on the table since that is the default.
This means that every change in your table will mark the changed row as obsolete and will recreate the updated row. The issue could be even worse if you have indices on your table since those should be updated on every row change too. As you could imagine this hurts the UPDATE performance too.
So technically if you would read the whole table and update even the smallest column after reading the rows then it would double the table size when your fillfactor is 100.
What you can do is to ALTER your table lower the fillfactor on it, then VACUUM it:
ALTER TABLE your_table SET (fillfactor = 90);
VACUUM FULL your_table;
Of course with this step your table will be about 10% bigger but Postgres will spare some space for your updates and it won't change its size with your process.
The reason why autovacuum helps is because it cleans the obsoleted rows periodically and therefore it will keep your table at the same size. But it puts a lot of pressure on your database. If you happen to know that you'll do operations like you described in the opening question then I would recommend tuning the fillfactor for your needs.
The problem is that (source):
"In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table"
Furthermore, we did not always close the cursor which also increased database size while running.
One last problem is that we were running one huge query, not allowing the system to autovacuum properly. This problem is described in more detail here
Our solution was to re-approach the problem such that the rows did not have to be updated. Other solutions that we could think of but have not tried is to stop the process every once in a while allowing the autovacuum to work correctly.
What do you mean adds data? to all the data files? specifically to some files?
to get a precise answer you should supply more details, but generally speaking, any DB operation will add data to the transaction logs, and possibly other files.

Cleaning up files from table without deleting rows in postgresql 9.6.3

I have a table with files and various relations to this table, files are stored as bytea. I want to free up space occupied by old files (according to timestamp), however the rows should still be present in the table.
Is it enough to set null to bytea field? Will the data be actually deleted from the table this way?
In PostgreSQL, updating a row creates a new tuple (row version), and the old one is left to be deleted by autovacuum.
Also, larger bytea attributes will be stored out-of-line in the TOAST table that belongs to the table.
When you set the bytea attribute to NULL (which is the right thing to do), two things will happen:
The main table will become bigger because of all the new tuples created by the UPDATE. Autovacuum will free the space, but not shrink the table (the empty space can be re-used by future data modifications).
Entries in the TOAST table will be deleted. Again, autovacuum will free the space, but the table won't shrink.
So what you will actually observe is that after the UPDATE, your table uses more space than before.
You can get rid of all that empty space by running VACUUM (FULL) on the table, but that will block concurrent access to the table for the duration of the operation, so be ready to schedule some down time (you'll probably do that for the UPDATE anyway).

Wrong (?) Size of PostgreSQL table

I have a table with columns and constraints:
height smallint,
length smallint,
diameter smallint,
volume integer,
idsensorfragments integer,
CONSTRAINT sensorstats_idsensorfragments_fkey FOREIGN KEY (idsensorfragments)
REFERENCES sensorfragments (idsensorfragments) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
(no primary key). There is currently 28 978 112 records in it, but the size of the table is way too much in my opinion.
Result of the query:
select pg_size_pretty(pg_total_relation_size('sensorstats')), pg_size_pretty(pg_relation_size('sensorstats'))
is:
"1849 MB";"1226 MB"
There is just one index on idsensorfragments column. Using simple math you can see that one record takes ~66,7 B (?!?!). Can anyone explain me where this number comes from?
5 columns = 2 + 2 + 2 + 4 + 4 = 14 Bytes. I have one index, no primary key. Where additional 50B per record comes from?
P.S. Table was vacuumed, analyzed and reindexed.
You should take a look on how Database Physical Storage is organized, especially on the Page Layout.
PostgreSQL keeps a bunch of extra fields for each tuple (row) and also for each Page. Tuples are kept in the Pages, as Page is the item that database operates on, typically 8192 bytes in size. So the extra space usage comes from:
PageHeader, 24 bytes;
Tupleheader, 27 bytes;
“invisible” Tuple versions;
reserved free space, according to the Storage Parameters of the table;
NULL indicator array;
(might have missed something more).
The layout of physical storage changes between major releases, that's the reason you have to perform a full dump/restore. In the recent versions pg_upgrade is of great help in this process.
Did you do a VACUUM FULL or CLUSTER? If not, the unused space is still dedicated to this table and index. These statements rewrite the table, a VACUUM without FULL doesn't do the rewrite.

How to avoid fragmented database storage by very often updates?

When I have the following table:
CREATE TABLE test
(
"id" integer NOT NULL,
"myval" text NOT NULL,
CONSTRAINT "test-id-pkey" PRIMARY KEY ("id")
)
When doing a lot of queries like the following:
UPDATE "test" set "myval" = "myval" || 'foobar' where "id" = 12345
Then the row myval will get larger and larger over time.
What will postgresql do? Where will it get the space from?
Can I avoid that postgresql needs more than one seek to read a particular myval-column?
Will postgresql do this automatically?
I know that normally I should try to normalize the data much more. But I need to read the value with one seek. Myval will enlarge by about 20 bytes with each update (that adds data). Some colums will have 1-2 updates, some 1000 updates.
Normally I would just use one new row instead of an update. But then selecting is getting slow.
So I came to the idea of denormalizing.
Change the FILLFACTOR of the table to create space for future updates. This can also be HOT updates because the text field doesn't have an index, to make the update faster and autovacuum overhead lower because HOT updates use a microvacuum. The CREATE TABLE statement has some information about the FILLFACTOR.
ALTER TABLE test SET (fillfactor = 70);
-- do a table rebuild to blow some space in your current table:
VACUUM FULL ANALYZE test;
-- start testing
The value 70 is not the perfect setting, it depends on your unique situation. Maybe you're fine with 90, it could also be 40 or something else.
This is related to this question about TEXT in PostgreSQL, or at least the answer is similar. PostgreSQL stores large columns away from the main table storage:
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
So you can expect a TEXT (or BYTEA or large VARCHAR) column to always be stored away from the main table and something like SELECT id, myval FROM test WHERE id = 12345 will take two seeks to pull both columns off the disk (and more seeks to resolve their locations).
If your UPDATEs really are causing your SELECTs to slow down then perhaps you need to review your vacuuming strategy.