AWS database single column adds extremely much data - postgresql

I'm retrieving data from an AWS database using PgAdmin. This works well. The problem is that I have one column that I set to True after I retrieve the corresponding row, where originally it is set to Null. Doing so adds an enormous amount of data to my database.
I have checked that this is not due to other processes: it only happens when my program is running.
I am certain no rows are being added, I have checked the number of rows before and after and they're the same.
Furthermore, it only does this when changing specific tables, when I update other tables in the same database with the same process, the database size stays the same. It also does not always increase the database size, only once every couple changes does the total size increase.
How can changing a single boolean from Null to True add 0.1 MB to my database?
I'm using the following commands to check my database makeup:
To get table sizes
SELECT
relname as Table,
pg_total_relation_size(relid) As Size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as External Size
FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;
To get number of rows:
SELECT schemaname,relname,n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
To get database size:
SELECT pg_database_size('mydatabasename')

If you have not changed that then your fillfactor is at 100% on the table since that is the default.
This means that every change in your table will mark the changed row as obsolete and will recreate the updated row. The issue could be even worse if you have indices on your table since those should be updated on every row change too. As you could imagine this hurts the UPDATE performance too.
So technically if you would read the whole table and update even the smallest column after reading the rows then it would double the table size when your fillfactor is 100.
What you can do is to ALTER your table lower the fillfactor on it, then VACUUM it:
ALTER TABLE your_table SET (fillfactor = 90);
VACUUM FULL your_table;
Of course with this step your table will be about 10% bigger but Postgres will spare some space for your updates and it won't change its size with your process.
The reason why autovacuum helps is because it cleans the obsoleted rows periodically and therefore it will keep your table at the same size. But it puts a lot of pressure on your database. If you happen to know that you'll do operations like you described in the opening question then I would recommend tuning the fillfactor for your needs.

The problem is that (source):
"In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table"
Furthermore, we did not always close the cursor which also increased database size while running.
One last problem is that we were running one huge query, not allowing the system to autovacuum properly. This problem is described in more detail here
Our solution was to re-approach the problem such that the rows did not have to be updated. Other solutions that we could think of but have not tried is to stop the process every once in a while allowing the autovacuum to work correctly.

What do you mean adds data? to all the data files? specifically to some files?
to get a precise answer you should supply more details, but generally speaking, any DB operation will add data to the transaction logs, and possibly other files.

Related

Is estimated row count accurate when only inserts are done in a table?

We use PostgreSQL for analytics. Three typical operations we do on tables are:
Create table as select
Create table followed by insert in table
Drop table
We are not doing any UPDATE, DELETE etc.
For this situation can we assume that estimates would just be accurate?
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
With autovacuum running (which is the default), ANALYZE and VACUUM are fired up automatically - both of which update reltuples. Basic configuration parameters for ANALYZE (which typically runs more often), (quoting the manual):
autovacuum_analyze_threshold (integer)
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE in any one table. The default is 50
tuples. This parameter can only be set in the postgresql.conf file
or on the server command line; but the setting can be overridden for
individual tables by changing table storage parameters.
autovacuum_analyze_scale_factor (floating point)
Specifies a fraction of the table size to add to
autovacuum_analyze_threshold when deciding whether to trigger an
ANALYZE. The default is 0.1 (10% of table size). This parameter can
only be set in the postgresql.conf file or on the server command
line; but the setting can be overridden for individual tables by
changing table storage parameters.
Another quote gives insight to details:
For efficiency reasons, reltuples and relpages are not updated
on-the-fly, and so they usually contain somewhat out-of-date values.
They are updated by VACUUM, ANALYZE, and a few DDL commands such
as CREATE INDEX. A VACUUM or ANALYZE operation that does not
scan the entire table (which is commonly the case) will incrementally
update the reltuples count on the basis of the part of the table it
did scan, resulting in an approximate value. In any case, the planner
will scale the values it finds in pg_class to match the current
physical table size, thus obtaining a closer approximation.
Estimates are up to date accordingly. You can change autovacuum settings to be more aggressive. You can even do this per table. See:
Aggressive Autovacuum on PostgreSQL
On top of that, you can scale estimates like Postgres itself does it. See:
Fast way to discover the row count of a table in PostgreSQL
Note that VACUUM (of secondary relevance to your case) wasn't triggered by only INSERTs before Postgres 13. Quoting the release notes:
Allow inserts, not only updates and deletes, to trigger vacuuming
activity in autovacuum (Laurenz Albe, Darafei
Praliaskouski)
Previously, insert-only activity would trigger auto-analyze but not
auto-vacuum, on the grounds that there could not be any dead tuples to
remove. However, a vacuum scan has other useful side-effects such as
setting page-all-visible bits, which improves the efficiency of
index-only scans. Also, allowing an insert-only table to receive
periodic vacuuming helps to spread out the work of “freezing” old
tuples, so that there is not suddenly a large amount of freezing work
to do when the entire table reaches the anti-wraparound threshold all
at once.
If necessary, this behavior can be adjusted with the new parameters
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor, or the equivalent
table storage options.

Cleaning up files from table without deleting rows in postgresql 9.6.3

I have a table with files and various relations to this table, files are stored as bytea. I want to free up space occupied by old files (according to timestamp), however the rows should still be present in the table.
Is it enough to set null to bytea field? Will the data be actually deleted from the table this way?
In PostgreSQL, updating a row creates a new tuple (row version), and the old one is left to be deleted by autovacuum.
Also, larger bytea attributes will be stored out-of-line in the TOAST table that belongs to the table.
When you set the bytea attribute to NULL (which is the right thing to do), two things will happen:
The main table will become bigger because of all the new tuples created by the UPDATE. Autovacuum will free the space, but not shrink the table (the empty space can be re-used by future data modifications).
Entries in the TOAST table will be deleted. Again, autovacuum will free the space, but the table won't shrink.
So what you will actually observe is that after the UPDATE, your table uses more space than before.
You can get rid of all that empty space by running VACUUM (FULL) on the table, but that will block concurrent access to the table for the duration of the operation, so be ready to schedule some down time (you'll probably do that for the UPDATE anyway).

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Database size doubles on update to single new column

I have a fairly simple table used to drive tile processing for a web mapping application.
Column | Type | Modifiers
---------------+--------------------------+---------------------------------------------------------
id | integer | not null default
nextval('wmts_tiles_id_seq'::regclass)
tile_matrix | integer |
rowx | integer |
coly | integer |
geom | geometry(Geometry,27700) |
processed | smallint |
Indexes:
"wmts_tiles_pkey" PRIMARY KEY, btree (id)
"wmts_tiles_wmts_tile_matrix_x_y" UNIQUE CONSTRAINT, btree (tile_matrix, rowx, coly)
"ix_spatial_wmts_tiles" gist (geom)
"ix_tile_matrix_processed" btree (tile_matrix, processed)
with various indexes (one spatial) and constaints as shown. This table has 240 million rows and the pg_relation_size and pg_total_relation_size indicate that this table is 66 GB, of which, half is the indexes and half the data.
I added a single date column and then ran an update to populate it,
alter database wmts_tiles add column currency_date date;
update wmts_tiles set currency_date = '2014-05-01'
After this, the size went to 133 GB, ie, it doubled. Once I ran, VACUUM FULL on the table, the size shrunk back to 67 GB, ie, 1GB larger than before, which is what you would expect after adding 240 million rows of a 4 byte field (date).
I understand that there will often be a reasonably percentage of dead rows in a table where there are a lot of inserts and deletes happening, but why would a table size double under one single update and is there anything I can do to prevent this? Note, this update was the only transaction running and the table had just been dumped and recreated, so the data pages and index were in a compact state prior to the update.
EDIT: I have seen this question, Why does my postgres table get much bigger under update? and I understand that while the table is being updated that to support MVCC the table needs to be rewritten, what I don't understand is why it stays twice the size, until I explicitly run VACUUM FULL.
Most of this is covered by this prior question.
The reason it doesn't shrink is that PostgreSQL doesn't know you want it to. It's inefficient to allocate disk space (grow a table) and then release it (shrink the table) repeatedly. PostgreSQL prefers to grow a table then keep the disk space, marking it empty and ready for re-use.
Not only is the allocation and release cycle inefficient, but the OS also only permits release of space at the end of a file*. So PostgreSQL has to move all the rows from the end of the file, which was the only place it could write them when you did the update, to the now-free space at the start. It could't do this as it went because it couldn't overwrite any of the old data until the update transaction committed.
If you know you won't need the space again, you can use VACUUM FULL to compact the table and release the space.
There's no periodic vacuum full done by autovacuum, partly because it might be quite bad for performance if the table just has to be expanded again, partly because vacuum full is quite I/O intensive (so it'll slow everything else down) and partly because vacuum full requires an access exclusive lock that prevents any concurrent access to the table. PostgreSQL would need an incremental table repack command/feature, which it doesn't have yet, for this to be possible with autovacuum. Patches are welcome... though this part of the code is very complex and getting it right would not be a beginner job.
* Yes, I know you can mark large regions of zeroes within a file as sparse regions on some platforms. Feel free to submit a PostgreSQL patch for that. In practice you'll have to compact anyway, because you don't find large regions with free pages in the table normally. Plus you'd have to deal with page headers.

postgres too slow

I'm doing massive tests on a Postgres database...
so basically I have 2 table where I inserted 40.000.000 records on, let's say table1 and 80.000.000 on table2
after this I deleted all those records.
Now if I do SELECT * FROM table1 it takes 199000ms ?
I can't understand what's happening?
can anyone help me on this?
If you delete all the rows from a table, they are marked as deleted but not actually removed from disk immediately. In order to remove them you need to do a "vacuum" operation- this should kick in automatically some time after such a big delete. Even so, that will just leave the pages empty but taking up quite a bit of disk space without a "vacuum full".
If you regularly need to do delete all the rows from a large table, consider using "truncate" instead, which simply zaps the table data file.
The tuples are logically deleted, not fisically.
You should perform a VACUUM on the db.
More info here
If you are deleting all records, use truncate not delete. Further the first time you run it the relation will not be cached (file cache or shared buffers), so it will be slower than subsequent times.