Postgresql - large delete from 500Gb DB / auto-vacuum? - postgresql

I need to delete around 80% of my 500Gb Postgresql DB.
I have successfully run a delete command for around 50Gb of rows so far, and paused before proceeding. (This took a long time, perhaps one hour)
I notice that after deleting around 50Gb of data, no extra disk space is freed up, but some memory intensive postgres processes can be observed when I run 'htop'. Am I correct in assuming this is down to dead rows, which need to be vacuumed before the disk space is released?
Second part of this question is, if I am not mistaken about the first part, am I better off deleting all the rows and then allowing auto-vacuum to take place? It appears auto-vacuum (or some other intensive background process) has started by itself before I had a chance to continue my row deletion command list. Do I just proceed or should I gracefully tell it to stop first?

After a big delete, autovacuum is sure to run. That is as designed and should not interfere with you deleting still more rows.
While autovacuum frees the dead space in the tables, it does not return the space to the operating system. Rather, it remains as free space in the table and can be reused for future inserts.
If you want to shrink the tables, run VACUUM (FULL) on them, but be warned that this rewrites the table, so it temporarily uses additional storage space and blocks all concurrent activity on the table.
If you have to do mass deletes like that regularly, consider partitioning the table. It makes bulk deletes painless.

A better solution would be to TRUNCATE the table.
In my scenario i deleted a specific row that was taking up a lot of disk space but it was too much data for VACUUM to clear in a reasonable amount of time.
I ended up duplicating the table:
CREATE table dupe_table AS (SELECT * FROM table);
Truncating the original table:
TRUNCATE table
Finally moving the data back:
INSERT INTO table(column1, column2, column3)
SELECT column1, column2, column3
FROM dupe_table
NOTE: that you could lose data if you transaction happening between the creation of the duplicate table and the truncating of the original table

Related

Is it possible to run VACUUM FULL for a short while and get some benefit?

Is it possible to run PostgreSQL 11's VACUUM FULL for a short while and then get some benefit? Or does cancelling it midway cause all of its progress to be lost?
I've read about pg_repack (https://aws.amazon.com/blogs/database/remove-bloat-from-amazon-aurora-and-rds-for-postgresql-with-pg_repack/) but the way it works (creating new tables, copying data, etc.) sounds risky to me. Is that my paranoia or is it safe to use on a production database?
Backstory: I am working with a very large production database on AWS Aurora PostgreSQL 11. Many of the tables had tens of millions of records but have been pruned down significantly. The problem is that the table sizes on disk (and in the snapshots) have not decreased because DELETE and VACUUM (without FULL) do not shrink the files. These tables are in the hundreds of gigabytes range and I'm afraid running VACUUM FULL will take forever.
No. VACUUM FULL writes a new physical file for the table. Stopping it before it finishes voids the work done so far.
The manual:
VACUUM FULL rewrites the entire contents of the table into a new
disk file with no extra space, allowing unused space to be returned to
the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on each table while it is being processed.
This is the main reason why community tools like pg_repack or pg_squeeze were created, which are more flexible, less blocking, and often faster, too. (I don't think pg_squeeze is available for Aurora, yet).
pg_repack might be a bit of overkill. You can instead just delete tuples from the end of the table and reinsert them towards the front of the table (reusing space already marked as free by an earlier VACUUM), at which point another ordinary VACUUM can truncate away the free space at the end of the table.
with d as (delete from mytable where ctid>='(50000,1)' returning *)
insert into mytable select * from d;
You can use pg_freespacemap to figure out where would be a good place to start the ctid criterion at.
This might not behave well if you have triggers or FK constraints, and it might bloat indexes such they would need to be rebuilt (but they probably do anyway). It will also lock a large number rows at a time, for the duration it takes for the re-insert to run and commit.
Improvements made since v11 will make the ctid scan more efficient than it will be in v11.

best disk saving strategy for "replacement inserts"

Every day I delete hundreds of thousands of records from a large table, then I do some calculations (with new data) and replace every one of the records that I previously deleted. I thought doing regular vacuum tbl would do the trick. I know it doesn't return disk space to the server, but (because of the pg docs) I thought because I was inserting about as many records as I was deleting, I wouldn't loose any/much disk space. However, after moving the table to a different namespace (for an unrelated reason) the table went from 117GB to 44GB! So...
Is there a better strategy than this so my table does bloat:
delete from tbl where ...etc... -- hundreds of thousands of rows removed
insert into tbl (...etc...) values (...etc...) -- hundreds of thousands of rows added back (fresh calcs)
.. repeat the above about 10 times a day ...
vacuum tbl
https://www.postgresql.org/docs/9.6/static/sql-vacuum.html
PostgreSQL 9.6
What I actually did to reduce the table size is in my answer here:
integer out of range and remaining disk space too small to convert id to bigint and other solutions
Edit 1:
The drawbacks to vacuum full are too restricting for me. I am processing stuff 24/7 so i can't have locks like that and my available disk space is pretty limited at any point in time. Trying to go about this in a better way.
What you are looking for is "dead space equilibrium" as I like to call it. If you've got say 1M rows and you want to delete and replace 100k rows, then you can do it in different ways. Let's suppose you delete 100k, and insert 100k right away. The db won't have time to vacuum up those old dead rows, so now your 1M row table has 100k dead rows in it. Over the next 24 hours vacuum will kick in and mark them dead, and the next time you delete / insert, you'll create 100k more dead rows, then reuse (most of) the previous 100k dead rows. Your 1M row table now has ~100k dead rows again, which will get reused next time and so on.
You want to reach a point where your deletes/inserts (or updates) and vacuum are creating / reclaiming dead tuples at an even rate.

PostgreSQL - Empty table

I have a table called EVENTS on my PostgreSQL DB schema.
It is empty, i.e. when I execute
SELECT * FROM EVENTS
I get an empty results set.
Nonetheless, the table occupies 5MB of disk space.
I'm executing
SELECT round(pg_total_relation_size('events') / 1024.0 / 1024.0, 2)
And I'm getting 5.13MB.
I tried to explicitly run VACUUM, but it didn't change anything.
Any ideas?
Truncate the table:
truncate events;
From the documentation:
TRUNCATE quickly removes all rows from a set of tables. It has the same effect as an unqualified DELETE on each table, but since it does not actually scan the tables it is faster. Furthermore, it reclaims disk space immediately, rather than requiring a subsequent VACUUM operation. This is most useful on large tables.
If you want to immediately reclaim disk space keeping existing rows of a non-empty table, you can use vacuum:
vacuum full events;
This locks exclusively the table and rewrite it (in fact, creates a new copy and drops the old one). It is an expensive operation and generally not recommended on larger tables.
In RDBMS some redundant usage of the disk space is a normal state. If you have a properly configured autovacuum daemon the unused space will be used when new rows are inserted.
If you have dead rows or bloat in your table, VACUUM will not actually reclaim its memory but make it reusable and this is used when you insert data to the table next time.
To reclaim the memory used, try
VACUUM FULL events;

postgres 9.2 table size with pg_total_relation_size

I process a table with ~10^7 rows the following way: take last N rows, update them in some way, and delete, then vacuum table. In the end I make a query for pg_total_relation_size. Loop repeats until the table is over. Each iteration last for several seconds. There are no any other queries for this table except mentioned above. The problem is that I get the same results for table size. It changes about once a several hours.
So the question is -- does postgres store somewhere the table size or does it calculate it every time the function is invoked? I.e., does my table size really stays the same in spite of its processing?
Your table really does stay the same size on disk despite the DELETEs and VACUUMing you're doing. As per the documentation on VACUUM, ordinary VACUUM only releases space back to the OS if it can do so by truncating free space from the end of the file without rearranging live rows.
The space is still "free" in that PostgreSQL can re-use it for other new rows. It is much, much faster to re-use space that PostgreSQL hasn't given back to the OS than it is to extend a relation with new space, so this is often preferable.
The other reason Pg doesn't just give this space back is that it can only give space back to the OS when it's a contiguous chunk with no visible rows until the end of the file. This doesn't happen much so in practice Pg needs to move some rows around to compact the table and allow it to free space at the end, kind of like a defrag on a file system. This is an inefficient and slow process that can counter-intuitively make the table slower to access instead of faster, so it's not always a good idea.
If you have a relation that's mostly but not entirely empty it can be worth doing a VACUUM FULL (Pg 9.0 and above) or CLUSTER (all versions) to free the space. If you expect to refill the table this is usually counter-productive; it's actually better to leave it as-is.
(For what I mean by terms like "live" and "visible" see the documentation on MVCC which will help you understand Pg's table organisation.)
Personally I'd skip the manual VACUUM in your case. Turn autovacuum up if you need to. If you really need to you could consider partitioning your table, processing it partition by partition and TRUNCATE each partition when you're done processing it.

What is the effect on record size of reordering columns in PostgreSQL?

Since Postgres can only add columns at the end of tables, I end up re-ordering by adding new columns at the end of the table, setting them equal to existing columns, and then dropping the original columns.
So, what does PostgreSQL do with the memory that's freed by dropped columns? Does it automatically re-use the memory, so a single record consumes the same amount of space as it did before? But that would require a re-write of the whole table, so to avoid that, does it just keep a bunch of blank space around in each record?
The question is old, but since both answers are wrong or misleading, I'll add another one.
When updating a row, Postgres writes a new row version and the old one is eventually removed by VACUUM after no running transaction can see it any more.
Plain VACUUM does not return disk space from the physical file that contains the table to the system, unless it finds completely dead or empty blocks at the physical end of the table. You need to run VACUUM FULL or CLUSTER to aggressively compact the table and return excess space to the system. This is not typically desirable in normal operation. Postgres can re-use dead tuples to keep new row versions on the same data page, which benefits performance.
In your case, since you update every row, the size of the table is doubled (from its minimum size). It's advisable to run VACUUM FULL or CLUSTER to return the bloat to the system.
Both take an exclusive lock on the table. If that interferes with concurrent access, consider pg_repack, which can do the same without exclusive locks.
To clarify: Running CLUSTER reclaims the space completely. No VACUUM FULL is needed after CLUSTER (and vice versa).
More details:
PostgreSQL 9.2: vacuum returning disk space to operating system
From the docs:
The DROP COLUMN form does not physically remove the column, but simply makes it invisible to SQL operations. Subsequent insert and update operations in the table will store a null value for the column. Thus, dropping a column is quick but it will not immediately reduce the on-disk size of your table, as the space occupied by the dropped column is not reclaimed. The space will be reclaimed over time as existing rows are updated.
You'll need to do a CLUSTER followed by a VACUUM FULL to reclaim the space.
Why do you "reorder" ? There is no order in SQL, it doesn't make sence. If you need a fixed order, tell your queries what order you need or use a view, that's what views are made for.
Diskspace will be used again after vacuum, auto_vacuum will do the job. Unless you disabled this process.
Your current approach will kill overall performance (table locks), indexes have to be recreated, statistics go down the toilet, etc. etc. And in the end, you end up with the same situation you allready had. So why the effort?