PostgreSQL Long VACUUM - postgresql

I am currently cleaning up a table with 2 indexes and 250 million active rows and approximately as many dead rows (or more). I issued the command VACCUM FULL ANALYSE from my client computer (laptop) to my server. It has been going about its business for the last 3-4 days or so; I am wondering if it will end anytime soon for I have much work to do!
The server has a quad-code Xeon 2.66 GHz processor, 12 GB or RAM and a RAID controller attached to 2 x 10K rpm 146 GB SAS HDs in a RAID 1 configuration; it is running Suse Linux. I am wondering...
Now, firstly the VACUUM postmaster process seems to be making use of only one core. Secondly, I am not seeing a very high I/O writes to I/O idle time ratio. Thirdly, from calling procinfo, I can extrapolate that the VACUUM process spends most of its time (88%) waiting for I/0.
So why isn't it utilizing more cores through threads in order to overload the RAID controller (get high I/O writes to idle ratio)? Why is it waiting for I/O if the I/O load isn't high? Why is it not going faster with all this power/resources at its fingers? It seems to me that VACUUM can and should be multithreaded, especially if it is working on a huge table and it is the only one working!
Also, is their a way to configure postgresql.conf to let it multithread such VACUUMs? Can I kill it and still benefit from its partial clean-up? I need to work on that table.
[I am using PostgreSQL 8.1]
Thx again

You don't say what version of PostgreSQL you are using. Is it possible it is pre-8.0?
I had this exact same scenario. Your best best:
kill the vacuum
back up the table with pg_dump -t option
drop the table
restore the table
If you are using 8.x, look at the autovacuum options. Vacuum is single threaded, there's nothing you can do to make it use multiple threads.

Some quick tips:
Run VACUUM FULL VERBOSE so you can se what is going on.
Drop all indexes before the VACUUM. It's faster to rebuild them than vacuum them. You also need to rebuild them now and then because VACUUM FULL isn't good enough (especially on such an old PosgreSQL as 8.1).
Set the maintenance_work_mem really high.
Use a newer PostgreSQL. Btw, 8.4 will have an huge improvement in vacuuming.
An alternative to VACUUM is to dump and restore.
Edit: Since 9.0 VACUUM FULL rewrites the whole table. It's basically the same thing as doing a dump + restore, so running REINDEX is unnecessary.

Are you sure you don't have anything ongoing that could lock tables and prevent vacuum from running ?
(Anyway, it's best to use vacuum_cost_delay so that vacuum is not disruptive to production.)

Old VACUUM FULL is a fossil. It's pretty slow too, and you got to REINDEX afterwards. Don't use it. If you really want to defrag a table, use CLUSTER, or this :
Lettssay you have some disk space left, that's much faster than dump&reload :
CREATE TABLE newtable AS SELECT * FROM oldtable;
CREATE INDEX bla ON newtable( ... );
ALTER TABLE oldtable RENAME TO archive;
ALTER TABLE newtable RENAME TO oldtable;
Note this will not copy your constraints. You can use CREATE TABLE LIKE ... to copy them.
So why isn't it utilizing more cores through threads
pg doesn't support this.

Related

Is it possible to run VACUUM FULL for a short while and get some benefit?

Is it possible to run PostgreSQL 11's VACUUM FULL for a short while and then get some benefit? Or does cancelling it midway cause all of its progress to be lost?
I've read about pg_repack (https://aws.amazon.com/blogs/database/remove-bloat-from-amazon-aurora-and-rds-for-postgresql-with-pg_repack/) but the way it works (creating new tables, copying data, etc.) sounds risky to me. Is that my paranoia or is it safe to use on a production database?
Backstory: I am working with a very large production database on AWS Aurora PostgreSQL 11. Many of the tables had tens of millions of records but have been pruned down significantly. The problem is that the table sizes on disk (and in the snapshots) have not decreased because DELETE and VACUUM (without FULL) do not shrink the files. These tables are in the hundreds of gigabytes range and I'm afraid running VACUUM FULL will take forever.
No. VACUUM FULL writes a new physical file for the table. Stopping it before it finishes voids the work done so far.
The manual:
VACUUM FULL rewrites the entire contents of the table into a new
disk file with no extra space, allowing unused space to be returned to
the operating system. This form is much slower and requires an ACCESS EXCLUSIVE lock on each table while it is being processed.
This is the main reason why community tools like pg_repack or pg_squeeze were created, which are more flexible, less blocking, and often faster, too. (I don't think pg_squeeze is available for Aurora, yet).
pg_repack might be a bit of overkill. You can instead just delete tuples from the end of the table and reinsert them towards the front of the table (reusing space already marked as free by an earlier VACUUM), at which point another ordinary VACUUM can truncate away the free space at the end of the table.
with d as (delete from mytable where ctid>='(50000,1)' returning *)
insert into mytable select * from d;
You can use pg_freespacemap to figure out where would be a good place to start the ctid criterion at.
This might not behave well if you have triggers or FK constraints, and it might bloat indexes such they would need to be rebuilt (but they probably do anyway). It will also lock a large number rows at a time, for the duration it takes for the re-insert to run and commit.
Improvements made since v11 will make the ctid scan more efficient than it will be in v11.

How to Remove Dead Row Version in Postgres 9.2

I ran a vaccum on the tables of my database and it appears it is not helping me. For example I have a huge table and when I run vacuum on it, it returns there are 87887889 dead row versions that can not be deleted
My question is how to get rid of these dead rows
You have two basic options if a routine vacuum is insufficient. Both require a full table lock.
VACUUM FULL. This requires no additional disk space but takes a while to complete.
CLUSTER. This rewrites a table in a physical order optimized for a given index. It requires additional space to do the rewrite, but is much faster.
In general I would recommend using CLUSTER during a maintenance window if disk space allows.

Free space after massive postgres delete

I have a 9 million row table. I figured out that a large amount of it (around 90%) can be freed up. What actions are needed after the cleanup? Vacuum, reindex etc.
If you want to free up space on the file system, either VACUUM FULL or CLUSTER can help you. You may also want to run ANALYZE after these, to make sure the planner has up-to-date statistics but this is not specifically required.
It is important to note using VACUUM FULL places an ACCESS EXCLUSIVE lock on your table(s) (blocking any operation, writes & reads), so you probably want to take your application offline for the duration.
In PostgreSQL 8.2 and earlier, VACUUM FULL is probably your best bet.
In PostgreSQL 8.3 and 8.4, the CLUSTER command was significantly improved, so VACUUM FULL is not recommended -- it's slow and it will bloat your indexes. `CLUSTER will re-create indexes from scratch and without the bloat. In my experience, it's usually much faster too. CLUSTER will also sort the whole physical table using an index, so you must pick an index. If you don't know which, the primary key will work fine.
In PostgreSQL 9.0, VACUUM FULL was changed to work like CLUSTER, so both are good.
It's hard to make predictions, but on a properly tuned server with commodity hardware, 9 million rows shouldn't take longer than 20 minutes.
See the documentation for CLUSTER.
PostgreSQL wiki about VACUUM FULL and recovering dead space
You definitely want to run a VACUUM, to free up that space for future inserts. If you want to actually reclaim that space on disk, making it available to the OS, you'll need to run VACUUM FULL. Keep in mind that VACUUM can run concurrently, but VACUUM FULL requires an exclusive lock on the table.
You will also want to REINDEX, since the indexes will remain bloated even after the VACUUM runs. If possible, a much faster way to do this is to drop the index and create it again from scratch.
You'll also want to ANALYZE, which you can just combine with the VACUUM.
See the documentation for more info.
Hi
Don't it be more optimal to create a temporary table with 10% of needed records. Then drop original table and rename temporary to original ...
I'm relatively new to the world of Postgres, but I understand VACUUM ANALYZE is recommended. I think there's also a sub-option which just frees up space. I found reindex useful as well when doing batch inserts or deletes. Yes I've been working with tables with a similar number of rows, and the speed increase is very noticeable (UBuntu, Core 2 Quad)

Why doesn't PostgreSQL performance get back to its maximum after a VACUUM FULL?

I have a table with a few million tuples.
I perform updates in most of them.
The first update takes about a minute. The second, takes two minutes. The third update takes four minutes.
After that, I execute a VACUUM FULL.
Then, I execute the update again, which takes two minutes.
If I dump the database and recreate it, the first update will take one minute.
Why doesn't PostgreSQL performance get back to its maximum after a VACUUM FULL?
VACUUM FULL does not compact the indexes. In fact, indexes can be in worse shape after performing a VACUUM FULL. After a VACUUM FULL, you should REINDEX the table.
However, VACUUM FULL+REINDEX is quite slow. You can achieve the same effect of compacting the table and the indexes using the CLUSTER command which takes a fraction of the time. It has the added benefit that it will order your table based on the index you choose to CLUSTER on. This can improve query performance. The downsides to CLUSTER over VACUUM FULL+REINDEX is that it requires approximately twice the disk space while running. Also, be very careful with this command if you are running a version older than 8.3. It is not MVCC safe and you can lose data.
Also, you can do a no-op ALTER TABLE ... ALTER COLUMN statement to get rid of the table and index bloat, this is the quickest solution.
Finally, any VACUUM FULL question should also address the fact why you need to do this? This is almost always caused by incorrect vacuuming. You should be running autovacuum and tuning it properly so that you never have to run a VACUUM FULL.
The order of the tuples might be different, this results in different queryplans. If you want a fixed order, use CLUSTER. Lower the FILLFACTOR as well and turn on auto_vacuum. And did you ANALYZE as well?
Use EXPLAIN to see how a query is executed.

PostgreSQL sequential scan on tiny table slow

I have a table in PostgreSQL that I need read into memory. It is a very small table, with only three columns and 200 rows, and I just do a select col1, col2, col3 from my_table on the whole thing.
On the development machine this is very fast (less than 1ms), even though that machine is a VirtualBox inside of a Mac OS FileVault.
But on the production server it consistently takes 600 ms. The production server may have lower specs, and the database version is older, too (7.3.x), but that alone cannot explain the huge difference, I think.
In both cases, I am running explain analyze on the db server itself, so it cannot be the network overhead. The query execution plan is in both cases a simple sequential full table scan. There was also nothing else going on on the production machine at the time, so contention is out, too.
How can find out why this is so slow, and what can I do about it?
Sounds like perhaps you haven't been VACUUMing this database properly? 7.3 is way too old to have AutoVacuum, so this is something you must do manually (cron job is recommended). If you have had many updates (over time) to this table and not run VACUUM, it will be very slow to access.
It's clearly table bloat. Run vacuum analyze of the table in question. Also - upgrade. 7.3 is not even supported anymore.
what does happen if you run the query several times ?
the fisrt run should be slow, but the others should be faster because the 1st execution put the data in the cache.
BTW : if you do a SELECT ... FROM without any restriction, it's 100% normal that you have a seq scan, you have to seq scan to retrieve the value, and since you have do restrictions, there is no need to do an index scan.
Don't hesitate to post the result of your Explain Analyze query.
PostgreSQL 7.3 is really old, no option to upgrade to a more modern version ?