I am working with a PostgreSQL 8.4.13 database.
Recently I had around around 86.5 million records in a table. I deleted almost all of them - only 5000 records are left. I ran
reindex
and
vacuum analyze
after deleting the rows. But I still see that the table is occupying a large disk space:
jbossql=> SELECT pg_size_pretty(pg_total_relation_size('my_table'));
pg_size_pretty
----------------
7673 MB
Also, the index value of the remaining rows are pretty high still - like in the million range. I thought after vacuuming and re-indexing, the index of the remaining rows would start from 1.
I read the documentation and it's pretty clear that my understanding of re-indexing was skewed.
But nonetheless, my intention is to reduce the table size after delete operation and bring down the index values so that the read operations (SELECT) from the table does not take that long - currently it's taking me around 40 seconds to retrieve just one record from my table.
Update
Thanks Erwin. I have corrected the pg version number.
vacuum full
worked for me. I have one follow up question here:
Restart primary key numbers of existing rows after deleting most of a big table
To actually return disk space to the OS, run VACUUM FULL.
Further reading:
VACUUM returning disk space to operating system
Related
From the PostgreSQL 10.4 manual regarding a full vacuum:
Note that they also temporarily use extra disk space approximately equal to the size of the table, since the old copies of the table and indexes can't
be released until the new ones are complete
I've read in this in many different places and phrased in a variety of ways. Some indicating that the space required is at most equal to the size of the vacuumed table. Hinting that it may only require enough space to store the resulting vacuumed table, i.e. of size in the range [0-size_of_original_table], depending on how many dead rows are in the table.
My question is: Will doing a full vacuum of a table always require a space equal to the original table size or is it dependent on the number of live rows in the table?
The additional space required by VACUUM (FULL) depends on the number of live rows in the table.
What happens during VACUUM (FULL) is that a new copy of the table is written.
All live tuples (= row versions) and the dead tuples that cannot be removed yet will be written to this new copy.
When the transaction finishes, the old copy will be removed.
It is recommended to have the free space at least equal to size of largest table in the database.
i.e, if your database size is 10GB and the largest table size on your database is 2GB. Then you must have at least 2GB of extra space on your disk, in order to complete the vacuum successfully.
Because VACUUM FULL will create a new copy of the table, excluding the dead rows and then remove the existing tables.
Every day I delete hundreds of thousands of records from a large table, then I do some calculations (with new data) and replace every one of the records that I previously deleted. I thought doing regular vacuum tbl would do the trick. I know it doesn't return disk space to the server, but (because of the pg docs) I thought because I was inserting about as many records as I was deleting, I wouldn't loose any/much disk space. However, after moving the table to a different namespace (for an unrelated reason) the table went from 117GB to 44GB! So...
Is there a better strategy than this so my table does bloat:
delete from tbl where ...etc... -- hundreds of thousands of rows removed
insert into tbl (...etc...) values (...etc...) -- hundreds of thousands of rows added back (fresh calcs)
.. repeat the above about 10 times a day ...
vacuum tbl
https://www.postgresql.org/docs/9.6/static/sql-vacuum.html
PostgreSQL 9.6
What I actually did to reduce the table size is in my answer here:
integer out of range and remaining disk space too small to convert id to bigint and other solutions
Edit 1:
The drawbacks to vacuum full are too restricting for me. I am processing stuff 24/7 so i can't have locks like that and my available disk space is pretty limited at any point in time. Trying to go about this in a better way.
What you are looking for is "dead space equilibrium" as I like to call it. If you've got say 1M rows and you want to delete and replace 100k rows, then you can do it in different ways. Let's suppose you delete 100k, and insert 100k right away. The db won't have time to vacuum up those old dead rows, so now your 1M row table has 100k dead rows in it. Over the next 24 hours vacuum will kick in and mark them dead, and the next time you delete / insert, you'll create 100k more dead rows, then reuse (most of) the previous 100k dead rows. Your 1M row table now has ~100k dead rows again, which will get reused next time and so on.
You want to reach a point where your deletes/inserts (or updates) and vacuum are creating / reclaiming dead tuples at an even rate.
I have Postgres 9.4.7 and I have a big table ~100M rows and 20 columns. The table queries are 1.5k selects, 150 inserts and 300 updates per minute, no deletes though. Here is my autovacuum config:
autovacuum_analyze_scale_factor 0
autovacuum_analyze_threshold 5000
autovacuum_vacuum_scale_factor 0
autovacuum_vacuum_threshold 5000
autovacuum_max_workers 6
autovacuum_naptime 5s
In my case database are almost always in the constant state of vacuuming. When one vacuuming session ends another one begins.
So the main question:
Is there a common way to vacuum big tables?
Here are some other questions.
Standard vacuum do not scan entire table and 'analyze' only scans 30k rows. So under the same load I should have a constant execution time, is it true?
Do I really need to analyze table? Can frequent 'analyze' do any useful changes in query plans for a large table?
vacuum
VACUUM reclaims storage occupied by dead tuples.
So it changes only affected pages, but it will scan entire table.
That regards what you probably call "Standard vacuum". Now if you have 9.6, then
VACUUM will skip pages based on the visibility map
analyze
amount of data that ANALYZE scans depends on table size and default_statistics_target set per instance or per table - it is not 30K per se:
For large tables, ANALYZE takes a random sample of the table contents,
rather than examining every row... change slightly each time ANALYZE
is run, even if the actual table contents did not change. This might
result in small changes in the planner's estimated costs shown by
EXPLAIN.
So if you want more stable results for EXPLAIN run smth like
alter table ... alter COLUMN ... set STATISTICS 200;
or increase default_statistics_target, otherwise too often analyze has more chances to change plan.
One more thing - you have 5K threshold. In a table with 100000K rows it is 0.002% - right? so the scale is 0.00002? while default one in 0.2 or 0.1... It makes me thing that maybe you have threshold too low. Running vacuum more often is recommended indeed, but here it looks too often. Like a thousand times more often then it would be by default...
we are using Postgres to store ~ 2.000.000.000 samples. This ends up in tables with ~ 500 mio entries and ~100GB Size each table.
What I want to do:
E.g. update the table entries: UPDATE table SET flag = true;
After this, the table is twice as big, i.e. 200GB
To get the space (stored on a SSD) back we: "VACCUM FULL table"
Unfortunately, this step needs again loads of space which results in the Vacuum to fail due to too little space left.
My Questions:
Does this mean, that, in order to make this UPDATE query only once and to get the space back for other tables in this DB we need at least 300-400GB space for a 100GB table?
In your scenario, you won't get away without having at least twice as much space as the table data would require.
The cheapest solution is probably to define the table with a fillfactor of 50 so that half of each block is left empty, thereby doubling the table size. Then the updated rows can all be in the same block as the original rows, and the UPDATE won't increase the table size because PostgreSQL can use the heap only tuple (HOT) update feature. The old versions will be freed immediately if there are no long running transactions that can still see them.
NOTE: This will only work if the colum you are updating is not indexed.
The downside of this approach is that the table is always twice the necessary size, and all sequential scans will take twice as long. It won't bother you if you don't use sequential scans of the table.
I'm using a Dev level database on Heroku that was about 63GB and approaching about 9.9 million rows (close to the limit of 10 million for this tier). I ran a script that deleted about 5 million rows I didn't need, and now (few days later) in the Postgres control panel/using pginfo:table-size it shows roughly 4.7 million rows but it's still at 63GB. 64 is the limit for he next tier so I need to reduce the size.
I've tried vacuuming but pginfo:bloat said the bloat was only about 3GB. Any idea what's happening here?
If you have [vacuum][1]ed the table, don't worry about the size one disk still remaining unchanged. The space has been marked as reusable by new rows. So you can easily add another 4.7 million rows and the size on disk wont grow.
The standard form of VACUUM removes dead row versions in tables and
indexes and marks the space available for future reuse. However, it
will not return the space to the operating system, except in the
special case where one or more pages at the end of a table become
entirely free and an exclusive table lock can be easily obtained. In
contrast, VACUUM FULL actively compacts tables by writing a complete
new version of the table file with no dead space. This minimizes the
size of the table, but can take a long time. It also requires extra
disk space for the new copy of the table, until the operation
completes.
If you want to shrink it on disk, you will need to VACUUM FULL which locks the tables and needs as much extra space as the size of the tables when the operation is in progress. So you will have to check your quota before you try this and your site will be unresponsive.
Update:
You can get a rough idea about the size of your data on disk by querying the pg_class table like this:
SELECT SUM(relpages*8192) from pg_class
Another method is a query of this nature:
SELECT pg_database_size('yourdbname');
This link: https://www.postgresql.org/docs/9.5/static/disk-usage.html provides additional information on disk usage.