we need to remove old data from a huge table once a year. The rows of the table are rather compact (around 40 bytes) and there is just a clustered index on the table.
The database is about 750 GB in total, and the table worked on is 640 GB in size and contains 8.7 billion rows before cleaning. After removing, only 3.7 billion rows remain, the size drops to about 500 GB for data.
These numbers look odd, but they are fine: each page had some rows removed. Some pages were emptied and dropped, some were unchanged and are still 100% full, but most pages are partly filled now and there is a lot of unclaimed space on each page.
To reclaim this space, I need to rebuild the index. My question is: How to defrag an index in a database that is about the size of the index itself?
If I remember correctly, an INDEX DEFRAG REBUILD will require as free space 1.3x the size it works on, as it copies the data in a sorted way. The DB would grow by almost 1 TB, and this new space will not be needed once the defrag finishes.
Shrinking back after the defrag is not helpful as it introduces new (heavy) fragmentation.
I am aware of the "SORT_IN_TEMPDB" setting. Is there an estimate how much free space in the DB will be required with this setting?
As an alternative, I could drop & recreate the clustered index, but I am unsure what the space requirement for that operation is.
Reorganising the index does not reclaim space on each page (?), so this operation is also not what I want.
Thanks for any ideas!
Ralf
we need to remove old data from a huge table once a year.
That is exactlyx the use case of partitioning. Partition per year, drop subtable, done. Downtime? Millisecond if you do it smart.
Related
From the PostgreSQL 10.4 manual regarding a full vacuum:
Note that they also temporarily use extra disk space approximately equal to the size of the table, since the old copies of the table and indexes can't
be released until the new ones are complete
I've read in this in many different places and phrased in a variety of ways. Some indicating that the space required is at most equal to the size of the vacuumed table. Hinting that it may only require enough space to store the resulting vacuumed table, i.e. of size in the range [0-size_of_original_table], depending on how many dead rows are in the table.
My question is: Will doing a full vacuum of a table always require a space equal to the original table size or is it dependent on the number of live rows in the table?
The additional space required by VACUUM (FULL) depends on the number of live rows in the table.
What happens during VACUUM (FULL) is that a new copy of the table is written.
All live tuples (= row versions) and the dead tuples that cannot be removed yet will be written to this new copy.
When the transaction finishes, the old copy will be removed.
It is recommended to have the free space at least equal to size of largest table in the database.
i.e, if your database size is 10GB and the largest table size on your database is 2GB. Then you must have at least 2GB of extra space on your disk, in order to complete the vacuum successfully.
Because VACUUM FULL will create a new copy of the table, excluding the dead rows and then remove the existing tables.
we are using Postgres to store ~ 2.000.000.000 samples. This ends up in tables with ~ 500 mio entries and ~100GB Size each table.
What I want to do:
E.g. update the table entries: UPDATE table SET flag = true;
After this, the table is twice as big, i.e. 200GB
To get the space (stored on a SSD) back we: "VACCUM FULL table"
Unfortunately, this step needs again loads of space which results in the Vacuum to fail due to too little space left.
My Questions:
Does this mean, that, in order to make this UPDATE query only once and to get the space back for other tables in this DB we need at least 300-400GB space for a 100GB table?
In your scenario, you won't get away without having at least twice as much space as the table data would require.
The cheapest solution is probably to define the table with a fillfactor of 50 so that half of each block is left empty, thereby doubling the table size. Then the updated rows can all be in the same block as the original rows, and the UPDATE won't increase the table size because PostgreSQL can use the heap only tuple (HOT) update feature. The old versions will be freed immediately if there are no long running transactions that can still see them.
NOTE: This will only work if the colum you are updating is not indexed.
The downside of this approach is that the table is always twice the necessary size, and all sequential scans will take twice as long. It won't bother you if you don't use sequential scans of the table.
I'm using a Dev level database on Heroku that was about 63GB and approaching about 9.9 million rows (close to the limit of 10 million for this tier). I ran a script that deleted about 5 million rows I didn't need, and now (few days later) in the Postgres control panel/using pginfo:table-size it shows roughly 4.7 million rows but it's still at 63GB. 64 is the limit for he next tier so I need to reduce the size.
I've tried vacuuming but pginfo:bloat said the bloat was only about 3GB. Any idea what's happening here?
If you have [vacuum][1]ed the table, don't worry about the size one disk still remaining unchanged. The space has been marked as reusable by new rows. So you can easily add another 4.7 million rows and the size on disk wont grow.
The standard form of VACUUM removes dead row versions in tables and
indexes and marks the space available for future reuse. However, it
will not return the space to the operating system, except in the
special case where one or more pages at the end of a table become
entirely free and an exclusive table lock can be easily obtained. In
contrast, VACUUM FULL actively compacts tables by writing a complete
new version of the table file with no dead space. This minimizes the
size of the table, but can take a long time. It also requires extra
disk space for the new copy of the table, until the operation
completes.
If you want to shrink it on disk, you will need to VACUUM FULL which locks the tables and needs as much extra space as the size of the tables when the operation is in progress. So you will have to check your quota before you try this and your site will be unresponsive.
Update:
You can get a rough idea about the size of your data on disk by querying the pg_class table like this:
SELECT SUM(relpages*8192) from pg_class
Another method is a query of this nature:
SELECT pg_database_size('yourdbname');
This link: https://www.postgresql.org/docs/9.5/static/disk-usage.html provides additional information on disk usage.
I am working with a PostgreSQL 8.4.13 database.
Recently I had around around 86.5 million records in a table. I deleted almost all of them - only 5000 records are left. I ran
reindex
and
vacuum analyze
after deleting the rows. But I still see that the table is occupying a large disk space:
jbossql=> SELECT pg_size_pretty(pg_total_relation_size('my_table'));
pg_size_pretty
----------------
7673 MB
Also, the index value of the remaining rows are pretty high still - like in the million range. I thought after vacuuming and re-indexing, the index of the remaining rows would start from 1.
I read the documentation and it's pretty clear that my understanding of re-indexing was skewed.
But nonetheless, my intention is to reduce the table size after delete operation and bring down the index values so that the read operations (SELECT) from the table does not take that long - currently it's taking me around 40 seconds to retrieve just one record from my table.
Update
Thanks Erwin. I have corrected the pg version number.
vacuum full
worked for me. I have one follow up question here:
Restart primary key numbers of existing rows after deleting most of a big table
To actually return disk space to the OS, run VACUUM FULL.
Further reading:
VACUUM returning disk space to operating system
All the question is in the title,
if we kill a cluster query on a 100 millions row table, will it be dangerous for database ?
the query is running for 2 hours now, and i need to access the table tomorrow morning (12h left hopefully).
I thought it would be far quicker, my database is running on raid ssd and Bi-Xeon Processor.
Thanks for your wise advice.
Sid
No, you can kill the cluster operation without any risk. Before the operation is done, nothing has changed to the original table- and indexfiles. From the manual:
When an index scan is used, a temporary copy of the table is created
that contains the table data in the index order. Temporary copies of
each index on the table are created as well. Therefore, you need free
space on disk at least equal to the sum of the table size and the
index sizes.
When a sequential scan and sort is used, a temporary sort file is also
created, so that the peak temporary space requirement is as much as
double the table size, plus the index sizes.
As #Frank points out, it is perfectly fine to do so.
Assuming you want to run this query in the future and assuming you have the luxury of a service window and can afford some downtime, I'd tweak some settings to boost the performance a bit.
In your configuration:
turn off fsync, for higher throughput to the file system
Fsync stands for file system sync. With fsync on, the database waits for the file system to commit on every page flush.
maximize your maintenance_work_mem
It's ok to just take all memory available, as it will not be allocated during production hours. I don't know how big your table and the index you are working on are, things will run faster when they can be fully loaded in main memory.