PostgreSQL vacuuming a big table

PostgreSQL vacuuming a big table - postgresql

I have Postgres 9.4.7 and I have a big table ~100M rows and 20 columns. The table queries are 1.5k selects, 150 inserts and 300 updates per minute, no deletes though. Here is my autovacuum config:
autovacuum_analyze_scale_factor 0
autovacuum_analyze_threshold 5000
autovacuum_vacuum_scale_factor 0
autovacuum_vacuum_threshold 5000
autovacuum_max_workers 6
autovacuum_naptime 5s
In my case database are almost always in the constant state of vacuuming. When one vacuuming session ends another one begins.
So the main question:
Is there a common way to vacuum big tables?
Here are some other questions.
Standard vacuum do not scan entire table and 'analyze' only scans 30k rows. So under the same load I should have a constant execution time, is it true?
Do I really need to analyze table? Can frequent 'analyze' do any useful changes in query plans for a large table?

vacuum
VACUUM reclaims storage occupied by dead tuples.
So it changes only affected pages, but it will scan entire table.
That regards what you probably call "Standard vacuum". Now if you have 9.6, then
VACUUM will skip pages based on the visibility map
analyze
amount of data that ANALYZE scans depends on table size and default_statistics_target set per instance or per table - it is not 30K per se:
For large tables, ANALYZE takes a random sample of the table contents,
rather than examining every row... change slightly each time ANALYZE
is run, even if the actual table contents did not change. This might
result in small changes in the planner's estimated costs shown by
EXPLAIN.
So if you want more stable results for EXPLAIN run smth like
alter table ... alter COLUMN ... set STATISTICS 200;
or increase default_statistics_target, otherwise too often analyze has more chances to change plan.
One more thing - you have 5K threshold. In a table with 100000K rows it is 0.002% - right? so the scale is 0.00002? while default one in 0.2 or 0.1... It makes me thing that maybe you have threshold too low. Running vacuum more often is recommended indeed, but here it looks too often. Like a thousand times more often then it would be by default...

Related

How to decrease size of a large postgresql table

I have a postgresql table that is "frozen" i.e. no new data is coming into it. The table is strictly used for reading purposes. The table contains about 17M records. The table has 130 columns and can be queried multiple different ways. To make the queries faster, I created indices for all combinations for filters that can be used. So I have a total of about 265 indexes on the table. Each index is about 1.1 GB. This makes the total table size to be around 265 GB. I have vacuumed the table as well.
Question
Is there a way to further bring down the disk usage of this table?
Is there a better way to handle queries for "frozen" tables that never get any data entered into them?

If your table or indexes are bloated, then VACUUM FULL tablename could shrink them. But if they aren't bloated then this won't do any good. This is not a benign operation, it will lock the table for a period of time (needing rebuild hundreds of index, probably a long period of time) and generate large amounts of IO and of WAL, the last of which will be especially troublesome for replicas. So I would test it on a non-production clone to see it actually shrinks things and see about how long of a maintenance window you will need to declare.
Other than that, be more judicious in your choice of indexes. How did you get the list of "all combinations for filters that can be used"? Was it by inspecting your source code, or just by tackling slow queries one by one until you ran out of slow queries? Maybe you can look at snapshots of pg_stat_user_indexes taken a few days apart to see if all them are actually being used.
Are these mostly two-column indexes?

best disk saving strategy for "replacement inserts"

Every day I delete hundreds of thousands of records from a large table, then I do some calculations (with new data) and replace every one of the records that I previously deleted. I thought doing regular vacuum tbl would do the trick. I know it doesn't return disk space to the server, but (because of the pg docs) I thought because I was inserting about as many records as I was deleting, I wouldn't loose any/much disk space. However, after moving the table to a different namespace (for an unrelated reason) the table went from 117GB to 44GB! So...
Is there a better strategy than this so my table does bloat:
delete from tbl where ...etc... -- hundreds of thousands of rows removed
insert into tbl (...etc...) values (...etc...) -- hundreds of thousands of rows added back (fresh calcs)
.. repeat the above about 10 times a day ...
vacuum tbl
https://www.postgresql.org/docs/9.6/static/sql-vacuum.html
PostgreSQL 9.6
What I actually did to reduce the table size is in my answer here:
integer out of range and remaining disk space too small to convert id to bigint and other solutions
Edit 1:
The drawbacks to vacuum full are too restricting for me. I am processing stuff 24/7 so i can't have locks like that and my available disk space is pretty limited at any point in time. Trying to go about this in a better way.

What you are looking for is "dead space equilibrium" as I like to call it. If you've got say 1M rows and you want to delete and replace 100k rows, then you can do it in different ways. Let's suppose you delete 100k, and insert 100k right away. The db won't have time to vacuum up those old dead rows, so now your 1M row table has 100k dead rows in it. Over the next 24 hours vacuum will kick in and mark them dead, and the next time you delete / insert, you'll create 100k more dead rows, then reuse (most of) the previous 100k dead rows. Your 1M row table now has ~100k dead rows again, which will get reused next time and so on.
You want to reach a point where your deletes/inserts (or updates) and vacuum are creating / reclaiming dead tuples at an even rate.

Postgres: Set fillfactor to 50?

I have a table of records that is populated sequentially once, but then every record is updated (the order in which they are updated and the timing of the updates are both random). The updates are not HOT updates. Is there any advantage to setting my fillfactor for this table to 50, or even less than 50, given these facts?

Ok, as you mentioned in the comments to your question, you're making changes in your table using transactions updating 1-10k records in each transaction. This is right approach leaving some chances to autovacuum to make its work. But table's fillfactor is not the first thing I'd check/change. Fillfactor can help you to speed up the process, but if autovacuum is not aggressive enough, you'll get very bloated table and bad performance soon.
So, first, I'd suggest you to control your table's bloating level. There is a number of queries which can help you:
https://wiki.postgresql.org/wiki/Show_database_bloat
http://blog.ioguix.net/postgresql/2014/09/10/Bloat-estimation-for-tables.html
https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat-82-84.sql
https://github.com/dataegret/pg-utils/blob/master/sql/table_bloat.sql
(and for indexes:
https://github.com/dataegret/pg-utils/blob/master/sql/index_bloat.sql;
these queries require pgstattuple extension)
Next, I'd tune autovacuum to much more aggressive state than default, like this (this is usually good idea even if you don't need to process whole table in short period of time), something like this:
log_autovacuum_min_duration = 0
autovacuum_vacuum_scale_factor = 0.01
autovacuum_analyze_scale_factor = 0.05
autovacuum_naptime = 60
autovacuum_vacuum_cost_delay = 20
After some significant number of transactions with UPDATEs, check the bloating level.
Finally, yes, I'd tune fillfactor but probably to some higher (and more usual) value like 80 or 90 – here you need to make some predictions, what is the probability that 10% or more tuples inside a page will be updated by the single transaction? If the chances are very high, reduce fillfactor. But you've mentioned that order of rows in UPDATEs is random, so I'd use 80-90%. Keep in mind that there is an obvious trade-off here: if you set fillfactor to 50, your table will need 2x more disk space and all operations will naturally become slower. If you want to go deep to this question, I suggest creating 21 tables with fillfactors 50..100 with the same data and testing UPDATE TPS with pgbench.

Database table size did not decrease proportionately

I am working with a PostgreSQL 8.4.13 database.
Recently I had around around 86.5 million records in a table. I deleted almost all of them - only 5000 records are left. I ran
reindex
and
vacuum analyze
after deleting the rows. But I still see that the table is occupying a large disk space:
jbossql=> SELECT pg_size_pretty(pg_total_relation_size('my_table'));
pg_size_pretty
----------------
7673 MB
Also, the index value of the remaining rows are pretty high still - like in the million range. I thought after vacuuming and re-indexing, the index of the remaining rows would start from 1.
I read the documentation and it's pretty clear that my understanding of re-indexing was skewed.
But nonetheless, my intention is to reduce the table size after delete operation and bring down the index values so that the read operations (SELECT) from the table does not take that long - currently it's taking me around 40 seconds to retrieve just one record from my table.
Update
Thanks Erwin. I have corrected the pg version number.
vacuum full
worked for me. I have one follow up question here:
Restart primary key numbers of existing rows after deleting most of a big table

To actually return disk space to the OS, run VACUUM FULL.
Further reading:
VACUUM returning disk space to operating system

Why is count(*) taking extremely long in one PostgreSQL database but not another?

I have two Postgres databases. In one I have two tables, each with about 8,000,000 rows, and a count on either of them takes about a second. In another database, also Postgres, there are tables that are 1,000,000 rows, and a count takes 10s, and one table thats about 6,000,000 rows, and count takes 3min to run. What factors determine how long this will take? They are on different machines, but the database that takes longer is on a faster machine.
I've read about how postgres count is slow in general, but this seems odd to me. I can't really use a workaround, because I am using django, and it does a count in the admin, which is taking forever and making it dificult to use.
Any information on this would be helpful.

Speed of counting depends not just on the number of rows in the table but on the time taken to read the data from disk. The time depends on many things:
Number of rows in the table - as you already mentioned.
The number of records per page (if each record takes more space you need to read more pages to read the same number of rows).
If pages are only partly full you have to read more pages.
If the tables is already cached in memory (having more memory available helps here).
If the table is indexed with a small index (the index can be counted instead).
Hardware differences.
etc....

Indexes, caches, disk speed, for starters all have an impact.

Is the "slow table" properly vacuumed?
Do not use VACUUM FULL, it only creates table and index bloat. VACUUM is absolutely enough. VACUUM ANALYZE would even be better.
And make sure autovacuum is turned on and properly configured