Refreshing materialized view CONCURRENTLY causes table bloat - postgresql

In PostgreSQL 9.5 I've decided to create a materialized view "effects" and scheduled an hourly concurrent refresh, since I wanted it to be always available:
REFRESH MATERIALIZED VIEW CONCURRENTLY effects;
In the beginning everything worked well, my materialized view was refreshing and disk space usage remained more or less constant.
The Issue
After some time though, disk usage started to linearly grow.
I've concluded that the reason for this growth is the materialized view and ran the query from this answer to get the following result:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+-------------+--------------+---------------
core_relation_size | 32224567296 | 30 GB | 21140
visibility_map | 991232 | 968 kB | 0
free_space_map | 7938048 | 7752 kB | 5
table_size_incl_toast | 32233504768 | 30 GB | 21146
indexes_size | 22975922176 | 21 GB | 15073
total_size_incl_toast_and_indexes | 55209426944 | 51 GB | 36220
live_rows_in_text_representation | 316152215 | 302 MB | 207
------------------------------ | | |
row_count | 1524278 | |
live_tuples | 676439 | |
dead_tuples | 1524208 | |
(11 rows)
Then, I found that the last time this table was autovacuumed was two days ago, by running:
SELECT relname, n_dead_tup, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup desc;
I decided to manually call vacuum (VERBOSE) effects. It ran for about half an hour and produced the following output:
vacuum (VERBOSE) effects;
INFO: vacuuming "public.effects"
INFO: scanned index "effects_idx" to remove 129523454 row versions
DETAIL: CPU 12.16s/55.76u sec elapsed 119.87 sec
INFO: scanned index "effects_campaign_created_idx" to remove 129523454 row versions
DETAIL: CPU 19.11s/154.59u sec elapsed 337.91 sec
INFO: scanned index "effects_campaign_name_idx" to remove 129523454 row versions
DETAIL: CPU 28.51s/151.16u sec elapsed 315.51 sec
INFO: scanned index "effects_campaign_event_type_idx" to remove 129523454 row versions
DETAIL: CPU 38.60s/373.59u sec elapsed 601.73 sec
INFO: "effects": removed 129523454 row versions in 3865537 pages
DETAIL: CPU 59.02s/36.48u sec elapsed 326.43 sec
INFO: index "effects_idx" now contains 1524208 row versions in 472258 pages
DETAIL: 113679000 index row versions were removed.
463896 index pages have been deleted, 60386 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
INFO: index "effects_campaign_created_idx" now contains 1524208 row versions in 664910 pages
DETAIL: 121637488 index row versions were removed.
41014 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "effects_campaign_name_idx" now contains 1524208 row versions in 711391 pages
DETAIL: 125650677 index row versions were removed.
696221 index pages have been deleted, 28150 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "effects_campaign_event_type_idx" now contains 1524208 row versions in 956018 pages
DETAIL: 127659042 index row versions were removed.
934288 index pages have been deleted, 32105 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "effects": found 0 removable, 493 nonremovable row versions in 3880239 out of 3933663 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 666922 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 180.49s/788.60u sec elapsed 1799.42 sec.
INFO: vacuuming "pg_toast.pg_toast_1371723"
INFO: index "pg_toast_1371723_index" now contains 0 row versions in 1 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "pg_toast_1371723": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
At this point I thought the problem was resolved and started thinking what could interfere with the autovacuum. To be sure, I ran again the query to find space usage by that table and to my surprise it didn't change.
Only after I called REFRESH MATERIALIZED VIEW effects; not concurrently. Only now the output of the query to check table size was:
what | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+-----------+--------------+---------------
core_relation_size | 374005760 | 357 MB | 245
visibility_map | 0 | 0 bytes | 0
free_space_map | 0 | 0 bytes | 0
table_size_incl_toast | 374013952 | 357 MB | 245
indexes_size | 213843968 | 204 MB | 140
total_size_incl_toast_and_indexes | 587857920 | 561 MB | 385
live_rows_in_text_representation | 316175512 | 302 MB | 207
------------------------------ | | |
row_count | 1524385 | |
live_tuples | 676439 | |
dead_tuples | 1524208 | |
(11 rows)
And everything went back to normal...
Questions
The problem is solved but there's still a fair amount of confusion
Could anyone please explain what was the issue I experienced?
How could I avoid this in the future?

First, let's explain the bloat
REFRESH MATERIALIZED CONCURRENTLY is implemented in src/backend/commands/matview.c, and the comment is enlightening:
/*
* refresh_by_match_merge
*
* Refresh a materialized view with transactional semantics, while allowing
* concurrent reads.
*
* This is called after a new version of the data has been created in a
* temporary table. It performs a full outer join against the old version of
* the data, producing "diff" results. This join cannot work if there are any
* duplicated rows in either the old or new versions, in the sense that every
* column would compare as equal between the two rows. It does work correctly
* in the face of rows which have at least one NULL value, with all non-NULL
* columns equal. The behavior of NULLs on equality tests and on UNIQUE
* indexes turns out to be quite convenient here; the tests we need to make
* are consistent with default behavior. If there is at least one UNIQUE
* index on the materialized view, we have exactly the guarantee we need.
*
* The temporary table used to hold the diff results contains just the TID of
* the old record (if matched) and the ROW from the new table as a single
* column of complex record type (if matched).
*
* Once we have the diff table, we perform set-based DELETE and INSERT
* operations against the materialized view, and discard both temporary
* tables.
*
* Everything from the generation of the new data to applying the differences
* takes place under cover of an ExclusiveLock, since it seems as though we
* would want to prohibit not only concurrent REFRESH operations, but also
* incremental maintenance. It also doesn't seem reasonable or safe to allow
* SELECT FOR UPDATE or SELECT FOR SHARE on rows being updated or deleted by
* this command.
*/
So the materialized view is refreshed by deleting rows and inserting new ones from a temporary table. This can of course lead to dead tuples and table bloat, which is confirmed by your VACUUM (VERBOSE) output.
In a way, that's the price you pay for CONCURRENTLY.
Second, let's debunk the myth that VACUUM cannot remove the dead tuples
VACUUM will remove the dead rows, but it cannot reduce the bloat (that can be done with VACUUM (FULL), but that would lock the view just like REFRESH MATERIALIZED VIEW without CONCURRENTLY).
I suspect that the query you use to determine the number of dead tuples is just an estimate that gets the number of dead tuples wrong.
An example that demonstrates all that
CREATE TABLE tab AS SELECT id, 'row ' || id AS val FROM generate_series(1, 100000) AS id;
-- make sure autovacuum doesn't spoil our demonstration
CREATE MATERIALIZED VIEW tab_v WITH (autovacuum_enabled = off)
AS SELECT * FROM tab;
-- required for CONCURRENTLY
CREATE UNIQUE INDEX ON tab_v (id);
Use the pgstattuple extension to accurately measure table bloat:
CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 100000
tuple_len | 3788895
tuple_percent | 85.49
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 16724
free_percent | 0.38
Now let's delete some rows in the table, refresh and measure again:
DELETE FROM tab WHERE id BETWEEN 40001 AND 80000;
REFRESH MATERIALIZED VIEW CONCURRENTLY tab_v;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 60000
tuple_len | 2268895
tuple_percent | 51.19
dead_tuple_count | 40000
dead_tuple_len | 1520000
dead_tuple_percent | 34.3
free_space | 16724
free_percent | 0.38
Lots of dead tuples. VACUUM gets rid of these:
VACUUM tab_v;
SELECT * FROM pgstattuple('tab_v');
-[ RECORD 1 ]------+--------
table_len | 4431872
tuple_count | 60000
tuple_len | 2268895
tuple_percent | 51.19
dead_tuple_count | 0
dead_tuple_len | 0
dead_tuple_percent | 0
free_space | 1616724
free_percent | 36.48
The dead tuples are gone, but now there is a lot of empty space.

I'm adding to #Laurenz Albe full answer provided above. There is another possibility for the bloating. Consider the following scenario:
You have a view that rarely changes in most (1000000 records, 100 records change per request) and yet you still get 500000 dead tuples. The reason for that can be null in the index column.
As described in the answer above when views materialized concurrently, a copy is recreated and compared. The comparison uses the mandatory unique index but, what about nulls? nulls are never equal to each other in sql. So if you primary key allow nulls, the records that include nulls even if unchanged will be always recreated and added to the table
In order to fix this what you can do to remove the bloat is to add additional column that will coalesce the null column to some never used value (-1, to_timestamp(0), ...) and use this column only for the primary index

Related

PostgreSQL why table bloat ratio is higher than autovacuum_vacuum_scale_factor

I found that bloat ratio of feedback_entity is 48%
current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na
stackdb | public | feedback_entity | 5743878144 | 2785599488 | 48.4968416488746 | 100 | 2785599488 | 48.4968416488746 | f
but when I check autovacuum setting it has autovaccum setting of 10%
stackdb=> show autovacuum_vacuum_scale_factor;
autovacuum_vacuum_scale_factor
--------------------------------
0.1
(1 row)
stackdb=> show autovacuum_vacuum_threshold;
autovacuum_vacuum_threshold
-----------------------------
50
(1 row)
Also:
Autovacuum setting is on.
Autovacuum for mentioned table are running regularly at defined threshold
My Question is when auto vacuum is running at 10% of dead tuples why would bloat size increase to 48%. I have seen similar behaviour in hundreds of databases/tables. Why table bloat is always increasing and doesn't come down after every vacuum.
The query that you used to calculate the table bloat is unreliable. To determine the actual bloat, use the pgstattuple extension and query like this:
SELECT * FROM pgstattuple('public.feedback_entity');
But the table may really be bloated. There are two major reasons for that:
autovacuum runs and finishes in a reasonable time, but it cannot clean up the dead tuples. That may be because there is a long-running open transaction, an abandoned replication slot or a prepared transaction. See this article for details.
autovacuum runs too slow, so that dead rows are generated faster than it can clean them up. The symptoms are lots of dead tuples in pg_stat_user_tables and autovacuum processes that keep running forever. The straightforward solution is to use ALTER TABLE to increase autovacuum_vacuum_cost_limit or reduce autovacuum_vacuum_cost_delay for the afflicted table. An alternative approach, if possible, is to use HOT updates.

PostgreSQL index bloat ratio more than table bloat ratio and autovacuum_vacuum_scale_factor

Index bloats are reaching 57%, while table bloat is 9% only and autovacuum_vacuum_Scale_factor is 10% only.
what is more surprising is even primary key is having bloat of 57%. My understanding is since my primary key is auto incrementing and single column key only so after 10% of table dead tuples, primary key index should also have 10% dead tuples.
Now when autovacuum will run at 10% of dead tuples , it will clean dead tuples. The dead tuple space now becomes bloat and this should be reused by new updates, insert. But this isn't happening in my database, here bloat size keeps on increasing.
FYI:
Index Bloat:
current_database | schemaname | tblname | idxname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio
| is_na
------------------+------------+----------------------+----------------------------------------------------------+------------+------------+------------------+------------+------------+-------------------
+-------
stackdb | public | data_entity | data_entity_pkey | 2766848000 | 1704222720 | 61.5943745373797 | 90 | 1585192960 | 57.2923760177646
Table Bloat:
current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na
stackdb | public | data_entity | 10106732544 | 1007288320 | 9.96650812332014 | 100 | 1007288320 | 9.96650812332014 | f
Autovacuum Settings:
stackdb=> show autovacuum_vacuum_scale_factor;
autovacuum_vacuum_scale_factor
--------------------------------
0.1
(1 row)
stackdb=> show autovacuum_vacuum_threshold;
autovacuum_vacuum_threshold
-----------------------------
50
(1 row)
Note:
autovacuum is on
autovacuum is running successfully at defined intervals.
postgreSQL is running version 10.6. Same issue has been found with version 12.x
First: an index bloat of 57% is totally healthy. Don't worry.
Indexes become more bloated than tables, because the empty space cannot be reused as freely as it can be in a table. The table, also known al “heap”, has no predetermined ordering: if a new row is written as the result of an INSERT or UPDATE, it ends up in the first page that has enough free space, so it is easy to keep bloat low if VACUUM does its job.
B-tree indexes are different: their entries have a certain ordering, so the database is not free to choose where to put the new row. So you may have to put it into a page that is already full, causing a page split, while elsewhere in the index there are pages that are almost empty.

Redshift latency, i.e. discrepancy between "execution time" and "total runtime"

I'm currently experimenting with Redshift and I've noticed that for a simple query like:
SELECT COUNT(*) FROM table WHERE column = 'value';
The execution time reported by Redshift is only 84ms, which is expected and pretty good with the table at ~33M rows. However, the total runtime, both observed on my local psql client as well as in Redshift's console UI is 5 seconds. I've tried the query on both a single node cluster and multi-node (2 nodes and 4 nodes) clusters.
In addition, when I try with more realistic, complicated queries, I can see similarly that the query execution itself is only ~500ms in a lot of cases, but the total runtime is ~7 seconds.
What causes this discrepancy? Is there anyway to reduce this latency? Any internal table to dive deeper into the time distribution that covers the entire end-to-end runtime?
I read about Cold query performance improvements that Amazon recently introduced, but this latency seems to be there even on queries past the first cold one, as long as I alter the value in my where clause. However, the latency is somewhat inconsistent but definitely still go all the way up to 5 seconds.
-- Edited to give more details based on Bill Weiner's answer below --
There is no difference between doing SELECT COUNT(*) vs SELECT COUNT(column) (where column is a dist key to avoid skew).
There are absolutely zero other activities happening on the cluster because this is for exploration only. I'm the only one issuing queries and making connections to the DB, so there should be no queueing or locking delays.
The data resides in the Redshift database, with a normal schema and common-sense dist key and sort key. I have not added explicit compression to any columns, so everything is just AUTO right now.
Looks like compile time is the culprit!
STL_WLM_QUERY shows that for query 12599, this is the exec_start_time/exec_end_time:
-[ RECORD 1 ]------------+-----------------------------------------------------------------
userid | 100
xid | 14812605
task | 7289
query | 12599
service_class | 100
slot_count | 1
service_class_start_time | 2021-04-22 21:46:49.217
queue_start_time | 2021-04-22 21:46:49.21707
queue_end_time | 2021-04-22 21:46:49.21707
total_queue_time | 0
exec_start_time | 2021-04-22 21:46:49.217077
exec_end_time | 2021-04-22 21:46:53.762903
total_exec_time | 4545826
service_class_end_time | 2021-04-22 21:46:53.762903
final_state | Completed
est_peak_mem | 2097152
query_priority | Normal
service_class_name | Default queue
And from SVL_COMPILE, we have:
userid | xid | pid | query | segment | locus | starttime | endtime | compile
--------+----------+-------+-------+---------+-------+----------------------------+----------------------------+---------
100 | 14812605 | 30442 | 12599 | 0 | 1 | 2021-04-22 21:46:49.218872 | 2021-04-22 21:46:53.744529 | 1
100 | 14812605 | 30442 | 12599 | 2 | 2 | 2021-04-22 21:46:53.745711 | 2021-04-22 21:46:53.745728 | 0
100 | 14812605 | 30442 | 12599 | 3 | 2 | 2021-04-22 21:46:53.761989 | 2021-04-22 21:46:53.762015 | 0
100 | 14812605 | 30442 | 12599 | 1 | 1 | 2021-04-22 21:46:53.745476 | 2021-04-22 21:46:53.745503 | 0
(4 rows)
It shows that compile took from 21:46:49.218872 to 2021-04-22 21:46:53.744529, i.e. the overwhelming majority of the 4545ms total exec time.
There's a lot that could be taking up this time. Looking a more of the query and queuing statistics will help track down what is happening. Here are a few possibilities that I've seen be significant in the past:
Date return time. Since your query is an open select and could be returning a meaningful amount of data and moving this over the network to the requesting computer takes time.
Queuing delays. What else is happening on your cluster? Does you query start right away or does it need to wait for a slot?
Locking delays. What else is happening on your cluster? Are data/tables changing? Is the data your query needs being committed elsewhere?
Compile time. Is this the first time this query is run?
Is the table external? In S3 as an external table. Or are you using the new rs3 instance type? All the source data is in S3. (I'm guessing you are not on rs3 nodes but it doesn't hurt to ask)
A place to start is STL_WLM_QUERY to see where the query is spending this extra time.

PostgreSQL11 space reuse under high delete/update rate

We are evaluating PostgreSQL 11.1 for our production.
Having a system with 4251 updates per second, ~1000 deletes per second and ~3221 inserts per second and 1 billion transactions per day, we face a challenge where PostgreSQL does not reuse its (delete/update) space, and tables constantly increase in size.
We configured aggressive Autovacuum settings to avoid the wraparound situation. also tried adding periodic execution of vacuum analyze and vacuum –
and still there is no space reuse. (Only vacuum full or pg_repack release space to operating system – but this is not a reuse.)
Following are our vacuum settings:
autovacuum | on
vacuum_cost_limit | 6000
autovacuum_analyze_threshold | 50
autovacuum_vacuum_threshold | 50
autovacuum_vacuum_cost_delay | 5
autovacuum_max_workers | 32
autovacuum_freeze_max_age | 2000000
autovacuum_multixact_freeze_max_age | 2000000
vacuum_freeze_table_age | 20000
vacuum_multixact_freeze_table_age | 20000
vacuum_cost_page_dirty | 20
vacuum_freeze_min_age | 10000
vacuum_multixact_freeze_min_age | 10000
log_autovacuum_min_duration | 1000
autovacuum_naptime | 10
autovacuum_analyze_scale_factor | 0
autovacuum_vacuum_scale_factor | 0
vacuum_cleanup_index_scale_factor | 0
vacuum_cost_delay | 0
vacuum_defer_cleanup_age | 0
autovacuum_vacuum_cost_limit | -1
autovacuum_work_mem | -1
Your requirements are particularly hard for PostgreSQL.
You should set autovacuum_vacuum_cost_delay to 0 for that table.
Reset autovacuum_max_workers and autovacuum_naptime back to their default values.
Reset autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to their default values or slightly lower values.
Your problem is not that autovacuum does not run often enough, the problem is rather that it is too slow to keep up.
Even with that you might only be able to handle this workload with HOT updates:
Make sure that the attributes that are updated a lot are not part of any index.
Create the table with a fillfactor below 100, say 70.
HOT update often avoids the need for VACUUM and the need to update indexes.
Check the n_tup_hot_upd column of pg_stat_us-er_tables to see if it works.

How to find out fragmented indexes and defragment them in PostgreSQL?

I've found how we can solve this problem in SQL Server here - but how can i do it in PostgreSQL?
Normally you don't have to worry about that at all.
However, if there has been a mass delete or update, or the sustained change rate was so high that autovacuum couldn't keep up, you may end up with a badly bloated index.
The tool to determine that id the pgstattuple extension:
CREATE EXTENSION pgstattuple;
Then you can examine index bloat like this:
SELECT * FROM pgstatindex('spatial_ref_sys_pkey');
-[ RECORD 1 ]------+-------
version | 2
tree_level | 1
index_size | 196608
root_block_no | 3
internal_pages | 1
leaf_pages | 22
empty_pages | 0
deleted_pages | 0
avg_leaf_density | 64.48
leaf_fragmentation | 13.64
This index is in excellent shape (never used): It has only 14% bloat.
Mind that indexes are by default created with a fillfactor of 90, that is, index blocks are not filled to more than 90% by INSERT.
It is hard to say when an index is bloated, but if leaf_fragmentation exceeds 50-60, it's not so pretty.
To reorganize an index, use REINDEX.
With PostgreSQL index defragmentation should generally be handled automatically by the Autovacuum daemon. If you don't use the autovacuum daemon, or if it isn't able to keep up, you can always reindex problematic indexes.
Determining which indexes may be badly fragmented isn't particularly straight forward and it's discussed at length in this blog post and in this PostgreSQL wiki article.