Postgresql "CLUSTER tablename USING indexname;" takes more disk space than table size plus all indexes combined - postgresql

I have a quite large table in the database. Size of table and its indexes are shown in tables below:
table_size
----------------
22 GB
schemaname | scan_count | tablename | indexname | index_size
------------+------------+-----------+-----------------------------------------------------------------+------------
public | 1352665306 | t1 | ind1 | 6686 MB
public | 1492127808 | t1 | ind2 | 6587 MB
public | 3492322 | t1 | ind3 | 4747 MB
public | 71810172 | t1 | ind4 | 4237 MB
public | 80547954 | t1 | cluster_ind | 4035 MB
public | 3628773 | t1 | ind6 | 3700 MB
As you can see, tables size is 22GB. It has 6 different indexes, which take 30GB of space combined.
According to postgresql docs:
CLUSTER can re-sort the table using either an index scan on the specified index, or (if the index is a b-tree) a sequential scan followed by sorting. It will attempt to choose the method that will be faster, based on planner cost parameters and available statistical information.
When an index scan is used, a temporary copy of the table is created that contains the table data in the index order. Temporary copies of each index on the table are created as well. Therefore, you need free space on disk at least equal to the sum of the table size and the index sizes.
However, with df -h showing that I have 75GB of free space on the partition containing my postgresql data, I still run out of disk space when I try to cluster the table over cluster_ind index using the command below:
cluster table using cluster_ind;
and I face this error:
ERROR: could not extend file "base/16385/2165933.3": wrote only 4096 of 8192 bytes at block 394731
HINT: Check free disk space.
Question: What else is using disk space during clustering other than table size and sum of sizes of its indexes? And how can I estimate this space required to run a CLUSTER command on a table using an index?

That could be the TOAST table.
Use pg_total_relation_size('table_name') to get the full size of the table.

Related

How to make big postgres db faster?

I have big Postgres database(around 75 GB) and queries are very slow. Is there any way to make them faster?
About database:
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
--------+-------------------+----------+----------+-------------+---------------+------------+-------------
public | fingerprints | table | postgres | permanent | heap | 35 GB |
public | songs | table | postgres | permanent | heap | 26 MB |
public | songs_song_id_seq | sequence | postgres | permanent | | 8192 bytes |
\d+ fingerprints
Table "public.fingerprints"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
hash | bytea | | not null | | extended | | |
song_id | integer | | not null | | plain | | |
offset | integer | | not null | | plain | | |
date_created | timestamp without time zone | | not null | now() | plain | | |
date_modified | timestamp without time zone | | not null | now() | plain | | |
Indexes:
"ix_fingerprints_hash" hash (hash)
"uq_fingerprints" UNIQUE CONSTRAINT, btree (song_id, "offset", hash)
Foreign-key constraints:
"fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap
\d+ songs
Table "public.songs"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
song_id | integer | | not null | nextval('songs_song_id_seq'::regclass) | plain | | |
song_name | character varying(250) | | not null | | extended | | |
fingerprinted | smallint | | | 0 | plain | | |
file_sha1 | bytea | | | | extended | | |
total_hashes | integer | | not null | 0 | plain | | |
date_created | timestamp without time zone | | not null | now() | plain | | |
date_modified | timestamp without time zone | | not null | now() | plain | | |
Indexes:
"pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap
DB Scheme
DB Amount
No need to write to database, only read. All queries are very simple:
SELECT song_id
WHERE hash in fingerpints = X
EXPLAIN(analyze, buffers, format text) SELECT "song_id", "offset" FROM "fingerprints" WHERE "hash" = decode('eeafdd7ce9130f9697','hex');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_fingerprints_hash on fingerprints (cost=0.00..288.28 rows=256 width=8) (actual time=0.553..234.257 rows=871 loops=1)
Index Cond: (hash = '\xeeafdd7ce9130f9697'::bytea)
Buffers: shared hit=118 read=749
Planning Time: 0.225 ms
Execution Time: 234.463 ms
(5 rows)
234 ms looks fine where it is one query. But in reality there 3000 query per time, that takes about 600 seconds. It is audio recognition application, so algoritm works like that.
About indexes:
CREATE INDEX "ix_fingerprints_hash" ON "fingerprints" USING hash ("hash");
For pooler I use Odyssey.
Little bit of info from config:
shared_buffers = 4GB
huge_pages = try
work_mem = 582kB
maintenance_work_mem = 2GB
effective_io_concurrency = 200
max_worker_processes = 24
max_parallel_workers_per_gather = 12
max_parallel_maintenance_workers = 4
max_parallel_workers = 24
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 16GB
min_wal_size = 4GB
random_page_cost = 1.1
effective_cache_size = 12GB
Info about hardware:
Xeon 12 core (24 threads)
RAM DDR4 16 GB ECC
NVME disk
Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)? And what parameters should I change to say to Postgres to store db in ram?
I read about several topics about pg_tune, etc. but experiments don't show any good results.
Increasing the RAM so that everything can stay in cache (perhaps after using pg_prewarm to get it into cache in the first place) would certainly work. But it is expensive and shouldn't be necessary.
Having a hash index on something which is already a hashed value is probably not very helpful. Have you tried just a default (btree) index instead?
If you CLUSTER the table on the index over the column named "hash" (which you can only do if it is a btree index) then rows with the same hash code should mostly share the same table page, which would greatly cut down on the number of different buffer reads needed to fetch them all.
If you could get it do a bitmap heap scan instead of an index scan, then it should be able to have a large number of read requests outstanding at a time, due to effective_io_concurrency. But the planner does not account for effective_io_concurrency when doing planning, which means it won't choose a bitmap heap scan specifically to get it that benefit. Normally an index read finding hundreds of rows on different pages would automatically choose a bitmap heap scan method, but in your case it is probably the low setting of random_page_cost which is inhibiting it from doing so. The low setting of random_page_cost is probably reasonable in itself, but it does have this unfortunate side effect. A problem with this strategy is that it doesn't reduce the overall amount of IO needed, it just allows them overlap and so make better use of multiple IO channels. But if many sessions are running many instances of this query at the same time, they will start filling up those channels and so start competing with each other. So the CLUSTER method is probably superior as it gets the same answer with less IO. If you want to play around with bitmap scans, you could temporarily increase random_page_cost or temporarily set enable_indexscan to off.
No need to write to database, only read.
So the DB is read-only.
And in comments:
db worked fine on small amount of data(few GB), but after i filled out database started to slowdown.
So indexes have been built up incrementally.
Indexes
UNIQUE CONSTRAINT on (song_id, "offset", hash)
I would replace that with:
ALTER TABLE fingerprints
DROP CONSTRAINT uq_fingerprints
, ADD CONSTRAINT uq_fingerprints UNIQUE(hash, song_id, "offset") WITH (FILLFACTOR = 100)
This enforces the same constraint, but the leading hash column in the underlying B-tree index now supports the filter on hash in your displayed query. And the fact that all needed columns are included in the index, further allows much faster index-only scans. The (smaller) index should also be more easily cached than the (bigger) table (plus index).
See:
Is a composite index also good for queries on the first field?
Also rewrites the index in pristine condition, and with FILLFACTOR 100 for the read-only DB. (Instead of the default 90 for a B-tree index.)
Hash index on (hash) and CLUSTER
The name of the column "hash" has nothing to do with the name of the index type, which also happens to be "hash". (The column should probably not be named "hash" to begin with.)
If (and only if) you also have other queries centered around one of few hash values, that cannot use index-only scans (and you actually see faster queries than without) keep the hash index, additionally. And optimize it. (Else drop it!)
ALTER INDEX ix_fingerprints_hash SET (FILLFACTOR = 100);
An incrementally grown index may end up with bloat or unbalanced overflow pages in case of a hash index. REINDEX should take care of that. While being at it, increase FILLFACTER to 100 (from the default 75 for a hash index) for your read-only (!) DB. You can REINDEX to make the change effective.
REINDEX INDEX ix_fingerprints_hash;
Or you can CLUSTER (like jjanes already suggested) on the rearranged B-tree index from above:
CLUSTER fingerprints USING uq_fingerprints;
Rewrites the table and all indexes; rows are physically sorted according to the given index, so "clustered" around the leading column(s). Effects are permanent for your read-only DB. But index-only scans do not benefit from this.
When done optimizing, run once:
VACUUM ANALYZE fingerprints;
work_mem
The tiny setting for work_mem stands out:
work_mem = 582kB
Even the (very conservative!) default is 4MB.
But after reading your question again, it would seem you only have tiny queries. So maybe that's ok after all.
Else, with 16GB RAM you can typically afford a 100 times as much. Depends on your work load of course.
Many small queries, many parallel workers --> keep small work_mem (like 4MB?)
Few big queries, few parallel workers --> go high (like 256MB? or more)
Large amounts of temporary files written in your database over time, and mentions of "disk" in the output of EXPLAIN ANALYZE would indicate the need for more work_mem.
Additional questions
Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)?
More RAM almost always helps until the whole DB can be cached in RAM and all processes can afford all the work_mem they desire.
And what parameters should I change to say to Postgres to store db in ram?
Everything that's read from the database is cached automatically in system cache and Postgres cache, up to the limit of available RAM. (Setting work_mem too high competes for that same resource.)

PostgreSQL index bloat ratio more than table bloat ratio and autovacuum_vacuum_scale_factor

Index bloats are reaching 57%, while table bloat is 9% only and autovacuum_vacuum_Scale_factor is 10% only.
what is more surprising is even primary key is having bloat of 57%. My understanding is since my primary key is auto incrementing and single column key only so after 10% of table dead tuples, primary key index should also have 10% dead tuples.
Now when autovacuum will run at 10% of dead tuples , it will clean dead tuples. The dead tuple space now becomes bloat and this should be reused by new updates, insert. But this isn't happening in my database, here bloat size keeps on increasing.
FYI:
Index Bloat:
current_database | schemaname | tblname | idxname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio
| is_na
------------------+------------+----------------------+----------------------------------------------------------+------------+------------+------------------+------------+------------+-------------------
+-------
stackdb | public | data_entity | data_entity_pkey | 2766848000 | 1704222720 | 61.5943745373797 | 90 | 1585192960 | 57.2923760177646
Table Bloat:
current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na
stackdb | public | data_entity | 10106732544 | 1007288320 | 9.96650812332014 | 100 | 1007288320 | 9.96650812332014 | f
Autovacuum Settings:
stackdb=> show autovacuum_vacuum_scale_factor;
autovacuum_vacuum_scale_factor
--------------------------------
0.1
(1 row)
stackdb=> show autovacuum_vacuum_threshold;
autovacuum_vacuum_threshold
-----------------------------
50
(1 row)
Note:
autovacuum is on
autovacuum is running successfully at defined intervals.
postgreSQL is running version 10.6. Same issue has been found with version 12.x
First: an index bloat of 57% is totally healthy. Don't worry.
Indexes become more bloated than tables, because the empty space cannot be reused as freely as it can be in a table. The table, also known al “heap”, has no predetermined ordering: if a new row is written as the result of an INSERT or UPDATE, it ends up in the first page that has enough free space, so it is easy to keep bloat low if VACUUM does its job.
B-tree indexes are different: their entries have a certain ordering, so the database is not free to choose where to put the new row. So you may have to put it into a page that is already full, causing a page split, while elsewhere in the index there are pages that are almost empty.

Postgresql relations per database limitations

In the Postgresql docs Appendix K (https://www.postgresql.org/docs/12/limits.html) there is a list of the limitations:
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| Item | Upper Limit | Comment |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| database size | unlimited | |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| number of databases | 4,294,950,911 | |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| relations per database | 1,431,650,303 | |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| relation size | 32 TB | with the default BLCKSZ of 8192 bytes |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| rows per table | limited by the number of tuples that can fit onto 4,294,967,295 pages | |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| columns per table | 1600 | further limited by tuple size fitting on a single page; see note below |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| field size | 1 GB | |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| identifier length | 63 bytes | can be increased by recompiling PostgreSQL |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| indexes per table | unlimited | constrained by maximum relations per database |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| columns per index | 32 | can be increased by recompiling PostgreSQL |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
| partition keys | 32 | can be increased by recompiling PostgreSQL |
+------------------------+-----------------------------------------------------------------------+------------------------------------------------------------------------+
The line 'relations per database' mean the total number of relations for each row in the database, or the number of declared Foreign Keys ?
For exemple, 2 tables with a Foreign Key to join both count as 1 relation in the database ? Or as 1 * number rows (1.000 rows equals 1.000 relations) ?
Thanks a lot for your help :)
“Relation” is anything that is stored in the catalog table pg_class: tables, views, sequences, indexes, materialized views, partitioned tables and partitioned indexes.
That limit is mostly theoretical, in practice you will get in trouble if you have 100000 or so relations.
See here:
https://www.postgresql.org/docs/current/tutorial-concepts.html
Relation is essentially a mathematical term for table.

optimize a postgres query that updates a big table [duplicate]

I have two huge tables:
Table "public.tx_input1_new" (100,000,000 rows)
Column | Type | Modifiers
----------------|-----------------------------|----------
blk_hash | character varying(500) |
blk_time | timestamp without time zone |
tx_hash | character varying(500) |
input_tx_hash | character varying(100) |
input_tx_index | smallint |
input_addr | character varying(500) |
input_val | numeric |
Indexes:
"tx_input1_new_h" btree (input_tx_hash, input_tx_index)
Table "public.tx_output1_new" (100,000,000 rows)
Column | Type | Modifiers
--------------+------------------------+-----------
tx_hash | character varying(100) |
output_addr | character varying(500) |
output_index | smallint |
input_val | numeric |
Indexes:
"tx_output1_new_h" btree (tx_hash, output_index)
I want to update table1 by the other table:
UPDATE tx_input1 as i
SET
input_addr = o.output_addr,
input_val = o.output_val
FROM tx_output1 as o
WHERE
i.input_tx_hash = o.tx_hash
AND i.input_tx_index = o.output_index;
Before I execute this SQL command, I already created the index for this two table:
CREATE INDEX tx_input1_new_h ON tx_input1_new (input_tx_hash, input_tx_index);
CREATE INDEX tx_output1_new_h ON tx_output1_new (tx_hash, output_index);
I use EXPLAIN command to see the query plan, but it didn't use the index I created.
It took about 14-15 hours to complete this UPDATE.
What is the problem within it?
How can I shorten the execution time, or tune my database/table?
Thank you.
Since you are joining two large tables and there are no conditions that could filter out rows, the only efficient join strategy will be a hash join, and no index can help with that.
First there will be a sequential scan of one of the tables, from which a hash structure is built, then there will be a sequential scan over the other table, and the hash will be probed for each row found. How could any index help with that?
You can expect such an operation to take a long time, but there are some ways in which you could speed up the operation:
Remove all indexes and constraints on tx_input1 before you begin. Your query is one of the examples where an index does not help at all, but actually hurts performance, because the indexes have to be updated along with the table. Recreate the indexes and constraints after you are done with the UPDATE. Depending on the number of indexes on the table, you can expect a decent to massive performance gain.
Increase the work_mem parameter for this one operation with the SET command as high as you can. The more memory the hash operation can use, the faster it will be. With a table that big you'll probably still end up having temporary files, but you can still expect a decent performance gain.
Increase checkpoint_segments (or max_wal_size from version 9.6 on) to a high value so that there are fewer checkpoints during the UPDATE operation.
Make sure that the table statistics on both tables are accurate, so that PostgreSQL can come up with a good estimate for the number of hash buckets to create.
After the UPDATE, if it affects a big number of rows, you might consider to run VACUUM (FULL) on tx_input1 to get rid of the resulting table bloat. This will lock the table for a longer time, so do it during a maintenance window. It will reduce the size of the table and as a consequence speed up sequential scans.

What's the difference between pg_table_size, pg_relation_size & pg_total_relation_size? (PostgreSQL)

What's the difference between pg_table_size(), pg_relation_size() & pg_total_relation_size()?
I understand the basic differences explained in the documentation, but what does it imply in terms of how much space my table is actually using?
For a random table:
# select
pg_relation_size(20306, 'main') as main,
pg_relation_size(20306, 'fsm') as fsm,
pg_relation_size(20306, 'vm') as vm,
pg_relation_size(20306, 'init') as init,
pg_table_size(20306),
pg_indexes_size(20306) as indexes,
pg_total_relation_size(20306) as total;
main | fsm | vm | init | pg_table_size | indexes | total
--------+-------+------+------+---------------+---------+--------
253952 | 24576 | 8192 | 0 | 286720 | 196608 | 483328
(1 row)
From that, you can tell pg_table_size is the sum of all the return values of pg_relation_size. And pg_total_relation_size is the sum of pg_table_size and pg_indexes_size.
If you want to know how much space your tables are using, use pg_table_size and pg_total_relation_size to think about them -- one number is table-only, and one number is table + indexes.
Check the storage file layout for some info about what fsm, vm, and init mean, and how they're stored on disk.
I am pretty sure after looking into the below image you will get a good understanding of various size relationships. I have made this diagram after reading all the answers mentioned in this answer section & analyzing more in DB for the query.
For storage file layout fsm, vm, and init mean, you can get from this link like #jmelesky mentioned.
pg_table_size: Disk space used by the specified table, excluding indexes (but including TOAST, free space map, and visibility map)
pg_relation_size: The size of the main data fork of the relation
select
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_relation_size(relid, 'main')) as relation_size_main,
pg_size_pretty(pg_relation_size(relid, 'fsm')) as relation_size_fsm,
pg_size_pretty(pg_relation_size(relid, 'vm')) as relation_size_vm,
pg_size_pretty(pg_relation_size(relid, 'init')) as relation_size_init,
pg_size_pretty(pg_table_size(relid)) as table_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as external_size
from
pg_catalog.pg_statio_user_tables
where
schemaname = 'XXXX'
and relname like 'XXXXXX';
total_size | 6946 MB
relation_size_main | 953 MB
relation_size_fsm | 256 kB
relation_size_vm | 32 kB
relation_size_init | 0 bytes
table_size | 6701 MB
external_size | 5994 MB
so pg_table_size is not only the sum of all the return values of pg_relation_size but you have to add toast size
toast_bytes | 5748 MB