Postgres Aurora 9.6 - Toast Vacuum Causing Table Lock - postgresql

I have a table in Aurora Postgres 9.6 that is just:
create table myTable
(
id uuid default extensions.uuid_generate_v4() not null
blobs jsonb not null
);
The blobs can get rather large at time, but are usually a few MBs. And end up getting stored in the toast tables.
Under increased load, I started to see tables locks in Aurora replica (lock_relation) or AccessExclusiveLock in Postgres equivalent. Looking at pg_locks tables, it seems that the cause for the table lock is a system process.
We are not able to select any rows from that table. What I found is that once we kill the vacuum, the table locks is released and we can fetch rows.
Locks
id,locktype,database,relation,page,tuple,virtualxid,transactionid,classid,objid,objsubid,virtualtransaction,pid,mode,granted,fastpath
1,relation,16394,142767200,,,,,,,,1/0,33244,AccessExclusiveLock,true,false
Relation Translation
142767200 -> pg_toast.pg_toast_142767196
pg_toast.pg_toast_142767196 - mySchema.myTable
Vacuum Run
0 years 0 mons 0 days 1 hours 37 mins 47.688907 secs rdsadmin autovacuum: VACUUM pg_toast.pg_toast_142767196 active
Screenshots of locks
Questions:
My understanding is that an auto vacuum shouldn't cause table locks that interfere with other operations, why do I see those?
Is there any other system processes that is connected to auto vacuum that once killed, unlocks the table (pids don't match up directly)

Related

PostgreSQL - Creating index on multiple partitioned tables

I am trying to create indexes on multiple (1000) partitioned tables. As I'm using Postgres 10.2, I would have to do this for each of the partition separately, having to execute 1000 queries for the same.
I have figured how to do it, and it does work on environments where the table size(s) and number of transactions are very less. Below is the query to be executed for one of the table (which is to be repeated for all the tables ( user_2, user_3, etc.)
CREATE INDEX IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );
where user_data is a jsonb column
This query does not work for large tables, with high number of transactions - when I run it for all the tables at once. Error thrown:
ERROR: SQL State : 40P01
Error Code : 0
Message : ERROR: deadlock detected
Detail: Process 77999 waits for ShareLock on relation 1999264 of database 16311; blocked by process 77902.
Process 77902 waits for RowExclusiveLock on relation 1999077 of database 16311; blocked by process 77999
I am able to run it in small batches (of 25 each) - still encountering the issue at times, but running successfully when I retry it once or twice. Smaller the batch, lesser the chances of a deadlock.
I would think this happens because all the user tables ( user_1, user_2, etc) are linked to the parent table: user. I don't want to lock the entire table for the index creation (since in theory only one table is being modified at a time). Why does this happen and is there any way around this, to ensure that the index is created without the deadlocks ?
This worked:
CREATE INDEX CONCURRENTLY IF NOT EXISTS user_1_idx_location_id
ON users.user_1 ( user_id, ( user_data->>'locationId') );

When do Postgresql update its statistical informations for CBO?

I know the statistical information is updated by VACUUM ANALYZE and CREATE INDEX, but I'm not sure about some other situations:
insert new data into a table
let the database do nothing (and wait for autovacuum?)
delete some rows in a table
truncate a partition of a table
CREATE INDEX does not cause new statistics to be calculated.
The autovacuum daemon will run an ANALYZE process for all tables that have more than 10% of their data changed (this is the default configuration). Theres changes are INSERT, UPDATE or DELETE. TRUNCATE will clear the statistics for a table.

Postgres select query acquired AccessShareLock and blocked other running queries

ideally, Queries wait only for ExclusiveLock on a table but we saw a weird behaviour in our prod infra. select queries were waiting for accessShareLock on a view in postgres, any reason why it can be possible? Here is the output of lock monitoring query, pg_locks table of postgres
https://docs.google.com/spreadsheets/d/1xt0sfYicrDiPEdd3QdVVEm--cdHI3QGHSKb2jP2ofjI/edit#gid=1512744594
The only lock that conflicts with an AccessShareLock is an AccessExclusiveLock, which is tanken by statements like TRUNCATE, DROP TABLE or ALTER TABLE. So either one of these statements is currently running or is waiting for a transaction to finish (locks are queued).
Avoid long database transactions, and the problem will go away.

why does autovacuum not vacuum my table?

There is one table in my schema that does not get autovacuumed. If I run VACUUM posts; on the table the vacuum process finishes nicely, but the autovacuum daemon never vacuums the table for some reason.
Is there a way to find out why? What could be possible reasons for this?
That is just fine, nothing to worry.
The table is the only medium sized one (3 million rows).
Autovacuum will kick in if the number of dead tuples more than autovacuum_vacuum_scale_factor (default: 0.2) of your live tuples, so if more than 20% of your table has been deleted or updated.
This is usually just fine, and I would not change it. But if you want to do it for some reason, you can do it like this:
ALTER TABLE posts SET (autovacuum_vacuum_scale_factor = 0.1);

PostgreSQL database size (tablespace size) much bigger then calculated total sum of relations

Hallo all,
I see a very big difference between the actual database size (on the HDD and displayed by pg_database_size() call) and the size, calculated by summing up total relation sizes retrieved by pg_total_relation_size().
The first is 62G and the last is 16G (right the difference of the deleted data from the biggest table)
Here is a simplified query, that can show that difference on my system:
select current_database(),
pg_size_pretty( sum(total_relation_raw_size)::bigint ) as calculated_database_size,
pg_size_pretty( pg_database_size(current_database()) ) as database_size
from (select pg_total_relation_size(relid) as total_relation_raw_size
from pg_stat_all_tables -- this includes also system tables shared between databases
where schemaname != 'pg_toast'
) as stats;
It seems like there is some dangling data there. As this situation appeared, after we dumped and full vacuumed lots of unused data from that DB.
P.S.: I suppose, that it was a database corruption of some sort... The only way to recover from this situation was to switch to the Hot-Standby database...
LOBs are a very valid concern as BobG writes, since they are not deleted when the rows of your application table (containing the OIDs) get deleted.
These will NOT be deleted by the VACUUM process automatically, only you have run VACUUMLO on them.
Vacuumlo will delete all of the unreferenced LOBs from the database.
Example call:
vacuumlo -U postgres -W -v <database_name>
(I only included the -v to make vacuumlo a bit more verbose so that you see how many LOBs it removes)
After vacuumlo has deleted the LOBs, you can run VACUUM FULL (or let the auto-vacuum process run).
Do you have unused LOBs?
If you have something like this:
CREATE TABLE bigobjects (
id BIGINT NOT NULL PRIMARY KEY,
filename VARCHAR(255) NOT NULL,
filecontents OID NOT NULL
);
followed by:
\lo_import '/tmp/bigfile'
11357
INSERT INTO bigobjects VALUES (1, 'bigfile', 11357);
TRUNCATE TABLE bigobjects;
You'll still have the LOB (id 11357) in the database.
You can check the pg_catalog.pg_largeobject system catalog table for all the large objects in your database (recommend SELECT DISTINCT loid FROM pg_catalog.pg_largeobject unless you want to see all your LOB data as octal.)
If you clean out all your unused LOBs and do a VACUUM FULL, you should see a hefty reduction in storage. I just tried this on a personal dev database I've been using and saw a reduction in size from 200MB down to 10MB (as reported by pg_database_size(current_database()).)
As this situation appeared, after we dumped and full vacuumed lots of unused data from that DB.
I had similar experience: 3GB db with lots of dynamic data that went to 20GB for a month or so.
Manually deleting / vacuuming the problematic tables doesn't seamed to have effect ..
And then we just did a final
VACUUM FULL ANALYZE
On the WHOLE DB ... and it dropped half the size.
It took 4hours so be careful with that.
Your query is specifically screening out pg_toast tables, which can be big. See if getting rid of that where schemaname != 'pg_toast' gets you a more accurate answer.