Debugging AccessExclusiveLock in Postgres 9.6 - postgresql

We have an application, backed by Postgres which briefly locked up. The Postgres logs showed a series of AccessExclusiveLock entries for pg_database:
[13-1] sql_error_code = 00000 LOG: process 7045 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.123 ms
[6-1] sql_error_code = 00000 LOG: process 7132 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.118 ms
[6-1] sql_error_code = 00000 LOG: process 8824 still waiting for AccessExclusiveLock on object 0 of class 1262 of database 0 after 1000.133 ms
[14-1] sql_error_code = 00000 LOG: process 7045 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 39265.319 ms
[7-1] sql_error_code = 00000 LOG: process 7132 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 12824.407 ms
[7-1] sql_error_code = 00000 LOG: process 8824 acquired AccessExclusiveLock on object 0 of class 1262 of database 0 after 6362.509 ms
1262 here refers to pg_database:
=> select 1262::regclass;
+-------------+
| regclass |
+-------------+
| pg_database |
+-------------+
We are running Postgres 9.6.5 on Heroku.
From what I understand, an AEL will be taken for "heavy" operations such as DROP TABLE, TRUNCATE, REINDEX [1]... Our runtime operations consist of a number of stored procedures, each of which insert/update/delete on multiple tables (deletes are rarer). We do not perform any of the operations listed above and in the linked documentation at runtime, and there were no releases/maintenance (by us) running at this time.
I haven't managed to find any documentation giving examples of when this lock could be taken during the "normal operation" outlined above. My questions are:
What could have caused the AELs (if I have given enough information to speculate) and are there any best-practices for avoiding them in future?
What else could I look at to help debug the cause?
[1]: Postgres docs - explicit locking

Related

Why does subscription status is down when using pglogical in AWS aurora?

I have two Aurora postgresql clusters 11.9 and 13.4 deployed in AWS. I followed the instruction https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/ to setup the replica from cluster 11 to cluster 13.
The node, replica_set and subscription are all created successfully in both clusters. However, when I check the subscription status in the target database, the status is down:
=> select subscription_name, slot_name, status from pglogical.show_subscription_status();
subscription_name | slot_name | status
-------------------+-----------------------------------------------+--------
ams_subscription1 | pgl_____ngine_ams_mast2d01c59_ams_subsc050888 | down
(1 row)
In the target database log, I can see below error:
2022-03-01 06:34:29 UTC::#:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 29403) exited with exit code 1
2022-03-01 06:35:49 UTC::#:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 1453) exited with exit code 1
2022-03-01 06:38:29 UTC::#:[16329]:LOG: background worker "pglogical apply 16400:3226503298" (PID 10318) exited with exit code 1
----------------------- END OF LOG ----------------------
in the source database log:
2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser#AMSEngine:[26786]:ERROR: replication origin "pgl_____ngine_ams_mast2d01c59_ams_subsc050888" does not exist
2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser#AMSEngine:[26786]:STATEMENT: SELECT pg_catalog.pg_replication_origin_session_setup('pgl_____ngine_ams_mast2d01c59_ams_subsc050888');
BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET session_replication_role = 'replica';
SET DATESTYLE = ISO;
SET INTERVALSTYLE = POSTGRES;
SET extra_float_digits TO 3;
SET statement_timeout = 0;
SET lock_timeout = 0;
2022-03-01 06:34:29 UTC:10.74.105.225(33688):amsMasterUser#AMSEngine:[26786]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33684):amsMasterUser#AMSEngine:[26784]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33684):amsMasterUser#AMSEngine:[26784]:LOG: unexpected EOF on client connection with an open transaction
2022-03-01 06:34:29 UTC:10.74.105.225(33686):amsMasterUser#AMSEngine:[26785]:LOG: could not receive data from client: Connection reset by peer
2022-03-01 06:34:29 UTC:10.74.105.225(33686):amsMasterUser#AMSEngine:[26785]:LOG: unexpected EOF on client connection with an open transaction
----------------------- END OF LOG ----------------------
I can see there are errors in both source and target instances. But I can't figure out what could be the issue.

Postgres is not accepting commands and Vacuum failed due to missing chunk number error

Version: 9.4.4
Exception while inserting a record in health_status.
org.postgresql.util.PSQLException: ERROR: database is not accepting commands to avoid wraparound data loss in database "db"
Hint: Stop the postmaster and vacuum that database in single-user mode.
As indicated in the above error, I tried logging to single-user mode and tried to run full vacuum but instead received below error:
PostgreSQL stand-alone backend 9.4.4
backend> vacuum full;
< 2019-11-06 14:26:25.179 UTC > WARNING: database "db" must be vacuumed within 999999 transactions
< 2019-11-06 14:26:25.179 UTC > HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions.
< 2019-11-06 14:26:25.215 UTC > ERROR: missing chunk number 0 for toast value xxxx in pg_toast_1234
< 2019-11-06 14:26:25.215 UTC > STATEMENT: vacuum full;
I tried to run vacuum but the same is leading to another error that indicates missing attributes for relid xxxxx
backend> vacuum;
< 2019-11-06 14:27:47.556 UTC > ERROR: catalog is missing 3 attribute(s) for relid xxxxx
< 2019-11-06 14:27:47.556 UTC > STATEMENT: vacuum;
I tried to do a vacuum freeze for the entire db but it is leading to the catalog error again after waiting for sometime.
Furthermore, I tried to run vacuum freeze for a single table which was working fine but when I do the vacuuming for all tables, it probably includes the corrupted one as well and ends up with the same error:
backend> vacuum full freeze
< 2019-11-07 08:54:25.958 UTC > WARNING: database "db" must be vacuumed within 999987 transactions
< 2019-11-07 08:54:25.958 UTC > HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions.
< 2019-11-07 08:54:26.618 UTC > ERROR: missing chunk number 0 for toast value xxxxx in pg_toast_xxxx
< 2019-11-07 08:54:26.618 UTC > STATEMENT: vacuum full freeze
Is there a way to figure out the corrupted table and a way to restore the integrity of the database so the application can access the rest of the database?
P.S. I do not have a backup to restore the data so deleting the corrupted data or somehow fixing it would be the only solution here.

Unexpected termination and restart of postgresql 9.6

We have a postgresql 9.6.14 postgres server, where we run a query which caused a termination of postgresql process and restart of the postgresql process.
We don't know why it happened.
The query runs fine when we query it with another filter value, so I guess it has to do with the amount of data it is querying. But can this really cause a restart of the whole postgres service? So maybe a memory problem?
postgresql.log
2019-07-12 17:54:13.487 CEST [6459]: [7-1] user=,db=,app=,client= LOG:
server process (PID 11064) was terminated by signal 11: Segmentation
fault 2019-07-12 17:54:13.487 CEST [6459]: [8-1]
user=,db=,app=,client= DETAIL: Failed process was running:
2019-07-12 17:54:13.487 CEST [6459]: [9-1] user=,db=,app=,client= LOG:
terminating any other active server processes 2019-07-12 17:54:13.488
CEST [11501]: [1-1] user=hg,db=test,app=[unknown],client=172.31.0.43
WARNING: terminating connection because of crash of another server
process 2019-07-12 17:54:13.488 CEST [11501]: [2-1]
user=hg,db=test,app=[unknown],client=172.31.0.43 DETAIL: The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory. 2019-07-12 17:54:13.488 CEST
[11501]: [3-1] user=hg,db=test,app=[unknown],client=172.31.0.43 HINT:
In a moment you should be able to reconnect to the database and repeat
your command. 2019-07-12 17:54:13.488 CEST [8889]: [2-1]
user=hg,db=_test,app=[unknown],client=172.31.0.46 WARNING:
terminating connection because of crash of another server process
select stat.*,
(
Select
1
From
table1 a, table2 pg
Where
a.field_1::Text = stat.field_1::Text And
a.field_2::Text = stat.field_2::Text And
stat.field_3::Text = pg.field_3::Text And
a.field_4= pg.field_4
limit 1
)
from table3 stat
where field_1= 'xyz';

Postgres Replication with pglogical: ERROR: connection to other side has died

Got this error (on replica) while replicating between 2 Postgres instances:
ERROR: connection to other side has died
Here is the logs on the replica/subscriber:
2017-09-15 20:03:55 UTC [14335-3] LOG: apply worker [14335] at slot 7 generation 109 crashed
2017-09-15 20:03:55 UTC [2961-1732] LOG: worker process: pglogical apply 16384:3661733826 (PID 14335) exited with exit code 1
2017-09-15 20:03:59 UTC [14331-2] ERROR: connection to other side has died
2017-09-15 20:03:59 UTC [14331-3] LOG: apply worker [14331] at slot 2 generation 132 crashed
2017-09-15 20:03:59 UTC [2961-1733] LOG: worker process: pglogical apply 16384:3423246629 (PID 14331) exited with exit code 1
2017-09-15 20:04:02 UTC [14332-2] ERROR: connection to other side has died
2017-09-15 20:04:02 UTC [14332-3] LOG: apply worker [14332] at slot 4 generation 125 crashed
2017-09-15 20:04:02 UTC [2961-1734] LOG: worker process: pglogical apply 16384:2660030132 (PID 14332) exited with exit code 1
2017-09-15 20:04:02 UTC [14350-1] LOG: starting apply for subscription parking_sub
2017-09-15 20:04:05 UTC [14334-2] ERROR: connection to other side has died
2017-09-15 20:04:05 UTC [14334-3] LOG: apply worker [14334] at slot 6 generation 119 crashed
2017-09-15 20:04:05 UTC [2961-1735] LOG: worker process: pglogical apply 16384:394989729 (PID 14334) exited with exit code 1
2017-09-15 20:04:06 UTC [14333-2] ERROR: connection to other side has died
Logs on master/provider:
2017-09-15 23:22:43 UTC [22068-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:43 UTC [22068-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:44 UTC [22067-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:44 UTC [22067-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:48 UTC [22070-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:48 UTC [22070-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:49 UTC [22069-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:49 UTC [22069-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
Config on master/provider:
archive_mode = on
archive_command = 'cp %p /data/pgdata/wal_archives/%f'
max_wal_senders = 20
wal_level = logical
max_worker_processes = 100
max_replication_slots = 100
shared_preload_libraries = pglogical
max_wal_size = 20GB
Config on the replica/subscriber:
max_replication_slots = 100
shared_preload_libraries = pglogical
max_worker_processes = 100
max_wal_size = 20GB
I'm having a total of 18 subscriptions for 18 schemas. It seemed to work fine in the beginning, but it quickly deteriorated and some subscriptions started to bounce between down and replicating statuses, with the error posted above.
Question
What could be the possible causes? Do I need to change my Pg configurations?
Also, I noticed that when replication is going on, the CPU usage on the master/provider is pretty high.
/# ps aux | sort -nrk 3,3 | head -n 5
postgres 18180 86.4 1.0 415168 162460 ? Rs 22:32 19:03 postgres: getaround getaround 10.240.0.7(64106) CREATE INDEX
postgres 20349 37.0 0.2 339428 38452 ? Rs 22:53 0:07 postgres: wal sender process repuser 10.240.0.7(49742) idle
postgres 20351 33.8 0.2 339296 36628 ? Rs 22:53 0:06 postgres: wal sender process repuser 10.240.0.7(49746) idle
postgres 20350 28.8 0.2 339016 44024 ? Rs 22:53 0:05 postgres: wal sender process repuser 10.240.0.7(49744) idle
postgres 20352 27.6 0.2 339420 36632 ? Rs 22:53 0:04 postgres: wal sender process repuser 10.240.0.7(49750) idle
Thanks in advance!
I had a similar problem which was fixed by setting the: wal_sender_timeout config on the master/provider to 5 minutes (default is 1 minute). It will drop the connection if it times out - this seems to have fixed the problem for me.

Postgresql Logger Process

I'm trying to determine if Postgres 9.3 still has a logger process. It isn't referenced anywhere in the "PostgreSQL 9.3.4 Documentation". And I can't find it in my cluster's process list (see below). Also, does anyone know of a good general overview the memory structures in 9.3?
postgres 21397 1 0 20:51 pts/1 00:00:00 /opt/PostgreSQL/9.3/bin/postgres
postgres 21399 21397 0 20:51 ? 00:00:00 postgres: checkpointer process
postgres 21400 21397 0 20:51 ? 00:00:00 postgres: writer process
postgres 21401 21397 0 20:51 ? 00:00:00 postgres: wal writer process
postgres 21402 21397 0 20:51 ? 00:00:00 postgres: autovacuum launcher process
postgres 21403 21397 0 20:51 ? 00:00:00 postgres: archiver process last was 0001000004000092
postgres 21404 21397 0 20:51 ? 00:00:00 postgres: stats collector process
Thanks
Jim
Postgres has a logging collector process which is controlled through a config parameter,
logging_collector.
So in your postgresql.conf file, you would make sure this is set:
logging_collector = on
The blurb on this param from the postgres doc:
This parameter enables the logging collector, which is a background
process that captures log messages sent to stderr and redirects them
into log files. This approach is often more useful than logging to
syslog, since some types of messages might not appear in syslog
output. (One common example is dynamic-linker failure messages;
another is error messages produced by scripts such as
archive_command.) This parameter can only be set at server start.
It will show up in the process list with the following description:
postgres: logger process
For more info: http://www.postgresql.org/docs/current/static/runtime-config-logging.html
Regarding the memory structures, I'm not sure offhand, but would recommend you post that as a separate question.