PostgreSQL master server hangs on replication flow - postgresql

First of all, I'm not a data engineer, so I'll try to do my best to give you all needed things to resolve my problem :/
Context:
I'm trying to create 2 PostgreSQL servers, 1 master and 1 slave.
psql (PostgreSQL) 10.9 (Ubuntu 10.9-0ubuntu0.18.04.1)
As far as I understand, it's not a good idea to do a synchronous replication when we only have 2 servers. But I have to understand what's going on here...
Problem:
Master server hangs when I try to execute a CREATE SCHEMA test;.
But, schema exists on Master, and exists on Slave too. The Master hangs because it waits for the slave commit status...
Configuration of Master:
/etc/postgresql/10/main/conf.d/master.conf
# Connection
listen_addresses = '127.0.0.1,slave-ip'
ssl = on
ssl_cert_file = '/etc/ssl/postgresql/certs/server.pem'
ssl_key_file = '/etc/ssl/postgresql/private/server.key'
ssl_ca_file = '/etc/ssl/postgresql/certs/server.pem'
password_encryption = scram-sha-256
# WAL
wal_level = replica
synchronous_commit = remote_apply #local works, remote_apply hangs
# Archive
archive_mode = on
archive_command = 'rsync -av %p postgres#lab-3:/var/lib/postgresql/wal_archive_lab_2/%f'
# Replication master
max_wal_senders = 2
wal_keep_segments = 100
synchronous_standby_names = 'ANY 1 ("lab-3")'
/etc/postgresql/10/main/pg_hba.conf
hostssl replication replicate slave-ip/32 scram-sha-256
Configuration of Slave:
/etc/postgresql/10/main/conf.d/standby.conf
# Connection
listen_addresses = '127.0.0.1,master-ip'
ssl = on
ssl_cert_file = '/etc/ssl/postgresql/certs/server.pem'
ssl_key_file = '/etc/ssl/postgresql/private/server.key'
ssl_ca_file = '/etc/ssl/postgresql/certs/server.pem'
password_encryption = scram-sha-256
# WAL
wal_level = replica
# Archive
archive_mode = on
archive_command = 'rsync -av %p postgres#lab-3:/var/lib/postgresql/wal_archive_lab_3/%f'
# Replication slave
max_wal_senders = 2
wal_keep_segments = 100
hot_standby = on
/var/lib/postgresql/10/main/recovery.conf
standby_mode = on
primary_conninfo = 'host=master-ip port=5432 user=replicate password=replicate_password sslmode=require application_name="lab-3"'
trigger_file = '/var/lib/postgresql/10/postgresql.trigger'
I got absolutely NOTHING in log files when it hangs, just the error when I Ctrl+C to abort on the master instance:
WARNING: canceling wait for synchronous replication due to user request
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
Is there a way to check what append, and why it stays stuck like this ?
EDIT 1
The content of pg_stat_replication :
Before query
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------
54431 | 16384 | replicate | "lab-3" | slave-ip | | 47742 | 2019-08-06 07:56:48.105056+02 | | streaming | 0/110000D0 | 0/110000D0 | 0/110000D0 | 0/110000D0 | | | | 0 | async
(1 row)
While it hangs / after
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+-----------+------------------+--------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+---------------+---------------+------------
54431 | 16384 | replicate | "lab-3" | slave-ip | | 47742 | 2019-08-06 07:56:48.105056+02 | | streaming | 0/11000C10 | 0/11000C10 | 0/11000C10 | 0/11000C10 | 00:00:00.000521 | 00:00:00.004421 | 00:00:00.0045 | 0 | async
(1 row)
Thanks !

As Laurenz Albe said, the problem was the quoting of the synchronous standby name.
Documentation explains that it should be quoted in the synchronous_standby_names configuration entry on master server if it contains dash, but it must not be quoted in the primary_conninfo value on the slave.

Related

PG_WAL is very big size

I have a Postgres cluster with 3 nodes: ETCD+Patroni+Postgres13.
Now there was a problem of constantly growing pg_wal folder. It now contains 5127 files. After searching the internet, I found an article advising you to pay attention to the following database parameters (their meaning at the time of the case is this):
archive_mode off;
wal_level replica;
max_wal_size 1G;
SELECT * FROM pg_replication_slots;
postgres=# SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]-------+------------
slot_name | db2
plugin |
slot_type | physical
datoid |
database |
temporary | f
active | t
active_pid | 2247228
xmin |
catalog_xmin |
restart_lsn | 2D/D0ADC308
confirmed_flush_lsn |
wal_status | reserved
safe_wal_size |
-[ RECORD 2 ]-------+------------
slot_name | db1
plugin |
slot_type | physical
datoid |
database |
temporary | f
active | t
active_pid | 2247227
xmin |
catalog_xmin |
restart_lsn | 2D/D0ADC308
confirmed_flush_lsn |
wal_status | reserved
safe_wal_size |
All other functionality of the Patroni cluster works (switchover, reinit, replication);
root#srvdb3:~# patronictl -c /etc/patroni/patroni.yml list
+ Cluster: mobile (7173650272103321745) --+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+------------+---------+---------+----+-----------+
| db1 | 10.01.1.01 | Replica | running | 17 | 0 |
| db2 | 10.01.1.02 | Replica | running | 17 | 0 |
| db3 | 10.01.1.03 | Leader | running | 17 | |
+--------+------------+---------+---------+----+-----------+
Patroni patroni-edit:
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
parameters:
checkpoint_timeout: 30
hot_standby: 'on'
max_connections: '1100'
max_replication_slots: 5
max_wal_senders: 5
shared_buffers: 2048MB
wal_keep_segments: 5120
wal_level: replica
use_pg_rewind: true
use_slots: true
retry_timeout: 10
ttl: 100
Help please, what could be the matter?
This is what I see in pg_stat_archiver:
postgres=# select * from pg_stat_archiver;
-[ RECORD 1 ]------+------------------------------
archived_count | 0
last_archived_wal |
last_archived_time |
failed_count | 0
last_failed_wal |
last_failed_time |
stats_reset | 2023-01-06 10:21:45.615312+00
If you have wal_keep_segments set to 5120, it is completely normal if you have 5127 WAL segments in pg_wal, because PostgreSQL will always retain at least 5120 old WAL segments. If that is too many for you, reduce the parameter. If you are using replication slots, the only disadvantage is that you might only be able to pg_rewind soon after a failover.

Issue in postgresql HA mode switching of Master node

I am new in postgresqlDB configuration. I am trying to configure postgresDB in HA mode with the help of pgpool and Elastic IP. Full setup is in AWS RHEL 8 servers.
pgpool version : 4.1.2
postgres version - 12
Below links I have followed during the configuration:
https://www.pgpool.net/docs/pgpool-II-4.1.2/en/html/example-cluster.html#EXAMPLE-CLUSTER-STRUCTURE
https://www.pgpool.net/docs/42/en/html/example-aws.html
https://www.enterprisedb.com/docs/pgpool/latest/03_configuring_connection_pooling/
Currently the postgres and pgpool services are up in all 3 component nodes. But if I am stopping master postgres service/server whole setup is going down and standby node is not taking the place of master. Please find the status of the pool nodes when master is down:
node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+--------------+------+--------+-----------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
0 | server1 | 5432 | down | 0.333333 | standby | 0 | false | 0 | | | 2022-10-12 12:10:13
1 | server2 | 5432 | up | 0.333333 | standby | 0 | true | 0 | | | 2022-10-13 09:16:07
2 | server3 | 5432 | up | 0.333333 | standby | 0 | false | 0 | | | 2022-10-13 09:16:07
Any help would be appreciated. Thanks in advance.

WAL files keep growing even though archiving is working

I have archiving fully working with failed_count = 0. And I successfully set up logical replication for 2 tables (with total records of 23 millions and total columns of 200).
FYI : The tables are not busy during weekend, 0 additional records. Even on weekdays, they only get maximum of 1 - 10 records / day for 2 tables.
But, when I do : SELECT COUNT(*) FROM pg_ls_dir('pg_wal') WHERE pg_ls_dir ~ '^[0-9A-F]{24}', I can see that the WAL files keep growing at a rate of 1 - 2 files every 20 minutes .
I am expecting that the WAL file stop growing as soon as archiving taking place.
Initially when I inherited this database, there were already 6800 of WAL files even though there were no logical replication took place.
Here is my configuration:
name |setting |unit|
----------------------------+------------------------------------------+----+
archive_command |test ! -f /archive/%f && cp %p /archive/%f| |
archive_mode |on | |
archive_timeout |2400 |s |
checkpoint_completion_target|0.9 | |
checkpoint_flush_after |32 |8kB |
checkpoint_timeout |300 |s |
checkpoint_warning |30 |s |
hot_standby |on | |
log_checkpoints |off | |
max_replication_slots |10 | |
max_wal_senders |5 | |
max_wal_size |8192 |MB |
min_wal_size |2048 |MB |
synchronous_standby_names |* | |
wal_compression |off | |
wal_level |logical | |
wal_log_hints |off | |
wal_segment_size |16777216 |B |
wal_sender_timeout |60000 |ms |
Questions:
Why do they keep on growing ? Is the server trying to send all the existing WAL files to another server that involves in Logical Replication ?
How to stop them from growing ?
How do I start from zero again (ie: empty pg_wal) ?
postgresql 12.

CREATE DATABASE never ends

I cannot create a database with postgres 9.6.12, viewing pg_activity there's no blocking and waiting queries
this is my query:
-[ RECORD 1 ]----+------------------------------------
datid | 16390
datname | mydb
pid | 7275
usesysid | 10
usename | postgres96
application_name | pgAdmin III - Query Tool
client_addr | myip
client_hostname | mypc
client_port | 55202
backend_start | 2019-07-22 09:12:11.238705-04
xact_start | 2019-07-22 09:12:13.010278-04
query_start | 2019-07-22 09:12:13.010278-04
state_change | 2019-07-22 09:12:13.010282-04
wait_event_type |
wait_event |
state | active
backend_xid | 991367173
backend_xmin | 991367173
query | CREATE DATABASE mydb2\r +
| WITH OWNER = postgres96\r +
| ENCODING = 'UTF8'\r +
| TABLESPACE = system\r +
| LC_COLLATE = 'en_US.UTF-8'\r+
| LC_CTYPE = 'en_US.UTF-8'\r +
| CONNECTION LIMIT = -1;
why is tacking so long?
well... after dropping all subscriptions of pglogical and restart de service I could create the database (I couldn't after a simple restart)

pgpool-II 3.7.5 not caching PG connections

Shouldn't pgpool cache PG backend processes? After disconnecting and reconnecting pool_backendpidchanges.
Relevant parameters:
num_init_children = 1
max_pool = 1
child_life_time = 300
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0
connection_cache = on
Test:
postgres#node3:/etc/pgpool2$ psql -p 5433 -U postgres postgres
psql (9.6.10)
Type "help" for help.
postgres=# show pool_pools;
LOG: statement: show pool_pools;
pool_pid | start_time | pool_id | backend_id | database | username | create_time | majorversion | minorversion | pool_counter | pool_backendpid | pool_connected
----------+---------------------+---------+------------+----------+----------+---------------------+--------------+--------------+--------------+-----------------+----------------
3569 | 2018-09-13 20:18:22 | 0 | 0 | postgres | postgres | 2018-09-13 20:25:04 | 3 | 0 | 1 | 3631 | 1
(1 row)
postgres=# \q
postgres#node3:/etc/pgpool2$ psql -p 5433 -U postgres postgres
psql (9.6.10)
Type "help" for help.
postgres=# show pool_pools;
LOG: statement: show pool_pools;
pool_pid | start_time | pool_id | backend_id | database | username | create_time | majorversion | minorversion | pool_counter | pool_backendpid | pool_connected
----------+---------------------+---------+------------+----------+----------+---------------------+--------------+--------------+--------------+-----------------+----------------
3569 | 2018-09-13 20:18:22 | 0 | 0 | postgres | postgres | 2018-09-13 20:25:15 | 3 | 0 | 1 | 3640 | 1
(1 row)
Found out why:
connection_cache (boolean)
Caches connections to backends when set to on. Default is on. However,
connections to template0, template1, postgres and regression databases
are not cached even if connection_cache is on.
I was connecting to postgres database.