Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout" - postgresql

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).

connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

Related

PSQL timeline conflict prevent start of master

We had an outage on one of our PSQL 14 (managed by Zalando) due to k8s control plane being unreachable for 30min.
Control plane is now ok but master PSQL does not want to start:
LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432"
LOG,00000,"listening on IPv6 address ""::"", port 5432"
LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432"""
LOG,00000,"database system was shut down at 2023-01-30 02:51:10 UTC"
WARNING,01000,"specified neither primary_conninfo nor restore_command",,"The database server will regularly poll the pg_wal subdirectory to check for files placed there."
LOG,00000,"entering standby mode"
FATAL,XX000,"requested timeline 5 is not a child of this server's history","Latest checkpoint is at 2/82000028 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 0/530000A0."
LOG,00000,"startup process (PID 23007) exited with exit code 1"
LOG,00000,"aborting startup due to startup process failure"
LOG,00000,"database system is shut down"
We can see in archive_status folder:
-rw-------. 1 postgres postgres 0 Jan 30 02:51 000000040000000200000081.ready
-rw-------. 1 postgres postgres 0 Jan 30 02:51 00000005.history.done
Would you know how we can recover safely from this?
I guess switching back to timeline 4 would be enough as timeline 5 was made after start of outage.
The server is started in standby mode. Remove standby.signal if you want to start the server as primary server.

All members disappeared from a patroni cluster

I have 2 members in patroni cluster (1-master and 1-replica). In logs i saw problem after master reconnecting to new etcd server:
ERROR: Request to server http://etcd2:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'etcd2\', port=2379): Max retries exceeded with url: /v2/keys/patroni/patroni-cluster/?recursive=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd2\', port=2379): Read timed out. (read timeout=3.333078201239308)"))')
INFO: Reconnection allowed, looking for another server.
INFO: Retrying on http://etcd1:2379
INFO: Selected new etcd server http://etcd1:2379
INFO: Lock owner: patroni2; I am patroni1
INFO: does not have lock
INFO: Reaped pid=3098484, exit status=0
LOG: received immediate shutdown request
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
After this replica node became a master:
INFO: Got response from patroni1 http://0.0.0.0:8008/patroni: {"state": "running", "postmaster_start_time": "2021-08-09 14:43:18.372 UTC", "role": "replica", "server_version": 120003, "cluster_unlocked": true, "xlog": {"received_location": 139045264096, "replayed_location": 139045264096, "replayed_timestamp": "2021-09-27 15:03:10.389 UTC", "paused": false}, "timeline": 30, "database_system_identifier": "6904244251638517787", "patroni": {"version": "1.6.5", "scope": "patroni-cluster"}}
WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
INFO: promoted self to leader by acquiring session lock
server promoting
LOG: received promote request
INFO: Lock owner: patroni2; I am patroni2
INFO: no action. i am the leader with the lock
ERROR: replication slot "patroni1" does not exist
ERROR: replication slot "patroni1" does not exist
INFO: acquired session lock as a leader
As you can see above new master cannot see a patroni1 now. After several times to recover wal patroni1 wrote these logs below:
INFO: establishing a new patroni connection to the postgres cluster
INFO: My wal position exceeds maximum replication lag
INFO: following a different leader because i am not the healthiest node
INFO: My wal position exceeds maximum replication lag
These logs information doesn't change at this time. patroni2 writes acquired session lock as a leader and patroni1 writes my wal position exceeds maximum replication lag.
But i can't see them in patroni cluster when use patronictl -c /patroni.yml list command.
How should i bring them back to cluster in better way?

Postgres in recovery mode after failed delete queries from partitioned table (PG 12)

I have a code that used to work on a simple table and stopped working when the same table was partitioned to many sub-partitioned.
In a distributed application (Spark) we have code that performs batch delete queries in parallel from different computers in the same time (deleting different records).
Most of the queries work but then one of them fails on what seems to be a socket connection
timeout:
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM my_table WHERE vessel_id='xxxxxx' AND day='2020-09-15 00:00:00+00'::timestamp was aborted: An I/O error occurred while sending to the backend. Call getNextException to see other errors in the batch.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
When the code retries to run the task the connection fails on
:FATAL: the database system is in recovery mode
In the database log I see:
2020-09-21 16:44:27 UTC::#:[26848]:DETAIL: Failed process was running: DELETE FROM my_table WHERE vessel_id=$1 AND day=$2
2020-09-21 16:44:27 UTC::#:[26848]:LOG: terminating any other active server processes
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC::#:[22480]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC::#:[22480]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC::#:[22480]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:127.0.0.1(31826):rdsadmin#rdsadmin:[27967]:FATAL: the database system is in recovery mode
Any ideas why the database fails when the table is partitioned?
Why all the other connections on the other computers are closed and the database goes into recovery mode?
After looking at the logs I found that the problem was out-of-memory.
This database instance is the main instance, it does the writing, replicating and deleting and it didn't have enough memory to handle all these tasks at the same time.
The fix was simply to add more memory.
Nothing fancy.

Postgres synchronous_standby_names var not accepting '-' in the hostname

I am trying to setup Postgres cluster with 3 machines to get high availability with automatic failover.
postgres-01 --> master
postgres-02 --> sync replica
postgres-03 --> async replica
When I tried to use synchronous_standby_names='postgres-02' in the postgresql.conf it fails to restart the postgres with the following error
LOG: invalid value for parameter "synchronous_standby_names": "postgres-02"
DETAIL: syntax error at or near "-"
FATAL: configuration file "/pgsql/postgresql.conf" contains errors
postgresql-10.service: main process exited, code=exited, status=1/FAILURE
Failed to start PostgreSQL 10 database server.
-- Subject: Unit postgresql-10.service has failed
-- Defined-By: systemd
Removing the '-' from the hostname fixes the problem, But is this really required.
You'll have to quote the name:
synchronous_standby_names = '"postgres-02"'
You should have at least two synchronous standby servers, else your system will stop functioning if the single synchronous standby server goes down.

PostgreSQL 9.1 streaming replication restore_command: special meaning of exit code 255?

I have a PostgreSQL 9.1.3 streaming replication setup on Ubuntu 10.04.2 LTS (primary and standby). Replication is initialized with a streamed base backup (pg_basebackup). The restore_command script tries to fetch the required WAL archives from a remote archive location with rsync.
Everything works like described in the documentation when the restore_command script fails with an exit code <> 255:
At startup, the standby begins by restoring all WAL available in the archive location, calling restore_command. Once it reaches the end of WAL available there and restore_command fails, it tries to restore any WAL available in the pg_xlog directory. If that fails, and streaming replication has been configured, the standby tries to connect to the primary server and start streaming WAL from the last valid record found in archive or pg_xlog. If that fails or streaming replication is not configured, or if the connection is later disconnected, the standby goes back to step 1 and tries to restore the file from the archive again. This loop of retries from the archive, pg_xlog, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.
But when the restore_command script fails with exit code 255 (because the exit code from a failed rsync call is returned by the script) the server process dies with the following error:
2012-05-09 23:21:30 CEST - # LOG: database system was interrupted; last known up at 2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - # LOG: entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - # FATAL: could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - # LOG: startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - # LOG: aborting startup due to startup process failure
So my question is now: Is this a bug or is there a special meaning of exit code 255 which is missing in the otherwise excellent documentation or am I missing something else here?
On the primary server, you have WAL files sitting in the pg_xlog/ directory. While WAL files are there, PostgreSQL is able to deliver them to the standby should they be requested.
Typically, you also have local archived WAL location, when files are moved there by PostgreSQL, they no longer can be delivered to the standby on-line and standby is expecting them to come from the archived WAL location via restore_command.
If you have different locations for archived WALs setup on primary and on standby servers, then there's no way for a while to reach standby and you have a gap.
In your case this might mean, that:
00000001000000000000003D had been archived by the primary PostgreSQL;
standby's restore_command doesn't see it from the configured source location.
You might consider manually copying missing WAL files from primary to the standby using scp or rsync. It is also might be necessary to review your WAL locations and make sure both servers look in the same direction.
EDIT:
grep-ing for restore_command in sources, only access/transam/xlog.c references it. In function RestoreArchivedFile almost at the end (round line 3115 for 9.1.3 sources), there's a check whether restore_command had exited normally or had it received a signal.
In first case, message is classified as DEBUG2. In case restore_command received a signal other then SIGTERM (and wasn't able to handle it properly I guess), a FATAL error will be reported. This is true for all codes greater then 125.
I will not be able to tell you why though.
I recommend asking on the hackers list.
This looks like an rsync problem I encountered temporarily using NFS (with rpcbind/rstatd on port 837):
$ rsync -avz /var/backup/* backup#storage:/data/backups
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
This fixed it for me:
service rpcbind stop
I had the same issue creating a hot standby (postgres 9.5). Streaming was working (I seeded the standby via pg_basebackup using the same credentials as would later be used in the standby's recovery.conf).
After taking the basebackup, I setup the following recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user#ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'
Starting the server would yield:
2016-03-08 12:34:58.981 UTC (/)LOG: database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC (/)FATAL: could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC (/)LOG: startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC (/)LOG: aborting startup due to startup process failure
If I removed the restore_command line from recovey.conf, the standby started up fine and began streaming WALs from the master.
I eventually traced the problem down to not having added the standby postgres user's public key to the authorized_hosts file of the WAL archive host. I'd also forgotten to add the WAL archive host's server fingerprint to the known_hosts file of the standby postgres user.
These two mistakes were (I assume) causing the sftp restore_command to exit with code 255. As tscho says, the Postgres docs suggest that if the restore_command exits with ANY non-zero value, Postgres will simply move on to trying to stream from the master rather than refusing to start. In reality this doesn't seem to be the case if the exit code is higher than a certain number (maybe 125, as vyegorov's source code grepping suggests?).
Once I fixed the two SSH issues, the standby started fine with the restore_command present in recovery.conf.
Here is the comment describing why this behavior for high exit status from the command process was chosen, and the current code to implement it.
/*
* Remember, we rollforward UNTIL the restore fails so failure here is
* just part of the process... that makes it difficult to determine
* whether the restore failed because there isn't an archive to restore,
* or because the administrator has specified the restore program
* incorrectly. We have to assume the former.
*
* However, if the failure was due to any sort of signal, it's best to
* punt and abort recovery. (If we "return false" here, upper levels will
* assume that recovery is complete and start up the database!) It's
* essential to abort on child SIGINT and SIGQUIT, because per spec
* system() ignores SIGINT and SIGQUIT while waiting; if we see one of
* those it's a good bet we should have gotten it too.
*
* On SIGTERM, assume we have received a fast shutdown request, and exit
* cleanly. It's pure chance whether we receive the SIGTERM first, or the
* child process. If we receive it first, the signal handler will call
* proc_exit, otherwise we do it here. If we or the child process received
* SIGTERM for any other reason than a fast shutdown request, postmaster
* will perform an immediate shutdown when it sees us exiting
* unexpectedly.
*
* Per the Single Unix Spec, shells report exit status > 128 when a called
* command died on a signal. Also, 126 and 127 are used to report
* problems such as an unfindable command; treat those as fatal errors
* too.
*/
if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
proc_exit(1);
signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;
ereport(signaled ? FATAL : DEBUG2,
(errmsg("could not restore file \"%s\" from archive: %s",
xlogfname, wait_result_to_str(rc))));