All members disappeared from a patroni cluster - postgresql

I have 2 members in patroni cluster (1-master and 1-replica). In logs i saw problem after master reconnecting to new etcd server:
ERROR: Request to server http://etcd2:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'etcd2\', port=2379): Max retries exceeded with url: /v2/keys/patroni/patroni-cluster/?recursive=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd2\', port=2379): Read timed out. (read timeout=3.333078201239308)"))')
INFO: Reconnection allowed, looking for another server.
INFO: Retrying on http://etcd1:2379
INFO: Selected new etcd server http://etcd1:2379
INFO: Lock owner: patroni2; I am patroni1
INFO: does not have lock
INFO: Reaped pid=3098484, exit status=0
LOG: received immediate shutdown request
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
After this replica node became a master:
INFO: Got response from patroni1 http://0.0.0.0:8008/patroni: {"state": "running", "postmaster_start_time": "2021-08-09 14:43:18.372 UTC", "role": "replica", "server_version": 120003, "cluster_unlocked": true, "xlog": {"received_location": 139045264096, "replayed_location": 139045264096, "replayed_timestamp": "2021-09-27 15:03:10.389 UTC", "paused": false}, "timeline": 30, "database_system_identifier": "6904244251638517787", "patroni": {"version": "1.6.5", "scope": "patroni-cluster"}}
WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
INFO: promoted self to leader by acquiring session lock
server promoting
LOG: received promote request
INFO: Lock owner: patroni2; I am patroni2
INFO: no action. i am the leader with the lock
ERROR: replication slot "patroni1" does not exist
ERROR: replication slot "patroni1" does not exist
INFO: acquired session lock as a leader
As you can see above new master cannot see a patroni1 now. After several times to recover wal patroni1 wrote these logs below:
INFO: establishing a new patroni connection to the postgres cluster
INFO: My wal position exceeds maximum replication lag
INFO: following a different leader because i am not the healthiest node
INFO: My wal position exceeds maximum replication lag
These logs information doesn't change at this time. patroni2 writes acquired session lock as a leader and patroni1 writes my wal position exceeds maximum replication lag.
But i can't see them in patroni cluster when use patronictl -c /patroni.yml list command.
How should i bring them back to cluster in better way?

Related

Postgres.exe crashes and tears down all apps, recovers and is running again

I'm running an application with about 20 processes connected to a postgres DB (10.0) on windows server 2016.
Since about a month I have unexpected crashes of postgres.exe.
To isolate the problem I extended the logging by setting log_min_duration_statement = 0
This creates more detailed logfile. What I can see is:
LOG: server process (PID xxxxx) was terminated by exception
0xFFFFFFFF DETAIL: Failed process was running: COMMIT HINT: See C
include file "ntstatus.h" for a description of the hexadecimal value.
Then it tears down all 20 processes like this:
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.Then DB recovers:
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2021-06-11 18:17:18 CEST
DB enters recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: database system was not properly shut down; automatic recovery in progress
...
LOG: redo starts at 1B2/33319E58
FATAL: the database system is in recovery mode
LOG: invalid record length at 1B2/33D29930: wanted 24, got 0
LOG: redo done at 1B2/33D29908
LOG: last completed transaction was at log time 2021-06-11 18:21:39.830526+02
FATAL: the database system is in recovery mode
...
FATAL: the database system is in recovery mode
LOG: database system is ready to accept connections
Now it's running again like normal
The crashed PID xxxxx I can identify to a postgres.exe running for one of the 20 application processes. It's not always the same one. This happens about every 5-10 days.
Can anybody give me some advice how to track down the reason of this crash?
Extensions used:
oracle_fdw 2.0.0, PostgreSQL 10.0, Oracle client 11.2.0.3.0, Oracle server 11.2.0.2.0
Crashdump:
Followed the link :
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
Although the postgres user has "full control" of the crashdump folder in the security tab it does not write something. Folder stays empty.
Follow-Up on the comment #Laurenz Albe:
The COMMIT is not the reason of the crash. It is the last successfull executed command of the session. Explained on the following example:
Process gets a job and starts to do it's job
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.061 ms statement: DISCARD ALL
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.012 ms statement: BEGIN
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.015 ms statement: SET TRANSACTION ISOLATION LEVEL READ COMMITTED
now a lot of action going on within session 25604
and among others the oracle foreign datawrapper
2021-06-15 16:28:13.792 CEST [25604] LOG: duration: 0.016 ms execute <unnamed>: FETCH ALL FROM "<unnamed portal 689>"
finishes action successfully (data of the transaction in the database)
2021-06-15 16:28:13.823 CEST [25604] LOG: duration: 0.059 ms statement: COMMIT
a lot of action is going in different sessions
among others the oracle foreign datawrapper
more the 7 minutes afterwards the next job is requested and now postgres.exe crash
2021-06-15 16:36:01.524 CEST [17904] LOG: server process (PID 25604) was terminated by exception 0xFFFFFFFF
The process does not do DISCARD ALL, BEGIN and SET TRANSACTION ISOLATION LEVEL READ COMMITTED
It crashes immediately
My Conclusion:
"the possibly corrupted shared memory" was initiated by one of the processes before. Meaning between the last successful COMMIT and the new request.
That's a 7 minutes time span where the problem occurs.
Some feedback on this conclusion?

PSQL TimescaleDB, ERROR: the database system is in recovery mode

We have an application pipeline and Postgres-12(TimescaleDB, managed through Patroni) on a separate server (VM with Ubuntu 18.04 LTS).
We are facing an issue with the DB, it suddenly stuck in the recovery mode, and also we can’t access it from the psql client and select queries also hung.
After an hour or late all got back to normal (As my current pipeline terminated) and able to run queries against the DB server.
Master DB error details:
2020-11-03 18:35:08.612 IST [9773] [unknown]#[unknown] LOG: connection received: host=x.x.x.x port=58780
2020-11-03 18:35:08.612 IST [9773] FATAL: the database system is in recovery mode
2020-11-03 18:35:08.596 IST [18276] LOG: could not send data to client: Broken pipe
Replica server error details:
2020-11-03 18:34:55 IST [18316]: [85649-1] user=postgres,db=postgres,app=[unknown],client=x.x.x.x LOG: duration: 10.228 ms statement: SELECT * FROM pg_stat_bgwriter;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 18:35:08 IST [18322]: [2-1] user=,db=,app=,client= FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
2020-11-03 18:35:08 IST [20500]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
Pipeline error details:
Job aborted due to stage failure: Task 4 in stage 0.0 failed 3 times, most recent failure: Lost task 4.2 in stage 0.0 (TID 29, ip-x-x-x-x.ap-southeast-1.compute.internal, executor 19): org.postgresql.util.PSQLException: FATAL: the database system is in recovery mode at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:514) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:141) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
Please any advise on this issue?
What version of TimescaleDB are you running? In particular, there were some issues with 1.7.x if you try to query a read replica; we recommend upgrading to 1.7.4.
(Otherwise, there's not much information about to suggest what might have happened.)
https://github.com/timescale/timescaledb/releases/tag/1.7.4

Postgres in recovery mode after failed delete queries from partitioned table (PG 12)

I have a code that used to work on a simple table and stopped working when the same table was partitioned to many sub-partitioned.
In a distributed application (Spark) we have code that performs batch delete queries in parallel from different computers in the same time (deleting different records).
Most of the queries work but then one of them fails on what seems to be a socket connection
timeout:
java.sql.BatchUpdateException: Batch entry 0 DELETE FROM my_table WHERE vessel_id='xxxxxx' AND day='2020-09-15 00:00:00+00'::timestamp was aborted: An I/O error occurred while sending to the backend. Call getNextException to see other errors in the batch.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
When the code retries to run the task the connection fails on
:FATAL: the database system is in recovery mode
In the database log I see:
2020-09-21 16:44:27 UTC::#:[26848]:DETAIL: Failed process was running: DELETE FROM my_table WHERE vessel_id=$1 AND day=$2
2020-09-21 16:44:27 UTC::#:[26848]:LOG: terminating any other active server processes
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:172.31.4.110(59468):postgres#postgres:[27705]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC:10.3.1.138(57926):rdsrepladmin#[unknown]:[26740]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC::#:[22480]:WARNING: terminating connection because of crash of another server process
2020-09-21 16:44:27 UTC::#:[22480]:DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-09-21 16:44:27 UTC::#:[22480]:HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-09-21 16:44:27 UTC:127.0.0.1(31826):rdsadmin#rdsadmin:[27967]:FATAL: the database system is in recovery mode
Any ideas why the database fails when the table is partitioned?
Why all the other connections on the other computers are closed and the database goes into recovery mode?
After looking at the logs I found that the problem was out-of-memory.
This database instance is the main instance, it does the writing, replicating and deleting and it didn't have enough memory to handle all these tasks at the same time.
The fix was simply to add more memory.
Nothing fancy.

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

When insert rows using python psycopg2, docker postgres process is terminated

Database is postgresql-9.5.1 in docker. My host machine has 3.75 GB memory, linux. In some methods I am inserting 490000 rows one after another using psycopg2 with below code.
student_list = [(name, surname, explanation)]
args_str = ','.join(cur.mogrify("(%s,%s,%s)", x) for x in student_list)
cur.execute('INSERT INTO students (name, surname, explanation) VALUES ' + args_str)
This makes my database docker memory seems full and gives these errors:
LOG: server process (PID 11219) was terminated by signal 9:
Killed DETAIL: Failed process was running LOG: terminating
any other active server processes docker#test_db WARNING:
terminating connection because of crash of another server process
docker#test_db DETAIL: The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared
memory. docker#test_db HINT: In a moment you should be able to
reconnect to the database and repeat your command. docker#test_db
WARNING: terminating connection because of crash of another server
process docker#test_db DETAIL: The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory. ... docker#test_db FATAL: the database system is in
recovery mode LOG: all server processes terminated;
reinitializing LOG: database system was interrupted; last known
up at 2017-06-06 09:39:40 UTC LOG: database system was not
properly shut down; automatic recovery in progress docker#test_db
FATAL: the database system is in recovery mode docker#test_db
FATAL: the database system is in recovery mode docker#test_db
FATAL: the database system is in recovery mode LOG: autovacuum
launcher started
Script gives that log:
Inner exception
SSL SYSCALL error: EOF detected
I tried put some sleep time between consecutive queries but got same result. Is there any limitation for that?
Also I tried to connect and disconnect for each query but got same result. These are my connect and disconnect methods.
def connect():
conn = psycopg2.connect(database=database_name,
user=database_user,
host=database_host,
port=database_port)
conn
.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_AUTOCOMMIT)
cur = conn.cursor()
return conn, cur
def disconnect(conn, cur):
cur.close()
conn.close()
Here is what I did. Actually my memory was full enough. That's why linux OS used to kill the process in Postgresql. There were 1M values in every insert process. The trick was I divided data lists to chunks and tried it 100k by 100k. That works very well. Thanks for your helps.