PGPool Failover - postgresql

We have a high available database architecture that contains PGPool cluster with 3 instances, and at the backend PostgreSQL database cluster with 2 instances.
Our PostgreSQL and PGPool versions are, respectively, PostgreSQL 10.17 and pgpool-II version 4.1.9(karasukiboshi).
We tested our PGPool cluster for failover. Our steps were:
Stopping PGPool service on the master node. -> Successful.
Killing PGPool service on the master node. -> Successful.
Rebooting PGPool master node server. -> Unsuccessful.
I found failover related topics that are related my first 2 cases, but there is no topic rebooting PGPool master node server.
Does any of you had similar problem? This is not capability of PGPool? We don' t what the problem is.
PS: If you need I will update the topic with my configuration files.
The log entries when we reboot the master PGPool server:
pid 669752: LOG: authentication timeout May 10 17:22:51 my_server_name pgpool[3536624]: 2022-05-10 17:22:51: pid 670080:
LOG: authentication timeout May 10 17:22:51 my_server_name pgpool[3536624]: 2022-05-10 17:22:51: pid 670090:
LOG: authentication timeout May 10 17:22:51 my_server_name pgpool[3536624]: 2022-05-10 17:22:51: pid 669677:
LOG: authentication timeout May 10 17:22:51 my_server_name pgpool[3536624]: 2022-05-10 17:22:51: pid 670092:
LOG: authentication timeout May 10 17:22:51 my_server_name pgpool[3536624]: 2022-05-10 17:22:51: pid 670094: LOG: authenticatio
Thanks!

Related

Kafka stretched cluster stopped when second DC become down

My Kafka version:
/opt/kafka/bin/kafka-topics.sh --version
2.4.1 (Commit:c57222ae8cd7866b)
My Kafka cluster configuration looks like:
6 nodes Kafka cluster
6 x Zookeeper i.e. is installed on each node/broker
2 DC's, there are 3 nodes in each DC
rack-awareness feature is enabled on each node:
node1 DC1:
broker.id=1
broker.rack=dc1
node2 DC1:
broker.id=2
broker.rack=dc1
node3 DC1:
broker.id=3
broker.rack=dc1
node1 DC2:
broker.id=4
broker.rack=dc2
node2 DC2:
broker.id=5
broker.rack=dc2
node3 DC2:
broker.id=6
broker.rack=dc2
When the whole DC2 become down the kafka cluster stopped and node1 from DC1 show errors like this:
[2022-03-16 07:38:45,422] INFO Unable to read additional data from server sessionid 0x40000004f930002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:45,549] INFO Unable to read additional data from server sessionid 0x200ab15af610000, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:45,787] INFO Client successfully logged in. (org.apache.zookeeper.Login)
[2022-03-16 07:38:45,787] INFO Client will use DIGEST-MD5 as SASL mechanism. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2022-03-16 07:38:45,787] INFO Opening socket connection to server dc2kafkabr2/A.B.C.72:2181. Will attempt to SASL-authenticate using Login Context section 'Client' (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:45,788] INFO Socket error occurred: dc2kafkabr2/A.B.C.72:2181: Connection refused (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,503] INFO Client successfully logged in. (org.apache.zookeeper.Login)
[2022-03-16 07:38:46,503] INFO Client will use DIGEST-MD5 as SASL mechanism. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2022-03-16 07:38:46,503] INFO Opening socket connection to server dc1kafkabr1/A.B.C.68:2181. Will attempt to SASL-authenticate using Login Context section 'Client' (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,504] INFO Socket connection established, initiating session, client: /A.B.C.68:35796, server: dc1kafkabr1/A.B.C.68:2181 (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,505] INFO Unable to read additional data from server sessionid 0x40000004f930002, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,616] INFO Client successfully logged in. (org.apache.zookeeper.Login)
[2022-03-16 07:38:46,617] INFO Client will use DIGEST-MD5 as SASL mechanism. (org.apache.zookeeper.client.ZooKeeperSaslClient)
[2022-03-16 07:38:46,617] INFO Opening socket connection to server dc1kafkabr2/A.B.C.69:2181. Will attempt to SASL-authenticate using Login Context section 'Client' (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,617] INFO Socket connection established, initiating session, client: /A.B.C.68:38936, server: dc1kafkabr2/A.B.C.69:2181 (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,619] INFO Unable to read additional data from server sessionid 0x200ab15af610000, likely server has closed socket, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2022-03-16 07:38:46,896] INFO Client successfully logged in. (org.apache.zookeeper.Login)
However when the Kafka nodes will be stopped normally/humanly in DC2 by systemctl command then Kafka cluster works properly on the nodes in DC1.
The question is why if DC2 is turned off, the Kafka cluster stops working? How to prevent of it? Any idea?
Best Regards,
Dan
Dears,
After next tests I know that problem is by side of Zookeeper because when I trun off two brokers in DC2 the Kafka cluster still works. After turn off kafka.service on the last broker in DC2 the Kafka cluster still works. But when I turn off zookeeper.service on the last broker in DC2 the cluster becomes unresponsive.
This is my zookeeper's configuration:
cat zookeeper.properties
tickTime=2000
dataDir=/opt/zookeeper/data
#dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=5
syncLimit=3
############## HARDENING #################
authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
requireClientAuthScheme=sasl
###########################################
server.1=A.B.C.68:2888:3888
server.2=A.B.C.69:2888:3888
server.3=A.B.C.70:2888:3888
server.4=A.B.C.71:2888:3888
server.5=A.B.C.72:2888:3888
server.6=A.B.C.73:2888:3888
Any idea what is wrong in this configuration?
Best Regards,
Dan
Zookeeper quorum is not ensure and this is reason.

Why kafka crash?

I have kafka 2.5.0
My service kafka crash sometimes.
kafka/logs/server.log
[09:25:23,316] WARN Unable to read additional data from client sessionid 0x1000001a8fd0012, likely client has closed socket (org.apache.zookeeper.server.NIOServerCnxn)
/var/log/messages
09:25:23 kafka1 systemd: kafka.service: main process exited, code=exited, status=1/FAILURE
09:25:23 kafka1 systemd: Unit kafka.service entered failed state.
09:25:23 kafka1 systemd: kafka.service failed.
How to find out why this happens?
Check Zookeeper first if it is running.
If it is running, try to change these settings in zoo.cfg:
autopurge.snapRetainCount=15 (at least)
autopurge.purgeInterval=1 - 2 hours
Some hints might be here:
zookeeper + Unable to read additional data from client session id
ZooKeeper keeps getting EndOfStreamException, causing a crash

PostgreSQL connection issue after service restart

I have edited my pg_hba file and copied it to server and restarted the services by "sudo service postgresql restart" but after that the server is not connecting.
Showing the below error, Your database returned: "Connection to 138.2xx.1xx.xx:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."
The Jenkins job and data visualization tools are failing which was working fine previously. What could be the reason.
Getting this in PostgreSQL Log
2019-10-23 07:21:25.829 CEST [11761] LOG: received fast shutdown request
2019-10-23 07:21:25.829 CEST [11761] LOG: aborting any active transactions
2019-10-23 07:21:25.829 CEST [11766] LOG: autovacuum launcher shutting down
2019-10-23 07:21:25.832 CEST [11763] LOG: shutting down
2019-10-23 07:21:25.919 CEST [11761] LOG: database system is shut down
2019-10-23 07:21:27.068 CEST [22633] LOG: database system was shut down at 2019-10-23 07:21:25 CEST
2019-10-23 07:21:27.073 CEST [22633] LOG: MultiXact member wraparound protections are now enabled
2019-10-23 07:21:27.075 CEST [22631] LOG: database system is ready to accept connections
2019-10-23 07:21:27.075 CEST [22637] LOG: autovacuum launcher started
2019-10-23 07:21:27.390 CEST [22639] [unknown]#[unknown] LOG: incomplete startup packet
Below shows no response.
root#Ubuntu-1604-xenial-64-minimal ~ # pg_isready -h localhost -p 5432
localhost:5432 - no response
Below was already added to the postgresql.config file.
listen_addresses = '*'
Do i need to restart the entire server?
Can anyone please help me to resolve this.

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

pgpool HA + repmgr for Postgresql 9.6

I'm trying to conifure pgpool in my postgresql environment (2 postgresql servers + 1 pgpool) to do HA while repmgr is responsible for the replication.
I'm getting the next messages in the log :
017-12-03 19:27:07: pid 19033: DEBUG: pool_flush_it: flush size: 0
2017-12-03 19:27:07: pid 19033: DEBUG: pool_read: read 103 bytes from backend 1
2017-12-03 19:27:07: pid 19033: ERROR: failed to authenticate
2017-12-03 19:27:07: pid 19033: DETAIL: password authentication failed for user "nobody"
2017-12-03 19:27:07: pid 19033: DEBUG: find_primary_node: no primary node found
2017-12-03 19:27:08: pid 19033: LOG: find_primary_node: checking backend no 0
2017-12-03 19:27:08: pid 19033: DEBUG: SSL is requested but SSL support is not available
2017-12-03 19:34:27: pid 22132: ERROR: unable to read data from DB node 1
2017-12-03 19:34:27: pid 22132: DETAIL: EOF encountered with backend
2017-12-03 19:28:27: pid 19033: DEBUG: find_primary_node: no primary node found
The pool_hba.conf :
TYPE DATABASE USER CIDR-ADDRESS METHOD
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
In postgresql pg_hba.conf I enabled connection from pgpool server :
####pgpool####
host all all 172.22.13.170/32 trust
1.What can be the problem ?
2.If the repmgr is responsible for the replication should I set the parameter backend_flag to 'DISALLOW_TO_FAILOVER'?
Thanks.
I'm just getting up to speed on repmgr and pgpool, but I think there are multiple issues here:
1) Your pgpool.conf has some default settings for alive checking, and the user for that is 'nobody', so to get that to work you need to create a pgsql user with that name so that pgpool can query all hosts to find the current master.
2) pgpool executes scripts to change which is the master etc, and that script would normally just run repmgr commands to promote a new primary at failover, so I don't think DISALLOW_TO_FAILOVER is needed.
If repmgr would failover, then the 1 part of you question would make pgpool find which the new master is anyway, but in that case i would have repmgr configure to not failover automatically (since they could fight on who should do what.