Haproxy backend stays down and is never brought up again - haproxy

It works fine up until the moment the remote server becomes unavailable for some time. In which case the server goes down in the logs and is never brought up again. Config is quite simple:
defaults
retries 3
timeout connect 5000
timeout client 3600000
timeout server 3600000
log global
option log-health-checks
listen amazon_ses
bind 127.0.0.2:1234
mode tcp
no option http-server-close
default_backend bk_amazon_ses
backend bk_amazon_ses
mode tcp
no option http-server-close
server amazon email-smtp.us-west-2.amazonaws.com:587 check inter 30s fall 120 rise 1
Here are the logs when the problem occurs:
Jul 3 06:45:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30004ms, status: 119/120 UP.
Jul 3 06:46:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30003ms, status: 118/120 UP.
Jul 3 06:47:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30002ms, status: 117/120 UP.
...
Jul 3 08:44:36 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30000ms, st
atus: 0/1 DOWN.
Jul 3 08:44:36 jupiter haproxy[40331]: Server bk_amazon_ses/amazon is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued,
0 remaining in queue.
Jul 3 08:44:36 jupiter haproxy[40331]: backend bk_amazon_ses has no server available!
And that's it. Nothing but operator intervention brings the server back to life. I also tried removing the check part and what follows it - still the same thing happens. Can't haproxy be configured to try indefinitely and not mark a server DOWN? Thanks.

Related

How to access mongodb behind haproxy - message msgLen 1347703880 is invalid. Min 16 Max: 48000000

I'm attempting to connect to mongodb which is behind a haproxy. However i get the following error message when connecting.
mongo --host mongo.<redacted>.com:27017
MongoDB shell version v4.2.8 connecting to:
mongodb://mongo.redacted.com:27017/?compressors=disabled&gssapiServiceName=mongodb
2020-07-31T08:52:58.942+0100 I NETWORK [js] recv(): message msgLen
1347703880 is invalid. Min 16 Max: 48000000
2020-07-31T08:52:58.942+0100 I NETWORK [js] DBClientConnection
failed to receive message from mongo..com:27017 -
ProtocolError: recv(): message msgLen 1347703880 is invalid. Min 16
Max: 48000000 2020-07-31T08:52:58.942+0100 E QUERY [js] Error:
network error while attempting to run command 'isMaster' on host
'mongo.redacted.com:27017' : connect#src/mongo/shell/mongo.js:341:17
#(connect):2:6 2020-07-31T08:52:58.946+0100 F - [main]
exception: connect failed 2020-07-31T08:52:58.946+0100 E -
[main] exiting with code 1
haproxy.cfg
global
user root
group root
log stdout local0 debug
defaults
log global
mode http
balance leastconn
retries 3
option http-server-close
option httplog
option dontlognull
timeout connect 5s
timeout check 5s
timeout client 60s
timeout server 60s
option forwardfor
listen mongo
bind *:27017
mode tcp
server prod-mongodb-svc prod-mongodb-svc.mongodb:27017
Both HAProxy and MongoDB services are deployed in a kubernetes cluster. The HAProxy service is deployed as a load balancer (aws-elb). I have created an A record mongo.<redacted>.com which points to this load balancer.

Haproxy SSL termination : Layer4 connection problem, info: "Connection refused"

I was trying to implement SSL termination with HAProxy.
This is how my haproxy.cfg looks like
frontend Local_Server
bind *:443 ssl crt /home/vagrant/ingress-certificate/k8s.pem
mode tcp
reqadd X-Forwarded-Proto:\ https
default_backend k8s_server
backend k8s_server
mode tcp
balance roundrobin
redirect scheme https if !{ ssl_fc }
server web1 100.0.0.2:8080 check
I have generated the self signed certificate which k8s.pem.
My normal URL (without https) is working perfectly fine .i.e. - http://100.0.0.2/hello
But when i try to access the same url with HTTPS .i.e.- https://100.0.0.2/hello i get 404 and when i checked my haproxy logs i can see following message
Jul 21 18:10:19 node1 haproxy[10813]: Server k8s_server/web1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 21 18:10:19 node1 haproxy[10813]: Server k8s_server/web1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Any suggestions which i can incorporate in my haproxy.cfg ?
PS - The microservice which i am trying to access is deployed under kubernetes cluster with service exposed as ClusterIP

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

HAPROXY health-check external is not working

backend geoserver
balance roundrobin
log global
#option httpclose
#option httplog
#option forceclose
mode http
#option tcplog
#monitor-uri /tiny
#*************** health check ********************************
option tcpka
option external-check
option log-health-checks
external-check path "/usr/bin:/bin:/tmp"
#external-check path "/usr/bin:/bin"
external-check command /bin/true
#external-check command /var/lib/haproxy/ping.sh
timeout queue 60s
timeout server 60s
timeout connect 60s
#************** cookiee *******************************
cookie NV_HAPROXY insert indirect nocache
server web1_Controller_ASHISH 10.10.60.15:9002 check cookie
web1_Controller_ASHISH
server web2_controller_jagjeet 10.10.60.15:7488 check cookie
web2_Controller_jagjeet
Previously following errors was encountered :
backend geoserver has no server available!
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
After that I have changed my chroot directory(/var/lib/haproxy) to /etc/haproxy where all my configuration file was stored but still "No server is available to handle this request" error is coming. please let me know what is the issue.

How to validate zookeeper quorum

How do I verify that all the nodes in a zookeeper are part of a quorum and are healthy? Manual talks about "ruok" but that doesnt still say if the zookeeper node is part of quorum and is in sync with the rest.
You can use the srvr command documented in The Four Letter Words to get more detailed status information about each ZooKeeper server in the ensemble. See below for sample output from a 3-node cluster, with hosts named ubuntu1, ubuntu2 and ubuntu3.
The Mode field will tell you if that particular server is the leader or a follower. The Zxid field refers to the ZooKeeper cluster's internal transaction ID used for tracking state changes to the tree of znodes. In a healthy cluster, you'll see one leader, multiple followers, and all nodes will generally be close to one another in the zxid value.
> for x in ubuntu1 ubuntu2 ubuntu3; do echo $x; echo srvr|nc $x 2181; echo; done
ubuntu1
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 3/9/21
Received: 9
Sent: 8
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6
ubuntu2
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: leader
Node count: 6
ubuntu3
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6