HAPROXY health-check external is not working - haproxy

backend geoserver
balance roundrobin
log global
#option httpclose
#option httplog
#option forceclose
mode http
#option tcplog
#monitor-uri /tiny
#*************** health check ********************************
option tcpka
option external-check
option log-health-checks
external-check path "/usr/bin:/bin:/tmp"
#external-check path "/usr/bin:/bin"
external-check command /bin/true
#external-check command /var/lib/haproxy/ping.sh
timeout queue 60s
timeout server 60s
timeout connect 60s
#************** cookiee *******************************
cookie NV_HAPROXY insert indirect nocache
server web1_Controller_ASHISH 10.10.60.15:9002 check cookie
web1_Controller_ASHISH
server web2_controller_jagjeet 10.10.60.15:7488 check cookie
web2_Controller_jagjeet
Previously following errors was encountered :
backend geoserver has no server available!
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web1_Controller_ASHISH failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web1_Controller_ASHISH is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Health check for server geoserver/web2_controller_jagjeet failed, reason: External check error, code: 255, check duration: 0ms, status: 0/2 DOWN.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Aug 20 18:02:10 netstorm-ProLiant-ML10 haproxy[54168]: Server geoserver/web2_controller_jagjeet is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
After that I have changed my chroot directory(/var/lib/haproxy) to /etc/haproxy where all my configuration file was stored but still "No server is available to handle this request" error is coming. please let me know what is the issue.

Related

Haproxy SSL termination : Layer4 connection problem, info: "Connection refused"

I was trying to implement SSL termination with HAProxy.
This is how my haproxy.cfg looks like
frontend Local_Server
bind *:443 ssl crt /home/vagrant/ingress-certificate/k8s.pem
mode tcp
reqadd X-Forwarded-Proto:\ https
default_backend k8s_server
backend k8s_server
mode tcp
balance roundrobin
redirect scheme https if !{ ssl_fc }
server web1 100.0.0.2:8080 check
I have generated the self signed certificate which k8s.pem.
My normal URL (without https) is working perfectly fine .i.e. - http://100.0.0.2/hello
But when i try to access the same url with HTTPS .i.e.- https://100.0.0.2/hello i get 404 and when i checked my haproxy logs i can see following message
Jul 21 18:10:19 node1 haproxy[10813]: Server k8s_server/web1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Jul 21 18:10:19 node1 haproxy[10813]: Server k8s_server/web1 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
Any suggestions which i can incorporate in my haproxy.cfg ?
PS - The microservice which i am trying to access is deployed under kubernetes cluster with service exposed as ClusterIP

How should I limit the slow disconnect of thousands of tcp connections?

I have a golang program running on centos that usually has around 5k tcp clients connected. Every once in a while this number goes to around 15k for about an hour and still everything is fine.
The program has a slow shutdown mode where it stops accepting new clients and slowly kills all currently connected clients over the course of 20 mins. During these slow shutdown periods if the machine has 15k clients, sometimes I get:
Wed Oct 31 21:28:23 2018] net_ratelimit: 482 callbacks suppressed
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
[Wed Oct 31 21:28:23 2018] TCP: too many orphaned sockets
I have tried adding:
echo "net.ipv4.tcp_max_syn_backlog=5000" >> /etc/sysctl.conf
echo "net.ipv4.tcp_fin_timeout=10" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_recycle=1" >> /etc/sysctl.conf
echo "net.ipv4.tcp_tw_reuse=1" >> /etc/sysctl.conf
sysctl -f /etc/sysctl.conf
And these values are set, I see them with their correct new values. A typical sockstat is:
cat /proc/net/sockstat
sockets: used 31682
TCP: inuse 17286 orphan 5 tw 3874 alloc 31453 mem 15731
UDP: inuse 8 mem 3
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
And ideas how to stop the too many orphaned socket error and crash? Should I increase the 20 min slow shutdown period to 40 mins? Increase tcp_mem? Thanks!

Haproxy backend stays down and is never brought up again

It works fine up until the moment the remote server becomes unavailable for some time. In which case the server goes down in the logs and is never brought up again. Config is quite simple:
defaults
retries 3
timeout connect 5000
timeout client 3600000
timeout server 3600000
log global
option log-health-checks
listen amazon_ses
bind 127.0.0.2:1234
mode tcp
no option http-server-close
default_backend bk_amazon_ses
backend bk_amazon_ses
mode tcp
no option http-server-close
server amazon email-smtp.us-west-2.amazonaws.com:587 check inter 30s fall 120 rise 1
Here are the logs when the problem occurs:
Jul 3 06:45:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30004ms, status: 119/120 UP.
Jul 3 06:46:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30003ms, status: 118/120 UP.
Jul 3 06:47:35 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30002ms, status: 117/120 UP.
...
Jul 3 08:44:36 jupiter haproxy[40331]: Health check for server bk_amazon_ses/amazon failed, reason: Layer4 timeout, check duration: 30000ms, st
atus: 0/1 DOWN.
Jul 3 08:44:36 jupiter haproxy[40331]: Server bk_amazon_ses/amazon is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued,
0 remaining in queue.
Jul 3 08:44:36 jupiter haproxy[40331]: backend bk_amazon_ses has no server available!
And that's it. Nothing but operator intervention brings the server back to life. I also tried removing the check part and what follows it - still the same thing happens. Can't haproxy be configured to try indefinitely and not mark a server DOWN? Thanks.

How to validate zookeeper quorum

How do I verify that all the nodes in a zookeeper are part of a quorum and are healthy? Manual talks about "ruok" but that doesnt still say if the zookeeper node is part of quorum and is in sync with the rest.
You can use the srvr command documented in The Four Letter Words to get more detailed status information about each ZooKeeper server in the ensemble. See below for sample output from a 3-node cluster, with hosts named ubuntu1, ubuntu2 and ubuntu3.
The Mode field will tell you if that particular server is the leader or a follower. The Zxid field refers to the ZooKeeper cluster's internal transaction ID used for tracking state changes to the tree of znodes. In a healthy cluster, you'll see one leader, multiple followers, and all nodes will generally be close to one another in the zxid value.
> for x in ubuntu1 ubuntu2 ubuntu3; do echo $x; echo srvr|nc $x 2181; echo; done
ubuntu1
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 3/9/21
Received: 9
Sent: 8
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6
ubuntu2
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: leader
Node count: 6
ubuntu3
Zookeeper version: 3.4.7-1713338, built on 11/09/2015 04:32 GMT
Latency min/avg/max: 0/0/0
Received: 2
Sent: 1
Connections: 1
Outstanding: 0
Zxid: 0x100000004
Mode: follower
Node count: 6

Kubernetes starts giving errors after few hours of uptime

I have installed K8S on OpenStack following this guide.
The installation went fine and I was able to run pods but after some time my applications stops working. I can still create pods but request won't reach the services from outside the cluster and also from within the pods. Basically, something in networking gets messed up. The iptables -L -vnt nat still shows the proper configuration but things won't work.
To make it working, I have to rebuild cluster, removing all services and replication controllers doesn't work.
I tried to look into the logs. Below is the journal for kube-proxy:
Dec 20 02:12:18 minion01.novalocal systemd[1]: Started Kubernetes Proxy.
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.269784 1030 proxier.go:487] Opened iptables from-containers public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.278952 1030 proxier.go:498] Opened iptables from-host public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 03:05:11 minion01.novalocal kube-proxy[1030]: W1220 03:05:11.806927 1030 api.go:224] Got error status on WatchEndpoints channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1433/544]) [2432] Reason: Details:<nil> Code:0}
Dec 20 03:06:08 minion01.novalocal kube-proxy[1030]: W1220 03:06:08.177225 1030 api.go:153] Got error status on WatchServices channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1476/207]) [2475] Reason: Details:<nil> Code:0}
..
..
..
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448570 1030 proxier.go:161] Failed to ensure iptables: error creating chain "KUBE-PORTALS-CONTAINER": fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.448749 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448868 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/kubernetes:"
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448906 1030 proxier.go:176] Failed to ensure portal for "default/kubernetes:": error checking rule: fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.449006 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.449133 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/repo-client:"
I found few posts relating to "failed to install iptables" but they don't seem to be relevant as initially everything works but after few hours it gets messed up.
What version of Kubernetes is this? A long time ago (~1.0.4) we had a bug in the kube-proxy where it leaked sockets/file-descriptors.
If you aren't running a 1.1.3 binary, consider upgrading.
Also, you should be able to use lsof to figure out who has all of the files open.