Kubernetes starts giving errors after few hours of uptime

Kubernetes starts giving errors after few hours of uptime - kubernetes

I have installed K8S on OpenStack following this guide.
The installation went fine and I was able to run pods but after some time my applications stops working. I can still create pods but request won't reach the services from outside the cluster and also from within the pods. Basically, something in networking gets messed up. The iptables -L -vnt nat still shows the proper configuration but things won't work.
To make it working, I have to rebuild cluster, removing all services and replication controllers doesn't work.
I tried to look into the logs. Below is the journal for kube-proxy:
Dec 20 02:12:18 minion01.novalocal systemd[1]: Started Kubernetes Proxy.
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.269784 1030 proxier.go:487] Opened iptables from-containers public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 02:15:52 minion01.novalocal kube-proxy[1030]: I1220 02:15:52.278952 1030 proxier.go:498] Opened iptables from-host public port for service "default/opensips:sipt" on TCP port 5060
Dec 20 03:05:11 minion01.novalocal kube-proxy[1030]: W1220 03:05:11.806927 1030 api.go:224] Got error status on WatchEndpoints channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1433/544]) [2432] Reason: Details:<nil> Code:0}
Dec 20 03:06:08 minion01.novalocal kube-proxy[1030]: W1220 03:06:08.177225 1030 api.go:153] Got error status on WatchServices channel: &{TypeMeta:{Kind: APIVersion:} ListMeta:{SelfLink: ResourceVersion:} Status:Failure Message:401: The event in requested index is outdated and cleared (the requested history has been cleared [1476/207]) [2475] Reason: Details:<nil> Code:0}
..
..
..
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448570 1030 proxier.go:161] Failed to ensure iptables: error creating chain "KUBE-PORTALS-CONTAINER": fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.448749 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448868 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/kubernetes:"
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.448906 1030 proxier.go:176] Failed to ensure portal for "default/kubernetes:": error checking rule: fork/exec /usr/sbin/iptables: too many open files:
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: W1220 16:01:23.449006 1030 iptables.go:203] Error checking iptables version, assuming version at least 1.4.11: %vfork/exec /usr/sbin/iptables: too many open files
Dec 20 16:01:23 minion01.novalocal kube-proxy[1030]: E1220 16:01:23.449133 1030 proxier.go:409] Failed to install iptables KUBE-PORTALS-CONTAINER rule for service "default/repo-client:"
I found few posts relating to "failed to install iptables" but they don't seem to be relevant as initially everything works but after few hours it gets messed up.

What version of Kubernetes is this? A long time ago (~1.0.4) we had a bug in the kube-proxy where it leaked sockets/file-descriptors.
If you aren't running a 1.1.3 binary, consider upgrading.
Also, you should be able to use lsof to figure out who has all of the files open.

Related

Handshake Failed test connectivity for OpenVPN

I am trying to set up OpenVPN on Ubuntu 20.04. I'm not experienced in this area. After I set up OpenVPN, I perform test connectivity. I received handshake error message:
Sun Jul 26 05:53:17 2020 TCP/UDP: Preserving recently used remote address: [AF_INET]68.228.217.219:1194
Sun Jul 26 05:53:17 2020 Socket Buffers: R=[212992->212992] S=[212992->212992]
Sun Jul 26 05:53:17 2020 UDP link local: (not bound)
Sun Jul 26 05:53:17 2020 UDP link remote: [AF_INET]My_Public_ISP_IP:1194
Sun Jul 26 05:54:17 2020 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Sun Jul 26 05:54:17 2020 TLS Error: TLS handshake failed
Sun Jul 26 05:54:17 2020 SIGUSR1[soft,tls-error] received, process restarting
Sun Jul 26 05:54:17 2020 Restart pause, 5 second(s)
Then I check to log
journalctl --identifier openvpn
I found two error message I believe why my OpenVPN cannot connect:
This is one of the error messages:
Could not determine IPv4/IPv6 protocol. Using AF_INET
I notice it's using my old client .conf file:
Error Message
My new .conf file is local.ovpn/
I tried removing client conf. sudo rm -vf BigK and replace it with local.ovpn. but it didnt work.
I need help figuring this issue out. i tried researching on my own but i came up short.
UPDATE
After several hours of researching online. the closet post I see helping me is this post https://unix.stackexchange.com/questions/385966/openvpn-error-status-2-and-cant-connect-to-internet-while-usingwhich didn't help.
I checked my client.conf
client
dev tun
proto udp
remote Public_IP 1194
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
auth SHA512
cipher AES-256-CBC
ignore-unknown-option block-outside-dns
block-outside-dns
verb 3
<ca>
Here is my server.conf
local IP
port 1194
proto udp
dev tun
ca ca.crt
cert server.crt
key server.key
dh dh.pem
auth SHA512
tls-crypt tc.key
topology subnet
server 10.8.0.0 255.255.255.0
push "redirect-gateway def1 bypass-dhcp"
ifconfig-pool-persist ipp.txt
push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 8.8.4.4"
keepalive 10 120
cipher AES-256-CBC
user nobody
group nogroup
persist-key
persist-tun
status openvpn-status.log
verb 3
crl-verify crl.pem
explicit-exit-notify
Here is localvpn.ovpn
client
dev tun
proto udp
remote Public_IP 1194
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
auth SHA512
cipher AES-256-CBC
ignore-unknown-option block-outside-dns
block-outside-dns
verb 3

I faced the same problem and didn't find any solution. I was looking for another way to connect to OpenVPN server and it helped me.
Ubuntu 20.04 has a default tool for using OpenVPN:
Settings -> Network
Click + icon on one line with the VPN title
Choose Import from file... option and select your .ovpn config file in the popup window
Click Add button and that's it
PS: I hope it will help somebody to save any hours

Kubeadm - no port 6443 after cluster creation

I'm trying to create Kubernetes HA cluster using kubeadm.
Kubeadm version: v.1.11.1
I'm using following instructions: kubeadm ha
All passed ok, except the final point. Nodes can't see each other on port 6443.
sudo netstat -an | grep 6443
Shows nothing.
In journalctl -u kubelet I see following error:
reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://<LB>:6443/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-19-111-200.ec2.internal&limit=500&resourceVersion=0: dial tcp 172.19.111.200:6443: connect: connection refused
List of docker runs on instance:
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e3eabb527a92 0e4a34a3b0e6 "kube-scheduler --ad…" 19 hours ago Up 19 hours k8s_kube-scheduler_kube-scheduler-ip-172-19-111-200.ec2.internal_kube-system_31eabaff7d89a40d8f7e05dfc971cdbd_1
123e78fa73c7 55b70b420785 "kube-controller-man…" 19 hours ago Up 19 hours k8s_kube-controller-manager_kube-controller-manager-ip-172-19-111-200.ec2.internal_kube-system_85384ca66dd4dc0adddc63923e2425a8_1
e0aa05e74fb9 1d3d7afd77d1 "/usr/local/bin/kube…" 19 hours ago Up 19 hours k8s_kube-proxy_kube-proxy-xh5dg_kube-system_f6bc49bc-959e-11e8-be29-0eaa4481e274_0
f5eac0b8fe7b k8s.gcr.io/pause:3.1 "/pause" 19 hours ago Up 19 hours k8s_POD_kube-proxy-xh5dg_kube-system_f6bc49bc-959e-11e8-be29-0eaa4481e274_0
541011b3e83a k8s.gcr.io/pause:3.1 "/pause" 19 hours ago Up 19 hours k8s_POD_etcd-ip-172-19-111-200.ec2.internal_kube-system_84d934eebaace20c70e0f268eb100028_0
a5e203947686 k8s.gcr.io/pause:3.1 "/pause" 19 hours ago Up 19 hours k8s_POD_kube-scheduler-ip-172-19-111-200.ec2.internal_kube-system_31eabaff7d89a40d8f7e05dfc971cdbd_0
89dbcdda659c k8s.gcr.io/pause:3.1 "/pause" 19 hours ago Up 19 hours k8s_POD_kube-apiserver-ip-172-19-111-200.ec2.internal_kube-system_4202bb793950ae679b2a433ea8711d18_0
5948e629d90e k8s.gcr.io/pause:3.1 "/pause" 19 hours ago Up 19 hours k8s_POD_kube-controller-manager-ip-172-19-111-200.ec2.internal_kube-system_85384ca66dd4dc0adddc63923e2425a8_0
Forwarding in sysctl exists:
sudo sysctl -p
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.ip_forward = 1

Nodes can't see each other on port 6443.
It seems like your api server in not runnning.
Fact that you have error stating :6443: connect: connection refused is pointing towards your api server not running.
This is further confirmed from your list of running docker containers on instances - you are missing api server container. Note that you have related container with "/pause" but you are missing container with "kube-apiserver --...". Your scheduler and controller-manger appear to run correctly, but api server is not.
Now you have to dig in and see what prevented your api server from starting properly. Check kubelet logs on all control-plane nodes.

This also happens if your Linux kernel is not configured to do ip4/ip6 transparently.
An ip4 address configured when the kube-api listens on an ip6 interface breaks.

HAProxy not running stats socket

I installed haproxy from aur in Arch Linux and modified the config file a bit:
global
maxconn 20000
log 127.0.0.1 local0
user haproxy
stats socket /run/haproxy/haproxy.sock mode 660 level admin
stats timeout 30s
chroot /usr/share/haproxy
pidfile /run/haproxy.pid
daemon
defaults
mode http
stats enable
stats uri /stats
stats realm Haproxy\ Statistics
frontend www-http
bind 127.0.0.1:80
default_backend www-backend
backend www-backend
mode http
balance roundrobin
timeout connect 5s
timeout server 30s
timeout queue 30s
server app1 127.0.0.1:5001 check
server app2 127.0.0.1:5002 check
I have made sure that the directory /run/haproxy exists and has permissions for the user haproxy to write to it:
ツ ls -al /run/haproxy
total 0
drwxr-xr-x 2 haproxy root 40 May 13 21:37 .
drwxr-xr-x 27 root root 720 May 13 22:00 ..
When I launch haproxy using systemctl start haproxy.service, it loads fine. I can even go to the /stats page and view stats, however, socat reports the following error:
ツ sudo socat unix-connect:/run/haproxy/haproxy.sock stdio
2016/05/13 22:04:11 socat[24202] E connect(5, AF=1 "/run/haproxy/haproxy.sock", 27): No such file or directory
I am at wits end and not able to understand what is happening. This is what I get from journalctl -xe:
May 13 21:56:31 rohanarch.local systemd[1]: Starting HAProxy Load Balancer...
May 13 21:56:31 rohanarch.local systemd[1]: Started HAProxy Load Balancer.
May 13 21:56:31 rohanarch.local haproxy-systemd-wrapper[20454]: haproxy-systemd-wrapper: executing /usr/bin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds
May 13 21:56:31 rohanarch.local haproxy-systemd-wrapper[20454]: [WARNING] 133/215631 (20456) : config : missing timeouts for frontend 'www-http'.
May 13 21:56:31 rohanarch.local haproxy-systemd-wrapper[20454]: | While not properly invalid, you will certainly encounter various problems
May 13 21:56:31 rohanarch.local haproxy-systemd-wrapper[20454]: | with such a configuration. To fix this, please ensure that all following
May 13 21:56:31 rohanarch.local haproxy-systemd-wrapper[20454]: | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
Basically, no errors/warnings or not even so much as an indication about the stats socket. Others who have faced a problem with the stats socket fail to get haproxy started. In my case, it starts up fine, but the socket just isn't creating.

You need to manually create the directory yourself. Please ensure
/run/haproxy exists. If it doesn't, then first create it with:
sudo mkdir /run/haproxy
This should resolve your issue.

try to make selinux permissive with the command belowe and restart HAproxy service.
selinux command

mesos slaves are not connecting with mesos masters cluster

i have a setup where i am using 3 mesos masters and 3 mesos slasves. after making all the required configurations i can see 3 mesos masters are part of a cluster which is maintained by zookeepers.
now i have setup 3 mesos slaves and when i am starting mesos-slave service, i am expecting that mesos slaves will be available to the mesos masters web UI page. But i can not see any of them in the slaves tab.
selinux, firewall, iptalbes all are disabled. able to perform ssh between node.
[cloud-user#slave1 ~]$ sudo systemctl status mesos-slave -l
mesos-slave.service - Mesos Slave
Loaded: loaded (/usr/lib/systemd/system/mesos-slave.service; enabled)
Active: active (running) since Sat 2016-01-16 16:11:55 UTC; 3s ago
Main PID: 2483 (mesos-slave)
CGroup: /system.slice/mesos-slave.service
├─2483 /usr/sbin/mesos-slave --master=zk://10.0.0.2:2181,10.0.0.6:2181,10.0.0.7:2181/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins
├─2493 logger -p user.info -t mesos-slave[2483]
└─2494 logger -p user.err -t mesos-slave[2483]
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628670 2497 detector.cpp:482] A new leading master (UPID=master#127.0.0.1:5050) is detected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628732 2497 slave.cpp:729] New master detected at master#127.0.0.1:5050
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628825 2497 slave.cpp:754] No credentials provided. Attempting to register without authentication
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628844 2497 slave.cpp:765] Detecting new master
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.628872 2497 status_update_manager.cpp:176] Pausing sending status updates
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.628922 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: W0116 16:11:55.629107 2502 slave.cpp:3218] Master disconnected! Waiting for a new master to be elected
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: E0116 16:11:55.983531 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Jan 16 16:11:57 slave1.novalocal mesos-slave[2494]: E0116 16:11:57.465049 2503 process.cpp:1911] Failed to shutdown socket with fd 11: Transport endpoint is not connected

So the problematic line is:
Jan 16 16:11:55 slave1.novalocal mesos-slave[2494]: I0116 16:11:55.629093 2502 slave.cpp:3215] master#127.0.0.1:5050 exited
Specifically, note it's detecting the master as having the IP address 127.0.0.1. The Mesos Agent[1] sees that IP address, and tries to connect which fails (The master isn't running on the same machine as the agent).
This happens because the master announces what it thinks it's IP address is into Zookeeper. In your case, the master is thinking it's IP is 127.0.0.1 and then storing that into zk. Mesos has several configuration flags to control this behavior, mainly --hostname, --no-hostname_lookup, --ip, --ip_discovery_command, and via setting the environment variable LIBPROCESS_IP. See http://mesos.apache.org/documentation/latest/configuration/ for details about them and what they do.
The best thing you can do to make sure things work out of the box is to make sure the machines have resolvable hostnames. Mesos does a reverse-DNS lookup of the boxes hostname in order to figure out what IP people will contact it from.
If you can't get the hostnames setup properly, I would recommend setting --hostname and --ip manually which should cause mesos to announce exactly what you want.
[1]The mesos slave has been renamed to agent, see: https://issues.apache.org/jira/browse/MESOS-1478

systemd restart service on watchdog does terminate previous hanged instance

I'm trying to setup systemd service configuration to restart service on watchdog failure. If my application does not call sd_notify() in time, systemd spawns new instance.
However, previus instance is not killed. After some time, I have many instances of my application running.
$ systemctl status my-daemon.service
Loaded: loaded (/lib/systemd/system/my-daemon.service; disabled)
Active: active (running) since Tue, 26 Aug 2014 10:27:46 +0000; 7s ago
Main PID: 1433 (attendance-syst)
CGroup: name=systemd:/system/my-daemon.service
├ 1281 /usr/local/bin/my-daemon
├ 1384 /usr/local/bin/my-daemon
├ 1407 /usr/local/bin/my-daemon
└ 1433 /usr/local/bin/my-daemon
...
This is part of my service file:
[Service]
ExecStart=/usr/local/bin/my-daemon
TimeoutStopSec=5
WatchdogSec=10
Restart=on-failure
How can i configure systemd to kill instances which fails on watchdog?
I have already read manual page but it didn't help me.
I thought Restart=on-failure shall restart hanged process by default...

It's a bug and it's already fixed in newer versions of systemd.
In systemd 208 (available for debian jessie) it works correctly.
In systemd 204 (available for debian wheezy via backports) it's still broken.
I haven't found exact release where they fixed it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse