All Kubernetes Pods go down simultaneously periodically - kubernetes

I've been running a Kubernetes cluster for a while now, but I haven't been able to keep it stable.
My cluster consists of four nodes, two masters and two workers. All nodes run on the same physical server, which in turn runs VMware vSphere 6.5. Each node runs CoreOS stable (1353.7.0), and I'm running Kubernetes/Hyperkube v1.6.4, using Calico for networking. I've followed the steps in this guide.
What happens is that for a few hours/days, the cluster will run without a hitch. Then, all of a sudden (for no discernible reason as far as I can tell) all my pods go to status "Pending" and stay that way. Any hosted services are then no longer reachable.
After a while (usually 5 to 10 minutes), it seems to restore itself, after which it starts recreating all my pods, and trying (but failing) to shut down all my running pods. Some of the newly created pods come up, but will initially have no connection to the internet.
For a couple of weeks now I've had this issue intermittently, and it's been preventing me from using Kubernetes in production. I'd really like to figure out what's been causing this!
Weirdly enough, when I try to diagnose the problem by inspecting the logs,
I've noticed that on both of my worker nodes, the journald logs will have become corrupted! On the master nodes, the log is still readable, but not very informative.
Even when running, kubelet is constantly emitting errors in its logs. On all the nodes, this is what's posted about once a minute:
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.012890 24228 cni.go:275] Error deleting network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014762 24228 remote_runtime.go:109] StopPodSandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014818 24228 kuberuntime_gc.go:138] Failed to stop sandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:38:07 kube-master1 kubelet-wrapper[24228]: I0526 09:38:07.422341 24228 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9a378211-3597-11e7-a7ec-000c2958a0d7-default-token-0p3gf" (spec.Name: "default-token-0p3gf") pod "9a378211-3597-11e7-a7ec-000c2958a0d7" (UID: "9a378211-3597-11e7-a7ec-000c2958a0d7").
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: W0526 09:38:14.037553 24228 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "logstash-s3498_default": Unexpected command output nsenter: cannot open : No such file or directory
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: with error: exit status 1
I've googled this error, encountered this issue, but that has been closed and people indicate that using v1.6.0 or later should resolve it, but it definitely hasn't in my case...
Can anybody point me in the right direction?!
Thanks!

Seen this as well. problem seems to go away if you downgrade CoreOS to a older version with docker 1.12.3.
Docker is a nightmare with regressions in every single version they release :(

Related

After uninstalling calico, new pods are stuck in container creating state

After uninstalling calico, kubectl -f calico.yaml, not able to create new pods in the cluster. Any new pods in the cluster are stuck in container creating state. Kubectl describe shows the errors below:
Warning FailedCreatePodSandBox 2m kubelet, 10.0.12.2 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to set up pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "f15743177fd70c5eabf70c60be5b8b354e5346837d1b5d59bf99d1d1b5d6416c" network for pod "test-9465-768b57b5df-fv9d4": NetworkPlugin cni failed to teardown pod "test-9465-768b57b5df-fv9d4_policy-demo" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
The main issue is caused because calico has an init container but does not have a cleanup container. T
To undeploy calico, we have to do the usual kubectl delete -f <yaml>, and then delete a calico conf file in each of the nodes /etc/cni/net.d/. This configuration file along with other binaries are loaded on to the host by the init container.
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
From this link we can see that kubelet reads the configuration file from the default directory, and if there are multiple configuration files, then it applies the CNI plugin from the config file that appears first in an alphabetical order (why, oh god why??).
So, in our case, after uninstalling calico, it would be removed from all the admin privileges but the nodes would still try to apply calico rules based upon the config file it picked up from the default directory. Then restart the node to get rid of the iptable rules.
Removing the file and restarting the node solves the issue and we get back to normal behavior. Another way to solve the same problem is by simply terminating the node from the cluster if you are on a managed kubernetes cluster. Since, public cloud infrastructure automatically boots up another node to keep the same state, it no longer has the calico configuration file.

fluentd daemon set container for papertrail failing to start in kubernetes cluster

Am trying to setup fluentd in kubernetes cluster to aggregate logs in papertrail, as per the documentation provided here.
The configuration file is fluentd-daemonset-papertrail.yaml
It basically creates a daemon set for fluentd container and a config map for fluentd configuration.
When I apply the configuration, the pod is assigned to a node and the container is created. However, its either not completing the initialization or the pod gets killed immediately after it is started.
As the pods are getting killed, am loosing the logs too. Couldn't investigate the cause of the issue.
Looking through the events for kube-system namespace has below errors,
Error: failed to start container "fluentd": Error response from daemon: OCI runtime create failed: container_linux.go:338: creating new parent process caused "container_linux.go:1897: running lstat on namespace path \"/proc/75026/ns/ipc\" caused \"lstat /proc/75026/ns/ipc: no such file or directory\"": unknown
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "9559643bf77e29d270c23bddbb17a9480ff126b0b6be10ba480b558a0733161c" network for pod "fluentd-papertrail-b9t5b": NetworkPlugin kubenet failed to set up pod "fluentd-papertrail-b9t5b_kube-system" network: Error adding container to network: failed to open netns "/proc/111610/ns/net": failed to Statfs "/proc/111610/ns/net": no such file or directory
Am not sure whats causing these errors. Appreciate any help to understand and troubleshoot these errors.
Also, is it possible to look at logs/events that could tell us why a pod is given a terminate signal?
Please ensure that /etc/cni/net.d and its /opt/cni/bin friend both exist and are correctly populated with the CNI configuration files and binaries on all Nodes.
Take a look: sandbox.
With help from papertrail support team, I was able to resolve the issue by removing below entry from manifest file.
kubernetes.io/cluster-service: "true"
Above annotation seems to have been deprecated.
Relevant github issues:
https://github.com/fluent/fluentd-kubernetes-daemonset/issues/296
https://github.com/kubernetes/kubernetes/issues/72757

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Environment:
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
Questions:
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?
What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all
I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.

Minions can't rejoin cluster on reboot of AWS instance

The kubernetes cluster using v1.3.4 starts a master and 2 minions
The cluster starts fine and pods can be started and controlled without issue
As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster
The error from the kubelet service is of the form:
Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309 911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found
The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it
UPDATE:
I had a look at the controller manager log and got the following
W0815 13:36:39.087991 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045 1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found
This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.

Errors relating to Kubernetes watches

I am seeing a lot of errors in my logs relating to watches. Here's a snippet from my apiserver log on one machine:
W0517 07:54:02.106535 1 reflector.go:289] pkg/storage/cacher.go:161: watch of *api.Service ended with: client: etcd cluster is unavailable or misconfigured
W0517 07:54:02.106553 1 reflector.go:289] pkg/storage/cacher.go:161: watch of *api.PersistentVolumeClaim ended with: client: etcd cluster is unavailable or misconfigured
E0517 07:54:02.120217 1 reflector.go:271] pkg/admission/resourcequota/admission.go:86: Failed to watch *api.ResourceQuota: too old resource version: 790115 (790254)
E0517 07:54:02.120390 1 reflector.go:271] pkg/admission/namespace/lifecycle/admission.go:126: Failed to watch *api.Namespace: too old resource version: 790115 (790254)
E0517 07:54:02.134209 1 reflector.go:271] pkg/admission/serviceaccount/admission.go:102: Failed to watch *api.ServiceAccount: too old resource version: 790115 (790254)
As you can see, there are two types of errors:
etcd cluster is unavailable or misconfigured
I am passing --etcd-servers=http://k8s-master-etcd-elb.eu-west-1.i.tst.nonprod-ffs.io:2379 to the apiserver (this is definitely reachable). Another question seems to suggest that this does not work, but --etcd-cluster is not a recognised option in the version I'm running (1.2.3)
too old resource version
I've seen various mentions of this (eg. this issue) but nothing conclusive as to what causes this. I understand the default cache window is 1000, but the delta between versions in the example above are less than 1000. Could it be the error above is the cause of this?
I see that you are accessing the etcd through ELB proxy on AWS.
I have similar solution, just the ETCD is decoupled from the kubmaster server to its own 3 node cluster, hidden behind a internal ELB.
I can see the same errors from the kube-apiserver when configured to use the ELB. Without the ELB, configured as usual with a list of ETCD endponts, I don't see any errors.
Unfortunately, I don't know the root cause or why is this happening, will investigate more.