Errors relating to Kubernetes watches

Errors relating to Kubernetes watches - kubernetes

I am seeing a lot of errors in my logs relating to watches. Here's a snippet from my apiserver log on one machine:
W0517 07:54:02.106535 1 reflector.go:289] pkg/storage/cacher.go:161: watch of *api.Service ended with: client: etcd cluster is unavailable or misconfigured
W0517 07:54:02.106553 1 reflector.go:289] pkg/storage/cacher.go:161: watch of *api.PersistentVolumeClaim ended with: client: etcd cluster is unavailable or misconfigured
E0517 07:54:02.120217 1 reflector.go:271] pkg/admission/resourcequota/admission.go:86: Failed to watch *api.ResourceQuota: too old resource version: 790115 (790254)
E0517 07:54:02.120390 1 reflector.go:271] pkg/admission/namespace/lifecycle/admission.go:126: Failed to watch *api.Namespace: too old resource version: 790115 (790254)
E0517 07:54:02.134209 1 reflector.go:271] pkg/admission/serviceaccount/admission.go:102: Failed to watch *api.ServiceAccount: too old resource version: 790115 (790254)
As you can see, there are two types of errors:
etcd cluster is unavailable or misconfigured
I am passing --etcd-servers=http://k8s-master-etcd-elb.eu-west-1.i.tst.nonprod-ffs.io:2379 to the apiserver (this is definitely reachable). Another question seems to suggest that this does not work, but --etcd-cluster is not a recognised option in the version I'm running (1.2.3)
too old resource version
I've seen various mentions of this (eg. this issue) but nothing conclusive as to what causes this. I understand the default cache window is 1000, but the delta between versions in the example above are less than 1000. Could it be the error above is the cause of this?

I see that you are accessing the etcd through ELB proxy on AWS.
I have similar solution, just the ETCD is decoupled from the kubmaster server to its own 3 node cluster, hidden behind a internal ELB.
I can see the same errors from the kube-apiserver when configured to use the ELB. Without the ELB, configured as usual with a list of ETCD endponts, I don't see any errors.
Unfortunately, I don't know the root cause or why is this happening, will investigate more.

Related

Unable to enter a pod in the gke cluster

We have our k8s cluster set up with our app, including a neo4j DB deployment and other artifacts. Overnight, we've started facing an issue in our GKE cluster when trying to enter or interact somehow with any pod running in the cluster. The following screenshot shows a sample of the error we get.
issued command
error: unable to upgrade connection: Authorization error (user=kube-apiserver, verb=create, resource=nodes, subresource=proxy)
Our GKE cluster is created as standard (no autopilot) and the versions are
Node pool details
cluster basics
As said before it was working fine regardless of the warning about the versions. However, we haven't been able yet to identify what could have changed between the last time it worked, and now.
Any clue on what authorization setup might have been changed making it incompatible now is very welcomed

Kubectl connection refused existing cluster

Hope someone can help me.
To describe the situation in short, I have a self managed k8s cluster, running on 3 machines (1 master, 2 worker nodes). In order to make it HA, I attempted to add a second master to the cluster.
After some failed attempts, I found out that I needed to add controlPlaneEndpoint configuration to kubeadm-config config map. So I did, with masternodeHostname:6443.
I generated the certificate and join command for the second master, and after running it on the second master machine, it failed with
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
Checking the first master now, I get connection refused for the IP on port 6443. So I cannot run any kubectl commands.
Tried recreating the .kube folder, with all the config copied there, no luck.
Restarted kubelet, docker.
The containers running on the cluster seem ok, but I am locked out of any cluster configuration (dashboard is down, kubectl commands not working).
Is there any way I make it work again? Not losing any of the configuration or the deployments already present?
Thanks! Sorry if it’s a noob question.
Cluster information:
Kubernetes version: 1.15.3
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: kubeadm
Host OS: RHEL 7
CNI and version: weave 0.3.0
CRI and version: containerd 1.2.6

This is an old, known problem with Kubernetes 1.15 [1,2].
It is caused by short etcd timeout period. As far as I'm aware it is a hard coded value in source, and cannot be changed (feature request to make it configurable is open for version 1.22).
Your best bet would be to upgrade to a newer version, and recreate your cluster.

FailedToUpdateEnpoint in kubernetes

I have a kubernetes cluster with some deployments and pods.I have experienced a issue with my deployments with error messages like FailedToUpdateEndpoint, RedinessprobeFailed.
This errors are unexpected and didn't have idea about it.When we analyse the logs of our, it seems like someone try hack our cluster(not sure about it).
Thing to be clear:
1.Is there any chance someone can illegally access our kubernetes cluster without having the kubeconfig?
2.Is there any chance, by using the frontend IP,access our apps and make changes in cluster configurations(means hack the cluster services via Web URL)?
3.Even if the cluster access illegally via frontend URL, is there any chance to change the configuration in cluster?
4.Is there is any mechanism to detect, whether the kubernetes cluster is healthy state or hacked by someone?
Above three mentioned are focus the point, is there any security related issues with kubernetes engine.If not
Then,
5.Still I work on this to find reason for that errors, Please provide more information on that, what may be the cause for these errors?
Error Messages:
FailedToUpdateEndpoint: Failed to update endpoint default/job-store: Operation cannot be fulfilled on endpoints "job-store": the object has been modified; please apply your changes to the latest version and try again
The same error happens for all our pods in cluster.
Readiness probe failed: Error verifying datastore: Get https://API_SERVER: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver

Kubernetes Node NotReady: ContainerGCFailed / ImageGCFailed context deadline exceeded

Worker node is getting into "NotReady" state with an error in the output of kubectl describe node:
ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded
Environment:
Ubuntu, 16.04 LTS
Kubernetes version: v1.13.3
Docker version: 18.06.1-ce
There is a closed issue on that on Kubernetes GitHub k8 git, which is closed on the merit of being related to Docker issue.
Steps done to troubleshoot the issue:
kubectl describe node - error in question was found(root cause isn't clear).
journalctl -u kubelet - shows this related message:
skipping pod synchronization - [container runtime status check may not have completed yet PLEG is not healthy: pleg has yet to be successful]
it is related to this open k8 issue Ready/NotReady with PLEG issues
Check node health on AWS with cloudwatch - everything seems to be fine.
journalctl -fu docker.service : check docker for errors/issues -
the output doesn't show any erros related to that.
systemctl restart docker - after restarting docker, the node gets into "Ready" state but in 3-5 minutes becomes "NotReady" again.
It all seems to start when I deployed more pods to the node( close to its resource capacity but don't think that it is direct dependency) or was stopping/starting instances( after restart it is ok, but after some time node is NotReady).
Questions:
What is the root cause of the error?
How to monitor that kind of issue and make sure it doesn't happen?
Are there any workarounds to this problem?

What is the root cause of the error?
From what I was able to find it seems like the error happens when there is an issue contacting Docker, either because it is overloaded or because it is unresponsive. This is based on my experience and what has been mentioned in the GitHub issue you provided.
How to monitor that kind of issue and make sure it doesn't happen?
There seem to be no clarified mitigation or monitoring to this. But it seems like the best way would be to make sure your node will not be overloaded with pods. I have seen that it is not always shown on disk or memory pressure of the Node - but this is probably a problem of not enough resources allocated to Docker and it fails to respond in time. Proposed solution is to set limits for your pods to prevent overloading the Node.
In case of managed Kubernetes in GKE (not sure but other vendors probably have similar feature) there is a feature called node auto-repair. Which will not prevent node pressure or Docker related issue but when it detects an unhealthy node it can drain and redeploy the node/s.
If you already have resources and limits it seems like the best way to make sure this does not happen is to increase memory resource requests for pods. This will mean fewer pods per node and the actual used memory on each node should be lower.
Another way of monitoring/recognizing this could be done by SSH into the node check the memory, the processes with PS, monitoring the syslog and command $docker stats --all

I have got the same issue. I have cordoned and evicted the pods.
Rebooted the server. automatically node came into ready state.

Minions can't rejoin cluster on reboot of AWS instance

The kubernetes cluster using v1.3.4 starts a master and 2 minions
The cluster starts fine and pods can be started and controlled without issue
As soon as one of the minions is rebooted, or any of the dependent services, such as kubelet is restarted, the minions will not rejoin the cluster
The error from the kubelet service is of the form:
Aug 08 08:21:15 ip-10-16-1-20 kubelet[911]: E0808 08:21:15.955309 911 kubelet.go:2875] Error updating node status, will retry: error getting node "ip-10-16-1-20.us-west-2.compute.internal": nodes "ip-10-16-1-20.us-west-2.compute.internal" not found
The only way, that we can see to rectify this issue at the moment is to tear down the whole cluster and rebuild it
UPDATE:
I had a look at the controller manager log and got the following
W0815 13:36:39.087991 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
W0815 13:37:39.123811 1 nodecontroller.go:433] Unable to find Node: ip-10-16-1-25.us-west-2.compute.internal, deleting all assigned Pods.
E0815 13:37:39.133045 1 nodecontroller.go:434] pods "kube-proxy-ip-10-16-1-25.us-west-2.compute.internal" not found

This is actually a coreos issue, although it is difficult to ascertain what the problem actually is. It is more than likely the low level os host resolution code being called from the aws go layers, but that is purely a guess. By upgrading the coreos ami to a later version solved many of the issues we were facing.