I faced the issue with Kubernetes after OOM on the master node. Kubernetes services were looking Ok, there were not any error or warning messages in the log. But Kubernetes failed to process new deployment, wich was created after OOM happened.
I reloaded Kubernetes by systemctl restart kube-*. And it solved the issue, Kubernetes began work normally.
I just wonder is it expected behavior or bug in Kubernetes?
It would be great if you can share kube-controller's log. But when api server crash / OOMKilled, there can be potential synchronization problems in early version of kubernetes (i remember we saw similar problems with daemonset and I have bug filed to Kubernete community), but rare.
Meanwhile, we did a lot of work to make kubernetes production ready: both tuning kubernetes and crafting other micro-services that need to talk to kubernetes. Hope these blog entries would help:
https://applatix.com/making-kubernetes-production-ready-part-2/ This is about 30+ knobs we used to tune kubernetes
https://applatix.com/making-kubernetes-production-ready-part-3/ This is about micro service behavior to ensure cluster stability
It seems the problem wasn't caused by OOM. It was caused by kube-controller regardless to was OOM happen or not.
If I restart kube-controller Kubernetes begins process deployments and pods normally.
Related
We have upgraded our AKS to 1.24.3, and since we have, we are having an issue with containers refusing connection.
There have been no changes to the deployed microservices as part of the AKS upgrade, and the issue is occurring at random intervals.
From what I can see the container is returning the error - The client closed the connection.
What I cannot seem to be able to trace is, the connections, within AKS, and the issue is across all services.
Has anyone experienced anything similar and are able to provide any advise?
I hit similar issue upgrading from 1.23.5 to 1.24.3, issue was configuration mis-match with kubernetes load balancer health probe path and ingress-nginx probe endpoints.
Added this annotation to my ingress-nginx helm install command corrected my problem: --set controller.service.annotations."service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path"=/healthz
Anyone has aware of this issue, I have a cluster of 3 nodes and Am running pods in statefulset. totally 3 pods are running in the order, assume pod-0 running on node-1, pod-2 running on node-2, and pod-3 running on node-3. now, the traffic is going properly and getting the response immediately, when we stop one node(eg: node-2) , then the response is intermittent and the traffic is routing to stopped pod as well, is there any solution/workaround for this issue.
when we stop one node(eg: node-2), then the response is intermittent and the traffic is routing to stopped pod as well, is there any solution/workaround for this issue.
This seem to be a reported issue. However, Kubernetes is a distribued cloud native system and you should design for resilience with use of request retries.
Improve availability and resilience of your Microservices using these seven cloud design patterns
How to Make Services Resilient in a Microservices Environment
My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster
After deploying anything to minikube it seems as though the apiserver starts eating up all the CPU and makes the dashboard mostly unusable until the apiserver dies and gets restarted.
I've read through a bit of the references found in this post: kube-apiserver high CPU and requests
However, those seem to specifically target deployed k8s clusters on many machines, or at least where the master isn't on the same machine.
That's not how it would work with minikube since it's a signle node cluster. Not to mention it typically isn't given a ton of resources (neither CPU or mem).
Is there a way to curb or eliminate this behavior? Perhaps I've missed some important configuration for running on minikube?
The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?
There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.
I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.