How to restart unresponsive kubernetes master in GKE - kubernetes

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?

There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.

I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.

Related

Prevent GCP maintenance from restarting GKE cluster

Seems like every week the GKE cluster gets restarted. Is there anything I could do to prevent that from happening? It does migrate pods to other node while it does maintenance on one of the node. But I'm not sure if there is downtime during migration and also sometimes the pods gets stuck in crash crashloopbackoff or errimagepull state.
How does the migration happen while maintenance? Does it create a new pod and then route the traffic and then delete the old pod when the total number of replica is just one? Just wanted to know if there is downtime. Its a new cluster and monitoring hasn't been setup so don't know if players are experiencing downtime during maintenance.
Is there a way to prevent GCP from doing maintenance? I used terraform to create the cluster so if I could prevent it I need to do it via terraform since GKE nodes can't be edited using GCP console.
You can configure your maintenance windows and enable/disable automatic node upgrades.
Here's an example of the configuration options in the GCP console:
You can also decide on which release channel you want to be (rapid, regular and stable).
Your Kubernetes control plane will have downtime if you have a zonal cluster. Only regional clusters replicate the control plane.
In terms of your own applications they should have zero downtime and GKE will automatically create new nodes and divert traffic when pods are ready to receive traffic.

Kubernetes cluster recovery after linux host reboot

We are still in a design phase to move away from monolithic architecture towards Microservices with Docker and Kubernetes. We did some basic research on Docker and Kubernetes and got some understanding. We still have couple of open question considering we will be creating K8s cluster with multiple Linux hosts (due to some reason we can't think about Cloud right now) .
Consider a scenario where we have K8s Cluster spanning over multiple linux hosts (5+).
1) If one of the linux worker node crashes and once we bring it back, does enabling kubelet as part of systemctl in advance will be sufficient to bring up required K8s jobs so that it be detected by master again?
2) I believe once worker node is crashed (X pods), after the pod eviction timeout master will reschedule those X pods into some other healthy node(s). Once the node is UP it won't do any deployment of X pods as master already scheduled to other node but will be ready to accept new requests from Master.
Is this correct ?
Yes, should be the default behavior, check your Cluster deployment tool.
Yes, Kubernetes handles these things automatically for Deployments. For StatefulSets (with local volumes) and DaemonSets things can be node specific and Kubernetes will wait for the node to come back.
Better to create a test environment and see/test the failure scenarios

kubectl get nodes hangs when I delete a node externally

Been experimenting with Kubernetes/Rancher and encountered some unexpected behavior. Today I'm deliberately putting on my chaos monkey hat and learning how things behave when stuff fails.
Here's what I've done:
1) Using the Rancher UI, stand up a 3 node cluster on Digital Ocean
Success -- a few mins later I have a 3 node cluster, visible in Rancher.
2) Using the Rancher UI, I deleted a node in a 'happy' scenario where I push the appropriate node delete button using Rancher.
Some minutes later, I have a 2 node cluster. Great.
3) Using the Digital Ocean admin UI, I delete a node in an 'oops' scenario as if a sysadmin accidentally deleted a node.
Back on the ranch (sorry), I click here to view the state of the cluster:
Unfortunately after three minutes, I'm getting a gateway timeout
Detailed timeouts in Chrome network inspector
Here's what kubectl says:
$ kubectl get nodes
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get nodes)
So, question is, what happened here?
I was under the impression Kubernetes was 'self healing' and even if this node I deleted was the etcd leader, it would eventually recover. Been around 2 hours -- do I just need to wait more?

Kubernetes Engine: Node keeps getting unhealthy and rebooted for no apparent reason

My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster

How can I protect my GKE cluster against master node failure?

In GKE every cluster has a single master endpoint, which is managed by Google Container Engine. Is this master node high available?
I deploy a beautiful cluster of redundant nodes with kubernetes but what happen if the master node goes down? How can i test this situation?
In Google Container Engine the master is managed for you and kept running by Google. According to the SLA for Google Container Engine the master should be available at least 99.5% of the time.
In addition to what Robert Bailey said about GKE keeping the master available for you, it's worth noting that Kubernetes / GKE clusters are designed (and tested) to continue operating properly in the presence of failures. If the master is unavailable, you temporarily lose the ability change what's running in the cluster (i.e. schedule new work or modify existing resources), but everything that's already running will continue working properly.