Prevent GCP maintenance from restarting GKE cluster

Prevent GCP maintenance from restarting GKE cluster - kubernetes

Seems like every week the GKE cluster gets restarted. Is there anything I could do to prevent that from happening? It does migrate pods to other node while it does maintenance on one of the node. But I'm not sure if there is downtime during migration and also sometimes the pods gets stuck in crash crashloopbackoff or errimagepull state.
How does the migration happen while maintenance? Does it create a new pod and then route the traffic and then delete the old pod when the total number of replica is just one? Just wanted to know if there is downtime. Its a new cluster and monitoring hasn't been setup so don't know if players are experiencing downtime during maintenance.
Is there a way to prevent GCP from doing maintenance? I used terraform to create the cluster so if I could prevent it I need to do it via terraform since GKE nodes can't be edited using GCP console.

You can configure your maintenance windows and enable/disable automatic node upgrades.
Here's an example of the configuration options in the GCP console:
You can also decide on which release channel you want to be (rapid, regular and stable).
Your Kubernetes control plane will have downtime if you have a zonal cluster. Only regional clusters replicate the control plane.
In terms of your own applications they should have zero downtime and GKE will automatically create new nodes and divert traffic when pods are ready to receive traffic.

Related

How to avoid downtime during scheduled maintenance window

I'm experiencing downtimes whenever the GKE cluster gets upgraded during the maintenance window. My services (APIs) become unreachable for like ~5min.
The cluster Location type is set to "Zonal", and all my pods have 2 replicas. The only affected pods seem to be the ones using nginx ingress controller.
Is there anything I can do to prevent this? I read that using Regional clusters should prevent downtimes in the control plane, but I'm not sure if it's related to my case. Any hints would be appreciated!

You mention "downtime" but is this downtime for you using the control plane (i.e. kubectl stop working) or is it downtime in that the end user who is using the services stops seeing the service working.
A GKE upgrade upgrades two parts of the cluster: the control plane or master nodes, and the worker nodes. These are two separate upgrades although they can happen at the same time depending on your configuration of the cluster.
Regional clusters can help with that, but they will cost more as you are having more nodes, but the upside is that the cluster is more resilient.
Going back to the earlier point about the control plane vs node upgrades. The control plane upgrade does NOT affect the end-user/customer perspective. The services will remaining running.
The node upgrade WILL affect the customer so you should consider various techniques to ensure high availability and resiliency on your services.
A common technique is to increase replicas and also to include pod antiaffinity. This will ensure the pods are scheduled on different nodes, so when the node upgrade comes around, it doesn't take the entire service out because the cluster scheduled all the replicas on the same node.
You mention the nginx ingress controller in your question. If you are using Helm to install that into your cluster, then out of the box, it is not setup to use anti-affinity, so it is liable to be taken out of service if all of its replicas get scheduled onto the same node, and then that node gets marked for upgrade or similar.

Enabling RBAC on Existing GKE Cluster

We have been running a cluster on GKE for around three years. As such, legacy authorization is enabled.
The control plane has been getting updated automatically, and our node pools are running a mixture of 1.12 and 1.14.
We have an increasing number of services, and are planning on incrementally adopting istio.
We want to enable a minimal RBAC setup without causing errors and downtime of our services.
I haven't been able to find any guides for how to accomplish this. Some people say just to enable RBAC authorization on the GKE cluster, but I assume that would take down all of our services.
It has also been implied that k8s can run in a hybrid ABAC/RBAC mode, but we can't tell if it is or not!
Is there a good guide for migrating to RBAC for GKE?

If you cluster is Regional you won't have downtime in your application when upgrade, but if your cluster is single-zonal or multi-zonal the best approach here is:
Add a new node pool
Cordon the old node pool to migrate the applications to the new node pool
Delete the old node pool after all pods are migrated.
It is the safesty way to update your node pool (zonal) without downtimes. Please read the references below to understand in details every step.
References:
https://kubernetes.io/docs/concepts/architecture/nodes/#reliability
https://kubernetes.io/docs/reference/kubectl/cheatsheet/#interacting-with-nodes-and-cluster

Change node machine type on GKE cluster

I have a GKE cluster I'm trying to switch the default node machine type on.
I have already tried:
Creating a new node pool with the machine type I want
Deleting the default-pool. GKE will process for a bit, then not remove the default-pool. I assume this is some undocumented behavior where you cannot delete the default-pool.
I'd prefer to not re-create the cluster and re-apply all of my deployments/secrets/configs/etc.
k8s version: 1.14.10-gke.24 (Stable channel)
Cluster Type: Regional

The best approach to change/increase/decrease your node pool specification would be with:
Migration
To migrate your workloads without incurring downtime, you need to:
Create a new node pool.
Mark the existing node pool as unschedulable.
Drain the workloads running on the existing node pool.
Check if the workload is running correctly on a new node pool.
Delete the existing node pool.
Your workload will be scheduled automatically onto a new node pool.
Kubernetes, which is the cluster orchestration system of GKE clusters, automatically reschedules the evicted Pods to the new node pool as it drains the existing node pool.
There is official documentation about migrating your workload:
This tutorial demonstrates how to migrate workloads running on a GKE cluster to a new set of nodes within the same cluster without incurring downtime for your application. Such a migration can be useful if you want to migrate your workloads to nodes with a different machine type.
-- GKE: Migrating workloads to different machine types
Please take a look at above guide and let me know if you have any questions in that topic.

Disable the default-pool's autoscaler and set the pool size to 0 nodes.
Wish there was a way I could just switch the machine type on the default-pool...

Kubernetes automatic shutdown after some idle time

Does kubernetes or Helm support shut down the pods if it is idle for more than a given threshold time?
This would be very useful in the development environment, to provide room for other processes to consume it and save cost.

Kubernetes is featured with the ability to autoscale your application in a cluster. Literally, it means that Kubernetes can start additional pods when the load is increasing and terminate excessive pods when the load is decreasing.
It is possible to downscale the application to zero pods, but, in this case, you will have a delay serving the first request while the pod is starting.
This functionality relies on performance metrics provided by Heapster application, that must be run in the cluster. From the practical side, it means that autoscaling doesn't happen instantly, because it takes some time to performance metrics reach the configured threshold.
The mentioned Kubernetes feature called HPA(horizontal pod autoscale) is described in this document.
In case you are running your cluster on GCP or GKE, you are able to go further and automatically start additional nodes for your cluster when you need more computing capacity and shut down nodes when they are not running application pods anymore.
More information about this functionality can be found following the link.
If you decide to give it a try, you might find this information useful:
Creating a Container cluster in GKE
70% cheaper Kubernetes cluster on AWS
How to build a Kubernetes Horizontal Pod Autoscaler using custom metrics

How can I protect my GKE cluster against master node failure?

In GKE every cluster has a single master endpoint, which is managed by Google Container Engine. Is this master node high available?
I deploy a beautiful cluster of redundant nodes with kubernetes but what happen if the master node goes down? How can i test this situation?

In Google Container Engine the master is managed for you and kept running by Google. According to the SLA for Google Container Engine the master should be available at least 99.5% of the time.

In addition to what Robert Bailey said about GKE keeping the master available for you, it's worth noting that Kubernetes / GKE clusters are designed (and tested) to continue operating properly in the presence of failures. If the master is unavailable, you temporarily lose the ability change what's running in the cluster (i.e. schedule new work or modify existing resources), but everything that's already running will continue working properly.