Upgrading EKS node group from 1.17 to 1.18 failed due to AsgInstanceLaunchFailures - how do I fix it?

Upgrading EKS node group from 1.17 to 1.18 failed due to AsgInstanceLaunchFailures - how do I fix it? - upgrade

I have an EKS cluster that has gone through an upgrade from 1.17 to 1.18.
The cluster has 2 node groups (updated using the AWS console).
EKS control plane and one of the node groups upgrades were ok.
The last node group the upgrade is failing due to a health issue - AsgInstanceLaunchFailures - One or more target groups not found. Validating load balancer configuration failed. and now the node group is marked as Degraded.
when I access the update ID I see the following error:
NodeCreationFailure - Couldn't proceed with upgrade process as new nodes are not joining node group {NODE_GROUP_NAME}
I tried accessing the ASG with that ID and I can see it has several load-balancing target groups attached to it.
I could not find any way to fix this in the AWS docs.
Any advice?

Issue resolved.
it appears there was an empty target group added manually to the cluster (there were 3 other target groups created automatically). Once the empty target group was deleted, the upgrade was completed successfully.
I am still unclear as to how EKS chooses the proper target group to update when there is more than one.

Are you able to launch new nodes which are coming up in Ready State and joining cluster? Based on EKS public doc, the upgrade request would succeed only when ASG can launch new instances in Ready state in all the AZs of the node group.
To debug this further, you can trigger a new upgrade request and check the health of new nodes which EKS brings up in your cluster.

Related

terraform: how to upgrade Azure AKS default nodepool VM size without replacement of the cluster

I'm trying to upgrade the VM size of my AKS cluster using this approach with Terraform. Basically I create a new nodepool with the required amount of nodes, then I cordon the old node to disallow scheduling of new pods. Next, I drain the old node to reschedule all the pods in the newly created node. Then, I proceed to upgrade the VM size.
The problem I am facing is that azurerm_kubernetes_cluster resource allow for the creation of the default_node_pool and another resource, azurerm_kuberentes_cluster_node_pool allows me to create new nodepools with extra nodes.
Everything works until I create the new nodepool, cordon and drain the old one. However when I change the default_nod_pool.vm_size and run terraform apply, it tells me that the whole resource has to be recreated, including the new nodepool I just created, because it's linked to the cluster id, which will be replaced.
How should I manage this upgrade from the documentation with Terraform if upgrading the default node pool always forces replacement even if a new nodepool is in place?
terraform version
terraform v1.1.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/azurerm v2.82.0
+ provider registry.terraform.io/hashicorp/local v2.2.2

Kubernetes V1.16.8 doesn't support 'node-role' label using "--node-labels=node-role.kubernetes.io/master="

Upgrade Kube-aws v1.15.5 cluster to the next version 1.16.8.
Use Case:
I want to keep the Same node label for Master and Worker nodes as I'm using in v1.15 .
When I tried to upgrade the cluster to V1.16 the --node-labels is restricted to use 'node-role'
If I keep the node role as "node-role.kubernetes.io/master" the kubelet fails to start after upgrade. if I remove the label, kubectl get node output shows none for the upgraded node.
How do I reproduce?
Before the upgrade I took a backup of 'cp /etc/sysconfig/kubelet /etc/sysconfig/kubelet-bkup' have removed "-role" from it and once the upgrade is completed, I have moved the kubelet sysconfig by replacing the edited file 'mv /etc/sysconfig/kubelet-bkup /etc/sysconfig/kubelet'. Now I could able to see the Noderole as Master/Worker even after kubelet service restart.
The Problem I'm facing now?
Though I perform the upgrade on the existing cluster successfully. The cluster is running in AWS as Kube-aws model. So, the ASG would spin up a new node whenever Cluster-Autoscaler triggers it.
But, the new node fails to join to the cluster since the node label "node-role.kubernetes.io/master" exists in the code base.
How can I add the node-role dynamically in the ASG scale-in process?. Any solution would be appreciated.
Note:
(Kubeadm, kubelet, kubectl )- v1.16.8

I have sorted out the issue. I have created a Python code that watches the node events. So whenever ASG spins up a new node, after it joins to the cluster, the node wil be having a role "" , later the python code will add a appropriate label to the node dynamically.
Also, I have created a docker image with the base of python script I created for node-label and it will run as a pod. The pod will be deployed into the cluster and it does the job of labelling the new nodes.
Ref my solution given in GitHub
https://github.com/kubernetes/kubernetes/issues/91664
I have created as a docker image and it is publicly available
https://hub.docker.com/r/shaikjaffer/node-watcher
Thanks,
Jaffer

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool

To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.

The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

RouteController failed to create a route on GKE

I have a cluster on GKE whose node pool I create when I want to use the cluster, and delete when I'm done with it.
It's a two node cluster with the master in europe-west2-a and with and whose node zones are europe-west2-a and europe-west2-b.
The most recent creation resulted in the node in zone B failing with NetworkUnavailable because RouteController failed to create a route. The reason was because Could not create route xxx 10.244.1.0/24 for node xxx after 342.263706ms: instance not found.
Why would this be happening all of a sudden, and what can I do to fix it?!

You didn't mention which version of GKE you are using so just for clarification:
Changes in access scopes
Beginning with Kubernetes version 1.10, gcloud and GCP Console no longer grants the compute-rw access scope on new clusters and new node pools by default. Furthermore, if --scopes is specified in gcloud container create, gcloud no longer silently adds compute-rw or storage-ro.
In any case you can still revert to legacy access scopes but this is not recommended approach.
Hope this help.

With gke 1.13.6-gke.13, some of the default scopes were changed, including the compute-rw scope being removed. I think that due to the age of the cluster, this scope was necessary for a route to be correctly created between nodes in a node pool.
In the end, my gcloud creation command had these scopes:
--scopes https://www.googleapis.com/auth/projecthosting,storage-rw,monitoring,trace,compute-rw

Updating deployment in GCE leads to node restart

We have some odd issue happening with GCE.
We have 2 clusters dev and prod each consisting of 2 nodes.
Production nodes are n1-standard-2, dev - n1-standard-1.
Typically dev cluster is busier with more pods eating more resources.
We deploy updates mostly with deployments (few projects still recreate RCs to update to latest versions)
Normally, the process is: build project, build docker image, docker push, create new deployment config and kubectl apply new config.
What's constantly happening on production is after applying new config, single or both nodes restart. Cluster does not seem to be starving with memory/cpu and we could not find anything in the logs that would explain those restarts.
Same procedure on staging never causes nodes to restart.
What can we do to diagnose the issue? Any specific events,logs we should be looking at?
Many thanks for any pointers.
UPDATE:
This is still happening and I found following in Computer Engine - Operations:
repair-1481931126173-543cefa5b6d48-9b052332-dfbf44a1
Operation type: compute.instances.repair.recreateInstance
Status message : Instance Group Manager 'projects/.../zones/europe-west1-c/instanceGroupManagers/gke-...' initiated recreateInstance on instance 'projects/.../zones/europe-west1-c/instances/...'. Reason: instance's intent is RUNNING but instance's health status is TIMEOUT.
We still can't figure out why this is happening and it's having a negative effect on our production environment every time we deploy our code.