Migrating from the Docker runtime to the Containerd runtime on a specific NODE kubernetes - kubernetes

I would like to know if it's possible to change the param --image-type COS_CONTAINERD in the google cloud command line but on a specific node.
I found this command below :
gcloud container clusters upgrade CLUSTER_NAME --image-type COS_CONTAINERD [--node-pool POOL_NAME]
But it will update my entire node pool so it'will create a disturbance on my services web .
Every node in my pool will be updated in the same time ( destroy & re-create ).
Or there is another way ???
I would like to know if you have the same topic / issue . It' will help me during this migration.
Thanks for your experience .

The command you are mentioning, will update all the nodes within the node pool.
I would suggest you trying to create a new node pool on your cluster and choose the COS_CONTAINERD image type, and once you have the new node pool with cos-containerd migrate your workload to the new node pool node(s) by following this process. This will also help you to manage the downtime.

Related

terraform: how to upgrade Azure AKS default nodepool VM size without replacement of the cluster

I'm trying to upgrade the VM size of my AKS cluster using this approach with Terraform. Basically I create a new nodepool with the required amount of nodes, then I cordon the old node to disallow scheduling of new pods. Next, I drain the old node to reschedule all the pods in the newly created node. Then, I proceed to upgrade the VM size.
The problem I am facing is that azurerm_kubernetes_cluster resource allow for the creation of the default_node_pool and another resource, azurerm_kuberentes_cluster_node_pool allows me to create new nodepools with extra nodes.
Everything works until I create the new nodepool, cordon and drain the old one. However when I change the default_nod_pool.vm_size and run terraform apply, it tells me that the whole resource has to be recreated, including the new nodepool I just created, because it's linked to the cluster id, which will be replaced.
How should I manage this upgrade from the documentation with Terraform if upgrading the default node pool always forces replacement even if a new nodepool is in place?
terraform version
terraform v1.1.7
on linux_amd64
+ provider registry.terraform.io/hashicorp/azurerm v2.82.0
+ provider registry.terraform.io/hashicorp/local v2.2.2

Update the node size of a digital ocean kubernetes cluster without replacing the whole cluster

I successfully maintain a kubernetes cluster in digital ocean throught terraform. The core cluster configuration is the following:
resource "digitalocean_kubernetes_cluster" "cluster" {
name = var.name
region = var.region
version = var.k8s_version
vpc_uuid = digitalocean_vpc.network.id
node_pool {
name = "workers"
size = var.k8s_worker_size
node_count = var.k8s_worker_count
}
}
The problem is, I now need to increase the node size (stored in the variable k8s_worker_size).
If I simply change the variable to a new string, the terraform plan results in a full replace of the kubernetes cluster:
digitalocean_kubernetes_cluster.cluster must be replaced
This is not doable in our production environment.
The correct procedure to perform this operation inside digital ocean is to:
Create a new node pool, with the required size
Use kubectl drain to remove our pods from the 'old' nodes
Remove the previous node pool.
Of course, by doing this manually inside the digital ocean console, the terraform state is completely out-of-sync and is therefore unusable.
Is there a way to perform that operation through terraform?
As an alternative options, is it possible to "manually" update the terraform state in order to sync it with the real cluster state after I perform the migration manually?
Is there a way to perform that operation through terraform?
There might be some edge cases where there is a solution to this. Since I am not familiar with kubernetes inside DigitalOcean I can't share a specific solution.
As an alternative options, is it possible to "manually" update the terraform state in order to sync it with the real cluster state after I perform the migration manually?
Yes! Do as you proposed manually and then remove the out-of-sync cluster with
terraform state rm digitalocean_kubernetes_cluster.cluster
from the state. Please visit the corresponding documentation for state rm and update the address if your cluster is in a module etc. Then use
terraform import digitalocean_kubernetes_cluster.cluster <id of your cluster>
to reimport the cluster. Please consult the documentation for importing the cluster for the details. The documentations mentions something around tagging the default node pool.

Kubernetes V1.16.8 doesn't support 'node-role' label using "--node-labels=node-role.kubernetes.io/master="

Upgrade Kube-aws v1.15.5 cluster to the next version 1.16.8.
Use Case:
I want to keep the Same node label for Master and Worker nodes as I'm using in v1.15 .
When I tried to upgrade the cluster to V1.16 the --node-labels is restricted to use 'node-role'
If I keep the node role as "node-role.kubernetes.io/master" the kubelet fails to start after upgrade. if I remove the label, kubectl get node output shows none for the upgraded node.
How do I reproduce?
Before the upgrade I took a backup of 'cp /etc/sysconfig/kubelet /etc/sysconfig/kubelet-bkup' have removed "-role" from it and once the upgrade is completed, I have moved the kubelet sysconfig by replacing the edited file 'mv /etc/sysconfig/kubelet-bkup /etc/sysconfig/kubelet'. Now I could able to see the Noderole as Master/Worker even after kubelet service restart.
The Problem I'm facing now?
Though I perform the upgrade on the existing cluster successfully. The cluster is running in AWS as Kube-aws model. So, the ASG would spin up a new node whenever Cluster-Autoscaler triggers it.
But, the new node fails to join to the cluster since the node label "node-role.kubernetes.io/master" exists in the code base.
How can I add the node-role dynamically in the ASG scale-in process?. Any solution would be appreciated.
Note:
(Kubeadm, kubelet, kubectl )- v1.16.8
I have sorted out the issue. I have created a Python code that watches the node events. So whenever ASG spins up a new node, after it joins to the cluster, the node wil be having a role "" , later the python code will add a appropriate label to the node dynamically.
Also, I have created a docker image with the base of python script I created for node-label and it will run as a pod. The pod will be deployed into the cluster and it does the job of labelling the new nodes.
Ref my solution given in GitHub
https://github.com/kubernetes/kubernetes/issues/91664
I have created as a docker image and it is publicly available
https://hub.docker.com/r/shaikjaffer/node-watcher
Thanks,
Jaffer

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

RouteController failed to create a route on GKE

I have a cluster on GKE whose node pool I create when I want to use the cluster, and delete when I'm done with it.
It's a two node cluster with the master in europe-west2-a and with and whose node zones are europe-west2-a and europe-west2-b.
The most recent creation resulted in the node in zone B failing with NetworkUnavailable because RouteController failed to create a route. The reason was because Could not create route xxx 10.244.1.0/24 for node xxx after 342.263706ms: instance not found.
Why would this be happening all of a sudden, and what can I do to fix it?!
You didn't mention which version of GKE you are using so just for clarification:
Changes in access scopes
Beginning with Kubernetes version 1.10, gcloud and GCP Console no longer grants the compute-rw access scope on new clusters and new node pools by default. Furthermore, if --scopes is specified in gcloud container create, gcloud no longer silently adds compute-rw or storage-ro.
In any case you can still revert to legacy access scopes but this is not recommended approach.
Hope this help.
With gke 1.13.6-gke.13, some of the default scopes were changed, including the compute-rw scope being removed. I think that due to the age of the cluster, this scope was necessary for a route to be correctly created between nodes in a node pool.
In the end, my gcloud creation command had these scopes:
--scopes https://www.googleapis.com/auth/projecthosting,storage-rw,monitoring,trace,compute-rw