GKE nodes unexpectedly deleted and recreated - kubernetes

I created a cluster on Google Kubernetes Engine. The nodes get deleted/created very often (at least once a day). Even though new instances are created to replace them, and pods are moved to these new nodes, I would like to understand why the nodes disappear.
I checked the settings used to create the cluster and the node pool:
"Automatic node upgrade" is Disabled on the node pool.
"Pre-emptible nodes" is Disabled.
"Automatic node repair" is Enabled, but I doesn't look like there was a node repair, since I don't see anything in gcloud container operations list at the time when my nodes were deleted.
I can see that the current nodes were all (re-)created at 21:00, while the cluster was created at 08:35 :
➜ ~ gcloud container clusters describe my-cluster --format=json
{
"createTime": "2019-04-11T08:35:39+00:00",
...
"nodePools": [
{
...
"management": {
"autoRepair": true
},
"name": "default-pool",
}
],
"status": "RUNNING",
...
}
How can I trace the reason why the nodes were deleted ?

I tried to reproduce your problem by creating a cluster, manually stopping the kubelet on a node (by running systemctl stop kubelet) to trigger repair and watching the node recover. In my case, I do see an operation for the auto node repair, but I can also see in the GCE operations log that the VM was deleted and recreated (by the GKE robot account).
If you run gcloud compute operations list (or check the cloud console page for operations) you should see what caused the VM to be deleted and recreated.

just happened to me on Sunday 13/10/2019.
all data from stateful partition also gone

Related

Kubernetes V1.16.8 doesn't support 'node-role' label using "--node-labels=node-role.kubernetes.io/master="

Upgrade Kube-aws v1.15.5 cluster to the next version 1.16.8.
Use Case:
I want to keep the Same node label for Master and Worker nodes as I'm using in v1.15 .
When I tried to upgrade the cluster to V1.16 the --node-labels is restricted to use 'node-role'
If I keep the node role as "node-role.kubernetes.io/master" the kubelet fails to start after upgrade. if I remove the label, kubectl get node output shows none for the upgraded node.
How do I reproduce?
Before the upgrade I took a backup of 'cp /etc/sysconfig/kubelet /etc/sysconfig/kubelet-bkup' have removed "-role" from it and once the upgrade is completed, I have moved the kubelet sysconfig by replacing the edited file 'mv /etc/sysconfig/kubelet-bkup /etc/sysconfig/kubelet'. Now I could able to see the Noderole as Master/Worker even after kubelet service restart.
The Problem I'm facing now?
Though I perform the upgrade on the existing cluster successfully. The cluster is running in AWS as Kube-aws model. So, the ASG would spin up a new node whenever Cluster-Autoscaler triggers it.
But, the new node fails to join to the cluster since the node label "node-role.kubernetes.io/master" exists in the code base.
How can I add the node-role dynamically in the ASG scale-in process?. Any solution would be appreciated.
Note:
(Kubeadm, kubelet, kubectl )- v1.16.8
I have sorted out the issue. I have created a Python code that watches the node events. So whenever ASG spins up a new node, after it joins to the cluster, the node wil be having a role "" , later the python code will add a appropriate label to the node dynamically.
Also, I have created a docker image with the base of python script I created for node-label and it will run as a pod. The pod will be deployed into the cluster and it does the job of labelling the new nodes.
Ref my solution given in GitHub
https://github.com/kubernetes/kubernetes/issues/91664
I have created as a docker image and it is publicly available
https://hub.docker.com/r/shaikjaffer/node-watcher
Thanks,
Jaffer

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

Kubernetes Deployment update crashes ReplicaSet and creates too many Pods

Using Kubernetes I deploy an app to Google Cloud Containerengine on a cluster with 3 smalll instances.
On a first-time deploy, all goes well using:
kubectl create -f deployment.yaml
And:
kubectl create -f service.yaml
Then I change the image in my deployment.yaml and update it like so:
kubectl apply -f deployment.yaml
After the update, a couple of things happen:
Kubernetes updates its Pods correctly, ending up with 3 updated instances.
Short after this, another ReplicaSet is created (?)
Also, the double amount (2 * 3 = 6) of Pods are suddenly present, where half of them have a status of Running, and the other half Unknown.
So I inspected my Pods and came across this error:
FailedSync Error syncing pod, skipping: network is not ready: [Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
Also I can't use the dashboard anymore using kubectl proxy. The page shows:
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "no endpoints available for service \"kubernetes-dashboard\"",
"reason": "ServiceUnavailable",
"code": 503
}
So I decided to delete all pods forecefully:
kubectl delete pod <pod-name> --grace-period=0 --force
Then, three Pods are triggered for creation, since this is defined in my service.yaml. But upon inspecting my Pods using kubectl describe pods/<pod-name>, I see:
no nodes available to schedule pods
I have no idea where this all went wrong. I essence, all I did was updating an image of a deployment.
Anyone ideas?
I've run into similar issues on Kubernetes. According to your reply to my question on your question (see above):
I noticed that this happens only when I deploy to a micro instance on Google Cloud, which simply has insufficient resources to handle the deployment. Scaling up the initial resources (CPU, Memory) resolved my issue
It seems to me like what's happening here is that the OOM killer from the Linux kernel ends up killing the kubelet, which in turn makes the Node useless to the cluster (and becomes "Unknown").
A real solution to this problem (to prevent an entire node from dropping out of service) is to add resource limits. Make sure you're not just adding requests; add limits because you want your services -- rather than K8s system services -- to be killed so that they can be rescheduled appropriately (if possible).
Also inside of the cluster settings (specifically in the Node Pool -- select from https://console.cloud.google.com/kubernetes/list), there is a box you can check for "Automatic Node Repair" that would at least partially re-mediate this problem rather than giving you an undefined amount of downtime.
If your intention is just to update the image try to use kubectl set image instead. That at least works for me.
By googling kubectl apply a lot of known issues do seem to come up. See this issue for example or this one.
You did not post which version of kubernetes you deployed, but if you can try to upgrade your cluster to the latest version to see if the issue still persists.

How to reduce nodes(vm) running in a Kubernetes cluster of GKE gracefully?

I'm wondering the graceful way to reduce nodes in a Kubernetes cluster on GKE.
I have some nodes each of which has some pods watching a shared job queue and executing a job. I also have the script which monitors the length of the job queue and increase the number of instances when the length exceeds a threshold by executing gcloud compute instance-groups managed resize command and it works ok.
But I don't know the graceful way to reduce the number of instances when the length falls below the threshold.
Is there any good way to stop the pods working on the terminating instance before the instance gets terminated? or any other good practice?
Note
Each job can take around between 30m and 1h
It is acceptable if a job gets executed more than once (in the worst case...)
I think the best approach is instead of using a pod to run your tasks, use the kubernetes job object. That way when the task is completed the job terminates the container. You would only need a small pod that could initiate kubernetes jobs based on the queue.
The more kube jobs that get created, the more resources will be consumed and the cluster auto-scaler will see that it needs to add more nodes. A kube job will need to complete even if it gets terminated, it will get re-scheduled to complete.
There is no direct information in the GKE docs about whether a downsize will happen if a Job is running on the node, but the stipulation seems to be if a pod can be easily moved to another node and the resources are under-utilized it will drain the node.
Refrences
https://cloud.google.com/container-engine/docs/cluster-autoscaler
http://kubernetes.io/docs/user-guide/kubectl/kubectl_drain/
http://kubernetes.io/docs/user-guide/jobs/
Before resizing the cluster, let's set the project context in the cloud shell by running the below commands:
gcloud config set project [PROJECT_ID]
gcloud config set compute/zone [COMPUTE_ZONE]
gcloud config set compute/region [COMPUTE_REGION]
gcloud components update
Note: You can also set project, compute zone & region as flags in the below command using --project, --zone, and --region operational flags
gcloud container clusters resize [CLUSTER_NAME] --node-pool [POOL_NAME] --num-nodes [NUM_NODES]
Run the above command for each node pool. You can omit the --node-pool flag if you have only one node pool.
Reference: https://cloud.google.com/kubernetes-engine/docs/how-to/resizing-a-cluster

Updating deployment in GCE leads to node restart

We have some odd issue happening with GCE.
We have 2 clusters dev and prod each consisting of 2 nodes.
Production nodes are n1-standard-2, dev - n1-standard-1.
Typically dev cluster is busier with more pods eating more resources.
We deploy updates mostly with deployments (few projects still recreate RCs to update to latest versions)
Normally, the process is: build project, build docker image, docker push, create new deployment config and kubectl apply new config.
What's constantly happening on production is after applying new config, single or both nodes restart. Cluster does not seem to be starving with memory/cpu and we could not find anything in the logs that would explain those restarts.
Same procedure on staging never causes nodes to restart.
What can we do to diagnose the issue? Any specific events,logs we should be looking at?
Many thanks for any pointers.
UPDATE:
This is still happening and I found following in Computer Engine - Operations:
repair-1481931126173-543cefa5b6d48-9b052332-dfbf44a1
Operation type: compute.instances.repair.recreateInstance
Status message : Instance Group Manager 'projects/.../zones/europe-west1-c/instanceGroupManagers/gke-...' initiated recreateInstance on instance 'projects/.../zones/europe-west1-c/instances/...'. Reason: instance's intent is RUNNING but instance's health status is TIMEOUT.
We still can't figure out why this is happening and it's having a negative effect on our production environment every time we deploy our code.