How to reduce nodes(vm) running in a Kubernetes cluster of GKE gracefully? - kubernetes

I'm wondering the graceful way to reduce nodes in a Kubernetes cluster on GKE.
I have some nodes each of which has some pods watching a shared job queue and executing a job. I also have the script which monitors the length of the job queue and increase the number of instances when the length exceeds a threshold by executing gcloud compute instance-groups managed resize command and it works ok.
But I don't know the graceful way to reduce the number of instances when the length falls below the threshold.
Is there any good way to stop the pods working on the terminating instance before the instance gets terminated? or any other good practice?
Note
Each job can take around between 30m and 1h
It is acceptable if a job gets executed more than once (in the worst case...)

I think the best approach is instead of using a pod to run your tasks, use the kubernetes job object. That way when the task is completed the job terminates the container. You would only need a small pod that could initiate kubernetes jobs based on the queue.
The more kube jobs that get created, the more resources will be consumed and the cluster auto-scaler will see that it needs to add more nodes. A kube job will need to complete even if it gets terminated, it will get re-scheduled to complete.
There is no direct information in the GKE docs about whether a downsize will happen if a Job is running on the node, but the stipulation seems to be if a pod can be easily moved to another node and the resources are under-utilized it will drain the node.
Refrences
https://cloud.google.com/container-engine/docs/cluster-autoscaler
http://kubernetes.io/docs/user-guide/kubectl/kubectl_drain/
http://kubernetes.io/docs/user-guide/jobs/

Before resizing the cluster, let's set the project context in the cloud shell by running the below commands:
gcloud config set project [PROJECT_ID]
gcloud config set compute/zone [COMPUTE_ZONE]
gcloud config set compute/region [COMPUTE_REGION]
gcloud components update
Note: You can also set project, compute zone & region as flags in the below command using --project, --zone, and --region operational flags
gcloud container clusters resize [CLUSTER_NAME] --node-pool [POOL_NAME] --num-nodes [NUM_NODES]
Run the above command for each node pool. You can omit the --node-pool flag if you have only one node pool.
Reference: https://cloud.google.com/kubernetes-engine/docs/how-to/resizing-a-cluster

Related

Job still running even when deleting nodes

I created a two nodes clusters and I created a new job using the busybox image that sleeps for 300 secs. I checked on which node this job is running using
kubectl get pods -o wide
I deleted the node but surprisingly the job was still finishing to run on the same node. Any idea if this is a normal behavior? If not how can I fix it?
Jobs aren't scheduled or running on nodes. The role of a job is just to define a policy by making sure that a pod with certain specifications exists and ensure that it runs till the completion of the task whether it completed successfully or not.
When you create a job, you are declaring a policy that the built-in job-controller will see and will create a pod for. Then the built-in kube-scheduler will see this pod without a node and patch the pod to it with a node's identity. The kubelet will see a pod with a node matching it's own identity and hence a container will be started. As the container will be still running, the control-plane will know that the node and the pod still exist.
There are two ways of breaking a node, one with a drain and the second without a drain. The process of breaking a node without draining is identical to a network cut or a server crash. The api-server will keep the node resource for a while, but it 'll cease being Ready. The pods will be then terminated slowly. However, when you drain a node, it looks as if you are preventing new pods from scheduling on to the node and deleting the pods using kubectl delete pod.
In both ways, the pods will be deleted and you will be having a job that hasn't run to completion and doesn't have a pod, therefore job-controller will make a new pod for the job and the job's failed-attempts will be increased by 1, and the loop will start over again.

Designing K8 pod and proceses for initialization

I have a problem statement where in there is a Kubernetes cluster and I have some pods running on it.
Now, I want some functions/processes to run once per deployment, independent of number of replicas.
These processes use the same image like the image in deployment yaml.
I cannot use initcontainers and sidecars, because they will run along with main container on pod for each replica.
I tried to create a new image and then a pod out of it. But this pod keeps on running, which is not good for cluster resource, as it should be destroyed after it has done its job. Also, the main container depends on the completion on this process, in order to run the "command" part of K8 spec.
Looking for suggestions on how to tackle this?
Theoretically, You could write an admission controller webhook for intercepting create/update deployments and triggering your functions as you want. If your functions need to be checked, use ValidatingWebhookConfiguration for validating the process and then deny or accept commands.

How to restart master node in kubernetes

I have a kubernetes cluster with 3 masters and 3 workers, I want to restart one of the masters to update the system of the master machine.
So can I just reboot the machine directly on the console with reboot,
or some steps need to be done before the reboot to void the risk of out of service and data loss?
If you need to reboot a node (such as for a kernel upgrade, libc upgrade, hardware repair, etc.), and the downtime is brief, then when the Kubelet restarts, it will attempt to restart the pods scheduled to it. If the reboot takes longer (the default time is 5 minutes, controlled by --pod-eviction-timeout on the controller-manager), then the node controller will terminate the pods that are bound to the unavailable node. If there is a corresponding replica set (or replication controller), then a new copy of the pod will be started on a different node. So, in the case where all pods are replicated, upgrades can be done without special coordination, assuming that not all nodes will go down at the same time
If you want more control over the upgrading process, you may use the following workflow:
Use kubectl drain to gracefully terminate all pods on the node while marking the node as unschedulable:
kubectl drain $NODENAME
This keeps new pods from landing on the node while you are trying to get them off.
For pods with a replica set, the pod will be replaced by a new pod which will be scheduled to a new node. Additionally, if the pod is part of a service, then clients will automatically be redirected to the new pod.
For pods with no replica set, you need to bring up a new copy of the pod, and assuming it is not part of a service, redirect clients to it.
Perform maintenance work on the node.
Make the node schedulable again:
kubectl uncordon $NODENAME
Additionally if the node is hosting ETCD then you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data
Take a backup of the ETCD if it's hosting the ETCD. You can use the in-built command to backup the data like
ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/snapshot-pre-boot.db
Now drain the node using
kubectl drain <master01>
Do the System update | patches and reboot.
Now uncordon the node back to the cluster
kubectl uncordon <master01>
Whenever you wish to reboot OS on the particular Node(Master, worker), K8s cluster engine does not aware for that action and it keeps all the cluster related events in ETCD key value storage, backing up the most recent data. As soon as you wish carefully prepare cluster Node reboot, you might have to adjust Maintenance job on this Node in order to drain it from scheduling and gracefully terminate all the existing Pods.
If you compose any relevant K8s resource within defined set of replicas, then ReplicationController guarantees that a specified number of pod replicas are running at any one time through each available Node. It simply re-spawns Pods if they failed health check, deleted or terminated, matching desired replicas. In case of Master nodes which host ETCDs you need to be extra careful in terms of rolling upgrade of ETCD and backing up the data.
1. Backup a single master
As mentioned previously, we need to backup etcd. In addition to that, we need the certificates and
optionally the kubeadm configuration file for easily restoring the
master. If you set up your cluster using kubeadm (with no special
configuration) you can do it similar to this:
Backup certificates:
$ sudo cp -r /etc/kubernetes/pki backup/
Make etcd snapshot:
$ sudo docker run --rm -v $(pwd)/backup:/backup \
--network host \
-v /etc/kubernetes/pki/etcd:/etc/kubernetes/pki/etcd \
--env ETCDCTL_API=3 \
k8s.gcr.io/etcd-amd64:3.2.18 \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
snapshot save /backup/etcd-snapshot-latest.db
Backup kubeadm-config:
$ sudo cp /etc/kubeadm/kubeadm-config.yaml backup/
Note that the contents of the backup folder should then be stored somewhere safe, where it can survive if the master is completely destroyed. You perhaps want to use e.g. AWS S3 (or similar) for this.
There are three commands in the example and all of them should be run on the master node. The first one copies the folder containing all the certificates that kubeadm creates. These certificates are used for secure communications between the various components in a Kubernetes cluster. The final command is optional and only relevant if you use a configuration file for kubeadm. Storing this file makes it easy to initialize the master with the exact same configuration as before when restoring it.
If master update went wrong you can then simply restore old version of master node.
You can also automate etcd backups.
Doing a single backup manually may be a good first step but you really need to make regular backups for them to be useful. The easiest way to do this is probably to take the commands from the example above, create a small script and a cron job that runs the script every now and then. But since we are running Kubernetes anyway, use a Kubernetes CronJob. This would allow you to keep track of the backup jobs inside Kubernetes just like you monitor your workloads.
More information you can find here: backups-kubernetes.
2. Next step is to mark a node unschedulable, run this command:
$ kubectl drain $NODENAME
The kubectl drain command should only be issued to a single node at a time. However, you can run multiple kubectl drain commands for different nodes in parallel, in different terminals or in the background. Multiple drain commands running concurrently will still respect the PodDisruptionBudget you specify.
3. Execute the system update or patch and reboot.
4. Finally uncordon the node back to the cluster, execute command below:
$ kubectl uncordon $NODENAME
On GCP there is option such auto-upgrading nodes which improve managing node updates.
About maintenance Kubernetes nodes's you can read here: node-maintenace.

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

Suspending a container in a kubernetes pod

I would like to suspend the main process in a docker container running in a kubernetes pod. I have attempted to do this by running
kubectl exec <pod-name> -c <container-name> kill -STOP 1
but the signal will not stop the container. Investigating other approaches, it looks like docker stop --signal=SIGSTOP or docker pause might work. However, as far as I know, kubectl exec always runs in the context of a container, and these commands would need to be run in the pod outside the context of the container. Does kubectl's interface allow for anything like this? Might I achieve this behavior through a call to the underlying kubernetes API?
You could set the replicaset to 0 which would set the number of working deployments to 0. This isn't quite a Pause but it does Stop the deployment until you set the number of deployments to >0.
kubectl scale --replicas=0 deployment/<pod name> --namespace=<namespace>
So kubernetes does not support suspending pods because it's a VM kinda behavior, and since starting a new one is cheaper it just schedules a new pod in case of failure. In effect your pods should be stateless. And any application that needs to store state, should have a persistent volume mounted inside the pod.
The simple mechanics(and general behavior) of Kubernetes is if the process inside the contaiener fails kuberentes will restart it by creating a new pod.
If you also comment what you are trying to achieve as an end goal I think I can help you better.