Parallel deletion of GKE nodes - kubernetes

I am trying to run a job on GKE for 5 mins and 50 nodes. However when i scale down instances it happens sequentially and thus costing me much more for a 4-5 min job.
Is there any way to paralelly delete GKE instances?

Kubernetes cluster has an underlying Instance Group.
I was able to delete the nodes in parallel by directly changing the number of nodes in Instance Group from 50 to 5.
All nodes were deleted within 30 seconds and GKE had also automatically updated the cluster size with the new value.

If you are using already to scale down your cluster:
$ gcloud container clusters resize [CLUSTER_NAME] --node-pool [NODE_POOL] --size [SIZE]
I believe there are other option to speed up this process.
On other other hand if you are reeling on autoscaler you could try if with that command the resize is faster and you would be able to meet your requirements. However keep in mind that the purpose of Kubernetes is not create an infrastructure were it is extremely fast to spin up and add 50 nodes and when you are done to kill them.
Consider also from the doumentation:
The kubectl drain command should only be issued to a single node at a time. However, you can run multiple kubectl drain commands for different node in parallel, in different terminals or in the background. Multiple drain commands running concurrently will still respect the PodDisruptionBudget you specify.
Therefore it turns out from the documentation that drain a node (that is an essential phase to remove a node) it is discouraged

Related

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

How to assign different number of pods to different nodes in Kubernetes for same deployment?

I am running a deployment on a cluster of 1 master and 4 worker nodes (2-32GB and 2-4GB machine). I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB machines.
Is there a way to assign different number of pods to different nodes in Kubernetes for same deployment?
I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB
machines.
This is possible with configuring kubelet to limit the maximum pod count on the node:
// maxPods is the number of pods that can run on this Kubelet.
MaxPods int32 `json:"maxPods,omitempty"`
Github can be found here.
Is there a way to assign different number of pods to different nodes
in Kubernetes for same deployment?
Adding this to your request makes it not possible. There is no such native mechanism in Kubernetes at this point to suffice this. And this more or less goes in spirit of how Kubernetes works and its principles. Basically you schedule your application and let scheduler decides where it should go, unless there is very specific resource required like GPU. And this is possible with labels,affinity etc .
If you look at the Kubernetes-API you notice the there is no such field that will make your request possible. However, API functionality can be extended with custom resources and this problem can be tackled with creating your own scheduler. But this is not the easy way of fixing this.
You may want to also set appropriate memory requests. Higher requests will tell scheduler to deploy more pods into node which has more memory resources. It's not ideal but it is something.
Well in general the scheduling is done on basis of algorithms like round robin, least used etc.
And likely we have the independence of adding node affinities via selectors but that won't even tackle the count.
Maybe you have to manually reset this thing up along the worker nodes.
Say -
you did kubectl top nodes to get the available spaces, once the deployment has been done.
and kubectl get po -o wide will give you the nodes taken on by the pods.
Now to force the Pod to get spawned in a specific node, let's say the one with 32GB then you can temporarily mark the 4GB nodes as "Not ready" by executing following command
Kubectl cordon {node_name}
And now kill the pods those are running in 4GB machines and you want those to run in 32GB machines. After killing them, they will automatically get spawned in any of the 32GB nodes
then you can execute
Kubectl uncordon {node_name} to mark the node as "ready" again.
This is bit involved stuff and will need lots of calculations as well.

Can I force Kubernetes not to run more than X replicas of a pod in the same node?

I have a tiny Kubernetes cluster consisting of just two nodes running on t3a.micro AWS EC2 instances (to save money).
I have a small web app that I am trying to run in this cluster. I have a single Deployment for this app. This deployment has spec.replicas set to 4.
When I run this Deployment, I noticed that Kubernetes scheduled 3 of its pods in one node and 1 pod in the other node.
Is it possible to force Kubernetes to schedule at most 2 pods of this Deployment per node? Having 3 instances in the same pod puts me dangerously close to running out of memory in these tiny EC2 instances.
Thanks!
The correct solution for this would be to set memory requests and limits correctly matching your steady state and burst RAM consumption levels on every pod, then the scheduler will do all this math for you.
But for the future and for others, there is a new feature which kind of allows this https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/. It's not an exact match, you can't put a global cap, rather you can require pods be evenly spaced over the cluster subject to maximum skew caps.

Is there a way to resize a GKE cluster to 0 nodes after a certain amount of idle time?

I have a GKE cluster that I want to have sitting at 0 nodes, scale up to 3 nodes to perform a task, and then after a certain amount of idle time, scale back down to 0. Is there a way to do this?
A GKE cluster can never scale down to 0 because of the system pods running in the cluster. The pods running in the kube-system namespace count against resource usage in your nodes thus the autoscaler will never make the decision to scale the entire cluster down to 0
It is definitely possible to have individual node pools scale down to 0 though.
You may want to consider using 2 different node pools: 1 small one to hold all the system pods (minus daemonset pods) and another larger pool with autoscaling enabled from 0 to X. You can add a taint to this node pool to ensure system pods don't use it.
This will minimize your resource usage during down times, but there is no way to ensure k8s automatically resizes to 0
An alternative, if you have a planned schedule for when the cluster should scale up or down, you can leverage Cloud Scheduler to launch a job that sends an API call to the container API to resize your cluster.
Or you could configure a job in the cluster or a prestop hook in your final job to trigger a Cloud Function
Yes, you can scale down pools after some idle condition.
You can use G-Cloud cli to scale up and scale down node pool.
Resize GKE pool using cloud function :
You can also use Cloud function to trigger API calls and use cloud scheduler to trigger that function at a specific time.
OR
You can write Cloud function with PubSub as a trigger as pass message to PubSub based on that you can resize the GKE node pool as per requirement.
Example if anyone looking for : https://github.com/harsh4870/cloud-function-scale-gke
As we can read in the GKE documentation about Cluster autoscaler.
Autoscaling limits
When you autoscale clusters, node pool scaling limits are determined by zone availability.
For example, the following command creates an autoscaling multi-zone cluster with six nodes across three zones, with a minimum of one node per zone and a maximum of four nodes per zone:
gcloud container clusters create example-cluster \
--zone us-central1-a \
--node-locations us-central1-a,us-central1-b,us-central1-f \
--num-nodes 2 --enable-autoscaling --min-nodes 1 --max-nodes 4
The total size of this cluster is between three and twelve nodes, spread across three zones. If one of the zones fails, the total size of cluster becomes between two and eight nodes.
But there are limitations.
Occasionally, cluster autoscaler cannot scale down completely and an extra node exists after scaling down. This can occur when required system Pods are scheduled onto different nodes, because there is no trigger for any of those Pods to be moved to a different node. See I have a couple of nodes with low utilization, but they are not scaled down. Why?. To work around this limitation, you can configure a Pod disruption budget.
Here are the reasons why the nodes might not be scaled down:
the node group already has the minimum size,
node has the scale-down disabled annotation (see How can I prevent Cluster Autoscaler from scaling down a particular node?)
node was unneeded for less than 10 minutes (configurable by --scale-down-unneeded-time flag),
there was a scale-up in the last 10 min (configurable by --scale-down-delay-after-add flag),
there was a failed scale-down for this group in the last 3 minutes (configurable by --scale-down-delay-after-failure flag),
there was a failed attempt to remove this particular node, in which case Cluster Autoscaler will wait for extra 5 minutes before considering it for removal again,
using large custom value for --scale-down-delay-after-delete or --scan-interval, which delays CA action.
The following command would resize the cluster to zero nodes:
gloud container clusters resize [cluster-name] --size 0 --zone [zone]
Now, it is up to you how you wanna increase or decrease the size of the cluster.
Suppose you have a few things to be deployed and you know how much they will take, increase the size of the cluster with the following command:
gloud container clusters resize [cluster-name] --size 3 --zone [zone]
and once done with the task you wanted to perform, run the above-mentioned command again to resize it to zero. You can write a shell script to automate this thing, provided you are certain about the time needed by the cluster to perform the tasks you want.

Kubernetes batch performance with activation of thousands of pods using jobs

I am writing a pipeline with kubernetes in google cloud.
I need to activate sometimes a few pods in a second, where each pod is a task that runs inside a pod.
I plan to call kubectl run with Kubernetes job and wait for it to complete (poll every second all the pods running) and activate the next step in the pipeline.
I will also monitor the cluster size to make sure I am not exceeding the max CPU/RAM usage.
I can run tens of thousands of jobs at the same time.
I am not using standard pipelines because I need to create a dynamic number of tasks in the pipeline.
I am running the batch operation so I can handle the delay.
Is it the best approach? How long does it take to create a pod in Kubernetes?
If you wanna run ten thousands of jobs at the same time - you will definitely need to plan resource allocation. You need to estimate the number of nodes that you need. After that you may create all nodes at once, or use GKE cluster autoscaler for automatically adding new nodes in response to resource demand. If you preallocate all nodes at once - you will probably have high bill at the end of month. But pods can be created very quickly. If you create only small number of nodes initially and use cluster autoscaler - you will face large delays, because nodes take several minutes to start. You must decide what your approach will be.
If you use cluster autoscaler - do not forget to specify maximum nodes number in cluster.
Another important thing - you should put your jobs into Guaranteed quality of service in Kubernetes. Otherwise if you use Best Effort or Burstable pods - you will end up with Eviction nightmare which is really terrible and uncontrolled.