Kubernetes Autoscaler how to always keep one node at idle - kubernetes

I am currently working with GPU's and since they are expensive I want them to scale down and up depending on the load. However scaling up the cluster and preparing the node takes around 8 minutes since it installs the drivers and do some other preparation.
So to solve this problem, I want to let one node stay in idle state and autoscale the rest of the nodes. Is there any way to do it?
This way when a request comes, the idle node will take it and a new idle node will be created.
Thanks!

There are three different approaches:
1 - The 1st approach is entirely manual. This will help you keep a node in an idle state without incurring downtime for your application during the autoscaling process.
You would have to prevent one specific node from autosaling (let's call it "node A"). Create a new node and make replicas of the node A's pods to that new node.
The node will be running while it is not part of the autoscaling process.
Once the autoscaling process is complete, and the boot is finished, you may safely drain that node.
a. Create a new node.
b. Prevent node A from evicting its pods by adding the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
c. Copy a replica of node A, make replicas of the pods into that new node.
d. Once the autoscaler has scaled all the nodes, and the boot time has
completed, you may safely drain node A, and delete it.
2 - You could run a Pod Disruption Budget.
3 - If you would like to block the node A from being deleted when the autoscaler scales down, you could set the annotation "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" on one particular node. This only works during a scaling down process.

Related

Is there a benefit to having spare nodes for handling EC2 spot interruption node termination event?

Suppose I have a 3 node EKS cluster made up of 3 spot instances (we'll call them Node A, B, and C), and each node has critical pods scheduled. The EKS cluster has the EKS Node Termination Handler running. Metadata gets posted saying that in 2 minutes Node A is going to be reclaimed by Amazon.
The Node Termination handler cordons and drains the node being taken (Node A), and a new node spins up. The pods from Node A are then scheduled on the Node A Replacement. If this completes in two minutes time, perfect.
Is there a benefit to having spare capacity around (Node D). If Node A is taken back by Amazon, will my pods be rescheduled on Node D since it is already available?
In this architecture, it seems like a great idea to have a spare node or two around for pod rescheduling so I don't have a risk of the 2 minute window. Do I need to do anything special to make sure the pods are rescheduled in the most efficient way?
Is there a benefit to having spare capacity around (Node D). If Node A is taken back by Amazon, will my pods be rescheduled on Node D since it is already available?
Yes definitely, there are chances POD will get scheduled on that node if not any specific argument attached to deployment like Node selector, taint, affinity etc.
Do I need to do anything special to make sure the pods are rescheduled
in the most efficient way?
That sounds a good idea but what if all at the same time 3 PODs get termination signal, in 2 Min all POD can be rescheduled to new Nodes?
New 3 nodes will be available or Single D node will be available?
You might need to take care about the Size of all PODs being scheduled on the number of Nodes, Readiness-liveness with proper fast configuration so POD comes up asap and handle the traffic.
If your Single D node is running, and all 3 Spot instances get terminated that can create issue, How about the PODs of Nginx ingress or service mesh you will be running?
If Nginx PODs are getting scheduled, those may take a few sec time sometimes if they are Rollingupdate then it's fine.

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

How to avoid the last pod being killed on automatic node scale down in AKS

We are using Azure AKS v1.17.9 with auto-scaling both for pods (using HorizontalPodAutoscaler) and for nodes. Overall it works well, but we have seen outages in some cases. We have some deployments where minReplicas=1 and maxReplicas=4. Most of the time there will only be one pod running for such a deployment. In some cases where the auto-scaler has decided to scale down a node, the last remaining pod has been killed. Later a new pod is started on another node, but this means an outage.
I would have expected the auto-scaler to first create a new pod running on another node (bringing the number of replicas up to the allowed value of 2) and then scaling down the old pod. That would have worked without downtime. As it is it kills first and asks questions later.
Is there a way around this except the obvious alternative of setting minReplicas=2 (which increases the cost as all these pods are doubled, needing additional VMs)? And is this expected, or is it a bug?
In some cases where the auto-scaler has decided to scale down a node, the last remaining pod has been killed. Later a new pod is started on another node, but this means an outage.
For this reason, you should always have at least 2 replicas for Deployment in a production environment. And you should use Pod Anti-Affinity so that those two pods are not scheduled to the same Availability Zone. E.g. if there is network problems in one Availability Zone, your app is still available.
It is common to have at least 3 replicas, one in each Availability Zone, since cloud providers typically has 3 Availability Zones in each Region - so that you can use inter-zone traffic which is cheaper than cross-zone traffic, typically.
You can always use fewer replicas to save cost, but it is a trade-off and you get worse availability.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

how to reduce the time to move pods on another nodes when a node is down

I have a problem, I can't find how to change the pod check parameter to move on another node. When k8s detects that a node is down .
I found the parameter --sync-synchrionizes but I'm not sure.
Someone know how to do it ?
You need to change the kube-controller-manager.conf and update the following parameters: (you can find the file in /etc/kubernetes/manifests)
node-status-update-frequency: 10s
node-monitor-period: 5s
node-monitor-grace-period: 40s
pod-eviction-timeout: 30s
This is what happens when node dies or go into offline mode:
The kubelet posts its status to masters by --node-status-update-fequency=10s.
Node goes offline
kube-controller-manager is monitoring all the nodes by --node-monitor-period=5s
kube-controller-manager will see the node is unresponsive and has the grace period --node-monitor-grace-period=40s until it considers node unhealthy. PS: This parameter should be in N x node-status-update-fequency
Once the node marked unhealthy, the kube-controller-manager will remove the pods based on --pod-eviction-timeout=5m
Now, if you tweaked the parameter pod-eviction-timeout to say 30 seconds, it will still take total 70 seconds to evict the pod from node The node-status-update-fequency and node-monitor-grace-period time counts in node-monitor-grace-period also. You can tweak these variable as well to further lower down your total node eviction time.
Once pod is scheduled to particular node, it is not moved or shifted to any other node in any case.New pod created on available node.
If you don't have deployment or RC, to manage state(number of pods) of your application, it will be lost forever. But if you are using deploy or other objects who is responsible to maintain desired state, then if node goes down, it detects the changes in current state and then it create new pod to another node(Depending on node capacity).
Absolutely agree with praful above. It is quite challenging to evict the pods from failed node and move them on to another available node in 5 seconds. Practically not possible. You need to monitor the node status, allow grace period to confirm that node in indeed down, then mark the status as unhealthy. Finally move the pods to other active node.
You can tweak those node monitor parameters to much less values but the downside is control pane performance would be hit as more connections are made between Kubelet and api server.
Suggest you run 2 replicas for each pod so that your app is still be available to serve the user requests