How to configure Kubernetes cluster autoscaler to scale down only? - kubernetes

I'd like to run the kubernetes cluster autoscaler so that unneeded nodes will be removed automatically, but I don't want the autoscaler to add nodes automatically. I prefer to handle scaling up myself. Is this possible?
I found maxNodesTotal, but I worry the semantics of setting this to 0 might mean all my nodes will go away. I also found scaleDownEnabled, but no corresponding option for scaling up.

Kubernetes Cluster Autoscaler or CA will attempt scale up whenever it will identify pending pods waiting to be scheduled to run but request more resources(CPU/RAM) than any available node can serve.
You can use the parameter maxNodeTotal to limit the maximum number of nodes CA would be allowed to spin up.
For example if you don't want your cluster to consist of any more than 3 nodes during peak utlization than you would set maxNodeTotal to 3.
There are different considerations that you should be aware of in terms of cost savings, performance and availability.
I would try to list some related to cost savings and efficient utilization as I suspect you might be more interested in that aspect.
Make sure you size your pods in consistency to their actual utlization, because scale up would get triggered by Pods resource request and not actual Pod resource utilization.
Also, bigger Pods are less likely to fit together on the same node, and in addition CA won't be able to scale down any semi-utilised nodes, resulting in resource spending.

Since you tagged this question with EKS, I will assume you are on AWS. On AWS the ASG (Auto Scaling Group) for each NodeGroup has a Max setting that is honoured by the cluster autoscaler. You can set this to prevent scaling above the set number of nodes. If the Min and Max on the ASG are the same value, then the autoscaler will never scale up or down. If the Min and Max are different, then the autoscaler can scale both up and down between those number of nodes. This is not exactly "never scale up", but it limits the upper end.
If you have multiple NodeGroups (ASGs), then each one can have different Min and Max nodes values.
You can also configure the cluster autoscaler itself in different ways. For example, you can set the utilization threshold. If a node's utilization fall under this threshold then the cluster autoscaler considers the node for scale down. See the FAQ.
The FAQ entry above that one may also apply. You can add an annotation to any node you do not want considered for scale down by the cluster autoscaler. Set: kubectl annotate node <nodename> cluster-autoscaler.kubernetes.io/scale-down-disabled=true or annotate the nodes as they are created. You can do this with entries in your AWS node group setup.

Related

kubernetes resource requests and limits adjustments

I have a k8s cluster in GKE with node autoscaler turned on. I want to maximize the resource utilization, and have applied all the suggestion on requests/limit changes recommended by GKE. At this moment there is 4 nodes as shown in the image below. They all uses n2-standard-2 i.e. 4 GB of memory per vCPU.
Memory request to allocatable ration is quite high compared to CPU request/allocatable.
Wondering if any other machine machine type that better suits my case. or any other resource optimization recommendation?
In GKE You can select custom compute sizes.
We find most workloads work best in 1:4 vCPU to Memory ratio (Hence the default). But it's possible to support other workload types. For your workload it looks like 1:2 for vCPU to Memory would be appropriate.
Also, it's hard to know exactly what sort of resource limit to set. You should look into generating some load for your cluster and using VPA to get a suggestion made by GKE cluster to be able to right size the limits.

can VPA and HPA(Auto Scaling) in kubernetes used together?

**can the following be done : **
VPA relies on a number of different
measurements and is different from the HPA. We can
therefore use VPA without interference in relation to the HPA.
For a truly efficient scaling, the HPA and VPA complement
each other. HPA creates new replicas if the load raises. If the
space for these replicas is not sufficient, VPA will provide
some nodes, allowing HPA-made pods to run
can it use the same metrics? if we use metrics will both of it execute or do we need to define different metrics for both?
I would also like to clarify one thing:
If the space for these replicas is not sufficient, VPA will provide some nodes, allowing HPA-made pods to run
If the number of nodes provided changes, it is horizontal scaling. Vertical scaling would mean changing the resource capacity of a node like number of cpus or amount of memory.
As for VPA working with HPA:
No, According to this article:
Avoid using HPA and VPA in tandem
HPA and VPA are currently incompatible and a best practice is to avoid
using both together for the same set of pods. VPA can however be used
with HPA that is configured to use either external or custom metrics.
AFAIK, k8s is better suited for HPA. K8s documentation also has HPA page.

How databricks do auto scaling for a cluster

I have a databricks cluster setup with auto scale upto 12 nodes.
I have often observed databricks scaling cluster from 6 to 8, then 8 to 11 and then 11 to 14 nodes.
So my queries -
1. Why is it picking up 2-3 nodes to be added at one go
2. Why auto scale is triggered as I see not many jobs are active or heavy processing on cluster. CPU usage is pretty low.
3. While auto scaling why is it leaving notebook in waiting state
4. Why is it taking up to 8-10 min to auto scale
Thanks
I am trying to investigate why data bricks is auto scaling cluster when its not needed
When you create a cluster, you can either provide a fixed number of workers for the cluster or provide a minimum and maximum number of workers for the cluster.
When you provide a fixed size cluster, Databricks ensures that your cluster has the specified number of workers. When you provide a range for the number of workers, Databricks chooses the appropriate number of workers required to run your job. This is referred to as autoscaling.
With autoscaling, Databricks dynamically reallocates workers to account for the characteristics of your job. Certain parts of your pipeline may be more computationally demanding than others, and Databricks automatically adds additional workers during these phases of your job (and removes them when they’re no longer needed).
Autoscaling makes it easier to achieve high cluster utilization, because you don’t need to provision the cluster to match a workload. This applies especially to workloads whose requirements change over time (like exploring a dataset during the course of a day), but it can also apply to a one-time shorter workload whose provisioning requirements are unknown. Autoscaling thus offers two advantages:
Workloads can run faster compared to a constant-sized
under-provisioned cluster.
Autoscaling clusters can reduce overall costs compared to a
statically-sized cluster.
Databricks offers two types of cluster node autoscaling: standard and optimized.
How autoscaling behaves
Autoscaling behaves differently depending on whether it is optimized or standard and whether applied to an interactive or a job cluster.
Optimized
Scales up from min to max in 2 steps.
Can scale down even if the cluster is not idle by looking at shuffle
file state.
Scales down based on a percentage of current nodes.
On job clusters, scales down if the cluster is underutilized over
the last 40 seconds.
On interactive clusters, scales down if the cluster is underutilized
over the last 150 seconds.
Standard
Starts with adding 4 nodes. Thereafter, scales up exponentially, but
can take many steps to reach the max.
Scales down only when the cluster is completely idle and it has been
underutilized for the last 10 minutes.
Scales down exponentially, starting with 1 node.

Resource Allocation in Kubernetes: How are pods scheduled?

In Kubernetes, the role of the scheduler is to seek a suitable node for the pods. So, after assigning a pod into a node, there are different pods on that node so that those pods are competing to gain resources. Therefore, for this competitive situation, how Kubernetes allocates resource? Is there any source code in Kubernetes for computing resource allocation?
I suppose you can take a look at the below articles to see if that answers your query
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_algorithm.md#ranking-the-nodes
https://jvns.ca/blog/2017/07/27/how-does-the-kubernetes-scheduler-work/
The filtered nodes are considered suitable to host the Pod, and it is often that there are more than one nodes remaining. Kubernetes prioritizes the remaining nodes to find the "best" one for the Pod. The prioritization is performed by a set of priority functions. For each remaining node, a priority function gives a score which scales from 0-10 with 10 representing for "most preferred" and 0 for "least preferred". Each priority function is weighted by a positive number and the final score of each node is calculated by adding up all the weighted scores. For example, suppose there are two priority functions, priorityFunc1 and priorityFunc2 with weighting factors weight1 and weight2 respectively, the final score of some NodeA is:
finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)
After the scores of all nodes are calculated, the node with highest score is chosen as the host of the Pod. If there are more than one nodes with equal highest scores, a random one among them is chosen.
Currently, Kubernetes scheduler provides some practical priority functions, including:
LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
CalculateNodeLabelPriority: Prefer nodes that have the specified label.
BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
CalculateSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.

How does open-faas deployed on kubernetes determine when to scale a function up or down?

In Kubernetes, I am a little unclear of what criteria needs to be met for open-faas to scale a function's replicas up or down.
According to the documentation:
Auto-scaling in OpenFaaS allows a function to scale up or down depending on demand represented by different metrics.
It sounds like, by default, a reason for scaling would be requests/second increasing/decreasing.
OpenFaaS ships with a single auto-scaling rule defined in the mounted configuration file for AlertManager. AlertManager reads usage (requests per second) metrics from Prometheus in order to know when to fire an alert to the API Gateway.
And this "alert" sent to the API Gateway would cause a function's replica count to scale up.
I don't see in the documentation, or the AlertManager, where the threshold for requests/second is set to scale up/down at.
My overall questions:
What is the default threshold of requests/second that would cause a scale up?
Is this threshold configurable? If so, how?