AWS EKS Cluster Autoscaler - Scale-In Policy - kubernetes

I've a CA (Cluster Autoscaler) deployed on EKS followed this post. What I'm wondering is CA automatically scales down the cluster whenever at least a single pod is deployed on that node i.e. if there are 3 nodes with the capacity of 8 pods, if 9th pod comes up, CA would provision 4th nodes to run that 9th pod. What I see is CA is continuously terminating & creating a node randomly chosen from within the cluster disturbing other pods & nodes.
How can I tell EKS (without defining minimal nodes or disabling scale-in policy in ASG) to not to kill the node having at least 1 pod running on it. Any suggestion would be appreciated.

You cannot use pod as unit. CA work with resources cpu and memory unit.
If the cluster does not have enough cpu or memory it add one new.
You have to play with your requests resources (in the pod definition) or redefine your node to take an instance type with more or less resources depending how many pod you want on each.
Or you can play with the param scale-down-utilization-threshold
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

Related

Added new node to eks, new pods still being scheduled on old nodes

I have a terraform-managed EKS cluster. It used to have 2 nodes on it. I doubled the number of nodes (4).
I have a kubernetes_deployment resource that automatically deploys a fixed number of pods to the cluster. It was set to 20 when I had 2 nodes, and seemed evenly distributed with 10 each. I doubled that number to 40.
All of the new pods for the kubernetes deployment are being scheduled on the first 2 (original) nodes. Now the two original nodes have 20 pods each, while the 2 new nodes have 0 pods. The new nodes are up and ready to go, but I cannot get kubernetes to schedule the new pods on those new nodes.
I am unsure where to even begin searching, as I am fairly new to k8s and ops in general.
A few beginner questions that may be related:
I'm reading about pod affinity, and it seems like I could tell k8s to have a pod ANTI affinity with itself within a deployment. However, I am having trouble setting up the anti-affinity rules. I see that the kubernetes_deployment resource has a scheduling argument, but I can't seem to get the syntax right.
Naively it seems that the issue may be that the deployment somehow isn't aware of the new nodes. If that is the case, how could I reboot the entire deployment (without taking down the already-running pods)?
Is there a cluster level scheduler that I need to set? I was under the impression that the default does round robin, which doesn't seem to be happening at the node level.
EDIT:
The EKS terraform module node_groups submodule has fields for desired/min/max_capacity. To increase my worker nodes, I just increased those numbers. The change is reflected in the aws eks console.
Check a couple of things:
Do your nodes show up correctly in the output of kubectl get nodes -o wide and do they have a state of ready?
Instead of pod affinity look into pod topology spread constraints. Anti affinity will not work with multiple pods.

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

Can I force Kubernetes not to run more than X replicas of a pod in the same node?

I have a tiny Kubernetes cluster consisting of just two nodes running on t3a.micro AWS EC2 instances (to save money).
I have a small web app that I am trying to run in this cluster. I have a single Deployment for this app. This deployment has spec.replicas set to 4.
When I run this Deployment, I noticed that Kubernetes scheduled 3 of its pods in one node and 1 pod in the other node.
Is it possible to force Kubernetes to schedule at most 2 pods of this Deployment per node? Having 3 instances in the same pod puts me dangerously close to running out of memory in these tiny EC2 instances.
Thanks!
The correct solution for this would be to set memory requests and limits correctly matching your steady state and burst RAM consumption levels on every pod, then the scheduler will do all this math for you.
But for the future and for others, there is a new feature which kind of allows this https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/. It's not an exact match, you can't put a global cap, rather you can require pods be evenly spaced over the cluster subject to maximum skew caps.

In kubernetes, when does it make sense to have pods replicated in the same node?

I understand the logic behind replicating a pod across nodes, but is there any benefit to replicating pods in the same node? From my understanding this doesn't increase degree of replication, because one pod can take down a node.
Replication of pods within a node?
You can handle the replication condition/count using some metrics but you can't handle the spawning of the pods on the same node until unless there is some affinity set, as it's handled by kube-scheduler. However you can set various pod/node affinities, anti affinities and topologykey [if you wish to maintain a minimum number of pods on a particular node].
One pod can take a node down?
I highly doubt that, it can only happen if you don't have any requests/limits set for CPU/Memory or if in your HPA the maximum replicas are set to increase more than the capacity of the node with a condition that podaffinity is set so all the pods will be spawning on the same node.
When can such a scenario make sense?
The reason for it can be if your nodes are of irregular size. And you want to consume a particular node more than other nodes in the cluster.
In a best practices environment all the nodes are of regular size. HPA is set and limits/quotas are provided so that a pod doesn't crash a node by crossing all the limits of a node.
EDIT from comments:
So what if I want to run multiple instances of a nodejs app to utilize all the cores on my kube node? Lets say 8 cores, does it make sense to replicate the nodejs app pod 8 times in the same kube node or is it better to have 1 pod spin up 8 instances of the nodejs app?
A single-threaded application in your case, a pod will be considered as an instance. A pod to spin up 8 instances? It will be a multi-container pod with 8 containers of the same image, that would really be a bad practice and surely not even worth a test environment. However, having different replicas for the same Deployment is a do-able practice. But the question is how will you stop the Kubernetes service that if a pod is already serving some request and is in the lock state how will the request be routed to another pod. That's only possible by HPA if max request per pod is set to 1.
Why not use NodeJS Cluster module for kubernetes to utilise all the cores of the node in the cluster where App is deployed? -- Suggestion by #DanielKobe
node-js-scaling-out-on-kubernetes
Should-you-use-pm2-node-cluster-or-neither-in-kubernetes

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity