Can I force Kubernetes not to run more than X replicas of a pod in the same node? - kubernetes

I have a tiny Kubernetes cluster consisting of just two nodes running on t3a.micro AWS EC2 instances (to save money).
I have a small web app that I am trying to run in this cluster. I have a single Deployment for this app. This deployment has spec.replicas set to 4.
When I run this Deployment, I noticed that Kubernetes scheduled 3 of its pods in one node and 1 pod in the other node.
Is it possible to force Kubernetes to schedule at most 2 pods of this Deployment per node? Having 3 instances in the same pod puts me dangerously close to running out of memory in these tiny EC2 instances.
Thanks!

The correct solution for this would be to set memory requests and limits correctly matching your steady state and burst RAM consumption levels on every pod, then the scheduler will do all this math for you.
But for the future and for others, there is a new feature which kind of allows this https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/. It's not an exact match, you can't put a global cap, rather you can require pods be evenly spaced over the cluster subject to maximum skew caps.

Related

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

How to assign different number of pods to different nodes in Kubernetes for same deployment?

I am running a deployment on a cluster of 1 master and 4 worker nodes (2-32GB and 2-4GB machine). I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB machines.
Is there a way to assign different number of pods to different nodes in Kubernetes for same deployment?
I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB
machines.
This is possible with configuring kubelet to limit the maximum pod count on the node:
// maxPods is the number of pods that can run on this Kubelet.
MaxPods int32 `json:"maxPods,omitempty"`
Github can be found here.
Is there a way to assign different number of pods to different nodes
in Kubernetes for same deployment?
Adding this to your request makes it not possible. There is no such native mechanism in Kubernetes at this point to suffice this. And this more or less goes in spirit of how Kubernetes works and its principles. Basically you schedule your application and let scheduler decides where it should go, unless there is very specific resource required like GPU. And this is possible with labels,affinity etc .
If you look at the Kubernetes-API you notice the there is no such field that will make your request possible. However, API functionality can be extended with custom resources and this problem can be tackled with creating your own scheduler. But this is not the easy way of fixing this.
You may want to also set appropriate memory requests. Higher requests will tell scheduler to deploy more pods into node which has more memory resources. It's not ideal but it is something.
Well in general the scheduling is done on basis of algorithms like round robin, least used etc.
And likely we have the independence of adding node affinities via selectors but that won't even tackle the count.
Maybe you have to manually reset this thing up along the worker nodes.
Say -
you did kubectl top nodes to get the available spaces, once the deployment has been done.
and kubectl get po -o wide will give you the nodes taken on by the pods.
Now to force the Pod to get spawned in a specific node, let's say the one with 32GB then you can temporarily mark the 4GB nodes as "Not ready" by executing following command
Kubectl cordon {node_name}
And now kill the pods those are running in 4GB machines and you want those to run in 32GB machines. After killing them, they will automatically get spawned in any of the 32GB nodes
then you can execute
Kubectl uncordon {node_name} to mark the node as "ready" again.
This is bit involved stuff and will need lots of calculations as well.

In kubernetes, when does it make sense to have pods replicated in the same node?

I understand the logic behind replicating a pod across nodes, but is there any benefit to replicating pods in the same node? From my understanding this doesn't increase degree of replication, because one pod can take down a node.
Replication of pods within a node?
You can handle the replication condition/count using some metrics but you can't handle the spawning of the pods on the same node until unless there is some affinity set, as it's handled by kube-scheduler. However you can set various pod/node affinities, anti affinities and topologykey [if you wish to maintain a minimum number of pods on a particular node].
One pod can take a node down?
I highly doubt that, it can only happen if you don't have any requests/limits set for CPU/Memory or if in your HPA the maximum replicas are set to increase more than the capacity of the node with a condition that podaffinity is set so all the pods will be spawning on the same node.
When can such a scenario make sense?
The reason for it can be if your nodes are of irregular size. And you want to consume a particular node more than other nodes in the cluster.
In a best practices environment all the nodes are of regular size. HPA is set and limits/quotas are provided so that a pod doesn't crash a node by crossing all the limits of a node.
EDIT from comments:
So what if I want to run multiple instances of a nodejs app to utilize all the cores on my kube node? Lets say 8 cores, does it make sense to replicate the nodejs app pod 8 times in the same kube node or is it better to have 1 pod spin up 8 instances of the nodejs app?
A single-threaded application in your case, a pod will be considered as an instance. A pod to spin up 8 instances? It will be a multi-container pod with 8 containers of the same image, that would really be a bad practice and surely not even worth a test environment. However, having different replicas for the same Deployment is a do-able practice. But the question is how will you stop the Kubernetes service that if a pod is already serving some request and is in the lock state how will the request be routed to another pod. That's only possible by HPA if max request per pod is set to 1.
Why not use NodeJS Cluster module for kubernetes to utilise all the cores of the node in the cluster where App is deployed? -- Suggestion by #DanielKobe
node-js-scaling-out-on-kubernetes
Should-you-use-pm2-node-cluster-or-neither-in-kubernetes

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

AWS EKS Cluster Autoscaler - Scale-In Policy

I've a CA (Cluster Autoscaler) deployed on EKS followed this post. What I'm wondering is CA automatically scales down the cluster whenever at least a single pod is deployed on that node i.e. if there are 3 nodes with the capacity of 8 pods, if 9th pod comes up, CA would provision 4th nodes to run that 9th pod. What I see is CA is continuously terminating & creating a node randomly chosen from within the cluster disturbing other pods & nodes.
How can I tell EKS (without defining minimal nodes or disabling scale-in policy in ASG) to not to kill the node having at least 1 pod running on it. Any suggestion would be appreciated.
You cannot use pod as unit. CA work with resources cpu and memory unit.
If the cluster does not have enough cpu or memory it add one new.
You have to play with your requests resources (in the pod definition) or redefine your node to take an instance type with more or less resources depending how many pod you want on each.
Or you can play with the param scale-down-utilization-threshold
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca