Question about concept on Kubernetes pod assignment to nodes - kubernetes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.

There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.

Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

Related

kubernetes - use-case to choose deployment over daemon-set

Normally when we scale up the application we do not deploy more than 1 pod of the same service on the same node, using daemon-set we can make sure that we have our service on each nodes and would make it very easy to manage pod when scale-up and scale down node. If I use deployment instead, there will have trouble when scaling, there may have multiple pod on the same node, and new node may have no pod there.
I want to know the use-case where deployment will be more suitable than daemon-set.
Your cluster runs dozens of services, and therefore runs hundreds of nodes, but for scale and reliability you only need a couple of copies of each service. Deployments make more sense here; if you ran everything as DaemonSets you'd have to be able to fit the entire stack into a single node, and you wouldn't be able to independently scale components.
I would almost always pick a Deployment over a DaemonSet, unless I was running some sort of management tool that must run on every node (a metric collector, log collector, etc.). You can combine that with a HorizontalPodAutoscaler to make the size of the Deployment react to the load of the system, and in turn combine that with the cluster autoscaler (if you're in a cloud environment) to make the size of the cluster react to the resource requirements of the running pods.
Cluster scale-up and scale-down isn't particularly a problem. If the cluster autoscaler removes a node, it will first move all of the existing pods off of it, so you'll keep the cluster-wide total replica count for your service. Similarly, it's not usually a problem if every node isn't running every service, so long as there are enough replicas of each service running somewhere in the cluster.
There are two levels (or say layers) of scaling when using deployments:
Let's say a website running on kubernetes has high traffic only on Fridays.
The deployment is scaled up to launch more pods as the traffic increases and scaled down later when traffic subsides. This is service/pod auto scaling.
To accommodate the increase in the pods more nodes are added to the cluster, later when there are less pods some nodes are shutdown and released. This is cluster auto scaling.
Unlike the above case, a daemonset has a 1 to 1 mapping to the nodes. And the N nodes = N pods kind of scaling will be useful only when 1 pods fits exactly to 1 node resources. This however, is very unlikely in real world scenarios.
Having a Daemonset has the downside that you might need to scale the application and therefore need to scale the number of nodes to add more pods. Also if you only need a few pods of the application but have a large cluster you might end up running a lot of unused pods that block resources for other applications.
Having a Deployment solves this problem, because two or more pods of the same application can run on one node and the number of pods is decoupled from the number of nodes per default. But this brings another problem: If your cluster is rather small and you have a small number of pods, they might end up all running on a few nodes. There is no good distribution over all available nodes. If some of those nodes fail for some reason you loose the majority of your application pods.
You can solve this using PodAntiAffinity, so pods can not run on a node where a defined other pod is running. By that you can have a similar behavior as a Daemonset but with far less pods and more flexibility regarding scaling and resource usage.
So a use case would be, when you don't need one pod per node but still want them to be distrubuted over your nodes. Say you have 50 nodes and an application of which you need 15 pods. Using a Deployment with PodAntiAffinity you can run those 15 pods in a distributed way on different 15 nodes. When you suddently need 20 you can scale up the application (not the nodes) so 20 pods run on 20 different nodes. But you never have 50 pods per default, where you only need 15 (or 20).
You could achieve the same with a Daemonset using nodeSelector or taints/tolerations but that would be far more complicated and less flexible.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

In kubernetes, when does it make sense to have pods replicated in the same node?

I understand the logic behind replicating a pod across nodes, but is there any benefit to replicating pods in the same node? From my understanding this doesn't increase degree of replication, because one pod can take down a node.
Replication of pods within a node?
You can handle the replication condition/count using some metrics but you can't handle the spawning of the pods on the same node until unless there is some affinity set, as it's handled by kube-scheduler. However you can set various pod/node affinities, anti affinities and topologykey [if you wish to maintain a minimum number of pods on a particular node].
One pod can take a node down?
I highly doubt that, it can only happen if you don't have any requests/limits set for CPU/Memory or if in your HPA the maximum replicas are set to increase more than the capacity of the node with a condition that podaffinity is set so all the pods will be spawning on the same node.
When can such a scenario make sense?
The reason for it can be if your nodes are of irregular size. And you want to consume a particular node more than other nodes in the cluster.
In a best practices environment all the nodes are of regular size. HPA is set and limits/quotas are provided so that a pod doesn't crash a node by crossing all the limits of a node.
EDIT from comments:
So what if I want to run multiple instances of a nodejs app to utilize all the cores on my kube node? Lets say 8 cores, does it make sense to replicate the nodejs app pod 8 times in the same kube node or is it better to have 1 pod spin up 8 instances of the nodejs app?
A single-threaded application in your case, a pod will be considered as an instance. A pod to spin up 8 instances? It will be a multi-container pod with 8 containers of the same image, that would really be a bad practice and surely not even worth a test environment. However, having different replicas for the same Deployment is a do-able practice. But the question is how will you stop the Kubernetes service that if a pod is already serving some request and is in the lock state how will the request be routed to another pod. That's only possible by HPA if max request per pod is set to 1.
Why not use NodeJS Cluster module for kubernetes to utilise all the cores of the node in the cluster where App is deployed? -- Suggestion by #DanielKobe
node-js-scaling-out-on-kubernetes
Should-you-use-pm2-node-cluster-or-neither-in-kubernetes

How does multiple replicas/pods scale Kubernetes?

From what I understand, using multiple replicas as well as auto-scaling is supposed to help in the case that lots of people visit your website and make calls to services provided by your Kubernetes cluster.
How do the replicas help with scaling?
Aren't these extra pods all just running on the same computer with constant resources? That would mean that they're all limited by a constant amount of CPU and memory.
Kubernetes has couple of scaling mechanisms. Horizontal Pod Autoscaler being the basic, but not the only one.
With HPA you can spin up additional PODs according to some metrics (most commonly cpu and memory). At some point you will hit a moment when your cluster nodes do not have enough resources to satisfy resource requirements of your pods (you will have pods in Pending state due to lack of nodes available for scheduling).
At that point a Cluster Autoscaler can kick in and ie. scale AWS ASG (or some other cloud-ish node pool) to add new node to the cluster and make space for the pending pod(s)

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)