k8s - how scheduler assigns the nodes - kubernetes

I am just curious to know how k8s master/scheduler will handle this.
Lets consider I have a k8s master with 2 nodes. Assume that each node has 8GB RAM and each node running a pod which consumes 3GB RAM.
node A - 8GB
- pod A - 3GB
node B - 8GB
- pod B - 3GB
Now I would like to schedule another pod, say pod C, which requires 6GB RAM.
Question:
Will the k8s master shift pod A or B to other node to accommodate the pod C in the cluster or will the pod C be in the pending status?
If the pod C is going to be in pending status, how to use the resources efficiently with k8s?
Unfortunately I could not try this with my minikube. If you know how k8s scheduler assigns the nodes, please clarify.

Most of the Kubernetes components are split by responsibility and workload assignment is no different. We could define the workload assignment process as Scheduling and Execution.
The Scheduler as the name suggests will be responsible for the Scheduling step, The process can be briefly described as, "get a list of pods, if it is not scheduled to a node, assign it to one node with capacity to run the pod". There is a nice blog post from Julia Evan here explaining Schedulers.
And Kubelet is responsible for the Execution of pods scheduled to it's node. It will get a list of POD Definitions allocated to it's node, make sure they are running with the right configuration, if not running start then.
With that in mind, the scenario you described will have the behavior expected, the POD will not be scheduled, because you don't have a node with capacity available for the POD.
Resource Balancing is mainly decided at scheduling level, a nice way to see it is when you add a new node to the cluster, if there are no PODs pending allocation, the node will not receive any pods. A brief of the logic used to Resource balancing can be seen on this PR
The solutions,
Kubernetes ships with a default scheduler. If the default scheduler does not suit your needs you can implement your own scheduler as described here. The idea would be implement and extension for the Scheduler to ReSchedule PODs already running when the cluster has capacity but not well distributed to allocated the new load.
Another option is use tools created for scenarios like this, the Descheduler is one, it will monitor the cluster and evict pods from nodes to make the scheduler re-allocate the PODs with a better balance. There is a nice blog post here describing these scenarios.
PS:
Keep in mind that the total memory of a node is not allocatable, depending on which provider you use, the capacity allocatable will be much lower than the total, take a look on this SO: Cannot create a deployment that requests more than 2Gi memory

find below the answers
Will the k8s master shift pod A or B to other node to accommodate the pod C in the cluster or will the pod C be in the pending status?
No. pod A and pod B would still be running, pod C will not be scheduled.
If the pod C is going to be in pending status, how to use the resources efficiently with k8s?
Both the nodes cant meet the resource requirements needed to run pod C and hence it cant be scheduled.
You mentioned that the node capacity is 8 GB RAM. note that whole 8 GB RAM is not available to run the work loads. certain amount of RAM is reserved for kube-proxy, kubelet and other node management activities.

Related

How does k8s manage containers using more cpu than requested without limits?

I'm trying to understand what happens when a container is configured with a CPU request and without a limit, and it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources.
Will k8s keep the container throttled in its current node or will it be moved to another node with available resources? do we know how/when k8s decides to move the container when its throttled in such a case?
I would appreciate any extra resources to read on this matter, as I couldn't find anything that go into details for this specific scenario.
Q1) what happens when a container is configured with a CPU request and without a limit ?
ANS:
If you do not specify a CPU limit
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
If you specify a CPU limit but do not specify a CPU request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit. Similarly, if a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit.
Q2) it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources?
ANS:
The Kubernetes scheduler is a control plane process which assigns Pods to Nodes. The scheduler determines which Nodes are valid placements for each Pod in the scheduling queue according to constraints and available resources. The scheduler then ranks each valid Node and binds the Pod to a suitable Node. Multiple different schedulers may be used within a cluster; kube-scheduler is the reference implementation. See scheduling for more information about scheduling and the kube-scheduler component.
Scheduling, Preemption and Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.
Q3) Will k8s keep the container throttled in its current node or will it be moved to another node with available resources?
ANS:
Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or involuntarily.
Voluntary disruptions are started intentionally by application owners or cluster administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable issues like Nodes running out of resources, or by accidental deletions.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application.
Examples are:
a hardware failure of the physical machine backing the node
cluster administrator deletes VM (instance) by mistake
cloud provider or hypervisor failure makes VM disappear
a kernel panic
the node disappears from the cluster due to cluster network partition
eviction of a pod due to the node being out-of-resources.
Suggestion:
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
Command:
kubectl taint nodes node1 key1=value1:NoSchedule
Example:
kubectl taint nodes node1 key1=node.kubernetes.io/disk-pressure:NoSchedule

How to assign different number of pods to different nodes in Kubernetes for same deployment?

I am running a deployment on a cluster of 1 master and 4 worker nodes (2-32GB and 2-4GB machine). I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB machines.
Is there a way to assign different number of pods to different nodes in Kubernetes for same deployment?
I want to run a maximum of 10 pods on 4GB machines and 50 pods in 32GB
machines.
This is possible with configuring kubelet to limit the maximum pod count on the node:
// maxPods is the number of pods that can run on this Kubelet.
MaxPods int32 `json:"maxPods,omitempty"`
Github can be found here.
Is there a way to assign different number of pods to different nodes
in Kubernetes for same deployment?
Adding this to your request makes it not possible. There is no such native mechanism in Kubernetes at this point to suffice this. And this more or less goes in spirit of how Kubernetes works and its principles. Basically you schedule your application and let scheduler decides where it should go, unless there is very specific resource required like GPU. And this is possible with labels,affinity etc .
If you look at the Kubernetes-API you notice the there is no such field that will make your request possible. However, API functionality can be extended with custom resources and this problem can be tackled with creating your own scheduler. But this is not the easy way of fixing this.
You may want to also set appropriate memory requests. Higher requests will tell scheduler to deploy more pods into node which has more memory resources. It's not ideal but it is something.
Well in general the scheduling is done on basis of algorithms like round robin, least used etc.
And likely we have the independence of adding node affinities via selectors but that won't even tackle the count.
Maybe you have to manually reset this thing up along the worker nodes.
Say -
you did kubectl top nodes to get the available spaces, once the deployment has been done.
and kubectl get po -o wide will give you the nodes taken on by the pods.
Now to force the Pod to get spawned in a specific node, let's say the one with 32GB then you can temporarily mark the 4GB nodes as "Not ready" by executing following command
Kubectl cordon {node_name}
And now kill the pods those are running in 4GB machines and you want those to run in 32GB machines. After killing them, they will automatically get spawned in any of the 32GB nodes
then you can execute
Kubectl uncordon {node_name} to mark the node as "ready" again.
This is bit involved stuff and will need lots of calculations as well.

Kubernetes scheduling ignores pod count per worker node

We have a kubernetes cluster with three worker nodes, which was built manually, borrowing from the 'Kubernetes, the hard way' Tutorial.
Everything on this cluster works as expected for one exception:
The scheduler does not - or seems not to - honor the 110 pod per worker node limit.
Example:
Worker Node 1: 60 pods
Worker Node 2: 100 pods
Worker Node 3: 110 pods
When I want to deploy a new pod, it often happens that the scheduler decides it would be best to schedule the new pod to 'Worker Node 3'. Kubelet refuses to do so, it does honor its 110 pod limitation. The scheduler tries again and again and never succeeds.
I do not understand why this is happening. I think I might be missing some detail about this problem.
From my understanding and what I have read about the scheduler itself, there is no resource or metric for 'amount of pods per node' which is considered while scheduling - or at least I haven't found anything that would suggest otherwise in the Kubernetes Scheduler documentation. Of course the scheduler considers CPU requests/limits, memory requests/limits, disk requests/limits - that's all fine and working. So I don't even know how the scheduler could ever consider the amount of pods used on a worker, but there has to be some kind of functionality doing that, right? Or am I mistaken?
Is my cluster broken? Is there some misconception I have about how scheduling should/does work?
Kubernetes binary versions: v1.17.2
Edit: Kubernetes version
Usually this means the other nodes are unsuitable. Either explicitly via taints, etc or more often things like resource request space.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

Kubernetes: do evicted pods with no resource requests get rescheduled successfully?

I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.