kubernetes pod scheduling with resource quotas - kubernetes

I have read about k8s resource management but things are still not very clear to me. Lets say we have 2 k8s nodes each with 22 mb memory.
Lets say Pod A has request 10mb and limit 15mb(but lets say actual usage is 5mb). so this pod is scheduled on node 1. So node1 has 22 mb memory, 5 is used by Pod A but another 17mb is available if more memory is needed by Pod A.
Pod B has request 10 and limit 15(basically the same with Pod A). so this pod is scheduled on node 2
So both nodes have 5 mbs of usages out of 22mb. If Pod C has a request 5mb and limit 10mb, will this pod be scheduled on any of the nodes? If yes, what would happen Pod C needs 10m memory and the other pod needs 15mb of memory?
What would happen if Pod C has a request of 13mb and a limit of 15mb? In this case 13(request of pod C) + 10(request of pod A) will be 23(more than 22)?
Does k8s try to make sure that requests of all pods < available memory && limits of all pods < available memory ?

Answering question from the post:
Lets say Pod A has request 10mb and limit 15mb(but lets say actual usage is 5mb). so this pod is scheduled on node 1. So node1 has 22 mb memory, 5 is used by Pod A but another 17mb is available if more memory is needed by Pod A. Pod B has request 10 and limit 15(basically the same with Pod A). so this pod is scheduled on node 2
This is not really a question but I think this part needs some perspective on how Pods are scheduled onto the nodes. The component that is responsible for telling Kubernetes where a Pod should be scheduled is: kube-scheduler. It could come to the situation as you say that:
Pod A, req:10M, limit: 15M -> Node 1, mem: 22MB
Pod B req:10M, limit: 15M -> Node 2, mem: 22MB
Citing the official documentation:
Node selection in kube-scheduler
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
-- Kubernetes.io: Docs: Concepts: Scheduling eviction: Kube sheduler: Implementation
So both nodes have 5 mbs of usages out of 22mb. If Pod C has a request 5mb and limit 10mb, will this pod be scheduled on any of the nodes? If yes, what would happen Pod C needs 10m memory and the other pod needs 15mb of memory?
In this particular example I would much more focus on the request part rather than the actual usage. Assuming that there is no other factor that will deny the scheduling, Pod C should be spawned on one of the nodes (that the kube-scheduler chooses). Why is that?:
resource.limits will not deny the scheduling of the Pod (limit can be higher than memory)
resource.requests will deny the scheduling of the Pod (request cannot be higher than memory)
I encourage you to check following articles to get more reference:
Sysdig.com: Blog: Kubernetes limits requests
Cloud.google.com: Blog: Products: Containers Kubernetes: Kubernetes best practices: (this is GKE blog but it should give the baseline idea, see the part on: "The lifecycle of a Kubernetes Pod" section)
What would happen if Pod C has a request of 13mb and a limit of 15mb? In this case 13(request of pod C) + 10(request of pod A) will be 23(more than 22)?
In that example the Pod will not be scheduled as the sum of requests > memory (assuming no Pod priority). The Pod will be in Pending state.
Additional resources:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Kube sheduler: What's next
Youtube.com: Deep Dive Into the Latest Kubernetes Scheduler Features
- from CNCF [Cloud Native Computing Foundation] conference

Related

How to force Eviction on a Kubernetes Cluster (minikube)

I am relatively new to Kubernetes, and have a current task to debug Eviction pods on my work.
I'm trying to replicate the behaviour on a local k8s cluster in minikube.
So far I just cannot get evicted pods happening.
Can you help me trigger this mechanism?
the eviction of pods is managed by the qos classes (quality of pods)
there are 3 categories
Guaranteed (limit = request cpu or ram) not evictable
Burstable
BestEffort
if you want test this mechanism , you can scale a pod that consume lot of memory or cpu and before that launch your example pods with différent request and limit for test this behavior. this behavior is only avaible for eviction so your pods must be already started before a cpu load.
after if you test a scheduling mechanism during a luanch time you can configure a priorityclassname for schedule a pods even if the cluster is full.
by example if your cluster is full you can't schedule a new pods because your pod don't have a sufficient privilege.
if you want schedule anyway a pod despite that you can add a priorityclassname system-node-critical or create your own priorityclass and one of the pod with a lower priority will be evict and your pod will be launched

kubectl drain and rolling update, downtime

Does kubectl drain first make sure that pods with replicas=1 are healthy on some other node?
Assuming the pod is controlled by a deployment, and the pods can indeed be moved to other nodes.
Currently as I see it only evict (delete pods) from the nodes, without scheduling them first.
In addition to Suresh Vishnoi answer:
If PodDisruptionBudget is not specified and you have a deployment with one replica, the pod will be terminated and then new pod will be scheduled on a new node.
To make sure your application will be available during node draining process you have to specify PodDisruptionBudget and create more replicas. If you have 1 pod with minAvailable: 30% it will refuse to drain with following error:
error when evicting pod "pod01" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Briefly that's how draining process works:
As explained in documentation kubectl drain command "safely evicts all of your pods from a node before you perform maintenance on the node and allows the pod’s containers to gracefully terminate and will respect the PodDisruptionBudgets you have specified”
Drain does two things:
cordons the node- it means that node is marked as unschedulable, so new pods cannot be scheduled on this node. Makes sense- if we know that node will be under maintenance there is no point to schedule a pod there and then reschedule it on another node due to maintenance. From Kubernetes perspective it adds a taint to the node: node.kubernetes.io/unschedulable:NoSchedule
evicts/ deletes the pods- after node is marked as unschedulable it tries to evict the pods that are running on the node. It uses Eviction API which takes PodDisruptionBudgets into account (if it's not supported it will delete pods). It calls DELETE method to K8S but considers GracePeriodSeconds so it lets a pod finish it's processes.
New Pods are scheduled, when the number of pods are not available (desired state != current state) in respective of draining or node failure.
With the PodDisruptionBudget resource you can manage the disruption during the draining of the node.
You can specify only one of maxUnavailable and minAvailable in a single PodDisruptionBudget. maxUnavailable can only be used to control the eviction of pods that have an associated controller managing them. In the examples below, “desired replicas” is the scale of the controller managing the pods being selected by the PodDisruptionBudget. https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
Example 1: With a minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’s selector.
Example 2: With a minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.
Example 3: With a maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.
Example 4: With a maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.

Kubernetes - Scheduling pod replicas on nodes as per the resources available

I have a node with 64 cores and another one with just 8. I need multiple replicas of my Kubernetes pods ( at least 6 ) and my 8-core node can only handle 1 instance. How could I ask kubernetes to schedule the rest (5) on the more powerful node ?
It would be good if I could do a scale-up only on the required node, is that possible?
While kubernetes is intelligent to spread pods on nodes with enough resources (CPU cores in this case), the following ways can be used to fine-tune how pods can be spread/load-balanced on the nodes in a cluster:
Adding labels to nodes and pods
Resource requests and limits for pods
nodeSelector, node affinity/anti-affinity, nodeName
Horizontal Pod Autoscaler
K8s Descheduler
In general, you should use resources.requests definitions in your workload in order to let the scheduler know about the requirements of your application. With this way the scheduler will take care of scheduling the pods where there are resources available.

Which component in Kubernetes is responsible for resource allocation?

After scheduling the pods in a node in Kubernetes, which component is responsible for sharing resources among the pods in that node?
From https://kubernetes.io/docs/concepts/overview/components :
kube-scheduler - Component on the master that watches newly created pods
that have no node assigned, and selects a node for them to run on.
Factors taken into account for scheduling decisions include individual
and collective resource requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality,
inter-workload interference and deadlines.
After pod is scheduled node's kubelet is responsible for dealing with pod's requests and limits. Depending on pod's quality of service and node resource pressure pod can be evicted or restarted by kubelet.
After scheduling
That will be the OS kernel.
You can reserve/limit pod resources: https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-resource-requests-and-limits.
Than it is passed from kubelet down to docker, then to cgroups, and finally to a kernel.

Does HorizontalPodAutoscaler make sense when there is only one Deployment on GKE (Google Container Engine) Kubernetes cluster?

I have a "homogeneous" Kubernetes setup. By this I mean that I am only running instances of a single type of pod (an http server) with a load balancer service distributing traffic to them.
By my reasoning, to get the most out of my cluster (edit: to be concrete -- getting the best average response times to http requests) I should have:
At least one pod running on every node: Not having a pod running on a node, means that I am paying for the node and not having it ready to serve a request.
At most one pod running on every node: The pods are threaded http servers so they can maximize utilization of a node, so running multiple pods on a node does not net me anything.
This means that I should have exactly one pod per node. I achieve this using a DaemonSet.
The alternative way is to configure a Deployment and apply a HorizontalPodAutoscaler to it and have Kubernetes handle the number of pods and pod to node mapping. Is there any disadvantage of my approach in comparison to this?
My evaluation is that the HorizontalPodAutoscaler is relevant mainly in heterogeneous situations, where one HorizontalPodAutoscaler can scale up a Deployment at the expense of another Deployment. But since I have only one type of pod, I would have only one Deployment and I would be scaling up that deployment at the expense of itself, which does not make sense.
HorizontalPodAutoscaler is actually a valid solution for your needs. To address your two concerns:
1. At least one pod running on every node
This isn't your real concern. The concern is underutilizing your cluster. However, you can be underutilizing your cluster even if you have a pod running on every node. Consider a three-node cluster:
Scenario A: pod running on each node, 10% CPU usage per node
Scenario B: pod running on only one node, 70% CPU usage
Even though Scenario A has a pod on each node the cluster is actually being less utilized than in Scenario B where only one node has a pod.
2. At most one pod running on every node
The Kubernetes scheduler tries to spread pods around so that you don't end up with multiple pods of the same type on a single node. Since in your case the other nodes should be empty, the scheduler should have no problems starting the pods on the other nodes. Additionally, if you have the pod request resources equivalent to the node's resources, that will prevent the scheduler from scheduling a new pod on a node that already has one.
Now, you can achieve the same effect whether you go with DaemonSet or HPA, but I personally would go with HPA since I think it fits your semantics better, and would also work much better if you eventually decide to add other types of pods to your cluster
Using a DamonSet means that the pod has to run on every node (or some subset). This is a great fit for something like a logger or a metrics collector which is per-node. But you really just want to use available cluster resources to power your pod as needed, which matches up better with the intent of HPA.
As an aside, I believe GKE supports cluster autoscaling, so you should never be paying for nodes that aren't needed.