CPU/Memory requests and limits per node

CPU/Memory requests and limits per node - kubernetes

kubectl describe nodes gives information on the requests and limits for resources such as CPU and memory. However, the api endpoint api/v1/nodes doesn't provide this information.
Alternatively, I could also hit the api/v1/pods endpoint to get this information per pod which I can accumulate across nodes. But is there already a kubernetes API endpoint which provides the information pertaining to cpu/memory requests and limits per node?

From what I've found in documentation, the endpoint responsible for that is the Kubernetes API Server.
CPU and memory are each a resource type. A resource type has a base unit. CPU is specified in units of cores, and memory is
specified in units of bytes.
CPU and memory are collectively referred to as compute resources,
or just resources. Compute resources are measurable quantities that
can be requested, allocated, and consumed. They are distinct from
API resources.
API resources, such as Pods and Services
are objects that can be read and modified through the Kubernetes API
server.
Going further to what is a node:
Unlike pods and services, a node is not inherently created by Kubernetes: it is created externally by cloud providers like Google Compute Engine, or exists in your pool of physical or virtual machines. What this means is that when Kubernetes creates a node, it is really just creating an object that represents the node. After creation, Kubernetes will check whether the node is valid or not.
[...]
Currently, there are three components that interact with the Kubernetes node interface: node controller, kubelet, and kubectl.
[...]
The capacity of the node (number of cpus and amount of memory) is part of the node object. Normally, nodes register themselves and report their capacity when creating the node object. If you are doing manual node administration, then you need to set node capacity when adding a node.
The Kubernetes scheduler ensures that there are enough resources for
all the pods on a node. It checks that the sum of the requests of
containers on the node is no greater than the node capacity. It
includes all containers started by the kubelet, but not containers
started directly by Docker nor processes not in containers.
Edit:
Alternatively, I could also hit the api/v1/pods endpoint to get this
information per pod which I can accumulate across nodes.
This is an actual description in what order it works.
But is there already a kubernetes API endpoint which provides the
information pertaining to cpu/memory requests and limits per node?
Answer to this question is no, there is not.
Unfortunately there is no endpoint to get that information directly. Kubectl uses several requests to show describe of nodes. You can check them by kubectl -v 8 describe nodes:
When you run kubectl -v=8 describe nodes you can see GET calls in this order:
/api/v1/nodes?includeUninitialized=true
/api/v1/nodes/minikube
/api/v1/pods
/api/v1/events?fieldSelector

Related

How does k8s manage containers using more cpu than requested without limits?

I'm trying to understand what happens when a container is configured with a CPU request and without a limit, and it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources.
Will k8s keep the container throttled in its current node or will it be moved to another node with available resources? do we know how/when k8s decides to move the container when its throttled in such a case?
I would appreciate any extra resources to read on this matter, as I couldn't find anything that go into details for this specific scenario.

Q1) what happens when a container is configured with a CPU request and without a limit ?
ANS:
If you do not specify a CPU limit
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
If you specify a CPU limit but do not specify a CPU request
If you specify a CPU limit for a Container but do not specify a CPU request, Kubernetes automatically assigns a CPU request that matches the limit. Similarly, if a Container specifies its own memory limit, but does not specify a memory request, Kubernetes automatically assigns a memory request that matches the limit.
Q2) it tries to use more CPU than requested while the node is fully utilized, but there is another node with available resources?
ANS:
The Kubernetes scheduler is a control plane process which assigns Pods to Nodes. The scheduler determines which Nodes are valid placements for each Pod in the scheduling queue according to constraints and available resources. The scheduler then ranks each valid Node and binds the Pod to a suitable Node. Multiple different schedulers may be used within a cluster; kube-scheduler is the reference implementation. See scheduling for more information about scheduling and the kube-scheduler component.
Scheduling, Preemption and Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.
Q3) Will k8s keep the container throttled in its current node or will it be moved to another node with available resources?
ANS:
Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or involuntarily.
Voluntary disruptions are started intentionally by application owners or cluster administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable issues like Nodes running out of resources, or by accidental deletions.
Voluntary and involuntary disruptions
Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.
We call these unavoidable cases involuntary disruptions to an application.
Examples are:
a hardware failure of the physical machine backing the node
cluster administrator deletes VM (instance) by mistake
cloud provider or hypervisor failure makes VM disappear
a kernel panic
the node disappears from the cluster due to cluster network partition
eviction of a pod due to the node being out-of-resources.
Suggestion:
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
Command:
kubectl taint nodes node1 key1=value1:NoSchedule
Example:
kubectl taint nodes node1 key1=node.kubernetes.io/disk-pressure:NoSchedule

What metrics Kubernetes scheduler depends on?

Does the Kubernetes scheduler place the pods on the nodes only based on their requested resources and nodes' available resources at the current snapshot of the server or it also takes into account the node's historical resource utilization?

In the official Kubernetes documentation we can find process and metrics used by kube-scheduler for choosing node for pod.
Basically this is 2-step process:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
Filtering step is responsible for getting list of nodes which actually are able to run a pod:
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
Scoring step is responsible for choosing the best node from the list generated by the filtering step:
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
When the node with the highest score is chosen, scheduler notifies the API server:
...picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.
Factors that are taken into consideration for scheduling:
Individual and collective resource requirements
Hardware
Policy constraints
Affinity and anti-affinity specifications
Data locality
Inter-workload interference
Others...
More detailed information about parameters be found here:
The following predicates implement filtering:
PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by its hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
MatchNodeSelector: Checks if a Pod's Node Selector matches the Node's label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that's over a configured limit.
PodToleratesNodeTaints: checks if a Pod's tolerations can tolerate the Node's taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs.
The following priorities implement scoring:
SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belong to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Implements preferred inter pod affininity and antiaffinity.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn't run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes.
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node's rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favours scheduling onto nodes that don't have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
EqualPriority: Gives an equal weight of one to all nodes.
EvenPodsSpreadPriority: Implements preferred pod topology spread constraints.
Answering your question:
Does it take into account the node's historical resource utilization?
As can see, on the above list there are no parameters related to the historical resource utilization. Also, I did research and I didn't find any information about it.

Kubernetes release requested cpu

We have a Java application distributed over multiple pods on Google Cloud Platform. We also set memory requests to give the pod a certain part of the memory available on the node for heap and non-heap space.
The application is very resource-intensive in terms of CPU while starting the pod but does not use the CPU after the pod is ready (only 0,5% are used). If we use container resource "requests", the pod does not release these resources after start has finished.
Does Kubernetes allow to specify that a pod is allowed to use (nearly) all the cpu power available during start and release those resources after that? Due to rolling update we can prevent that two pods are started at the same time.
Thanks for your help.

If you specify requests without a limit the value will be used for scheduling the pod to an appropriate node that satisfies the requested available CPU bandwidth. The kernel scheduler will assume that the requests match the actual resource consumption but will not prevent exceeding usage. This will be 'stolen' from other containers.
If you specify a limit as well your container will get throttled if it tries to exceed the value. You can combine both to allow bursting usage of the cpu, exceeding the usual requests but not allocating everything from the node, slowing down other processes.

"Does Kubernetes allow to specify that a pod is allowed to use
(nearly) all the cpu power available during start and release those
resources after that?"
A key word here is "available". The answer is "yes" and it can be achieved by using Burstable QoS (Quality of Service) class. Configure CPU request to a value you expect the container will need after starting up, and either:
configure CPU limit higher than the CPU request, or
don't configure CPU limit in which case either namespace's default CPU limit will apply if defined, or the container "...could use all of the CPU resources available on the Node where it is running".
If there isn't CPU available on the Node for bursting, the container won't get any beyond the requested value and as result the starting of the application could be slower.
It is worth mentioning what the docs explain for Pods with multiple Containers:
The CPU request for a Pod is the sum of the CPU requests for all the
Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of
the CPU limits for all the Containers in the Pod.
If running Kubernetes v1.12+ and have access to configure kubelet, the Node CPU Management Policies could be of interest.

one factor for scheduling pods in nodes is resource availability and kubernetes scheduler calculates used resources from request value of each pod. If you do not assign any value in request parameter then for this deployment request will be zero . Request parameter doesnt ensure that the pod will use this much cpu or ram. you can get current usage of resources from "kubectl top pods / nodes".
request parameter will buffer resources for a pod. where as limit put a cap on resources usage for a pod.
you can get more information here https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/.
This will give you a rough idea of request and limit.

What does it mean OutOfcpu error in kubernetes?

I got OutOfcpu in kubernetes on googlecloud what does it mean? My pods seem to be working now, however there there were pods in this same revision which got OutOfcpu.

It means that the kube-scheduler can't find any node with available CPU to schedule your pods:
kube-scheduler selects a node for the pod in a 2-step operation:
Filtering
Scoring
The filtering step finds the set of Nodes where it’s feasible to
schedule the Pod. For example, the PodFitsResources filter checks
whether a candidate Node has enough available resource to meet a Pod’s
specific resource requests.
[...]
PodFitsResources: Checks if the
Node has free resources (eg, CPU and Memory) to meet the requirement
of the Pod.
Also, as per Assigning Pods to Nodes:
If the named node does not have the resources to accommodate the pod,
the pod will fail and its reason will indicate why, e.g. OutOfmemory
or OutOfcpu.

In addition to how-kube-scheduler-schedules-pods, I think this will be helpful to understand why OutOfcpu error has been shown up.
When you create a Pod, the Kubernetes scheduler selects a node for the
Pod to run on. Each node has a maximum capacity for each of the
resource types: the amount of CPU and memory it can provide for Pods.
The scheduler ensures that, for each resource type, the sum of the
resource requests of the scheduled Containers is less than the
capacity of the node. Note that although actual memory or CPU resource
usage on nodes is very low, the scheduler still refuses to place a Pod
on a node if the capacity check fails. This protects against a
resource shortage on a node when resource usage later increases, for
example, during a daily peak in request rate.
Ref: how-pods-with-resource-requests-are-scheduled

how much percentage of resources are used by kube-master from kube-minion for spawning pod

I am trying to add new minion to my existing k8s cluster. I am wondering about how much percentage of added resources are actually used by k8s master for spawning pods ?

I think you are asking about the pod management overhead on individual nodes, i.e., how much cpu/memory it takes for the per-node agent/daemon to run N pods for you on that node. This is different from the "master node" which usually runs a few master components/pods.
The resource usage depends on many factors. To name a few, the container runtime (e.g, docker/rkt) and its version/configuration, the number of the pods, the type of pods, and so on. To present users a better picture of the kubernetes per-node overhead, there is an on-going effort to create a node benchmark to let you profile the performance and resource usage on your node.
For now, you can check the k8s perf dashboard and select "kubelet perf 100" from the drop down menu to see what's the resource usage of kubelet (the node agent) and docker for managing a 100 "do-nothing" pods on the node. It should be noted that this is the best case (i.e., steady state) where no pod operations (creation/deletion) are performed. You are expected to see cpu spikes when there are multiple concurrent operations (e.g., creating 30 pods).