As far as I understand from the VPA documentation the vertical pod autoscaler stop/restart the pod based-on the predicted request/limit's lower/upper bounds and target.
In the "auto" mode it says that the pod will be stopped and restarted, however, I don't get the point of doing a prediction and restarting the pod while it is still working because although we know that it might go out of resource eventually it is still working and we can wait to rescale it once it has really gone out of memory/cpu. Isn't it more efficient to just wait for the pod to go out of memory/cpu and then restart it with the new predicted request?
Is recovering from a dead container more costly than stopping and restarting the pod ourselves? If yes, in what ways?
Isn't it more efficient to just wait for the pod to go out of
memory/cpu and then restart it with the new predicted request?
In my opinion this is not the best solution. If the pod would try to use more CPU than available limits than the container's CPU use is being throttled, if the container is trying to use more memory than limits kubernetes OOM kills the container due to limit overcommit but limit on npods usually can be higher than sum of node capacity so this can lead to memory exhaust in the node and can case the death of other workload/pods.
Answering your question - VPA was designed to simplify those scenarios:
Vertical Pod Autoscaler (VPA) frees the users from necessity of
setting up-to-date resource limits and requests for the containers in
their pods. When configured, it will set the requests automatically
based on usage and thus allow proper scheduling onto nodes so that
appropriate resource amount is available for each pod. It will also
maintain ratios between limits and requests that were specified in
initial containers configuration.
In addition VPA should is not only responsible for scaling up but also for scaling down:
it can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
Is recovering from a dead container more costly than stopping and
restarting the pod ourselves? If yes, in what ways?
Talking about the cost of recovering from the dead container - the main possible cost might be requests that can eventually get lost during OOM killing process as per the official doc.
As per the official documentation VPAs operates in those mode:
"Auto": VPA assigns resource requests on pod creation as well as
updates them on existing pods using the preferred update mechanism
Currently this is equivalent to "Recrete".
"Recreate": VPA assigns resource requests on pod creation as well as
updates them on existing pods by evicting them when the requested
resources differ significantly from the new recommendation (respecting
the Pod Disruption Budget, if defined).
"Initial": VPA only assigns resource requests on pod creation and
never changes them later.
"Off": VPA does not automatically change resource requirements of the
pods.
NOTE:
VPA Limitations
VPA recommendation might exceed available resources, such as you cluster capacity or your team’s quota. Not enough available resources may cause pods to go pending.
VPA in Auto or Recreate mode won’t evict pods with one replica as this would cause disruption.
Quick memory growth might cause the container to be out of memory killed. As out of memory killed pods aren’t rescheduled, VPA won’t apply new resource.
Please also take a look at some of the VPA Known limitations:
Updating running pods is an experimental feature of VPA. Whenever VPA updates the pod resources the pod is recreated, which causes all
running containers to be restarted. The pod may be recreated on a
different node.
VPA does not evict pods which are not run under a controller. For such pods Auto mode is currently equivalent to Initial.
VPA reacts to most out-of-memory events, but not in all situations.
Additional resources:
VERTICAL POD AUTOSCALING: THE DEFINITIVE GUIDE
Related
I'm running a pod on k8s that has a baseline memory usage of about 1GB. Sometimes, depending on user behaviour, the pod can consume much more memory (10-12GB) for a few minutes, and then drop down to the baseline level.
What I would like to configure in k8s, is that the memory request would be quite low (1GB), but that the pod will run a node with a much higher memory capacity. This way when needed the pod will just grow and shrink back. The reason I don't configure the request to be higher, is that I have multiple replicas of this pod, and ideally I would want all of them to be hosted on 1-2 nodes, and let each one peak when needed, without spending too much.
It was counter intuitive to find out that the memory limit configuration does not affect node selection, meaning if I configure the limit to be 12GB I can still get a 4GB node.
Is there any way to configure my pods to share some large nodes, so they will be able to extend their memory usage without crashing?
Resource Requests vs Limits
Memory limit doesn't affect node selection, but memory request does. This is because the concept of resource request and resource limit serve a different purpose.
Resource request makes sure that your pod has at least requested amount of resources at all times.
Resource limit makes sure that your pod does not have more than the limit amount of resources at any time. E-g not setting memory limits right leads to OOM (out-of-memory) kill on pods, and this happens very often.
Resource requests are the parameters used by the scheduler to determine how to schedule your pods on certain nodes. Limit are not used for this purpose and it is up to the application designer to set pod limits correctly.
So What Can You Do
You can set your pod request the same way you are doing currently. Say, 1.5GiB of memory. You can then, from your experience, check that your pod can consume as much as 12GiB of memory. And you are running 2 pods of this application. Therefore, if you want to schedule both these pods on a single node and not have any issues, you have to make sure your node has:
total_memory = 2 * pod_limit + overhead (system) OR
total_memory ~= 2 * pod_limit
Overhead is the memory overhead of your idle node which can account for usage by certain OS components and your cluster system components like kubelet and kubeproxy binaries. But this is generally a very small value.
This will allow you to select the right node for your pods.
Summary
You will have to ensure for your pod to have a lot more memory in order to handle that spike (but not too much, so you will set a limit). You can read more about this here.
Note: Strictly speaking, pods don't shrink and grow they have a fixed resource profile, which is determined by the resource requests and resource limits in the definition. Also note that, a pod is always "occupying" the requested amount of resources, even if it is not using them.
Other Elements to Control Scheduling
Additionally, you can also use node-affinities to add a 'preference' for the scheduler to schedule your pods. I say a preference because node affinities are not definite rules, but guidelines. You can also use anti-affinities to make sure certain pods don't get scheduled on the same node. Just keep in mind, that if a pod is in danger of not being scheduled, the affinity or anti-affinity rules can be possibly ignored by the scheduler.
The last option is, you can use nodeSelector on your pod, to make sure it lands on the node which specifies the conditions of the selector. If no node matches this selector, your pod will be stuck in Pending state, meaning it cannot be scheduled anywhere. You can read about this here.
So, after you've decided which nodes you want your pod to be scheduled on, you can label them, and use a selector to ensure the pods are scheduled on the matching node.
It is also possible to provide a specified nodeName to your pod, to force it to schedule on a particular node. This is an option rarely used. Simply because nodes can be added/deleted in your cluster and this will require you to change your pod definition every time this happens. Using nodeSelector is better since you can specify general attributes like a label which you can attach to a new node you add to the cluster and the pod scheduling will not be affected, neither will be the pod definition.
Hope this helps.
excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.
I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.
I think I have a pretty simple scenario: I need to auto-scale on Google Kubernetes Engine with a pod that runs one per node and uses all available remaining resources on the node.
"Remaining" resources means that there are certain basic pod services running on each node such logging and metrics, which need their requested resources. But everything left should go to this particular pod, which is in fact the main web service for my cluster.
Also, these remaining resources should be available when the pod's container starts up, rather than through vertical autoscaling with pod restarts. The reason is that the container has certain constraints that make restarts sort of expensive: heavy disk caching, and issues with licensing of some 3rd party software I use. So although certainly the container/pod is restartable, I'd like to avoid except for rolling updates.
The cluster should scale nodes when CPU utilization gets too high (say, 70%). And I don't mean requested CPU utilization of a node's pods, but rather the actual utilization, which is mainly determined by the web service's load.
How should I configure the cluster for this scenario? I've seen there's cluster auto scaling, vertical pod autoscaling, and horizontal pod autoscaling. There's also Deployment vs DaemonSet, although it does not seem that DaemonSet is designed for pods that need to scale. So I think Deployment may be necessary, but in a way that limits one web service pod per node (pod anti affinity??).
How do I put all this together?
You could set up a Deployment with a resource request that equals a single node's allocatable resources (i.e., total resources minus auxiliary services as you mentioned). Then configure Horizontal Pod Autoscaling to scale up your deployment when CPU request utilization goes above 70%; this should do the trick as in this case request utilization rate is essentially the same as total node resource utilization rate, right? However if you do want to base scaling on actual node CPU utilization, there's always scaling by external metrics.
Technically the Deployment's resource request doesn't have to exactly equal remaining resources; rather it's enough for the request to be large enough to prevent two pods being ran on the same node. As long as that's the case and there's no resource limits, the pod ends up consuming all the available node resources.
Finally configure cluster autoscaling on your GKE node pool and we should be good to go. Vertical Pod Autoscaling doesn't really come into play here as pod resource request stays constant, and DaemonSets aren't applicable as they're not scalable via HPA as mentioned.
I've read as much Kubernetes documentation as I can find, but I'm still having trouble understanding a specific scenario I have in mind.
For the sake of example, let's say I have a single node with 1GB of memory. I also have a deployment that wants 100 pods with memory limits set to 100MB and memory requests unset. The pods only use 1MB most of the time, but can sometimes jump up to 99MB.
Question 1: Will all 100 pods be scheduled onto the node?
Now, let's say all the pods simultaneously start using 99MB of memory each and stay there. There isn't enough memory on the machine to handle that, but none of the pods have exceeded their memory limit. I'm assuming Kubernetes evicts some pods at this point.
Question 2: When Kubernetes tries to reschedule the evicted pods, does it succeed since there is no memory request set? What happens when the node immediately runs out of memory again? Does this eviction, rescheduling keep happening over and over? If so, is there some metric that I can use to detect that this is happening?
A pod will be scheduled as long as there's an eligible node that can satisfy the requested resources. So if you do not specify request the pod will pretty much get scheduled. Request and limits are totally different things. Request is a condition for a pod to be scheduled and limit is a condition for a running pod already scheduled.
If you overcommit the actual resources on a node you will run into typical issues - if you overcommit on memory it'll start to swap and CPU there will just be general slow down. Either way the node and pods on it will become unresponsive. It's difficult to deal with and tools like request and limits set up sane boundaries that will help you not take things quite this far where you'll simply see the pod fail to schedule.
When the Kubernetes scheduler schedules a pod running on a node, it will always ensure that the total limits of the containers are less than the node capacity. If a node runs out of resources, Kubernetes will not schedule any new containers running on it. If no node is available when you launch a pod, the pod will remain pending, since the Kubernetes scheduler will be unable to nd any node that could run your desired pod.
Kubernetes Cookbook
I think this excerpt gave you some understanding on how it internally works. So answers for your questions:
At most 10 pods will be scheduled into your node.
If there no free memory in node evicted pods will be pending. Also k8s can simply evict pod if it exceeds limits when resources are needed for other pods and services.