kubernetes eviction does not follow rolling update strategy - kubernetes

context:
deployment/service/pod with one replica (because it uses an expensive shared resource, would be a big rewrite to change this)
rolling update strategy: maxsurge 1, maxunavailable 0 (because short-term sharing of the expensive resource is acceptable)
unwanted downtime of the service when the pod is evicted because of predictable reasons
I know the rolling update strategy is followed perfectly when I run kubectl rollout restart ..., but when a pod is killed due to predictable reasons (in our setup, typically the autoscaler downscaling the cluster), this doesn't happen.
Is there a functionality in k8s that I'm missing that can emulate the behaviour of a rolling update strategy?
The closest I'm getting is
poddisruptionbudget forcing at least one to be online
rollout restart every X minutes
but I can't help feeling there should be some form of trigger or interaction that can be used to cleanly solve this.
for clarity:
wanted timeline:
cluster autoscaler notifies (eviction api) that the pod needs to go
a new pod is created (double resource usage start)
the new pod is ready
the old pod is killed (double resource usage end)
no downtime, slight period of double resource usage
actual timeline:
eviction api
old pod is killed (downtime start)
new pod is created
new pod is ready (downtime end)

Related

what is the use of vertical pod autoscaler "auto" mode

As far as I understand from the VPA documentation the vertical pod autoscaler stop/restart the pod based-on the predicted request/limit's lower/upper bounds and target.
In the "auto" mode it says that the pod will be stopped and restarted, however, I don't get the point of doing a prediction and restarting the pod while it is still working because although we know that it might go out of resource eventually it is still working and we can wait to rescale it once it has really gone out of memory/cpu. Isn't it more efficient to just wait for the pod to go out of memory/cpu and then restart it with the new predicted request?
Is recovering from a dead container more costly than stopping and restarting the pod ourselves? If yes, in what ways?
Isn't it more efficient to just wait for the pod to go out of
memory/cpu and then restart it with the new predicted request?
In my opinion this is not the best solution. If the pod would try to use more CPU than available limits than the container's CPU use is being throttled, if the container is trying to use more memory than limits kubernetes OOM kills the container due to limit overcommit but limit on npods usually can be higher than sum of node capacity so this can lead to memory exhaust in the node and can case the death of other workload/pods.
Answering your question - VPA was designed to simplify those scenarios:
Vertical Pod Autoscaler (VPA) frees the users from necessity of
setting up-to-date resource limits and requests for the containers in
their pods. When configured, it will set the requests automatically
based on usage and thus allow proper scheduling onto nodes so that
appropriate resource amount is available for each pod. It will also
maintain ratios between limits and requests that were specified in
initial containers configuration.
In addition VPA should is not only responsible for scaling up but also for scaling down:
it can both down-scale pods that are over-requesting resources, and also up-scale pods that are under-requesting resources based on their usage over time.
Is recovering from a dead container more costly than stopping and
restarting the pod ourselves? If yes, in what ways?
Talking about the cost of recovering from the dead container - the main possible cost might be requests that can eventually get lost during OOM killing process as per the official doc.
As per the official documentation VPAs operates in those mode:
"Auto": VPA assigns resource requests on pod creation as well as
updates them on existing pods using the preferred update mechanism
 Currently this is equivalent to "Recrete".
"Recreate": VPA assigns resource requests on pod creation as well as
updates them on existing pods by evicting them when the requested
resources differ significantly from the new recommendation (respecting
the Pod Disruption Budget, if defined).
"Initial": VPA only assigns resource requests on pod creation and
never changes them later.
"Off": VPA does not automatically change resource requirements of the
pods.
NOTE:
VPA Limitations
VPA recommendation might exceed available resources, such as you cluster capacity or your team’s quota. Not enough available resources may cause pods to go pending.
VPA in Auto or Recreate mode won’t evict pods with one replica as this would cause disruption.
Quick memory growth might cause the container to be out of memory killed. As out of memory killed pods aren’t rescheduled, VPA won’t apply new resource.
Please also take a look at some of the VPA Known limitations:
Updating running pods is an experimental feature of VPA. Whenever VPA updates the pod resources the pod is recreated, which causes all
running containers to be restarted. The pod may be recreated on a
different node.
VPA does not evict pods which are not run under a controller. For such pods Auto mode is currently equivalent to Initial.
VPA reacts to most out-of-memory events, but not in all situations.
Additional resources:
VERTICAL POD AUTOSCALING: THE DEFINITIVE GUIDE

Will k8s scale a pod within HPA range to evict it and meet disruption budget?

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

In Kubernetes, can draining a node temporarily scale up a deployment?

Does kubernetes ever create pods as the result of draining a node? I'm not sure whether this is a feature I'm missing, or whether I just haven't found the right docs for it.
So here's the problem: I've got services that want to be always-on, but typically want to be a single container (for various stupid reasons having to do with them being more stateful than they should be). It's ok to run two containers temporarily during deployment or maintenance activities, though. So in ECS, I would say "desired capacity 1, maximumPercent 200%, minimumHealthPercent 100%". Then, if I need to replace a cluster node, ECS would automatically scale the service up, and once the new task was passing health checks, it would stop the old task and then the node could continue draining.
In kubernetes, the primitives seem to be less-flexible: I can set a pod disruption budget to prevent a pod from being evicted. But I don't see a way to get a deployment to temporarily scale up as a result of a node being drained. The pod disruption budget object in kubernetes, being mostly independent of a deployment or replica set, seems to mainly act as a blocker to draining a node, but not as a way to eagerly trigger scale-up.
In Kubernetes, deployments will only create new pods when current replicas are below desired replicas. In another word, the creation of a new Pod is triggered post disruption.
By design, deployments do not observe the disruption events(and probably it's not possible, as there are many voluntary actions) nither the eviction API directly. Hence the deployments never scale up automatically.
Probably you are looking for something like `horizontal pod autoscaler. However, this only scales based on resource consumption.
I would have deployed at least 2 replicas and use pod disruption budget as your service(application) is critical and should run 24/7/365. This is not only for nodes maintenance, but for many many other reasons(voluntary & involuntary) a pod can come down/rescheduled.

kubernetes: Proactively trigger nodes' upscaling before rollout

When performing a new Rollout via the Rolling Upgrade strategy, I have found out that the main bottleneck in terms of duration is the creation of new nodes (by the Cluster Autoscaler) needed to accommodate for the simultaneous presence of old and new pods.
Although tweaking a bit .spec.strategy.rollingUpdate.maxUnavailable and .spec.strategy.rollingUpdate.maxSurge values can mitigate the side effects a bit, I see that if I proactively (and manually for the time being) spin up new nodes, the time of the new rollout reduces dramatically.
Are there any off-the-self tools that perform this kind of Tasks?
If not, any recommended strategy to go about this would be highly appreciated.
If the cluster autoscaler is just acting on "simple" resource requests, such as CPU and memory, you could launch a couple of "warm-up" Pods at the beginning of your CI process, which will give the autoscaler a "head's up" that a new deployment is coming very soon.
Something like kubectl run expensive-sleep --image=busybox:latest --requests=memory=32Gi --restart=Never -- sleep 300 which would poke the autoscaler, then the placeholder Pod would exit, but the autoscaler usually does not scale Nodes down immediately, so you would have some freshly provisioned Nodes waiting for the actual rollout of your Deployment.
If the autoscaler is making more complicated decisions, such as GPUs, taints/tolerations, availability zones and whathaveyou, then the trick may require more than just a kubectl run, but I believe the underlying idea is plausible

Kubernetes PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction?

If I have the following Kubernetes objects:
Deployment with rollingUpdate.maxUnavailable set to 1.
PodDisruptionBudget with maxUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling.
Cluster auto-scaling is enabled.
If the cluster was under load and is in the middle of scaling up, what happens:
During a rolling update? Do the new Pod's added due to the scale up use the new version of the Pod?
When a node needs to be restarted or replaced? Does the PodDisruptionBudget stop the restart completely? Does the HorizontalPodAutoscaler to scale up the number of nodes before taking down another node?
When the Pod affinity is set to avoid placing two Pod's from the same Deployment on the same node.
As in documentation:
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
So it partially depends on the controller configuration and implementation. I believe new pods added by the autoscaler will use the new version of the Pod, because that's the version present in deployments definition at that point.
That depends on the way you execute the node restart. If you just cut the power, nothing can be done ;) If you execute proper drain before shutting the node down, then the PodDisruptionBudget will be taken into account and the draining procedure won't violate it. The disruption budget is respected by the Eviction API, but can be violated with low level operations like manual pod deletion. It is more like a suggestion, that some APIs respect, than a force limit that is enforced by whole Kubernetes.
According to the official documentation if the affinity is set to be a "soft" one, the pods will be scheduled on the same node anyway. If it's "hard", then the deployment will get stuck, not being able to schedule required amount of pods. Rolling update will still be possible, but the HPA won't be able to grow the pod pool anymore.