Kubernetes PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction? - kubernetes

If I have the following Kubernetes objects:
Deployment with rollingUpdate.maxUnavailable set to 1.
PodDisruptionBudget with maxUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling.
Cluster auto-scaling is enabled.
If the cluster was under load and is in the middle of scaling up, what happens:
During a rolling update? Do the new Pod's added due to the scale up use the new version of the Pod?
When a node needs to be restarted or replaced? Does the PodDisruptionBudget stop the restart completely? Does the HorizontalPodAutoscaler to scale up the number of nodes before taking down another node?
When the Pod affinity is set to avoid placing two Pod's from the same Deployment on the same node.

As in documentation:
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
So it partially depends on the controller configuration and implementation. I believe new pods added by the autoscaler will use the new version of the Pod, because that's the version present in deployments definition at that point.
That depends on the way you execute the node restart. If you just cut the power, nothing can be done ;) If you execute proper drain before shutting the node down, then the PodDisruptionBudget will be taken into account and the draining procedure won't violate it. The disruption budget is respected by the Eviction API, but can be violated with low level operations like manual pod deletion. It is more like a suggestion, that some APIs respect, than a force limit that is enforced by whole Kubernetes.
According to the official documentation if the affinity is set to be a "soft" one, the pods will be scheduled on the same node anyway. If it's "hard", then the deployment will get stuck, not being able to schedule required amount of pods. Rolling update will still be possible, but the HPA won't be able to grow the pod pool anymore.

Related

kubernetes eviction does not follow rolling update strategy

context:
deployment/service/pod with one replica (because it uses an expensive shared resource, would be a big rewrite to change this)
rolling update strategy: maxsurge 1, maxunavailable 0 (because short-term sharing of the expensive resource is acceptable)
unwanted downtime of the service when the pod is evicted because of predictable reasons
I know the rolling update strategy is followed perfectly when I run kubectl rollout restart ..., but when a pod is killed due to predictable reasons (in our setup, typically the autoscaler downscaling the cluster), this doesn't happen.
Is there a functionality in k8s that I'm missing that can emulate the behaviour of a rolling update strategy?
The closest I'm getting is
poddisruptionbudget forcing at least one to be online
rollout restart every X minutes
but I can't help feeling there should be some form of trigger or interaction that can be used to cleanly solve this.
for clarity:
wanted timeline:
cluster autoscaler notifies (eviction api) that the pod needs to go
a new pod is created (double resource usage start)
the new pod is ready
the old pod is killed (double resource usage end)
no downtime, slight period of double resource usage
actual timeline:
eviction api
old pod is killed (downtime start)
new pod is created
new pod is ready (downtime end)

Is there a way to use vertical and horizontal pod autoscaler without a controller?

I want to know if there is a way to use autoscalers in Kubernetes with pods directly created from a "pod creation yaml files" not the pods created as part of a higher-level controller like deployments or replicasets?
The short answer to your question is no.
Horizontal Pod Autoscaler changes the number of replicas of a Deployment reacting to changes in load utilization. So you need a Deployment for it to work.
Regarding Vertical Pod Autoscaler, I think it should work with spare pods as well, but only at Pod creation time. In fact, I read the following statement in the Known limitations section of the README:
VPA does not evict pods which are not run under a controller. For such
pods Auto mode is currently equivalent to Initial.
That sentence make me conclude that VPA should work on Pods not backed by a Controller, but in a limited way. In fact, the documentation about Initial mode states:
VPA only assigns resource requests on pod creation and never changes
them later.
Making it basically useless.
I think it is not possible to use Pod object as the target resource for an HPA.
The document describes HPA as:
The Horizontal Pod Autoscaler automatically scales the number of Pods
in a replication controller, deployment, replica set or stateful set
based on observed CPU utilization (or, with custom metrics support, on
some other application-provided metrics). Note that Horizontal Pod
Autoscaling does not apply to objects that can't be scaled, for
example, DaemonSets.
The document also described how the algorithm is implemented at the backend as:
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
and since the Pod resource does not have the replicas field as part of its spec therefore we can say that the same is not supported for auto scaling using the HPA.
Although it seems the VPA does support working with Pod object but there is a limitation when using VPA just with Pods.
VPA does not evict pods which are not run under a controller. For such
pods Auto mode is currently equivalent to Initial.
You can read about the different updatePolicy.updateModes in the docs.

Will k8s scale a pod within HPA range to evict it and meet disruption budget?

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

In Kubernetes, can draining a node temporarily scale up a deployment?

Does kubernetes ever create pods as the result of draining a node? I'm not sure whether this is a feature I'm missing, or whether I just haven't found the right docs for it.
So here's the problem: I've got services that want to be always-on, but typically want to be a single container (for various stupid reasons having to do with them being more stateful than they should be). It's ok to run two containers temporarily during deployment or maintenance activities, though. So in ECS, I would say "desired capacity 1, maximumPercent 200%, minimumHealthPercent 100%". Then, if I need to replace a cluster node, ECS would automatically scale the service up, and once the new task was passing health checks, it would stop the old task and then the node could continue draining.
In kubernetes, the primitives seem to be less-flexible: I can set a pod disruption budget to prevent a pod from being evicted. But I don't see a way to get a deployment to temporarily scale up as a result of a node being drained. The pod disruption budget object in kubernetes, being mostly independent of a deployment or replica set, seems to mainly act as a blocker to draining a node, but not as a way to eagerly trigger scale-up.
In Kubernetes, deployments will only create new pods when current replicas are below desired replicas. In another word, the creation of a new Pod is triggered post disruption.
By design, deployments do not observe the disruption events(and probably it's not possible, as there are many voluntary actions) nither the eviction API directly. Hence the deployments never scale up automatically.
Probably you are looking for something like `horizontal pod autoscaler. However, this only scales based on resource consumption.
I would have deployed at least 2 replicas and use pod disruption budget as your service(application) is critical and should run 24/7/365. This is not only for nodes maintenance, but for many many other reasons(voluntary & involuntary) a pod can come down/rescheduled.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.