Will k8s scale a pod within HPA range to evict it and meet disruption budget? - kubernetes

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.

I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

Related

GKE won't scale down to a single node

I've seen other similar questions, but none that quite address our specific case that as far as I can tell.
We have a cluster where we run development environments. When we're not working, ideally, that cluster should go down to a single node. At the moment, no one is working, and I can see that there is one node where CPU/Mem/Disk are essentially at 0 percent, with only system pods on it. The other node has some stuff on it.
The cluster is setup to autoscale down to 1. Why won't it do so?
It will autoscale up to however many we need when we spin up new environments and down to 2 no problem. But down to 1? No dice. When I manually delete the node with only system pods, and basically 0 usage, the cluster spins up a new one. I can't understand why.
Update/Clarification:
I've messed around with the configuration, so I'm not sure exactly what system pods were running, but I'm almost certain they were all DaemonSet-controlled. So, even after manually destroying a node, having everything non-system rescheduled, a new node would still pop up with no workloads specifically triggering the scale-up to 2.
Just to make sure I wasn't making things up, I've re-organized things so that there's just a single node running with no autoscaling, and it has plenty of excess capacity with everything running nicely. As far as I can tell, nothing new got scheduled onto that single node.
Looks like you might not have checked limitation of GKE scaling down section. No issues please check and read once you might need to change the PDB (Pod distribution budget) once.
Occasionally, the cluster autoscaler cannot scale down completely and
an extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node. See I
have a couple of nodes with low utilization, but they are not scaled
down. Why?. To work around this limitation, you can configure a Pod disruption budget.
By default, kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
You can read more at : https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Don't forget to checkout limitation of GKE scaling : https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#limitations

Kubernetes - ReplicaSet vs PodDisruptionBudget

I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.

In Kubernetes, can draining a node temporarily scale up a deployment?

Does kubernetes ever create pods as the result of draining a node? I'm not sure whether this is a feature I'm missing, or whether I just haven't found the right docs for it.
So here's the problem: I've got services that want to be always-on, but typically want to be a single container (for various stupid reasons having to do with them being more stateful than they should be). It's ok to run two containers temporarily during deployment or maintenance activities, though. So in ECS, I would say "desired capacity 1, maximumPercent 200%, minimumHealthPercent 100%". Then, if I need to replace a cluster node, ECS would automatically scale the service up, and once the new task was passing health checks, it would stop the old task and then the node could continue draining.
In kubernetes, the primitives seem to be less-flexible: I can set a pod disruption budget to prevent a pod from being evicted. But I don't see a way to get a deployment to temporarily scale up as a result of a node being drained. The pod disruption budget object in kubernetes, being mostly independent of a deployment or replica set, seems to mainly act as a blocker to draining a node, but not as a way to eagerly trigger scale-up.
In Kubernetes, deployments will only create new pods when current replicas are below desired replicas. In another word, the creation of a new Pod is triggered post disruption.
By design, deployments do not observe the disruption events(and probably it's not possible, as there are many voluntary actions) nither the eviction API directly. Hence the deployments never scale up automatically.
Probably you are looking for something like `horizontal pod autoscaler. However, this only scales based on resource consumption.
I would have deployed at least 2 replicas and use pod disruption budget as your service(application) is critical and should run 24/7/365. This is not only for nodes maintenance, but for many many other reasons(voluntary & involuntary) a pod can come down/rescheduled.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity