In Kubernetes, can draining a node temporarily scale up a deployment? - kubernetes

Does kubernetes ever create pods as the result of draining a node? I'm not sure whether this is a feature I'm missing, or whether I just haven't found the right docs for it.
So here's the problem: I've got services that want to be always-on, but typically want to be a single container (for various stupid reasons having to do with them being more stateful than they should be). It's ok to run two containers temporarily during deployment or maintenance activities, though. So in ECS, I would say "desired capacity 1, maximumPercent 200%, minimumHealthPercent 100%". Then, if I need to replace a cluster node, ECS would automatically scale the service up, and once the new task was passing health checks, it would stop the old task and then the node could continue draining.
In kubernetes, the primitives seem to be less-flexible: I can set a pod disruption budget to prevent a pod from being evicted. But I don't see a way to get a deployment to temporarily scale up as a result of a node being drained. The pod disruption budget object in kubernetes, being mostly independent of a deployment or replica set, seems to mainly act as a blocker to draining a node, but not as a way to eagerly trigger scale-up.

In Kubernetes, deployments will only create new pods when current replicas are below desired replicas. In another word, the creation of a new Pod is triggered post disruption.
By design, deployments do not observe the disruption events(and probably it's not possible, as there are many voluntary actions) nither the eviction API directly. Hence the deployments never scale up automatically.
Probably you are looking for something like `horizontal pod autoscaler. However, this only scales based on resource consumption.
I would have deployed at least 2 replicas and use pod disruption budget as your service(application) is critical and should run 24/7/365. This is not only for nodes maintenance, but for many many other reasons(voluntary & involuntary) a pod can come down/rescheduled.

Related

Half of My Kubernetes Cluster Is Never Used, Due to the Way Pods Are Scheduled

I have a Kubernetes cluster with 4 nodes and, ideally, my application should have 4 replicas, evenly distributed to each node. However, when pods are scheduled, they almost always end up on only two of the nodes, or if I'm very lucky, on 3 of the 4. My app as quite a bit of traffic and I would really want to use all the resources that I pay for.
I suspect the reason why this happens is that Kubernetes tries to schedule the new pods on the nodes that have the most available resources, which is nice as a concept, but it would be even nicer if it would reschedule the pods once the old nodes become available again.
What options do I have? Thanks.
You have lots of options!
First and foremost: Pod Affinity and Anti-affinity to make sure your Pod prefer to be placed on a host that does not already have a Pod with the same label.
Second, you could set up Pod Topology Spread Constraints. This is newer and a bit more advanced, but usually a better solution that simple anti-affinity.
Thirdly, you can pin your Pods to a specific node using a NodeSelector.
Finally, you could write your own scheduler or modify the default scheduler settings, but that's a bit more advanced topic. Don't forget to always set your resource requests correctly, these should be set to a value that more or less encapsulates the usage during peak traffic, to make sure that a node has enough resources available to max out the Pod without interfering with other Pods.

Will k8s scale a pod within HPA range to evict it and meet disruption budget?

excuse me for asking something that has much overlap with many specific questions about the same knowledge area. I am curious to know if kubernetes will scale a pod in order to evict it.
Given are the following facts at the time of eviction:
The pod is running one instance.
The pod has an HPA controlling it, with the following params:
minCount: 1
maxCount: 2
It has a PDB with params:
minAvailable: 1
I would expect the k8s controller to have enough information to safely scale up to 2 instances to meet the PDB, and until recently I was assuming it would indeed do so.
Why am I asking this? (The question behind the question ;)
Well, we run into auto-upgrade problems on AKS because it won't evict pods as described above, and the Azure team told me to change the params. But if no scaling happens, this means we have to set minAvailable to 2, effectively increasing pod amount only for future evictions. I want to get to the bottom of this before I file a feature request with k8s or a bug with AKS.
I believe these two parts are independent; the pod disruption budget doesn't look at the autoscaling capability, or otherwise realize that a pod is running as part of a deployment that could be temporarily upscaled.
If you have a deployment with replicas: 1, and a corresponding PDB with minAvailable: 1, this will prevent the node the pod is running on from being taken out of service. (I see this behavior in the system I work on professionally, using a different Kubernetes environment.)
The way this works normally (see also the PodDisruptionBudget example in the Kubernetes documentation):
Some command like kubectl drain or the cluster autoscaler marks a node as going out of service.
The pods on that node are terminated.
The replication controller sees that some replica sets have too few pods, and creates new ones.
The new pods get scheduled on in-service nodes.
The pod disruption budget only affects the first part of this sequence; it would keep kubectl drain from actually draining a node until the disruption budget could be satisfied, or cause the cluster autoscaler to pick a different node. HPA isn't considered at all, nor is it considered that it's "normal" to run extra copies of a deployment-managed pod during upgrades. (That is, this is a very reasonable question, it just doesn't work that way right now.)
My default setup for most deployments tends to be to use 3 replicas and to have a pod disruption budget requiring at least 1 of them to be available. That definitely adds some cost to operating the service, but it makes you tolerant of an involuntary node failure and it does allow you to consciously rotate nodes out. For things that read from message queues (Kafka or RabbitMQ-based workers) it could make sense to run only 1 replica with no PDB since the worker will be able to tolerate an outage.

Kubernetes Autoscaler: no downtime for deployments when downscaling is possible?

In a project, I'm enabling the cluster autoscaler functionality from Kubernetes.
According to the documentation: How does scale down work, I understand that when a node is used for a given time less than 50% of its capacity, then it is removed, together with all of its pods, which will be replicated in a different node if needed.
But the following problem can happen: what if all the pods related to a specific deployment are contained in a node that is being removed? That would mean users might experience downtime for the application of this deployment.
Is there a way to avoid that the scale down deletes a node whenever there is a deployment which only contains pods running on that node?
I have checked the documentation, and one possible (but not good) solution, is to add an annotation to all of the pods containing applications here, but this clearly would not down scale the cluster in an optimal way.
In the same documentation:
What happens when a non-empty node is terminated? As mentioned above, all pods should be migrated elsewhere. Cluster Autoscaler does this by evicting them and tainting the node, so they aren't scheduled there again.
What is the Eviction ?:
The eviction subresource of a pod can be thought of as a kind of policy-controlled DELETE operation on the pod itself.
Ok, but what if all pods get evicted at the same time on the node?
You can use Pod Disruption Budget to make sure minimum replicas are always working:
What is PDB?:
A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions.
In k8s docs you can also read:
A PodDisruptionBudget has three fields:
A label selector .spec.selector to specify the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.
So if you use PDB for your deployment it should not get deleted all at once.
But please notice that if the node fails for some other reason (e.g hardware failure), you will still experience downtime. If you really care about High Availability consider using pod antiaffinity to make sure the pods are not scheduled all on one node.
Same document you referred to, has this:
How is Cluster Autoscaler different from CPU-usage-based node autoscalers? Cluster Autoscaler makes sure that all pods in the
cluster have a place to run, no matter if there is any CPU load or
not. Moreover, it tries to ensure that there are no unneeded nodes in
the cluster.
CPU-usage-based (or any metric-based) cluster/node group autoscalers
don't care about pods when scaling up and down. As a result, they may
add a node that will not have any pods, or remove a node that has some
system-critical pods on it, like kube-dns. Usage of these autoscalers
with Kubernetes is discouraged.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

Kubernetes PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction?

If I have the following Kubernetes objects:
Deployment with rollingUpdate.maxUnavailable set to 1.
PodDisruptionBudget with maxUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling.
Cluster auto-scaling is enabled.
If the cluster was under load and is in the middle of scaling up, what happens:
During a rolling update? Do the new Pod's added due to the scale up use the new version of the Pod?
When a node needs to be restarted or replaced? Does the PodDisruptionBudget stop the restart completely? Does the HorizontalPodAutoscaler to scale up the number of nodes before taking down another node?
When the Pod affinity is set to avoid placing two Pod's from the same Deployment on the same node.
As in documentation:
Pods which are deleted or unavailable due to a rolling upgrade to an application do count against the disruption budget, but controllers (like deployment and stateful-set) are not limited by PDBs when doing rolling upgrades – the handling of failures during application updates is configured in the controller spec.
So it partially depends on the controller configuration and implementation. I believe new pods added by the autoscaler will use the new version of the Pod, because that's the version present in deployments definition at that point.
That depends on the way you execute the node restart. If you just cut the power, nothing can be done ;) If you execute proper drain before shutting the node down, then the PodDisruptionBudget will be taken into account and the draining procedure won't violate it. The disruption budget is respected by the Eviction API, but can be violated with low level operations like manual pod deletion. It is more like a suggestion, that some APIs respect, than a force limit that is enforced by whole Kubernetes.
According to the official documentation if the affinity is set to be a "soft" one, the pods will be scheduled on the same node anyway. If it's "hard", then the deployment will get stuck, not being able to schedule required amount of pods. Rolling update will still be possible, but the HPA won't be able to grow the pod pool anymore.