How to automatically do rebalancing of pods in aws eks (kubernetes ) across all nodes/workers - kubernetes

Suppose we have 4 nodes eks cluster in ec2-autoscaling min with 4 nodes.
A kubernetes application stack deployed on the same with one pod- one node. Now traffic increases HPA triggered on eks level.
Now total pods are 8 pods ,two pods - on one node. Also triggered auto-scaling. Now total nodes are 6 nodes.
Its observed all pods remain in current state. Post autscaling also.
Is there a direct and simpler way?
Some of already running pods should automatically launch on the additional nodes (detect it and reschedule itself on the recently added idle worker/nodes (i:e non-utilized - by using force eviction of pods)
Thanks in Advance.

One easy way is to delete all those pods by selector using below command and let the deployment recreate those pods in the cluster
kubectl delete po -l key=value
There could be other possibilities. would be glad to know from others

Take a look at the Descheduler. This project runs as a Kubernetes Job that aims at killing pods when it thinks the cluster is unbalanced.
The LowNodeUtilization strategy seems to fit your case:
This strategy finds nodes that are under utilized and evicts pods, if
possible, from other nodes in the hope that recreation of evicted pods
will be scheduled on these underutilized nodes.
Another option is to apply a little of chaos engineering manually, forcing a Rolling Update on your deployment, and hopefully, the scheduler will fix the balance problem when pods are recreated.
You can use the kubectl rollout restart my-deployment. It's way better than simply deleting the pods with kubectl delete pod, as the rollout will ensure availability during the "rebalancing" (although deleting the pods altogether increases your chances for a better rebalance).

Related

Added new node to eks, new pods still being scheduled on old nodes

I have a terraform-managed EKS cluster. It used to have 2 nodes on it. I doubled the number of nodes (4).
I have a kubernetes_deployment resource that automatically deploys a fixed number of pods to the cluster. It was set to 20 when I had 2 nodes, and seemed evenly distributed with 10 each. I doubled that number to 40.
All of the new pods for the kubernetes deployment are being scheduled on the first 2 (original) nodes. Now the two original nodes have 20 pods each, while the 2 new nodes have 0 pods. The new nodes are up and ready to go, but I cannot get kubernetes to schedule the new pods on those new nodes.
I am unsure where to even begin searching, as I am fairly new to k8s and ops in general.
A few beginner questions that may be related:
I'm reading about pod affinity, and it seems like I could tell k8s to have a pod ANTI affinity with itself within a deployment. However, I am having trouble setting up the anti-affinity rules. I see that the kubernetes_deployment resource has a scheduling argument, but I can't seem to get the syntax right.
Naively it seems that the issue may be that the deployment somehow isn't aware of the new nodes. If that is the case, how could I reboot the entire deployment (without taking down the already-running pods)?
Is there a cluster level scheduler that I need to set? I was under the impression that the default does round robin, which doesn't seem to be happening at the node level.
EDIT:
The EKS terraform module node_groups submodule has fields for desired/min/max_capacity. To increase my worker nodes, I just increased those numbers. The change is reflected in the aws eks console.
Check a couple of things:
Do your nodes show up correctly in the output of kubectl get nodes -o wide and do they have a state of ready?
Instead of pod affinity look into pod topology spread constraints. Anti affinity will not work with multiple pods.

GKE won't scale down to a single node

I've seen other similar questions, but none that quite address our specific case that as far as I can tell.
We have a cluster where we run development environments. When we're not working, ideally, that cluster should go down to a single node. At the moment, no one is working, and I can see that there is one node where CPU/Mem/Disk are essentially at 0 percent, with only system pods on it. The other node has some stuff on it.
The cluster is setup to autoscale down to 1. Why won't it do so?
It will autoscale up to however many we need when we spin up new environments and down to 2 no problem. But down to 1? No dice. When I manually delete the node with only system pods, and basically 0 usage, the cluster spins up a new one. I can't understand why.
Update/Clarification:
I've messed around with the configuration, so I'm not sure exactly what system pods were running, but I'm almost certain they were all DaemonSet-controlled. So, even after manually destroying a node, having everything non-system rescheduled, a new node would still pop up with no workloads specifically triggering the scale-up to 2.
Just to make sure I wasn't making things up, I've re-organized things so that there's just a single node running with no autoscaling, and it has plenty of excess capacity with everything running nicely. As far as I can tell, nothing new got scheduled onto that single node.
Looks like you might not have checked limitation of GKE scaling down section. No issues please check and read once you might need to change the PDB (Pod distribution budget) once.
Occasionally, the cluster autoscaler cannot scale down completely and
an extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node. See I
have a couple of nodes with low utilization, but they are not scaled
down. Why?. To work around this limitation, you can configure a Pod disruption budget.
By default, kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
You can read more at : https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Don't forget to checkout limitation of GKE scaling : https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#limitations

Kubernetes - ReplicaSet vs PodDisruptionBudget

I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.

Question about concept on Kubernetes pod assignment to nodes

I am quite a beginner in Kuberenetes and would like to ask about some concepts related to kuberenetes pod assignment.
Suppose there is a deployment to be made with a requirement of 3 replica sets.
(1)
Assume that there are 4 nodes, where each of it being a different physical server with different CPU and memory.
When the deployment is made, how would kubernetes assgin the pods to the nodes? Will there be scenario where it will put multiple pods on the same server, while a server does not have pod assignment (due to resource considereation)?
(2)
Assume there are 4 nodes (on 4 indentical physical servers), and 1 pod is created on each of the 4 nodes.
Suppose that now one of the nodes goes down. How would kuberenetes handle this? Will it recreate the pod on one of the other 3 nodes, based on which one having more available resources?
Thank you for any advice in advance.
There's a brief discussion of the Kubernetes Scheduler in the Kubernetes documentation. Generally scheduling is fairly opaque, but you also tend to aim for fairly well-loaded nodes; the important thing from your application point of view is to set appropriate resource requests: in your pod specifications. Just so long as there's enough room on each node to meet the resource requests, it usually doesn't matter to you which node gets picked.
In the scenario you describe, (1) it is possible that two replicas will be placed on the same node and so two nodes will go unused. That's especially true if the nodes aren't identical and they have resource constraints: if your pods require 4 GB of RAM, but you have some nodes that have less than that (after accounting for system pods and daemon set pods), the pods can't get scheduled there.
If a node fails (2) Kubernetes will automatically reschedule the pods running on that node if possible. "Fail" is a broad case, and can include a node being intentionally stopped to be upgraded or replaced. In this latter case you have some control over the cluster's behavior; see Disruptions in the documentation.
Many environments will run a cluster autoscaler. This can cause nodes to come and go automatically: if you try to schedule a pod and it won't fit, the autoscaler will allocate a new node, and if a node is under 50% utilization, it will be removed (and its pods rescheduled). In your first scenario you might start with only one node, but when the pod replicas don't all fit, the autoscaler would create a new node and once it's available the excess pods could be scheduled there.
Kubernetes will try to deploy pods to multiple nodes for better availability and resiliency. This will be based on the resource availability of the nodes. So if any node is not having enough capacity to host a pod it's possible that more than one replica of a pod is scheduled into same node.
Kubernetes will reschedule pods from the failed node to other available node which has enough capacity to host the pod. In this process again if there is no enough node which can host the replicas then there is a possibility that more than one replica is scheduled on same node.
You can read more on the scheduling algorithm here.
You can influence the scheduler by node and pod affinity and antiaffinity

Redistribute load after a Worker node is returned to the cluster

I have a cluster of 5 worker nodes, then I shutdown one of the workers and the pods being executed in it are distributed to the other nodes.
Now I start again the worker node but the pods are not redistributed again so I have one node almost free.
Is there any way to force kubernetes to redistribute the load to the "new" worker node?
Thank you.
I don't believe there's a builtin mechanic to 'load balance' the cluster.
Over time, the system will return to normal by itself.
If you want to trigger a bit of redistribution, you could scale up the deployments that can be run in parallel, then scale them down again. Others you can re-release.
You can't move a pod from one node to another without killing it, So you can go ahead and kill the pods (if there is no impact). Most probably they will be be scheduled to the free node.
There are ways of configuring affinity, but if you shutdown a node again, you are going to find yourself in the same situation, so it won't help.
As #rln mentioned, scaling down and up again can trigger redistribution. These are the commands if you're using kubectl.
Scale down:
kubectl scale --replicas=<SIZE LESS THAN YOUR CURRENT DEPLOYMENT> deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
Wait for pods to terminate.
Scale back up: kubectl scale --replicas=<ORIGINAL DESIRED DEPLOYMENT SIZE> deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
Instead of a deployment you could use a DaemonSet. DaemonSets distribute the pods to all nodes of the cluster. This would automatically distribute your pods to a new node when added. This way you could add a new node before removing the old one and your pods would still be running.