Redistribute load after a Worker node is returned to the cluster - kubernetes

I have a cluster of 5 worker nodes, then I shutdown one of the workers and the pods being executed in it are distributed to the other nodes.
Now I start again the worker node but the pods are not redistributed again so I have one node almost free.
Is there any way to force kubernetes to redistribute the load to the "new" worker node?
Thank you.

I don't believe there's a builtin mechanic to 'load balance' the cluster.
Over time, the system will return to normal by itself.
If you want to trigger a bit of redistribution, you could scale up the deployments that can be run in parallel, then scale them down again. Others you can re-release.

You can't move a pod from one node to another without killing it, So you can go ahead and kill the pods (if there is no impact). Most probably they will be be scheduled to the free node.
There are ways of configuring affinity, but if you shutdown a node again, you are going to find yourself in the same situation, so it won't help.

As #rln mentioned, scaling down and up again can trigger redistribution. These are the commands if you're using kubectl.
Scale down:
kubectl scale --replicas=<SIZE LESS THAN YOUR CURRENT DEPLOYMENT> deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>
Wait for pods to terminate.
Scale back up: kubectl scale --replicas=<ORIGINAL DESIRED DEPLOYMENT SIZE> deployment/<DEPLOYMENT_NAME> -n <NAMESPACE>

Instead of a deployment you could use a DaemonSet. DaemonSets distribute the pods to all nodes of the cluster. This would automatically distribute your pods to a new node when added. This way you could add a new node before removing the old one and your pods would still be running.

Related

GKE won't scale down to a single node

I've seen other similar questions, but none that quite address our specific case that as far as I can tell.
We have a cluster where we run development environments. When we're not working, ideally, that cluster should go down to a single node. At the moment, no one is working, and I can see that there is one node where CPU/Mem/Disk are essentially at 0 percent, with only system pods on it. The other node has some stuff on it.
The cluster is setup to autoscale down to 1. Why won't it do so?
It will autoscale up to however many we need when we spin up new environments and down to 2 no problem. But down to 1? No dice. When I manually delete the node with only system pods, and basically 0 usage, the cluster spins up a new one. I can't understand why.
Update/Clarification:
I've messed around with the configuration, so I'm not sure exactly what system pods were running, but I'm almost certain they were all DaemonSet-controlled. So, even after manually destroying a node, having everything non-system rescheduled, a new node would still pop up with no workloads specifically triggering the scale-up to 2.
Just to make sure I wasn't making things up, I've re-organized things so that there's just a single node running with no autoscaling, and it has plenty of excess capacity with everything running nicely. As far as I can tell, nothing new got scheduled onto that single node.
Looks like you might not have checked limitation of GKE scaling down section. No issues please check and read once you might need to change the PDB (Pod distribution budget) once.
Occasionally, the cluster autoscaler cannot scale down completely and
an extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node. See I
have a couple of nodes with low utilization, but they are not scaled
down. Why?. To work around this limitation, you can configure a Pod disruption budget.
By default, kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
You can read more at : https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Don't forget to checkout limitation of GKE scaling : https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#limitations

Kubernetes - ReplicaSet vs PodDisruptionBudget

I was wondering what added value gives the PodDisruptionBudget.
As far as I understand, PodDisruptionBudget promises that a certain amount of nodes will always remain in the cluster while there are 2 options to decide how: minAvailable / maxUnavailable.
Now, when I define ReplicaSet I define how many replicas I want. So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
PodDisruptionBudget helps in ensuring zero downtime for an application which ReplicaSet can't guarantee.
The following post explains with an example how PodDisruptionBudget can be useful in achieving zero downtime for an application:
Quoting the post, the node upgrade is a normal scenario as described in:
Let’s consider a scenario, we need to upgrade version of node or
update the spec often. Cluster downscaling is also a normal condition.
In these cases, the pods running on the to-be-deleted nodes needs to
be drained.
kubectl drain is performed on one of the nodes for the upgrade:
We need to remove node1 from the pool which we cannot do it by
detaching instantly as that will lead to termination of all the pods
running in there which can get services down. First step before
detaching node is to make the node unscheduled.
Running kubectl get pods -w will show the pods running on the node get in termination state which leads to a downtime:
If you quickly check the pods with kubectl get pods , it will
terminate all the running pods instantly which were scheduled on node1
. This could lead a downtime! If you are running few number of pods
and all of them are scheduled on same node, it will take some time for
the pods to be scheduled on other node.
PodDisruptionBudget with minAvailable are useful in such scenarios to achieve zero downtime. Replicaset will only ensure that the replicas number of pods will be
created on other nodes during the process.
If you just have a Replicaset with one replica and no PodDisruptionBudget specified, the pod will be terminated and a new pod will be created on other nodes. This is where PDBs provide the added advantage over the Replicaset.
For the PodDisruptionBudget to work, there must be at least 2 pods
running for a label selector otherwise, the node cannot be drained
gracefully and it will be evicted forcefully when grace time ends.
Then what gives the PodDisruptionBudget?
If you have an application where you want high availability e.g. it may take time to rebuild a cache after each crash.
There are both voluntary and involuntary disruptions. PodDisruptionBudget can limit the latter but both counts against the budget.
An example of voluntary disruption is when an employee of your platform team decide to upgrade the kernel for all your nodes - sometimes you want to do this slowly since all Pods on the node will be terminated and scheduled to a different node.
There is also involuntary disruptions e.g. a disk crash on one of your nodes.
So if for example I define 2, there won't be less than 2 replicas. Then what gives the PodDisruptionBudget?
It's 2 for minAvailable. And maxAvailable is a wrong name , it's maxUnavailable.

How to automatically do rebalancing of pods in aws eks (kubernetes ) across all nodes/workers

Suppose we have 4 nodes eks cluster in ec2-autoscaling min with 4 nodes.
A kubernetes application stack deployed on the same with one pod- one node. Now traffic increases HPA triggered on eks level.
Now total pods are 8 pods ,two pods - on one node. Also triggered auto-scaling. Now total nodes are 6 nodes.
Its observed all pods remain in current state. Post autscaling also.
Is there a direct and simpler way?
Some of already running pods should automatically launch on the additional nodes (detect it and reschedule itself on the recently added idle worker/nodes (i:e non-utilized - by using force eviction of pods)
Thanks in Advance.
One easy way is to delete all those pods by selector using below command and let the deployment recreate those pods in the cluster
kubectl delete po -l key=value
There could be other possibilities. would be glad to know from others
Take a look at the Descheduler. This project runs as a Kubernetes Job that aims at killing pods when it thinks the cluster is unbalanced.
The LowNodeUtilization strategy seems to fit your case:
This strategy finds nodes that are under utilized and evicts pods, if
possible, from other nodes in the hope that recreation of evicted pods
will be scheduled on these underutilized nodes.
Another option is to apply a little of chaos engineering manually, forcing a Rolling Update on your deployment, and hopefully, the scheduler will fix the balance problem when pods are recreated.
You can use the kubectl rollout restart my-deployment. It's way better than simply deleting the pods with kubectl delete pod, as the rollout will ensure availability during the "rebalancing" (although deleting the pods altogether increases your chances for a better rebalance).

Kubernetes Deployment with Zero Down Time

As a leaner of Kubernetes concepts, their working, and deployment with it. I have a couple of cases which I don't know how to achieve. I am looking for advice or some guideline to achieve it.
I am using the Google Cloud Platform. The current running flow is described below. A push to the google source repository triggers Cloud Build which creates a docker image and pushes the image to the running cluster nodes.
Case 1: Now I want that when new pods are up and running. Then traffic is routed to the new pods. Kill old pod but after each pod complete their running request. Zero downtime is what I'm looking to achieve.
Case 2: What will happen if the space of running pod reaches 100 and in the Debian case that the inode count reaches full capacity. Will kubernetes create new pods to manage?
Case 3: How to manage pod to database connection limits?
Like the other answer use Liveness and Readiness probes. Basically, a new pod is added to the service pool then it will only serve traffic after the readiness probe has passed. The old pod is removed from the Service pool, then drained and then terminated. This happens on a rolling fashion one pod at a time.
This really depends on the capacity of your cluster and the ability to schedule pods depending on the limits for the containers in them. For more about setting up limits for containers refer to here. In terms of the inode limit, if you reach it on a node, the kubelet won't be able to run any more pods on that node. The kubelet eviction manager also has a mechanism in where evicts some pods using the most inodes. You can also configure your eviction thresholds on the kubelet.
This would be more a limitation at the OS level combined your stateful application configuration. You can keep this configuration in a ConfigMap. And for example in something for MySql the option would be max_connections.
I can answer case 1 since Ive done it myself.
Use Deployments with readinessProbes & livelinessProbes

How do I schedule the same pod on different nodes using kubectl scale?

New to kubernetes. Can I use kubectl scale --replicas=N and start pods on different nodes?
By default the scheduler attempts to spread pods across nodes, so that you don't have multiple pods of the same type on the same node. So there's nothing special required if you're just aiming for best-effort pod spreading.
If you want to express the requirement that the pod must not run on a node that already has a pod of that type on it you can use pod anti-affinity, which is currently an Alpha feature.
If you want to ensure that all nodes (or all nodes matching a certain selector) have that pod on them you can use a DaemonSet.
Scaling a Deployment (or RC) tells controller-manager to create more pods, new pods are then subject to scheduling. K8S scheduler will attempt to find most reasonable placement to schedule your pods to. This does not guarantee that pods will launch on different nodes, but makes it a rather likely scenario, if you have the required resources. Unfortunately it also means that if all pods can fit on one node, there are situations where scheduler might actually do just that (ie. all other nodes in unschedulable state for some reason). If that happens, the pods will not reschedule when conditions change.
To have a solid guarantee that pods wil not get colocated on the same node you have two options:
legacy hack : define a hostPort in your pod template. As given host port is a resource that can be assigned only once per node, your pods will never exist more then once per node
alpha feature : you can look into Pod AntiAffinity, quite early and not really battle proven yet
First one has a dissadvantage - you can never have more then one pod of this type per node, so it ie. affects rolling deployments and limits your capacity for scaling (you can never have more active pods then number of nodes)