Prevent Kubernetes from schedulling everything to the same node - kubernetes

So I have 4 nodes currently, and Kubernetes, for some reason, decides to always schedule everything to the same node.
I'm not talking about replicas of the same deployment, so topologySpreadConstraints wouldn't apply there. In fact, when I scale up a deployment to several replicas, they get scheduled to different nodes. However, any new deployment and any new volume always go to the same node.
Affinity constraints also work, if I configure a pod to only schedule to a specific node (different from the usual one) it works fine. But anything else, goes to the same node. Is this considered normal? The node is at 90% utilization, and even when it crashes completely, Kubernetes happily schedules everything to it again.

Okay, so this was a very specific issue, and I'm not sure whether I actually resolved it, but it seems to be working now.
This was a cluster deployed on hetzner and using the hetzner cloud controller manager. I had been removing and adding nodes to the cluster and, as it turns out, I forgot to add the flag --cloud-provider=external to this one's kubelet
This issue is pretty well known. It was specifically showing as a "missing prefix" event, so I never thought it was related.
To solve it, adding the flag and restarting the kubelet was not enough for me. So I had to drain the node, remove it from the cluster, build it from scratch and re-join it then with the correct flag. This not only solved the "missing prefix" issue, but it also seems to have also solved the scheduling issue, though I'm not sure why.

Related

Cloud Composer 2: prevent eviction of worker pods

I am currently planning to upgrade our Cloud Composer environment from Composer 1 to 2. However I am quite concerned about disruptions that could occur in Cloud Composer 2 due to the new autoscaling behavior inherited from GKE Autopilot. In particular since nodes will now auto-scale based on demand, it seems like nodes with running workers could be killed off if GKE thinks the workers could be rescheduled elsewhere. This would be bad because my code isn't currently very tolerant to retries.
I think that this can be prevented by adding the following annotation to the worker pods: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
However, I don't know how to add annotations to worker pods created by Composer (I'm not creating them myself, after all). How can I do that?
EDIT: I think this issue is made more complex by the fact that it should still be possible for the cluster to evict a pod once it's finished processing all its Airflow tasks. If the annotation is added but doesn't go away once the pod is finished processing, I'm worried that could prevent the cluster from ever scaling down.
So a more dynamic solution may be needed, perhaps one that takes into account the actual tasks that Airflow is processing.
If I have understood your problem well. Could you please try this solution:
In the Cloud Composer environment, navigate to the Kubernetes Engine --> Workloads page in the GCP Console.
Find the worker pod you want to add the annotation to and click on the name of the pod.
On the pod details page, click on the Edit button.
In the Pod template section, find the Annotations field and click on
the pencil icon to edit.
In the Edit annotations field, add the annotation "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
Click on the Save button to apply the change.
Let me know if it works fine. Good luck.

Kubernetes CSI Driver: Mounting of volumes when pods run on different nodes

I am currently using the Hetzner CSI-Driver (https://github.com/hetznercloud/csi-driver) in Kubernetes, which works fine for the most part.
But sometimes I run into the issue that two pods using the same persistentVolumeClaim get scheduled onto different nodes. Since the persistentVolume is only mounted onto one node, all podes running on the other node fail with the error 'Unable to attach or mount volumes'.
That makes sense to me but I can't wrap my head around what the correct solution would be. I thought that CSI-Drivers which mount volumes told Kubernetes in some way "oh, this pod needs that volumeClaim? Then you need to schedule it onto that node because the mounted volume is currently in use there by another pod", so I don't understand why pods using the same claim even get scheduled onto different nodes.
Is my understanding of CSI-drivers in general incorrect or is there some way in which I can enforce that behaviour? Or am I using this wrong alltogether and should change the underlying configuration?
Any help is appreciated.
Currently I simply restart the pod until I get lucky and it is moved to the correct node and then everything works fine. But I assume that there is a more elegant solution.

GKE won't scale down to a single node

I've seen other similar questions, but none that quite address our specific case that as far as I can tell.
We have a cluster where we run development environments. When we're not working, ideally, that cluster should go down to a single node. At the moment, no one is working, and I can see that there is one node where CPU/Mem/Disk are essentially at 0 percent, with only system pods on it. The other node has some stuff on it.
The cluster is setup to autoscale down to 1. Why won't it do so?
It will autoscale up to however many we need when we spin up new environments and down to 2 no problem. But down to 1? No dice. When I manually delete the node with only system pods, and basically 0 usage, the cluster spins up a new one. I can't understand why.
Update/Clarification:
I've messed around with the configuration, so I'm not sure exactly what system pods were running, but I'm almost certain they were all DaemonSet-controlled. So, even after manually destroying a node, having everything non-system rescheduled, a new node would still pop up with no workloads specifically triggering the scale-up to 2.
Just to make sure I wasn't making things up, I've re-organized things so that there's just a single node running with no autoscaling, and it has plenty of excess capacity with everything running nicely. As far as I can tell, nothing new got scheduled onto that single node.
Looks like you might not have checked limitation of GKE scaling down section. No issues please check and read once you might need to change the PDB (Pod distribution budget) once.
Occasionally, the cluster autoscaler cannot scale down completely and
an extra node exists after scaling down. This can occur when required
system Pods are scheduled onto different nodes, because there is no
trigger for any of those Pods to be moved to a different node. See I
have a couple of nodes with low utilization, but they are not scaled
down. Why?. To work around this limitation, you can configure a Pod disruption budget.
By default, kube-system pods prevent Cluster Autoscaler from removing nodes on which they are running. Users can manually add PDBs for the kube-system pods that can be safely rescheduled elsewhere:
kubectl create poddisruptionbudget <pdb name> --namespace=kube-system --selector app=<app name> --max-unavailable 1
You can read more at : https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
Don't forget to checkout limitation of GKE scaling : https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler#limitations

How to tell Kubernetes to not reschedule a pod unless it dies?

Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?

Ignite ReadinessProbe

Deploying an ignite cluster within Kubernetes, I cam across an issue that prevents cluster members from joining the group. If I use a readinessProbe and a livenessProbe, even with a delay as low as 10 seconds, they nodes never join each other. If I remove those probes, they find each other just fine.
So, my question is: can you use these probes to monitor node health, and if so, what are appropriate settings. On top of that, what would be good, fast health checks for Ignite, anyway?
Update:
After posting on the ignite mailing list, it looks like StatefulSets are the way to go. (Thanks Dmitry!)
I think I'm going to leave in the below logic to self-heal any segmentation issues although hopefully it won't be triggered often.
Original answer:
We are having the same issue and I think we have a workable solution. The Kubernetes discovery spi lists services as they become ready.
This means that if there are no ready pods at startup time, ignite instances all think that they are the first and create their own grid.
The cluster should be able to self heal if we have a deterministic way to fail pods if they aren't part of an 'authoritative' grid.
In order to do this, we keep a reference to the TcpDiscoveryKubernetesIpFinder and use it to periodically check the list of ignite pods.
If the instance is part of a cluster that doesn't contain the alphabetical first ip in the list, we know we have a segmented topology. Killing the pods that get into that state should cause them to come up again, look at service list and join the correct topology.
I am facing the same issue, using Ignite embedded within a Java spring application.
As you said the readinessProbe: on the Kubernetes Deployment spec.template.spec.container has the side effect to prevent the Kubernetes Pods from being listed on the related Kubernetes Service as Endpoints
Trying without any readinessProbe, it seems to indeed works better (Ignite nodes are all joinging the same Ignite cluster)
Yet this have the undesired side effect of exposing the Kubernetes Pods when not yet ready, as Spring has not yet fully started ...