Disable auto rescheduling for a pod - kubernetes

On k8s cluster (GCP) during nodes auto-scaling, my pods are rescheduled automatically. The main problem that they perform computations and keep results in memory during auto-scaling. Because of rescheduling, pods lose all results and tasks.
I want to disable rescheduling for specified pods. I know a few possible solutions:
nodeSelector (not very flexible due to the dynamic nature of a cluster)
pod disruption budget PDB
I have tried PDB and set minAvailable = 1 but it didn't work. I found that you can also set maxUnavailable=0, will it more effective? I didn't understand exactly the behaviour if maxUnavailable when it's set to 0. Could you explain it more? Thank you!
Link for more details - https://github.com/dask/dask-kubernetes/issues/112

Setting max unavailable to 0 is a way to go and also, using nodepools can be a good workaround.
gcloud container node-pools create <nodepool> --node-taints=app=dask-scheduler:NoSchedule
gcloud container node-pools create <nodepool> --node-labels app=dask-scheduler
This will create the nodepool with the label app=dask-scheduler, after in the pod spec, you can do this:
nodeSelector:
app: dask-scheduler
And put the dask scheduler on a node-pool that doesn't autoscale.
There's an object called PDB where in its spec you can set maxUnavailable
in the example of maxUnavailable=1, this means if you had 100 pods defined, always make sure there is only one removed/drained/re-scheduled at a time
in the case of maxUnavailable, if you have 2 pods, and you set maxUnavailable to 0, it will never remove your pods. It being the scheduler
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: zookeeper

Are you specifying resource requests and limits?

Related

How to install pods (based on pod name) to specific nodes in K8s cluster?

I understand that the nodeSelector will help to move the pods to specific nodes with labels. But, say if I know the names for pods in advance and based on these names, how do I move these pods to different nodes having specific labels.
I am unable to understand as to how to use nodeSelector, affinity, antiAffinity in this case.
What would an example values.yaml look like?
I have labelled three nodes. Then, when I launch the 6 pods, they are equally divided among the nodes. Each pod has an index value at the end of the pod name. mypod-0, mypod-1 until mypod-5.
I want to have mypod-0 and mypod-3 on node 1, mypod-1 and mypod-4 on node 2 and so on.
We can use Pod Topology Constraints to decide how the pods need to spread across your cluster.
For Example as per you question we can deploy 2 pods in 1 node and other 2 pods in different node. To make this possible we need to set the topologyKey for nodes and use them while deploying the pod. This sample yaml shows the syntax of pod topology constraints.
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: nginx
image: nginx
Since you are using labels for naming the pods you can use nodeAffinity and nodeSelectors along with pod topology constraints. For more information regarding the combination of topology and affinity refer to this official k8 document

How to make kubernetes cluster elastic?

Helo i am running a .NET application in Azure Kubernetes Services as a 3 pod cluster (1 pod per node).
I am trying to understand how can i make my cluster elastic depending on load ?
How can i configure the deployment.yaml so that after a certain % of the cpu utilization and/or % of memory per pod it spawns another pod? The same thing when load decreases, how do i shut down instances.
Is there any guide/tutorial to set this up based on percentage (ideally) ?
The basic feature you need to use is called HorizontalPodAutoscaler or for short HPA. There you can configure cpu or memory limits and if the limit is exceeded, the pod replica number will be increased. E.g. from this walkthrough:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This will scale out the php-apache deployment, as soon as the pods cpu utilization is greater than 50 %. Be aware that calculating the resource utilization and the resulting number of replicas is not as intuitive, as it might seam. Also see docs (the whole page should be quite interesting too). You can also combine criteria for scale out.
There are also addons that help you scale based on other parameters, like the number of messages in a queue. Check out keda, they provide different scalers, like RabbitMQ, Kafka, AWS CloudWatch, Azure Monitor, etc.
And since you wrote
1 pod per node
you might be running a DaemonSet. In that case your only option to scale out would be to add additional nodes, since with daemonsets there is always exactly one pod per node. If that's the case you could think about using a Deployment combined with a PodAntiAffinity instead, see docs. By that you can configure pods to preferably run on nodes where pods of the same deployment are not running yet, e.g.:
[...]
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
[...]
From docs:
The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a node that is in the same zone as one or more Pods with the label security=S2. More precisely, the scheduler should try to avoid placing the Pod on a node that has the topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently running Pods with the Security=S2 Pod label.
That would make scale out more flexible as it is with a daemonset, yet you have a similar effect of pods being equally distributed through out the cluster.
If you want/need to stick to a daemonset you can check out the AKS Cluster Autoscaler, that can be used to automatically add/remove additional nodes from your cluster, based on resource consumption.

Cluster Autoscaler and Horizontal Pod Autoscaler working together

I have a cluster with Cluster Autoscaler activated and HPA for one of my deployments.
This is the HPA definition:
kind: HorizontalPodAutoscaler
metadata:
name: hpa-resource-metrics-cpu
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: ReplicationController
name: hello-hpa-cpu
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 50
Now in a situation where my cluster is being used very lightly, that means this deployment will only have 1 available replica.
And since the cluster is not under high usage, it could be the case that the node containing that replica is scheduled for deletion (downscaling).
In that case, it would make my deployment have a downtime (when the cluster node is deleted, the only replica for the deployment is deleted as well, so it needs to be rescheduled in a new pod). I don't want that to happen (the downtime).
From this issue: https://github.com/kubernetes/kubernetes/issues/48307, it seems that Pod Disruption Budgets are not applicable to deployments with only 1 replica.
So the only solution to my problem would be to have minReplicas set to 2?
Or is there something else I could do to prevent this downtime, and still let minReplicas as 1?
Kubernetes has the notion of a disruption. The cluster autoscaler (or an administrator) taking a node offline is a "voluntary" disruption (as distinct from, say, the node losing power) and so you have some control over it. If you create a pod disruption budget:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: hello-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: hello
You have specified that there shouldn't be fewer than one pod, with a label app: hello, when the cluster tries to perform a voluntary disruption.
Doing this can prevent the cluster autoscaler from actually deleting the node. The examples in the PDB documentation generally have multiple replicas and can tolerate some of them being offline, so it's possible to delete 1 replica of 3 and recreate it on a different node. There is an extended example where there's not capacity in the cluster to start a rescheduled pod, and this blocks destroying a node. You might set the HPA to minReplicas: 3 to avoid this case, even if it means your system will be overprovisioned at the quietest times.

Distribute pods for a deployment across different node pools

In my GKE Kubernetes cluster, I have 2 node pools; one with regular nodes and the other with pre-emptible nodes. I'd like some of the pods to be on pre-emptible nodes so I can save costs while I have at least 1 pod on a regular non-pre-emptible node to reduce the risk of downtime.
I'm aware of using podAntiAffinity to encourage pods to be scheduled on different nodes, but is there a way to have k8s schedule pods for a single deployment across both pools?
Yes 💡! You can use Pod Topology Spread Constraints, based on a label 🏷️ key on your nodes. For example, the label could be type and the values could be regular and preemptible. Then you can have something like this:
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: app
image: myimage
You can also identify a maxSkew which means the maximum differentiation of a number of pods that one label value (node type) can have.
You can also combine multiple 'Pod Topology Spread Constraints' and also together with PodAffinity/AntiAffinity and NodeAffinity. All depending on what best fits your use case.
Note: This feature is alpha in 1.16 and beta in 1.18. Beta features are enabled by default but with alpha features, you need an alpha cluster in GKE.
☮️✌️

Is it possible for running pods on kubernetes to share the same PVC

I've currently set up a PVC with the name minio-pvc and created a deployment based on the stable/minio chart with the values
mode: standalone
replicas: 1
persistence:
enabled: true
existingClaim: minio-pvc
What happens if I increase the number of replicas? Do i run the risk of corrupting data if more than one pod tries to write to the PVC at the same time?
Don't use deployment for stateful containers. Instead use StatefulSets.
StatefulSets are specifically designed for running stateful containers like databases. They are used to persist the state of the container.
Note that each pod is going to bind a separate persistent volume via pvc. There is no possibility of multiple instances of pods writing to same pv. Hope I answered your question.
In case you are sticking to Deployments instead of StatefulSets it won't be feasible for multiple replicas to write to the same PVC, since there is no guarantee that the different replicas are scheduled on the same node, and so you might have a pending pod waiting to establish a connection to the volume and fail. The solution is to choose a specific node and have all your replicas run on the same node.
Run the following and assign a label to one of your nodes:
kubectl label nodes <node-name> <label-key>=<label-value>
Say we choose label-key to be labelKey and label-value to be node1. Then you can go ahead and add the following to your YAML file and have the pods scheduled on the same node:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 3
template:
spec:
nodeSelector:
labelKey: node1
containers:
...