Two pods force deploy to different ICP workers - kubernetes

There is a cluster Kubernetes and IBM Cloud Private with two workers.
I have one deployment which creates two pods. How can I force deployment to install its pods on two different workers? In this case if I lost one icp worker I always have other with need pod.

If you want pods to not schedule on the same node, the correct concept that you will want to use is inter-pod anti-affinity. https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity-beta-feature
Observe:
spec:
replicas: 2
selector:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname

You can create your pods as kubernetes DaemonSet. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. You can access below link to see details.
https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

In addition to #Santiclause answer regarding scheduling policy in affinity mode there are two different mods of affinity:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution.
While using requiredDuringSchedulingIgnoredDuringExecution affinity scheduler we need to make sure that all rules are met for a pod to be scheduled.
If you will have i.e. not enough nodes to spawn all pods the scheduler will wait forever until there will be enough nodes available.
If you use preferredDuringSchedulingIgnoredDuringExecution affinity scheduler it will try to spawn all replicas based on the highest score the nodes gets from the combination of defined rules and their weight.
Weight is a parameter used along with a rule, each rule can have a different weight. In order to calculate a Score for a node we use following logic:
For every node, we iterate through rules defined in the configuration (i.e. resource request, requiredDuringScheduling, affinity expressions, etc.). In case the rule is matched we add the weight value to the score for that node. Once all rules for all nodes are processed we will have a list of all nodes with their final score. The node(s) with the highest score are the most preferred.
Just to summarize, higher weight value will increase importance of a rule and will help scheduler to decide which node to choose.

Related

Schedule Kubernetes pods in the same failure zone

We have a deployment with a large replicas number ( > 1 ) that we must deploy in the same zone.
We stumbled upon this documentation section: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#an-example-of-a-pod-that-uses-pod-affinity
which explains how to schedule pods in zones that already have other pods that match certain labels.
however, there are no other pods that our deployment depends upon. all other workloads are replicated and spread across multiple zones, and this is the first deployment that we would like to keep in a single zone.
also, we thought about explicitly setting the zone for this deployment, but in case of zone failure, it will become unavailable until we notice and explicitly set it to another zone. so setting the exact zone won't work here.
any insights here? and thanks!
Pod Affinity affects how the pod is scheduled based on the presence or absence of other pods within the node. That would probably not serve the purpose you're trying to achieve.
You're probably better off using node affinity (it's on the same link you provided)
That would allow you to force to a zone, because each GKE node will have a failure-domain label which you can get doing this and looking through the results:
kubectl get node {name-of-node} -o json | jq ".metadata.labels"
The labels will read something like this:
"failure-domain.beta.kubernetes.io/region": "europe-west2",
"failure-domain.beta.kubernetes.io/zone": "europe-west2-b",
You can then combine this with nodeAffinity in your deployment yaml (parts snipped for brevity):
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
...
annotations:
...
name: my-deployment
spec:
replicas: 1
strategy: {}
selector:
matchLabels:
...
template:
metadata:
annotations:
...
labels:
...
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
This will force the pods generated by the deployment to all go onto nodes sitting in europe-west2-b
I could change this and make it like this:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- europe-west2-b
- europe-west2-c
To allow it schedule in two zones (but it would not be able to schedule on to the europe-west2-a zone as a consequence)
I do not think there is a direct way to achieve this. I can think of two ways this can work.
Using Affinity on Pod and Node
Adding node affinity with preferredDuringSchedulingIgnoredDuringExecution for the regions you would want to target.
Adding pod affinity to itself with preferredDuringSchedulingIgnoredDuringExecution for pods to prefer to be with each other.
With this what should happen is when the first pod is about to be spun up it would match none of its preferred affinity but the scheduler will still schedule it. But once one is running for the rest of the pod there will be a pod with the correct affinity and they should all spin up. The challenge is there is a possibility of a race condition where multiple pods try to get scheduled and scheduler puts them in different locations once your first preferred zone is out.
Using Webhooks
You can use some mutating webhook to check the node label and add requiredDuringSchedulingIgnoredDuringExecution affinity to pods based on what zones you have still available.
The challenge here is you would most likely need to write and maintain this webhook yourself. I am not sure if you will find your exact usecase solved by someone else in open source. A quick search shows me this repo. I have not tested this but might give you a start.

How to distribute K8 Deployments evenly across nodes with Kubernetes?

I have 12 K8 deployments that should be distributed somewhat evenly across 3 K8 nodes based on resource utilization (like the uptime command). I expected Kubernetes to automatically choose the node that is utilizing the least amount of resources at the time of pod creation, which I would think should result in a somewhat even distribution, but to my surprise Kubernetes is creating the majority of the Pods on the same single node that is barely handling the load, while the other nodes are not being utilized almost at all.
I heard about using topologySpreadConstraints like so
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
type: wordpress-3
But I cant get it to work properly, what is the correct way to achieve the even distribution behavior of deployments that I am looking for? thanks!
Are you using bitnami's wordpress chart?
If so you can update the values.yaml you pass into the chart and set anti-affinity like this:
# Affinity
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app.kubernetes.io/instance: {name of your Wordpress release}
This will force kubernetes to only allow one Wordpress pod on one host (i.e. node). I use this setup on my own Wordpress installations and it means if one node goes down, it doesn't take out the site as the other replicas will still be running and on separate nodes

Kubernetes : How to ensure one pod gets scheduled on each worker node?

Goal : Have one pod (namely 'log-scraper') get scheduled on every node at least once but no more than once
Assume a cluster has the following nodes
Nodes
master/control-plane
worker-1
worker-2
worker-2
Pod I'm working with
apiVersion: v1
kind: Pod
metadata:
name: log-scraper
spec:
volumes:
- name: container-log-dir
hostPath:
path: /var/log/containers
containers:
- image: "logScraper:latest"
name: log-munger
volumeMounts:
- name: container-log-dir
mountPath: /var/log/logging-app
Adding affinity to select only 'worker' nodes (or non-mater nodes)
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "worker"
operator: In
values:
- "true"
Question 1: How do I ensure every node runs ONE-AND-ONLY-ONE pod of type log-scraper
Question 2: What other manifests should be applied/added to achieve this?
You should probably use Daemonsets which are exactly made for this purpose of scheduling one pod per node and gets automatically added to new nodes in case of cluster autoscaling.
Concept
There are two important things when it comes to assigning Pods to Nodes - "Affinity" and "AntiAffinity".
Affinity will basically select based on given criteria while anti-affinity will avoid based on given criteria.
With Affinity and Anti-affinity, you can use operators like In, NotIn, Exist, DoesNotExist, Gt and Lt. When you use NotIn and DoesNotExist, then it becomes anti-affinity.
Now, in Affinity/Antiaffinity, you have 2 choices - Node affinity/antiaffinity and Inter-pod affinity/antiaffinity
Node affinity/antiaffinity
Node affinity is conceptually similar to nodeSelector -- it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels on the node.
Inter-pod affinity/antiaffinity
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes.
Your Solution
Basically what you need is "Antiaffinity" and in that "Pod antiaffinity" instead of Node. So, your solution should look something like below (please note that since I do not have 3 Node cluster so I couldn't test this, so thin chances that you might have to do minor code adjustment):
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
labelSelector:
- matchExpressions:
- key: worker
operator: In
values:
- log-scraper
Read more over here, and especially go through example over here.
Using Pod Topology Spread Constraints
Another way to do it is using Pod Topology Spread Constraints.
You will set up taints and tolerances as usual to control on which nodes the pods can be scheduled. Then add some labels to the pod. I will use the pod label id: foo-bar in the example. Then to allow only a single pod from a replicaSet, deployment or other to be scheduled per node add following in the pod spec.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
id: foo-bar
topologyKey is the label of nodes. The kubernetes.io/hostname is a default label set per node. Put pod labels inside matchLabels. Create the resources and kubescheduler should schedule a single pod with the matching labels per node.
To learn more, check out the documentation here and also this excellent blog post.

Ignore pod anti affinity during deployment update

Imagine in a Master-Node-Node setup where you deploy a service with pod anti-affinity on the Nodes: An update of the Deployment will cause another pod being created but the scheduler not being able to schedule, because both Nodes have the anti-affinity.
Q: How could one more flexibly set the anti-affinity to allow the update?
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
With an error
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (2), PodToleratesNodeTaints (1).
Look at Max Surge
If you set Max Surge = 0, you are telling Kubernetes that you won't allow it to create more pods than the number of replicas you have setup for the deployment. This basically forces Kubernetes to remove a pod before starting a new one, and thereby making room for the new pod first, getting you around the podAntiAffinity issue. I've utilized this mechanism myself, with great success.
Config example
apiVersion: extensions/v1beta1
kind: Deployment
...
spec:
replicas: <any number larger than 1>
...
strategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
...
template:
...
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api
topologyKey: kubernetes.io/hostname
Warning: Don't do this if you only have one replica, as it will cause downtime because the only pod will be removed before a new one is added. If you have a huge number of replicas, which will make deployments slow because Kubernetes can only upgrade 1 pod at at a time, you can crank up maxUnavailable to enable Kubernetes to remove a higher number of pods at a time.
Here are a few methods which may work:
Modify your deployment rollingUpdate strategy 'Max Unavailable' parameter be at least 1, which allows for one pod of the old deployment to be destroyed immediately, making room for one pod of the new deployment to be created.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
Modify the deployment pod spec to use "soft" anti-affinity instead of "hard".
Pod Anti-Affinity
There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (just like nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee.

How to specify pod to node affinity in kubernetes

How can I configure a specific pod to run on a multi-node kubernetes cluster so that it would restrict the containers of the POD to a subset of the nodes.
E.g. let's say I have A, B, C three nodes running mu kubernetes cluster.
How to limit a Pod to run its containers only on A & B, and not on C?
You can add label to nodes that you want to run pod on and add nodeSelector to pod configuration. The process is described here:
http://kubernetes.io/docs/user-guide/node-selection/
So basically you want to
kubectl label nodes A node_type=foo
kubectl label nodes B node_type=foo
And you want to have this nodeSelector in your pod spec:
nodeSelector:
node_type: foo
Firstly, you need to add label to nodes. You can refer to Nebril's answer of how to add label.
Then I recommend you to use the node affinity feature to constrain pods to nodes with particular labels. Here is an example.
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node_type
operator: In
values:
- foo
containers:
- name: with-node-affinity
image: gcr.io/google_containers/pause:2.0
Compared to nodeSelector, the affinity/anti-affinity feature greatly expands the types of constraints you can express.
For more detailed information about node affinity, you can refer to the following link: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
I am unable to post a comment to previous replies but I upvoted the answer that is complete.
nodeSelector is not strictly enforced and relying on it may cause some grief including Master getting overwhelmed with requests as well as IOs from pods scheduled on Master. At the time the answer was provided it may still have been the only option but in the later versions that is not so.
Definitely use nodeAffinity or nodeAntiAffinity with requiredDuringSchedulingIgnoredDuringExecution or preferredDuringSchedulingIgnoredDuringExecution
For more expressive filters for scheduling peruse: Assigning pods to nodes