Hybrid between replicaset and daemonset - kubernetes

Is there such a thing as a hybrid between a replicaset and a daemonset.
I want to specify that I always want to have 2 pods up. But those pods must
never be on the same node. (and I have like 10 nodes)
Is there a way I can achieve this?

In a deployment or replicaSet you can use podAffinity and podAntiaffinity.
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled, based on labels on pods that are already running on the node rather than based on labels on nodes.
The rules are of the form “this pod should (or, in the case of anti-affinity, shouldn't) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
Above example nginx pod1 and pod2 will never be scheduled on same node.
Find more details in the official docs.

Related

Is it possible to assign a pod of StatefulSet to a specific node of a Kubernetes cluster?

I have a 5 node cluster(1-master/4-worker). Is it possible to configure a StatefulSet where I can make a pod(s) to run on a given node knowing it has sufficient capacity rather Kubernetes Scheduler making this decision?
Lets say, my StatefulSet create 4 pods(replicas: 4) as myapp-0,myapp-1,myapp-2 and myapp-3. Now what I am looking for is:
myapp-0 pod-- get scheduled over---> worker-1
myapp-1 pod-- get scheduled over---> worker-2
myapp-2 pod-- get scheduled over---> worker-3
myapp-3 pod-- get scheduled over---> worker-4
Please let me know if it can be achieved somehow? Because if I add a toleration to pods of a StatefulSet, it will be same for all the pods and all of them will get scheduled over a single node matching the taint.
Thanks, J
You can delegate responsibility for scheduling arbitrary subsets of pods to your own custom scheduler(s) that run(s) alongside, or instead of, the default Kubernetes scheduler.
You can write your own custom scheduler. A custom scheduler can be written in any language and can be as simple or complex as you need. Below is a very simple example of a custom scheduler written in Bash that assigns a node randomly. Note that you need to run this along with kubectl proxy for it to work.
SERVER='localhost:8001'
while true;
do
for PODNAME in $(kubectl --server $SERVER get pods -o json | jq '.items[] | select(.spec.schedulerName == "my-scheduler") | select(.spec.nodeName == null) | .metadata.name' | tr -d '"')
;
do
NODES=($(kubectl --server $SERVER get nodes -o json | jq '.items[].metadata.name' | tr -d '"'))
NUMNODES=${#NODES[#]}
CHOSEN=${NODES[$[$RANDOM % $NUMNODES]]}
curl --header "Content-Type:application/json" --request POST --data '{"apiVersion":"v1", "kind": "Binding", "metadata": {"name": "'$PODNAME'"}, "target": {"apiVersion": "v1", "kind"
: "Node", "name": "'$CHOSEN'"}}' http://$SERVER/api/v1/namespaces/default/pods/$PODNAME/binding/
echo "Assigned $PODNAME to $CHOSEN"
done
sleep 1
done
Then just in your StatefulSet configuration file under specification section you will have to add schedulerName: your-scheduler line.
You can also use pod affinity:.
Example:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
The below yaml snippet of the webserver statefuset has podAntiAffinity and podAffinity configured. This informs the scheduler that all its replicas are to be co-located with pods that have selector label app=store. This will also ensure that each web-server replica does not co-locate on a single node.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.12-alpine
If we create the above two deployments, our three node cluster should look like below.
node-1 node-2 node-3
webserver-1 webserver-2 webserver-3
cache-1 cache-2 cache-3
The above example uses PodAntiAffinity rule with topologyKey: "kubernetes.io/hostname" to deploy the redis cluster so that no two instances are located on the same host
You can simply define three replicas of specific pod and define particular pod configuration file, egg.:
There is label: nodeName which is the simplest form of node selection constraint, but due to its limitations it is typically not used. nodeName is a field of PodSpec. If it is non-empty, the scheduler ignores the pod and the kubelet running on the named node tries to run the pod. Thus, if nodeName is provided in the PodSpec, it takes precedence over the above methods for node selection.
Here is an example of a pod config file using the nodeName field:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-worker-1
More information about scheduler: custom-scheduler.
Take a look on this article: assigining-pods-kubernetes.
You can use the following KubeMod ModRule:
apiVersion: api.kubemod.io/v1beta1
kind: ModRule
metadata:
name: statefulset-pod-node-affinity
spec:
type: Patch
match:
# Select pods named myapp-xxx.
- select: '$.kind'
matchValue: Pod
- select: '$.metadata.name'
matchRegex: myapp-.*
patch:
# Patch the selected pods such that their node affinity matches nodes that contain a label with the name of the pod.
- op: add
path: /spec/affinity/nodeAffinity/requiredDuringSchedulingIgnoredDuringExecution
value: |-
nodeSelectorTerms:
- matchExpressions:
- key: accept-pod/{{ .Target.metadata.name }}
operator: In
values:
- 'true'
The above ModRule will monitor for the creation of pods named myapp-* and will inject a nodeAffinity section into their resource manifest before they get deployed. This will instruct the scheduler to schedule the pod to a node which has a label accept-pod/<pod-name> set to true.
Then you can assign future pods to nodes by adding labels to the nodes:
kubectl label node worker-1 accept-pod/myapp-0=true
kubectl label node worker-2 accept-pod/myapp-1=true
kubectl label node worker-3 accept-pod/myapp-2=true
...
After the above ModRule is deployed, creating the StatefulSet will trigger the creation of its pods, which will be intercepted by the ModRule. The ModRule will dynamically inject the nodeAffinity section using the name of the pod.
If, later on, the StatefulSet is deleted, deploying it again will lead to the pods being scheduled on the same exact nodes as they were before.
You can do this using nodeSelector and node affinity (take a look at this guide https://kubernetes.io/docs/concepts/configuration/assign-pod-node/), anyone can be used to run pods on specific nodes. But if the node has taints (restrictions) then you need to add tolerations for those nodes (more can be found here https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/). Using this approach, you can specify a list of nodes to be used for your pod's scheduling, the catch is if you specify for ex. 3 nodes and you have 5 pods then you don't have control how many pods will run on each of these nodes. They gets distributed as per kube-schedular.
Another relevant use case: If you want to run one pod in each of the specified nodes, you can create a daemonset and select nodes using nodeSelector.
You can use podAntiAffinity to distribute replicas to different nodes.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
replicas: 4
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
This would deploy web-0 in worker1 , web-1 in worker2, web-2 in worker3 and web-3 in worker4.
take a look to this guideline https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
however, what you are looking for is the nodeSelector directive that should be placed in the pod spec.

Kubernetes pod in more replicas - anti affinity rule

I have a question about K8S pod anti affinity rule. What I have and what I need.
I have cross 2 data centers K8S cluster. In every DC is for example 5 nodes. I have a Deployment for Pod, which runs in 10 replicas cross all 10 nodes. For every node is 1 Pod replica.
And I want to set up rule for case, if one DC will crash, to not migrate 5 replicas from crashed DC to health DC.
I found, that it could be possible to do it throught "Anti-affinity" rule, but I can't find any example for this scenario. Do you have example for it?
From the documentation
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
you need to set a selector on your deployment and indicate in the anti-affinity section what is the value to match and make the anti-affinity true:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
You can see that it is using a label selector that try to find the key app with value store it means that if a node has already a pod with that label and value kubernetes will apply anti-affinity.
look at DaemonSet. It will deploy one replica on each node.
If one DC is crashed then the pods will not be redeployed to other DC.

Are 2 OpenShift pods replicas deployed on two different nodes (when #nodes > 2)?

Assume I have a cluster with 2 nodes and a POD with 2 replicas. Can I have the guarantee that my 2 replicas are deployed in 2 differents nodes. So that when a node is down, the application keeps running. By default does the scheduler work on best effort mode to assign the 2 replicas in distinct nodes?
Pod AntiAffinity
Pod anti-affinity can also to repel the pod from each other. so no two pods can be scheduled on same node.
Use following configurations.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
This will use the anti-affinity feature so if you are having more than 2 nodes the there will be guarantee that no two pod will be scheduled on same node.
You can use kind: DeamonSet . Here is a link to Kubernetes DeamonSet documentation.
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
Here is a link to documentation about DeamonSets in OpenShift
Example might look like the following:
This is available on Openshift >= 3.2 version of openshift This use case is to run a specific docker container (veermuchandi/welcome) on all nodes (or a set nodes with specific label
Enable HostPorts expose on Openshift
$ oc edit scc restricted #as system:admin user
change allowHostPorts: true and save
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: welcome
spec:
template:
metadata:
name: welcome
labels:
daemon: welcome
spec:
containers:
- name: c
image: veermuchandi/welcome
ports:
- containerPort: 8080
hostPort: 8080
name: serverport
$ oc create -f myDaemonset.yaml #with system:admin user
Source available here
Daemonset is not a good option. It will schedule one pod on every node. In future if you scale your cluster and then pods get scaled as many as nodes. Instead Use pod affinity to schedule no more than one pod on any node

How can I distribute a deployment across nodes?

I have a Kubernetes deployment that looks something like this (replaced names and other things with '....'):
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubernetes.io/change-cause: kubectl replace deployment ....
-f - --record
creationTimestamp: 2016-08-20T03:46:28Z
generation: 8
labels:
app: ....
name: ....
namespace: default
resourceVersion: "369219"
selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/....
uid: aceb2a9e-6688-11e6-b5fc-42010af000c1
spec:
replicas: 2
selector:
matchLabels:
app: ....
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: ....
spec:
containers:
- image: gcr.io/..../....:0.2.1
imagePullPolicy: IfNotPresent
name: ....
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: "0"
terminationMessagePath: /dev/termination-log
dnsPolicy: ClusterFirst
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 2
observedGeneration: 8
replicas: 2
updatedReplicas: 2
The problem I'm observing is that Kubernetes places both replicas (in the deployment I've asked for two) on the same node. If that node goes down, I lose both containers and the service goes offline.
What I want Kubernetes to do is to ensure that it doesn't double up containers on the same node where the containers are the same type - this only consumes resources and doesn't provide any redundancy. I've looked through the documentation on deployments, replica sets, nodes etc. but I couldn't find any options that would let me tell Kubernetes to do this.
Is there a way to tell Kubernetes how much redundancy across nodes I want for a container?
EDIT: I'm not sure labels will work; labels constrain where a node will run so that it has access to local resources (SSDs) etc. All I want to do is ensure no downtime if a node goes offline.
There is now a proper way of doing this.
You can use the label in "kubernetes.io/hostname" if you just want to spread it out across all nodes. Meaning if you have two replicas of a pod, and two nodes, each should get one if their names aren't the same.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
labels:
app: my-service
spec:
replicas: 2
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
containers:
- name: pause
image: k8s.gcr.io/pause:3.1
I think you're looking for the Affinity/Anti-Affinity Selectors.
Affinity is for co-locating pods, so I want my website to try and schedule on the same host as my cache for example. On the other hand, Anti-affinity is the opposite, don't schedule on a host as per a set of rules.
So for what you're doing, I would take a closer look at this two links:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#never-co-located-in-the-same-node
https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure
If you create a Service for that Deployment, before creating the said Deployment, Kubernetes will spread your pods across nodes. This behavior comes from the Scheduler, it is provided on a best-effort basis, providing that you have enough resources available on both nodes.
From the Kubernetes documentation (Managing Resources):
it’s best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
Also related: Configuration best practices - Service.
I agree with Antoine Cotten to use a service for your deployment. A service always keeps any service up by creating a new pod if, for some reason, one pod is dying in a certain node. However, if you just want to distribute a deployment among all nodes then you can use pod anti affinity in your pod manifest file. I put an example on my gitlab page that you can also find in Kubernetes Blog. For your convenience, I'm providing the example here as well.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: gcr.io/google_containers/nginx-slim:0.8
ports:
- containerPort: 80
In this example, each Deployment has a label which is app and the value of this label is nginx. In pod spec, you have podAntiAffinity that will restrict to have two same pods (label app:nginx) in one node. You can also use podAffinity if you would like to place multiple Deployments in one node.
If a node goes down, any pods running on it would be restarted automatically on another node.
If you start specifying exactly where you want them to run, then you actually loose the capability of Kubernetes to reschedule them on a different node.
The usual practice therefore is to simply let Kubernetes do its thing.
If however you do have valid requirements to run a pod on a specific node, due to requirements for certain local volume type etc, have a read of:
http://kubernetes.io/docs/user-guide/node-selection/
Maybe a DaemonSet will work better. I'm using DaemonStets with nodeSelector to run pods on specific nodes and avoid duplication.
http://kubernetes.io/docs/admin/daemons/

Avoiding kubernetes scheduler to run all pods in single node of kubernetes cluster

I have one kubernetes cluster with 4 nodes and one master. I am trying to run 5 nginx pod in all nodes. Currently sometimes the scheduler runs all the pods in one machine and sometimes in different machine.
What happens if my node goes down and all my pods were running in same node? We need to avoid this.
How to enforce scheduler to run pods on the nodes in round-robin fashion, so that if any node goes down then at at least one node should have NGINX pod in running mode.
Is this possible or not? If possible, how can we achieve this scenario?
Use podAntiAfinity
Reference: Kubernetes in Action Chapter 16. Advanced scheduling
The podAntiAfinity with requiredDuringSchedulingIgnoredDuringExecution can be used to prevent the same pod from being scheduled to the same hostname. If prefer more relaxed constraint, use preferredDuringSchedulingIgnoredDuringExecution.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: <---- hard requirement not to schedule "nginx" pod if already one scheduled.
- topologyKey: kubernetes.io/hostname <---- Anti affinity scope is host
labelSelector:
matchLabels:
app: nginx
container:
image: nginx:latest
Kubelet --max-pods
You can specify the max number of pods for a node in kubelet configuration so that in the scenario of node(s) down, it will prevent K8S from saturating another nodes with pods from the failed node.
Use Pod Topology Spread Constraints
As of 2021, (v1.19 and up) you can use Pod Topology Spread Constraints topologySpreadConstraints by default and I found it more suitable than podAntiAfinity for this case.
The major difference is that Anti-affinity can restrict only one pod per node, whereas Pod Topology Spread Constraints can restrict N pods per nodes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-example-deployment
spec:
replicas: 6
selector:
matchLabels:
app: nginx-example
template:
metadata:
labels:
app: nginx-example
spec:
containers:
- name: nginx
image: nginx:latest
# This sets how evenly spread the pods
# For example, if there are 3 nodes available,
# 2 pods are scheduled for each node.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx-example
For more details see KEP-895 and an official blog post.
I think the inter-pod anti-affinity feature will help you.
Inter-pod anti-affinity allows you to constrain which nodes your pod is eligible to schedule on based on labels on pods that are already running on the node. Here is an example.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: nginx-service
name: nginx-service
spec:
replicas: 3
selector:
matchLabels:
run: nginx-service
template:
metadata:
labels:
service-type: nginx
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: service-type
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-service
image: nginx:latest
Note: I use preferredDuringSchedulingIgnoredDuringExecution here since you have more pods than nodes.
For more detailed information, you can refer to the Inter-pod affinity and anti-affinity (beta feature) part of following link:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The scheduler should spread your pods if your containers specify resource request for the amount of memory and CPU they need. See
http://kubernetes.io/docs/user-guide/compute-resources/
We can use Taint or toleration to avoid pods deployed into an node or not to deploy into a node.
Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
A sample deployment yaml will be like
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: nginx-service
name: nginx-service
spec:
replicas: 3
selector:
matchLabels:
run: nginx-service
template:
metadata:
labels:
service-type: nginx
spec:
containers:
- name: nginx-service
image: nginx:latest
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
You can find more information at https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#:~:text=Node%20affinity%2C%20is%20a%20property,onto%20nodes%20with%20matching%20taints.