How can I distribute a deployment across nodes? - kubernetes

I have a Kubernetes deployment that looks something like this (replaced names and other things with '....'):
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "3"
kubernetes.io/change-cause: kubectl replace deployment ....
-f - --record
creationTimestamp: 2016-08-20T03:46:28Z
generation: 8
labels:
app: ....
name: ....
namespace: default
resourceVersion: "369219"
selfLink: /apis/extensions/v1beta1/namespaces/default/deployments/....
uid: aceb2a9e-6688-11e6-b5fc-42010af000c1
spec:
replicas: 2
selector:
matchLabels:
app: ....
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: ....
spec:
containers:
- image: gcr.io/..../....:0.2.1
imagePullPolicy: IfNotPresent
name: ....
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
cpu: "0"
terminationMessagePath: /dev/termination-log
dnsPolicy: ClusterFirst
restartPolicy: Always
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 2
observedGeneration: 8
replicas: 2
updatedReplicas: 2
The problem I'm observing is that Kubernetes places both replicas (in the deployment I've asked for two) on the same node. If that node goes down, I lose both containers and the service goes offline.
What I want Kubernetes to do is to ensure that it doesn't double up containers on the same node where the containers are the same type - this only consumes resources and doesn't provide any redundancy. I've looked through the documentation on deployments, replica sets, nodes etc. but I couldn't find any options that would let me tell Kubernetes to do this.
Is there a way to tell Kubernetes how much redundancy across nodes I want for a container?
EDIT: I'm not sure labels will work; labels constrain where a node will run so that it has access to local resources (SSDs) etc. All I want to do is ensure no downtime if a node goes offline.

There is now a proper way of doing this.
You can use the label in "kubernetes.io/hostname" if you just want to spread it out across all nodes. Meaning if you have two replicas of a pod, and two nodes, each should get one if their names aren't the same.
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-service
labels:
app: my-service
spec:
replicas: 2
selector:
matchLabels:
app: my-service
template:
metadata:
labels:
app: my-service
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: my-service
containers:
- name: pause
image: k8s.gcr.io/pause:3.1

I think you're looking for the Affinity/Anti-Affinity Selectors.
Affinity is for co-locating pods, so I want my website to try and schedule on the same host as my cache for example. On the other hand, Anti-affinity is the opposite, don't schedule on a host as per a set of rules.
So for what you're doing, I would take a closer look at this two links:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#never-co-located-in-the-same-node
https://kubernetes.io/docs/tutorials/stateful-application/zookeeper/#tolerating-node-failure

If you create a Service for that Deployment, before creating the said Deployment, Kubernetes will spread your pods across nodes. This behavior comes from the Scheduler, it is provided on a best-effort basis, providing that you have enough resources available on both nodes.
From the Kubernetes documentation (Managing Resources):
it’s best to specify the service first, since that will ensure the scheduler can spread the pods associated with the service as they are created by the controller(s), such as Deployment.
Also related: Configuration best practices - Service.

I agree with Antoine Cotten to use a service for your deployment. A service always keeps any service up by creating a new pod if, for some reason, one pod is dying in a certain node. However, if you just want to distribute a deployment among all nodes then you can use pod anti affinity in your pod manifest file. I put an example on my gitlab page that you can also find in Kubernetes Blog. For your convenience, I'm providing the example here as well.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: gcr.io/google_containers/nginx-slim:0.8
ports:
- containerPort: 80
In this example, each Deployment has a label which is app and the value of this label is nginx. In pod spec, you have podAntiAffinity that will restrict to have two same pods (label app:nginx) in one node. You can also use podAffinity if you would like to place multiple Deployments in one node.

If a node goes down, any pods running on it would be restarted automatically on another node.
If you start specifying exactly where you want them to run, then you actually loose the capability of Kubernetes to reschedule them on a different node.
The usual practice therefore is to simply let Kubernetes do its thing.
If however you do have valid requirements to run a pod on a specific node, due to requirements for certain local volume type etc, have a read of:
http://kubernetes.io/docs/user-guide/node-selection/

Maybe a DaemonSet will work better. I'm using DaemonStets with nodeSelector to run pods on specific nodes and avoid duplication.
http://kubernetes.io/docs/admin/daemons/

Related

How to (re-)name a pod in a K8s deployment?

I want to deploy two containers in a pod through a deployment. But I want the pod to have exactly the name yoda. But in my case, a random string is always append after yoda like that yoda-f8bcb7bf4-khml6. Is it possible to force the pod name? I try the following but I did not get what I expected.
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: yoda
name: yoda
spec:
replicas: 1
selector:
matchLabels:
app: yoda
strategy: {}
template:
metadata:
creationTimestamp: null
name: yoda
labels:
app: yoda
spec:
containers:
- image: busybox
name: anakin
resources: {}
- image: nginx
name: obiwan
resources: {}
status: {}
Regards,
Benoît
This may not be the answer you expect but with Kubernetes pods should not be seen as pets, i. e. they should not receive a lot of attention but considered as highly replaceable. The name generation is part of this consideration among others to avoid conflicts.
Almost all ways of Kubernetes involve a kind of decoupling, including container rollouts. If a pod always receives the same name it cuts itself from things like rolling deployment strategies, in which on pod terminates while another spawns. Alternatively a conflict would be the alternative.
Without a deeper discussion why the pod should be maintained by hand I am not sure you will find a proper solution.
To give some perspective:
Labels (which you already use) give a good way to select a certain pod. If you change the deployment with a different image there might be two pods be selectable with your yoda label.
So, if you want to select either the older or the newer pod (but not both), adding another label with the respective version could solve the distinguishing problem (if that is what you want). See the template metadata section below.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: yoda
name: yoda
spec:
replicas: 1
selector:
matchLabels:
app: yoda
strategy: {}
template:
metadata:
name: yoda
labels:
app: yoda
app.version: 2.0.0
spec:
containers:
- image: busybox
name: anakin
resources: {}
- image: nginx
name: obiwan
resources: {}
I hope this helps.
I am not sure if statefulset can solve your issue. But the statefulset always retain pod name.How ever it also append a numeric number(start from 0) after the pod name & goes upto no of replicas you define in the yaml definition file.
For example, if you define the replica count to 3 in statefulset definition yaml file, then pod's name will be listed below.
[podName]-0
[podName]-1
[podName]-2

In Kubernetes how can I have a hard minimum number of pods for Deployment?

On my deployment I can set replicas: 3 but then it only spins up one pod. Its my understanding that kubernetes will then fire up more pods as needed, up to three.
But for the sake of uptime I want to be able to have a minimum of three pods at all times and for it to maybe create more as needed but to never scale down lower than 3.
Is this possible with a Deployment?
It is exactly as you did, you define 3 replicas in your Deployment.
Have you verified that you have enough resources for your Deployments?
Replicas example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3 # <------- The number of your desired replicas
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
K8S will spin up a new pod if it can do it meaning, if you have enough resources, the resources are valid (like nodes, selectors, image, volumes, configuration, and more)
If you have defined 3 replicas and you still getting only 1, examine your deployment or your events.
How to view events
# To view all the events (don't specify pod name or namespace)
kubectl get events
# Get event for specific pod
kubectl describe event <pod name> --namespace <namespace>

Hybrid between replicaset and daemonset

Is there such a thing as a hybrid between a replicaset and a daemonset.
I want to specify that I always want to have 2 pods up. But those pods must
never be on the same node. (and I have like 10 nodes)
Is there a way I can achieve this?
In a deployment or replicaSet you can use podAffinity and podAntiaffinity.
Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled, based on labels on pods that are already running on the node rather than based on labels on nodes.
The rules are of the form “this pod should (or, in the case of anti-affinity, shouldn't) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
Above example nginx pod1 and pod2 will never be scheduled on same node.
Find more details in the official docs.

Is it possible to move the running pods from ReplicationController to a Deployment?

We are using RC to run our workload and want to migrate to Deployment. Is there a way to do that with out causing any impact to the running workload. I mean, can we move these running pods under Deployment?
Like, #matthew-l-daniel answered, the answer is yes. But I am more than 80% certain about it. Because I have tested it
Now whats the process we need to follow
Lets say I have a ReplicationController.
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Question: can we move these running pods under Deployment?
Lets follow these step to see if we can.
Step 1:
Delete this RC with --cascade=false. This will leave Pods.
Step 2:
Create ReplicaSet first, with same label as ReplicationController
apiVersion: apps/v1beta2
kind: ReplicaSet
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
---
So, now these Pods are under ReplicaSet.
Step 3:
Create Deployment Now with same label.
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
----
And Deployment will find one ReplicaSet already exists and our job is done.
Now we can check increasing replicas to see if it works.
And It works.
Which way It doesn't work
After deleting ReplicationController, do not create Deployment directly. This will not work. Because, Deployment will find no ReplicaSet, and will create new one with additional label which will not match with your existing Pods
I'm about 80% certain the answer is yes, since they both use Pod selectors to determine whether new instances should be created. The key trick is to use the --cascade=false (the default is true) in kubectl delete, whose help even speaks to your very question:
--cascade=true: If true, cascade the deletion of the resources managed by this resource (e.g. Pods created by a ReplicationController). Default true.
By deleting the ReplicationController but not its subordinate Pods, they will continue to just hang out (although be careful, if a reboot or other hazard kills one or all of them, no one is there to rescue them). Creating the Deployment with the same selector criteria and a replicas count equal to the number of currently running Pods should cause a "no action" situation.
I regret that I don't have my cluster in front of me to test it, but I would think a small nginx RC with replicas=3 should be a simple enough test to prove that it behaves as you wish.

Avoiding kubernetes scheduler to run all pods in single node of kubernetes cluster

I have one kubernetes cluster with 4 nodes and one master. I am trying to run 5 nginx pod in all nodes. Currently sometimes the scheduler runs all the pods in one machine and sometimes in different machine.
What happens if my node goes down and all my pods were running in same node? We need to avoid this.
How to enforce scheduler to run pods on the nodes in round-robin fashion, so that if any node goes down then at at least one node should have NGINX pod in running mode.
Is this possible or not? If possible, how can we achieve this scenario?
Use podAntiAfinity
Reference: Kubernetes in Action Chapter 16. Advanced scheduling
The podAntiAfinity with requiredDuringSchedulingIgnoredDuringExecution can be used to prevent the same pod from being scheduled to the same hostname. If prefer more relaxed constraint, use preferredDuringSchedulingIgnoredDuringExecution.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: <---- hard requirement not to schedule "nginx" pod if already one scheduled.
- topologyKey: kubernetes.io/hostname <---- Anti affinity scope is host
labelSelector:
matchLabels:
app: nginx
container:
image: nginx:latest
Kubelet --max-pods
You can specify the max number of pods for a node in kubelet configuration so that in the scenario of node(s) down, it will prevent K8S from saturating another nodes with pods from the failed node.
Use Pod Topology Spread Constraints
As of 2021, (v1.19 and up) you can use Pod Topology Spread Constraints topologySpreadConstraints by default and I found it more suitable than podAntiAfinity for this case.
The major difference is that Anti-affinity can restrict only one pod per node, whereas Pod Topology Spread Constraints can restrict N pods per nodes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-example-deployment
spec:
replicas: 6
selector:
matchLabels:
app: nginx-example
template:
metadata:
labels:
app: nginx-example
spec:
containers:
- name: nginx
image: nginx:latest
# This sets how evenly spread the pods
# For example, if there are 3 nodes available,
# 2 pods are scheduled for each node.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx-example
For more details see KEP-895 and an official blog post.
I think the inter-pod anti-affinity feature will help you.
Inter-pod anti-affinity allows you to constrain which nodes your pod is eligible to schedule on based on labels on pods that are already running on the node. Here is an example.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: nginx-service
name: nginx-service
spec:
replicas: 3
selector:
matchLabels:
run: nginx-service
template:
metadata:
labels:
service-type: nginx
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: service-type
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
containers:
- name: nginx-service
image: nginx:latest
Note: I use preferredDuringSchedulingIgnoredDuringExecution here since you have more pods than nodes.
For more detailed information, you can refer to the Inter-pod affinity and anti-affinity (beta feature) part of following link:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The scheduler should spread your pods if your containers specify resource request for the amount of memory and CPU they need. See
http://kubernetes.io/docs/user-guide/compute-resources/
We can use Taint or toleration to avoid pods deployed into an node or not to deploy into a node.
Tolerations are applied to pods, and allow (but do not require) the pods to schedule onto nodes with matching taints.
Taints and tolerations work together to ensure that pods are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any pods that do not tolerate the taints.
A sample deployment yaml will be like
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: nginx-service
name: nginx-service
spec:
replicas: 3
selector:
matchLabels:
run: nginx-service
template:
metadata:
labels:
service-type: nginx
spec:
containers:
- name: nginx-service
image: nginx:latest
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
You can find more information at https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#:~:text=Node%20affinity%2C%20is%20a%20property,onto%20nodes%20with%20matching%20taints.