Pods affinity to different nodes in StatefulSet - kubernetes

I am creating StatefulSets and I want pods within one StatefulSet to be distributed across different nodes of the k8s cluster. In my case - one StatefulSet is one database replicaset.
sts.Spec.Template.Labels["mydb.io/replicaset-uuid"] = replicasetUUID.String()
sts.Spec.Template.Spec.Affinity.PodAntiAffinity = &corev1.PodAntiAffinity{
RequiredDuringSchedulingIgnoredDuringExecution: []corev1.PodAffinityTerm{
{
LabelSelector: &metav1.LabelSelector{
MatchExpressions: []metav1.LabelSelectorRequirement{
{
Key: "mydb.io/replicaset-uuid",
Operator: metav1.LabelSelectorOpIn,
Values: []string{replicasetUUID.String()},
},
},
},
TopologyKey: "kubernetes.io/hostname",
},
},
}
However, with these settings, I get the opposite. storage-0-0 and storage-0-1 are on the same replicaset and on the same node...
Moreover, they have exactly the same label mydb.io/replicaset-uuid
$ kubectl -n mydb get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
storage-0-0 1/1 Running 0 40m x.x.x.x kubernetes-cluster-x-main-0 <none> <none>
storage-0-1 1/1 Running 0 39m x.x.x.x kubernetes-cluster-x-main-0 <none> <none>
storage-1-0 1/1 Running 0 40m x.x.x.x kubernetes-cluster-x-slave-0 <none> <none>
storage-1-1 1/1 Running 0 40m x.x.x.x kubernetes-cluster-x-slave-0 <none> <none>
mydb-operator-58c9bfbb9b-7djml 1/1 Running 0 46m x.x.x.x kubernetes-cluster-x-slave-0 <none> <none>

It works correctly, as #jesmart wrote in the comment:
The description of the problem works correctly I just indicated the wrong image with the application

I suggest using an podAntiAffinity rule in the statefulset definition to deploy your application so that no two instances are located on the same host.
Reference: An example of a pod that uses pod affinity

Related

How to make k8s imagePullPolicy = never work?

I have followed the instructions on this blog to create a simple container image and deploy it in a k8s cluster.
However, in my case the pods do not run:
student#master:~$ k get pod -o wide -l app=hello-python --field-selector spec.nodeName=master
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hello-python-58547cf485-7l8dg 0/1 ErrImageNeverPull 0 2m26s 192.168.219.126 master <none> <none>
hello-python-598c594dc5-4c9zd 0/1 ErrImageNeverPull 0 2m26s 192.168.219.67 master <none> <none>
student#master:~$ sudo podman images hello-python
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/hello-python latest 11cf1e5a86b1 50 minutes ago 941 MB
student#master:~$ hostname
master
student#master:~$
I understand why it may not work on the worker node, but why it does not work on the same node where the image is cached - the master node?
student#master:~$ k describe pod hello-python-58547cf485-7l8dg | grep -A 10 'Events:'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned default/hello-python-58547cf485-7l8dg to master
Warning Failed 8m7s (x12 over 10m) kubelet Error: ErrImageNeverPull
Warning ErrImageNeverPull 4m59s (x27 over 10m) kubelet Container image "localhost/hello-python:latest" is not present with pull policy of Never
student#master:~$
My question is: how to make the pod run on the master node with the imagePullPolicy = never given that the image in question is available on the master node as the podman images attests?
EDIT 1
I am using a k8s cluster running on two VMs deployed in GCE. It was setup with a script provided in the context of the Linux Foundation Kubernetes Developer course LFD0259.
EDIT 2
The master node is allowed to run workloads - this is how the LFD259 course sets it up. For example:
student#master:~$ k create deployment xyz --image=httpd
deployment.apps/xyz created
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 5m37s 192.168.171.66 worker <none> <none>
student#master:~$
student#master:~$ k scale deployment xyz --replicas=10
deployment.apps/xyz scaled
student#master:~$ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
xyz-6c6bd4cd89-c2xv4 1/1 Running 0 73s 192.168.219.71 master <none> <none>
xyz-6c6bd4cd89-g89k2 0/1 ContainerCreating 0 73s <none> master <none> <none>
xyz-6c6bd4cd89-jfftl 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-kbdnq 1/1 Running 0 73s 192.168.219.106 master <none> <none>
xyz-6c6bd4cd89-nm6rt 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-qn4zr 1/1 Running 0 7m22s 192.168.171.66 worker <none> <none>
xyz-6c6bd4cd89-vts6x 1/1 Running 0 73s 192.168.171.84 worker <none> <none>
xyz-6c6bd4cd89-wd2ls 1/1 Running 0 73s 192.168.171.127 worker <none> <none>
xyz-6c6bd4cd89-wv4jn 0/1 ContainerCreating 0 73s <none> worker <none> <none>
xyz-6c6bd4cd89-xvtlm 0/1 ContainerCreating 0 73s <none> master <none> <none>
student#master:~$
It depends how you've set up your Kubernetes Cluster. I assume you've installed it with kubeadm. However, by default the Master is not scheduleable for workloads. And by my understanding the image you're talking about only exists on the master node right? If that's the case you can't start a pod with that Image as it only exists on the master node, which doesn't allow workloads by default.
If you were to copy the Image to the worker node, your given command should work.
However if you want to make your Master-Node scheduleable just taint it with (maybe you need to amend the last bit if it differs from yours):
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

how to communicate with daemonset pod from another pod in another node?

I have a daemonset configuration that runs on all nodes.
every pod listens on port 34567. I want from other pod on different node to communicate with this pod. how can I achieve that?
Find the target Pod's IP address as shown below
controlplane $ k get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-fb8b8dccf-42pq8 1/1 Running 1 5m43s 10.88.0.4 node01 <none> <none>
coredns-fb8b8dccf-f9n5x 1/1 Running 1 5m43s 10.88.0.3 node01 <none> <none>
etcd-controlplane 1/1 Running 0 4m38s 172.17.0.23 controlplane <none> <none>
katacoda-cloud-provider-74dc75cf99-2jrpt 1/1 Running 3 5m42s 10.88.0.2 node01 <none> <none>
kube-apiserver-controlplane 1/1 Running 0 4m33s 172.17.0.23 controlplane <none> <none>
kube-controller-manager-controlplane 1/1 Running 0 4m45s 172.17.0.23 controlplane <none> <none>
kube-keepalived-vip-smkdc 1/1 Running 0 5m27s 172.17.0.26 node01 <none> <none>
kube-proxy-8sxkt 1/1 Running 0 5m27s 172.17.0.26 node01 <none> <none>
kube-proxy-jdcqc 1/1 Running 0 5m43s 172.17.0.23 controlplane <none> <none>
kube-scheduler-controlplane 1/1 Running 0 4m47s 172.17.0.23 controlplane <none> <none>
weave-net-8cxqg 2/2 Running 1 5m27s 172.17.0.26 node01 <none> <none>
weave-net-s4tcj 2/2 Running 1 5m43s 172.17.0.23 controlplane <none> <none>
Next "exec" into the originating pod - kube-proxy-8sxkt in my example
kubectl -n kube-system exec -it kube-proxy-8sxkt sh
Next, you will use the destination pod's IP and port (10256 - my example) number to connect. Please note that you may have to install curl/telnet if your originating container's image does not include the application
# curl telnet://172.17.0.23:10256
HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8
Connection: close
You can call via pod's IP.
Note: This IP can only be used in the k8s cluster.
POD address (IP) is a good option you can use it, unless you know the POD IP which might get changed from time to time due to deployment and scaling changes.
i would suggest trying out the Daemon set by exposing it using the service type Node port if you have a fix amount of Node and not much autoscaling there.
If you want to connect your POD with a specific POD you can use the Node IP on which POD is scheduled and use the Node port service.
Node IP:Node port
Read more at : https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport
If you don't want to connect to a specific POD and just any of the Daemon sets replica will work to connect with you can use the service name to connect PODs with each other.
my-svc.my-namespace.svc.cluster-domain.example
Read more about the service and POD DNS
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

Stuck nginx ingress

I deployed nginx ingress by kubespray. I have 3 masters and 2 workers and 5 ingress-nginx-controller. I tried to shutdown one worker and now I see still 5 nginx ingress on all hosts.
[root#node1 ~]# kubectl get pod -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-controller-5828c 1/1 Running 0 7m4s 10.233.96.9 node2 <none> <none>
ingress-nginx-controller-h5zzl 1/1 Running 0 7m42s 10.233.92.7 node3 <none> <none>
ingress-nginx-controller-wrvv6 1/1 Running 0 6m11s 10.233.90.17 node1 <none> <none>
ingress-nginx-controller-xdkrx 1/1 Running 0 5m44s 10.233.105.25 node4 <none> <none>
ingress-nginx-controller-xgpn2 1/1 Running 0 6m38s 10.233.70.32 node5 <none> <none>
The problem is I am getting 503 error with app after one node was power off. Is some option disconnect not working ingress-nginx-controller or possibility to use round robin, please? Or could I catch non working ingress-nginx-controller and redirect traffic to correct one, please?
I shutdown the node where the app was running. Now is everything working.

Horizontal Pod Autoscaler replicas based on the amount of the nodes in the cluster

Im looking for the solution that will scale out pods automatically when the nodes join the cluster and scale in back when the nodes are deleted.
We are running WebApp on the nodes and this require graceful pod eviction/termination when the node is scheduled to be disconnected.
I was checking the option of using the DaemonSet but since we are using Kops for the cluster rolling update it ignores DaemonSets evictions (flag "--ignore-daemionset" is not supported).
As a result the WebApp "dies" with the node which is not acceptable for our application.
The ability of HorizontalPodAutoscaler to overwrite the amount of replicas which are set in the deployment yaml could solve the problem.
I want to find the way to change the min/maxReplicas in HorizontalPodAutoscaler yaml dynamically based on the amount of nodes in the cluster.
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: MyWebApp
minReplicas: "Num of nodes in the cluster"
maxReplicas: "Num of nodes in the cluster"
Any ideas how to get the number of nodes and update HorizontalPodAutoscaler yaml in the cluster accordingly? Or any other solutions for the problem?
Have you tried usage of nodeSelector spec in daemonset yaml.
So if you have nodeselector set in yaml and just before drain if you remove the nodeselector label value from the node the daemonset should scale down gracefully also same when you add new node to cluster label the node with custom value and deamonset will scale up.
This works for me so you can try this and confirm with Kops
First : Label all you nodes with a custom label you will always have on your cluster
Example:
kubectl label nodes k8s-master-1 mylabel=allow_demon_set
kubectl label nodes k8s-node-1 mylabel=allow_demon_set
kubectl label nodes k8s-node-2 mylabel=allow_demon_set
kubectl label nodes k8s-node-3 mylabel=allow_demon_set
Then to your daemon set yaml add node selector.
Example.yaml used as below : Note added nodeselctor field
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
nodeSelector:
mylabel: allow_demon_set
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
So nodes are labeled as below
$ kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-master-1 Ready master 9d v1.17.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master-1,kubernetes.io/os=linux,mylable=allow_demon_set,node-role.kubernetes.io/master=
k8s-node-1 Ready <none> 9d v1.17.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node-1,kubernetes.io/os=linux,mylable=allow_demon_set
k8s-node-2 Ready <none> 9d v1.17.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node-2,kubernetes.io/os=linux,mylable=allow_demon_set
k8s-node-3 Ready <none> 9d v1.17.0 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node-3,kubernetes.io/os=linux,mylable=allow_demon_set
Once you have correct yaml start the daemon set using it
$ kubectl create -f Example.yaml
$ kubectl get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/fluentd-elasticsearch-jrgl6 1/1 Running 0 20s 10.244.3.19 k8s-node-3 <none> <none>
pod/fluentd-elasticsearch-rgcm2 1/1 Running 0 20s 10.244.0.6 k8s-master-1 <none> <none>
pod/fluentd-elasticsearch-wccr9 1/1 Running 0 20s 10.244.1.14 k8s-node-1 <none> <none>
pod/fluentd-elasticsearch-wxq5v 1/1 Running 0 20s 10.244.2.33 k8s-node-2 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9d <none>
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
daemonset.apps/fluentd-elasticsearch 4 4 4 4 4 mylable=allow_demon_set 20s fluentd-elasticsearch quay.io/fluentd_elasticsearch/fluentd:v2.5.2 name=fluentd-elasticsearch
Then before draining a node we can just remove the custom label from node and the pod-should scale down gracefully and then drain the node.
$ kubectl label nodes k8s-node-3 mylabel-
Check the daemonset and it should scale down
ubuntu#k8s-kube-client:~$ kubectl get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/fluentd-elasticsearch-jrgl6 0/1 Terminating 0 2m36s 10.244.3.19 k8s-node-3 <none> <none>
pod/fluentd-elasticsearch-rgcm2 1/1 Running 0 2m36s 10.244.0.6 k8s-master-1 <none> <none>
pod/fluentd-elasticsearch-wccr9 1/1 Running 0 2m36s 10.244.1.14 k8s-node-1 <none> <none>
pod/fluentd-elasticsearch-wxq5v 1/1 Running 0 2m36s 10.244.2.33 k8s-node-2 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9d <none>
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
daemonset.apps/fluentd-elasticsearch 3 3 3 3 3 mylable=allow_demon_set 2m36s fluentd-elasticsearch quay.io/fluentd_elasticsearch/fluentd:v2.5.2 name=fluentd-elasticsearch
Now again add the label to new node with same custom label when it is added to cluster and the deamonset will scale up
$ kubectl label nodes k8s-node-3 mylable=allow_demon_set
ubuntu#k8s-kube-client:~$ kubectl get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/fluentd-elasticsearch-22rsj 1/1 Running 0 2s 10.244.3.20 k8s-node-3 <none> <none>
pod/fluentd-elasticsearch-rgcm2 1/1 Running 0 5m28s 10.244.0.6 k8s-master-1 <none> <none>
pod/fluentd-elasticsearch-wccr9 1/1 Running 0 5m28s 10.244.1.14 k8s-node-1 <none> <none>
pod/fluentd-elasticsearch-wxq5v 1/1 Running 0 5m28s 10.244.2.33 k8s-node-2 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 9d <none>
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE CONTAINERS IMAGES SELECTOR
daemonset.apps/fluentd-elasticsearch 4 4 4 4 4 mylable=allow_demon_set 5m28s fluentd-elasticsearch quay.io/fluentd_elasticsearch/fluentd:v2.5.2 name=fluentd-elasticsearch
Kindly confirm if this what you want to do and works with kops

How to reserve certain worker nodes for a namespace

I would like to reserve some worker nodes for a namespace. I see the notes of stackflow and medium
How to assign a namespace to certain nodes?
https://medium.com/#alejandro.ramirez.ch/reserving-a-kubernetes-node-for-specific-nodes-e75dc8297076
I understand we can use taint and nodeselector to achieve that.
My question is if people get to know the details of nodeselector or taint, how can we prevent them to deploy pods into these dedicated worker nodes.
thank you
To accomplish what you need, basically you have to use taint.
Let's suppose you have a Kubernetes cluster with one Master and 2 Worker nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
knode01 Ready <none> 8d v1.16.2
knode02 Ready <none> 8d v1.16.2
kubemaster Ready master 8d v1.16.2
As example I'll setup knode01 as Prod and knode02 as Dev.
$ kubectl taint nodes knode01 key=prod:NoSchedule
$ kubectl taint nodes knode02 key=dev:NoSchedule
To run a pod into these nodes, we have to specify a toleration in spec session on you yaml file:
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "key"
operator: "Equal"
value: "dev"
effect: "NoSchedule"
This pod (pod1) will always run in knode02 because it's setup as dev. If we want to run it on prod, our tolerations should look like that:
tolerations:
- key: "key"
operator: "Equal"
value: "prod"
effect: "NoSchedule"
Since we have only 2 nodes and both are specified to run only prod or dev, if we try to run a pod without specifying tolerations, the pod will enter on a pending state:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod0 1/1 Running 0 21m 192.168.25.156 knode01 <none> <none>
pod1 1/1 Running 0 20m 192.168.32.83 knode02 <none> <none>
pod2 1/1 Running 0 18m 192.168.25.157 knode01 <none> <none>
pod3 1/1 Running 0 17m 192.168.32.84 knode02 <none> <none>
shell-demo 0/1 Pending 0 16m <none> <none> <none> <none>
To remove a taint:
$ kubectl taint nodes knode02 key:NoSchedule-
This is how it can be done
Add new label, say, ns=reserved, label to a specific worker node
Add taint and tolerations to target specific pods on to this worker node
You need to define RBAC roles and role bindings in that namespace to control what other users can do