Understanding built-in Labels - kubernetes

I'm new to kubernetes and I'm trying to understand how labels work on a node. We have eks server version 1.14 running in our organization. I'm trying to change built-in deprecated labels.
In aws-node daemonset, I want to replace beta.kubernetes.io/os to kubernetes.io/os and beta.kubernetes.io/arch to kubernetes.io/arch.
Since it both beta.kubernetes.io/arch and kubernetes.io/arch labels when i describe a node.
Is it safe to go ahead remove the beta.kubernetes.io/arch and
beta.kubernetes.io/os labels?
I want to understand if I change the label, what are its effects?
Does Pods running on that node are affected?
Can apiVersion: apps/v1 change built-in labels?
Can I just run kubectl label node "node-name" beta.kubernetes.io/arch=amd64 - to remove the labels?
Is there a need to apply the daemonset ?
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: aws-node
namespace: kube-system
labels:
k8s-app: aws-node
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
k8s-app: aws-node
template:
metadata:
labels:
k8s-app: aws-node
spec:
priorityClassName: system-node-critical
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "beta.kubernetes.io/os"
operator: In
values:
- linux
- key: "beta.kubernetes.io/arch"
operator: In
values:
- amd64
kubectl describe node/ip-10-xx-xx-xx.ec2.internal -n kube-system
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=c4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1a
group=nodes
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-182-32-156.ec2.internal
kubernetes.io/os=linux

From the documentation we can read that beta.kubernetes.io/arch and beta.kubernetes.io/os are deprecated since version 1.14 (removed on version 1.18) and kubernetes.io should be used instead.
You are using version 1.14 and there is no reason for you to change/remove these labels. Changing them would add one more layer of complication to your cluster when you want to add a node for example (you have to always keep in mind that you have non-stock labels in your nodes).
Is it safe to go ahead remove the beta.kubernetes.io/arch and beta.kubernetes.io/os labels?
It's safe but unnecessary unless you have applications running on mixed clusters and you are using these labels.
I want to understand if I change the label, what are its effects?
From the documentation we can read:
kubernetes.io/arch: This can be handy if you are mixing arm and x86 nodes.
kubernetes.io/os: This can be handy if you are mixing operating systems in your cluster (for example: mixing Linux and Windows nodes).
So these labels are there for your convenience, you can use them to keep track of things.
Does Pods running on that node are affected?
No, pods are still going to be normally scheduled .
Can I just run kubectl label node "node-name" beta.kubernetes.io/arch=amd64 - to remove the labels?
To remove the label you can run:
kubectl label node "node-name" beta.kubernetes.io/arch-
To remove from all nodes:
kubectl label nodes --all beta.kubernetes.io/arch-
Is there a need to apply the daemonset ?
I particularly don't see a need for that.

Related

Node [Affinity/AntiAffinity ] along with Pod [Affinity/AntiAffinity]

I have a pod that has both node affinity and pod affinity. could some help me understand how would things behave in such a scenario?
Node 1:
label:
schedule-on : gpu
Node 2:
label:
schedule-on : gpu
Node 3:
label:
schedule-on : non-gpu
Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/name: test
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: schedule-on
operator: In
values:
- gpu
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- test
topologyKey: schedule-on
the output of the above is:
Pods are getting scheduled on a different node [node1,node2]
ideal output: Pod needs to be scheduled on the same node [node1]
Here is my finding.
Finding 1: I believe node affinity is taking precedence and pod affinity is getting ignored
It's the union of node affinity and pod affinity. since both the pod has the same topology key domain . hence making them in the same colocation the pods can get scheduled in different nodes but in same colocation .
When matching the topology key and placing the pod. Value of the key is also considered
I would like to add some definitions to your answer.
First of all, from this article:
Node affinity is a set of rules. It is used by the scheduler to decide where a pod can be placed in the cluster. The rules are defined using labels on nodes and label selectors specified in pods definition. Node affinity allows a pod to specify an affinity towards a group of nodes it can be scheduled on.
And from here:
Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
In your scenario: as you wrote - both pods have the same topology key domain. This fact makes them in the same colocation, so the pods can get scheduled in different nodes but in same colocation.
Here are also some words about topologyKey:
TopologyKey is the key of node labels. If two Nodes are labelled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.

Spread specific number of deployment pods per node

I have an EKS node group with 2 nodes for compute workloads. I use a taint on these nodes and tolerations in the deployment. I have a deployment with 2 replicas I want these two pods to be spread on these two nodes like one pod on each node.
I tried using:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- appname
Each pod is put on each node but if I updated the deployment file like changing its image name, it fails to schedule a new pod.
I also tried:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: type
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
type: compute
but they aren't spread evenly like 2 pods on a node.
Try adding:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
By default K8s is trying to scale the new replicaset up first before it starts downscaling the old replicas. Since it cannot schedule new replicas (because antiaffinity) they are stuck in pending state.
Once you set the deployment's maxSurge=0, you tell k8s that you don't want the deployment to scale up first during update, and thus in result it can only scale down making place for new replicas to be scheduled.
Setting maxUnavailable=1 tells k8s to replace only one pod at a time.
You can use DeamonSet instead of Deployment. A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
See documentation for Deamonset
I was having the same problem with pods failing to schedule and getting stuck in pending state while rolling out new versions while my goal was to run exactly 3 pods at all times, 1 on each of the 3 available nodes.
That means I could not use maxUnavailable: 1 because that would temporarily result in less than 3 pods during the rollout.
Instead of using the app name label for matching anti-affinity, I ended up using a label with a random value ("version") on each deployment. This means new deployments will happily schedule pods to nodes where a previous version is still running, but the new versions will always be spread evenly.
Something like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v1
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v1
topologyKey: "kubernetes.io/hostname"
v1 can be anything that's a valid label and changes on every deployment attempt.
I'm using envsubst to have dynamic variables in yaml files:
DEPLOYMENT_VERSION=$(date +%s) envsubst < deploy.yaml | kubectl apply -f -
And then the config looks like this:
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 3
template:
metadata:
labels:
deploymentVersion: v${DEPLOYMENT_VERSION}
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: deploymentVersion
operator: In
values:
- v${DEPLOYMENT_VERSION}
topologyKey: "kubernetes.io/hostname"
I wish Kubernetes offered a more straightforward way to achieve this.

How to do controlled rollout using Kubernetes deployment

We have 1000 store nodes and need to deploy an application image on every kubernetes node by rolling out in the below order and would like to specify the deployment node details during the deployment. Is there a way to specify node details in the command line when we execute kubectl create or apply deployment commands?
This application image would be configured to store/node specific details during container/POD creation.
1 node on day 1,
10 node on day 2,
100 node on day 3 etc.
Answering on the question from the title:
How to do controlled rollout using Kubernetes deployment
You can create a Deployment that will have specific fields in its manifest that will configure the way Kubernetes handles it.
With the fields like: podAntiAffinity, requiredDuringSchedulingIgnoredDuringExecution you can ensure that Kubernetes will distribute the Pods equally across the cluster Nodes. You can read more about it by following below documentation:
Kubernetes.io: Docs: Concepts: Scheduling eviction: Assign Pod to Node
Having in mind following rollout schedule:
DAY
REPLICAS_COUNT
1
1
2
10
3
100
4
1000
You could use CI/CD tools (like for example Jenkins) to rollout (change) the amount of replicas of your Deployment across a specific schedule.
You could create a Jenkins pipeline with a deploy stage where you could put your own command with it's scheduler (or delay).
The example of such Deployment that could be used with Jenkins is following:
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
selector:
matchLabels:
app: nginx
replicas: ${REPLICAS_COUNT}
template:
metadata:
labels:
app: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
containers:
- name: nginx
image: nginx
EOF
This Deployment will assign Pods to the Nodes that aren't already having an already running replica of this Deployment (i.e. 1 Pod = 1 Node). If the amount of Pods exceeds the amount of Nodes they will remain in Pending state.
Additional resources:
Jenkins.io: Doc: Pipeline: Tour: Environment
Kubernetes.io: Docs: Concepts: Workloads: Controllers: Deployment

Kubernetes Cronjob labeling

As I have seen few related posts but none answered my question, I thought I would ask a new question based on suggestions from other users as well here.
I have the need to make a selector label for a network policy for a running cronjob that is responsible to connect to some other services within the cluster, as far as I know there is no easy straight forward way to make a selector label for the jobs pod as that would be problematic with duplicate job labels if they ever existed. Not sure why the cronjob can't have a selector itself, and then can be applied to the job and the pod.
also there might be a possibility to just set this cronjob in its own namespace and then allow all from that one namespace to whatever needed in the network policy but does not feel like the right way to overcome that problem.
Using k8s v1.20
First of all, to select pods (spawned by your CronJob) that should be allowed by the NetworkPolicy as ingress sources or egress destinations, you may set specific label for those pods.
You can easily set a label for Jobs spawned by CronJob using labels field (another example with an explanation can be found in the OpenShift CronJobs documentation):
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: mysql-test
spec:
...
jobTemplate:
spec:
template:
metadata:
labels:
workload: cronjob # Sets a label for jobs spawned by this CronJob.
type: mysql # Sets another label for jobs spawned by this CronJob.
...
Pods spawned by this CronJob will have the labels type=mysql and workload=cronjob, using this labels you can create/customize your NetworkPolicy:
$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
mysql-test-1615216560-tkdvk 0/1 Completed 0 2m2s ...,type=mysql,workload=cronjob
mysql-test-1615216620-pqzbk 0/1 Completed 0 62s ...,type=mysql,workload=cronjob
mysql-test-1615216680-8775h 0/1 Completed 0 2s ...,type=mysql,workload=cronjob
$ kubectl describe pod mysql-test-1615216560-tkdvk
Name: mysql-test-1615216560-tkdvk
Namespace: default
...
Labels: controller-uid=af99e9a3-be6b-403d-ab57-38de31ac7a9d
job-name=mysql-test-1615216560
type=mysql
workload=cronjob
...
For example this mysql-workload NetworkPolicy allows connections to all pods in the mysql namespace from any pod with the labels type=mysql and workload=cronjob (logical conjunction) in a namespace with the label namespace-name=default :
NOTE: Be careful to use correct YAML (take a look at this namespaceSelector and podSelector example).
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: mysql-workload
namespace: mysql
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
namespace-name: default
podSelector:
matchLabels:
type: mysql
workload: cronjob
To use network policies, you must be using a networking solution which supports NetworkPolicy:
Network policies are implemented by the network plugin. To use network policies, you must be using a networking solution which supports NetworkPolicy. Creating a NetworkPolicy resource without a controller that implements it will have no effect.
You can learn more about creating Kubernetes NetworkPolicies in the Network Policies documentation.

Statefulset's replica scheduling questions

Is there anyway I can tell Kuberbetes how to schedule the replicas in the statefulset? For example, I have nodes divided into 3 different availability zones (AZ). I have labeled these nodes accordingly. Now I want K8s to put 1 replica in each AZ based on node label. Thanks
Pods will always try to be scheduled across different nodes, to achieve what you are looking for you can try to use DaemonSet, which will allow only one of these kind of pods in each node.
Also, you can use anti affinity based on the already scheduled pods in that node.
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
The feature you are looking for is called Pod Anti-Affinity and can be specified as such:
apiVersion: apps/v1
kind: StatefulSet
[..]
spec:
template:
spec:
affinity:
nodeAffinity: {}
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app: myapp
topologyKey: az
weight: 100
[..]
Since Kubernetes 1.18, there is also Pod Topology Spread Constraints, which is a nicer way to specify these anti-affinity rules.