StatefulSet: Longer rolling update lead Version mismatching - kubernetes

Application is deployed on K8s using StatefulSet because of stateful in nature. There is around 250+ pods are running and HPA has been implemented on it too that can scale upto 400 pods.
When new deployment occurs, it takes longer time (~ 10-15m) to update all pods in Rolling Update fashion.
Problem: End user get response from 2 version of pods until all pods are replaced with new revision.
I am googling for an architecture where overall deployment time can be reduced and getting the best possible solutions to use BLUE/GREEN strategy but it has bunch of impact with integrated services like monitoring, logging, telemetry etc because of 2 naming conventions.
Ideally I am looking for a solutions like maxSurge for Deployment in which firstly new pods are created and then traffic are shifted to it at a time but in case of StatefulSet, it won't support maxSurge with RollingUpdate strategy & controller will delete and recreate each Pod in the StatefulSet based on ordinal index from bigger to smaller.

The solution is to do a partitioning rolling update along with a canary deployment.
Let’s suppose we have the statefulset workload defined by the following yaml file:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
version: "1.20"
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
version: "1.20"
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # Label selector that determines which Pods belong to the StatefulSet
# Must match spec: template: metadata: labels
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx # Pod template's label selector
version: "1.20"
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: nginx:1.20
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
You could patch the statefulset to create a partition, and change the image and version label for the remaining pods: (In this case, since there are only 3 pods, the last one will be the one that will change its image.)
$ kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":2}}}}'
$ kubectl patch statefulset web --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/image", "value":"nginx:1.21"}]'
$ kubectl patch statefulset web --type='json' -p='[{"op": "replace", "path": "/spec/template/metadata/labels/version", "value":"1.21"}]'
At this point, you have a pod with the new image and version label ready to use, but since the version label is different, the traffic is still going to the other two pods. If you change the version in the yaml file and apply the new configuration, the rollout will be transparent, since there is already a pod ready to migrate the traffic:
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
version: "1.21"
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
version: "1.21"
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # Label selector that determines which Pods belong to the StatefulSet
# Must match spec: template: metadata: labels
serviceName: "nginx"
replicas: 3
template:
metadata:
labels:
app: nginx # Pod template's label selector
version: "1.21"
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: nginx:1.21
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
$ kubectl apply -f file-name.yaml
Once traffic is migrated to the pod containing the new image and version label, you should patch again the statefulset and remove the partition with the command kubectl patch statefulset web -p '{"spec":{"updateStrategy":{"type":"RollingUpdate","rollingUpdate":{"partition":0}}}}'
Note: You will need to be very careful with the size of the partition, since the remaining pods will handle the whole traffic for some time.

Related

Kubernetes non specific spec.selector does not prevent Kubernetes from working correctly

I've experienced a surprising behavior when playing around with Kubernetes and I wanted to know if there is any good explanation behind it.
I've noticed that when two Kubernetes deployments are created with the same labels, and with the same spec.selector, the deployments still function correctly, even though using the same selector "should" cause them to be confused regarding which pods is related to each one.
Example configurations which present this -
example_deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
extra_label: one
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
example_deployment_2.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment-2
labels:
app: nginx
extra_label: two
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
I expected the deployments not to work correctly, since they will select pods from each other and assume it is theirs.
The actual result is that the deployments seem to be created correctly, but entering the deployment from k9s returns all of the pods. This is true for both deployments.
Can anyone please shed light regarding why this is happening? Is there additional internal filtering in Kubernetes to to prevent pods which were not really created by the deployment from being associated with it?
I'll note that I've seen this behavior in AWS and have reproduced it in Minikube.
When you create a K8S Deployment, K8S creates a ReplicaSet to manage the pods, then this ReplicaSet creates the pods based on the number of replicas provided or patched by the hpa. Addition to the provided labels and annotations you provide, the ReplicaSet add ownerReferences which contains its name and uid, so even if you have 4 pods with the same labels, each two pods will have a different ownerReferences used by the ReplicaSet to manage them:
apiVersion: v1
kind: Pod
metadata:
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: <replicaset name>
uid: <replicaset uid>
...

Openshift: Is it possible to make different pods of the same deployment to use different resources?

In Openshift, say there are two pods of the same deployment in Test env. Is it possible to make one pod to use/connect to database1, make another pod to use/connect to dababase2 via label or configuration?
I have created two diff pods with same code base or image containing same compiled code. Using spring profiling,passed two different arguments for connection to oracle database.
for example
How about try to use StatefulSet for deploying each pod ? StatefulSet make each pod uses each PersistentVolume, so if you place each configuration file which is configured with other database connection data on each PersistentVolume, each pod can use other database each other. Because the pod can refer different config file.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: app
spec:
serviceName: "app"
replicas: 2
selector:
matchLabels:
app: app
template:
metadata:
labels:
app: app
spec:
containers:
- name: app
image: example.com/app:1.0
ports:
- containerPort: 8080
name: web
volumeMounts:
- name: databaseconfig
mountPath: /usr/local/databaseconfig
volumeClaimTemplates:
- metadata:
name: databaseconfig
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Mi

How to update a storageClassName in a statefulset?

I am new to k8's and trying to update storageClassName in a StatefulSet.(from default to default-t1 only change in the yaml)
I tried running kubectl apply -f test.yaml
The only difference between 1st and 2nd Yaml(one using to apply update) is storageClassName: default-t1 instead of default
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
podManagementPolicy: "Parallel"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: default
resources:
requests:
storage: 1Gi
Every-time I try to update it I get The StatefulSet "web" is invalid: spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
What am I missing or what steps should I take to do this?
There's no easy way as far as I know but it's possible.
Save your current StatefulSet configuration to yaml file:
kubectl get statefulset some-statefulset -o yaml > statefulset.yaml
Change storageClassName in volumeClaimTemplate section of StatefulSet configuration saved in yaml file
Delete StatefulSet without removing pods using:
kubectl delete statefulset some-statefulset --cascade=orphan
Recreate StatefulSet with changed StorageClass:
kubectl apply -f statefulset.yaml
For each Pod in StatefulSet, first delete its PVC (it will get stuck in terminating state until you delete the pod) and then delete the Pod itself
After deleting each Pod, StatefulSet will recreate a Pod (and since there is no PVC) also a new PVC for that Pod using changed StorageClass defined in StatefulSet.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
serviceName: "nginx"
podManagementPolicy: "Parallel"
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: <storage class name>
resources:
requests:
storage: 1Gi
in the above file add storage class and you can apply it like.
kubectl apply -f .yaml
First line of StatefulSet shoud define an "apiVersion", you can see example here statefulset. Please add it in first line:
apiVersion: apps/v1
Could you show me output of your 'www' PVC file?
kubectl get pvc www -o yaml
In PVC you have field "storageClassName", which should be set on your StorageClass you want to use, so in your case it would be:
storageClassName: default-t1

How does Kubernetes control replication?

I was curious about how Kubernetes controls replication. I my config yaml file specifies I want three pods, each with an Nginx server, for instance (from here -- https://kubernetes.io/docs/concepts/workloads/controllers/replicationcontroller/#how-a-replicationcontroller-works)
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
How does Kubernetes know when to shut down pods and when to spin up more? For example, for high traffic loads, I'd like to spin up another pod, but I'm not sure how to configure that in the YAML file so I was wondering if Kubernetes has some behind-the-scenes magic that does that for you.
Kubernetes does no magic here - from your configuration, it does simply not know nor does it change the number of replicas.
The concept you are looking for is called an Autoscaler. It uses metrics from your cluster (need to be enabled/installed as well) and can then decide, if Pods must be scaled up or down and will in effect change the number of replicas in the deployment or replication controller. (Please use a deployment, not replication controller, the later does not support rolling updates of your applications.)
You can read more about the autoscaler here: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
You can use HorizontalPodAutoscaler along with deployment as below. This will autoscale your pod declaratively based on target CPU utilization.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: $DEPLOY_NAME
spec:
replicas: 2
template:
metadata:
labels:
app: $DEPLOY_NAME
spec:
containers:
- name: $DEPLOY_NAME
image: $DEPLOY_IMAGE
imagePullPolicy: Always
resources:
requests:
cpu: "0.2"
memory: 256Mi
limits:
cpu: "1"
memory: 1024Mi
---
apiVersion: v1
kind: Service
metadata:
name: $DEPLOY_NAME
spec:
selector:
app: $DEPLOY_NAME
ports:
- port: 8080
type: ClusterIP
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: $DEPLOY_NAME
namespace: $K8S_NAMESPACE
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: $DEPLOY_NAME
minReplicas: 2
maxReplicas: 6
targetCPUUtilizationPercentage: 60

kubernetes "n" independent pods with identity 1 to n

Requirement 1 (routes):
I need to be able to route to "n" independent Kubernetes deployments. Such as:
http://building-1-building
http://building-2-building
http://building-3-building
... n ...
Each building is receiving traffic that is independent of the others.
Requirement 2 (independence):
If pod for building-2-deployment dies, I would like only building-2-deployment to be restarted and the other n-1 deployments to be not affected.
Requirement 3 (kill then replace):
If pod for building-2-deployment is unhealthy, I would like it to be killed then a new one created. Not a replacement created, then the sick is killed.
When I update the image and issue a "kubectl apply -f building.yaml", I would like each deployment to be shut down then a new one started with the new SW. In other words, not create a second then kill the first.
Requirement 4 (yaml): This application is created and updated with a yaml file so that it's repeatable and archivable.
kubectl create -f building.yaml
kubectl apply -f building.yaml
Partial Solution:
The following yaml creates routes (requirement 1), operates each deployment independently (requirement 2), but fails on to kill before starting a replacement (requirement 3).
This partial solution is a little verbose as each deployment is replicated "n" time where only the "n" is changed.
I would appreciate a suggested to solve all 3 requirements.
apiVersion: v1
kind: Service
metadata:
name: building-1-service # http://building-1-service
spec:
ports:
- port: 80
targetPort: 80
type: NodePort
selector:
app: building-1-pod #matches name of pod being created by deployment
---
apiVersion: v1
kind: Service
metadata:
name: building-2-service # http://building-2-service
spec:
ports:
- port: 80
targetPort: 80
type: NodePort
selector:
app: building-2-pod #matches name of pod being created by deployment
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: building-1-deployment # name of the deployment
spec:
replicas: 1
selector:
matchLabels:
app: building-1-pod # matches name of pod being created
template:
metadata:
labels:
app: building-1-pod # name of pod, matches name in deployment and route "location /building_1/" in nginx.conf
spec:
containers:
- name: building-container # name of docker container
image: us.gcr.io//proj-12345/building:2018_03_19_19_45
resources:
limits:
cpu: "1"
requests:
cpu: "10m"
ports:
- containerPort: 80
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: building-2-deployment # name of the deployment
spec:
replicas: 1
selector:
matchLabels:
app: building-2-pod # matches name of pod being created
template:
metadata:
labels:
app: building-2-pod # name of pod, matches name in deployment and route "location /building_2/" in nginx.conf
spec:
containers:
- name: building-container # name of docker container
image: us.gcr.io//proj-12345/building:2018_03_19_19_45
resources:
limits:
cpu: "1"
requests:
cpu: "10m"
ports:
- containerPort: 80
You are in the luck. This is exactly what StatefulSets are for.