Strategy Replace for StatefulSet - kubernetes

I have simple issue with StatefulSet update on my dev environment and CI.
I want to replace all StatefulSet replicas instantly without using Kubectl delete first.
Is it possible to change the manifest to strategy: Replace as in Deployments and continue using kubectl apply ...

Currently the StatefulSets support only two kinds of update strategies:
RollingUpdate: The RollingUpdate update strategy implements automated, rolling update for the Pods in a StatefulSet. It is the default strategy when .spec.updateStrategy is left unspecified. When a StatefulSet's .spec.updateStrategy.type is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time. It will wait until an updated Pod is Running and Ready prior to updating its predecessor.
OnDelete: The OnDelete update strategy implements the legacy (1.6 and prior) behavior. When a StatefulSet's .spec.updateStrategy.type is set to OnDelete, the StatefulSet controller will not automatically update the Pods in a StatefulSet. Users must manually delete Pods to cause the controller to create new Pods that reflect modifications made to a StatefulSet's .spec.template.
However, there is a plan to implement a MaxUnavailable Rolling Update to StatefulSet. It would allow you to update X number of replicas together based on a maxUnavailble strategy. It led to this update proposal but it is not done yet and judging from the latest comments it should be set as a milestone for Kubernetes 1.20.

Related

Can a deployment kind with a replica count = 1 ever result in two Pods in the 'Running' phase?

From what I understand, with the above configuration, it is possible to have 2 pods that exist in the cluster associated with the deployment. However, the old Pod is guranteed to be in the 'Terminated' state. An example scenario is updating the image version associated with the deployment.
There should not be any scenario where there are 2 Pods that are associated with the deployment and both are in the 'Running' phase. Is this correct?
In the scenarios I tried, for example, Pod eviction or updating the Pod spec. The existing Pod enters 'Terminating' state and a new Pod is deployed.
This is what I expected. Just wanted to make sure that all possible scenarios around updating Pod spec or Pod eviction cannot end up with two Pods in the 'Running' state as it would violate the replica count = 1 config.
It depends on your update strategy. Many times it's desired to have the new pod running and healthy before you shut down the old pod, otherwise you have downtime which may not be acceptable as per business requirements. By default, it's doing rolling updates.
The defaults look like the below, so if you don't specify anything, that's what will be used.
apiVersion: apps/v1
kind: Deployment
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
So usually, you would have a moment where both pods are running. But Kubernetes will terminate the old pod as soon as the new pod becomes ready, so it will be hard, if not impossible, to literally see both in the state ready.
You can read about it in the docs: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod, and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
This is also explained here: https://kubernetes.io/docs/tutorials/kubernetes-basics/update/update-intro/
Users expect applications to be available all the time and developers are expected to deploy new versions of them several times a day. In Kubernetes this is done with rolling updates. Rolling updates allow Deployments' update to take place with zero downtime by incrementally updating Pods instances with new ones. The new Pods will be scheduled on Nodes with available resources.
To get the behaviour, described by you, you would set spec.strategy.type to Recreate.
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate.

Kubernetes AKS PodDisruptionBudget, HorizontalPodAutoscaler & RollingUpdate Interaction - Without Load

My question is similar to this one, about pdb,hpa and drain
Bear in mind that most of the time my pod is not under load, and so has only 1 replica, but at times, it scales up to ~7 or 8 replica
I have the following Kubernetes objects in an AKS cluster:
Deployment with rollingUpdate.maxUnavailable set to 0, rollingUpdate.maxSurge set to 1
PodDisruptionBudget with minUnavailable set to 1.
HorizontalPodAutoscaler setup to allow auto scaling, with bounds 1-10 lower-upper
Cluster auto-scaling is enabled.
When I trigger a update of the cluster (not an update of the service pod)
AKS first triggers a drain on the node that hosts my pod.
The update fails, because at this time, the number of replicas is 1, and the PodDisruptionBudget is 1, so the evict on the pod fails, and so the drain fails, and AKS eventually fails the update after that.
Other pods on the same node work well, there are new nodes created, and new pods scheduled there, and then the old pods, and the old node are terminated, those pods don't specify a PodDisruptionBudget at all.
My question is is this just an AKS specific thing in how they implemented the Cluster Upgrade - do these constraints work in other kubernetes implementations?
I would expect that the drain should allow my pod to be evicted, but scheduled onto another node, creating the new pod first, as per the rollingUpdate strategy stipulated.
the HorizontalPodAutoscaler should allow a second replica to be created by rollingUpdate.maxSurge=1
I am assuming that the rollingUpdate is never called when I upgrade the cluster, though I am not sure.
The quickest solution I can think of is HorizontalPodAutoscaler = 2-10 so that there's never only one left. but this is wasteful for most of the time.
What is my best option?

What handles a StatefulSet replication?

If a Deployment uses ReplicaSets to scale Pods up and down, and StatefulSets don't have ReplicaSets...
So, how does it manage to scale Pods up and down? I mean, what resource is responsible? What requests does a StatefulSet make in order to scale?
In short StatefulSet Controller handles statefulset replicas.
A StatefulSet is a Kubernetes API object for managing stateful application workloads. StatefulSets handle the deployment and scaling of sets of Kubernetes pods, providing guarantees about their uniqueness and ordering.
Similar to deployments, StatefulSets manage pods with identical container specifications. They differ in terms of maintaining a persistent identity for each pod. While the pods are all created based on the same spec, they are not interchangeable, so each pod is given a persistent identifier that is maintained through rescheduling.
Benefits of a StatefulSet deployment include:
Unique identifiers—every pod in the StatefulSet is assigned a unique, stable network identity, consisting of a hostname based on the application name and instance number. For example, a StatefulSet for a web application with three instances may have pods labeled web1, web2 and web3.
Persistent storage—every pod has its own stable, persistent volume, either by default or as defined per storage class. When the pods in a cluster are scaled down or deleted, their associated volumes are not lost, and the data persists. Unneeded resources can be purged by scaling down the StatefulSet to 0 before deleting the unused pods.
Ordered deployment and scaling—the pods in a StatefulSet are created and deployed in order, according to their increments. Pods are also shut down in (reverse) order, ensuring that the deployment and runtime are reliable and repeatable. The StatefulSet won’t scale until all every required pod is running, so if a pod fails, it will recreate the pod before it attempts to add more instances as per the scaling requirements.
Automated, ordered updates—a StatefulSets can handle rolling updates, shutting down each node and rebuilding it according to the original order, until every node has been replaced and the older versions cleaned up. The persistent volumes can be reused, so data is migrated to the new version automatically.

Can a Pod be managed by two different ReplicaSets?

3 pods were running under ReplicationController 'rc1', then I deleted only rc1 (not pds) and created a new ReplicaSet 'rs1' with the same label selector of rc1. So as expected rs1 matched the existing pods created but rc1.
After some time, I created the ReplicationController rc2 with the same manifest file as that of rc1. Now, rc1 is spun up new pods instead of referring pods with same labels.
So I was wondering if it is possible that a pod can be scoped under two different ReplicaSets/ReplicationsControllers?
A ReplicaSet purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the availability of a specified number of identical Pods.
So I was wondering if it is possible that a pod can be scoped under two different ReplicaSets/ReplicationsControllers?
The link a ReplicaSet has to its Pods is via the Pods’ metadata.ownerReferences field, which specifies what resource the current object is owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet’s identifying information within their ownerReferences field. It’s through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the OwnerReference is not a Controller and it matches a ReplicaSet’s selector, it will be immediately acquired by said ReplicaSet. That is explained very well (with examples) in the official documentation.
After some time, I created the ReplicationController rc2 with the same manifest file as that of rc1. Now, rc1 is spun up new pods instead of referring pods with same labels.
Please note that a Deployment that configures a ReplicaSet is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated.
Hope that helps.

How to rollaback Kubernetes StatefulSet application

Currently, I am migrating one of our microservice from K8S Deployment type to StatefulSets.
While updating Kubernetes deployment config I noticed StatefulSets doesn't support revisionHistoryLimit and minReadySeconds.
revesionHistoryLimit is used keep previous N numbers of replica sets for rollback.
minReadySeconds is number of seconds pod should be ready without any of its container crashing.
I couldn't find any compatible settings for StatefulSets.
So my questions are:
1) How long master will wait to consider Stateful Pod ready?
2) How to handle rollback of Stateful application.
After reverting the configuration, you must also delete any Pods that StatefulSet had already attempted to run with the bad configuration. The new pod will automatically spin up with correct configuration.
You should define a readiness probe, and the master will wait for it to report the pod as Ready.
StatefulSets currently do not support rollbacks.