I need to run my application with "at most once" semantics. It is absolutely crucial that only one instance of my app is running at any given time or none at all
At first I was using resource type "deployment" with single replica but then I realized the during network partition we might inadvertently be running more than one instances.
I stumbled up on "stateful sets" while I was searching for most once semantics in kubernetes. On reading further, the examples dealt with cases where the containers needed a persistent volume and typically these containers were running with more than one replica. My application is not even using any volumes.
I also read about tolerations to kill the pod if the node is unreachable. Given that tolerations can handle pod unreachable cases, is stateful set an overkill for my usecase?
I am justifying the use stateful set - because even in that mean time the node becomes unreachable and toleration seconds reached and kubelet realizes that it is cut off from the network and kills the processes, kubernetes can spin up another instance. And I believe stateful set prevent this corner case too.
Am I right? Is there any other approach to achieve this?
To quote a Kubernetes doc:
...StatefulSets maintain a sticky, stable identity for their
Pods...Guaranteeing an identity for each Pod helps avoid split-brain
side effects in the case when a node becomes unreachable (network
partition).
As described in the same doc, StatefulSet Pods on a Node are marked as "Unknown" and aren't rescheduled unless forcefully deleted when a node becomes unreachable. Something to consider for proper recovery, if going this route.
So, yes - StatefulSet may be more suitable for the given use case than Deployment.
In my opinion, it won't be an overkill to use StatefulSet - choose the Kubernetes object that works best for your use case.
Statefulsets are not the recourse for at-most one semantics - they are typically used for deploying "state full" applications like databases which use the persistent identity of their pods to cluster among themselves
We have faced similar issues like what you mentioned - we had implicitly assumed that a old pod would be fully deleted before bringing up the new instance
One option is to use the combination of preStop hooks + init-containers combination
preStop hook will do the necessary cleanup (say delete a app specific etcd key)
Init container can wait till the etcd key disappears (with an upper bound).
References:
https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
One alternative is to try with the anti-affinity settings, but i am not very sure about this one though
It is absolutely crucial that only one instance of my app is running at any given time.
Use a leader election pattern to guarantee at most one active replica. If you use more than one replica and leader election, the other replicas are stand by in case of network partition situations. This is how the components in the Kubernetes control plane solves this problem when needing only one active instance.
Leader election algorithms in Kubernetes usually work by taking a lock (e.g. in etcd) with a timeout. Only the instance that has the lock is active. When the lock is timed out, the leader election algorithm either extend the lock timeout or elect a new leader. How it works depends on the implementation but there is a guarantee that there is at most one leader - the active instance.
See e.g. Simple Leader Election with Kubernetes that also describe how to solve this in a side car container.
If your application is stateless, you should use Deployment and not StatefulSet. It can appear that StatefulSet is a way to solve at most one instance in a situation with a network partition, but that is mostly in case of a stateful replicated application like e.g. a cache or database cluster even though it may solve your specific situation as well.
Related
In Kubernetes, I have a statefulset with a number of replicas.
I've set the updateStrategy to RollingUpdate.
I've set podManagementPolicy to Parallel.
My statefulset instances do not have a persistent volume claim -- I use the statefulset as a way to allocate ordinals 0..(N-1) to pods in a deterministic manner.
The main reason for this, is to keep availability for new requests while rolling out software updates (freshly built containers) while still allowing each container, and other services in the cluster, to "know" its ordinal.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset (mapped by the ordinal) without a temporary outage.
Unfortunately, I don't see a way of doing this -- what am I missing?
Because I don't use volume claims, you might think I could use deployments instead, but I really do need each of the pods to have a deterministic ordinal, that:
is unique at the point of dispatching new service requests (incoming HTTP requests, including public ingresses)
is discoverable by the pod itself
is persistent for the duration of the pod lifetime
is contiguous from 0 .. (N-1)
The second-best option I can think of is using something like zookeeper or etcd to separately manage this property, using some of the traditional long-poll or leader-election mechanisms, but given that kubernetes already knows (or can know) about all the necessary bits, AND kubernetes service mapping knows how to steer incoming requests from old instances to new instances, that seems more redundant and complicated than necessary, so I'd like to avoid that.
I assume that you need this for a stateful workload, a workload that e.g. requires writes. Otherwise you can use Deployments with multiple pods online for your shards. A key feature with StatefulSet is that they provide unique stable network identities for the instances.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset.
This behavior is supported by Kubernetes pods. But you also need to implement support for it in your application.
New traffic will not be sent to your "old" pods.
A SIGTERM signal will be sent to the pod - your application may want to listen to this and do some action.
After a configurable "termination grace period", your pod will get killed.
See Kubernetes best practices: terminating with grace for more info about pod termination.
Be aware that you should connect to services instead of directly to pods for this to work. E.g. you need to create headless services for the replicas in a StatefulSet.
If your clients are connecting to a specific headless service, e.g. N, this means that it will not be available for some times during upgrades. You need to decide if your clients should retry their connections during this time period or if they should connect to another headless service if N is not available.
If you are in a case where you need:
stateful workload (e.g. support for write operations)
want high availability for your instances
then you need a form of distributed system that does some form of replication/synchronization, e.g. using raft or a product that implements this. Such system is easiest deployed as a StatefulSet.
You may be able to do this using Container Lifecycle Hooks, specifically the preStop hook.
We use this to drain connections from our Varnish service before it terminates.
However, you would need to implement (or find) a script to do the draining.
Kubernetes tends to assume apps are small/lightweight/stateless microservices which can be stopped on one node and restarted on another node with no downtime.
We have a slow starting (20min) legacy (stateful) application which, once run as a set of pod should not be rescheduled without due cause. The reason being all user sessions will be killed and the users will have to login again. There is NO way to serialize the sessions and externalize them. We want 3 instances of the pod.
Can we tell k8s not to move a pod unless absolutely necessary (i.e. it dies)?
Additional information:
The app is a tomcat/java monolith
Assume for the sake of argument we would like to run it in Kubernetes
We do have a liveness test endpoint available
There is no benefit, if you tell k8s to use only one pod. That is not the "spirit" of k8s. In this case, it might be better to use a dedicated machine for your app.
But you can assign a pod to a special node - Assigning Pods to Nodes. The should be necessary only, when special hardware requirements are needed (e.g. the AI-microservice needs a GPU, which is only on node xy).
k8s don't restart your pod for fun. It will restart it, when there is a reason (node died, app died, ...) and I never noticed a "random reschedule" in a cluster. It is hard to say, without any further information (like deployment, logs, cluster) what exactly happened to you.
And for your comment: There are different types of recreation, one of them starts a fresh instance and will kill the old one, when the startup was successfully. Look here: Kubernetes deployment strategies
All points together:
Don't enforce a node to your app - k8s will "smart" select the node.
There are normally no planned reschedules in k8s.
k8s will recreate pods only, if there is a reason. Maybe your app didn't answer on the liveness-endpoint? Or someone/something deleting your pod?
Is there a way in Kubernetes to upgrade a given type of pod first when we have a deployment or stateful set with two or more replicas ( where one pod is master and others are not)?
My requirement to be specific is to ensure when calling upgrade on deployment/statefull set is to upgrade master as the last pod under a given number of replicas..
The only thing that's built into Kubernetes is the automatic sequential naming of StatefulSet pods.
If you have a StatefulSet, one of its pods is guaranteed to be named statefulsetname-0. That pod can declare itself the "master" for whatever purposes this is required. A pod can easily determine (by looking at its hostname(1)) whether it is the "master", and if it isn't, it can also easily determine what pod is. Updates happen by default in numerically reverse order, so statefulsetname-0 will be upgraded last, which matches your requirement.
StatefulSets have other properties, which you may or may not want. It's impossible for another pod to take over as the master if the first one fails; startup and shutdown happens in a fairly rigid order; if any part of your infrastructure is unstable then you may not be able to reliably scale the StatefulSet.
If you don't want a StatefulSet, you can implement your own leader election in a couple of ways (use a service like ZooKeeper or etcd that can help you maintain this state across the cluster; bring in a library for a leader-election algorithm like Raft). Kubernetes doesn't provide this on its own. The cluster will also be unaware of the "must upgrade the leader last" requirement, but if the current leader is terminated, another pod can take over the leadership.
The easiest way is probably having master in one deployment/statefulset, while followers in another deployment/statefulset. This approach ensure update is persist and can make use of update strategy in k8s.
The fact that k8s does not differentiate pod by containers nor any role specific to user application architecture ('master'); it is better to manage your own deployment when you have specific sequence that is outside of deployment/statefulset control. You can patch but change will not persist rollout restart.
I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.
I have three demonset pods which contain a container of hadoop resource manager in each pod. One of three is active node. And the other two are standby nodes.
So there is two quesion:
Is there a way to let kubernetes know the hadoop resource manager
inside the pod is a active node or standby node?
I want to control the rolling update way to update the standby node at first and update the active node in last for decrease the times
changing active node which may cause risk.
Consider the following: Deployments, DaemonSets and ReplicaSets are abstractions meant to manage a uniform group of objects.
In your specific case, although you're running the same application, you can't say it's a uniform group of object as you have two types: active and standby objects.
There is no way for telling Kubernetes which is which if they're grouped in what is supposed to be an uniform set of objects.
As suggested by #wolmi, having them in a Deployment instead of DaemonSet still leaves you with the issue that deployment strategies can't individually identify objects to control when they're updated because of the aforementioned logic.
My suggestion would be that, additional to using a Deployment with node affinity to ensure a highly available environment, you separate active and standby objects in different Deployments/Services and base your rolling update strategy on that scenario.
This will ensure that you're updating the standby nodes first, removing the risk of updating the active nodes before the other.
I think this is not the best way to do that, totally understand that you use Daemonset to be sure that Hadoop exists on an HA environment one per node but you can have that same scenario using a deployment and affinity parameters more concrete the pod affinity, then you can be sure only one Hadoop node exists per K8S node.
With that new approach, you can use a replication-controller to control the rolling-update, some resources from the documentation:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/