Kubernetes how to maintain session affinity during hpa scale down event

Kubernetes how to maintain session affinity during hpa scale down event - kubernetes

I have deploy my application in kubernetes using deployment.
Whenever user gets login to application pod will generate session for that user.
To maintain session stickiness I have set session cookie using Nginx ingress annotations.
When hpa scale down pods application user is phasing a logout problem when pod is get terminated. If ingress has generated a session using this pod. It needs to log in again.
What I want is some sort of graceful termination of the connection. when pod is in a terminating state it should serve existing sessions until grace period.

What i want is some sort of graceful termination of connection. when
pod is in terminating state it should serve existing sessions
until grace period.
You can use the key in POD spec : terminationGracePeriodSeconds
this will wait for the mentioned second and after that POD will get terminated.
You can read more at : https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

The answer Harsh Manvar is great, However, I want to expand it a bit :)
You can of course use terminationGracePeriodSeconds in the POD spec. Look at the example yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
terminationGracePeriodSeconds: 60
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML. For example, to change it to 60 seconds.
For more look here.
If you want to know how exactly looks like pod lifecycle see this link to the official documentation. The part about the termination of pods should be most interesting. You will also have it described how exactly the termination takes place.
It is recommended that applications deployed on Kubernetes have a design that complies with the recommended standards.
One set of standards for modern, cloud-based applications is known as the Twelve-Factor App:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Some web systems rely on “sticky sessions” – that is, caching user session data in memory of the app’s process and expecting future requests from the same visitor to be routed to the same process. Sticky sessions are a violation of twelve-factor and should never be used or relied upon. Session state data is a good candidate for a datastore that offers time-expiration, such as Memcached or Redis.

Related

In kubernetes, is there a way to make statefulset pods linger to finish requests on rolling update?

In Kubernetes, I have a statefulset with a number of replicas.
I've set the updateStrategy to RollingUpdate.
I've set podManagementPolicy to Parallel.
My statefulset instances do not have a persistent volume claim -- I use the statefulset as a way to allocate ordinals 0..(N-1) to pods in a deterministic manner.
The main reason for this, is to keep availability for new requests while rolling out software updates (freshly built containers) while still allowing each container, and other services in the cluster, to "know" its ordinal.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset (mapped by the ordinal) without a temporary outage.
Unfortunately, I don't see a way of doing this -- what am I missing?
Because I don't use volume claims, you might think I could use deployments instead, but I really do need each of the pods to have a deterministic ordinal, that:
is unique at the point of dispatching new service requests (incoming HTTP requests, including public ingresses)
is discoverable by the pod itself
is persistent for the duration of the pod lifetime
is contiguous from 0 .. (N-1)
The second-best option I can think of is using something like zookeeper or etcd to separately manage this property, using some of the traditional long-poll or leader-election mechanisms, but given that kubernetes already knows (or can know) about all the necessary bits, AND kubernetes service mapping knows how to steer incoming requests from old instances to new instances, that seems more redundant and complicated than necessary, so I'd like to avoid that.

I assume that you need this for a stateful workload, a workload that e.g. requires writes. Otherwise you can use Deployments with multiple pods online for your shards. A key feature with StatefulSet is that they provide unique stable network identities for the instances.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset.
This behavior is supported by Kubernetes pods. But you also need to implement support for it in your application.
New traffic will not be sent to your "old" pods.
A SIGTERM signal will be sent to the pod - your application may want to listen to this and do some action.
After a configurable "termination grace period", your pod will get killed.
See Kubernetes best practices: terminating with grace for more info about pod termination.
Be aware that you should connect to services instead of directly to pods for this to work. E.g. you need to create headless services for the replicas in a StatefulSet.
If your clients are connecting to a specific headless service, e.g. N, this means that it will not be available for some times during upgrades. You need to decide if your clients should retry their connections during this time period or if they should connect to another headless service if N is not available.
If you are in a case where you need:
stateful workload (e.g. support for write operations)
want high availability for your instances
then you need a form of distributed system that does some form of replication/synchronization, e.g. using raft or a product that implements this. Such system is easiest deployed as a StatefulSet.

You may be able to do this using Container Lifecycle Hooks, specifically the preStop hook.
We use this to drain connections from our Varnish service before it terminates.
However, you would need to implement (or find) a script to do the draining.

Automatic Pod Deletion Delay in Kubernetes

Is there is a way to automatically delay all Kubernetes pod deletion requests such that the endpoint deregistration is signaled, but the pod's SIGTERM is delayed by several seconds?
It would be preferable, but not required, if the delay only affected pods with an Endpoint/Service.
Background:
It is well established that some traffic can continue to a Pod after a pod has been sent the SIGTERM termination signal due to the asynchronous nature of endpoint deregistration and the deletion signal. The recommended mitigation is to introduce a few seconds delay in the pod's preStop lifecycle hook by invoking sleep.
The difficulty rapidly arises where the pod's deployment may be done via helm or other upstream source, or else there are large numbers of deployments and containers to be managed. Modifying many deployments in such a way may be difficult, or even impossible (e.g. the container may not have a sleep binary, shell, or anything but the application executable).
I briefly explored a mutating admission controller, but that seems unworkable to dynamically add a preStop hook, as all images do not have a /bin/sleep or already have a preStop that could need image-specific knowledge to merge.
(Of course, all of this could be avoided if the K8S API made the endpoint deregistration synchronous with a timeout to avoid deadlock (hint, hint), but I haven't seen any discussions of such a change. Yes, there are tons of reasons why this isn't synchronous, but that doesn't mean something can't be done.)

Kubernetes lifecycle has following steps.
Pod is set to the “Terminating” State and removed from the endpoints list of all Services
preStop hook is executed
SIGTERM signal is sent to the pod
Kubernetes waits for a grace period, default is 30 seconds
SIGKILL signal is sent to pod, and the pod is removed
Grace period is what you need.
It's important to node that this grace period is happening in parallel to the preStop hook and the SIGTERM signal.
A call to the preStop hook fails if the container is already in terminated or completed state. It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent.
Here you can read more about Container Lifecycle Hooks.
So for example you could set the terminationGracePeriodSeconds: 90 and this might look like the following:
spec:
terminationGracePeriodSeconds: 90
containers:
- name: myApplication
You can read the Kubernetes docs regarding Termination of Pods. I also recommend great blog post Kubernetes best practices: terminating with grace.

Is there a way to configure Istio to route traffic to a POD which is in the terminating state?

I have a Kubernetes cluster with two services deployed: SvcA and SvcB - both in the service mesh.
SvcA is backed by a single Pod, SvcA_P1. The application in SvcA_P1 exposes a PreStop HTTP hook. When performing a "kubectl drain" command on the node where SvcA_P1 resides, the Pod transitions into the "terminating" state and remains in that state until the application has completed its work (the rest request returns and Kubernetes removes the pod). The work for SvcA_P1 includes completing ongoing in-dialog (belonging to established sessions) HTTP requests/responses. It can stay in the "terminating" state for hours before completing.
When the Pod enters the "terminating" phase, Istio sidecar appears to remove the SvcA_P1 from the pool. Requests sent to SvcA_P1 from e.g., SvcB_P1 are rejected with a "no healthy upstream".
Is there a way to configure Istio/Envoy to:
Continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
Reject traffic without session affinity to SvcA_P1 (no JSESSIONID, cookies, or special HTTP headers)?
I have played around with the DestinationRule(s), modifying trafficPolicy.loadBalancer.consistentHash.[httpHeaderName|httpCookie] with no luck. Once the Envoy removes the upstream server, the new destination is re-hashed using the reduced set of servers.
Thanks,
Thor

According to Kubernetes documentation, when pod must be deleted three things happen simultaneously:
Pod shows up as “Terminating” when listed in client commands
When the Kubelet sees that a Pod has been marked as terminating because the "dead" timer for the Pod has been set in the API server,
it begins the pod shutdown process.
If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period
expires, step 2 is then invoked with a small (2 second) extended grace
period.
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication
controllers. Pods that shutdown slowly cannot continue to serve
traffic as load balancers (like the service proxy) remove them from
their rotations.
As soon as Istio works like a mesh network below/behind Kubernetes Services and Services no longer consider a Pod in Terminating state as a destination for the traffic, tweaking Istio policies doesn't help much.

Is there a way to configure Istio/Envoy to continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
This problem is at Kubernetes level rather than Istio/Envoy level: by default, upon entering the "Terminating" state, Pods are removed from their corresponding Services.
You can change that behaviour by telling your Service to advertise Pods in the "Terminating" state: see that answer.

Graceful termination of kubernetes pods

We have an application with 4 pods running with a load balancer! We want to try the rolling update, but we are not sure what happens when a pod goes down! The documentation is unclear! Particularly this quote from Termination Of Pods:
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication controllers. Pods that shutdown slowly can continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
So, if someone can guide us on the following questions :
1.) When a pod is shutting down, can it still serve new requests? Or does the load balancer not consider it?
2.) Does it complete the requests it is processing till the grace-period is exhausted? and then kills the container even if any process is still running?
3.) Also, this mentions replication controllers, what we have is a Deployment and Deployment has replica sets, so will there be any difference?
We went through this question but the answers are conflicting without any source : Does a Kubernetes rolling-update gracefully remove pods from a service load balancer

1) when a Pod is shutting down it's state is changed to Terminating and it is not considered by the LoadBalancer - as described in the Pod termination docs
2) Yes - you might want to look at the pod.Spec.TerminationGracePeriodSeconds configuration to gain some control. You'll find details in the API documentation
3) No - the ReplicaSet and the Deployment take care of scheduling Pods, there's no difference when it comes to the shutdown behaviour of the Pods

Does Kubernetes support connection draining?

Does Kubernetes support connection draining?
For example, my deployment rolls out a new version of my web app container.
In connection draining mode Kubernetes should spin up a new container from the new image and route all new traffic coming to my service to this new instance. The old instance should remain alive long enough to send a response for existing connections.

Kubernetes does support connection draining, but how it happens is controlled by the Pods, and is called graceful termination.
Graceful Termination
Let's take an example of a set of Pods serving traffic through a Service. This is a simplified example, the full details can be found in the documentation.
The system (or a user) notifies the API that the Pod needs to stop.
The Pod is set to the Terminating state. This removes it from a Service serving traffic. Existing connections are maintained, but new connections should stop as soon as the load balancers recognize the change.
The system sends SIGTERM to all containers in the Pod.
The system waits terminationGracePeriodSeconds (default 30s), or until the Pod completes on it's own.
If containers in the Pod are still running, they are sent SIGKILL and terminated immediately. At this point the Pod is forcefully terminated if it is still running.
This not only covers the simple termination case, but the exact same process is used in rolling update deployments, each Pod is terminated in the exact same way and is given the opportunity to clean up.
Using Graceful Termination For Connection Draining
If you do not handle SIGTERM in your app, your Pods will immediately terminate, since the default action of SIGTERM is to terminate the process immediately, and the grace period is not used since the Pod exits on its own.
If you need "connection draining", this is the basic way you would implement it in Kubernetes:
Handle the SIGTERM signal, and clean up your connections in a way that your application decides. This may simply be "do nothing" to allow in-flight connections to clear out. Long running connections may be terminated in a way that is (more) friendly to client applications.
Set the terminationGracePeriodSeconds long enough for your Pod to clean up after itself.

There are some more options which could help to enable a zero downtime deployment. Here's a summary:
1. Pod Graceful Termination
s. Answer of Kekoa
Downside: Pod directly receives SIGTERM and will not accept any new requests (dependent on the implementation).
2. Pre Stop Hook
Pod receives SIGTERM after a waiting time and can still accept any new requests. You should set terminationGracePeriodSeconds to a value larger than the preStop sleeping time.
lifecycle:
preStop:
exec:
command: ["/bin/bash","-c","/bin/sleep 90"]
This is the recommended solution of the Azure Application Gateway Ingress Controller.
Downside: Pod is removed from the list of Endpoints and might not be visible to other pods.
3. Helm Chart Hooks
If you need some cleanup before the Pods are removed from the Endpoints, you need Helm Chart Hooks.
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-mydeployment
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
app.kubernetes.io/name: graceful-shutdown-mydeployment
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: ...
command: ...
restartPolicy: Never
backoffLimit: 0
For more details see https://stackoverflow.com/a/66733416/1909531
Downside: Helm required
Full lifecycle
Here's how these options are executed in the pod's lifecycle.
Execute Helm Chart Hook
Set Pod to Terminating State + Executing Pre Stop Hook
Send SIGTERM
Wait terminationGracePeriodSeconds
Stop Pod by sending SIGKILL, if containers are still running.
Choosing an option
I would first try 1. terminationGracePeriodSeconds, then 2. Pre Stop Hook and at last 3. Helm Chart Hook as the complexity rises.

No, deployments do not support connection draining per se. Draining connections effectively happen as old pods stop & new pods start, clients connected to old pods will have to reconnect to new pods. As clients connect to the service, it's all transparent to clients. You do need to ensure that your application can handle different versions running concurrently, but that is a good idea anyway as it minimises downtime in upgrades & allows you to perform things like A/B testing.
There are a couple of different strategies which will let you tweak how your upgrades take place: deployments support two update strategies: Recreate or RollingUpdate.
With Recreate, old pods are stopped before new pods are started. This leads to a period of downtime but ensures that all clients connect to either the old or the new version - there will never be a time when both old & new pods are servicing clients at the same time. If downtime is acceptable to you then this may be an option to you.
Most of the time, however, downtime is unacceptable for a service so RollingUpdate is more appropriate. This starts up new pods & as it does so it stops old pods. As old pods are stopped, clients connected to them have to reconnect. Eventually there will be no old pods & all clients will have reconnected to new pods.
While there is no option to do connection draining as you suggest, you can configure the rolling update via maxUnavailable and maxSurge options. From http://kubernetes.io/docs/user-guide/deployments/#rolling-update-deployment:
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). The absolute number is calculated from percentage by rounding up. This can not be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. By default, a fixed value of 1 is used.
For example, when this value is set to 30%, the old Replica Set can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old Replica Set can be scaled down further, followed by scaling up the new Replica Set, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created above the desired number of Pods. Value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). This can not be 0 if MaxUnavailable is 0. The absolute number is calculated from percentage by rounding up. By default, a value of 1 is used.
For example, when this value is set to 30%, the new Replica Set can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods do not exceed 130% of desired Pods. Once old Pods have been killed, the new Replica Set can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Hope that helps.