Does Kubernetes support connection draining? - kubernetes

Does Kubernetes support connection draining?
For example, my deployment rolls out a new version of my web app container.
In connection draining mode Kubernetes should spin up a new container from the new image and route all new traffic coming to my service to this new instance. The old instance should remain alive long enough to send a response for existing connections.

Kubernetes does support connection draining, but how it happens is controlled by the Pods, and is called graceful termination.
Graceful Termination
Let's take an example of a set of Pods serving traffic through a Service. This is a simplified example, the full details can be found in the documentation.
The system (or a user) notifies the API that the Pod needs to stop.
The Pod is set to the Terminating state. This removes it from a Service serving traffic. Existing connections are maintained, but new connections should stop as soon as the load balancers recognize the change.
The system sends SIGTERM to all containers in the Pod.
The system waits terminationGracePeriodSeconds (default 30s), or until the Pod completes on it's own.
If containers in the Pod are still running, they are sent SIGKILL and terminated immediately. At this point the Pod is forcefully terminated if it is still running.
This not only covers the simple termination case, but the exact same process is used in rolling update deployments, each Pod is terminated in the exact same way and is given the opportunity to clean up.
Using Graceful Termination For Connection Draining
If you do not handle SIGTERM in your app, your Pods will immediately terminate, since the default action of SIGTERM is to terminate the process immediately, and the grace period is not used since the Pod exits on its own.
If you need "connection draining", this is the basic way you would implement it in Kubernetes:
Handle the SIGTERM signal, and clean up your connections in a way that your application decides. This may simply be "do nothing" to allow in-flight connections to clear out. Long running connections may be terminated in a way that is (more) friendly to client applications.
Set the terminationGracePeriodSeconds long enough for your Pod to clean up after itself.

There are some more options which could help to enable a zero downtime deployment. Here's a summary:
1. Pod Graceful Termination
s. Answer of Kekoa
Downside: Pod directly receives SIGTERM and will not accept any new requests (dependent on the implementation).
2. Pre Stop Hook
Pod receives SIGTERM after a waiting time and can still accept any new requests. You should set terminationGracePeriodSeconds to a value larger than the preStop sleeping time.
lifecycle:
preStop:
exec:
command: ["/bin/bash","-c","/bin/sleep 90"]
This is the recommended solution of the Azure Application Gateway Ingress Controller.
Downside: Pod is removed from the list of Endpoints and might not be visible to other pods.
3. Helm Chart Hooks
If you need some cleanup before the Pods are removed from the Endpoints, you need Helm Chart Hooks.
apiVersion: batch/v1
kind: Job
metadata:
name: graceful-shutdown-mydeployment
annotations:
"helm.sh/hook": pre-delete,pre-upgrade,pre-rollback
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
labels:
app.kubernetes.io/name: graceful-shutdown-mydeployment
spec:
template:
spec:
containers:
- name: graceful-shutdown
image: ...
command: ...
restartPolicy: Never
backoffLimit: 0
For more details see https://stackoverflow.com/a/66733416/1909531
Downside: Helm required
Full lifecycle
Here's how these options are executed in the pod's lifecycle.
Execute Helm Chart Hook
Set Pod to Terminating State + Executing Pre Stop Hook
Send SIGTERM
Wait terminationGracePeriodSeconds
Stop Pod by sending SIGKILL, if containers are still running.
Choosing an option
I would first try 1. terminationGracePeriodSeconds, then 2. Pre Stop Hook and at last 3. Helm Chart Hook as the complexity rises.

No, deployments do not support connection draining per se. Draining connections effectively happen as old pods stop & new pods start, clients connected to old pods will have to reconnect to new pods. As clients connect to the service, it's all transparent to clients. You do need to ensure that your application can handle different versions running concurrently, but that is a good idea anyway as it minimises downtime in upgrades & allows you to perform things like A/B testing.
There are a couple of different strategies which will let you tweak how your upgrades take place: deployments support two update strategies: Recreate or RollingUpdate.
With Recreate, old pods are stopped before new pods are started. This leads to a period of downtime but ensures that all clients connect to either the old or the new version - there will never be a time when both old & new pods are servicing clients at the same time. If downtime is acceptable to you then this may be an option to you.
Most of the time, however, downtime is unacceptable for a service so RollingUpdate is more appropriate. This starts up new pods & as it does so it stops old pods. As old pods are stopped, clients connected to them have to reconnect. Eventually there will be no old pods & all clients will have reconnected to new pods.
While there is no option to do connection draining as you suggest, you can configure the rolling update via maxUnavailable and maxSurge options. From http://kubernetes.io/docs/user-guide/deployments/#rolling-update-deployment:
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). The absolute number is calculated from percentage by rounding up. This can not be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. By default, a fixed value of 1 is used.
For example, when this value is set to 30%, the old Replica Set can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old Replica Set can be scaled down further, followed by scaling up the new Replica Set, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created above the desired number of Pods. Value can be an absolute number (e.g. 5) or a percentage of desired Pods (e.g. 10%). This can not be 0 if MaxUnavailable is 0. The absolute number is calculated from percentage by rounding up. By default, a value of 1 is used.
For example, when this value is set to 30%, the new Replica Set can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods do not exceed 130% of desired Pods. Once old Pods have been killed, the new Replica Set can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.
Hope that helps.

Related

Kubernetes how to maintain session affinity during hpa scale down event

I have deploy my application in kubernetes using deployment.
Whenever user gets login to application pod will generate session for that user.
To maintain session stickiness I have set session cookie using Nginx ingress annotations.
When hpa scale down pods application user is phasing a logout problem when pod is get terminated. If ingress has generated a session using this pod. It needs to log in again.
What I want is some sort of graceful termination of the connection. when pod is in a terminating state it should serve existing sessions until grace period.
What i want is some sort of graceful termination of connection. when
pod is in terminating state it should serve existing sessions
until grace period.
You can use the key in POD spec : terminationGracePeriodSeconds
this will wait for the mentioned second and after that POD will get terminated.
You can read more at : https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace
The answer Harsh Manvar is great, However, I want to expand it a bit :)
You can of course use terminationGracePeriodSeconds in the POD spec. Look at the example yaml:
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: my-image
terminationGracePeriodSeconds: 60
At this point, Kubernetes waits for a specified time called the termination grace period. By default, this is 30 seconds. It’s important to note that this happens in parallel to the preStop hook and the SIGTERM signal. Kubernetes does not wait for the preStop hook to finish.
If your app finishes shutting down and exits before the terminationGracePeriod is done, Kubernetes moves to the next step immediately.
If your pod usually takes longer than 30 seconds to shut down, make sure you increase the grace period. You can do that by setting the terminationGracePeriodSeconds option in the Pod YAML. For example, to change it to 60 seconds.
For more look here.
If you want to know how exactly looks like pod lifecycle see this link to the official documentation. The part about the termination of pods should be most interesting. You will also have it described how exactly the termination takes place.
It is recommended that applications deployed on Kubernetes have a design that complies with the recommended standards.
One set of standards for modern, cloud-based applications is known as the Twelve-Factor App:
Twelve-factor processes are stateless and share-nothing. Any data that needs to persist must be stored in a stateful backing service, typically a database.
Some web systems rely on “sticky sessions” – that is, caching user session data in memory of the app’s process and expecting future requests from the same visitor to be routed to the same process. Sticky sessions are a violation of twelve-factor and should never be used or relied upon. Session state data is a good candidate for a datastore that offers time-expiration, such as Memcached or Redis.

K8s - Schedule new pod before the old one is terminated

I have read up on the Kubernetes docs but I'm unable to get a clear answer on my question. I'm using the official cluster-autoscaler.
I have a deployment that specifies one replica should be running. When a pod is terminated (for example, was running on a node that is getting scaled-down) is the new pod scheduled before the termination begins or after the termination is done? The docs say that schedule happens when terminating, but don't mention at which phase.
To achieve seamless node scale-down without disruption to any services, I would expect k8 to scale up pods to N+1 replicas (at this point pods are scheduled only to nodes that are not scaling down) and then drain the node. Based on my testing, it first drains, and then schedules any missing pods based on configurations. Is it possible to configure this behaviour or this is currently not possible to do?
From what I understand, seamless updates are easy with RollingUpdate strategy. I have not find the same "Rolling" strategy to be possible for scale-down.
EDIT
TL;DR I'm looking for HA on a) two+ replica deployment and b) one replica deployment
a) Can be achieved by using PDBs. Checkout Fritz's answer. If you need pods scheduled on different nodes, leverage anti-affinity (Marc's answer)
b) If you're okay with short disruption, PDB is the official way to go. If you need a workaround, my answer can be of inspiration.
The scale down behavior can be configured with what is called a Disruption Budget
In your Deployment Manifest you can define maxUnavailable and minAvailable number of Pods during voluntary disruptions like draining nodes.
For how to do it, check out the K8s Documentation.
Below are some insight, hope this will help :
If you use a deployment, then the scheduler checks that you always have the desired number of replicas running. No less, no more. So when you kill a node (which have one of your replicas), the new pod will be scheduled after the termination of one of your original replicas. It's up to you to anticipate if it's a planified maintenance.
If you have lots of nodes (meaning more than one) and want to achieve HA (high availability) for your deployments, then you should have a look at pod affinity/anti-affinity. You can find out more in the official doc
Hate to answer my own question, but an easy solution to high-availability service with only one pod (not wasting resources with running one idle replica) is to use PreStop hook (to make the action blocking if proper SIGTERM handling is not implemented) together with terminationGracePeriodSeconds with enough time for the other service to start.
Contradicting to what has been said here, the scheduling happens when pod is terminating. After quick testing (should have done that together with reading docs) where I created a busybox (sh sleep 3600) deployment with one replica and terminationGracePeriodSeconds set to 240 seconds.
By deleting the pod, it will enter the Terminating state and stay in that state for 240 seconds. Immediately after marking the pod as Terminating, new pod was scheduled instead of it.
So the previous pod has time to finish whatever it is doing and the other one can seamlessly take its place.
I haven't tested how will the networking behave since LB will stop sending new requests, but I assume the downtime will be much lower than without the terminationGracePeriodSeconds set to a higher amount than the default.
Beware that is not official by any means but serves as a workaround for my use case.

Can we have --pod-eviction-timeout=300m?

I have a k8s cluster, in our cluster we do not want the pods to get evicted, because pod eviction causes lot of side effects to the applications running on it.
To prevent pod eviction from happening, we have configured all the pods as Guaranteed QoS. I know even with this the pod eviction can happen if there are any resource starvation in the system. We have monitors to alert us when there are resource starvation within the pod and node. So we get to know way before a pod gets evicted. This helps us in taking measures before pod gets evicted.
The other reasons for pod eviction to happen is if the node is in not-ready state, then kube-controller-manager will check the pod-eviction-timeout and it will evict the pods after this timeout. We have monitor to alert us when the node goes to not-ready state. now after this alert we wanted to take some measures to clean-up from application side, so the application will end gracefully. To do this clean-up we need more than few hours, but pod-eviction-timeout is by default 5 minutes.
Is it fine to increase the pod eviction timeout to 300m? what are the impacts of increasing this timeout to such a limit?
P.S: I know during this wait time, if the pod utilises more resources, then kubelet can itself evict this pod. I wanted to know what other impact of waiting for such a long time?
As #coderanger said, your limits are incorrect and this should be fixed instead of lowering self-healing capabilities of Kubernetes.
If your pod dies no matter what was the issue with it, by default it will be rescheduled based on your configuration.
If you are having a problem with this then I would recommend redoing your architecture and rewriting the app to use Kubernetes how it's supposed to be used.
if you are getting problems with a pod still being send requests when it's unresponsive, you should implement a LB in front or queue the requests,
if you are getting a problem with IPs that are being changed after pod restarts, this should be fixed by using DNS and service instead of connecting directly to a pod,
if your pod is being evicted check why, make the limits and requests,
As for the node, there is a really nice blog post about Improving Kubernetes reliability: quicker detection of a Node down, it's opposite of what you are thinking of doing but it also mentions why 340s is too much
Once the node is marked as unhealthy, the kube controller manager will remove its pods based on –pod-eviction-timeout=5m0s
This is a very important timeout, by default it’s 5m which in my opinion is too high, because although the node is already marked as unhealthy the kube controller manager won’t remove the pods so they will be accessible through their service and requests will fail.
If you still want to change default values to higher you can look into changing these:
kubelet: node-status-update-frequency=10s
controller-manager: node-monitor-period=5s
controller-manager: node-monitor-grace-period=40s
controller-manager: pod-eviction-timeout=5m
to higher ones.
If you provide more details I'll try to help more.

Are hitless rolling updates possible on GKE with externalTrafficPolicy: Local?

I have a GKE cluster (1.12.10-gke.17).
I'm running the nginx-ingress-controller with type: LoadBalancer.
I've set externalTrafficPolicy: Local to preserve the source ip.
Everything works great, except during rolling updates. I have maxSurge: 1 and maxUnavailable: 0.
My problem is that during a rolling update, I start getting request timeouts. I suspect the Google load balancer is still sending requests to the node where the pod is Terminating even though the health checks are failing. This happens for about 30-60s starting right when the pod changes from Running to Terminating. Everything stabilizes after a while and traffic eventually goes only to the new node with the new pod.
If the load balancer is slow to stop sending requests to a terminating pod, is there some way to make these rolling deploys hitless?
My understanding is that in a normal k8s service, where externalTrafficPolicy is not normal, the Google load balancer simply sends requests to all nodes and let's the iptables sort it out. When a pod is Terminating the iptables are updated quickly and traffic does not get sent to that pod anymore. In the case where externalTrafficPolicy is Local however, if the node that receives the request does not have a Running pod, then the request times out, which is what is happening here.
If this is correct, then I only see two options
stop sending requests to the node with a Terminating pod
continue servicing requests even though the pod is Terminating
I feel like option 1 is difficult since it requires informing the load balancer that the pod is about to start Terminating.
I've made some progress on option 2, but so far haven't gotten it working. I've managed to continue serving requests from the pod by adding a preStop lifecycle hook which just runs sleep 60, but I think the problem is that the healthCheckNodePort reports localEndpoints: 0 and I suspect something is blocking the request between arriving at the node and getting to the pod. Perhaps, the iptables aren't routing when localEndpoints: 0.
I've also adjusted the Google load balancer health check, which is different from the readinessProbe and livenessProbe, to the "fastest" settings possible e.g. 1s interval, 1 failure threshold and I've verified that the load balancer backend aka k8s node, indeed fails health checks quickly, but continues to send requests to the terminating pod anyway.
There is a similar discussion here. Although it's not identical, it's a similar use case.
Everything sounds like it is working as expected.
The LoadBalancer will send traffic to any healthy node based on the LoadBalancer health check. The LoadBalancer is unaware of individual pods.
The health check will mark a node as unhealthy once the health check threshold is crossed, ie HC is sent every x seconds with x timeout delay, x number of failed requests. This causes a delay between the time that the pod goes into terminating and it is marked as unhealthy.
Also note that once the pod is marked as notReady, the pod is removed from the service endpoint. If there is no other pod on a node, traffic will continue reaching this node (because of the HC behaviour explained above), the requests can't be forwarded because of the externalTrafficPolicy (traffic remains on the node where it was sent).
There are a couple of ways to address this.
To minimize the amount of time between a terminated pod and the node being marked as unhealthy, you can set a more aggressive health check. The trouble with this is that an overly sensitive HC may cause false positives, usually increases the overhead on the node (additional health check requests), and it will not fully eliminate the failed requests.
Have enough pods running so that there are always at least 2 pods per node. Since the service removes the pod from the endpoint once it goes into notReady, requests will just get sent to the running pod instead. The downside here is that you will either have additional overhead (more pods) or a tighter grouping (more vulnerable to failure). It also won't fully eliminate the failed requests, but they will be incredibly few.
Tweak the HC and your container to work together:
3a.Have the HC endpoint be separate from the normal path you use.
3b. Configure the container readinessProbe to match the main path your container serves traffic on (it will be different from the LB HC path)
3c. Configure your image so that when SIGTERM is received, the first thing to go down is the HC path.
3d. Configure the image to gracefully drain all connections once a SIGTERM is received rather than immediately closing all sessions and connections.
This should mean that ongoing sessions will gracefully terminate which reduces errors.
It should also mean that the node will start failing HC probes even though it is ready to serve normal traffic, this gives time for the node to be marked as unhealthy and the LB will stop sending traffic to it before it is no longer able to serve requests.
The problem with this last option is 2 fold. First, it is more complex to configure. The other issue is that it means your pods will take longer to terminate so rolling updates will take longer, so will any other process that relies on gracefully terminating the pod such as draining the node. The second issue isn't too bad unless you are in need of a quick turn around.

How does Kubernetes knows what pod to kill when downscaling?

Is there a way to tell Kubernetes what pods to kill before or after a downscale? For example, suppose that I have 10 replicas and I want to downscale them to 5, but I want certain replicas to be alive and others to be killed after the downscale. Is that possible?
While it's not possible to selectively choose which pod is killed, you can prevent what you're really concerned about, which is the killing of pods that are in the midst of processing tasks. This requires you do two things:
Your application should be able to listen for and handle SIGTERM events, which Kubernetes sends to pods before it kills them. In your case, your app would handle SIGTERM by finishing any in-flight tasks then exiting.
You set the terminationGracePeriodSeconds on the pod to something greater than the longest time it takes for the longest task to be processed. Setting this property extends the period of time between k8s sending the SIGTERM (asking your application to finish up), and SIGKILL (forcefully terminating).
As per provided by #Matt link and #Robert Bailey's answer, currently K8s ReplicaSets based resources don't support scaling functions, removing some specific Pods from replicas pool. You can find related #45509 issue and followed up #75763 PR.
you can use stateful sets instead of replicasets:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
they will be created sequentially (my-app0,my-app1,myapp2), and when you will scale down, they will be terminated in reverse order, from {N-1..0}.