Kubernetes HPA and Scaling Down - kubernetes

I have a kubernetes HPA set up in my cluster, and it works as expected scaling up and down instances of pods as the cpu/memory increases and decreases.
The only thing is that my pods handle web requests, so it occasionally scales down a pod that's in the process of handling a web request. The web server never gets a response back from the pod that was scaled down and thus the caller of the web api gets an error back.
This all makes sense theoretically. My question is does anyone know of a best practice way to handle this? Is there some way I can wait until all requests are processed before scaling down? Or some other way to ensure that requests complete before HPA scales down the pod?
I can think of a few solutions, none of which I like:
Add retry mechanism to the caller and just leave the cluster as is.
Don't use HPA for web request pods (seems like it defeats the purpose).
Try to create some sort of custom metric and see if I can get that metric into Kubernetes (e.x https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics)
Any suggestions would be appreciated. Thanks in advance!

Graceful shutdown of pods
You must design your apps to support graceful shutdown. First your pod will receive a SIGTERM signal and after 30 seconds (can be configured) your pod will receive a SIGKILL signal and be removed. See Termination of pods
SIGTERM: When your app receives termination signal, your pod will not receive new requests but you should try to fulfill responses of already received requests.
Design for idempotency
Your apps should also be designed for idempotency so you can safely retry failed requests.

Related

Kubernetes HPA. Settings for right down scale

I use Kubernetes in my project, specially HPA. So, every minute in project we started check-status request for checking if all microservices are available. Availability is defined by simple response from one of replicas (not all) each microservice.
But I have one moment related to HPA. When HPA automatically decides to remove some pods from cluster and my check-status request comes to server at the same time then very often occurs that my API-gateway service push it to deleted pod and doesn't get any response. It means that microservice is unavailable for our server.
My question is what is the best way for setting autoscaler to avoid this cases.
It is not related to HPA in this case but more on how you graceful shut down your pods.
In short, your service/LB is not aware if your pod is ready to accept new requests, so on a SIGTERM signal, your pod should set your readiness probe to false, and give some time for the app to shutdown. If your readiness probe is not healthy, the service won't send new requests to your pod.
Then you can shut it down once all requests have been addressed AND the pod won't receive new requests.
I would advise you of reading these sources:
https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace
https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html

Load balancing onto replicas of pods

We have an AKS cluster and we want to achieve below two points in our architecture:
We have replicas of pods and we want to have only 1 request served by one pod. basically one pod - one request design.
When all pods are busy, then next coming request should not be queued at POD level, instead it should be queued at service level and once any of busy pod become idle or available then only queued request should be dispatched on idle pod.
How to achieve above things?
Generally, this could be achieved by creating a custom proxy that creates pods on demand, but in practice it will be very difficult and performance will be poor. This was very well explained by David Maze in his comment:
You need to write a custom proxy with access to the Kubernetes API that can create new pods on demand; this is not a standard Kubernetes setup. This is also an extremely heavy-weight setup (if it takes tens of seconds to pull and deploy a new pod you can hit HTTP request timeouts very easily) and every Web framework supports handling multiple requests per process.

Are hitless rolling updates possible on GKE with externalTrafficPolicy: Local?

I have a GKE cluster (1.12.10-gke.17).
I'm running the nginx-ingress-controller with type: LoadBalancer.
I've set externalTrafficPolicy: Local to preserve the source ip.
Everything works great, except during rolling updates. I have maxSurge: 1 and maxUnavailable: 0.
My problem is that during a rolling update, I start getting request timeouts. I suspect the Google load balancer is still sending requests to the node where the pod is Terminating even though the health checks are failing. This happens for about 30-60s starting right when the pod changes from Running to Terminating. Everything stabilizes after a while and traffic eventually goes only to the new node with the new pod.
If the load balancer is slow to stop sending requests to a terminating pod, is there some way to make these rolling deploys hitless?
My understanding is that in a normal k8s service, where externalTrafficPolicy is not normal, the Google load balancer simply sends requests to all nodes and let's the iptables sort it out. When a pod is Terminating the iptables are updated quickly and traffic does not get sent to that pod anymore. In the case where externalTrafficPolicy is Local however, if the node that receives the request does not have a Running pod, then the request times out, which is what is happening here.
If this is correct, then I only see two options
stop sending requests to the node with a Terminating pod
continue servicing requests even though the pod is Terminating
I feel like option 1 is difficult since it requires informing the load balancer that the pod is about to start Terminating.
I've made some progress on option 2, but so far haven't gotten it working. I've managed to continue serving requests from the pod by adding a preStop lifecycle hook which just runs sleep 60, but I think the problem is that the healthCheckNodePort reports localEndpoints: 0 and I suspect something is blocking the request between arriving at the node and getting to the pod. Perhaps, the iptables aren't routing when localEndpoints: 0.
I've also adjusted the Google load balancer health check, which is different from the readinessProbe and livenessProbe, to the "fastest" settings possible e.g. 1s interval, 1 failure threshold and I've verified that the load balancer backend aka k8s node, indeed fails health checks quickly, but continues to send requests to the terminating pod anyway.
There is a similar discussion here. Although it's not identical, it's a similar use case.
Everything sounds like it is working as expected.
The LoadBalancer will send traffic to any healthy node based on the LoadBalancer health check. The LoadBalancer is unaware of individual pods.
The health check will mark a node as unhealthy once the health check threshold is crossed, ie HC is sent every x seconds with x timeout delay, x number of failed requests. This causes a delay between the time that the pod goes into terminating and it is marked as unhealthy.
Also note that once the pod is marked as notReady, the pod is removed from the service endpoint. If there is no other pod on a node, traffic will continue reaching this node (because of the HC behaviour explained above), the requests can't be forwarded because of the externalTrafficPolicy (traffic remains on the node where it was sent).
There are a couple of ways to address this.
To minimize the amount of time between a terminated pod and the node being marked as unhealthy, you can set a more aggressive health check. The trouble with this is that an overly sensitive HC may cause false positives, usually increases the overhead on the node (additional health check requests), and it will not fully eliminate the failed requests.
Have enough pods running so that there are always at least 2 pods per node. Since the service removes the pod from the endpoint once it goes into notReady, requests will just get sent to the running pod instead. The downside here is that you will either have additional overhead (more pods) or a tighter grouping (more vulnerable to failure). It also won't fully eliminate the failed requests, but they will be incredibly few.
Tweak the HC and your container to work together:
3a.Have the HC endpoint be separate from the normal path you use.
3b. Configure the container readinessProbe to match the main path your container serves traffic on (it will be different from the LB HC path)
3c. Configure your image so that when SIGTERM is received, the first thing to go down is the HC path.
3d. Configure the image to gracefully drain all connections once a SIGTERM is received rather than immediately closing all sessions and connections.
This should mean that ongoing sessions will gracefully terminate which reduces errors.
It should also mean that the node will start failing HC probes even though it is ready to serve normal traffic, this gives time for the node to be marked as unhealthy and the LB will stop sending traffic to it before it is no longer able to serve requests.
The problem with this last option is 2 fold. First, it is more complex to configure. The other issue is that it means your pods will take longer to terminate so rolling updates will take longer, so will any other process that relies on gracefully terminating the pod such as draining the node. The second issue isn't too bad unless you are in need of a quick turn around.

How to signal "bad" but not "fatal" health check from spring boot to Kubernetes?

What we're looking for is a way for an actuator health check to signal some intention like "I am limping but not dead. If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp."
We have a rest service hosted in clustered Kubernetes containers that periodically call out to fetch fresh data from an external resource. Occasionally we have failures reaching those external resources, and sometimes, but not every time, a restart of the pod will resolve the issue.
The services can operate just fine on possibly stale data. Although we wouldn't want to continue operating on stale data, that's preferable to just going down entirely.
In the interim, we're planning on having a node unilaterally decide not to report any problems through actuator until X amount of time has passed since the last successful sync, but that really only delays the point at which all nodes would still report failure.
In Kubernetes you can use LivenessProbe and ReadinessProbe to let a controller to heal your service, but some situations is better handled with HTTP response codes or alternative degraded service.
LivenessPobe
Use a LivenessProbe to resolve a deadlock situation. When your pod does not respond on a LivenessProbe, it will be killed and a new pod will replace it.
ReadinessProbe
Use a ReadinessProbe when your pod is not prepared for serving requests, e.g. if your pod need to read some files or need to connect to an external service before serving requests.
Fault affecting all replicas
If you have a problem that all your replicas depends on, e.g. an external service is down, then you can not solve it by restarting your pods. You may use an OpsToogle or a circuit breaker in this situation and notifying other services that you are degraded or show a message about temporary error.
For your situations
If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp.
You can not delegate that logic to Kubernetes. Your application need to understand each fault situation, e.g. if an error was a transient network error or if your error will affect all replicas.

Smooth load rebalancing for Kubernetes HPA

I have configured my ingress controller with nginx-ingress hashing and I define HPA for my deployments. When we do load testing we hit a problem on the newly created pods that aren't warmed up enough and while the load balancing shifts immediately target portion of the traffic the latency spikes and service is choking. Is there a way to define some smooth load rebalancing that would rather move the traffic gradually and thus warm up the service in more natural way ?
Here is an example effect we see now:
At glance I see 2 possible reasons for that behaviour:
I think there is a chance that you are facing the same problem as encountered in this question: Some requests fails during autoscaling in kubernetes. In that case, Nginx was sending requests to Pods that were not completely ready. To solve this you can configure a Readiness Probe. Personally, I configure my Readiness Probes to send a http request to a /health endpoint of my services.
There is a chance however that your application naturally performs slowly during the first requests, usually because of caching or some other operation that needs to be done at the beginning of its life. I encountered this problem in a Django+Gunicorn app where the Gunicorn only started my app after the first request. To solve this I used a PostStart Container Hook which sends a request to my app right after the container is created. Here is an example of its use. You may also have a look at this question: Kubernetes Pod warm-up for load balancing.