How to respond with 503 error code in Kubernetes load balancer - kubernetes

I have an Google Cloud Load Balancer-backed ingress in my Google Kubernetes Engine cluster. I have an autoscaler set up to scale the number of replicas of my deployment based on CPU usage. Let's say I have set the CPU threshold to 50%.
When there is a burst of requests, the CPU usage goes to 100%. The autoscaler takes a few minutes to realize the high load, create more pods, create new nodes if necessary, and pass health checks. During this scaling period, some or the majority of requests fail with the 502 error due to timeouts. I would rather return a 503 error code immediately if the server is under heavy load instead of returning a 502 error code after the 30 second timeout.
Is it possible to have the load balancer direct traffic to pods with the lowest CPU usage? Is is possible to return a 503 error code if none of the pods have a CPU usage below a certain threshold, say 80%?
What is standard practice for handling a large burst of traffic, and how should I go about resolving this issue in Kubernetes?

First problem you are describing (serving 503) is called "load shedding". Normally it's a responsibility of the application to say: "oops, I'm overloaded, 503, slow down". If you move this responsibility to the client, then it might be too slow to react to provide you any reasonable protection - its data will always be behind. From the system reliability point of view, it's better to keep this logic in the server application.
The second problem is CPU-aware load balancing. One possible approach to this problem is called weighted round-robin - it's like regular round-robin, but preferring less loaded nodes. If you install istio in Kubernetes, you can select from a list of load balancing policies. One of them is weighted least request - it relies on the number of requests in flight, not directly on CPU, but if all your requests have about the same CPU cost, it might be a good proxy to CPU load.

Also one possible solution, is to use Istio Circuit Breaker. You can config how many concurrent request made to your services, or you are also able to use outlier detection, it will detect your service failure and based on that it can improve your UX.

Related

involuntary disruptions / SIGKILL handling in microservice following saga pattern

Should i engineer my microservice to handle involuntary disruptions like hardware failure?
Are these disruptions frequent enough to be handled in a service running on AWS managed EKS cluster.
Should i consider some design change in the service to handle the unexpected SIGKILL with methods like persisting the data at each step or will that be considered as over-engineering?
What standard way would you suggest for handling these involuntary disruptions if it is
a) a restful service that responds typically in 1s(follows saga pattern).
b) a service that process a big 1GB file in 1 hour.
There are couple of ways to handle those disruptions. As mentioned here here:
Here are some ways to mitigate involuntary disruptions:
Ensure your pod requests the resources it needs.
Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones
(if using a multi-zone cluster.)
The frequency of voluntary disruptions varies.
So:
if your budget allows it, spread your app accross zones or racks, you can use Node affinity to schedule Pods on cetrain nodes,
make sure to configure Replicas, it will ensure that when one Pod receives SIGKILL the load is automatically directed to another Pod. You can read more about this here.
consider using DaemonSets, which ensure each Node runs a copy of a Pod.
use Deployments for stateless apps and StatefulSets for stateful.
last thing you can do is to write your app to be distruption tolerant.
I hope I cleared the water a little bit for you, feel free to ask more questions.

Performance of .net GC in Kubernetes pod without memory limit

I'm checking on a scaling issue and we are suspecting it has something to do with the memory, but after running a load testing on local machine it doesn't seems to have memory leak.
We are hosting the .net core application in Kubernetes, with resources setting 800mi request memory without limit.
And as per describe from this Article
The trigger for Garbage collection occurs when, The system has low
physical memory and gets notification from OS.
So does that mean GC is unlikely to kick in until my nodes are low on memory if we did not setup memory limit, and it will eventually occupied most of memory in node?
Yes that's exactly what can happen, both with .NET and other pods.
Always set memory and CPU limits as this may have impact on other pods or Configure Default Memory Requests and Limits for a Namespace
#Martin is right but I would like to provide some more insight on this topic.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Bear in mind that it is very important is to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
And finally, you can use the metrics-server to get the CPU and memory usage of the pods.

Request buffering in Kubernetes clusters

This is a purely theoretical question. A standard Kubernetes clusted is given with autoscaling in place. If memory goes above a certain targetMemUtilizationPercentage than a new pod is started and it takes on the flow of requests that is coming to the contained service. The number of minReplicas is set to 1 and the number of maxReplicas is set to 5.
What happens when the number of pods that are online reaches maximum (5 in our case) and requests from clients are still coming towards the node? Are these requests buffered somewhere of they are discarded? Can I take any actions to avoid request loss?
Natively Kubernetes does not support messaging queue buffering. Depends on the scenario and setup you use your requests will most likely 'timeout'. To efficiently manage those you`ll need custom resource running inside Kubernetes cluster.
In that situations it very common to use a message broker which ensures communication between microservices is reliable and stable, that the messages are managed and monitored within the system and that messages don’t get lost.
RabbitMQ, Kafka and Redis appears to be most popular but choosing the right one will heaving depend on your requirement and features needed.
Worth to note since Kubernetes essentially runs on linux is that linux itself also manages/limits the requests coming in socket. You may want to read more about it here.
Another thing is that if you have pods limits set or lack of resource it is most likely that pods might be restarted or cluster will become unstable. Usually you can prevent it by configuring some kind of "circuit breaker" to limit amount of requests that could go to backed without overloading it. If the amount of requests goes beyond the circuit breaker threshold, excessive requests will be dropped.
It is better to drop some request than having cascading failure.
I managed to test this scenario and I get 503 Service Unavailable and 403 Forbidden on my requests that do not get processed.
Knative Serving actually does exactly this. https://github.com/knative/serving/
It buffers requests and informs autoscaling decisions based on in-flight request counts. It also can enforce per-Pod max in-flight requests and hold onto request until newly scaled-up Pods come up and then Knative proxies the request to them as it has this container named queue-proxy as a sidecar to its workload type called "Service".

In Kubernetes, how can I scale a Deployment to zero when idle

I'm running a fairly resource-intensive service on a Kubernetes cluster to support CI activities. Only a single replica is needed, but it uses a lot of resources (16 cpu), and it's only needed during work hours generally (weekdays, 8am-6pm roughly). My cluster runs in a cloud and is setup with instance autoscaling, so if this service is scaled to zero, that instance can be terminated.
The service is third-party code that cannot be modified (well, not easily). It's a fairly typical HTTP service other than that its work is fairly CPU intensive.
What options exist to automatically scale this Deployment down to zero when idle?
I'd rather not setup a schedule to scale it up/down during working hours because occasionally CI activities are performed outside of the normal hours. I'd like the scaling to be dynamic (for example, scale to zero when idle for >30 minutes, or scale to one when an incoming connection arrives).
Actually Kubernetes supports the scaling to zero only by means of an API call, since the Horizontal Pod Autoscaler does support scaling down to 1 replica only.
Anyway there are a few Operator which allow you to overtake that limitation by intercepting the requests coming to your pods or by inspecting some metrics.
You can take a look at Knative or Keda.
They enable your application to be serverless and they do so in different ways.
Knative, by means of Istio intercept the requests and if there's an active pod serving them, it redirects the incoming request to that one, otherwise it trigger a scaling.
By contrast, Keda best fits event-driven architecture, because it is able to inspect predefined metrics, such as lag, queue lenght or custom metrics (collected from Prometheus, for example) and trigger the scaling.
Both support scale to zero in case predefined conditions are met in a equally predefined window.
Hope it helped.
I ended up implementing a custom solution: https://github.com/greenkeytech/zero-pod-autoscaler
Compared to Knative, it's more of a "toy" project, fairly small, and has no dependency on istio. It's been working well for my use case, though I do not recommend others use it without being willing to adopt the code as your own.
There are a few ways this can be achieved, possibly the most "native" way is using Knative with Istio. Kubernetes by default allows you to scale to zero, however you need something that can broker the scale-up events based on an "input event", essentially something that supports an event driven architecture.
You can take a look at the offcial documents here: https://knative.dev/docs/serving/configuring-autoscaling/
The horizontal pod autoscaler currently doesn’t allow setting the minReplicas field to 0, so the autoscaler will never scale down to zero, even if the pods aren’t doing anything. Allowing the number of pods to be scaled down to zero can dramatically increase the utilization of your hardware.
When you run services that get requests only once every few hours or even days, it doesn’t make sense to have them running all the time, eating up resources that could be used by other pods.
But you still want to have those services available immediately when a client request comes in.
This is known as idling and un-idling. It allows pods that provide a certain service to be scaled down to zero. When a new request comes in, the request is blocked until the pod is brought up and then the request is finally forwarded to the pod.
Kubernetes currently doesn’t provide this feature yet, but it will eventually.
based on documentation it does not support minReplicas=0 so far. read this thread :-https://github.com/kubernetes/kubernetes/issues/69687. and to setup HPA properly you can use this formula to setup required pod :-
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
you can also setup HPA based on prometheus metrics follow this link:-
https://itnext.io/horizontal-pod-autoscale-with-custom-metrics-8cb13e9d475

How to automatically scale number of pod based on load?

We have a service which is fairly idle most of the time, hence it would be great for us if we could delete all the pods when the service is not getting any request for say 30 minutes, and in the next time when a new request comes Kubernetes will create the first pod and process the response.
Is it possible to set the min pod instance count to 0?
I found that currently, Kubernetes does not support this, is there a way I can achieve this?
This is not supported in Kubernetes the way it's supported by web servers like nginx, apache or app engines like puma, passenger, gunicorn, unicorn or even Google App Engine Standard where they can be soft started and then brought up the moment the first request comes in with downside of this is that your first requests will always be slower. (There may have been some rationale behind Kubernetes pods not having to behave this way, and I can see a lot of design changes or having to create a new type of workload for this very specific case)
If a pod is sitting idle it would not be consuming that many resources. You could tweak the values of your pod resources for request/limit so that you request a small number of CPUs/Memory and you set the limit to a higher number of CPUs/Memory. The upside of having a pod always running is that in theory, your first requests will never have to wait a long time to get a response.
Yes. You can achieve that using Horizontal Pod Autoscale.
See example of Horizontal Pod Autoscale: Horizontal Pod Autoscaler Walkthrough