How to automatically scale number of pod based on load? - kubernetes

We have a service which is fairly idle most of the time, hence it would be great for us if we could delete all the pods when the service is not getting any request for say 30 minutes, and in the next time when a new request comes Kubernetes will create the first pod and process the response.
Is it possible to set the min pod instance count to 0?
I found that currently, Kubernetes does not support this, is there a way I can achieve this?

This is not supported in Kubernetes the way it's supported by web servers like nginx, apache or app engines like puma, passenger, gunicorn, unicorn or even Google App Engine Standard where they can be soft started and then brought up the moment the first request comes in with downside of this is that your first requests will always be slower. (There may have been some rationale behind Kubernetes pods not having to behave this way, and I can see a lot of design changes or having to create a new type of workload for this very specific case)
If a pod is sitting idle it would not be consuming that many resources. You could tweak the values of your pod resources for request/limit so that you request a small number of CPUs/Memory and you set the limit to a higher number of CPUs/Memory. The upside of having a pod always running is that in theory, your first requests will never have to wait a long time to get a response.

Yes. You can achieve that using Horizontal Pod Autoscale.
See example of Horizontal Pod Autoscale: Horizontal Pod Autoscaler Walkthrough

Related

Kubernetes: how can i made horizontal pod autosacling (hpa) on my 2 tier webapp?

I have a web app, having 2 tiers: wordpress as frontend and MySQL as backend.
The frontend is deployed using Helm chart, however the backend is deployed using Operator.
Since my webapp receives a lot of traffic, i would like to implement horizental pod autosacling (HPA).
My Question:
Where can i define HPA: in frontend part (i.e: wordpress level) ? in backend part (i.e: MySQL level) ? or both ?
Thank you for your help!
You probably can't usefully set up HPA on the database. MySQL and PostgreSQL are usually single nodes, or if they have multiple nodes, they run in an active/standby mode, so adding nodes won't necessarily add capacity. (Even with a clustered database, there can be practical problems with setting up HPA around scale-in.)
You can set up HPA on the application. It's helpful to understand what the limiting factor on your application actually is: if you send enough load that it starts being slow or failing requests, is it starved for CPU time, out of memory, or waiting on the database? That would affect what parameters you'd want to set on the HPA.
One realistic situation is that your application does some database queries, which take the bulk of the time, and then very quickly renders the data to HTML or JSON. In this situation it will be hard to take advantage of HPA: the single-node database is hard to scale, and even if you scale up the application pods, you'll still be blocked on database queries.

How to avoid parallel requests to a pod belonging to a K8s service?

I have an (internal) K8s deployment (Python, TensorFlow, Guinicorn) with around 60 replicas and an attached K8s service to distribute incoming HTTP requests. Each of these 60 pods can only actually process one HTTP request at a time (because of TensorFlow reasons). Processing a request takes between 1 and 4 seconds. If a second request is sent to a pod while it's still processing one, the second request is just queued in the Gunicorn backlog.
Now I'd like to reduce the probability of that queuing happening as much as possible, i.e., have new requests routed to one of the non-occupied pods as long as such a non-occupied one exists.
Round-robin would not do the trick, because not every request takes the same amount of time to answer (see above).
The Python application itself could make the endpoint used for the ReadinessProbe fail while it's processing a normal request, but as far as I understand, readiness probes are not meant for something that dynamic (K8s would need to poll them multiple times per second).
So how could I achieve the goal?
Can't you implement the pub/sub or message broker in between?
saver data into a queue based on the ability you worker will fetch the message or data from queue and request will get processed.
You can use Redis for creating queues and in queue, you can use pub/sub also possible using the library. i used one before in Node JS however could be possible to implement the same using python also.
in 60 replicas ideally worker or we can say scriber will be running.
As soon as you get a request one application will publish it and scribers will be continuously working for processing those messages.
We also implemented one step further, scaling the worker count automatically depends on the message count in the queue.
This library i am using with the Node js : https://github.com/OptimalBits/bull
...kubectl get service shows "TYPE ClusterIP" and "EXTERNAL-IP <none>
Your k8s service will be routing requests at random in this case... obviously not good to your app. If you would like to stick with kube-proxy, you can switch to ipvs mode with sed. Here's a good article about it. Otherwise, you can consider using some sort of ingress controller like the one mentioned earlier on; ingress-nginx with "ewma" mode.

Request buffering in Kubernetes clusters

This is a purely theoretical question. A standard Kubernetes clusted is given with autoscaling in place. If memory goes above a certain targetMemUtilizationPercentage than a new pod is started and it takes on the flow of requests that is coming to the contained service. The number of minReplicas is set to 1 and the number of maxReplicas is set to 5.
What happens when the number of pods that are online reaches maximum (5 in our case) and requests from clients are still coming towards the node? Are these requests buffered somewhere of they are discarded? Can I take any actions to avoid request loss?
Natively Kubernetes does not support messaging queue buffering. Depends on the scenario and setup you use your requests will most likely 'timeout'. To efficiently manage those you`ll need custom resource running inside Kubernetes cluster.
In that situations it very common to use a message broker which ensures communication between microservices is reliable and stable, that the messages are managed and monitored within the system and that messages don’t get lost.
RabbitMQ, Kafka and Redis appears to be most popular but choosing the right one will heaving depend on your requirement and features needed.
Worth to note since Kubernetes essentially runs on linux is that linux itself also manages/limits the requests coming in socket. You may want to read more about it here.
Another thing is that if you have pods limits set or lack of resource it is most likely that pods might be restarted or cluster will become unstable. Usually you can prevent it by configuring some kind of "circuit breaker" to limit amount of requests that could go to backed without overloading it. If the amount of requests goes beyond the circuit breaker threshold, excessive requests will be dropped.
It is better to drop some request than having cascading failure.
I managed to test this scenario and I get 503 Service Unavailable and 403 Forbidden on my requests that do not get processed.
Knative Serving actually does exactly this. https://github.com/knative/serving/
It buffers requests and informs autoscaling decisions based on in-flight request counts. It also can enforce per-Pod max in-flight requests and hold onto request until newly scaled-up Pods come up and then Knative proxies the request to them as it has this container named queue-proxy as a sidecar to its workload type called "Service".

In Kubernetes, how can I scale a Deployment to zero when idle

I'm running a fairly resource-intensive service on a Kubernetes cluster to support CI activities. Only a single replica is needed, but it uses a lot of resources (16 cpu), and it's only needed during work hours generally (weekdays, 8am-6pm roughly). My cluster runs in a cloud and is setup with instance autoscaling, so if this service is scaled to zero, that instance can be terminated.
The service is third-party code that cannot be modified (well, not easily). It's a fairly typical HTTP service other than that its work is fairly CPU intensive.
What options exist to automatically scale this Deployment down to zero when idle?
I'd rather not setup a schedule to scale it up/down during working hours because occasionally CI activities are performed outside of the normal hours. I'd like the scaling to be dynamic (for example, scale to zero when idle for >30 minutes, or scale to one when an incoming connection arrives).
Actually Kubernetes supports the scaling to zero only by means of an API call, since the Horizontal Pod Autoscaler does support scaling down to 1 replica only.
Anyway there are a few Operator which allow you to overtake that limitation by intercepting the requests coming to your pods or by inspecting some metrics.
You can take a look at Knative or Keda.
They enable your application to be serverless and they do so in different ways.
Knative, by means of Istio intercept the requests and if there's an active pod serving them, it redirects the incoming request to that one, otherwise it trigger a scaling.
By contrast, Keda best fits event-driven architecture, because it is able to inspect predefined metrics, such as lag, queue lenght or custom metrics (collected from Prometheus, for example) and trigger the scaling.
Both support scale to zero in case predefined conditions are met in a equally predefined window.
Hope it helped.
I ended up implementing a custom solution: https://github.com/greenkeytech/zero-pod-autoscaler
Compared to Knative, it's more of a "toy" project, fairly small, and has no dependency on istio. It's been working well for my use case, though I do not recommend others use it without being willing to adopt the code as your own.
There are a few ways this can be achieved, possibly the most "native" way is using Knative with Istio. Kubernetes by default allows you to scale to zero, however you need something that can broker the scale-up events based on an "input event", essentially something that supports an event driven architecture.
You can take a look at the offcial documents here: https://knative.dev/docs/serving/configuring-autoscaling/
The horizontal pod autoscaler currently doesn’t allow setting the minReplicas field to 0, so the autoscaler will never scale down to zero, even if the pods aren’t doing anything. Allowing the number of pods to be scaled down to zero can dramatically increase the utilization of your hardware.
When you run services that get requests only once every few hours or even days, it doesn’t make sense to have them running all the time, eating up resources that could be used by other pods.
But you still want to have those services available immediately when a client request comes in.
This is known as idling and un-idling. It allows pods that provide a certain service to be scaled down to zero. When a new request comes in, the request is blocked until the pod is brought up and then the request is finally forwarded to the pod.
Kubernetes currently doesn’t provide this feature yet, but it will eventually.
based on documentation it does not support minReplicas=0 so far. read this thread :-https://github.com/kubernetes/kubernetes/issues/69687. and to setup HPA properly you can use this formula to setup required pod :-
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
you can also setup HPA based on prometheus metrics follow this link:-
https://itnext.io/horizontal-pod-autoscale-with-custom-metrics-8cb13e9d475

Running one pod per node with deterministic hostnames

I have what I believe is a simple goal, but I can't figure out how to get Kubernetes to play ball.
For my particular application, I am trying to deploy a number of replicas of a docker image that is a worker for another service. This system uses the hostname of the worker to distinguish between workers that are running at the same time.
I would like to be able to deploy a cluster where every node runs a worker for this service.
The problem is that the master also keeps track of every worker that ever worked for it, and displays these in a status dashboard. The intent is that you spin up a fixed number of workers by hand and leave it that way. I would like to be able to resize my cluster and have the number of workers change accordingly.
This seems like a perfect application for DaemonSet, except that then the hostnames are randomly generated and the master ends up tracking many orphaned hostnames.
An alternative might be StatefulSet, which gives us deterministic hostnames, but I can't find a way to force it to scale to one pod per node.
The system I am running is open source and I am looking into changing how it identifies workers to avoid this mess, but I was wondering if there was any sensible way to dynamically scale a StatefulSet to the number of nodes in the cluster. Or any way to achieve similar functionality.
The one way is to use nodeSelector, but I totally agree with #Markus: the more correct and advanced way is to use anti-affinity. This is really powerful and at the same time simple solution to prevent scheduling pods with the same labels to 1 node.