HPA scale-down-kubernetes pods - kubernetes

My Requirement is Scale up PODS on Custom metrics like pending messages from queue increases pods has to increase to process jobs. In kubernetes Scale up is working fine with prometheus adapter & prometheus operator.
I have long running process in pods, but HPA checks the custom metrics and try to scale down, Due to this process killing mid of operations and loosing that message. How i can control the HPA kill only free pods where no process is running.
AdapterService to collect custom metrics
seriesQuery: '{namespace="default",service="hpatest-service"}'
resources:
overrides:
namespace:
resource: "namespace"
service:
resource: "service"
name:
matches: "msg_consumergroup_lag"
metricsQuery: 'avg_over_time(msg_consumergroup_lag{topic="test",consumergroup="test"}[1m])'
HPA Configuration
type: Object
object:
describedObject:
kind: Service
name: custommetric-service
metric:
name: msg_consumergroup_lag
target:
type: Value
value: 2

At present the HPA cannot be configured to accommodate workloads of this nature. The HPA simply sets the replica count on the deployment to a desired value according to the scaling algorithm, and the deployment chooses one or more pods to terminate.
There is a lot of discussion on this topic in this Kubernetes issue that may be of interest to you. It is not solved by the HPA, and may never be. There may need to be a different kind of autoscaler for this type of workload. Some suggestions are given in the link that may help you in defining one of these.
If I was to take this on myself, I would create a new controller, with corresponding CRD containing a job definition and the scaling requirements. Instead of scaling deployments, I would have it launch jobs. I would have the jobs do their work (process the queue) until they became idle (no items in the queue) then exit. The controller would only scale up, by adding jobs, never down. The jobs themselves would scale down by exiting when the queue is empty.
This would require that your jobs be able to detect when they become idle, by checking the queue and exiting if there is nothing there. If your queue read blocks forever, this would not work and you would need a different solution.
The kubebuilder project has an excellent example of a job controller. I would start with that and extend it with the ability to check your published metrics and start the jobs accordingly.
Also see Fine Parallel Processing Using a Work Queue in the Kubernetes documentation.

I will suggest and idea here , You can run a custom script to disable HPA as soon as it scales up and the script should keep checking the resource and process and when no process enable HPA and scale down , or kill the pods using kubectl command and enable HPA back.

I had similar use case to scale the deployments based on the queue length, I used KEDA (keda.sh), it does exactly that. Just know that it will scale down the additional pods created for that deployment even if the pod is currently processing the data/input - you will have to configure the cooldown parameter to scale down appropriately.

KEDA ScaledJobs are best for such scenarios and can be triggered through Queue, Storage, etc. (the currently available scalers can be found here). The ScaledJobs are not killed in between the execution and are recommended for long running executions.

Related

Kubernetes vertical-pod-autoscaler without Updater and how to use it with benchmarks

Good morning all. I am working now with Kubernetes Vertical Pod Autoscaler (VPA) https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler and I'd like to use it to optimise resource consumption on the cluster. There's around 40+ microservices deployed, I can easily, in a shell script, automate creation of the VPA objects and I get recommendations. I cannot use the Admission Controller on a cluster - this is not deployed.
Do I understand the documentation correctly that Updater part needs the Admission Controller to work and w/o that, no eviction / update is done at all?
https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/updater/README.md
Updater runs in Kubernetes cluster and decides which pods should be restarted based on resources allocation recommendation calculated by Recommender. (...) Updater does not perform the actual resources update, but relies on Vertical Pod Autoscaler admission plugin to update pod resources when the pod is recreated after eviction.
Another question that I have is a difference between updatePolicy.updateMode: Auto and Off. How they are different in a deployment where Updater is not used?
Again, documentation / VPA code says:
// UpdateModeOff means that autoscaler never changes Pod resources.
// The recommender still sets the recommended resources in the
// VerticalPodAutoscaler object. This can be used for a "dry run".
UpdateModeOff UpdateMode = "Off"
// UpdateModeInitial means that autoscaler only assigns resources on pod
// creation and does not change them during the lifetime of the pod.
UpdateModeInitial UpdateMode = "Initial"
My VPA cannot evict / recreate a pod, no admission controller. So I should use the Off mode, if I understand correctly? Recreating a pod (scale -> 0, scale -> 1) recreates with resources defined the same as in Deployment, that's in my opinion expected. Is it any difference if I use Off or Initial?
And the last question, as I cannot use the Updater part of VPA, I've set up the --history-length=14d of Prometheus to 2 weeks, I can on demand run performance tests. Should I point VPA to an environment with more steady load or rather can use DEV-like environment with artificial load (via jMeter in my case) to get the recommendations. Should VPA be deployed on the same time as the load tests are being run to get the live metrics for recommendation?
Thank you for your responses.

Is Interrupt functionality is availabe in Kubernetes minikube autoscaling?

I have used custom metrics API and have successfully auto scaled the service based on some metrics.But here is the point the auto scaling works in way that the minikube's HPA(horizontal pod autoscaler) will check the particular URL/API and try to find out the metric value repetitively with some polling period.For example HPA will check for the value for every 15 seconds.So this is continuous polling from HPA to the URL/API to fetch the value of that metric.After that it will simply compare the value with the target reference value given and try to scale.
What I want is, the API/URL should trigger the minikube HPA whenever needed, it's like HPA should work as a interrupt here in simple words.
call for autoscale should be from Service/API to HPA not from HPA to Service !
is this feature available in Kubernetes ? or do you have any comments in this scenario ? please share your view on this I am currently in last stage and this question is stopping my progress!

Why does Openshift scale up old deployment before rolling deployment

In my team, we sometimes scale down to just one pod in Openshift to make testing easier. If we then do a rolling update with the desired replica count set to 2, Openshift scales up to two pods before performing a rolling deploy. It is a nuisance, because the new "old" pod can start things that we don't expect to be started before the new deployment starts, and so we have to remember to take down the one pod before the new deploy.
Is there a way to stop the old deployment from scaling up to the desired replica count while the new deployment is scaled up to the desired replica count? Also, why does it work this way?
OpenShift Master:
v3.11.200
Kubernetes Master:
v1.11.0+d4cacc0
OpenShift Web Console:
3.11.200-1-8a53b1d
From our Openshift template:
- apiVersion: v1
kind: DeploymentConfig
spec:
replicas: 2
strategy:
type: Rolling
This is expected behavior when using RollingUpdate strategy. It removes old pods one by one, while adding new ones at the same time, keeping the application available throughout the whole process, and ensuring there’s no drop in its capacity to handle requests. Since you have only one pod, Kubernetes scales the deployment to keep the strategy and zero-downtime as requested in the manifest.
It scales up to 2, because if not specified maxSurge defaults to 25%. It means that there can be at most 25% more pod instances than the desired count during an update.
If you want to ensure that this won't be scaled you might change the strategy to Recreate. This will cause all old pods to be deleted before the new ones are created. Use this strategy when your application doesn’t support running multiple versions in parallel and requires the old version to be stopped completely before the new one is started. However please note that, this strategy does involve a short period of time when your app becomes completely unavailable.
Here`s a good document that describes rolling update strategy. It is worth also checking official kubernetes documentation about deployments.

Kubernetes Deployment Rolling Updates

I have an application that I deploy on Kubernetes.
This application has 4 replicas and I'm doing a rolling update on each deployment.
This application has a graceful shutdown which can take tens of minutes (it has to wait for running tasks to finish).
My problem is that during updates, I have over-capacity since all the older version pods are stuck at "Terminating" status while all the new pods are created.
During the updates, I end up running with 8 containers and it is something I'm trying to avoid.
I tried to set maxSurge to 0, but this setting doesn't take into consideration the "Terminating" pods, so the load on my servers during the deployment is too high.
The behaviour I'm trying to get is that new pods will only get created after the old version pods finished successfully, so at all times I'm not exceeding the number of replicas I set.
I wonder if there is a way to achieve such behaviour.
What I ended up doing is creating a StatefulSet with podManagementPolicy: Parallel and updateStrategy to OnDelete.
I also set terminationGracePeriodSeconds to the maximum time it takes for a pod to terminate.
As a part of my deployment process, I apply the new StatefulSet with the new image and then delete all the running pods.
This way all the pods are entering Terminating state and whenever a pod finished its task and terminated a new pod with the new image will replace it.
This way I'm able to keep a static number of replicas during the whole deployment process.
Let me suggest the following strategy:
Deployments implement the concept of ready pods to aide rolling updates. Readiness probes allow the deployment to gradually update pods while giving you the control to determine when the rolling update can proceed.
A Ready pod is one that is considered successfully updated by the Deployment and will no longer count towards the surge count for deployment. A pod will be considered ready if its readiness probe is successful and spec.minReadySeconds have passed since the pod was created. The default for these options will result in a pod that is ready as soon as its containers start.
So, what you can do, is implement (if you haven't done so yet) a readiness probe for your pods in addition to setting the spec.minReadySeconds to a value that will make sense (worst case) to the time that it takes for your pods to terminate.
This will ensure rolling out will happen gradually and in coordination to your requirements.
In addition to that, don't forget to configure a deadline for the rollout.
By default, after the rollout can’t make any progress in 10 minutes, it’s considered as failed. The time after which the Deployment is considered failed is configurable through the progressDeadlineSeconds property in the Deployment spec.

Kubernetes Queue with Pod Per Work Item autoscaling

I want an application to pull an item off a queue, process the item on the queue and then destroy itself. Pull -> Process -> Destroy.
I've looked at using the job pattern Queue with Pod Per Work Item as that fits the usecase however it isn't appropriate when I need the job to autoscale aka 0/1 pods when queue is empty and scale to a point when items are added. The only way I can see doing this is via a deployment but that removes the pattern of Queue with Pod per Work Item. There must be a fresh container per item.
Is there a way to have the job pattern Queue with Pod Per Work Item but with auto-scaling?
I am a bit confused, so I'll just say this: if you don't mind a failed pod, and you wish that a failed pod will not be recreated by Kubernetes, you can do that in your code by catching all errors and exiting gracefully (not advised).
Please also note, that for deployments, the only accepted restartPolicy is always. So pods of a deployments who crash will always be restarted by Kubernetes, and will probably fails for the same reason, leading to a CrashLoopBackOff.
If you want to scale a deployment depending on the length of a RabbitMQ queue's length, check KEDA. It is an event-driven autoscaling platform.
Make sure to also check their example with RabbitMQ
Another possibility is a job/deployment, that routinely checks the length of the queue in question and executes kubectl commands to scale your deployment.
Here is the cleanest one I could find, at least for my taste