how to perform HorizontalPodAutoscaling in Kubernetes based on response time (custom metric) using Prometheus adapter? - kubernetes

Hi everyone,
I have a cluster based on kubeadm having 1 master and 2 workers. I have already implemented built-in horizontalPodAutoscaling (based on cpu_utilization and memory) and now i want to perform autoscaling on the basis of custom metrics (response time in my case).
I am using Prometheus Adapter for custom metrics.And, I could not find any metrics with the name of response_time in prometheus.
Is there any metric available in prometheus which scales the application based on response time and what is its name?
Whether i will need to edit the default horizontal autoscaling algorithm or i will have to make an algorithm for autoscaling from scratch which could scale my application on the basis of response time?

Prometheus has only 4 metric types: Counter, Gauge, Histogram and Summary.
I guess Histogram is that what you need
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
A histogram with a base metric name of <basename> exposes multiple time series during a scrape:
cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
the total sum of all observed values, exposed as <basename>_sum
the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)
1.
There is a stackoverflow question, where you can get a query for latency (response time), so I think this might be useful for you.
2.
I dont know if I understand you correctly, but if you want to edit HPA, you can edit the yaml file, delete previous HPA and create new one instead.
kubectl delete hpa <name.yaml>
kubectl apply -f <name.yaml>
There is good article about Autoscaling on custom metrics with custom Prometheus Metrics.

Related

Is there an HPA configuration that could autoscale based on previous CPU usage?

We currently have a GKE environemt with several HPAs for different deployments. All of them work just fine out-of-the-box, but sometimes our users still experience some delay during peak hours.
Usually this delay is the time it takes the new instances to start and become ready.
What I'd like is a way to have an HPA that could predict usage and scale eagerly before it is needed.
The simplest implementation I could think of is just an HPA that could take the average usage of previous days and in advance (say 10 minutos earliers) scale up or down based on the historic usage for the current time-frame.
Is there anything like that in vanilla k8s or GKE? I was unable to find anything like that in GCP's docs.
If you want to scale your applications based on events/custom metrics, you can use KEDA (Kubernetes-based Event Driven Autoscaler) which support scaling based on GCP Stackdriver, Datadog or Promtheus metrics (and many other scalers).
What you need to do is creating some queries to get the CPU usage at the moment: CURRENT_TIMESTAMP - 23H50M (or the aggregated value for the last week), then defining some thresholds to scale up/down your application.
If you have trouble doing this with your monitoring tool, you can create a custom metrics API that queries the monitoring API and aggregate the values (with the time shift) before sending it to the metrics-api scaler.

kubernetes / prometheus custom metric for horizontal autoscaling

I'm wondering about an approach one has to take for our server setup. We have pods that are short lived. They are started up with 3 pods at a minimum and each server is waiting on a single request that it handles - then the pod is destroyed. I'm not sure of the mechanism that this pod is destroyed, but my question is not about this part anyway.
There is an "active session count" metric that I am envisioning. Each of these pod resources could make a rest call to some "metrics" pod that we would create for our cluster. The metrics pod would expose a sessionStarted and sessionEnded endpoint - which would increment/decrement the kubernetes activeSessions metric. That metric would be what is used for horizontal autoscaling of the number of pods needed.
Since having a pod as "up" counts as zero active sessions, the custom event that increments the session count would update the metric server session count with a rest call and then decrement again on session end (the pod being up does not indicate whether or not it has an active session).
Is it correct to think that I need this metric server (and write it myself)? Or is there something that Prometheus exposes where this type of metric is supported already - rest clients and all (for various languages), that could modify this metric?
Looking for guidance and confirmation that I'm on the right track. Thanks!
It's impossible to give only one way to solve this and your question is more "opinion-based". However there is an useful similar question on StackOverFlow, please check the comments that can give you some tips. If nothing works, probably you should write the script. There is no exact solution from Kubernetes's side.
Please also take into the consideration of Apache Flink. It has Reactive Mode in combination of Kubernetes:
Reactive Mode allows to run Flink in a mode, where the Application Cluster is always adjusting the job parallelism to the available resources. In combination with Kubernetes, the replica count of the TaskManager deployment determines the available resources. Increasing the replica count will scale up the job, reducing it will trigger a scale down. This can also be done automatically by using a Horizontal Pod Autoscaler.

Kubernetes scale up pod with dynamic environment variable

I am wondering if there is a way to set up dynamically environment variables on a scale depending on the high load.
Let's imagine that we have
Kubernetes with service called Generic Consumer which at the beginning have 4 pods. First of all
I would like to set that 75% of pods should have env variable Gold and 25% Platinium. Is that possible? (% can be changed to static number for example 3 nodes Gold, 1 Platinium)
Second question:
If Platinium pod is having a high load is there a way to configure Kubernetes/charts to scale only the Platinium and then decrease it after higher load subsided
So far I came up with creating 2 separate YAML files with diff env variables and replicas numbers.
Obviously, the whole purpose of this is to prioritize some topics
I have used this as a reference https://www.confluent.io/blog/prioritize-messages-in-kafka.
So in the above example, Generic Consumer would be the Kafka consumer which would use env variable to get bucket config
configs.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
BucketPriorityAssignor.class.getName());
configs.put(BucketPriorityConfig.TOPIC_CONFIG, "orders-per-bucket");
configs.put(BucketPriorityConfig.BUCKETS_CONFIG, "Platinum, Gold");
configs.put(BucketPriorityConfig.ALLOCATION_CONFIG, "70%, 30%");
configs.put(BucketPriorityConfig.BUCKET_CONFIG, "Platinum");
consumer = new KafkaConsumer<>(configs);
If you have any alternatives, please let me know!
As was mention in comment section, the most versitale option (and probably the best for your scenario with prioritization) is to keep two separate deployments with gold and platinium labels.
Regarding first question, like #David Maze pointed, pods from Deployment are identical and you cannot have few pods with one label and few others with different. Even if you would create manually (3 pods with gold and 1 with platiunuim) you won't be able to use HPA.
This option allows you to adjustments depends on the situation. For example you would be able to scale one deployment using HPA and another with VPA or both with HPA. Would help you maintain budget, i.e for gold users you might limit to have maximum 5 pods to run simultaneously and for platinium you could set this maximum to 10.
You could consider Istio Traffic Management to routing request, but in my opinion, method with two separated deployments is more suitable.

Autoscale pods based on http request count

I am looking for pointers on how to autoscale pods based on custom metrics.
As the number of incoming http requests increase, I would like my GKE pods to autoscale to handle the load.
What is the best way to achieve this ?
By default, HPA in GKE uses CPU to scale up and down (based on resource requests Vs actual usage). However, you can use custom metrics as well, just follow this guide. In your case, have the custom metric track the number of HTTP requests per pod (do not use the number of requests to the LB).
Make sure when using custom metrics, that the value you choose to use will be an average across all pods, this way the number will increase or decrease with the number of pods. If you choose a metric that is no affected by the number of pods you have, your HPA will either always be at the maximum or the minimum number of pods.

Kubernetes HPA - How to avoid scaling-up for CPU utilisation spike

HPA - How to avoid scaling-up for CPU utilization spike (not on startup)
When the business configuration is loaded for different country CPU load increases for 1min, but we want to avoid scaling-up for that 1min.
below pic, CurrentMetricValue is just current value from a matrix or an average value from the last poll to current poll duration --horizontal.-pod-autoscaler-sync-period
The default HPA check interval is 30 seconds. This can be configured through the as you mentioned by changing value of flag --horizontal-pod-autoscaler-sync-period of the controller manager.
The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager’s --horizontal-pod-autoscaler-sync-period flag.
During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
In order to change/add flags in kube-controller-manager - you should have access to your /etc/kubernetes/manifests/ directory on master node and be able to modify parameters in /etc/kubernetes/manifests/kube-controller-manager.yaml.
Note: you are not able do this on GKE, EKS and other managed clusters.
What is more I recommend increasing --horizontal-pod-autoscaler-downscale-stabilization (the replacement for --horizontal-pod-autoscaler-upscale-delay).
If you're worried about long outages I would recommend setting up a custom metric (1 if network was down in last ${duration}, 0 otherwise) and setting the target value of the metric to 1 (in addition to CPU-based autoscaling). This way:
If network was down in last ${duration} recommendation based on the custom metric will be equal to the current size of your deployment. Max of this recommendation and very low CPU recommendation will be equal to the current size of the deployment. There will be no scale downs until the connectivity is restored (+ a few minutes after that because of the scale down stabilization window).
If network is available recommendation based on the metric will be 0. Maxed with CPU recommendation it will be equal to the CPU recommendation and autoscaler will operate normally.
I think this solves your issue better than limiting size of autoscaling step. Limiting size of autoscaling step will only slow down rate at which number of pods decreases so longer network outage will still result in your deployment shrinking to minimum allowed size.
You can also use memory based scaling
Since it is not possible to create memory-based hpa in Kubernetes, it has been written a script to achieve the same. You can find our script here by clicking on this link:
https://github.com/powerupcloud/kubernetes-1/blob/master/memory-based-autoscaling.sh
Clone the repository :
https://github.com/powerupcloud/kubernetes-1.git
and then go to the Kubernetes directory. Execute the help command to get the instructions:
./memory-based-autoscaling.sh --help
Read more here: memory-based-autoscaling.