Autoscale pods based on http request count - kubernetes

I am looking for pointers on how to autoscale pods based on custom metrics.
As the number of incoming http requests increase, I would like my GKE pods to autoscale to handle the load.
What is the best way to achieve this ?

By default, HPA in GKE uses CPU to scale up and down (based on resource requests Vs actual usage). However, you can use custom metrics as well, just follow this guide. In your case, have the custom metric track the number of HTTP requests per pod (do not use the number of requests to the LB).
Make sure when using custom metrics, that the value you choose to use will be an average across all pods, this way the number will increase or decrease with the number of pods. If you choose a metric that is no affected by the number of pods you have, your HPA will either always be at the maximum or the minimum number of pods.

Related

kubernetes / prometheus custom metric for horizontal autoscaling

I'm wondering about an approach one has to take for our server setup. We have pods that are short lived. They are started up with 3 pods at a minimum and each server is waiting on a single request that it handles - then the pod is destroyed. I'm not sure of the mechanism that this pod is destroyed, but my question is not about this part anyway.
There is an "active session count" metric that I am envisioning. Each of these pod resources could make a rest call to some "metrics" pod that we would create for our cluster. The metrics pod would expose a sessionStarted and sessionEnded endpoint - which would increment/decrement the kubernetes activeSessions metric. That metric would be what is used for horizontal autoscaling of the number of pods needed.
Since having a pod as "up" counts as zero active sessions, the custom event that increments the session count would update the metric server session count with a rest call and then decrement again on session end (the pod being up does not indicate whether or not it has an active session).
Is it correct to think that I need this metric server (and write it myself)? Or is there something that Prometheus exposes where this type of metric is supported already - rest clients and all (for various languages), that could modify this metric?
Looking for guidance and confirmation that I'm on the right track. Thanks!
It's impossible to give only one way to solve this and your question is more "opinion-based". However there is an useful similar question on StackOverFlow, please check the comments that can give you some tips. If nothing works, probably you should write the script. There is no exact solution from Kubernetes's side.
Please also take into the consideration of Apache Flink. It has Reactive Mode in combination of Kubernetes:
Reactive Mode allows to run Flink in a mode, where the Application Cluster is always adjusting the job parallelism to the available resources. In combination with Kubernetes, the replica count of the TaskManager deployment determines the available resources. Increasing the replica count will scale up the job, reducing it will trigger a scale down. This can also be done automatically by using a Horizontal Pod Autoscaler.

High total CPU request but low total usage (kubernetes resources)

I have a bunch of pods in a cluster that is almost requesting all (7.35/8) available CPU resources on a node:
even though their actual total usage is almost nothing (0.34/8).
The pod that is currently requesting the most only requests 210m which I guess is not an outrageous amount - also I would like to enforce some sensible minimum request size for all pods in the cluster. Of course that will accumulate when there are lots of pods.
It seems I could easily scale down the request by a factor of 10 and leave the limits where they are to begin with.
But is there something else that I should look into instead before doing that - reducing replica count etc.?
Also it looks a bit strange that the pods are not more evenly distributed between the nodes.
Your request values seems overestimated.
You need time and metrics to find the right request/limit for your workload.
Keep in mind that if you change those values, your pods will restart.
Also, It's normal that you can find some unbalance nodes on your cluster. Kubernetes will never remove a pod if you don't ask.
For example, if your create a cluster with 3 nodes, fill those 3 nodes with pods and then add another 3 nodes. The new nodes will stay empty.
You can setup some HorizontalPodAutoScaler on your cluster to adapt your number of pod to your workload.
Doing that, your workload will spread among nodes and with a correct balance. (if you use the default Scheduling Policy
I suggest following:
Resource Allocation: Based on history value set your request to meaningful value with buffer. Also to have guaranteed pod resource allocation it may be a good idea to set request and limit as same value. But that means you pod cannot burst for new resource. One more thing to note is scheduling only happens based on requested value, so if node has no more resource left, then pod will be killed and rescheduled if you request is trying to burst to limit.
Resource quotas: Check Kubernetes Resource Quotas to have sensible namespace level quotas to control overly provisioned resources by developers
Affinity/AntiAffinity: Check concept of Anti-affinity to have your replicas or different pods scheduled across your cluster. You can ensure for eg., that one host or Avalability zone etc can have only one replica of your pod (helps in HA), spread different pods to different nodes (layer scheduling etc) - Check this video
There are good answers already but I would like to add some more info.
It is very important to have a good strategy when calculating how much resources you would need for each container.
Optimally, your pods should be using exactly the amount of resources you requested but that's almost impossible to achieve. If the usage is lower than your request, you are wasting resources. If it's higher, you are risking performance issues. Consider a 25% margin up and down the request value as a good starting point. Regarding limits, achieving a good setting would depend on trying and adjusting. There is no optimal value that would fit everyone as it depends on many factors related to the application itself, the demand model, the tolerance to errors etc.
Kubernetes best practices: Resource requests and limits is a very good guide explaining the idea behind these mechanisms with a detailed explanation and examples.
Also, Managing Resources for Containers will provide you with the official docs regarding:
Requests and limits
Resource types
Resource requests and limits of Pod and Container
Resource units in Kubernetes
How Pods with resource requests are scheduled
How Pods with resource limits are run, etc
Just in case you'll need a reference.

Kubernetes HPA - How to avoid scaling-up for CPU utilisation spike

HPA - How to avoid scaling-up for CPU utilization spike (not on startup)
When the business configuration is loaded for different country CPU load increases for 1min, but we want to avoid scaling-up for that 1min.
below pic, CurrentMetricValue is just current value from a matrix or an average value from the last poll to current poll duration --horizontal.-pod-autoscaler-sync-period
The default HPA check interval is 30 seconds. This can be configured through the as you mentioned by changing value of flag --horizontal-pod-autoscaler-sync-period of the controller manager.
The Horizontal Pod Autoscaler is implemented as a control loop, with a period controlled by the controller manager’s --horizontal-pod-autoscaler-sync-period flag.
During each period, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics), or the custom metrics API (for all other metrics).
In order to change/add flags in kube-controller-manager - you should have access to your /etc/kubernetes/manifests/ directory on master node and be able to modify parameters in /etc/kubernetes/manifests/kube-controller-manager.yaml.
Note: you are not able do this on GKE, EKS and other managed clusters.
What is more I recommend increasing --horizontal-pod-autoscaler-downscale-stabilization (the replacement for --horizontal-pod-autoscaler-upscale-delay).
If you're worried about long outages I would recommend setting up a custom metric (1 if network was down in last ${duration}, 0 otherwise) and setting the target value of the metric to 1 (in addition to CPU-based autoscaling). This way:
If network was down in last ${duration} recommendation based on the custom metric will be equal to the current size of your deployment. Max of this recommendation and very low CPU recommendation will be equal to the current size of the deployment. There will be no scale downs until the connectivity is restored (+ a few minutes after that because of the scale down stabilization window).
If network is available recommendation based on the metric will be 0. Maxed with CPU recommendation it will be equal to the CPU recommendation and autoscaler will operate normally.
I think this solves your issue better than limiting size of autoscaling step. Limiting size of autoscaling step will only slow down rate at which number of pods decreases so longer network outage will still result in your deployment shrinking to minimum allowed size.
You can also use memory based scaling
Since it is not possible to create memory-based hpa in Kubernetes, it has been written a script to achieve the same. You can find our script here by clicking on this link:
https://github.com/powerupcloud/kubernetes-1/blob/master/memory-based-autoscaling.sh
Clone the repository :
https://github.com/powerupcloud/kubernetes-1.git
and then go to the Kubernetes directory. Execute the help command to get the instructions:
./memory-based-autoscaling.sh --help
Read more here: memory-based-autoscaling.

how to perform HorizontalPodAutoscaling in Kubernetes based on response time (custom metric) using Prometheus adapter?

Hi everyone,
I have a cluster based on kubeadm having 1 master and 2 workers. I have already implemented built-in horizontalPodAutoscaling (based on cpu_utilization and memory) and now i want to perform autoscaling on the basis of custom metrics (response time in my case).
I am using Prometheus Adapter for custom metrics.And, I could not find any metrics with the name of response_time in prometheus.
Is there any metric available in prometheus which scales the application based on response time and what is its name?
Whether i will need to edit the default horizontal autoscaling algorithm or i will have to make an algorithm for autoscaling from scratch which could scale my application on the basis of response time?
Prometheus has only 4 metric types: Counter, Gauge, Histogram and Summary.
I guess Histogram is that what you need
A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
A histogram with a base metric name of <basename> exposes multiple time series during a scrape:
cumulative counters for the observation buckets, exposed as <basename>_bucket{le="<upper inclusive bound>"}
the total sum of all observed values, exposed as <basename>_sum
the count of events that have been observed, exposed as <basename>_count (identical to <basename>_bucket{le="+Inf"} above)
1.
There is a stackoverflow question, where you can get a query for latency (response time), so I think this might be useful for you.
2.
I dont know if I understand you correctly, but if you want to edit HPA, you can edit the yaml file, delete previous HPA and create new one instead.
kubectl delete hpa <name.yaml>
kubectl apply -f <name.yaml>
There is good article about Autoscaling on custom metrics with custom Prometheus Metrics.

Is there any tool for GKE nodes autoscaling base on total pods requested in kubernetes?

When I resize a replication controller using kubectl, if the cluster does not have enough resource, there will have one or more pods always in pending.
Is there has any tool will auto resize GKE cluster when the resource is running out?
I had a similar requirement (for the Go build system): wanted to know when scheduled vs. available CPU or memory was > 1, and scale out nodes when that was true (or, more accurately, when it was ~.8). There's not a built-in metric, but as you suggest you can do it with a custom metric.
This was all done in Go, but it will give you the basic idea:
Create the metrics (memory and CPU, in my case
Put values to the metrics
The key takeaway IMO is that you have to iterate over each pod in the cluster to determine how much capacity is consumed, then iterate over each node in the cluster to determine how much capacity is available. It's then just a matter of pointing your autoscaler to the custom metric(s).
Big big big thing worth noting: I ultimately determined that scaling on the built-in CPU utilization metric was just as good as (if not better than, but more on that in a bit) than the custom metric. Each pod we scheduled pegged the CPU, so when pods were maxed out so was CPU. The build-in CPU utilization metric is probably better because you don't have the latency that comes with periodically putting custom metrics.
You can turn on autoscaling for the Instance Group that your GKE nodes belong to.