Kubernetes HPA not working: unable to get metrics

Kubernetes HPA not working: unable to get metrics - kubernetes

My pod scaler fails to deploy, and keeps giving an error of FailedGetResourceMetric:
Warning FailedComputeMetricsReplicas 6s horizontal-pod-autoscaler failed to compute desired number of replicas based on listed metrics for Deployment/default/bot-deployment: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
I have ensured to install metrics-server as you can see when I run the the following command to show the metrics-server resource on the cluster:
kubectl get deployment metrics-server -n kube-system
It shows this:
metrics-server
I also set the --kubelet-insecure-tls and --kubelet-preferred-address-types=InternalIP options in the args section of the metrics-server manifest file.
This is what my deployment manifest looks like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: bot-deployment
labels:
app: bot
spec:
replicas: 1
selector:
matchLabels:
app: bot
template:
metadata:
labels:
app: bot
spec:
containers:
- name: bot-api
image: gcr.io/<repo>
ports:
- containerPort: 5600
volumeMounts:
- name: bot-volume
mountPath: /core
- name: wallet
image: gcr.io/<repo>
ports:
- containerPort: 5000
resources:
requests:
cpu: 800m
limits:
cpu: 1500m
volumeMounts:
- name: bot-volume
mountPath: /wallet_
volumes:
- name: bot-volume
emptyDir: {}
The specifications for my pod scaler is shown below too:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: bot-scaler
spec:
metrics:
- resource:
name: cpu
target:
averageUtilization: 85
type: Utilization
type: Resource
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: bot-deployment
minReplicas: 1
maxReplicas: 10
Because of this the TARGET options always remains as /80%. Upon introspection, the HPA makes that same complaint over and over again, I have tried all options, that I have seen on some other questions, but none of them seem to work. I have also tried uninstalling and reinstalling the metrics-server many times, but it doesn't work.
One thing I notice though, is that the metrics-server seems to shut down after I deploy the HPA manifest, and it fails to start. When i check the state of the metrics-server the READY option shows 0/1 even though it was initially 1/1. What could be wrong?
I will gladly provide as much info as needed. Thank you!

Looks like your bot-api is missing it's resource request and limit. your wallet has them though. the hpa uses all the resources in the pod to calculate the utilization

Related

Kubernetes - Horizontal Pod Scaler error, target "unknown". Message "no recommendation"

I have a working 1.23.9 kubernetes cluster hosted on Google Kubernetes Engine with multi-cluster services enabled, one cluster hosted in us and another in eu. I have multiple deployment apps and hpa configured for each through YAML. Out of 7 deployment apps, HPA is only working for one app. service-1 can only be accessed from service-2 internally and service-2 is exposed through HttpGateway by GKE. Please find more info below. Any help would be extremely appreciated.
Deployment file, I have posted only 2 apps, service-2's HPA is working fine, whereas service-1's is not.
$ cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: service-1
namespace: backend
labels:
app: service-1
spec:
replicas: 1
selector:
matchLabels:
lbtype: internal
template:
metadata:
labels:
lbtype: internal
app: service-1
spec:
containers:
- name: service-1
image: [REDACTED]
ports:
- containerPort: [REDACTED]
name: "[REDACTED]"
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
imagePullSecrets:
- name: docker-gcr
restartPolicy: Always
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: service-2
namespace: backend
labels:
app: service-2
spec:
replicas: 2
selector:
matchLabels:
lbtype: external
template:
metadata:
labels:
lbtype: external
app: service-2
spec:
containers:
- name: service-2
image: [REDACTED]
ports:
- containerPort: [REDACTED]
name: "[REDACTED]"
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
imagePullSecrets:
- name: docker-gcr
restartPolicy: Always
HorizontalPodScaler file:
$ cat horizontal-pod-scaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: service-1
namespace: backend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: service-1
minReplicas: 1
maxReplicas: 2
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: service-2
namespace: backend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: service-2
minReplicas: 2
maxReplicas: 4
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Service file:
$ cat service.yaml
apiVersion: v1
kind: Service
metadata:
name: backend-internal
namespace: backend
spec:
type: ClusterIP
ports:
- name: service-1
port: [REDACTED]
targetPort: "[REDACTED]"
selector:
lbtype: internal
---
apiVersion: v1
kind: Service
metadata:
name: backend-middleware
namespace: backend
spec:
ports:
- name: service-2
port: [REDACTED]
targetPort: "[REDACTED]"
selector:
lbtype: external
$ kctl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
service-1 Deployment/service-1 <unknown>/70% 1 2 1 18h
service-2 Deployment/service-2 4%/70% 2 4 2 18h
$ kctl top pod
NAME CPU(cores) MEMORY(bytes)
service-1-8f7dc66cc-xtz76 3m 66Mi
service-2-5fd767cbc-vm7f5 4m 76Mi
$ kubectl describe deployment metrics-server-v0.5.2 -nkube-system
Name: metrics-server-v0.5.2
Namespace: kube-system
CreationTimestamp: Fri, 02 Dec 2022 11:01:18 +0530
Labels: addonmanager.kubernetes.io/mode=Reconcile
k8s-app=metrics-server
version=v0.5.2
Annotations: components.gke.io/layer: addon
deployment.kubernetes.io/revision: 4
Selector: k8s-app=metrics-server,version=v0.5.2
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
...
Containers:
metrics-server:
Image: gke.gcr.io/metrics-server:v0.5.2-gke.1
Port: 10250/TCP
Host Port: 10250/TCP
Command:
/metrics-server
--metric-resolution=30s
--kubelet-port=10255
--deprecated-kubelet-completely-insecure=true
--kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
--cert-dir=/tmp
--secure-port=10250
$ kctl describe hpa service-1
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: no recommendation
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetResourceMetric 2m (x4470 over 18h) horizontal-pod-autoscaler no recommendation
$ kctl describe hpa service-2
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count
Events: <none>

As per my understanding ScalingActive=False It should not affect the auto scaling in a major way.
Check below possible solutions :
1)Check The Resource Metric : You can remove the LIMITS from your deployments and try it. Try only Pod's containers must be set relevant REQUESTS for RESOURCES at the deployment level and it may work. If you see the HPA is working then later you can play with LIMITS as well. This discussion tells you that only using REQUESTS is sufficient to do the HPA.
2)FailedGetResourceMetric : Check if metric is registered and available (also look at "Custom and external metrics"). Try executing the commands kubectl top node and kubectl top pod -A to verify that metrics-server is working properly.
The HPA controller runs regularly to check if any adjustments to the system are required. During each run, the controller manager queries the resource utilization against the metrics specified in each HorizontalPodAutoscaler definition. The controller manager obtains the metrics from either the resource metrics API (for per-pod resource metrics).
Basically HPA targets deployment by name, uses deployment selector labels to get pod's metrics. One may have two deployments that use the same selector and then HPA would get metrics for pods of both deployments. Try the same deployment with a kind cluster and it may work fine.
3)Kubernetes Metrics Server is a scalable, efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines. Metrics Server for CPU/Memory based horizontal autoscaling.
Check Requirements : Kubernetes Metrics Server has specific requirements for cluster and network configuration. These requirements aren't the default for all cluster distributions. Please ensure that your cluster distribution supports these requirements before using Metrics Server.
4)HPA process scaleup event every 15-30 seconds and It may take around 3-4 min because of latency of metrics data.
5)Check this relevant SO for more information.

Flink Kubernetes deployment - the HPA controller was unable to get the target's current scale:

I am deploying the flink stateful app using the below-mentioned YAML file.
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: operational-reporting-15gb
spec:
image:.azurecr.io/stateful-app-v2
flinkVersion: v1_15
flinkConfiguration:
taskmanager.numberOfTaskSlots: "2"
state.savepoints.dir: abfs://flinktest#.dfs.core.windows.net/savepoints.v2
state.checkpoints.dir: abfs://flinktest#.dfs.core.windows.net/checkpoints.v2
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: abfs://flinktest#.dfs.core.windows.net/ha.v2
serviceAccount: flink
jobManager:
resource:
memory: "15360m"
cpu: 2
taskManager:
resource:
memory: "15360m"
cpu: 3
podTemplate:
spec:
containers:
- name: flink-main-container
volumeMounts:
- mountPath: /flink-data
name: flink-volume
volumes:
- name: flink-volume
emptyDir: {}
job:
jarURI: local:///opt/operationalReporting.jar
parallelism: 1
upgradeMode: savepoint
state: running
Flink jobs are running perfectly.
For auto-scaling I created HPA using the following code.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: basic-hpa
namespace: default
spec:
minReplicas: 1
maxReplicas: 15
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageValue: 100m
scaleTargetRef:
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
name: operational-reporting-15gb
While describing the auto scaling I am getting below mentioned error.
Type Status Reason Message
AbleToScale False FailedGetScale the HPA controller was unable to get the target's current scale: flinkdeployments.flink.apache.org "operational-reporting-15gb" not found
Events:
Type Reason Age From Message
Warning FailedGetScale 4m4s (x121 over 34m) horizontal-pod-autoscaler flinkdeployments.flink.apache.org "operational-reporting-15gb" not found
For HPA target is showing UNKNOW. Kindly help

I assume you are following the HPA example of the Kubernetes Operator. Thanks for giving it a try, it is an experimental feature as outlined in the docs, we only have limited experience with it at the moment.
That said checking the obvious is your FlinkDeployment named operational-reporting-15gb running in the default namespace? Otherwise please adjust the namespace of your HPA accordingly.
Also please make sure that you have the latest FlinkDeployment CRD installed. Just having v1beta1 only ensures complatibility it is not actually a fixed version and we added the scale subresource relatively recently.
git clone https://github.com/apache/flink-kubernetes-operator
cd flink-kubernetes-operator
kubectl replace -f helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml

Unable to fetch metrics from custom metrics API: the server is currently unable to handle the request

I'm using a HPA based on a custom metric on GKE.
The HPA is not working and it's showing me this error log:
unable to fetch metrics from custom metrics API: the server is currently unable to handle the request
When I run kubectl get apiservices | grep custom I get
v1beta1.custom.metrics.k8s.io services/prometheus-adapter False (FailedDiscoveryCheck) 135d
this is the HPA spec config :
spec:
scaleTargetRef:
kind: Deployment
name: api-name
apiVersion: apps/v1
minReplicas: 3
maxReplicas: 50
metrics:
- type: Object
object:
target:
kind: Service
name: api-name
apiVersion: v1
metricName: messages_ready_per_consumer
targetValue: '1'
and this is the service's spec config :
spec:
ports:
- name: worker-metrics
protocol: TCP
port: 8080
targetPort: worker-metrics
selector:
app.kubernetes.io/instance: api
app.kubernetes.io/name: api-name
clusterIP: 10.8.7.9
clusterIPs:
- 10.8.7.9
type: ClusterIP
sessionAffinity: None
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
What should I do to make it work ?

First of all, confirm that the Metrics Server POD is running in your kube-system namespace. Also, you can use the following manifest:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp
If so, take a look into the logs and look for any stackdriver adapter’s line. This issue is commonly caused due to a problem with the custom-metrics-stackdriver-adapter. It usually crashes in the metrics-server namespace. To solve that, use the resource from this URL, and for the deployment, use this image:
gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.1
Another common root cause of this is an OOM issue. In this case, adding more memory solves the problem. To assign more memory, you can specify the new memory amount in the configuration file, as the following example shows:
apiVersion: v1
kind: Pod
metadata:
name: memory-demo
namespace: mem-example
spec:
containers:
- name: memory-demo-ctr
image: polinux/stress
resources:
limits:
memory: "200Mi"
requests:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
In the above example, the Container has a memory request of 100 MiB and a memory limit of 200 MiB. In the manifest, the "--vm-bytes", "150M" argument tells the Container to attempt to allocate 150 MiB of memory. You can visit this Kubernetes Official Documentation to have more references about the Memory settings.
You can use the following threads for more reference GKE - HPA using custom metrics - unable to fetch metrics, Stackdriver-metadata-agent-cluster-level gets OOMKilled, and Custom-metrics-stackdriver-adapter pod keeps crashing.

What do you get for kubectl get pod -l "app.kubernetes.io/instance=api,app.kubernetes.io/name=api-name"?
There should be a pod, to which the service reffers.
If there is a pod, check its logs with kubectl logs <pod-name>. you can add -f to kubectl logs command, to follow the logs.

Adding this block in my EKS nodes security group rules solved the issue for me:
node_security_group_additional_rules = {
...
ingress_cluster_metricserver = {
description = "Cluster to node 4443 (Metrics Server)"
protocol = "tcp"
from_port = 4443
to_port = 4443
type = "ingress"
source_cluster_security_group = true
}
...
}

error while set up Prometheus metrics for kubernetes Horizontal Pod Auto-scaling

I'am trying to set up the metrics to activate HPA (Horizontal Pod Auto-scaling)
I follow this tutorial only the Custom Metrics (Prometheus) .
Unfortunately when i execute the command below :
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[]}
I must to see a lot of thing on resources however there is nothing.

This might be the issue how you setup metrics-server and metrics-server could not able to find your resources on InternalIP.
The solution is to replace the metrics-server-deployment.yaml file in metrics-server/deploy/1.8+ with the following yaml file:
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- command:
- /metrics-server
- --metric-resolution=30s
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp
Also, enable the --authentication-token-webhook in kubelet.conf, then you will be able to get the metrics.
Also, checkout my answer for step by step instruction to set the HPA using metrics-server.
How to Enable KubeAPI server for HPA Autoscaling Metrics
Hope this helps. Revert back if you face any issues.

Kubernetes | Rolling Update on Replica Set

I'm trying to perform a rolling update of the container image that my Federated Replica Set is using but I'm getting the following error:
When I run: kubectl rolling-update mywebapp -f mywebapp-v2.yaml
I get the error message: the server could not find the requested resource;
This is a brand new and clean install on Google Container Engine (GKE) so besides creating the Federated Cluster and deploying my first service nothing else has been done. I'm following the instructions from the Kubernetes Docs but no luck.
I've checked to make sure that I'm in the correct context and I've also created a new YAML file pointing to the new image and updated the metadata name. Am I missing something? The easy way for me to do this is to delete the replica set and then redeploy but then I'm cheating myself :). Any pointers would be appreciated
mywebappv2.yaml - new yaml file for rolling update
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
name: mywebapp-v2
spec:
replicas: 4
template:
metadata:
labels:
app: mywebapp
spec:
containers:
- name: mywebapp
image: gcr.io/xxxxxx/static-js:v2
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 80
name: mywebapp
My original mywebapp.yaml file:
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
name: mywebapp
spec:
replicas: 4
template:
metadata:
labels:
app: mywebapp
spec:
containers:
- name: mywebapp
image: gcr.io/xxxxxx/static-js:v2
resources:
requests:
cpu: 100m
memory: 100Mi
ports:
- containerPort: 80
name: mywebapp

Try kind: Deployment.
Most kubectl commands that support Replication Controllers also
support ReplicaSets. One exception is the rolling-update command. If
you want the rolling update functionality please consider using
Deployments instead.
Also, the rolling-update command is imperative
whereas Deployments are declarative, so we recommend using Deployments
through the rollout command.
-- Replica Sets |
Kubernetes

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse