GKE - HPA using custom metrics - unable to fetch metrics - kubernetes

I have custom metrics exported to Google Cloud Monitoring and i want to scale my deployment according to it.
This is my HPA:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: <DEPLOYMENT>-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: <DEPLOYMENT>
minReplicas: 5
maxReplicas: 100
metrics:
- type: External
external:
metricName: "custom.googleapis.com|rabbit_mq|test|messages_count"
metricSelector:
matchLabels:
metric.labels.name: production
targetValue: 1
When describing the hpa i see:
Warning FailedComputeMetricsReplicas 4m23s (x12 over 7m23s) horizontal-pod-autoscaler Invalid metrics (1 invalid out of 1), last error was: failed to get externa
l metric custom.googleapis.com|rabbit_mq|test|messages_count: unable to get external metric production/custom.googleapis.com|rabbit_mq|test|messages_count/&LabelSelect
or{MatchLabels:map[string]string{metric.labels.name: production,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently una
ble to handle the request (get custom.googleapis.com|rabbit_mq|test|messages_count.external.metrics.k8s.io)
Warning FailedGetExternalMetric 2m23s (x20 over 7m23s) horizontal-pod-autoscaler unable to get external metric production/custom.googleapis.com|rabbit_mq|te
st|messages_count/&LabelSelector{MatchLabels:map[string]string{metric.labels.name: production,},MatchExpressions:[],}: unable to fetch metrics from external metrics AP
I: the server is currently unable to handle the request (get custom.googleapis.com|rabbit_mq|test|messages_count.external.metrics.k8s.io)
And:
Metrics: ( current / target )
"custom.googleapis.com|rabbit_mq|test|messages_count" (target value): <unknown> / 1
Kubernetes is unable to get the metric.
I validated that the metric is available and updated through the Monitoring dashboard.
Cluster nodes has Full Control for Stackdriver Monitoring:
Kubernetes version is 1.15.
What may be causing that?
Edit 1
Discovered that stackdriver-metadata-agent-cluster-level deployment is CrashLoopBack.
kubectl -n=kube-system logs stackdriver-metadata-agent-cluster-le
vel-f8dcd8b45-nl8dj -c metadata-agent
Logs from container:
vel-f8dcd8b45-nl8dj -c metadata-agent
I0408 11:50:41.999214 1 log_spam.go:42] Command line arguments:
I0408 11:50:41.999263 1 log_spam.go:44] argv[0]: '/k8s_metadata'
I0408 11:50:41.999271 1 log_spam.go:44] argv[1]: '-logtostderr'
I0408 11:50:41.999277 1 log_spam.go:44] argv[2]: '-v=1'
I0408 11:50:41.999284 1 log_spam.go:46] Process id 1
I0408 11:50:41.999311 1 log_spam.go:50] Current working directory /
I0408 11:50:41.999336 1 log_spam.go:52] Built on Jun 27 20:15:21 (1561666521)
at gcm-agent-dev-releaser#ikle14.prod.google.com:/google/src/files/255462966/depot/branches/gcm_k8s_metadata_release_branch/255450506.1/OVERLAY_READONLY/google3
as //cloud/monitoring/agents/k8s_metadata:k8s_metadata
with gc go1.12.5 for linux/amd64
from changelist 255462966 with baseline 255450506 in a mint client based on //depot/branches/gcm_k8s_metadata_release_branch/255450506.1/google3
Build label: gcm_k8s_metadata_20190627a_RC00
Build tool: Blaze, release blaze-2019.06.17-2 (mainline #253503028)
Build target: //cloud/monitoring/agents/k8s_metadata:k8s_metadata
I0408 11:50:41.999641 1 trace.go:784] Starting tracingd dapper tracing
I0408 11:50:41.999785 1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
W0408 11:50:42.003682 1 client_config.go:549] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
E0408 11:50:43.999995 1 main.go:110] Will only handle some server resources due to partial failure: unable to retrieve the complete list of server APIs: custom.m
etrics.k8s.io/v1beta1: the server is currently unable to handle the request, custom.metrics.k8s.io/v1beta2: the server is currently unable to handle the request, exter
nal.metrics.k8s.io/v1beta1: the server is currently unable to handle the request
I0408 11:50:44.000286 1 main.go:134] Initiating watch for { v1 nodes} resources
I0408 11:50:44.000394 1 main.go:134] Initiating watch for { v1 pods} resources
I0408 11:50:44.097181 1 main.go:134] Initiating watch for {batch v1beta1 cronjobs} resources
I0408 11:50:44.097488 1 main.go:134] Initiating watch for {apps v1 daemonsets} resources
I0408 11:50:44.098123 1 main.go:134] Initiating watch for {extensions v1beta1 daemonsets} resources
I0408 11:50:44.098427 1 main.go:134] Initiating watch for {apps v1 deployments} resources
I0408 11:50:44.098713 1 main.go:134] Initiating watch for {extensions v1beta1 deployments} resources
I0408 11:50:44.098919 1 main.go:134] Initiating watch for { v1 endpoints} resources
I0408 11:50:44.099134 1 main.go:134] Initiating watch for {extensions v1beta1 ingresses} resources
I0408 11:50:44.099207 1 main.go:134] Initiating watch for {batch v1 jobs} resources
I0408 11:50:44.099303 1 main.go:134] Initiating watch for { v1 namespaces} resources
I0408 11:50:44.099360 1 main.go:134] Initiating watch for {apps v1 replicasets} resources
I0408 11:50:44.099410 1 main.go:134] Initiating watch for {extensions v1beta1 replicasets} resources
I0408 11:50:44.099461 1 main.go:134] Initiating watch for { v1 replicationcontrollers} resources
I0408 11:50:44.197193 1 main.go:134] Initiating watch for { v1 services} resources
I0408 11:50:44.197348 1 main.go:134] Initiating watch for {apps v1 statefulsets} resources
I0408 11:50:44.197363 1 main.go:142] All resources are being watched, agent has started successfully
I0408 11:50:44.197374 1 main.go:145] No statusz port provided; not starting a server
I0408 11:50:45.197164 1 binarylog.go:95] Starting disk-based binary logging
I0408 11:50:45.197238 1 binarylog.go:265] rpc: flushed binary log to ""
Edit 2
The issue in edit 1 was fixed using the answer in:
https://stackoverflow.com/a/60549732/4869599
But still the hpa can't fetch the metric.
Edit 3
It seems like the issue is caused by custom-metrics-stackdriver-adapter under the custom-metrics namespace which is stuck in CrashLoopBack.
The logs of the machine:
E0419 13:36:48.036494 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:48.832653 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:48.832692 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:49.433150 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:49.433191 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:51.032656 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
E0419 13:36:51.032694 1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}
E0419 13:36:51.235248 1 writers.go:172] apiserver was unable to write a JSON response: http2: stream closed
A related issue:
https://github.com/GoogleCloudPlatform/k8s-stackdriver/issues/303

The problem was with the custom-metrics-stackdriver-adapter. It was crashing in the metrics-server namespace.
Using the resource found here:
https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
And using this image for the deployment (my version was v0.10.2):
gcr.io/google-containers/custom-metrics-stackdriver-adapter:v0.10.1
This fixed the crashing pod, and now the hpa fetch the custom metric.

Check metrics server pod running in your kube-system namespace. or else you can use this.
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metrics-server
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: metrics-server
namespace: kube-system
labels:
k8s-app: metrics-server
spec:
selector:
matchLabels:
k8s-app: metrics-server
template:
metadata:
name: metrics-server
labels:
k8s-app: metrics-server
spec:
serviceAccountName: metrics-server
volumes:
# mount in tmp so we can safely use from-scratch images and/or read-only containers
- name: tmp-dir
emptyDir: {}
containers:
- name: metrics-server
image: k8s.gcr.io/metrics-server-amd64:v0.3.1
command:
- /metrics-server
- --kubelet-insecure-tls
- --kubelet-preferred-address-types=InternalIP
imagePullPolicy: Always
volumeMounts:
- name: tmp-dir
mountPath: /tmp

Related

FailedGetPodsMetric: for HPA autoscaling

I am trying to autoscale using custom metrics, with metric type "http_request". My following command is showing correct output:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq
Below is my hpa.yaml file:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: podinfo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metricName: http_requests
targetAverageValue: 1
but my scaling is failing due to
the HPA was unable to compute the replica count:
unable to get metric http_requests: unable to fetch metrics from custom metrics API: an error on the server`
("Internal Server Error: \"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%!A(MISSING)/http_requests?labelSelector=app%!D(MISSING)podinfo\": the server could not find the requested resource")
has prevented the request from succeeding (get pods.custom.metrics.k8s.io *)
Please help me out in this :)
Seems like you are missing pods in your cluster that match the provided deployment specification. Can you check if your podinfo deployment is running? And that it has healthy pods in it?
The command works because you're only checking the availability of the metrics endpoint. This simply implies that the endpoint is live to start providing metrics, doesn't guarantee that you will receive metrics (without any resources).

kubernetes "unable to get metrics"

I am trying to autoscale a deployment and a statefulset, by running respectivly these two commands:
kubectl autoscale statefulset mysql --cpu-percent=50 --min=1 --max=10
kubectl expose deployment frontend --type=LoadBalancer --name=frontend
Sadly, on the minikube dashboard, this error appears under both services:
failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io)
Searching online I read that it might be a dns error, so I checked but CoreDNS seems to be running fine.
Both workloads are nothing special, this is the 'frontend' deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
labels:
app: frontend
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: frontend
image: hubuser/repo
ports:
- containerPort: 3000
Has anyone got any suggestions?
First of all, could you please verify if the API is working fine? To do so, please run kubectl get --raw /apis/metrics.k8s.io/v1beta1.
If you get an error similar to:
“Error from server (NotFound):”
Please follow these steps:
1.- Remove all the proxy environment variables from the kube-apiserver manifest.
2.- In the kube-controller-manager-amd64, set --horizontal-pod-autoscaler-use-rest-clients=false
3.- The last scenario is that your metric-server add-on is disabled by default. You can verify it by using:
$ minikube addons list
If it is disabled, you will see something like metrics-server: disabled.
You can enable it by using:
$minikube addons enable metrics-server
When it is done, delete and recreate your HPA.
You can use the following thread as a reference.

EKS, Windows node. networkPlugin cni failed

Per https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html, I ran the command, eksctl utils install-vpc-controllers --cluster <cluster_name> --approve
My EKS version is v1.16.3. I tries to deploy Windows docker images to a windows node. I got error below.
Warning FailedCreatePodSandBox 31s kubelet, ip-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ab8001f7b01f5c154867b7e" network for pod "mrestapi-67fb477548-v4njs": networkPlugin cni failed to set up pod "mrestapi-67fb477548-v4njs_ui" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
$ kubectl logs vpc-resource-controller-645d6696bc-s5rhk -n kube-system
I1010 03:40:29.041761 1 leaderelection.go:185] attempting to acquire leader lease kube-system/vpc-resource-controller...
I1010 03:40:46.453557 1 leaderelection.go:194] successfully acquired lease kube-system/vpc-resource-controller
W1010 23:57:53.972158 1 reflector.go:341] pkg/mod/k8s.io/client-go#v0.0.0-20180910083459-2cefa64ff137/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 1480444 (1515040)
It complains too old resource version. How do I upgrade the version?
I removed the windows nodes, re-created windows nodes with different instance type. But, it did not work.
Removed windows nodes group, re-created windows nodes group. It did not work.
Finally, I removed entire EKS cluster, re-created eks cluster. The command, kubectl describe node <windows_node> gives me the output below.
vpc.amazonaws.com/CIDRBlock 0 0
vpc.amazonaws.com/ENI 0 0
vpc.amazonaws.com/PrivateIPv4Address 1 1
Deployed windows-server-iis.yaml. It works as expected. The root cause of the problem is mystery.
To troubleshoot this I would...
First list the components to make sure they're running:
$kubectl get pod -n kube-system | grep vpc
vpc-admission-webhook-deployment-7f67d7b49-wgzbg 1/1 Running 0 38h
vpc-resource-controller-595bfc9d98-4mb2g 1/1 Running 0 29
If they are running check their logs
kubectl logs <vpc-yadayada> -n kube-system
Make sure the instance type you are using has enough available IPs per ENI because in the Windows world only one ENI is used and is limited to the max available IP's per ENI minus one for the Primary IP address. I have run into this error before where I have exceeded the number of IP's available to my ENI.
Confirm that the selector of your pod is right
nodeSelector:
kubernetes.io/os: windows
kubernetes.io/arch: amd64
As an anecdote, I have done the steps mentioned under the To enable Windows support for your cluster with a macOS or Linux client section of the doc you linked on a few clusters to date, and they have worked well.
What is your output for
kubectl describe node <windows_node>
?
if it's like :
vpc.amazonaws.com/CIDRBlock: 0
vpc.amazonaws.com/ENI: 0
vpc.amazonaws.com/PrivateIPv4Address: 0
then you need to re-create the nodegroup with different instance type...
then try to deploy this :
apiVersion: apps/v1
kind: Deployment
metadata:
name: windows-server-iis-test
namespace: default
spec:
selector:
matchLabels:
app: windows-server-iis-test
tier: backend
track: stable
replicas: 1
template:
metadata:
labels:
app: windows-server-iis-test
tier: backend
track: stable
spec:
containers:
- name: windows-server-iis-test
image: mcr.microsoft.com/windows/servercore:1809
ports:
- name: http
containerPort: 80
imagePullPolicy: IfNotPresent
command:
- powershell.exe
- -command
- "Add-WindowsFeature Web-Server; Invoke-WebRequest -UseBasicParsing -Uri 'https://dotnetbinaries.blob.core.windows.net/servicemonitor/2.0.1.6/ServiceMonitor.exe' -OutFile 'C:\\ServiceMonitor.exe'; echo '<html><body><br/><br/><marquee><H1>Hello EKS!!!<H1><marquee></body><html>' > C:\\inetpub\\wwwroot\\default.html; C:\\ServiceMonitor.exe 'w3svc'; "
resources:
limits:
cpu: 256m
memory: 256Mi
requests:
cpu: 128m
memory: 100Mi
nodeSelector:
kubernetes.io/os: windows
---
apiVersion: v1
kind: Service
metadata:
name: windows-server-iis-test
namespace: default
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: windows-server-iis-test
tier: backend
track: stable
sessionAffinity: None
type: ClusterIP
kubectl proxy
open browser http://localhost:8001/api/v1/namespaces/default/services/http:windows-server-iis-test:80/proxy/default.html will shown webpage with Hello EKS text

Kubernetes + calico + replicaSet

So I found myself in a pretty sticky situation. I'm trying to create a simple replicaSet, but unfortunately I ran into some problems with the calico.
I have 2 VM running on OracleVM. I have them configured to use enp0s8 interface. The IP of the master node is 192.168.56.2 and the worker node's ip is 192.168.56.3
Here is what I'm doing in Kubernetes. First I'm creating the kubernetes master node:
kubeadm init --pod-network-cidr=192.168.0.0/16 --apiserver-advertise-address=192.168.56.2
after successfuly initialzing I'm running:
export KUBECONFIG=/etc/kubernetes/admin.conf
Now I'm creating the POD network by running:
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
after that I'm joining from the worker node successfully. Whenever I start the replica with:
*** edit: I don't have to create the replicaset to obtain the same result of the calico-node creation getting stuck
kubectl create -f replicaset-definition.yml
in which the yml looks like this:
kind: ReplicaSet
metadata:
name: myapp-replicaset
labels:
app: myapp
type: front-end
spec:
template:
metadata:
name: myapp-pod
labels:
app: myapp
type: front-end
spec:
containers:
- name: nginx-container
image: nginx
replicas: 2
selector:
matchLabels:
app: myapp
I'm getting a new calico-node created which eventually will get stuck
calico-node-mcb5g 0/1 Running 6 8m58s
calico-node-t9p5n 1/1 Running 0 12m
If I run
kubectl logs -n kube-system calico-node-mcb5g -f
on it I get the following logs:
2020-03-18 14:45:40.585 [INFO][8] startup.go 275: Using NODENAME environment for node name
2020-03-18 14:45:40.585 [INFO][8] startup.go 287: Determined node name: kubenode1
2020-03-18 14:45:40.587 [INFO][8] k8s.go 228: Using Calico IPAM
2020-03-18 14:45:40.588 [INFO][8] startup.go 319: Checking datastore connection
2020-03-18 14:46:10.589 [INFO][8] startup.go 334: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
2020-03-18 14:46:41.591 [INFO][8] startup.go 334: Hit error connecting to datastore - retry error=Get https://10.96.0.1:443/api/v1/nodes/foo: dial tcp 10.96.0.1:443: i/o timeout
I've tried to configure the calico.yml and added the following line in env:
- name: IP_AUTODETECTION_METHOD
value: "interface=enp0s8"
but the result is still the same.
Thank you so much for reading this and if you have any advice I will be sooo grateful!!!
Ok, so here it goes. What it seemed to be was that calico node crashed because the service CIDR and host CIDR overlaped.
If I initiate the master node with the CIDR changed as:
kubeadm init --pod-network-cidr=20.96.0.0/12 --apiserver-advertise-address=192.168.56.2
works like a charm.
This helped a lot:
Cluster Creation Successful but calico-node-xx pod is in CrashLoopBackOff Status

Why do pods with completed status still show up in kubctl get pods?

I have executed the samples from the book "Kubernetes Up and Running" where a pod with a work queue is run, then a k8s job is created 5 pods to consume all the work on the queue. I have reproduced the yaml api objects below.
My Expectation is that once a k8s job completes then it's pods would be deleted but kubectl get pods -o wide shows the pods are still around even though it reports 0/1 containers ready and they still seem to have ip addresses assigned see output below.
When will completed job pods be removed from the output of kubectl get pods why is that not right after all the containers in the pod finish?
Are the pods consuming any resources when they complete like an IP address or is the info being printed out historical?
Output from kubectl after all the pods have consumed all the messages.
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
consumers-bws9f 0/1 Completed 0 6m 10.32.0.35 gke-cluster1-default-pool-3796b2ee-rtcr
consumers-d25cs 0/1 Completed 0 6m 10.32.0.33 gke-cluster1-default-pool-3796b2ee-rtcr
consumers-jcwr8 0/1 Completed 0 6m 10.32.2.26 gke-cluster1-default-pool-3796b2ee-tpml
consumers-l9rkf 0/1 Completed 0 6m 10.32.0.34 gke-cluster1-default-pool-3796b2ee-rtcr
consumers-mbd5c 0/1 Completed 0 6m 10.32.2.27 gke-cluster1-default-pool-3796b2ee-tpml
queue-wlf8v 1/1 Running 0 22m 10.32.0.32 gke-cluster1-default-pool-3796b2ee-rtcr
The follow three k8s api calls were executed these are cut and pasted from the book samples.
Run a pod with a work queue
apiVersion: extensions/v1beta1
kind: ReplicaSet
metadata:
labels:
app: work-queue
component: queue
chapter: jobs
name: queue
spec:
replicas: 1
template:
metadata:
labels:
app: work-queue
component: queue
chapter: jobs
spec:
containers:
- name: queue
image: "gcr.io/kuar-demo/kuard-amd64:1"
imagePullPolicy: Always
Expose the pod as a service so that the worker pods can get to it.
apiVersion: v1
kind: Service
metadata:
labels:
app: work-queue
component: queue
chapter: jobs
name: queue
spec:
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
app: work-queue
component: queue
Post 100 items to the queue then run a job with 5 pods executing in parallel until the queue is empty.
apiVersion: batch/v1
kind: Job
metadata:
labels:
app: message-queue
component: consumer
chapter: jobs
name: consumers
spec:
parallelism: 5
template:
metadata:
labels:
app: message-queue
component: consumer
chapter: jobs
spec:
containers:
- name: worker
image: "gcr.io/kuar-demo/kuard-amd64:1"
imagePullPolicy: Always
args:
- "--keygen-enable"
- "--keygen-exit-on-complete"
- "--keygen-memq-server=http://queue:8080/memq/server"
- "--keygen-memq-queue=keygen"
restartPolicy: OnFailure
The docs say it pretty well:
When a Job completes, no more Pods are created, but the Pods are not
deleted either. Keeping them around allows you to still view the logs
of completed pods to check for errors, warnings, or other diagnostic
output. The job object also remains after it is completed so that you
can view its status. It is up to the user to delete old jobs after
noting their status. Delete the job with kubectl (e.g. kubectl delete
jobs/pi or kubectl delete -f ./job.yaml). When you delete the job
using kubectl, all the pods it created are deleted too.
It shows completed status when it actually terminated. If you set restartPloicy:Never( when you don't want to run more then once) then it goes to this state.
Terminated: Indicates that the container completed its execution and has stopped running. A container enters into this when it has successfully completed execution or when it has failed for some reason. Regardless, a reason and exit code is displayed, as well as the container’s start and finish time. Before a container enters into Terminated, preStop hook (if any) is executed.
...
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 30 Jan 2019 11:45:26 +0530
Finished: Wed, 30 Jan 2019 11:45:26 +0530
...