Using rabbitmq's queue to do hpa, access to custom.metrics fails - kubernetes

Can be successfully accessed through api,It can clearly obtain information,by /apis/custom.metrics.k8s.io/v1beta1
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/rabbitmq-exporter/rabbitmq_queue_messages_ready?metricLabelSelector=queue%3Dtest-1 | jq .
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/rabbitmq-exporter/rabbitmq_queue_messages_ready"
},
"items": [
{
"describedObject": {
"kind": "Service",
"namespace": "default",
"name": "rabbitmq-exporter",
"apiVersion": "/v1"
},
"metricName": "rabbitmq_queue_messages_ready",
"timestamp": "2020-02-17T13:50:20Z",
"value": "14",
"selector": null
}
]
}
HPA file
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: test-hpa
namespace: default
spec:
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: test
minReplicas: 1
maxReplicas: 5
metrics:
- type: Object
object:
metric:
name: "rabbitmq_queue_messages_ready"
selector:
matchLabels:
"queue": "test-1"
describedObject:
apiVersion: "custom.metrics.k8s.io/v1beta1"
kind: Service
name: rabbitmq-exporter
target:
type: Value
value: 4
Error message
Name: test-hpa
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"autoscaling/v2beta2","kind":"HorizontalPodAutoscaler","metadata":{"annotations":{},"name":"test-hpa","namespace":"defa...
CreationTimestamp: Mon, 17 Feb 2020 21:38:08 +0800
Reference: Deployment/test
Metrics: ( current / target )
"rabbitmq_queue_messages_ready" on Service/rabbitmq-exporter (target value): <unknown> / 4
Min replicas: 1
Max replicas: 5
Deployment pods: 1 current / 0 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetObjectMetric the HPA was unable to compute the replica count: unable to get metric rabbitmq_queue_messages_ready: Service on default rabbitmq-exporter/object metrics are not yet supported
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedComputeMetricsReplicas 97s (x12 over 4m22s) horizontal-pod-autoscaler Invalid metrics (1 invalid out of 1), last error was: failed to get object metric value: unable to get metric rabbitmq_queue_messages_ready: Service on default rabbitmq-exporter/object metrics are not yet supported
Warning FailedGetObjectMetric 82s (x13 over 4m22s) horizontal-pod-autoscaler unable to get metric rabbitmq_queue_messages_ready: Service on default rabbitmq-exporter/object metrics are not yet supported

Related

How can I troubleshoot pod stuck at ContainerCreating

I'm trying to troubleshoot a failing pod but I cannot gather enough info to do so. Hoping someone can assist.
[server-001 ~]$ kubectl get pods sandboxed-nginx-98bb68c4d-26ljd
NAME READY STATUS RESTARTS AGE
sandboxed-nginx-98bb68c4d-26ljd 0/1 ContainerCreating 0 18m
[server-001 ~]$ kubectl logs sandboxed-nginx-98bb68c4d-26ljd
Error from server (BadRequest): container "nginx-kata" in pod "sandboxed-nginx-98bb68c4d-26ljd" is waiting to start: ContainerCreating
[server-001 ~]$ kubectl describe pods sandboxed-nginx-98bb68c4d-26ljd
Name: sandboxed-nginx-98bb68c4d-26ljd
Namespace: default
Priority: 0
Node: worker-001/100.100.230.34
Start Time: Fri, 08 Jul 2022 09:41:08 +0000
Labels: name=sandboxed-nginx
pod-template-hash=98bb68c4d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/sandboxed-nginx-98bb68c4d
Containers:
nginx-kata:
Container ID:
Image: dummy-registry.com/test/nginx:1.17.7
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-887n4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-887n4:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25m default-scheduler Successfully assigned default/sandboxed-nginx-98bb68c4d-26ljd to worker-001
Warning FailedCreatePodSandBox 5m19s kubelet Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
[worker-001 ~]$ sudo crictl images
IMAGE TAG IMAGE ID SIZE
dummy-registry.com/test/externalip-webhook v1.0.0-1 e2e778d82e6c3 147MB
dummy-registry.com/test/flannel v0.14.1 52e470e10ebf9 209MB
dummy-registry.com/test/kube-proxy v1.22.8 93ab9e5f0c4d6 869MB
dummy-registry.com/test/nginx 1.17.7 db634ca7e0456 310MB
dummy-registry.com/test/pause 3.5 dabdc5fea3665 711kB
dummy-registry.com/test/linux 7-slim 41388a53234b5 140MB
[worker-001 ~]$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
b1c6d1bf2f09a db634ca7e045638213d3f68661164aa5c7d5b469631bbb79a8a65040666492d5 34 minutes ago Running nginx 0 3598c2c4d3e88
caaa14b395eb8 e2e778d82e6c3a8cc82cdf3083e55b084869cd5de2a762877640aff1e88659dd 48 minutes ago Running webhook 0 8a9697e2af6a1
4f97ac292753c 52e470e10ebf93ea5d2aa32f5ca2ecfa3a3b2ff8d2015069618429f3bb9cda7a 48 minutes ago Running kube-flannel 2 a4e4d0c14cafc
aacb3ed840065 93ab9e5f0c4d64c135c2e4593cd772733b025f53a9adb06e91fe49f500b634ab 48 minutes ago Running kube-proxy 2 9e0bc036c2d00
[worker-001 ~]$ sudo crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
3598c2c4d3e88 34 minutes ago Ready nginx-9xtss default 0 (default)
8a9697e2af6a1 48 minutes ago Ready externalip-validation-webhook-7988bff847-ntv6d externalip-validation-system 0 (default)
9e0bc036c2d00 48 minutes ago Ready kube-proxy-9c7cb kube-system 0 (default)
a4e4d0c14cafc 48 minutes ago Ready kube-flannel-ds-msz7w kube-system 0 (default)
[worker-001 ~]$ cat /etc/crio/crio.conf
[crio]
[crio.image]
pause_image = "dummy-registry.com/test/pause:3.5"
registries = ["docker.io", "dummy-registry.com/test"]
[crio.network]
plugin_dirs = ["/opt/cni/bin"]
[crio.runtime]
cgroup_manager = "systemd"
conmon_cgroup = "system.slice"
conmon = "/usr/libexec/crio/conmon"
manage_network_ns_lifecycle = true
manage_ns_lifecycle = true
selinux = false
[crio.runtime.runtimes]
[crio.runtime.runtimes.kata]
runtime_path = "/usr/bin/containerd-shim-kata-v2"
runtime_type = "vm"
runtime_root = "/run/vc"
[crio.runtime.runtimes.runc]
runtime_path = "/usr/bin/runc"
runtime_type = "oci"
[worker-001 ~]$ egrep -v '^#|^;|^$' /usr/share/defaults/kata-containers/configuration-qemu.toml
[hypervisor.qemu]
initrd = "/usr/share/kata-containers/kata-containers-initrd.img"
path = "/usr/libexec/qemu-kvm"
kernel = "/usr/share/kata-containers/vmlinuz.container"
machine_type = "q35"
enable_annotations = []
valid_hypervisor_paths = ["/usr/libexec/qemu-kvm"]
kernel_params = ""
firmware = ""
firmware_volume = ""
machine_accelerators=""
cpu_features="pmu=off"
default_vcpus = 1
default_maxvcpus = 0
default_bridges = 1
default_memory = 2048
disable_block_device_use = false
shared_fs = "virtio-9p"
virtio_fs_daemon = "/usr/libexec/kata-qemu/virtiofsd"
valid_virtio_fs_daemon_paths = ["/usr/libexec/kata-qemu/virtiofsd"]
virtio_fs_cache_size = 0
virtio_fs_extra_args = ["--thread-pool-size=1", "-o", "announce_submounts"]
virtio_fs_cache = "auto"
block_device_driver = "virtio-scsi"
enable_iothreads = false
enable_vhost_user_store = false
vhost_user_store_path = "/usr/libexec/qemu-kvm"
valid_vhost_user_store_paths = ["/var/run/kata-containers/vhost-user"]
valid_file_mem_backends = [""]
pflashes = []
valid_entropy_sources = ["/dev/urandom","/dev/random",""]
[factory]
[agent.kata]
kernel_modules=[]
[runtime]
internetworking_model="tcfilter"
disable_guest_seccomp=true
disable_selinux=false
sandbox_cgroup_only=true
static_sandbox_resource_mgmt=false
sandbox_bind_mounts=[]
vfio_mode="guest-kernel"
disable_guest_empty_dir=false
experimental=[]
[image]
[server-001 ~]$ cat nginx.yaml
---
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-containers
handler: kata
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sandboxed-nginx
spec:
replicas: 1
selector:
matchLabels:
name: sandboxed-nginx
template:
metadata:
labels:
name: sandboxed-nginx
spec:
runtimeClassName: kata-containers
containers:
- name: nginx-kata
image: dummy-registry.com/test/nginx:1.17.7
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: sandboxed-nginx
spec:
type: NodePort
ports:
- protocol: TCP
port: 80
targetPort: 80
selector:
name: sandboxed-nginx
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nginx
labels:
name: nginx
spec:
selector:
matchLabels:
name: nginx
template:
metadata:
labels:
name: nginx
spec:
tolerations:
# this toleration is to have the daemonset runnable on master nodes
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: nginx
image: dummy-registry.com/test/nginx:1.17.7
ports:
- containerPort: 80
[server-001 ~]$ kubectl apply -f nginx.yaml
runtimeclass.node.k8s.io/kata-containers unchanged
deployment.apps/sandboxed-nginx created
service/sandboxed-nginx created
daemonset.apps/nginx created
Since you're using kata containers with cri-o runtime, your pod should have a RuntimeClass parameter which it is missing.
You need to create a RuntimeClass object which will point to the runtime installed. See the docs here for how to do that. Also, make sure that the cri-o setup on worker-001 is correctly configured with k8s. Here is documentation for that.
Afterwards, add a RuntimeClass parameter to your pod so that the container can actually run. The ContainerCreating stage is stuck since the Pod controller cannot run cri-o based containers unless the RuntimeClass is specified. Here is some documentation on understanding Container Runtimes.

can't get custom metrics for hpa from datadog

hey guys i’m trying to setup datadog as custom metric for my kubernetes hpa using the official guide:
https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=helm
running on EKS 1.18 & Datadog Cluster Agent (v1.10.0).
the problem is that i can't get the external metrics's for my HPA:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: hibob-hpa
spec:
minReplicas: 1
maxReplicas: 5
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: something
metrics:
- type: External
external:
metricName: **kubernetes_state.container.cpu_limit**
metricSelector:
matchLabels:
pod: **something-54c4bd4db7-pm9q5**
targetAverageValue: 9
horizontal-pod-autoscaler unable to get external metric:
canary/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_app_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)
This is the errors i'm getting inside the cluster-agent:
datadog-cluster-agent-585897dc8d-x8l82 cluster-agent 2021-08-20 06:46:14 UTC | CLUSTER | ERROR | (pkg/clusteragent/externalmetrics/metrics_retriever.go:77 in retrieveMetricsValues) | Unable to fetch external metrics: [Error while executing metric query avg:nginx.net.request_per_s{kubea_app_name:ingress-nginx}.rollup(30): API error 403 Forbidden: {"status":********#datadoghq.com"}, strconv.Atoi: parsing "": invalid syntax]
# datadog-cluster-agent status
Getting the status from the agent.
2021-08-19 15:28:21 UTC | CLUSTER | WARN | (pkg/util/log/log.go:541 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
===============================
Datadog Cluster Agent (v1.10.0)
===============================
Status date: 2021-08-19 15:28:21.519850 UTC
Agent start: 2021-08-19 12:11:44.266244 UTC
Pid: 1
Go Version: go1.14.12
Build arch: amd64
Agent flavor: cluster_agent
Check Runners: 4
Log Level: INFO
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2021-08-19 15:28:21.519850 UTC
Hostnames
=========
ec2-hostname: ip-10-30-162-8.eu-west-1.compute.internal
hostname: i-00d0458844a597dec
instance-id: i-00d0458844a597dec
socket-fqdn: datadog-cluster-agent-585897dc8d-x8l82
socket-hostname: datadog-cluster-agent-585897dc8d-x8l82
hostname provider: aws
unused hostname providers:
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Metadata
========
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-585897dc8d-x8l82
Last Acquisition of the lease: Thu, 19 Aug 2021 12:13:14 UTC
Renewed leadership: Thu, 19 Aug 2021 15:28:07 UTC
Number of leader transitions: 17 transitions
Custom Metrics Server
=====================
External metrics provider uses DatadogMetric - Check status directly from Kubernetes with: `kubectl get datadogmetric`
Admission Controller
====================
Disabled: The admission controller is not enabled on the Cluster Agent
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
Total Runs: 787
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 660
Service Checks: Last Run: 3, Total: 2,343
Average Execution Time : 1.898s
Last Execution Date : 2021-08-19 15:28:17.000000 UTC
Last Successful Execution Date : 2021-08-19 15:28:17.000000 UTC
=========
Forwarder
=========
Transactions
============
Deployments: 350
Dropped: 0
DroppedOnInput: 0
Nodes: 497
Pods: 3
ReplicaSets: 576
Requeued: 0
Retried: 0
RetryQueueSize: 0
Services: 263
Transaction Successes
=====================
Total number: 3442
Successes By Endpoint:
check_run_v1: 786
intake: 181
orchestrator: 1,689
series_v1: 786
==========
Endpoints
==========
https://app.datadoghq.eu - API Key ending with:
- f295b
=====================
Orchestrator Explorer
=====================
ClusterID: f7b4f97a-3cf2-11ea-aaa8-0a158f39909c
ClusterName: production
ContainerScrubbing: Enabled
======================
Orchestrator Endpoints
======================
===============
Forwarder Stats
===============
Pods: 3
Deployments: 350
ReplicaSets: 576
Services: 263
Nodes: 497
===========
Cache Stats
===========
Elements in the cache: 393
Pods:
Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 7 Miss: 5)
Deployments:
Last Run: (Hits: 36 Miss: 1) | Total: (Hits: 40846 Miss: 2444)
ReplicaSets:
Last Run: (Hits: 297 Miss: 1) | Total: (Hits: 328997 Miss: 19441)
Services:
Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 49520 Miss: 2919)
Nodes:
Last Run: (Hits: 9 Miss: 0) | Total: (Hits: 10171 Miss: 755)```
and this is what i get from datadogmetric:
Name: dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
Namespace: datadog
Labels: <none>
Annotations: <none>
API Version: datadoghq.com/v1alpha1
Kind: DatadogMetric
Metadata:
Creation Timestamp: 2021-08-19T15:14:14Z
Generation: 1
Managed Fields:
API Version: datadoghq.com/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:status:
.:
f:autoscalerReferences:
f:conditions:
.:
k:{"type":"Active"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
k:{"type":"Error"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:message:
f:reason:
f:status:
f:type:
k:{"type":"Updated"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
k:{"type":"Valid"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
f:currentValue:
Manager: datadog-cluster-agent
Operation: Update
Time: 2021-08-19T15:14:44Z
Resource Version: 164942235
Self Link: /apis/datadoghq.com/v1alpha1/namespaces/datadog/datadogmetrics/dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
UID: 6e9919eb-19ca-4131-b079-4a8a9ac577bb
Spec:
External Metric Name: nginx.net.request_per_s
Query: avg:nginx.net.request_per_s{kube_app_name:nginx}.rollup(30)
Status:
Autoscaler References: canary/hibob-hpa
Conditions:
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: True
Type: Active
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: False
Type: Valid
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: True
Type: Updated
Last Transition Time: 2021-08-19T15:14:44Z
Last Update Time: 2021-08-19T15:53:14Z
Message: Global error (all queries) from backend
Reason: Unable to fetch data from Datadog
Status: True
Type: Error
Current Value: 0
Events: <none>
this is my cluster agent deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "18"
meta.helm.sh/release-name: datadog
meta.helm.sh/release-namespace: datadog
creationTimestamp: "2021-02-05T07:36:39Z"
generation: 18
labels:
app.kubernetes.io/instance: datadog
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: datadog
app.kubernetes.io/version: "7"
helm.sh/chart: datadog-2.7.0
name: datadog-cluster-agent
namespace: datadog
resourceVersion: "164881216"
selfLink: /apis/apps/v1/namespaces/datadog/deployments/datadog-cluster-agent
uid: ec52bb4b-62af-4007-9bab-d5d16c48e02c
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: datadog-cluster-agent
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
ad.datadoghq.com/cluster-agent.check_names: '["prometheus"]'
ad.datadoghq.com/cluster-agent.init_configs: '[{}]'
ad.datadoghq.com/cluster-agent.instances: |
[{
"prometheus_url": "http://%%host%%:5000/metrics",
"namespace": "datadog.cluster_agent",
"metrics": [
"go_goroutines", "go_memstats_*", "process_*",
"api_requests",
"datadog_requests", "external_metrics", "rate_limit_queries_*",
"cluster_checks_*"
]
}]
checksum/api_key: something
checksum/application_key: something
checksum/clusteragent_token: something
checksum/install_info: something
creationTimestamp: null
labels:
app: datadog-cluster-agent
name: datadog-cluster-agent
spec:
containers:
- env:
- name: DD_HEALTH_PORT
value: "5555"
- name: DD_API_KEY
valueFrom:
secretKeyRef:
key: api-key
name: datadog
optional: true
- name: DD_APP_KEY
valueFrom:
secretKeyRef:
key: app-key
name: datadog-appkey
- name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
value: "true"
- name: DD_EXTERNAL_METRICS_PROVIDER_PORT
value: "8443"
- name: DD_EXTERNAL_METRICS_PROVIDER_WPA_CONTROLLER
value: "false"
- name: DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD
value: "true"
- name: DD_EXTERNAL_METRICS_AGGREGATOR
value: avg
- name: DD_CLUSTER_NAME
value: production
- name: DD_SITE
value: datadoghq.eu
- name: DD_LOG_LEVEL
value: INFO
- name: DD_LEADER_ELECTION
value: "true"
- name: DD_COLLECT_KUBERNETES_EVENTS
value: "true"
- name: DD_CLUSTER_AGENT_KUBERNETES_SERVICE_NAME
value: datadog-cluster-agent
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
key: token
name: datadog-cluster-agent
- name: DD_KUBE_RESOURCES_NAMESPACE
value: datadog
- name: DD_ORCHESTRATOR_EXPLORER_ENABLED
value: "true"
- name: DD_ORCHESTRATOR_EXPLORER_CONTAINER_SCRUBBING_ENABLED
value: "true"
- name: DD_COMPLIANCE_CONFIG_ENABLED
value: "false"
image: gcr.io/datadoghq/cluster-agent:1.10.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 6
httpGet:
path: /live
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
name: cluster-agent
ports:
- containerPort: 5005
name: agentport
protocol: TCP
- containerPort: 8443
name: metricsapi
protocol: TCP
readinessProbe:
failureThreshold: 6
httpGet:
path: /ready
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/datadog-agent/install_info
name: installinfo
readOnly: true
subPath: install_info
dnsConfig:
options:
- name: ndots
value: "3"
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: datadog-cluster-agent
serviceAccountName: datadog-cluster-agent
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: datadog-installinfo
name: installinfo
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-05-13T15:46:33Z"
lastUpdateTime: "2021-05-13T15:46:33Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-02-05T07:36:39Z"
lastUpdateTime: "2021-08-19T12:12:06Z"
message: ReplicaSet "datadog-cluster-agent-585897dc8d" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 18
readyReplicas: 1
replicas: 1
updatedReplicas: 1
For the record i got this sorted.
According to the helm default values file you must set the app key in order to use metrics provider:
# datadog.appKey -- Datadog APP key required to use metricsProvider
## If you are using clusterAgent.metricsProvider.enabled = true, you must set
## a Datadog application key for read access to your metrics.
appKey: # <DATADOG_APP_KEY>
I guess this is a lack of information in the docs and also a check that is missing at the cluster-agent startup. Going to open an issue about it.
From the official documentation on troubleshooting the agent here, you have:
If you see the following error when describing the HPA manifest:
Warning FailedComputeMetricsReplicas 3s (x2 over 33s) horizontal-pod-autoscaler failed to get nginx.net.request_per_s external metric: unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)
Make sure the Datadog Cluster Agent is running, and the service exposing the port 8443, whose name is registered in the APIService, is up.
I believe the key phrase here is whose name is registered in the APIService. Did you perform the API Service registration for your external metrics service? This source should provide some details on how to set it up. Since you're getting 403 - Unauthorized errors, it simply implies the TLS setup is causing issues.
Perhaps you can follow the guide in general and ensure that your node-agent is functioning correctly and has token environment variable correctly configured.

Why are there two services for one seldon deployment

I noticed whenever I deployed one model, there are two services, e.g.
kubectl get service -n model-namespace
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
iris-model-default ClusterIP 10.96.82.232 <none> 8000/TCP,5001/TCP 8h
iris-model-default-classifier ClusterIP 10.96.76.141 <none> 9000/TCP 8h
I wonder why do we have two instead of one.
What are the three ports (8000, 9000, 5001) for respectively? which one should I use?
The manifest yaml is
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
namespace: model-namespace
spec:
name: iris
predictors:
- graph:
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
name: classifier
name: default
replicas: 1
from https://docs.seldon.io/projects/seldon-core/en/v1.1.0/workflow/quickstart.html
The CRD definition appears to be here in case it's useful.
k describe service/iris-model-default
Name: iris-model-default
Namespace: model-namespace
Labels: app.kubernetes.io/managed-by=seldon-core
seldon-app=iris-model-default
seldon-deployment-id=iris-model
Annotations: getambassador.io/config:
---
apiVersion: ambassador/v1
kind: Mapping
name: seldon_model-namespace_iris-model_default_rest_mapping
prefix: /seldon/model-namespace/iris-model/
rewrite: /
service: iris-model-default.model-namespace:8000
timeout_ms: 3000
---
apiVersion: ambassador/v1
kind: Mapping
name: seldon_model-namespace_iris-model_default_grpc_mapping
grpc: true
prefix: /(seldon.protos.*|tensorflow.serving.*)/.*
prefix_regex: true
rewrite: ""
service: iris-model-default.model-namespace:5001
timeout_ms: 3000
headers:
namespace: model-namespace
seldon: iris-model
Selector: seldon-app=iris-model-default
Type: ClusterIP
IP: 10.96.82.232
Port: http 8000/TCP
TargetPort: 8000/TCP
Endpoints: 172.18.0.17:8000
Port: grpc 5001/TCP
TargetPort: 8000/TCP
Endpoints: 172.18.0.17:8000
Session Affinity: None
Events: <none>
k describe service/iris-model-default-classifier
Name: iris-model-default-classifier
Namespace: model-namespace
Labels: app.kubernetes.io/managed-by=seldon-core
default=true
model=true
seldon-app-svc=iris-model-default-classifier
seldon-deployment-id=iris-model
Annotations: <none>
Selector: seldon-app-svc=iris-model-default-classifier
Type: ClusterIP
IP: 10.96.76.141
Port: http 9000/TCP
TargetPort: 9000/TCP
Endpoints: 172.18.0.17:9000
Session Affinity: None
Events: <none>
k get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
iris-model-default-0-classifier-579765fc5b-rm6np 2/2 Running 0 10h app.kubernetes.io/managed-by=seldon-core,app=iris-model-default-0-classifier,fluentd=true,pod-template-hash=579765fc5b,seldon-app-svc=iris-model-default-classifier,seldon-app=iris-model-default,seldon-deployment-id=iris-model,version=default
So only one pod is involved, I'm getting the idea that these ports are mapped from different containers:
k get pods -o json | jq '.items[].spec.containers[] | .name, .ports' [0] 0s
"classifier"
[
{
"containerPort": 6000,
"name": "metrics",
"protocol": "TCP"
},
{
"containerPort": 9000,
"name": "http",
"protocol": "TCP"
}
]
"seldon-container-engine"
[
{
"containerPort": 8000,
"protocol": "TCP"
},
{
"containerPort": 8000,
"name": "metrics",
"protocol": "TCP"
}
]
A more seldon-specific question is why so many ports are needed?
Yes, looks like you have 2 containers in your pod.
The first service:
iris-model-default ➡️ seldon-container-engine HTTP: 8000:8000 and GRPC: 5001:8000
The second service:
iris-model-default-classifier ➡️ classifier HTTP: 9000:9000 (6000 used internally looks like for metrics)
You didn't mention but sounds like deployed the classifier:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
namespace: seldon
spec:
name: iris
predictors:
- graph:
implementation: SKLEARN_SERVER
modelUri: gs://seldon-models/sklearn/iris
name: classifier
name: default
replicas: 1
If you'd like to find out the rationale behind why the two containers/services you might have dig into the operator itself 🔧.

Scale deployment based on custom metric

I'm trying to scale a deployment based on a custom metric coming from a custom metric server. I deployed my server and when I do
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/kubernetes/test-metric"
I get back this JSON
{
"kind": "MetricValueList",
"apiVersion": "custom.metrics.k8s.io/v1beta1",
"metadata": {
"selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/kubernetes/test-metric"
},
"items": [
{
"describedObject": {
"kind": "Service",
"namespace": "default",
"name": "kubernetes",
"apiVersion": "/v1"
},
"metricName": "test-metric",
"timestamp": "2019-01-26T02:36:19Z",
"value": "300m",
"selector": null
}
]
}
Then I created my hpa.yml using this
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: test-all-deployment
namespace: default
spec:
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: test-all-deployment
metrics:
- type: Object
object:
target:
kind: Service
name: kubernetes
apiVersion: custom.metrics.k8s.io/v1beta1
metricName: test-metric
targetValue: 200m
but it doesn't scale and I'm not sure what is wrong. running get hpa returns
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
test-all-deployment Deployment/test-all-deployment <unknown>/200m 1 10 1 9m
The part I'm not sure about is the target object in the metrics collection in the hpa definition. Looking at the doc here https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
It has
describedObject:
apiVersion: extensions/v1beta1
kind: Ingress
name: main-route
target:
kind: Value
value: 10k
but that gives me a validation error for API v2beta1. and looking at the actual object here https://github.com/kubernetes/api/blob/master/autoscaling/v2beta1/types.go#L296 it doesn't seem to match. I don't know how to specify that with the v2beta1 API.
It looks like there is a mistake in the documentation. In the same example two diffierent API version are used.
autoscaling/v2beta1 notation:
- type: Pods
pods:
metric:
name: packets-per-second
targetAverageValue: 1k
autoscaling/v2beta2 notation:
- type: Resource
resource:
name: cpu
target:
type: AverageUtilization
averageUtilization: 50
There is a difference between autoscaling/v2beta1 and autoscaling/v2beta2 APIs:
kubectl get hpa.v2beta1.autoscaling -o yaml --export > hpa2b1-export.yaml
kubectl get hpa.v2beta2.autoscaling -o yaml --export > hpa2b2-export.yaml
diff -y hpa2b1-export.yaml hpa2b2-export.yaml
#hpa.v2beta1.autoscaling hpa.v2beta2.autoscaling
#-----------------------------------------------------------------------------------
apiVersion: v1 apiVersion: v1
items: items:
- apiVersion: autoscaling/v2beta1 | - apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler kind: HorizontalPodAutoscaler
metadata: metadata:
creationTimestamp: "2019-03-21T13:17:47Z" creationTimestamp: "2019-03-21T13:17:47Z"
name: php-apache name: php-apache
namespace: default namespace: default
resourceVersion: "8441304" resourceVersion: "8441304"
selfLink: /apis/autoscaling/v2beta1/namespaces/default/ho | selfLink: /apis/autoscaling/v2beta2/namespaces/default/ho
uid: b8490a0a-4bdb-11e9-9043-42010a9c0003 uid: b8490a0a-4bdb-11e9-9043-42010a9c0003
spec: spec:
maxReplicas: 10 maxReplicas: 10
metrics: metrics:
- resource: - resource:
name: cpu name: cpu
targetAverageUtilization: 50 | target:
> averageUtilization: 50
> type: Utilization
type: Resource type: Resource
minReplicas: 1 minReplicas: 1
scaleTargetRef: scaleTargetRef:
apiVersion: extensions/v1beta1 apiVersion: extensions/v1beta1
kind: Deployment kind: Deployment
name: php-apache name: php-apache
status: status:
conditions: conditions:
- lastTransitionTime: "2019-03-21T13:18:02Z" - lastTransitionTime: "2019-03-21T13:18:02Z"
message: recommended size matches current size message: recommended size matches current size
reason: ReadyForNewScale reason: ReadyForNewScale
status: "True" status: "True"
type: AbleToScale type: AbleToScale
- lastTransitionTime: "2019-03-21T13:18:47Z" - lastTransitionTime: "2019-03-21T13:18:47Z"
message: the HPA was able to successfully calculate a r message: the HPA was able to successfully calculate a r
resource utilization (percentage of request) resource utilization (percentage of request)
reason: ValidMetricFound reason: ValidMetricFound
status: "True" status: "True"
type: ScalingActive type: ScalingActive
- lastTransitionTime: "2019-03-21T13:23:13Z" - lastTransitionTime: "2019-03-21T13:23:13Z"
message: the desired replica count is increasing faster message: the desired replica count is increasing faster
rate rate
reason: TooFewReplicas reason: TooFewReplicas
status: "True" status: "True"
type: ScalingLimited type: ScalingLimited
currentMetrics: currentMetrics:
- resource: - resource:
currentAverageUtilization: 0 | current:
currentAverageValue: 1m | averageUtilization: 0
> averageValue: 1m
name: cpu name: cpu
type: Resource type: Resource
currentReplicas: 1 currentReplicas: 1
desiredReplicas: 1 desiredReplicas: 1
kind: List kind: List
metadata: metadata:
resourceVersion: "" resourceVersion: ""
selfLink: "" selfLink: ""
Here is how the object definition is supposed to look like:
#hpa.v2beta1.autoscaling hpa.v2beta2.autoscaling
#-----------------------------------------------------------------------------------
type: Object type: Object
object: object:
metric: metric:
name: requests-per-second name: requests-per-second
describedObject: describedObject:
apiVersion: extensions/v1beta1 apiVersion: extensions/v1beta1
kind: Ingress kind: Ingress
name: main-route name: main-route
targetValue: 2k target:
type: Value
value: 2k

kubernetes coreos rbd storageclass

I want use k8s storageclass under coreos, but failed
.CoreOS version is stable (1122.2)
.Hyperkube version is v1.4.3_coreos.0
k8s cluster deployed by coreos-kubernetes script , and modify rkt_opts for rbd recommandded by kubelet-wrapper.md
ceph version is jewel, I have mounted a rbd image on coreos , it works well.
now, I try to use pvc in pods, Refer to the kubernetes official document https://github.com/kubernetes/kubernetes/tree/master/examples/experimental/persistent-volume-provisioning
the config files:
**ceph-secret-admin.yaml**
apiVersion: v1
kind: Secret
metadata:
name: ceph-secret-admin
namespace: kube-system
data:
key: QVFDTEl2NVg5c0U2R1JBQVRYVVVRdUZncDRCV294WUJtME1hcFE9PQ==
**ceph-secret-user.yaml**
apiVersion: v1
kind: Secret
metadata:
name: ceph-secret-user
data:
key: QVFDTEl2NVg5c0U2R1JBQVRYVVVRdUZncDRCV294WUJtME1hcFE9PQ==
**rbd-storage-class.yaml**
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
name: kubepool
annotations:
storageclass.beta.kubernetes.io/is-default-class: 'true'
provisioner: kubernetes.io/rbd
parameters:
monitors: 10.199.134.2:6789,10.199.134.3:6789,10.199.134.4:6789
adminId: rbd
adminSecretName: ceph-secret-admin
adminSecretNamespace: kube-system
pool: rbd
userId: rbd
userSecretName: ceph-secret-user
**claim1.json :**
{
"kind": "PersistentVolumeClaim",
"apiVersion": "v1",
"metadata": {
"name": "claim1",
"annotations": {
"volume.beta.kubernetes.io/storage-class": "kubepool"
}
},
"spec": {
"accessModes": [
"ReadWriteOnce"
],
"resources": {
"requests": {
"storage": "3Gi"
}
}
}
}
the secret create ok, the storageclass create seems ok, but can't describe (no description has been implemented for "StorageClass"), when create pvc, it's status always pending , describe it:
Name: claim1
Namespace: default
Status: Pending
Volume:
Labels: <none>
Capacity:
Access Modes:
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
16m 14s 66 {persistentvolume-controller } Warning ProvisioningFailed no volume plugin matched
Could some one help me ?