automatic labels to prometheus alertmanager rules - kubernetes

I'm using prometheus-community/prometheus chart
I'd like to add the following labels automatically to any alert manager rule firing
env=prod
cluster=project-prod-eks
so that I don't these labels manually to each alert rule.
- alert: NGINXTooMany400s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: warning
env: prod
cluster: project-prod-eks <---------------HOW to inject them?
annotations:
description: Too many 4XXs
summary: More than 5% of all requests returned 4XX, this requires your attention
so that I can do something like
- alert: NGINXTooMany400s
expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
for: 1m
labels:
severity: warning
annotations:
description: Too many 4XXs on {{ $labels.env }} / {{ $labels.cluster }} <----- THIS
summary: More than 5% of all requests returned 4XX, this requires your attention
Any ideas?

You can add external_labels to your prometheus.yml:
global:
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
env: prod
cluster: project-prod-eks
The community chart has it in values.yml:
serverFiles:
prometheus.yml:
global:
external_labels:
foo: bar
...

so I did slightly differently that modifying the "serverFiles", see below
server:
nodeSelector:
prometheus: "true"
baseURL: "https://prometheus.project.io"
enabled: true
retention: "30d"
strategy:
type: RollingUpdate
global:
scrape_interval: 30s
external_labels:
env: prod
client: client-name
project: project-name
cluster: project-prod-eks

Related

Kafka Connect Connector configuration is invalid and contains the following 2 error(s): Invalid value ... could not be found

I'm trying to make use of Schema Registry Transfer SMT plugin but I get following error.
PUT /connectors/kafka-connector-sb-16-sb-16/config returned 400 (Bad Request): Connector configuration is invalid and contains the following 2 error(s):
      Invalid value cricket.jmoore.kafka.connect.transforms.SchemaRegistryTransfer for configuration transforms.AvroSchemaTransfer.type: Class cricket.jmoore.kafka.connect.transforms.SchemaRegistryTransfer could not be found.
      Invalid value null for configuration transforms.AvroSchemaTransfer.type: Not a Transformation
      You can also find the above list of errors at the endpoint `/connector-plugins/{connectorType}/config/validate`
I use strimzi operator to manage my kafka objects in kubernetes.
Kafka Connect config:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
name: {{.Release.Name}}-{{.Values.ENVNAME}}
annotations:
strimzi.io/use-connector-resources: "true"
labels:
app: {{.Release.Name}}
# owner: {{ .Values.owner | default "NotSet" }}
# branch: {{ .Values.branch | default "NotSet" | trunc 56 | trimSuffix "-" }}
spec:
logging:
type: inline
loggers:
connect.root.logger.level: "DEBUG"
template:
pod:
imagePullSecrets:
- name: wecapacrdev001-azurecr-io-pull-secret
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "9404"
prometheus.io/scrape: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
authentication:
type: scram-sha-512
username: {{.Values.ENVNAME}}
passwordSecret:
secretName: {{.Values.ENVNAME}}
password: password
bootstrapServers: {{.Values.ENVNAME}}-kafka-bootstrap:9093
tls:
trustedCertificates:
- secretName: {{.Values.ENVNAME}}-cluster-ca-cert
certificate: ca.crt
config:
group.id: {{.Values.ENVNAME}}-connect-cluster
offset.storage.topic: {{.Values.ENVNAME}}-connect-cluster-offsets
config.storage.topic: {{.Values.ENVNAME}}-connect-cluster-configs
status.storage.topic: {{.Values.ENVNAME}}-connect-cluster-status
config.storage.replication.factor: -1
offset.storage.replication.factor: -1
status.storage.replication.factor: -1
key.converter: org.apache.kafka.connect.converters.ByteArrayConverter
value.converter: org.apache.kafka.connect.converters.ByteArrayConverter
image: capcr.azurecr.io/devops/kafka:0.27.1-kafka-2.8.0-arm64-v2.7
jvmOptions:
-Xms: 3g
-Xmx: 3g
livenessProbe:
initialDelaySeconds: 30
timeoutSeconds: 15
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: metrics-config.yml
name: {{.Release.Name}}-{{.Values.ENVNAME}}
readinessProbe:
initialDelaySeconds: 300
timeoutSeconds: 15
replicas: 3
resources:
limits:
cpu: 1000m
memory: 4Gi
requests:
cpu: 100m
memory: 3Gi
Connector config:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: {{.Release.Name}}-{{.Values.ENVNAME}}
labels:
strimzi.io/cluster: kafka-connect-cluster-{{.Values.ENVNAME}}
app: {{.Release.Name}}
# owner: {{ .Values.owner | default "NotSet" }}
# branch: {{ .Values.branch | default "NotSet" | trunc 56 | trimSuffix "-" }}
spec:
class: org.apache.kafka.connect.mirror.MirrorSourceConnector
config:
offset-syncs.topic.replication.factor: "3"
refresh.topics.interval.seconds: "60"
replication.factor: "3"
key.converter: org.apache.kafka.connect.converters.ByteArrayConverter
value.converter: org.apache.kafka.connect.converters.ByteArrayConverter
source.cluster.alias: SPLITTED
source.cluster.auto.commit.interval.ms: "3000"
source.cluster.auto.offset.reset: earliest
source.cluster.bootstrap.servers: {{.Values.ENVNAME}}-kafka-bootstrap:9093
source.cluster.fetch.max.bytes: "60502835"
source.cluster.group.id: {{.Values.ENVNAME}}-mirroring
source.cluster.max.poll.records: "100"
source.cluster.max.request.size: "60502835"
source.cluster.offset.storage.topic: {{.Values.ENVNAME}}-connect-cluster-offsets
source.cluster.config.storage.topic: {{.Values.ENVNAME}}-connect-cluster-configs
source.cluster.status.storage.topic: {{.Values.ENVNAME}}-connect-cluster-status
source.cluster.producer.compression.type: gzip
source.cluster.replica.fetch.max.bytes: "60502835"
source.cluster.sasl.jaas.config: org.apache.kafka.common.security.scram.ScramLoginModule
required username="{{.Values.ENVNAME}}" password="{{.Values.kafka.password}}";
source.cluster.sasl.mechanism: SCRAM-SHA-512
source.cluster.security.protocol: SASL_SSL
source.cluster.ssl.keystore.location: /opt/kafka/keystore/kafka_keystore.p12
source.cluster.ssl.keystore.password: password
source.cluster.ssl.keystore.type: PKCS12
source.cluster.ssl.truststore.location: /opt/kafka/keystore/kafka_keystore.p12
source.cluster.ssl.truststore.password: password
source.cluster.ssl.truststore.type: PKCS12
sync.topic.acls.enabled: "false"
target.cluster.alias: AGGREGATED
target.cluster.auto.commit.interval.ms: "3000"
target.cluster.auto.offset.reset: earliest
target.cluster.bootstrap.servers: {{.Values.ENVNAME}}-kafka-bootstrap:9093
target.cluster.fetch.max.bytes: "60502835"
target.cluster.group.id: {{.Values.ENVNAME}}-test
target.cluster.max.poll.records: "100"
target.cluster.max.request.size: "60502835"
target.cluster.offset.storage.topic: {{.Values.ENVNAME}}-connect-cluster-offsets
target.cluster.config.storage.topic: {{.Values.ENVNAME}}-connect-cluster-configs
target.cluster.status.storage.topic: {{.Values.ENVNAME}}-connect-cluster-status
target.cluster.producer.compression.type: gzip
target.cluster.replica.fetch.max.bytes: "60502835"
target.cluster.sasl.jaas.config: org.apache.kafka.common.security.scram.ScramLoginModule
required username="{{.Values.ENVNAME}}" password="{{.Values.kafka.password}}";
target.cluster.sasl.mechanism: SCRAM-SHA-512
target.cluster.security.protocol: SASL_SSL
target.cluster.ssl.keystore.location: /opt/kafka/keystore/kafka_keystore.p12
target.cluster.ssl.keystore.password: password
target.cluster.ssl.keystore.type: PKCS12
target.cluster.ssl.truststore.location: /opt/kafka/keystore/kafka_keystore.p12
target.cluster.ssl.truststore.password: password
target.cluster.ssl.truststore.type: PKCS12
tasks.max: "4"
topics: ^topic-to-be-replicated$
topics.blacklist: ""
transforms: AvroSchemaTransfer
transforms.AvroSchemaTransfer.dest.schema.registry.url: http://schemaregistry-{{.Values.ENVNAME}}-cp-schema-registry:8081
transforms.AvroSchemaTransfer.src.schema.registry.url: http://schemaregistry-{{.Values.ENVNAME}}-cp-schema-registry:8081
transforms.AvroSchemaTransfer.transfer.message.keys: "false"
transforms.AvroSchemaTransfer.type: cricket.jmoore.kafka.connect.transforms.SchemaRegistryTransfer
transforms.TopicRename.regex: (.*)
transforms.TopicRename.replacement: replica.$1
transforms.TopicRename.type: org.apache.kafka.connect.transforms.RegexRouter
tasksMax: 4
I tried to build the plugin and build the docker image like described here:
ARG STRIMZI_BASE_IMAGE=0.29.0-kafka-3.1.1-arm64
FROM quay.io/strimzi/kafka:$STRIMZI_BASE_IMAGE
USER root:root
COPY schema-registry-transfer-smt/target/schema-registry-transfer-smt-0.2.1.jar /opt/kafka/plugins/
USER 1001
I can see in logs that the plugin is loaded
INFO Registered loader: PluginClassLoader{pluginLocation=file:/opt/kafka/plugins/schema-registry-transfer-smt-0.2.1.jar} (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader) [main]
INFO Loading plugin from: /opt/kafka/plugins/schema-registry-transfer-smt-0.2.1.jar (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader) [main]
DEBUG Loading plugin urls: [file:/opt/kafka/plugins/schema-registry-transfer-smt-0.2.1.jar] (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader) [main]
However, when I execute kubectl describe KafkaConnect ..., I get following status which probably means the plugin is not loaded.
Status:
Conditions:
Last Transition Time: 2022-10-31T07:22:43.749290Z
Status: True
Type: Ready
Connector Plugins:
Class: org.apache.kafka.connect.file.FileStreamSinkConnector
Type: sink
Version: 2.8.0
Class: org.apache.kafka.connect.file.FileStreamSourceConnector
Type: source
Version: 2.8.0
Class: org.apache.kafka.connect.mirror.MirrorCheckpointConnector
Type: source
Version: 1
Class: org.apache.kafka.connect.mirror.MirrorHeartbeatConnector
Type: source
Version: 1
Class: org.apache.kafka.connect.mirror.MirrorSourceConnector
Type: source
Version: 1
The logs prints classpath jvm.classpath = ... and it doesn't "include schema-registry-transfer-smt-0.2.1.jar".
This look like configuration error. I tried to create and use fat jar, compile with java11, change scope of provided dependencies and many more. The repository's last commit was created in 2019. Does it make sense to bump all dependency versions? The error message doesn't suggest that.

GPU nodegroup in EKS

I am not able to create a nodegroup with GPU type using EKS, getting this error from cloud formation:
[!] retryable error (Throttling: Rate exceeded status code: 400, request id: 1e091568-812c-45a5-860b-d0d028513d28) from cloudformation/DescribeStacks - will retry after delay of 988.442104ms
This is my clusterconfig.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: CLUSTER_NAME
region: AWS_REGION
nodeGroups:
- name: NODE_GROUP_NAME_GPU
ami: auto
minSize: MIN_SIZE
maxSize: MAX_SIZE
instancesDistribution:
instanceTypes: ["g4dn.xlarge", "g4dn.2xlarge"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotInstancePools: 1
privateNetworking: true
securityGroups:
withShared: true
withLocal: true
attachIDs: [SECURITY_GROUPS]
iam:
instanceProfileARN: IAM_PROFILE_ARN
instanceRoleARN: IAM_ROLE_ARN
ssh:
allow: true
publicKeyPath: '----'
tags:
k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
k8s.io/cluster-autoscaler/enabled: 'true'
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: 'true'
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
the resolution was to install nividia plugins on the cluster so that the cluster will identify the gpu nodes

Error on Telegraf Helm Chart update: Error parsing data

Im trying to deploy telegraf helm chart on kubernetes.
helm upgrade --install telegraf-instance -f values.yaml influxdata/telegraf
When I add modbus input plugin with holding_register i get error
[telegraf] Error running agent: Error loading config file /etc/telegraf/telegraf.conf: Error parsing data: line 49: key `name’ is in conflict with line 2fd
my values.yaml like below
## Default values.yaml for Telegraf
## This is a YAML-formatted file.
## ref: https://hub.docker.com/r/library/telegraf/tags/
replicaCount: 1
image:
repo: "telegraf"
tag: "1.21.4"
pullPolicy: IfNotPresent
podAnnotations: {}
podLabels: {}
imagePullSecrets: []
args: []
env:
- name: HOSTNAME
value: "telegraf-polling-service"
resources: {}
nodeSelector: {}
affinity: {}
tolerations: []
service:
enabled: true
type: ClusterIP
annotations: {}
rbac:
create: true
clusterWide: false
rules: []
serviceAccount:
create: false
name:
annotations: {}
config:
agent:
interval: 60s
round_interval: true
metric_batch_size: 1000000
metric_buffer_limit: 100000000
collection_jitter: 0s
flush_interval: 60s
flush_jitter: 0s
precision: ''
hostname: '9825128'
omit_hostname: false
processors:
- enum:
mapping:
field: "status"
dest: "status_code"
value_mappings:
healthy: 1
problem: 2
critical: 3
inputs:
- modbus:
name: "PS MAIN ENGINE"
controller: 'tcp://192.168.0.101:502'
slave_id: 1
holding_registers:
- name: "Coolant Level"
byte_order: CDAB
data_type: FLOAT32
scale: 0.001
address: [51410, 51411]
- modbus:
name: "SB MAIN ENGINE"
controller: 'tcp://192.168.0.102:502'
slave_id: 1
holding_registers:
- name: "Coolant Level"
byte_order: CDAB
data_type: FLOAT32
scale: 0.001
address: [51410, 51411]
outputs:
- influxdb_v2:
token: token
organization: organisation
bucket: bucket
urls:
- "url"
metrics:
health:
enabled: true
service_address: "http://:8888"
threshold: 5000.0
internal:
enabled: true
collect_memstats: false
pdb:
create: true
minAvailable: 1
Problem resolved by doing the following steps
deleted config section of my values.yaml
added my telegraf.conf to /additional_config path
added configmap to kubernetes with the following command
kubectl create configmap external-config --from-file=/additional_config
added the following command to values.yaml
volumes:
- name: my-config
configMap:
name: external-config
volumeMounts:
- name: my-config
mountPath: /additional_config
args:
- "--config=/etc/telegraf/telegraf.conf"
- "--config-directory=/additional_config"

can't get custom metrics for hpa from datadog

hey guys i’m trying to setup datadog as custom metric for my kubernetes hpa using the official guide:
https://docs.datadoghq.com/agent/cluster_agent/external_metrics/?tab=helm
running on EKS 1.18 & Datadog Cluster Agent (v1.10.0).
the problem is that i can't get the external metrics's for my HPA:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: hibob-hpa
spec:
minReplicas: 1
maxReplicas: 5
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: something
metrics:
- type: External
external:
metricName: **kubernetes_state.container.cpu_limit**
metricSelector:
matchLabels:
pod: **something-54c4bd4db7-pm9q5**
targetAverageValue: 9
horizontal-pod-autoscaler unable to get external metric:
canary/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_app_name: nginx,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)
This is the errors i'm getting inside the cluster-agent:
datadog-cluster-agent-585897dc8d-x8l82 cluster-agent 2021-08-20 06:46:14 UTC | CLUSTER | ERROR | (pkg/clusteragent/externalmetrics/metrics_retriever.go:77 in retrieveMetricsValues) | Unable to fetch external metrics: [Error while executing metric query avg:nginx.net.request_per_s{kubea_app_name:ingress-nginx}.rollup(30): API error 403 Forbidden: {"status":********#datadoghq.com"}, strconv.Atoi: parsing "": invalid syntax]
# datadog-cluster-agent status
Getting the status from the agent.
2021-08-19 15:28:21 UTC | CLUSTER | WARN | (pkg/util/log/log.go:541 in func1) | Agent configuration relax permissions constraint on the secret backend cmd, Group can read and exec
===============================
Datadog Cluster Agent (v1.10.0)
===============================
Status date: 2021-08-19 15:28:21.519850 UTC
Agent start: 2021-08-19 12:11:44.266244 UTC
Pid: 1
Go Version: go1.14.12
Build arch: amd64
Agent flavor: cluster_agent
Check Runners: 4
Log Level: INFO
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2021-08-19 15:28:21.519850 UTC
Hostnames
=========
ec2-hostname: ip-10-30-162-8.eu-west-1.compute.internal
hostname: i-00d0458844a597dec
instance-id: i-00d0458844a597dec
socket-fqdn: datadog-cluster-agent-585897dc8d-x8l82
socket-hostname: datadog-cluster-agent-585897dc8d-x8l82
hostname provider: aws
unused hostname providers:
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Metadata
========
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-585897dc8d-x8l82
Last Acquisition of the lease: Thu, 19 Aug 2021 12:13:14 UTC
Renewed leadership: Thu, 19 Aug 2021 15:28:07 UTC
Number of leader transitions: 17 transitions
Custom Metrics Server
=====================
External metrics provider uses DatadogMetric - Check status directly from Kubernetes with: `kubectl get datadogmetric`
Admission Controller
====================
Disabled: The admission controller is not enabled on the Cluster Agent
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
Total Runs: 787
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 660
Service Checks: Last Run: 3, Total: 2,343
Average Execution Time : 1.898s
Last Execution Date : 2021-08-19 15:28:17.000000 UTC
Last Successful Execution Date : 2021-08-19 15:28:17.000000 UTC
=========
Forwarder
=========
Transactions
============
Deployments: 350
Dropped: 0
DroppedOnInput: 0
Nodes: 497
Pods: 3
ReplicaSets: 576
Requeued: 0
Retried: 0
RetryQueueSize: 0
Services: 263
Transaction Successes
=====================
Total number: 3442
Successes By Endpoint:
check_run_v1: 786
intake: 181
orchestrator: 1,689
series_v1: 786
==========
Endpoints
==========
https://app.datadoghq.eu - API Key ending with:
- f295b
=====================
Orchestrator Explorer
=====================
ClusterID: f7b4f97a-3cf2-11ea-aaa8-0a158f39909c
ClusterName: production
ContainerScrubbing: Enabled
======================
Orchestrator Endpoints
======================
===============
Forwarder Stats
===============
Pods: 3
Deployments: 350
ReplicaSets: 576
Services: 263
Nodes: 497
===========
Cache Stats
===========
Elements in the cache: 393
Pods:
Last Run: (Hits: 0 Miss: 0) | Total: (Hits: 7 Miss: 5)
Deployments:
Last Run: (Hits: 36 Miss: 1) | Total: (Hits: 40846 Miss: 2444)
ReplicaSets:
Last Run: (Hits: 297 Miss: 1) | Total: (Hits: 328997 Miss: 19441)
Services:
Last Run: (Hits: 44 Miss: 0) | Total: (Hits: 49520 Miss: 2919)
Nodes:
Last Run: (Hits: 9 Miss: 0) | Total: (Hits: 10171 Miss: 755)```
and this is what i get from datadogmetric:
Name: dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
Namespace: datadog
Labels: <none>
Annotations: <none>
API Version: datadoghq.com/v1alpha1
Kind: DatadogMetric
Metadata:
Creation Timestamp: 2021-08-19T15:14:14Z
Generation: 1
Managed Fields:
API Version: datadoghq.com/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:status:
.:
f:autoscalerReferences:
f:conditions:
.:
k:{"type":"Active"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
k:{"type":"Error"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:message:
f:reason:
f:status:
f:type:
k:{"type":"Updated"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
k:{"type":"Valid"}:
.:
f:lastTransitionTime:
f:lastUpdateTime:
f:status:
f:type:
f:currentValue:
Manager: datadog-cluster-agent
Operation: Update
Time: 2021-08-19T15:14:44Z
Resource Version: 164942235
Self Link: /apis/datadoghq.com/v1alpha1/namespaces/datadog/datadogmetrics/dcaautogen-2f116f4425658dca91a33dd22a3d943bae5b74
UID: 6e9919eb-19ca-4131-b079-4a8a9ac577bb
Spec:
External Metric Name: nginx.net.request_per_s
Query: avg:nginx.net.request_per_s{kube_app_name:nginx}.rollup(30)
Status:
Autoscaler References: canary/hibob-hpa
Conditions:
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: True
Type: Active
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: False
Type: Valid
Last Transition Time: 2021-08-19T15:14:14Z
Last Update Time: 2021-08-19T15:53:14Z
Status: True
Type: Updated
Last Transition Time: 2021-08-19T15:14:44Z
Last Update Time: 2021-08-19T15:53:14Z
Message: Global error (all queries) from backend
Reason: Unable to fetch data from Datadog
Status: True
Type: Error
Current Value: 0
Events: <none>
this is my cluster agent deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "18"
meta.helm.sh/release-name: datadog
meta.helm.sh/release-namespace: datadog
creationTimestamp: "2021-02-05T07:36:39Z"
generation: 18
labels:
app.kubernetes.io/instance: datadog
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: datadog
app.kubernetes.io/version: "7"
helm.sh/chart: datadog-2.7.0
name: datadog-cluster-agent
namespace: datadog
resourceVersion: "164881216"
selfLink: /apis/apps/v1/namespaces/datadog/deployments/datadog-cluster-agent
uid: ec52bb4b-62af-4007-9bab-d5d16c48e02c
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: datadog-cluster-agent
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
ad.datadoghq.com/cluster-agent.check_names: '["prometheus"]'
ad.datadoghq.com/cluster-agent.init_configs: '[{}]'
ad.datadoghq.com/cluster-agent.instances: |
[{
"prometheus_url": "http://%%host%%:5000/metrics",
"namespace": "datadog.cluster_agent",
"metrics": [
"go_goroutines", "go_memstats_*", "process_*",
"api_requests",
"datadog_requests", "external_metrics", "rate_limit_queries_*",
"cluster_checks_*"
]
}]
checksum/api_key: something
checksum/application_key: something
checksum/clusteragent_token: something
checksum/install_info: something
creationTimestamp: null
labels:
app: datadog-cluster-agent
name: datadog-cluster-agent
spec:
containers:
- env:
- name: DD_HEALTH_PORT
value: "5555"
- name: DD_API_KEY
valueFrom:
secretKeyRef:
key: api-key
name: datadog
optional: true
- name: DD_APP_KEY
valueFrom:
secretKeyRef:
key: app-key
name: datadog-appkey
- name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
value: "true"
- name: DD_EXTERNAL_METRICS_PROVIDER_PORT
value: "8443"
- name: DD_EXTERNAL_METRICS_PROVIDER_WPA_CONTROLLER
value: "false"
- name: DD_EXTERNAL_METRICS_PROVIDER_USE_DATADOGMETRIC_CRD
value: "true"
- name: DD_EXTERNAL_METRICS_AGGREGATOR
value: avg
- name: DD_CLUSTER_NAME
value: production
- name: DD_SITE
value: datadoghq.eu
- name: DD_LOG_LEVEL
value: INFO
- name: DD_LEADER_ELECTION
value: "true"
- name: DD_COLLECT_KUBERNETES_EVENTS
value: "true"
- name: DD_CLUSTER_AGENT_KUBERNETES_SERVICE_NAME
value: datadog-cluster-agent
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
key: token
name: datadog-cluster-agent
- name: DD_KUBE_RESOURCES_NAMESPACE
value: datadog
- name: DD_ORCHESTRATOR_EXPLORER_ENABLED
value: "true"
- name: DD_ORCHESTRATOR_EXPLORER_CONTAINER_SCRUBBING_ENABLED
value: "true"
- name: DD_COMPLIANCE_CONFIG_ENABLED
value: "false"
image: gcr.io/datadoghq/cluster-agent:1.10.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 6
httpGet:
path: /live
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
name: cluster-agent
ports:
- containerPort: 5005
name: agentport
protocol: TCP
- containerPort: 8443
name: metricsapi
protocol: TCP
readinessProbe:
failureThreshold: 6
httpGet:
path: /ready
port: 5555
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 5
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/datadog-agent/install_info
name: installinfo
readOnly: true
subPath: install_info
dnsConfig:
options:
- name: ndots
value: "3"
dnsPolicy: ClusterFirst
nodeSelector:
kubernetes.io/os: linux
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: datadog-cluster-agent
serviceAccountName: datadog-cluster-agent
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: datadog-installinfo
name: installinfo
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2021-05-13T15:46:33Z"
lastUpdateTime: "2021-05-13T15:46:33Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-02-05T07:36:39Z"
lastUpdateTime: "2021-08-19T12:12:06Z"
message: ReplicaSet "datadog-cluster-agent-585897dc8d" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 18
readyReplicas: 1
replicas: 1
updatedReplicas: 1
For the record i got this sorted.
According to the helm default values file you must set the app key in order to use metrics provider:
# datadog.appKey -- Datadog APP key required to use metricsProvider
## If you are using clusterAgent.metricsProvider.enabled = true, you must set
## a Datadog application key for read access to your metrics.
appKey: # <DATADOG_APP_KEY>
I guess this is a lack of information in the docs and also a check that is missing at the cluster-agent startup. Going to open an issue about it.
From the official documentation on troubleshooting the agent here, you have:
If you see the following error when describing the HPA manifest:
Warning FailedComputeMetricsReplicas 3s (x2 over 33s) horizontal-pod-autoscaler failed to get nginx.net.request_per_s external metric: unable to get external metric default/nginx.net.request_per_s/&LabelSelector{MatchLabels:map[string]string{kube_container_name: nginx,},MatchExpressions:[],}: unable to fetch metrics from external metrics API: the server is currently unable to handle the request (get nginx.net.request_per_s.external.metrics.k8s.io)
Make sure the Datadog Cluster Agent is running, and the service exposing the port 8443, whose name is registered in the APIService, is up.
I believe the key phrase here is whose name is registered in the APIService. Did you perform the API Service registration for your external metrics service? This source should provide some details on how to set it up. Since you're getting 403 - Unauthorized errors, it simply implies the TLS setup is causing issues.
Perhaps you can follow the guide in general and ensure that your node-agent is functioning correctly and has token environment variable correctly configured.

How to match the PrometheusRule to the AlertmanagerConfig with Prometheus Operator

I have multiple prometheusRules(rule a, rule b), and each rule defined different exp to constraint the alert; then, I have different AlertmanagerConfig(one receiver is slack, then other one's receiver is opsgenie); How can we make a connection between rules and alertmanagerconfig? for example: if rule a is triggered, I want to send message to slack; if rule b is triggered, I want to send message to opsgenie.
Here is what I tried, however, that does not work. Did I miss something?
This is prometheuisRule file
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: service-prometheus
role: alert-rules
app: kube-prometheus-stack
release: monitoring-prom
name: rule_a
namespace: monitoring
spec:
groups:
- name: rule_a_alert
rules:
- alert: usage_exceed
expr: salesforce_api_usage > 100000
labels:
severity: urgent
This is alertManagerConfig
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
labels:
alertmanagerConfig: slack
name: slack
namespace: monitoring
resourceVersion: "25842935"
selfLink: /apis/monitoring.coreos.com/v1alpha1/namespaces/monitoring/alertmanagerconfigs/opsgenie-and-slack
uid: fbb74924-5186-4929-b363-8c056e401921
spec:
receivers:
- name: slack-receiver
slackConfigs:
- apiURL:
key: apiURL
name: slack-config
route:
groupBy:
- job
groupInterval: 60s
groupWait: 60s
receiver: slack-receiver
repeatInterval: 1m
routes:
- matchers:
- name: job
value: service_a
receiver: slack-receiver
You need to match on a label of the alert, in your case you're trying to match on the label job with the value service_a which doesn't exist. You could either match on a label that does exist in the prometheuisRule file, eg severity, by changing the match in the alertManagerConfig file:
route:
routes:
- match:
severity: urgent
receiver: slack-receiver
or you could add another label to the prometheuisRule file:
spec:
groups:
- name: rule_a_alert
rules:
- alert: usage_exceed
expr: salesforce_api_usage > 100000
labels:
severity: urgent
job: service_a