I am trying to create alerts in Prometheus on Kubernetes and sending them to a Slack channel. For this i am using the prometheus-community helm-charts (which already includes the alertmanager). As i want to use my own alerts I have also created an values.yml (shown below) strongly inspired from here.
If I port forward Prometheus I can see my Alert there going from inactive, to pending to firing, but no message is sent to slack. I am quite confident that my alertmanager configuration is fine (as I have tested it with some prebuild alerts of another chart and they were sent to slack). So my best guess is that I add the alert in the wrong way (in the serverFiles part), but I can not figure out how to do it correctly. Also, the alertmanager logs look pretty normal to me. Does anyone have an idea where my problem comes from?
---
serverFiles:
alerting_rules.yml:
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: sum(rate(container_network_receive_bytes_total{namespace="kube-logging"}[5m]))>20000
for: 1m
labels:
severity: page
annotations:
summary: High request latency
alertmanager:
persistentVolume:
storageClass: default-hdd-retain
## Deploy alertmanager
##
enabled: true
## Service account for Alertmanager to use.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
##
serviceAccount:
create: true
name: ""
## Configure pod disruption budgets for Alertmanager
## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
## This configuration is immutable once created and will require the PDB to be deleted to be changed
## https://github.com/kubernetes/kubernetes/issues/45398
##
podDisruptionBudget:
enabled: false
minAvailable: 1
maxUnavailable: ""
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
slack_api_url: "I changed this url for the stack overflow question"
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
#receiver: 'slack'
routes:
- match:
alertname: DeadMansSwitch
receiver: 'null'
- match:
receiver: 'slack'
continue: true
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- channel: 'alerts'
send_resolved: false
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:> *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
So I have finally solved the problem. The problem apparently was that the kube-prometheus-stack and the prometheus helm charts work a bit differently.
So instead of alertmanager.config I had to insert the code (everything starting from global) at alertmanagerFiles.alertmanager.yml.
Related
I am a beginner who is using Prometheus and Grapana to monitor the value of REST API.
Prometheus, json-exporrter, and grafana both used the Helm chart, Prometheus installed as default values.yaml, and json-exporter installed as custom values.yaml.
I checked that the prometheus set the service monitor of json-exporter as a target, but I couldn't check its metrics.
How can I check the metrics? Below is the environment , screenshots and code.
environment :
kubernetes : v1.22.9
helm : v3.9.2
prometheus-json-exporter helm chart : v0.5.0
kube-prometheus-stack helm chart : 0.58.0
screenshots :
https://drive.google.com/drive/folders/1vfjbidNpE2_yXfxdX8oX5eWh4-wAx7Ql?usp=sharing
values.yaml
in custom_jsonexporter_values.yaml
# Default values for prometheus-json-exporter.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
replicaCount: 1
image:
repository: quay.io/prometheuscommunity/json-exporter
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: ""
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
serviceAccount:
# Specifies whether a service account should be created
create: true
# Annotations to add to the service account
annotations: []
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""
podAnnotations: []
podSecurityContext: {}
# fsGroup: 2000
# podLabels:
# Custom labels for the pod
securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
service:
type: ClusterIP
port: 7979
targetPort: http
name: http
serviceMonitor:
## If true, a ServiceMonitor CRD is created for a prometheus operator
## https://github.com/coreos/prometheus-operator
##
enabled: true
namespace: monitoring
scheme: http
# Default values that will be used for all ServiceMonitors created by `targets`
defaults:
additionalMetricsRelabels: {}
interval: 60s
labels:
release: prometheus
scrapeTimeout: 60s
targets:
- name : pi2
url: http://xxx.xxx.xxx.xxx:xxxx
labels: {} # Map of labels for ServiceMonitor. Overrides value set in `defaults`
interval: 60s # Scraping interval. Overrides value set in `defaults`
scrapeTimeout: 60s # Scrape timeout. Overrides value set in `defaults`
additionalMetricsRelabels: {} # Map of metric labels and values to add
ingress:
enabled: false
className: ""
annotations: []
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
hosts:
- host: chart-example.local
paths:
- path: /
pathType: ImplementationSpecific
tls: []
# - secretName: chart-example-tls
# hosts:
# - chart-example.local
resources: {}
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
# limits:
# cpu: 100m
# memory: 128Mi
# requests:
# cpu: 100m
# memory: 128Mi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 100
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80
nodeSelector: []
tolerations: []
affinity: []
configuration:
config: |
---
modules:
default:
metrics:
- name: used_storage_byte
path: '{ .used }'
help: used storage byte
values:
used : '{ .used }'
labels: {}
- name: free_storage_byte
path: '{ .free }'
help: free storage byte
labels: {}
values :
free : '{ .free }'
- name: total_storage_byte
path: '{ .total }'
help: total storage byte
labels: {}
values :
total : '{ .total }'
prometheusRule:
enabled: false
additionalLabels: {}
namespace: ""
rules: []
additionalVolumes: []
# - name: password-file
# secret:
# secretName: secret-name
additionalVolumeMounts: []
# - name: password-file
# mountPath: "/tmp/mysecret.txt"
# subPath: mysecret.txt
Firstly you can check the targets page on the Prometheus UI to see if a) your desired target is even defined and b) if the endpoint is reachable and being scraped.
However, you may need to troubleshoot a little if either of the above is not the case:
It is important to understand what is happening. You have deployed a Prometheus Operator to the cluster. If you have used the default values from the helm chart, you also deployed a Prometheus custom resource(CR). This instance is what is telling the Prometheus Operator how to ultimately configure the Prometheus running inside the pod. Certain things are static, like global metric relabeling for example, but most are dynamic, such as picking up new targets to actually scrape. Inside the Prometheus CR you will find options to specify serviceMonitorSelector and serviceMonitorNamespaceSelector (The behaviour is the same also for probes and podmonitors so I'm just going over it once). Assuming you leave the default set like serviceMonitorNamespaceSelector: {}, Prometheus Operator will look for ServiceMonitors in all namespaces on the cluster to which it has access via its serviceAccount. The serviceMonitorSelector field lets you specify a label and value combination that must be present on a serviceMonitor that must be present for it to be picked up. Once a or multiple serviceMonitors are found, that match the criteria in the selectors, Prometheus Operator adjusts the configuration in the actual Prometheus instance(tl;dr version) so you end up with proper scrape targets.
Step 1 for trouble shooting: Do your selectors match the labels and namespace of the serviceMonitors? Actually check those. The default on the prometheus operator helm chart expects a label release: prometheus-operator and in your config, you don't seem to add that to your json-exporter's serviceMonitor.
Step 2: The same behaviour as outline for how serviceMonitors are picked up, is happening in turn inside the serviceMonitor itself, so make sure that your service actually matches what is specced out in the serviceMonitor.
To deep dive further into the options you have and what the fields do, check the API documentation.
I'm using helm v 3.7.0 and I have a parent chart with a few subcharts as dependancies.
One of the subcharts has a virtual service defined as per below.
Subchart Values.yaml:
ingress:
enabled: true
host: "hostname.local"
annotations: {}
tls: []
Subchart virtual.service.yaml:
{{- if .Values.ingress.enabled -}}
apiVersion: types.kubefed.io/v1beta1
kind: FederatedVirtualService
metadata:
name: {{ template "product.fullname" . }}-web-vservice
spec:
placement:
clusterSelector: {}
template:
spec:
gateways:
- {{ template "product.fullname" . }}-web-gateway
hosts:
- {{ .Values.ingress.host }}
http:
# Website
- match:
- uri:
prefix: /
route:
- destination:
host: {{ template "product.fullname" . }}-web
port:
number: 80
{{- end }}
When I run:
helm template . --debug
It errors out with:
Error: template: virtual.service.yaml:1:14: executing "virtual.service.yaml" at <.Values.ingress.enabled>: nil pointer evaluating interface {}.enabled
If I move the enabled boolean outside of ingress and update the if statement it works.
i.e.
new values:
ingressEnabled: true
ingress:
host: "hostname.local"
annotations: {}
tls: []
new virtual service:
{{- if .Values.ingressEnabled -}}
apiVersion: types.kubefed.io/v1beta1
kind: FederatedVirtualService
The problem is that this is happening all over the place with lots of different values but only where they are nested and I cannot make all values flat.
I have virtual services being specified in exactly the same way in other projects and they work perfectly. I don't believe the issue is with how I'm defining this (unless anyone can correct me?) so I think something else is preventing helm from being able to read nested values but I don't know where to look to investigate this odd behaviour.
What would make helm unable to read nested values?
Based on the comment above, I see you were able to overcome the issue and I want to add the answer to the question for other to find.
The default values.yaml file is lower case (even though we use .Values in the template).
What you can do is either:
make the filename start with a lowercase letter or
pass the values file explicitly with with -f option
I installed Keycloak using the bitnami/keycloak Helm chart (https://bitnami.com/stack/keycloak/helm).
As I'm also using Prometheus-Operator for monitoring I enabled the metrics endpoint and the service monitor:
keycloak:
...
metrics:
enabled: true
serviceMonitor:
enabled: true
namespace: monitoring
additionalLabels:
release: my-prom-operator-release
As I'm way more interested in actual realm metrics I installed the keycloak-metrics-spi provider (https://github.com/aerogear/keycloak-metrics-spi) by setting up an init container that downloads it to a shared volume.
keycloak:
...
extraVolumeMounts:
- name: providers
mountPath: /opt/bitnami/keycloak/providers
extraVolumes:
- name: providers
emptyDir: {}
...
initContainers:
- name: metrics-spi-provider
image: SOME_IMAGE_WITH_WGET_INSTALLED
imagePullPolicy: Always
command:
- sh
args:
- -c
- |
KEYCLOAK_METRICS_SPI_VERSION=2.5.2
wget --no-check-certificate -O /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar \
https://github.com/aerogear/keycloak-metrics-spi/releases/download/${KEYCLOAK_METRICS_SPI_VERSION}/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar
chmod +x /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar
touch /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar.dodeploy
volumeMounts:
- name: providers
mountPath: /providers
The provider enables metrics endpoints on the regular public-facing http port instead of the http-management port, which is not great for me. But I can block external access to them in my reverse proxy.
What I'm missing is some kind of auto-scraping of those endpoints. Right now I created an additional template, that creates a new service monitor for each element of a predefined list in my chart:
values.yaml
keycloak:
...
metrics:
extraServiceMonitors:
- realmName: master
- realmName: my-realm
servicemonitor-metrics-spi.yaml
{{- range $serviceMonitor := .Values.keycloak.metrics.extraServiceMonitors }}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ $.Release.Name }}-spi-{{ $serviceMonitor.realmName }}
...
spec:
endpoints:
- port: http
path: /auth/realms/{{ $serviceMonitor.realmName }}/metrics
...
{{- end }}
Is there a better way of doing this? So that Prometheus can auto-detect all my realms and scrape their endpoints?
Thanks in advance!
As commented by #jan-garaj there is no need to query all the endpoints. All return the accumulated data of all realms. So it is enough to just scrape the endpoint of one realm (e.g. the master realm).
Thanks a lot!
It might help someone, the bitnami image so the helm chart already include the metrics-spi-provider. So do not need any further installation action but the metrics must be enabled in values.
I need to implement an alert for a prometheus metric that is being exposed by many instances of a given application running on a kubernetes cluster.
The alert has to be created in a .yaml file in the following format:
- name: some-alert-name
interval: 30s
rules:
- alert: name-alert
expr: <Expression To Make>
labels:
event_id: XXXXX
annotations:
description: "Project {{ $labels.kubernetes_namespace }} / App {{ $labels.app }} / Pod {{ $labels.kubernetes_pod_name }} / Instance {{ $labels.instance }}."
summary: "{{ $labels.kubernetes_namespace }}"
The condition to applied to the alert would be something like: givenMetricValue > 4
I have no issue in getting the metric values for all instances, as I can do it with: metricName{app=~"common-part-of-deployments-name-.*"}"
My troubles are in having a unique alert with an expression that fires if one of them satisfies the condition.
Is this possible to be done?
If so, how can I do it?
Turns out, if you want create the alert with a generic "all-fetching" expression like
metricName{app=~"common-part-of-deployments-name-.*"}"
The alert will be triggered for each deployment that the regex matches. So all you need is an alert with a generic expression.
I am trying to get the alert found by Prometheus to be notified in slack using alertmanager.
This is the alert.rules file and is working fine
groups:
- name: Instances
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
# Prometheus templates apply here in the annotation and label fields of the alert.
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
summary: 'Instance {{ $labels.instance }} down'
It is showing the one instance down successfully.
But What is the problem in my alertmanager.yml that it is not sending the notifications to slack. I have also setup the slack webhook successfully and have even tested if the hook is working fine while created a hook with the service that is provided by the slack
alertmanager.yml
groups:
- name: Instances
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
# Prometheus templates apply here in the annotation and label fields of the alert.
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
summary: 'Instance {{ $labels.instance }} down'
[tgurung#ip131 prometheus_graphana_myversion]$ cat alertmanager/alertmanager.yml
route:
receiver: 'slack-notifications'
#group_by: [alertname, datacenter, app]
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: https://hooks.slack.com/services/T52GRFN3F/B93KTCUHH/JC
channel: #general
send_resolved: true
# Alertmanager templates apply here.
text: "<!channel> \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"
When running docker-compose up I get following
prometheus_1 | level=error ts=2018-02-06T09:36:35.580565429Z caller=notifier.go:454 component=notifier alertmanager=http://x.x.x.x:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://x.X.x.x:9093/api/v1/alerts: dial tcp x.x.x.x:9093: getsockopt: no route to host"
Solving above error:
To solve the above routing issues I ran the alertmanager in completely new instance and than that error is overcome
Going to the API links in the error message I can see this
{"status":"success","data":[]}
And this is of alert_manager and looks working great.
alertmanager_1 | level=info ts=2018-02-06T09:36:37.66654544Z caller=main.go:141 msg="Starting Alertmanager" version="(version=0.13.0, branch=HEAD, revision=fb713f6d8239b57c646cae30f78e8b4b8861a1aa)"
alertmanager_1 | level=info ts=2018-02-06T09:36:37.66661402Z caller=main.go:142 build_context="(go=go1.9.2, user=root#d83981af1d3d, date=20180112-10:32:46)"
alertmanager_1 | level=info ts=2018-02-06T09:36:37.668103448Z caller=main.go:279 msg="Loading configuration file" file=/alertmanager/alertmanager.yml
alertmanager_1 | level=info ts=2018-02-06T09:36:37.673288146Z caller=main.go:354 msg=Listening address=:9093
Here is the prometheus.yml config file
global:
scrape_interval: 5s
external_labels:
monitor: 'my-monitor'
#alerting rules file
rule_files:
- '/alertmanager/alert.rules'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
alerting:
alertmanagers:
- static_configs:
- targets: ["54.36.X.X:9093"] #this is the alertmanager service url
Here is my docker-compose.yml
version: '2'
volumes:
grafana_data: {}
services:
prometheus:
image: prom/prometheus
privileged: true
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alertmanager/alert.rules:/alertmanager/alert.rules
- ./alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- '9090:9090'
links:
- "alertmanager"
node-exporter:
image: prom/node-exporter
ports:
- '9100:9100'
alertmanager:
image: prom/alertmanager
privileged: true
volumes:
- ./alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml
command:
- '--config.file=/alertmanager/alertmanager.yml'
ports:
- '9093:9093'
The alertmanager status link does not show the config being passed from volume in docker-composer. It's showing default configs
As some people have already pointed out, alertmanager configuration only includes how to send alerts and NOT to create alerts. That's prometheus' job.
Take a look at this repo, which sets prometheus with alertmanager very simply
https://github.com/stefanprodan/dockprom