I am trying to get the alert found by Prometheus to be notified in slack using alertmanager.
This is the alert.rules file and is working fine
groups:
- name: Instances
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
# Prometheus templates apply here in the annotation and label fields of the alert.
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
summary: 'Instance {{ $labels.instance }} down'
It is showing the one instance down successfully.
But What is the problem in my alertmanager.yml that it is not sending the notifications to slack. I have also setup the slack webhook successfully and have even tested if the hook is working fine while created a hook with the service that is provided by the slack
alertmanager.yml
groups:
- name: Instances
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
# Prometheus templates apply here in the annotation and label fields of the alert.
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
summary: 'Instance {{ $labels.instance }} down'
[tgurung#ip131 prometheus_graphana_myversion]$ cat alertmanager/alertmanager.yml
route:
receiver: 'slack-notifications'
#group_by: [alertname, datacenter, app]
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: https://hooks.slack.com/services/T52GRFN3F/B93KTCUHH/JC
channel: #general
send_resolved: true
# Alertmanager templates apply here.
text: "<!channel> \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"
When running docker-compose up I get following
prometheus_1 | level=error ts=2018-02-06T09:36:35.580565429Z caller=notifier.go:454 component=notifier alertmanager=http://x.x.x.x:9093/api/v1/alerts count=0 msg="Error sending alert" err="Post http://x.X.x.x:9093/api/v1/alerts: dial tcp x.x.x.x:9093: getsockopt: no route to host"
Solving above error:
To solve the above routing issues I ran the alertmanager in completely new instance and than that error is overcome
Going to the API links in the error message I can see this
{"status":"success","data":[]}
And this is of alert_manager and looks working great.
alertmanager_1 | level=info ts=2018-02-06T09:36:37.66654544Z caller=main.go:141 msg="Starting Alertmanager" version="(version=0.13.0, branch=HEAD, revision=fb713f6d8239b57c646cae30f78e8b4b8861a1aa)"
alertmanager_1 | level=info ts=2018-02-06T09:36:37.66661402Z caller=main.go:142 build_context="(go=go1.9.2, user=root#d83981af1d3d, date=20180112-10:32:46)"
alertmanager_1 | level=info ts=2018-02-06T09:36:37.668103448Z caller=main.go:279 msg="Loading configuration file" file=/alertmanager/alertmanager.yml
alertmanager_1 | level=info ts=2018-02-06T09:36:37.673288146Z caller=main.go:354 msg=Listening address=:9093
Here is the prometheus.yml config file
global:
scrape_interval: 5s
external_labels:
monitor: 'my-monitor'
#alerting rules file
rule_files:
- '/alertmanager/alert.rules'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
alerting:
alertmanagers:
- static_configs:
- targets: ["54.36.X.X:9093"] #this is the alertmanager service url
Here is my docker-compose.yml
version: '2'
volumes:
grafana_data: {}
services:
prometheus:
image: prom/prometheus
privileged: true
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alertmanager/alert.rules:/alertmanager/alert.rules
- ./alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- '9090:9090'
links:
- "alertmanager"
node-exporter:
image: prom/node-exporter
ports:
- '9100:9100'
alertmanager:
image: prom/alertmanager
privileged: true
volumes:
- ./alertmanager/alertmanager.yml:/alertmanager/alertmanager.yml
command:
- '--config.file=/alertmanager/alertmanager.yml'
ports:
- '9093:9093'
The alertmanager status link does not show the config being passed from volume in docker-composer. It's showing default configs
As some people have already pointed out, alertmanager configuration only includes how to send alerts and NOT to create alerts. That's prometheus' job.
Take a look at this repo, which sets prometheus with alertmanager very simply
https://github.com/stefanprodan/dockprom
Related
I installed Keycloak using the bitnami/keycloak Helm chart (https://bitnami.com/stack/keycloak/helm).
As I'm also using Prometheus-Operator for monitoring I enabled the metrics endpoint and the service monitor:
keycloak:
...
metrics:
enabled: true
serviceMonitor:
enabled: true
namespace: monitoring
additionalLabels:
release: my-prom-operator-release
As I'm way more interested in actual realm metrics I installed the keycloak-metrics-spi provider (https://github.com/aerogear/keycloak-metrics-spi) by setting up an init container that downloads it to a shared volume.
keycloak:
...
extraVolumeMounts:
- name: providers
mountPath: /opt/bitnami/keycloak/providers
extraVolumes:
- name: providers
emptyDir: {}
...
initContainers:
- name: metrics-spi-provider
image: SOME_IMAGE_WITH_WGET_INSTALLED
imagePullPolicy: Always
command:
- sh
args:
- -c
- |
KEYCLOAK_METRICS_SPI_VERSION=2.5.2
wget --no-check-certificate -O /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar \
https://github.com/aerogear/keycloak-metrics-spi/releases/download/${KEYCLOAK_METRICS_SPI_VERSION}/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar
chmod +x /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar
touch /providers/keycloak-metrics-spi-${KEYCLOAK_METRICS_SPI_VERSION}.jar.dodeploy
volumeMounts:
- name: providers
mountPath: /providers
The provider enables metrics endpoints on the regular public-facing http port instead of the http-management port, which is not great for me. But I can block external access to them in my reverse proxy.
What I'm missing is some kind of auto-scraping of those endpoints. Right now I created an additional template, that creates a new service monitor for each element of a predefined list in my chart:
values.yaml
keycloak:
...
metrics:
extraServiceMonitors:
- realmName: master
- realmName: my-realm
servicemonitor-metrics-spi.yaml
{{- range $serviceMonitor := .Values.keycloak.metrics.extraServiceMonitors }}
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: {{ $.Release.Name }}-spi-{{ $serviceMonitor.realmName }}
...
spec:
endpoints:
- port: http
path: /auth/realms/{{ $serviceMonitor.realmName }}/metrics
...
{{- end }}
Is there a better way of doing this? So that Prometheus can auto-detect all my realms and scrape their endpoints?
Thanks in advance!
As commented by #jan-garaj there is no need to query all the endpoints. All return the accumulated data of all realms. So it is enough to just scrape the endpoint of one realm (e.g. the master realm).
Thanks a lot!
It might help someone, the bitnami image so the helm chart already include the metrics-spi-provider. So do not need any further installation action but the metrics must be enabled in values.
I tried to use Grafana Tempo for distributed tracing.
I launch it from docker-compose:
version: "3.9"
services:
# MY MICROSERVICES
...
prometheus:
image: prom/prometheus
ports:
- ${PROMETHEUS_EXTERNAL_PORT}:9090
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:cached
promtail:
image: grafana/promtail
volumes:
- ./log:/var/log
- ./promtail/:/mnt/config
command: -config.file=/mnt/config/promtail-config.yaml
loki:
image: grafana/loki
command: -config.file=/etc/loki/local-config.yaml
tempo:
image: grafana/tempo
command: [ "-config.file=/etc/tempo.yaml" ]
volumes:
- ./tempo/tempo-local.yaml:/etc/tempo.yaml
# - ./tempo-data/:/tmp/tempo
ports:
- "14268" # jaeger ingest
- "3200" # tempo
- "55680" # otlp grpc
- "55681" # otlp http
- "9411" # zipkin
tempo-query:
image: grafana/tempo-query
command: [ "--grpc-storage-plugin.configuration-file=/etc/tempo-query.yaml" ]
volumes:
- ./tempo/tempo-query.yaml:/etc/tempo-query.yaml
ports:
- "16686:16686" # jaeger-ui
depends_on:
- tempo
grafana:
image: grafana/grafana
volumes:
- ./grafana/datasource-config/:/etc/grafana/provisioning/datasources:cached
- ./grafana/dashboards/prometheus.json:/var/lib/grafana/dashboards/prometheus.json:cached
- ./grafana/dashboards/loki.json:/var/lib/grafana/dashboards/loki.json:cached
- ./grafana/dashboards-config/:/etc/grafana/provisioning/dashboards:cached
ports:
- ${GRAFANA_EXTERNAL_PORT}:3000
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
depends_on:
- prometheus
- loki
with tempo-local.yaml:
server:
http_listen_port: 3200
distributor:
receivers: # this configuration will listen on all ports and protocols that tempo is capable of.
jaeger: # the receives all come from the OpenTelemetry collector. more configuration information can
protocols: # be found there: https://github.com/open-telemetry/opentelemetry-collector/tree/main/receiver
thrift_http: #
grpc: # for a production deployment you should only enable the receivers you need!
thrift_binary:
thrift_compact:
zipkin:
otlp:
protocols:
http:
grpc:
opencensus:
ingester:
trace_idle_period: 10s # the length of time after a trace has not received spans to consider it complete and flush it
max_block_bytes: 1_000_000 # cut the head block when it hits this size or ...
max_block_duration: 5m # this much time passes
compactor:
compaction:
compaction_window: 1h # blocks in this time window will be compacted together
max_block_bytes: 100_000_000 # maximum size of compacted blocks
block_retention: 1h
compacted_block_retention: 10m
storage:
trace:
backend: local # backend configuration to use
block:
bloom_filter_false_positive: .05 # bloom filter false positive rate. lower values create larger filters but fewer false positives
index_downsample_bytes: 1000 # number of bytes per index record
encoding: zstd # block encoding/compression. options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
wal:
path: /tmp/tempo/wal # where to store the the wal locally
encoding: snappy # wal encoding/compression. options: none, gzip, lz4-64k, lz4-256k, lz4-1M, lz4, snappy, zstd, s2
local:
path: /tmp/tempo/blocks
pool:
max_workers: 100 # worker pool determines the number of parallel requests to the object store backend
queue_depth: 10000
tempo-query.yaml:
backend: "tempo:3200"
and datasource.yml for instrumenting datasources on grafana:
apiVersion: 1
deleteDatasources:
- name: Prometheus
- name: Tempo
- name: Loki
datasources:
- name: Prometheus
type: prometheus
access: proxy
orgId: 1
url: http://prometheus:9090
basicAuth: false
isDefault: false
version: 1
editable: false
- name: Tempo
type: tempo
access: proxy
orgId: 1
url: http://tempo-query:16686
basicAuth: false
isDefault: false
version: 1
editable: false
apiVersion: 1
uid: tempo
- name: Tempo-Multitenant
type: tempo
access: proxy
orgId: 1
url: http://tempo-query:16686
basicAuth: false
isDefault: false
version: 1
editable: false
apiVersion: 1
uid: tempo-authed
jsonData:
httpHeaderName1: 'Authorization'
secureJsonData:
httpHeaderValue1: 'Bearer foo-bar-baz'
- name: Loki
type: loki
access: proxy
orgId: 1
url: http://loki:3100
basicAuth: false
isDefault: false
version: 1
editable: false
apiVersion: 1
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: \[.+,(.+),.+\]
name: TraceID
url: $${__value.raw}
But if I test the datasource in grafana, I have this error:
In Loki view I can found the Tempo button for seeing traces... but I can't see it on Tempo because I have an error:
Anyway if I take the trace id and I search it on Jaeger, I can see it correctly.
What I missing in configuration for Tempo? How configure it correctly?
Grafana 7.5 and later can talk to Tempo natively, and no longer need the tempo-query proxy. I think this explains what is happening, Grafana is attempting to use the Tempo-native API against tempo-query, which exposes the Jaeger API instead. Try changing the Grafana datasource in datasource.yml to http://tempo:3200.
This solution applies to the tempo installation via the kubernetes helm charts (so it applies to the question's title but not the exact question):
Use the url http://helmreleasename-tempo:3100 for configuring the tempo datasource in grafana. You need to check your tempo service name in kubernetes and use the port 3100.
I am trying to create alerts in Prometheus on Kubernetes and sending them to a Slack channel. For this i am using the prometheus-community helm-charts (which already includes the alertmanager). As i want to use my own alerts I have also created an values.yml (shown below) strongly inspired from here.
If I port forward Prometheus I can see my Alert there going from inactive, to pending to firing, but no message is sent to slack. I am quite confident that my alertmanager configuration is fine (as I have tested it with some prebuild alerts of another chart and they were sent to slack). So my best guess is that I add the alert in the wrong way (in the serverFiles part), but I can not figure out how to do it correctly. Also, the alertmanager logs look pretty normal to me. Does anyone have an idea where my problem comes from?
---
serverFiles:
alerting_rules.yml:
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: sum(rate(container_network_receive_bytes_total{namespace="kube-logging"}[5m]))>20000
for: 1m
labels:
severity: page
annotations:
summary: High request latency
alertmanager:
persistentVolume:
storageClass: default-hdd-retain
## Deploy alertmanager
##
enabled: true
## Service account for Alertmanager to use.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
##
serviceAccount:
create: true
name: ""
## Configure pod disruption budgets for Alertmanager
## ref: https://kubernetes.io/docs/tasks/run-application/configure-pdb/#specifying-a-poddisruptionbudget
## This configuration is immutable once created and will require the PDB to be deleted to be changed
## https://github.com/kubernetes/kubernetes/issues/45398
##
podDisruptionBudget:
enabled: false
minAvailable: 1
maxUnavailable: ""
## Alertmanager configuration directives
## ref: https://prometheus.io/docs/alerting/configuration/#configuration-file
## https://prometheus.io/webtools/alerting/routing-tree-editor/
##
config:
global:
resolve_timeout: 5m
slack_api_url: "I changed this url for the stack overflow question"
route:
group_by: ['job']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
#receiver: 'slack'
routes:
- match:
alertname: DeadMansSwitch
receiver: 'null'
- match:
receiver: 'slack'
continue: true
receivers:
- name: 'null'
- name: 'slack'
slack_configs:
- channel: 'alerts'
send_resolved: false
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:> *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
So I have finally solved the problem. The problem apparently was that the kube-prometheus-stack and the prometheus helm charts work a bit differently.
So instead of alertmanager.config I had to insert the code (everything starting from global) at alertmanagerFiles.alertmanager.yml.
When trying to setup Promtail I get the following error:
level=error ts=2020-11-27T06:10:30.310583Z caller=main.go:104 msg="error creating promtail" error="failed to make file target manager: pipeline stage must contain only one key"
I am running the following from the command line.
promtail-windows-amd64.exe --config.file=../conf/promtail-local-config.yml
My log lines looks like this:
13:21:03.183 - INFO - Successfully received document from '127.0.0.1'. Saved as 'c:\test\test_file.txt' in 102ms from address '/127.0.0.1'
13:21:05.275 - WARN - Failed to receive document from '127.0.0.1'. Error creating file c:\test\error_file.txt'
My config looks as follows:
scrape_configs:
- job_name: promtailTest
pipeline_stages:
- match:
selector: '{job="promtailTest"}'
stages:
- regex:
expression: '^(?P<timestamp>\\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s\\-\\s(?P<logLevel>[A-Z]{4,5})\\s\\-\\s(?P<logMessage>.*)$'
- labels:
logLevel:
static_configs:
- targets:
- localhost
labels:
job: promtailTest
app: promtailTest
host: LOCAL
__path__: C:/test/logs/*log
When I take out the pipeline_stages: section then I do see the lines in grafana however I cannot get the regex part to work. I actually want to add a label for the logging Level (so I can count the errors)
BAH!! I am a fool. I think the problem was with my format (YAML can be a real PITA!!!)
My new config now looks like this and it works.
scrape_configs:
- job_name: promtailTest
static_configs:
- targets:
- localhost
labels:
job: promtailTest
app: promtailTest
host: APOLLO99
__path__: D:/SXI/XPress/Dispatch/logs/*log
pipeline_stages:
- match:
selector: '{job="promtailTest"}'
stages:
- regex:
expression: "^^(?P<myTime>\\d{2}:\\d{2}:\\d{2}\\.\\d{3})\\s\\-\\s(?P<logLevel>[A-Z]{4,5})\\s\\-\\s(?P<logMessage>.*)$$"
- labels:
logLevel:
docker-compose.yml:
This is the docker-compose to run the prometheus, node-exporter and alert-manager service. All the services are running great. Even the health status in target menu of prometheus shows ok.
version: '2'
services:
prometheus:
image: prom/prometheus
privileged: true
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alertmanger/alert.rules:/alert.rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- '9090:9090'
node-exporter:
image: prom/node-exporter
ports:
- '9100:9100'
alertmanager:
image: prom/alertmanager
privileged: true
volumes:
- ./alertmanager/alertmanager.yml:/alertmanager.yml
command:
- '--config.file=/alertmanager.yml'
ports:
- '9093:9093'
prometheus.yml
This is the prometheus config file with targets and alerts target sets. The alertmanager target url is working fine.
global:
scrape_interval: 5s
external_labels:
monitor: 'my-monitor'
# this is where I have simple alert rules
rule_files:
- ./alertmanager/alert.rules
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
alerting:
alertmanagers:
- static_configs:
- targets: ['some-ip:9093']
alert.rules:
Just a simple alert rules to show alert when service is down
ALERT service_down
IF up == 0
alertmanager.yml
This is to send the message on slack when alerting occurs.
global:
slack_api_url: 'https://api.slack.com/apps/A90S3Q753'
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: 'tara gurung'
channel: '#general'
api_url: 'https://hooks.slack.com/services/T52GRFN3F/B90NMV1U2/QKj1pZu3ZVY0QONyI5sfsdf'
Problems:
All the containers are working fine I am not able to figure out the exact problem.What am I really missing. Checking the alerts in prometheus shows.
Alerts
No alerting rules defined
Your ./alertmanager/alert.rules file is not included in your docker config, so it is not available in the container. You need to add it to the prometheus service:
prometheus:
image: prom/prometheus
privileged: true
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alertmanager/alert.rules:/alertmanager/alert.rules
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- '9090:9090'
And probably give an absolute path inside prometheus.yml:
rule_files:
- "/alertmanager/alert.rules"
You also need to make sure you alerting rules are valid. Please see the prometheus docs for details and examples. You alert.rules file should look something like this:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
Once you have multiple files, it may be better to add the entire directory as a volume rather than individual files.
If you need answers to this question see the explanation on this link
How to make alert rules visible on Prometheus User Interface?
Your alert rules inside the prometheus.yml should look like this
rule_files:
- "/etc/prometheus/alert.rules.yml"
You need to stop the alertmanager and prometheus containers and run this
docker run -d --name prometheus_ops -p 9191:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml -v $(pwd)/alert.rules.yml:/etc/prometheus/alert.rules.yml prom/prometheus
Verify if you can see the alert.rule config path : Prometheus container ID and go to cd /etc/prometheus
docker exec -it fa99f733f69b sh