How to store the status of an expr in alert rules to use that in annotations? - annotations

I am setting up alerts for prometheus for whenever a node goes in "NotReady" my Kubernetes cluster. I get notified on Slack whenever that happens. The problem is I get notified with the same description "Node xxxx is in NotReady" even when it comes back up. I am trying to use a variable for the ready status of the node and use that in the annotations part.
I have tried using "vars" and "when" to assign it to a variable to use it in annotations.
- name: NodeNotReady
rules:
- alert: K8SNodeNotReadyAlert
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 3m
vars:
- ready_status: "Ready"
when: kube_node_status_condition{condition="Ready",status="true"} == 1
- ready_status: "Not Ready"
when: kube_node_status_condition{condition="Ready",status="true"} == 0
labels:
severity: warning
annotations:
description: Node {{ $labels.node }} status is in {{ ready_status }}.
summary: Node status {{ ready_status }} Alert!
I want to get these alerts :
1. When node is NotReady: "Node prom-node status is in NotReady."
2. When node is Ready: "Node prom-node status is in NotReady."

So the thing that you're looking for is over here. So you should end up with sth like this in description:
Node {{ $labels.node }} status is in {{ if eq $value 1 }} Ready {{ else }} Not Ready {{ end }} status.
Also worth to read this before making more alerts.

Related

HELM: include named template inside tpl .Files.Get fails

structure of folders:
files
alertmanager
rules
- alertmanager.rules
- nodes.rules
...
templates
- _helpers.tpl
- prometheus.yaml
files/alertmanager/rules/alertmanager.rules
- name: Alertmanager
rules:
- alert: PrometheusAlertmanagerConfigurationReloadFailure
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{app_kubernetes_io_name="alertmanager"}[5m]) == 0
for: 10m
labels:
severity: critical
{{ include _default_rule_labels . }}
annotations:
type: Alertmanager
summary: Prometheus AlertManager configuration reload failure
description: |
The error could be caused by recent changes and could be caused by an incorrect configuration of alertmanager template (defined in templates/prometheus.yaml).
Or it can be caused by incorrect route(s) configuration (typically in argocd/apps/values.yaml)
templates/_helpers.tpl
{{/*
Collect alertmanager rules from files
*/}}
{{- define "alertmanager.rules" -}}
{{- range $path, $_ := .Files.Glob "files/alertmanager/rules/**.rules" }}
{{ tpl ($.Files.Get $path) $ }}
{{- end }}
{{- end }}
{{/*
Set default alert rule labels
*/}}
{{- define "_default_rule_labels" -}}
environment: {{ .Values.environment }}
client: {{ .Values.client }}
cluster: {{ .Values.eks_cluster }}
sla: {{ .Values.sla }}
{{- end }}
templates/prometheus.yaml
# ? PROMETHEUS ALERT RULES
serverFiles:
alerting_rules.yml:
groups:
{{- include "alertmanager.rules" . | nindent 14 }}
I'm getting the following error when trying to render the prometheus.yaml template:
✗ helm template . -s templates/prometheus.yaml
parse error at (root/templates/prometheus.yaml:15): function "_default_rule_labels" not defined
How to approach?
I can render {{ .Values.something }} inside files/alertmanager/rules/alertmanager.rules, but inclusion of named templates throws an error.

Ansible conditional won't recognise debug output

I am trying to pass the debug message to conditional on Kubernetes object but it looks like it doesn't recognise it properly:
- name: get some service status log
kubernetes.core.k8s_log:
namespace: "{{ product.namespace }}"
label_selectors:
- app.kubernetes.io/name=check-service-existence
register: service_existence
- name: some service existence check log
debug:
msg: "{{ service_existence.log_lines | first }}"
- name: create service for "{{ product.namespace }}"
kubernetes.core.k8s:
state: present
template: create-service.j2
wait: yes
wait_timeout: 300
wait_condition:
type: "Complete"
status: "True"
when: service_existence == "service_does_not_exist"
what I am getting when I am running it is:
TASK [playbook : some service existence check log] ***
ok: [127.0.0.1] =>
msg: service_does_not_exist
TASK [playbook : create service for "namespace"] ***
skipping: [127.0.0.1]
I suspect that it treats msg: as a part of string. How can I deal with this properly?
Since your debug message is about the value of service_existence.log_lines | first your conditional should also be.
when: service_existence.log_lines | first == "service_does_not_exist"

Prometheus alertmanager slack notification newlines issue

I've defined an alert for my kubernetes pods as described below to notify through slack.
I used the example described in the official documentation for ranging over all received alerts to loop over multiple alerts and render them on my slack channel
I do get notifications but the new lines do not get rendered correctly somehow.
I'm new to prometheus any help is greatly appreciated.
Thanks.
detection:
# Alert If:
# 1. Pod is not in a running state.
# 2. Container is killed because it's out of memory.
# 3. Container is evicted.
rules:
groups:
- name: not-running
rules:
- alert: PodNotRunning
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not running."
description: 'Kubernetes pod {{ $labels.pod }} is not running.'
- alert: KubernetesContainerOOMKilledOrEvicted
expr: kube_pod_container_status_last_terminated_reason{reason=~"OOMKilled|Evicted"} > 0
for: 0m
labels:
severity: warning
annotations:
summary: "kubernetes container killed/evicted (instance {{ $labels.instance }})"
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
has been OOMKilled/Evicted."
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 3m
repeat_interval: 4h
receiver: slack-channel
routes:
- match:
alertname: PodNotRunning
- match:
alertname: KubernetesContainerOOMKilledOrEvicted
notifications:
receivers:
- name: slack-channel
slack_configs:
- channel: kube-alerts
title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
How it gets rendered on the actual slack channel:
Title: inst-1 down.\ninst-2 down.\ninst-3 down.\ninst-4 down.
Text: inst-1 down.\ninst-2 down.\ninst-3 down.\ninst-4 down
How I though it would render:
Title: inst-1 down.
Text: inst-1 down.
Title: inst-2 down.
Text: inst-2 down.
Title: inst-3 down.
Text: inst-3 down.
Title: inst-4 down.
Text: inst-4 down.
Use {{ "\n" }} instead of plain \n
example:
...
slack_configs:
- channel: kube-alerts
title: "{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}"

Is it possible to avoid sending repeated Slack notifications for already fired alert?

Disclaimer: First time I use Prometheus.
I am trying to send a Slack notification every time a Job ends successfully.
To achieve this, I installed kube-state-metrics, Prometheus and AlertManager.
Then I created the following rule:
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
labels:
severity: information
And added the AlertManager receiver text (template) :
{{ define "custom_slack_message" }}
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
{{ end }}
My current result: Everytime a new job completes successfully, I receive a Slack notification with the list of all Job that completed successfully.
I don't mind receiving the whole list at first but after that I would like to receive notifications that contain only the newly completed job(s) in the specified group interval.
Is it possible?
Just add extra rule which will just display last completed job(s):
line: for: <10m> - which will list just lastly completed job(s) in 10 minutes:
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
for: 10m
labels:
severity: information
I ended up using kube_job_status_completion_time and time() to dismiss past events (avoid refiring event upon repeat time).
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
time() - kube_job_status_completion_time < 60 and kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
labels:
severity: information

Custom alert for pod memory utilisation in Prometheus

I created alert rules for pod memory utilisation, in Prometheus. Alerts are showing perfectly on my slack channel, but it do not contain the name of the pod so that difficult to understand which pod is having the issue .
It is Just showing [FIRING:35] (POD_MEMORY_HIGH_UTILIZATION default/k8s warning). But when I look in to the "Alert" section in the Prometheus UI, I can see the fired rules with its pod name. Can anyone help?
My alert notification template is as follows:
alertname: TargetDown
alertname: POD_CPU_HIGH_UTILIZATION
alertname: POD_MEMORY_HIGH_UTILIZATION
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#devops'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
I have added the option title: '{{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' in my alert notification template and now it is showing the description. My description is description: pod {{$labels.pod}} is using high memory. But only showing is using high memory. Not specifying the pod name
As mentioned in the article, you should check the alert rules and update them if necessary. See an example:
ALERT ElasticacheCPUUtilisation
IF aws_elasticache_cpuutilization_average > 80
FOR 10m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "ElastiCache CPU Utilisation Alert",
description = "Elasticache CPU Usage has breach the threshold set (80%) on cluster id {{ $labels.cache_cluster_id }}, now at {{ $value }}%",
runbook = "https://mywiki.com/ElasticacheCPUUtilisation",
}
To provide external URL for your prometheus GUI, apply CLI argument to your prometheus server and restart it:
-web.external-url=http://externally-available-url:9090/
After that, you can put the values into your alertmanager configuration. See an example:
receivers:
- name: 'iw-team-slack'
slack_configs:
- channel: alert-events
send_resolved: true
api_url: https://hooks.slack.com/services/<your_token>
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:> *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}