Custom alert for pod memory utilisation in Prometheus - kubernetes

I created alert rules for pod memory utilisation, in Prometheus. Alerts are showing perfectly on my slack channel, but it do not contain the name of the pod so that difficult to understand which pod is having the issue .
It is Just showing [FIRING:35] (POD_MEMORY_HIGH_UTILIZATION default/k8s warning). But when I look in to the "Alert" section in the Prometheus UI, I can see the fired rules with its pod name. Can anyone help?
My alert notification template is as follows:
alertname: TargetDown
alertname: POD_CPU_HIGH_UTILIZATION
alertname: POD_MEMORY_HIGH_UTILIZATION
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#devops'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
I have added the option title: '{{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' in my alert notification template and now it is showing the description. My description is description: pod {{$labels.pod}} is using high memory. But only showing is using high memory. Not specifying the pod name

As mentioned in the article, you should check the alert rules and update them if necessary. See an example:
ALERT ElasticacheCPUUtilisation
IF aws_elasticache_cpuutilization_average > 80
FOR 10m
LABELS { severity = "warning" }
ANNOTATIONS {
summary = "ElastiCache CPU Utilisation Alert",
description = "Elasticache CPU Usage has breach the threshold set (80%) on cluster id {{ $labels.cache_cluster_id }}, now at {{ $value }}%",
runbook = "https://mywiki.com/ElasticacheCPUUtilisation",
}
To provide external URL for your prometheus GUI, apply CLI argument to your prometheus server and restart it:
-web.external-url=http://externally-available-url:9090/
After that, you can put the values into your alertmanager configuration. See an example:
receivers:
- name: 'iw-team-slack'
slack_configs:
- channel: alert-events
send_resolved: true
api_url: https://hooks.slack.com/services/<your_token>
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Monitoring Event Notification'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:> *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}

Related

HELM: include named template inside tpl .Files.Get fails

structure of folders:
files
alertmanager
rules
- alertmanager.rules
- nodes.rules
...
templates
- _helpers.tpl
- prometheus.yaml
files/alertmanager/rules/alertmanager.rules
- name: Alertmanager
rules:
- alert: PrometheusAlertmanagerConfigurationReloadFailure
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(alertmanager_config_last_reload_successful{app_kubernetes_io_name="alertmanager"}[5m]) == 0
for: 10m
labels:
severity: critical
{{ include _default_rule_labels . }}
annotations:
type: Alertmanager
summary: Prometheus AlertManager configuration reload failure
description: |
The error could be caused by recent changes and could be caused by an incorrect configuration of alertmanager template (defined in templates/prometheus.yaml).
Or it can be caused by incorrect route(s) configuration (typically in argocd/apps/values.yaml)
templates/_helpers.tpl
{{/*
Collect alertmanager rules from files
*/}}
{{- define "alertmanager.rules" -}}
{{- range $path, $_ := .Files.Glob "files/alertmanager/rules/**.rules" }}
{{ tpl ($.Files.Get $path) $ }}
{{- end }}
{{- end }}
{{/*
Set default alert rule labels
*/}}
{{- define "_default_rule_labels" -}}
environment: {{ .Values.environment }}
client: {{ .Values.client }}
cluster: {{ .Values.eks_cluster }}
sla: {{ .Values.sla }}
{{- end }}
templates/prometheus.yaml
# ? PROMETHEUS ALERT RULES
serverFiles:
alerting_rules.yml:
groups:
{{- include "alertmanager.rules" . | nindent 14 }}
I'm getting the following error when trying to render the prometheus.yaml template:
✗ helm template . -s templates/prometheus.yaml
parse error at (root/templates/prometheus.yaml:15): function "_default_rule_labels" not defined
How to approach?
I can render {{ .Values.something }} inside files/alertmanager/rules/alertmanager.rules, but inclusion of named templates throws an error.

Transform a helm dict into a list

I'm using a Helm chart to control what environment variables are set for a certain container in a deployment.
In my Values.yaml, I have an entry called env which is a dictionary:
image:
repository: xxxx.yyyyy.com/myimage
pullPolicy: IfNotPresent
# Enviroment variables that will be passed to the container.
env: {}
Now, I'll pass variables to the env dict using --set:
helm upgrade mydeployment chart --set env.VARIABLE=test
However, this must be transformed into a list to adhere to Kubernetes yaml:
spec:
template:
spec:
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
# This should come from that dict
env:
- name: VARIABLE
value: "test"
I don't know how to use the template language from Helm (sprig / go) to achieve that. Is it even possible?
To iterate through the map, the core Go text/template language provides a range keyword that can iterate through maps or arrays.
{{ range $key, $value := .Values.env }}
...
{{ end }}
Inside of this you can put arbitrary text. Helm doesn't require this to be any particular kind of YAML construct, just so long as the final result is valid YAML. For this setup a typical loop would look like
env:
{{- range $key, $value := .Values.env }}
- name: {{ quote $key }}
value: {{ quote $value }}
{{- end }}
You do need to be careful with indentation here. As a rule of thumb it often will work to include a - "swallow whitespace" indicator inside the open {{ and to not include one inside the close }}. The - name: must be at least as indented as the env: above it (ignoring the range line), and value: must be aligned with name:. I might put all of the template-language lines (the range and end) starting at the first column, even if they're embedded in a structure that's nested more.
spec:
template:
spec:
containers:
- name: {{ template "chart.fullname" . }}
env:
{{- range $key, $value := .Values.env }}
- name: {{ quote $key }}
value: {{ quote $value }}
{{- end }}
image: {{ .Values.registry }}/{{ .Values.image }}:{{ .Values.tag }}

Prometheus alertmanager slack notification newlines issue

I've defined an alert for my kubernetes pods as described below to notify through slack.
I used the example described in the official documentation for ranging over all received alerts to loop over multiple alerts and render them on my slack channel
I do get notifications but the new lines do not get rendered correctly somehow.
I'm new to prometheus any help is greatly appreciated.
Thanks.
detection:
# Alert If:
# 1. Pod is not in a running state.
# 2. Container is killed because it's out of memory.
# 3. Container is evicted.
rules:
groups:
- name: not-running
rules:
- alert: PodNotRunning
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is not running."
description: 'Kubernetes pod {{ $labels.pod }} is not running.'
- alert: KubernetesContainerOOMKilledOrEvicted
expr: kube_pod_container_status_last_terminated_reason{reason=~"OOMKilled|Evicted"} > 0
for: 0m
labels:
severity: warning
annotations:
summary: "kubernetes container killed/evicted (instance {{ $labels.instance }})"
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }}
has been OOMKilled/Evicted."
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 3m
repeat_interval: 4h
receiver: slack-channel
routes:
- match:
alertname: PodNotRunning
- match:
alertname: KubernetesContainerOOMKilledOrEvicted
notifications:
receivers:
- name: slack-channel
slack_configs:
- channel: kube-alerts
title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
How it gets rendered on the actual slack channel:
Title: inst-1 down.\ninst-2 down.\ninst-3 down.\ninst-4 down.
Text: inst-1 down.\ninst-2 down.\ninst-3 down.\ninst-4 down
How I though it would render:
Title: inst-1 down.
Text: inst-1 down.
Title: inst-2 down.
Text: inst-2 down.
Title: inst-3 down.
Text: inst-3 down.
Title: inst-4 down.
Text: inst-4 down.
Use {{ "\n" }} instead of plain \n
example:
...
slack_configs:
- channel: kube-alerts
title: "{{ range .Alerts }}{{ .Annotations.summary }}{{ "\n" }}{{ end }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ "\n" }}{{ end }}"

Is it possible to avoid sending repeated Slack notifications for already fired alert?

Disclaimer: First time I use Prometheus.
I am trying to send a Slack notification every time a Job ends successfully.
To achieve this, I installed kube-state-metrics, Prometheus and AlertManager.
Then I created the following rule:
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
labels:
severity: information
And added the AlertManager receiver text (template) :
{{ define "custom_slack_message" }}
{{ range .Alerts }}
{{ .Annotations.description }}
{{ end }}
{{ end }}
My current result: Everytime a new job completes successfully, I receive a Slack notification with the list of all Job that completed successfully.
I don't mind receiving the whole list at first but after that I would like to receive notifications that contain only the newly completed job(s) in the specified group interval.
Is it possible?
Just add extra rule which will just display last completed job(s):
line: for: <10m> - which will list just lastly completed job(s) in 10 minutes:
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
for: 10m
labels:
severity: information
I ended up using kube_job_status_completion_time and time() to dismiss past events (avoid refiring event upon repeat time).
rules:
- alert: KubeJobCompleted
annotations:
identifier: '{{ $labels.instance }}'
summary: Job Completed Successfully
description: Job *{{ $labels.namespace }}/{{ $labels.job_name }}* is completed successfully.
expr: |
time() - kube_job_status_completion_time < 60 and kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} == 0
labels:
severity: information

How to store the status of an expr in alert rules to use that in annotations?

I am setting up alerts for prometheus for whenever a node goes in "NotReady" my Kubernetes cluster. I get notified on Slack whenever that happens. The problem is I get notified with the same description "Node xxxx is in NotReady" even when it comes back up. I am trying to use a variable for the ready status of the node and use that in the annotations part.
I have tried using "vars" and "when" to assign it to a variable to use it in annotations.
- name: NodeNotReady
rules:
- alert: K8SNodeNotReadyAlert
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 3m
vars:
- ready_status: "Ready"
when: kube_node_status_condition{condition="Ready",status="true"} == 1
- ready_status: "Not Ready"
when: kube_node_status_condition{condition="Ready",status="true"} == 0
labels:
severity: warning
annotations:
description: Node {{ $labels.node }} status is in {{ ready_status }}.
summary: Node status {{ ready_status }} Alert!
I want to get these alerts :
1. When node is NotReady: "Node prom-node status is in NotReady."
2. When node is Ready: "Node prom-node status is in NotReady."
So the thing that you're looking for is over here. So you should end up with sth like this in description:
Node {{ $labels.node }} status is in {{ if eq $value 1 }} Ready {{ else }} Not Ready {{ end }} status.
Also worth to read this before making more alerts.