I am trying to implement alerting using grafana and prometheus.
As Grafana does not allow template variables in metrics to be used in alerting, I am currently forced to hardcode the IP's if I want to collect the memory metrics.
But that's not a solution that can long last, as the nodes in my setup can terminate and get recreated as auto-scaling is enabled.
Is there any better alternative than hardcoding each instance IP in the metric and still enable alerting on memory usage of each node?
Any help will be really appreciated.
Yeah, that's why we've given up on using alerts in Grafana and decided to use Alertmanager. For that you'll need to create alert rules and add them to PrometheusRule resource on the cluster and configure alertmanager itself.
if you can figure out how to add your required info into labels, you can reference labels in your alert message using the template like so:
{{$labels.instance}}
Anything that's reported in the instance as a label should be available, however, it's only available if the alert ends in a math expression. It isn't available for alerts that use a classic expression.
Related
I am currently setting up grafana alerts. How do I customize my message template so my alert email shows The ip address of the server, the state of the server and the node/instance?
Thank you.
I figured it out once, then recently I updated my grafana instance that wiped my work and I had to figure it out again. It was tough the first time.
You can use the labels that are made available through prometheus in your summary and description sections in your alerts by using the syntax:
{{$labels.instance}}
{{$labels.value}}
https://prometheus.io/docs/prometheus/latest/configuration/template_examples/
The only catch is that you have to use Math expression in the last condition in your alert rule for the labels to be available in the Summary section of the alert.
For example, in our personal alerts we will use something like:
Machine {{$labels.instance}} is not reporting status via win-exporter.
The machine could be offline or the service could be stopped.
I would like to create a Prometheus alert that notifies me via Alertmanager when my Kubernetes volumes are for example 80% or 90% occupied. Currently I am using the following expression:
100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="pvc-name"} * kubelet_volume_stats_available_bytes{persistentvolumeclaim="pvc-name"} > 80
The problem is, however, that I have to create the same alarm again in a slightly modified form for each claim. If in addition the name of a claim changes I have to adjust the rules as well.
Question: Is there a clever way to create an alert that takes into account all available claims? So that there is no need to change the alert if a claim changes.
Kind regards!
I am currently working on a solution for alerts/notifications where we have microservices deployed on kubernetes in a way of frontend and back end services. There has been multiple occasions where backend services are not able to restart or reach a 90% allocated pod limit, if they encounter memory exhaust. To identify such pods we want an alert mechanism to lookin when they fail or saturation level. We have prometheus and grafana as monitoring services but are not able to configure alerts, as i have quite a limited knowledge in these, however any suggestions and references provided where i can have detailed way on achieving this will be helpful. Please do let me know
I did try it out on the internet for such ,but almost all are pointing to node level ,cluster level monitoring only. :(
enter image description here
The Query used to check the memory usage is :
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",namespace=~"^$namespace$",pod_name=~"^$deployment-[a-z0-9]+-[a-z0-9]+"}) by (pod_name)
I saw this recently on google. It might be helpful to you. https://groups.google.com/u/1/g/prometheus-users/c/1n_z3cmDEXE?pli=1
I have a stackdriver log based metric tracking GKE pod restarts.
I'd like to alert via email if the number of alerts breaches a predefined threshold.
I'm unsure as what thresholds I need to set inroder to trigger the alert via stackdriver. I have three pods via deployed service.
You should use the Logs Viewer and create a filter:
As a resource you should choose GKE Cluster Operations and add a filter.
Filter might look like this:
resource.type="k8s_cluster"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.location="<CLUSTR_LOCATION>"
jsonPayload.reason="Killing"
After that create a custom metric by clicking on Create metric button.
Then you can Create alert from metric by clicking on created metric in Logs-based metrics.
Then setting up a Configuration for triggers and conditions and threshold.
As for the correct Threshold, I would take the average amount of restarts from past time period and make it a bit more for alerting.
GKE is already sending to Stackdriver a metric called: container/restart_count. You just need to create an alert policy as described on Managing alerting policies. As per the official doc, this metric expose:
Number of times the container has restarted. Sampled every 60 seconds.
I'm using Kubernetes on Google Compute Engine and Stackdriver. The Kubernetes metrics show up in Stackdriver as custom metrics. I successfully set up a dashboard with charts that show a few custom metrics such as "node cpu reservation". I can even set up an aggregate mean of all node cpu reservations to see if my total Kubernetes cluster CPU reservation is getting too high. See screenshot.
My problem is, I can't seem to set up an alert on the mean of a custom metric. I can set up an alert on each node, but that isn't what I want. I can also set up "Group Aggregate Threshold Condition", but custom metrics don't seem to work for that. Notice how "Custom Metric" is not in the dropdown.
Is there a way to set an alert for an aggregate of a custom metric? If not, is there some way I can alert when my Kubernetes cluster is getting too high on CPU reservation?
alerting on an aggregation of custom metrics is currently not available in Stackdriver. We are considering various solutions to the problem you're facing.
Note that sometimes it's possible to alert directly on symptoms of the problem rather than monitoring the underlying resources. For example, if you're concerned about cpu because X happens and users notice, and X is bad - you could consider alerting on symptoms of X instead of alerting on cpu.