Stackdriver custom metric aggregate alerts - kubernetes

I'm using Kubernetes on Google Compute Engine and Stackdriver. The Kubernetes metrics show up in Stackdriver as custom metrics. I successfully set up a dashboard with charts that show a few custom metrics such as "node cpu reservation". I can even set up an aggregate mean of all node cpu reservations to see if my total Kubernetes cluster CPU reservation is getting too high. See screenshot.
My problem is, I can't seem to set up an alert on the mean of a custom metric. I can set up an alert on each node, but that isn't what I want. I can also set up "Group Aggregate Threshold Condition", but custom metrics don't seem to work for that. Notice how "Custom Metric" is not in the dropdown.
Is there a way to set an alert for an aggregate of a custom metric? If not, is there some way I can alert when my Kubernetes cluster is getting too high on CPU reservation?

alerting on an aggregation of custom metrics is currently not available in Stackdriver. We are considering various solutions to the problem you're facing.
Note that sometimes it's possible to alert directly on symptoms of the problem rather than monitoring the underlying resources. For example, if you're concerned about cpu because X happens and users notice, and X is bad - you could consider alerting on symptoms of X instead of alerting on cpu.

Related

Prometheus alerting rules that includes each Kubernetes volume memory utilization

I would like to create a Prometheus alert that notifies me via Alertmanager when my Kubernetes volumes are for example 80% or 90% occupied. Currently I am using the following expression:
100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="pvc-name"} * kubelet_volume_stats_available_bytes{persistentvolumeclaim="pvc-name"} > 80
The problem is, however, that I have to create the same alarm again in a slightly modified form for each claim. If in addition the name of a claim changes I have to adjust the rules as well.
Question: Is there a clever way to create an alert that takes into account all available claims? So that there is no need to change the alert if a claim changes.
Kind regards!

Way to configure notifications/alerts for a kubernetes pod which is reaching 90% memory and which is not exposed to internet(backend microservice)

I am currently working on a solution for alerts/notifications where we have microservices deployed on kubernetes in a way of frontend and back end services. There has been multiple occasions where backend services are not able to restart or reach a 90% allocated pod limit, if they encounter memory exhaust. To identify such pods we want an alert mechanism to lookin when they fail or saturation level. We have prometheus and grafana as monitoring services but are not able to configure alerts, as i have quite a limited knowledge in these, however any suggestions and references provided where i can have detailed way on achieving this will be helpful. Please do let me know
I did try it out on the internet for such ,but almost all are pointing to node level ,cluster level monitoring only. :(
enter image description here
The Query used to check the memory usage is :
sum (container_memory_working_set_bytes{image!="",name=~"^k8s_.*",namespace=~"^$namespace$",pod_name=~"^$deployment-[a-z0-9]+-[a-z0-9]+"}) by (pod_name)
I saw this recently on google. It might be helpful to you. https://groups.google.com/u/1/g/prometheus-users/c/1n_z3cmDEXE?pli=1

Track time taken to scale up in Kubernetes using HPA and CA

I am trying to track and monitor, how much time does a pod take to come online/healthy/Running.
I am using EKS. And I have got HPA and cluster-autoscaler installed on my cluster.
Let's say I have a deployment with HorizontalPodAutoscaler scaling policy with 70% targetAverageUtilization.
So whenever the average utilization of deployment will go beyond 70%, HPA will trigger to create new POD. Now, based on different factors, like if nodes are available or not, and if not is already available, then the image needs to be downloaded or is it present on cache, the scaling can take from few seconds to few minutes to come up.
I want to track this time/duration, every time the POD is scheduled, how much time does it take to come to Running state. Any suggestions?
Or any direction where I should be looking at.
I found this Cluster Autoscaler Visibility Logs, but this is only available in GCE.
I am looking for any solution, can be out-of-the-box integration, or raising events and storing them in some time-series DB or scraping data from Prometheus. But I couldn't find any solution for this till now.
Thanks in advance.
There is nothing out of the box for this, you will need to build something yourself.

Kubernetes pod restart alert via stackdriver

I have a stackdriver log based metric tracking GKE pod restarts.
I'd like to alert via email if the number of alerts breaches a predefined threshold.
I'm unsure as what thresholds I need to set inroder to trigger the alert via stackdriver. I have three pods via deployed service.
You should use the Logs Viewer and create a filter:
As a resource you should choose GKE Cluster Operations and add a filter.
Filter might look like this:
resource.type="k8s_cluster"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.location="<CLUSTR_LOCATION>"
jsonPayload.reason="Killing"
After that create a custom metric by clicking on Create metric button.
Then you can Create alert from metric by clicking on created metric in Logs-based metrics.
Then setting up a Configuration for triggers and conditions and threshold.
As for the correct Threshold, I would take the average amount of restarts from past time period and make it a bit more for alerting.
GKE is already sending to Stackdriver a metric called: container/restart_count. You just need to create an alert policy as described on Managing alerting policies. As per the official doc, this metric expose:
Number of times the container has restarted. Sampled every 60 seconds.

Alternatives to hardcoding parameters in alert metrics in grafana

I am trying to implement alerting using grafana and prometheus.
As Grafana does not allow template variables in metrics to be used in alerting, I am currently forced to hardcode the IP's if I want to collect the memory metrics.
But that's not a solution that can long last, as the nodes in my setup can terminate and get recreated as auto-scaling is enabled.
Is there any better alternative than hardcoding each instance IP in the metric and still enable alerting on memory usage of each node?
Any help will be really appreciated.
Yeah, that's why we've given up on using alerts in Grafana and decided to use Alertmanager. For that you'll need to create alert rules and add them to PrometheusRule resource on the cluster and configure alertmanager itself.
if you can figure out how to add your required info into labels, you can reference labels in your alert message using the template like so:
{{$labels.instance}}
Anything that's reported in the instance as a label should be available, however, it's only available if the alert ends in a math expression. It isn't available for alerts that use a classic expression.