I have a stackdriver log based metric tracking GKE pod restarts.
I'd like to alert via email if the number of alerts breaches a predefined threshold.
I'm unsure as what thresholds I need to set inroder to trigger the alert via stackdriver. I have three pods via deployed service.
You should use the Logs Viewer and create a filter:
As a resource you should choose GKE Cluster Operations and add a filter.
Filter might look like this:
resource.type="k8s_cluster"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.location="<CLUSTR_LOCATION>"
jsonPayload.reason="Killing"
After that create a custom metric by clicking on Create metric button.
Then you can Create alert from metric by clicking on created metric in Logs-based metrics.
Then setting up a Configuration for triggers and conditions and threshold.
As for the correct Threshold, I would take the average amount of restarts from past time period and make it a bit more for alerting.
GKE is already sending to Stackdriver a metric called: container/restart_count. You just need to create an alert policy as described on Managing alerting policies. As per the official doc, this metric expose:
Number of times the container has restarted. Sampled every 60 seconds.
Related
I would like to create a Prometheus alert that notifies me via Alertmanager when my Kubernetes volumes are for example 80% or 90% occupied. Currently I am using the following expression:
100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="pvc-name"} * kubelet_volume_stats_available_bytes{persistentvolumeclaim="pvc-name"} > 80
The problem is, however, that I have to create the same alarm again in a slightly modified form for each claim. If in addition the name of a claim changes I have to adjust the rules as well.
Question: Is there a clever way to create an alert that takes into account all available claims? So that there is no need to change the alert if a claim changes.
Kind regards!
My metrics are scraping every 30seconds, even though I've specified a 10s interval when defining my servicemonitor.
I've created a servicemonitor for my exporter that seems to be working well. I can see my exporter as a target, and view metrics on the /graph endpoint. However, when on the "targets" page, the "last scrape" is showing the interval as 30s (I refresh the page to see how high the # of seconds will go up to, it was 30). Sure enough, zooming in on the graph shows that metrics are coming in every 30 seconds.
I've set my servicemonitor's interval to 10s, which should override any other intervals. Why is it being ignored?
endpoints:
- port: http-metrics
scheme: http
interval: 10s
First: Double check if you've changed the ServiceMonitor you need changed and if you are looking at scrapes from your ServiceMonitor.
Go to the web UI of your prometheus and select Status -> Configuration.
Now try to find part of the config that prometheus operator created (based on ServiceMonitor config). Probably looking by servicemonitor name will work - there should be a section with job_name containing your servicemonitor name.
Now look at the scrape_interval value in this section. If it is "30s" (or anything else that is not the expected "10s") and you are sure you're looking at the correct section then it means one of those things happened:
your ServiceMonitor does not really contain "10s" - maybe it was not applied correctly? Verify the live object in your cluster
prometheus-operator did not update Prometheus configuration - maybe it is not working? or is crashing? or just silently stopped working? It is quite safe just to restart the prometheus-operator pod, maybe it is worth trying.
prometheus did not load the new config correctly? prometheus operator updates a secret and when it is changed sidecar in prometheus pod triggers reload in prometheus. Maybe it didn't work? Look again in the Web UI in Status -> Runtime & Build information for "Configuration reload". Is it Successful? Does the "Last successful configuration reload" time roughly matches your change in servicemonitor? If it was not "Successful" then maybe some other change made the final promethuus config incorrect and it is unable to load it?
If I understand correctly, as per Configure Grafana Query Interval with Prometheus, that works as expected
This works as intended. This is because Grafana choses a step size,
and until that step is full, it won't display new data, or rather
Prometheus won't return a new data point. So this all works as
expected. You would need to align the step size to be something
smaller in order to see more data more quickly. The Prometheus
interface dynamically sets the step, which is why you get a different
result there.
It's a configuration of each Grafana panel
Grafana itself doesn't recommend to use intervals less then 30sec, you can find a reasonable explanation in Dashboard refresh rate issues part of Grafana Troubleshoot dashboards documentation page
Dashboard refresh rate issues By default, Grafana queries your data
source every 30 seconds. Setting a low refresh rate on your dashboards
puts unnecessary stress on the backend. In many cases, querying this
frequently makes no sense, because the data isn’t being sent to the
system such that changes would be seen.
We recommend the following:
Do not enable auto-refreshing on dashboards, panels, or variables unless you need it. Users can refresh their browser manually, or you
can set the refresh rate for a time period that makes sense (every ten
minutes, every hour, and so on).
If it is required, then set the refresh rate to once a minute. Again, users can always refresh the dashboard manually.
If your dashboard has a longer time period (such as a week), then you really don’t need automated refreshing.
For more information you can also visit still opened Specify desired step in Prometheus in dashboard panels github issue.
I am trying to track and monitor, how much time does a pod take to come online/healthy/Running.
I am using EKS. And I have got HPA and cluster-autoscaler installed on my cluster.
Let's say I have a deployment with HorizontalPodAutoscaler scaling policy with 70% targetAverageUtilization.
So whenever the average utilization of deployment will go beyond 70%, HPA will trigger to create new POD. Now, based on different factors, like if nodes are available or not, and if not is already available, then the image needs to be downloaded or is it present on cache, the scaling can take from few seconds to few minutes to come up.
I want to track this time/duration, every time the POD is scheduled, how much time does it take to come to Running state. Any suggestions?
Or any direction where I should be looking at.
I found this Cluster Autoscaler Visibility Logs, but this is only available in GCE.
I am looking for any solution, can be out-of-the-box integration, or raising events and storing them in some time-series DB or scraping data from Prometheus. But I couldn't find any solution for this till now.
Thanks in advance.
There is nothing out of the box for this, you will need to build something yourself.
I am trying to implement alerting using grafana and prometheus.
As Grafana does not allow template variables in metrics to be used in alerting, I am currently forced to hardcode the IP's if I want to collect the memory metrics.
But that's not a solution that can long last, as the nodes in my setup can terminate and get recreated as auto-scaling is enabled.
Is there any better alternative than hardcoding each instance IP in the metric and still enable alerting on memory usage of each node?
Any help will be really appreciated.
Yeah, that's why we've given up on using alerts in Grafana and decided to use Alertmanager. For that you'll need to create alert rules and add them to PrometheusRule resource on the cluster and configure alertmanager itself.
if you can figure out how to add your required info into labels, you can reference labels in your alert message using the template like so:
{{$labels.instance}}
Anything that's reported in the instance as a label should be available, however, it's only available if the alert ends in a math expression. It isn't available for alerts that use a classic expression.
I'm using Kubernetes on Google Compute Engine and Stackdriver. The Kubernetes metrics show up in Stackdriver as custom metrics. I successfully set up a dashboard with charts that show a few custom metrics such as "node cpu reservation". I can even set up an aggregate mean of all node cpu reservations to see if my total Kubernetes cluster CPU reservation is getting too high. See screenshot.
My problem is, I can't seem to set up an alert on the mean of a custom metric. I can set up an alert on each node, but that isn't what I want. I can also set up "Group Aggregate Threshold Condition", but custom metrics don't seem to work for that. Notice how "Custom Metric" is not in the dropdown.
Is there a way to set an alert for an aggregate of a custom metric? If not, is there some way I can alert when my Kubernetes cluster is getting too high on CPU reservation?
alerting on an aggregation of custom metrics is currently not available in Stackdriver. We are considering various solutions to the problem you're facing.
Note that sometimes it's possible to alert directly on symptoms of the problem rather than monitoring the underlying resources. For example, if you're concerned about cpu because X happens and users notice, and X is bad - you could consider alerting on symptoms of X instead of alerting on cpu.