Prometheus servicemonitor interval is being ignored - kubernetes

My metrics are scraping every 30seconds, even though I've specified a 10s interval when defining my servicemonitor.
I've created a servicemonitor for my exporter that seems to be working well. I can see my exporter as a target, and view metrics on the /graph endpoint. However, when on the "targets" page, the "last scrape" is showing the interval as 30s (I refresh the page to see how high the # of seconds will go up to, it was 30). Sure enough, zooming in on the graph shows that metrics are coming in every 30 seconds.
I've set my servicemonitor's interval to 10s, which should override any other intervals. Why is it being ignored?
endpoints:
- port: http-metrics
scheme: http
interval: 10s

First: Double check if you've changed the ServiceMonitor you need changed and if you are looking at scrapes from your ServiceMonitor.
Go to the web UI of your prometheus and select Status -> Configuration.
Now try to find part of the config that prometheus operator created (based on ServiceMonitor config). Probably looking by servicemonitor name will work - there should be a section with job_name containing your servicemonitor name.
Now look at the scrape_interval value in this section. If it is "30s" (or anything else that is not the expected "10s") and you are sure you're looking at the correct section then it means one of those things happened:
your ServiceMonitor does not really contain "10s" - maybe it was not applied correctly? Verify the live object in your cluster
prometheus-operator did not update Prometheus configuration - maybe it is not working? or is crashing? or just silently stopped working? It is quite safe just to restart the prometheus-operator pod, maybe it is worth trying.
prometheus did not load the new config correctly? prometheus operator updates a secret and when it is changed sidecar in prometheus pod triggers reload in prometheus. Maybe it didn't work? Look again in the Web UI in Status -> Runtime & Build information for "Configuration reload". Is it Successful? Does the "Last successful configuration reload" time roughly matches your change in servicemonitor? If it was not "Successful" then maybe some other change made the final promethuus config incorrect and it is unable to load it?

If I understand correctly, as per Configure Grafana Query Interval with Prometheus, that works as expected
This works as intended. This is because Grafana choses a step size,
and until that step is full, it won't display new data, or rather
Prometheus won't return a new data point. So this all works as
expected. You would need to align the step size to be something
smaller in order to see more data more quickly. The Prometheus
interface dynamically sets the step, which is why you get a different
result there.
It's a configuration of each Grafana panel
Grafana itself doesn't recommend to use intervals less then 30sec, you can find a reasonable explanation in Dashboard refresh rate issues part of Grafana Troubleshoot dashboards documentation page
Dashboard refresh rate issues By default, Grafana queries your data
source every 30 seconds. Setting a low refresh rate on your dashboards
puts unnecessary stress on the backend. In many cases, querying this
frequently makes no sense, because the data isn’t being sent to the
system such that changes would be seen.
We recommend the following:
Do not enable auto-refreshing on dashboards, panels, or variables unless you need it. Users can refresh their browser manually, or you
can set the refresh rate for a time period that makes sense (every ten
minutes, every hour, and so on).
If it is required, then set the refresh rate to once a minute. Again, users can always refresh the dashboard manually.
If your dashboard has a longer time period (such as a week), then you really don’t need automated refreshing.
For more information you can also visit still opened Specify desired step in Prometheus in dashboard panels github issue.

Related

Prometheus alerting rules that includes each Kubernetes volume memory utilization

I would like to create a Prometheus alert that notifies me via Alertmanager when my Kubernetes volumes are for example 80% or 90% occupied. Currently I am using the following expression:
100 / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="pvc-name"} * kubelet_volume_stats_available_bytes{persistentvolumeclaim="pvc-name"} > 80
The problem is, however, that I have to create the same alarm again in a slightly modified form for each claim. If in addition the name of a claim changes I have to adjust the rules as well.
Question: Is there a clever way to create an alert that takes into account all available claims? So that there is no need to change the alert if a claim changes.
Kind regards!

Grafana query when dashboard is not visible

I'm wondering how Grafana fires it's queries (to the datasource).
Does it only fire queries when the dashboard/panel is visible or does it keep on firing queries based on the default time period of all dashboards ?
My assumption was that since Grafana itself doesn't store data, the queries would be on a need basis, but I've been seeing http requests occur periodically from my VM.
This is relevant for metrics such as CloudWatch etc, where each API call can be charged.
Yes, you are correct. Grafana dashboards/panels fire queries only when they are visible (loaded in the browser).
But you may have alert rules in the Grafana and they should be evaluated also periodically. So I guess alerts are source of your queries.

Is there a way to make Dashboards with Prometheus datasources responsible/loadable even when one of the datasources is down?

Problem:
I have a dashboard in Grafana which monitors the healthiness of my monitoring services: Prometheis, Alertmanagers, Pushgateways and Grafana itself. It shows simple Up/Down status of these services in Singlestat panels.
When one of my Premetheus (I have one in each datacenter) is down, Singlestat panel which is backed with this Prometheus as a datasource is loading 30s, until it shows "Request error".
Even worse, when I want to have only one panel for each Prometheus instance and combine results from all Prometheis that monitor them (Prometheis in my setup monitor each other). For this I use --mixed-- data source, and in this case, when one of used datasources is down Singlestat panel loads forever, and as down datasource is added in all my Singlestat panels for Prometheis, all these panels load forever.
Also when one of Prometheis stops working, I have a very long loading time of some Grafana pages:
Configuration -> Datasources
and
Dashboards -> Home.
But this is not always, sometimes it loads normally.
Investigations:
I investigated Query timeout in Grafana datasource (set it for 1s), but without any effect on this problem.
I have also tried to add datasource variable. It solves the problem only partially and I am not satisfied with it:
I have a combo box with datasources in Dashboard and Singlestat panel for each Prometheus backed with this variable dastasource. Problem is that I have to change through all the Prometheis in a combo box to see the whole picture for Prometheus services.
Similar it is possible to create Singlestat panels for all combinations of datasources and Prometheus instances (in my case 3 x 3 panels) but it is not intuitive and gets worse and worse with each Prometheus servers I will add in the future.
Question:
Is there any way how to handle unreachable datasources, that dashboards will continue to work?
Maybe I have to add some component to my setup, but I think it should be done in Grafana (although it seems it is not possible).

Kubernetes pod restart alert via stackdriver

I have a stackdriver log based metric tracking GKE pod restarts.
I'd like to alert via email if the number of alerts breaches a predefined threshold.
I'm unsure as what thresholds I need to set inroder to trigger the alert via stackdriver. I have three pods via deployed service.
You should use the Logs Viewer and create a filter:
As a resource you should choose GKE Cluster Operations and add a filter.
Filter might look like this:
resource.type="k8s_cluster"
resource.labels.cluster_name="<CLUSTER_NAME>"
resource.labels.location="<CLUSTR_LOCATION>"
jsonPayload.reason="Killing"
After that create a custom metric by clicking on Create metric button.
Then you can Create alert from metric by clicking on created metric in Logs-based metrics.
Then setting up a Configuration for triggers and conditions and threshold.
As for the correct Threshold, I would take the average amount of restarts from past time period and make it a bit more for alerting.
GKE is already sending to Stackdriver a metric called: container/restart_count. You just need to create an alert policy as described on Managing alerting policies. As per the official doc, this metric expose:
Number of times the container has restarted. Sampled every 60 seconds.

Alternatives to hardcoding parameters in alert metrics in grafana

I am trying to implement alerting using grafana and prometheus.
As Grafana does not allow template variables in metrics to be used in alerting, I am currently forced to hardcode the IP's if I want to collect the memory metrics.
But that's not a solution that can long last, as the nodes in my setup can terminate and get recreated as auto-scaling is enabled.
Is there any better alternative than hardcoding each instance IP in the metric and still enable alerting on memory usage of each node?
Any help will be really appreciated.
Yeah, that's why we've given up on using alerts in Grafana and decided to use Alertmanager. For that you'll need to create alert rules and add them to PrometheusRule resource on the cluster and configure alertmanager itself.
if you can figure out how to add your required info into labels, you can reference labels in your alert message using the template like so:
{{$labels.instance}}
Anything that's reported in the instance as a label should be available, however, it's only available if the alert ends in a math expression. It isn't available for alerts that use a classic expression.