I'm wondering how Grafana fires it's queries (to the datasource).
Does it only fire queries when the dashboard/panel is visible or does it keep on firing queries based on the default time period of all dashboards ?
My assumption was that since Grafana itself doesn't store data, the queries would be on a need basis, but I've been seeing http requests occur periodically from my VM.
This is relevant for metrics such as CloudWatch etc, where each API call can be charged.
Yes, you are correct. Grafana dashboards/panels fire queries only when they are visible (loaded in the browser).
But you may have alert rules in the Grafana and they should be evaluated also periodically. So I guess alerts are source of your queries.
Related
My metrics are scraping every 30seconds, even though I've specified a 10s interval when defining my servicemonitor.
I've created a servicemonitor for my exporter that seems to be working well. I can see my exporter as a target, and view metrics on the /graph endpoint. However, when on the "targets" page, the "last scrape" is showing the interval as 30s (I refresh the page to see how high the # of seconds will go up to, it was 30). Sure enough, zooming in on the graph shows that metrics are coming in every 30 seconds.
I've set my servicemonitor's interval to 10s, which should override any other intervals. Why is it being ignored?
endpoints:
- port: http-metrics
scheme: http
interval: 10s
First: Double check if you've changed the ServiceMonitor you need changed and if you are looking at scrapes from your ServiceMonitor.
Go to the web UI of your prometheus and select Status -> Configuration.
Now try to find part of the config that prometheus operator created (based on ServiceMonitor config). Probably looking by servicemonitor name will work - there should be a section with job_name containing your servicemonitor name.
Now look at the scrape_interval value in this section. If it is "30s" (or anything else that is not the expected "10s") and you are sure you're looking at the correct section then it means one of those things happened:
your ServiceMonitor does not really contain "10s" - maybe it was not applied correctly? Verify the live object in your cluster
prometheus-operator did not update Prometheus configuration - maybe it is not working? or is crashing? or just silently stopped working? It is quite safe just to restart the prometheus-operator pod, maybe it is worth trying.
prometheus did not load the new config correctly? prometheus operator updates a secret and when it is changed sidecar in prometheus pod triggers reload in prometheus. Maybe it didn't work? Look again in the Web UI in Status -> Runtime & Build information for "Configuration reload". Is it Successful? Does the "Last successful configuration reload" time roughly matches your change in servicemonitor? If it was not "Successful" then maybe some other change made the final promethuus config incorrect and it is unable to load it?
If I understand correctly, as per Configure Grafana Query Interval with Prometheus, that works as expected
This works as intended. This is because Grafana choses a step size,
and until that step is full, it won't display new data, or rather
Prometheus won't return a new data point. So this all works as
expected. You would need to align the step size to be something
smaller in order to see more data more quickly. The Prometheus
interface dynamically sets the step, which is why you get a different
result there.
It's a configuration of each Grafana panel
Grafana itself doesn't recommend to use intervals less then 30sec, you can find a reasonable explanation in Dashboard refresh rate issues part of Grafana Troubleshoot dashboards documentation page
Dashboard refresh rate issues By default, Grafana queries your data
source every 30 seconds. Setting a low refresh rate on your dashboards
puts unnecessary stress on the backend. In many cases, querying this
frequently makes no sense, because the data isn’t being sent to the
system such that changes would be seen.
We recommend the following:
Do not enable auto-refreshing on dashboards, panels, or variables unless you need it. Users can refresh their browser manually, or you
can set the refresh rate for a time period that makes sense (every ten
minutes, every hour, and so on).
If it is required, then set the refresh rate to once a minute. Again, users can always refresh the dashboard manually.
If your dashboard has a longer time period (such as a week), then you really don’t need automated refreshing.
For more information you can also visit still opened Specify desired step in Prometheus in dashboard panels github issue.
I've been looking through the docs and grafana community but can't seem to find a definitive answer to this.
I have Grafana configured witha PostgreSQL datasource and created a dashboard to monitor the number of new sessions being created in my database. This works, and I can see a graph of sessions being generated over time.
My question is regarding where the numbers are recorded, if anywhere? If I have this graph on my dashboard, does it go away and make the query every single time the page is loaded? My main concern is that a user can change the time period, going back potentially years, which would hammer the database grouping all those sessions into time intervals.
If they are not stored anywhere with Grafana, how are people managing this? Would we need to use another 'middle man' to receive all of the stats, and use this as the datasource instead of the PostgreSQL database?
When you configured the datasource, you are telling grafana to pick data from there. In your case it is your postgres db.
Grafana does not copy data anywhere else.
If you want that then you should store that data somewhere else. You can setup monitoring for postgresql and store metrics in some engines like prometheus.
In the past, it was possible to setup an Azure alert on a single event for a resource e.g. on data factory single RunFinished where the status is Failed*.
This appears to have been superseded by "Activity Log Alerts"
However these alerts only seem to either work on a metric threshold (e.g. number of failures in 5 minutes) or on events which are related to the general admin of the resource (e.g. has it been deployed) not on the operations of the resource.
A threshold doesn't make sense for data factory, as a data factory may only run once a day, if a failure happens and then it doesn't happen X minutes later it doesn't mean it's been resolved.
The activity event alerts, don't seem to have things like failures.
Am I missing something?
It it because this is expected to be done in OMS Log Analytics now? Or perhaps even in Event Grid later?
*n.b. it is still possible to create these alert types via ARM templates, but you can't see them in the portal anymore.
The events you're describing are part of a resource type's diagnostic logs, which are not alertable in the same way that the Activity Log is. I suggest routing the data to Log Analytics and setting the alert there: https://learn.microsoft.com/en-us/azure/data-factory/monitor-using-azure-monitor
I want to build an internal dashboard to show the key metrics of a startup.
All data is stored in a mongodb database on Mongolab (SaaS on top of AWS).
Queries to aggregate datas from all documents take 1-10minutes.
What is the best practice to cache such data and make it immediately available?
Should I run a worker thread once a day and store the result somewhere?
I want to build an internal dashboard to show the key metrics of a startup. All data is stored in a mongodb database on Mongolab (SaaS on top of AWS). Queries to aggregate datas from all documents take 1-10minutes.
Generally users aren't happy to wait on the order of minutes to interact with dashboard metrics, so it is common to pre-aggregate using suitable formats to support more realtime interaction.
For example, with time series data you might want to present summaries with different granularity for charts (minute, hour, day, week, ...). The MongoDB manual includes some examples of Pre-Aggregated Reports using time-based metrics from web server access logs.
What is the best practice to cache such data and make it immediately available?
Should I run a worker thread once a day and store the result somewhere?
How and when to pre-aggregate will depend on the source and interdependency of your metrics as well as your use case requirements. Since use cases vary wildly, the only general best practice that comes to mind is that a startup probably doesn't want to be investing too much developer time in building their own dashboard tool (unless that's a core part of the product offering).
There are plenty of dashboard frameworks and reporting/BI tools available in both open source and commercial flavours.
I'm curious how MMS generates alerts on all the metrics, across all the alert configuration across all the groups across all the accounts.
What I would do is query alert configs and active alerts when a ping with new data comes in, and then generate an alert based on the new data.
However, what if some of the metrics can't be determined from the current ping alone, such as page faults avg/sec. This metric is derived from previous pings.
Would there be a background worker of sorts that polls periodically for every alert config in every group in every account? Or would every new ping quickly go get the last minute/5 minute's data for that metric (perhaps only if such an alert configuration exists)?
Also, when no new pings come in, the averages should decay and fire new alerts or update existing ones.
I get the pre-aggregated metrics as shown in this blog post, and even reporting, but how does one link this with alert generation and management.
This seems to be non-trivial to solve.