select prometheus alerts newer than a given time - grafana

I am working with grafana, trying to show a list of pods that are triggering a custom prometheus alert.
This query do the trick:
sum(ALERTS{alertname="myCustomAlert"}) BY (pod_name)
The problem is, it list all the alerts, and don't seems affected if I change the time interval to see only the ones launched in the last 5 minutes, or last hour
There is any way to limit in time the alert list? Lot of thanks!!

That expression will produce the number of alerts by pod_name firing at the current time (just as you would expect up{instance="foo"} to tell you whether instance foo is up now, whether you're looking at a dashboard that shows the last 5 minutes or the last hour).
If you want to see the values change over time, you could e.g. graph it. Then you'd see it change over time. And when the alert started and stopped firing for each pod.
And if you want the value at some past time, simply set the end time of the Grafana dashboard range to that time. (E.g. if your dashboard was showing the time range between 2 PM and 3 PM on January 1st, then your query would return the alerts firing at 3 PM on January 1st.

Related

Is it possible to accelerate time in grafana?

Actually what I want to do,
I created dashboards to monitor the alert status in grafana.
I created fake data in my system to simulate my alert situations on these boards. The time of this data covers the range now - now + 12h. In fact, it takes a long time to analyze the alert status in real data. For this reason, I cannot be very flexible on my alert rules. I have to wait until the end of this period to see alert status in the system. (I have many states like this actually.) Grafana creates pending, alerting, and ok states according to the records in my database. Is there a method to quickly verify my tests without waiting for this time?
The main problem is that it is fairly expensive to do in a data source agnostic way. The way worked in Bosun is you would select a time range, and then an interval or a number of queries to run.
Setting both From and To enables testing multiple iterations of the selected alert over time. The number of iterations depends on the setting to the two linked fields Intervals and Step Duration at 3 Changing one changes the other. Intervals will be the number of runs to do even spaced out over the duration of From to To and Step Duration is how much time in minutes should be between intervals. Doing a test over time will populate the Timeline tab 5 which draws a clickable graphic of severity states for each item in the set:
It would then run all those queries with a pool limiting simultaneous queries. For an interval of say 5 minutes, it would run adjacent 5 minute queries.
So this would speed up the alert authoring and testing workflow significantly. But it would best be implemented as a job system. This is because with more expensive queries, or range/interval combination that is a fair amount of runs, it may take a minute or so - so having to wait on an open network connection is less ideal.
So I found I generally used in two modes:
To tweak a specific alert that had fired at some time
To get a general overview of how much the alert rule would trigger for the historical data
For the general over, a larger time range is generally wanted, which means more queries if the interval is kept the same. And with a feature like FOR (Pending), you would have to use the same interval it would actually run at.
So possible, has some limitations, and some care needs to be taken to do it right. But extremely useful in my experience.

Storage aggregation is not combining like I would expect

I'm not getting expected results with some metrics I am tracking in Graphite and displaying in Grafana.
For metric like:
bitbucket.commits-per-user.username1.count
bitbucket.commits-per-user.username2.count
I have a retention policy like:
[default_bitbucket]
pattern = ^bitbucket\.
retentions = 1m:30d,1h:2y
I am pulling the data from an api, summarizing by the minute that the commit occurred for the user and adding it at with a timestamp of that minute (rounded down to the whole minute).
The storage-aggregation policy I am using is this:
[count_bitbucket]
pattern = ^bitbucket.*\.count$
xFilesFactor = 0
aggregationMethod = sum
I would expect that, once the timeframe exceeds 30 days, and I were running the metric with the function:
summarize(1d,sum,true)
, I would see commits per hour for whatever time period. However, It seems to be reporting significantly less per day once I move beyond 30 days.
Is there anything I am doing obviously wrong?
Could there be a problem if I don't add metrics for zeros on minutes when there are no commits?
I really appreciate any guidance - I'm fairly new to graphite.

How to correctly scrape and query metrics in Prometheus every hour

I would like Prometheus to scrape metrics every hour and display these hourly scrape events in a table in a Grafana dashboard. I have the global scrape interval set to 1h in the prometheus.yml file. From the prometheus visualizer, it seems like Prometheus scrapes around the 43 minute mark of every hour. However, it also seems like this data is only valid for about 3 minutes: Prometheus graph
My situation, then, is this: In a Grafana table, I set the min step of a query on this metric to 1h, but this causes the table to say that there are no data points. However, if I set the min step to 5 minutes, it displays the hourly scrape events with a timestamp on the 45 minute mark. My guess as to why this happens is that Prometheus starts on the dot of some hour and steps either forward or backward by the min step.
This does achieve what I would like to do, but it also has potential for incorrect behavior if Prometheus ever does something like can been seen at the beginning of the earlier graph. I also know that I can add a time shift, but it seems like it is always relative to the current time rather than an absolute time.
Is it possible to increase the amount of time that the scrape data is valid in Prometheus without having to scrape again every 3 minutes? Or maybe tell Prometheus to scrape at the 00 minute mark of every hour? Or if not, then can I add a relative time shift to the table so that it goes from the 45 minute mark instead of the 00 minute mark?
On a side note, in the above Prometheus graph, the irregular data was scraped after Prometheus was started. I had started Prometheus around 18:30 on the 22nd, but Prometheus didn't scrape until 23:30, and then it scraped at different intervals until it stabilized around 2:43 on the 23rd. Does anybody know why?
Your data disappear because of the staleness strategy implemented in Prometheus. Once a sample has been ingested, the metric is considered stale after 5 minutes. I didn't find any configuration to change that value.
Scraping every hour is not really the philosophy of Prometheus. If your really need to scrape with such a low frequency, it could be a better idea to schedule a job sending the data to a push gateway or using a prom file fed to a node exporter (if it makes sense). You can then scrape this endpoint every 1-2 minutes.
You could also roll your own exporter that memorize the last scrape and scrape anew only if the data age exceeds one hour. (That's the solution I would prefer)
Now, as a quick solution you can request the data over the last hour and average on it. That way, you'll get the last (old) scrape taken into account:
avg_over_time(old_metric[1h])
It should work or have some transient incorrect values if there is some jitters in the scheduling of the scrape.
Regarding the issues you had about late scraping, I suspect the scraping failed at those dates. Prometheus retries only at the next schedule (1h in your case).
If the metric is scraped with intervals exceeding 5 minutes, then Prometheus would return gaps to Grafana because of staleness mechanism. These gaps can be filled with the last raw sample value by wrapping the queried time series into last_over_time function. Just specify the lookbehind window in square brackets, which equals or exceeds the interval between samples. For example, the following query would fill gaps for my_gauge time series with one hour interval between samples:
last_over_time(my_gauge[1h])
See these docs for time durations format, which can be used in square brackets.

Stackdriver scheduled uptime check

I would like to run uptime checks using Stackdriver only when my google cloud instance is running (it is service that only runs a few hours every day). Is that possible?
No, Stackdriver uptime check run according to your check interval "Check every" field.
You can choose between 1, 5, 10, or 15 minutes. For example, choosing 5 minutes will cause each geographic location to attempt to reach your service once in every 5 minute period. Using the default six locations, and checking every 5 minutes, your service sees an average of 1.2 requests per minute. Checking every 1 minute, your service sees an average of 6 requests per minute.
See the documentation Here

Time since a value was zero

I have an application that consumes work to do from an AWS topic. Work is added several times a day and my application quickly consumes it and the queue length goes back to 0. I am able to produce a metric for the length of the queue.
I would like a metric for the time since the length of queue was last zero. Any ideas how to get started?
Assuming a queue_size gauge that records the size of the queue, you can define a recorded rule like this:
# Timestamp of the most recent `queue_size` == 0 sample; else propagate the previous value
- record: last_empty_queue_timestamp
expr: timestamp(queue_size == 0) or last_empty_queue_timestamp
Then you can compute the time since the last time the queue was empty as simply as:
timestamp(queue_size) - last_empty_queue_timestamp
Note however that because this is a gauge (and because of the limitations of sampling), you may end up with weird results. E.g. if one work item is added every minute, your sampling interval is one minute and you sample exactly after the work items have been added, your queue may never (or very rarely) appear empty from the point of view of Prometheus. If that turns out to be an issue (or simply a concern) you may be better off having your application export a metric that is the last timestamp when something was added to an empty queue (basically what the recorded rule attempts to compute).
Similar to Alin's answer; upon revisiting this problem I found this from the Prometheus documentation:
https://prometheus.io/docs/practices/instrumentation/#timestamps,-not-time-since
If you want to track the amount of time since something happened, export the
Unix timestamp at which it happened - not the time since it happened.
With the timestamp exported, you can use the expression time() -
my_timestamp_metric to calculate the time since the event, removing the need for
update logic and protecting you against the update logic getting stuck.