reset chart to 0 in grafan - grafana

Below is a chart I have in grafana:
My problem is that if my chosen time range is say 5 minutes, the graph wont show only what happened in the last 5 minutes. So in the picture, nothing happened in the past 5 minutes so it's just showing the last points it has. How can I change this so that it goes back to zero if nothing has changed? I'm using a Prometheus counter for this, if that is relevant.

As explained in the Prometheus documentation, a counter value in itself is not of much use. It depends on when your job was last restarted and everything that happened since.
What's interesting about a counter is how much it changed over some period of time. I.e. either the average rate of change per second (e.g. 3 queries per second) or the increase over some time range (e.g. 10K queries in the last hour).
So instead of graphing something like e.g. http_requests, you should graph rate(http_requests[1m]) (the averate number of requests over the previous 1 minute) or increase(http_requests[1h]) (the total number of requests over the past hour). You can play with the range size until you get something which makes sense for your data. But make sure to use a range at least 2x your scrape interval (and ideally more, as Prometheus is somewhat daft in the way it computes rates/increases).

Related

what is the default grafana setting for $__rate_interval

I understand that rate(xyz[5m]) * 60 is the rate of xyz per minute, averaged over 5 mins.
How then would $__rate_interval and $__interval be defined,
possibly in the same syntax?
What format is rate being measured here, in my panel? Per minute, per second?
What is the interval= 30s in my panel here? My scraping interval is set to 5s.
How do i change the rate format?
See New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work.
Rate is always per second. See Grafana documentation for the rate function.
Click on Query options, then click on the Info-Symbol. An explanation will be displayed.
To get rate per minute, just multiply the rate with 60.
Edit: ($__rate_interval and $_interval)
Prometheus periodically fetches data from your application. Grafana periodically fetches Data from Prometheus. Grafana does not know, how often Prometheus polls your application for data. Grafana will estimate this time by looking at the data. The $__interval variable expands to the duration between two data points in the graph. (Note that this is only true for small time ranges and high resolution as the intended use case for $__interval is reducing the number of data points when the time range is wide. See Approximate Calculation of $__interval.)
If the time-distance between every two data points in each series is 15 seconds, it does not make sense to use anything less than [15s] as interval in the rate function. The rate function works best with at least 4 data points. Therefore [1m] would be much better than anything betweeen [15s] and [1m]. This is what $__rate_interval tries to achieve: guessing a minimal sensible interval for the rate function.
Personally, I think, this does not always work if your application delivers sparse data. I prefer using fixed intervals like 10m or even 1h or 1d in these situations. The interval need to be great enough to get you enough data points for the metric to work with the rate function.
A different approach would be to use any of $__rate_interval and $_interval but also set the Min step parameter for the query in the Grafana UI to be big enough.

What is the best way to represent a chart of distribution of time intervals in Datadog?

I have a server that processes packets from different devices. Devices can report in different intervals.
I would like to make a chart showing the distribution of intervals by the count of devices (how many devices are reporting within 5 sec/10 sec/60 sec ...)
Intervals for each device can vary.
Now I'm sending metric with Set using deviceId with tags that represent interval (5 sec, 10 sec, 30 sec, and more) but I'm not sure that it is correct.
What is the best way to realize it?
Set is almost never the right custom metric type to use. It will send a count of the number of unique items per a given tag. The underlying items details will be stripped from the metric, meaning that from one time slice to the next, you will have no idea that actual true number of items over time.
For example
3:00:07-3:00:32 | 5 second bucket:[device1,device4,device7] -> 3 values
3:00:32-3:00:47 | 5 second bucket:[device1,device3] -> 2 values
Your time series to datadog will report 3, and then 2. But because the underlying device info is stripped you have no idea how to combine that 2 and 3 if you to zoom out in time and roll up the numbers to show 1 data point per minute. It could be any number from 3 to 5, but the Datadog backend has no idea. (even though we know that across those 30 seconds there were 4 unique values total)
Plus even if it was accurate somehow, you can't create an alert of it or notify anyone, because you won't know which device is having issues if you see a spike of devices in the 60 second bucket.
So let's go through other metric options.
The only metric types that are ever worth using are usually distributions or gauges, or [counts].
A gauge metric is just a measurement of the latency at a point in time, it's usually good for things like CPU or Memory of a computer, or temperature in a room. Numbers that are impossible to actually collect all dat a points for so you just take measurements every 10 seconds, or every minute, or however often you never to get an idea of the behavior.
A count metric is more exact, it's the number of things that happened. Usually good for number of requests to a server, or number of files processed. Even something like the amount of bytes flowing through something, although that usually is treated like a gauge by most people.
Distributions are good for when you want to create a gauge metric, but you need detailed measurements for every single event that happens. For example a web server is handling hundreds of requests per second and we need to know the latency metrics of that server. It's not possible to send a latency metric for every request as a gauge. Gauges have a built in limit of 1 data point per second (in Datadog). Anything more sent in a 1 second interval gets dropped. But we need stats for every request, so a distribution will summarize the data, it keep a running count, min, max, average, and optionally several percentiles (p50, p75, p99).
I haven't seen many good use cases for metric types outside of those 3. For your scenario, it seems like you would want to be sending a distribution metric for that device interval. So device 1 sends a value of 10.14 and device 3 sends a value of 2.3 and so on.
Then you can use a distribution widget in a dashboard to show the number of devices for each interval bucket.
Of course make sure you tag each metric by the device that is generating the metric.

How to sum prometheus counters when k8s pods restart

I'm running Prometheus in a kubernetes cluster. All is running find and my UI pods are counting visitors.
Please ignore the title, what you see here is the query at the bottom of the image. It's a counter. The gaps in the graph are due to pods restarting. I have two pods running simultaneously!
Now suppose I would like to count the total of visitors, so I need to sum over all the pods
This is what I expect considering the first image, right?
However, I don't want the graph to drop when a pod restarts. I would like to have something cumulative over a specified amount of time (somehow ignoring pods restarting). Hope this makes any sense. Any suggestions?
UPDATE
Below is suggested to do the following
Its a bit hard to see because I've plotted everything there, but the suggested answer sum(rate(NumberOfVisitors[1h])) * 3600 is the continues green line there. What I don't understand now is the value of 3 it has? Also why does the value increase after 21:55, because I can see some values before that.
As the approach seems to be ok, I noticed that the actual increase is actually 3, going from 1 to 4. In the graph below I've used just one time series to reduce noise
Rate, then sum, then multiply by the time range in seconds. That will handle rollovers on counters too.
Prometheus doesn't provide the ability to sum counters, which may be reset. Additionally, the increase() function in Prometheus has some issues, which may prevent from using it for querying counter increase over the specified time range:
It may return fractional values over integer counters because of extrapolation. See this issue for details.
It may miss counter increase between raw sample just before the lookbehind window in square brackets and the first raw sample inside the lookbehind window. For example, increase(NumberOfVisitors[1m]) at timestamp t may miss the counter increase between the last raw sample just before the t-1m time and the first raw sample at (t-1m ... t] time range. See more details here and here.
It may miss the increase for the first raw sample in a time series. For example, if the NumberOfVisitors counter is increased to 10 just before the first scrape of this counter by Prometheus, then increase() over the time range with the first sample would under-count the counter increase by 10.
Prometheus developers are going to fix these issues - see this design doc. In the mean time it is possible to use VictoriaMetrics - its' increase() function is free from these issues.
Returning to the original question - the sum of multiple counters, which may be reset, can be returned with the following MetricsQL query in VictoriaMetrics:
running_sum(sum(increase(NumberOfVisitor)))
It uses the following functions:
increase() for calculating increase per each counter between adjacent points on the graph.
sum() for summing the calculated increases per each point on the graph.
running_sum() for calculating the running sum over per-point increases on the graph.

prometheus: is it possible to use event number of gauge as a counter?

I use prometheus to monitor a api service. Currently, I use a Counter to count number of requests received and a Gauge for the response time in milliseconds.
I've tried to use something like count_over_time(response_time_ms[1m]) to count requests during a time range. However, I got result that each point is value of 10.
Why this doesn't work?
count_over_time(response_time_ms[1m]) will tell you the number of samples, not the number of times your Gauge was updated within (what I assume to be) a Java process. Based on the value of 10 you're seeing, I'm assuming your scrape interval is 6 seconds.
For an explanation of why this doesn't work as you would expect it, a Gauge is simply a Java object wrapping a double value. Every time you set its value, that value changes, but nothing more. There's no count of how many times the value changed or any notification sent to Prometheus that this happened. Prometheus simply polls every 6 seconds and collects whatever value was there at the time (never the wiser that the value changed 15 times since the last time it was collected). This is why gauges are intended to measure single values that go up and down (such as memory utilization: it's now 645 MB, in 6 seconds it's 648 MB, in 12 seconds 543 MB): you know the value constantly changes, but the best you can do is sample it every now and then.
For something like request latency, you should use a Histogram: it's basically a counter for the number of observations (i.e. number of requests); a counter for the sum of all observations (i.e. how long all requests put together took); and separate counters for each bucket (i.e. how many requests took less than 1 ms; how many requests took less than 10 ms; etc.). From this you can get an accurate average over any multiple of your scrape interval (i.e. change in total time divided by change in number of requests) as well as estimates for any percentile (including the median). How precise said percentiles are depends on the bucket sizes you choose (and how well they actually match the actual measurements).
Or, if all you're interested in is the number of requests, then a counter that's incremented on every request will be enough. To adjust for counter resets (e.g. job restarts), you should use increase() rather than the simple difference suggested above:
increase(number_of_requests_total[1m])
If you want to count number of requests in some specific time from now (in last 1m in this case) just use
number_of_requests_counter - number_of_requests_counter offset 1m
If you want to have sth like requests per second, than use
rate(number_of_requests_counter[1m])
I can tell you why it's not working with your Gauge, but first of all specify what do you assign to this metric. I mean, do you assing some avarage, last response time, or some other stuff?
For response time you should use Summary or Histogram (more info here)

How should i interpret this grafana visualized prometheus histogram buckets heatmap?

I visualized prometheus histogram buckets as heatmap with grafana, below pic shows the query and the outcome graph, how should i interpret this?
According to my attacker, in total i sent 300 requests in that period exactly, but when i sum those numbers up on above graph i can never get exact 300,
and also looks those numbers are fluctuating with the time elapsing, how should i interpret this graph in a meaningful way?
And if i want those numbers to be the exact request counts locate in each of those bucket in that time window, what should i do?
Oh, for the X-Axis Mode i chose Series and the Value i chose Current.
There are real reasons why you can't always get a precise rate/increase value out of Prometheus. One of them is failed scrapes, i.e. every now and then a scrape will fail or time out due to a slow service, slow Prometheus or network issue.
The other reason is the fact that collected samples are never exactly scrape_interval apart: there will always be a few milliseconds or seconds of delay here and there. So (to take an extreme example) how can you tell the precise increase over the past 1 minute if you only have 2 samples 63 seconds apart? Is it the difference between the two values? Is it that difference adjusted to 60 seconds (i.e. / 63 * 60)?
That being said, Prometheus further boxes itself into a corner by only looking at samples falling strictly within the requested time range. To explain myself: how would a reasonable person calculate the increase of a counter over the last 30 minutes? They would likely take the value of said counter now and the value 30 minutes ago and subtract them. I.e. in PromQL terms (adjusting for counter resets where necessary):
request_duration_bucket - request_duration_bucket offset 30m
What Prometheus does instead (assuming a scrape_interval of 1m and an ideal timeseries with samples spaced exactly 1m apart) is essentially this:
(request_duration_bucket - request_duration_bucket offset 29m) / 29 * 30
I.e. it takes the increase over 29 minutes and extrapolates it to 30. Because of self-imposed limitations, nothing to do with the nature of the problem at hand.
Note that this works fine with counters that increase smoothly and continuously. E.g. if you have a counter that increases by 500 every minute, then taking the increase over 29 minutes and extrapolating to 30 is exactly correct. But for anything that increases in jumps and fits (which is most real-life counters) it will either slightly overestimate the increase if it occurs during the 29 minutes it actually samples (by exactly 1/29) or seriously underestimate it (if the increase occurs in the 1 minute not included in the sampling). This is even worse if you compute a rate/increase over a range covering fewer samples. E.g. if your range only covers 5 samples on average, the overestimate will be 20%, i.e. 1 / (5 - 1) and (each of) your increases will totally disappear 1 minute out of 5.
The only way I've found to work around this limitation is (again, assuming a scrape_interval of 1m) to reverse engineer Prometheus' extrapolation:
increase(request_duration_bucket[31m]) / 31 * 30
But this requires you to be aware of your scrape_interval and adjust for it and is very brittle (if you ever change your scrape_interval all your careful tweaking goes to hell).
Or, if you are OK with your increase falling to zero every time an instance is restarted:
clamp_min(request_duration_bucket - request_duration_bucket offset 30m, 0)
I do actually have a proposed patch to Prometheus to add xrate/xincrease functions that actually behave more as you would expect them to (and as described above) but it doesn't look very likely to be accepted: https://github.com/prometheus/prometheus/issues/3806