Graph a counter from zero in prometheus/grafana - grafana

In prometheus, I have a monotonically increasing counter (ifHCInOctets from IF-MIB, in this case).
In Grafana, I can create a graph using the simple query ifHCInOctets{job='snmp',instance='$Device',ifDescr=~'eth0'} and see the counter graphed over different time ranges by selecting the desired range in the upper-right.
This is almost exactly what I want. However, I would like the graph to always start at zero and increase from there. The use-case is that I want to visualize my data usage over the course of a month to see how quickly I am approaching my data cap. (I already create a gauge object using increase(ifHCInOctets{...}[$__range]) function which shows me how much I have used in total over the given time range, but I'd like to be able to visualize that usage over time.)
Basically, I want ifHCInOctets{...} - X where X is the value of ifHCInOctets at the start of the range. My first thought was:
ifHCInOctets{...} - ifHCInOctets{...} offset $__range
But that seems to show me each data point minus the data point $__range time prior to it (rather than just subtracting the starting value from all points).
I then tried creating a query variable with the query query_result(ifHCInOctets{...} offset $__range) and setting it to update on time range change. This almost seemed to work, but the resulting graph always seemed to start slightly negative, depending on the time range selected, which made me think it wasn't doing what I thought it was.
I have also tried various forms of sum, sum_over_time, and increase, all to no avail.

You're probably looking for something like this
ifHCInOctets
-
min_over_time(
(ifHCInOctets
and
(month(timestamp(ifHCInOctets)) == scalar(month(vector($__to / 1000)))))[31d:]
)
But it doesn't take into account counter resets. And is ugly and inefficient as hell. It's basically the current value minus the min_over_time calculated over samples in the previous 31 days that fell into the same month as Grafana's $__to timestamp.
You probably want to set up a recording rule based on this expression (that adds year, month and day labels to a metric) and then calculate the increase() over any given month (including the current month). That takes into account both counter resets and counters that did not exist at the beginning of the month.

Related

Why are my values multiplying when I apply Month/Year to my values?

When I apply Month/Year to Cases or Deaths from my data, the values explode. For Cases it goes from approximately 48 million to over 1 billion, and for Deaths it goes from about 700 thousand to over 22 million. However, when I try the same thing with Initial Claims or the Stringency Index, my values remain correct. I'm trying to find the month over month percentage change by the way. And I'm using the Date column. I only select 2020 and 2021 in the filter for Year.
What I'm asking about is Sheet 21.
Link to workbook: https://public.tableau.com/app/profile/nilajah.rivers/viz/CoronaVirusProject_16323687296770/Sheet21
Your problem is that the data points are daily cumulative deaths. If you change the date aggregation to anything other than days, Tableau will default to summing the numbers for all the days in the month. This will give the wrong result, obviously.
If you want to show the correct total deaths or cases regardless of the time aggregation (months, days, weeks etc.) then you could use the New Case or New Death numbers plus a running sum table calculation. This will always give the correct total for the time period.
Table calculations will also allow automatic calculation of the period to period % change from the same data fields.
This is a common problem when working with datasets that offer pre-calculated aggregations. Tableau doesn't need that as it can dynamically calculate the aggregation of a field over any given time period but it is easy to forget which field has pre-aggregated data and which has raw data. Pre-aggregated fields assume a particular time period and can't be used for different time periods without disentangling that assumption (which is unnecessary if you also have the raw data (in this case daily new deaths/cases).

How to do a distinct count of a metric using graphite datasource in grafana?

I have a metric that shows the state of a server. The values are integers and if the value is 0 (zero) then the server is stable, else it is unstable. And the graph we have is at a minute level. So, I want to show an aggregated value to know how many hours the server is unstable in the selected time range.
Lets say, if I select "Last 7 days" as the time duration...we have get X hours of instability of server.
And one more thing, I have a line graph (time series graph) that shows the state of server...but, the thing is when I select "Last 24 hours or 48 hours" I am getting the graph at a minute level...when I increase the duration to a quarter I am getting the graph for every 5 min or something like that....I understand it's aggregating the values....but does any body know how the grafana is doing the aggregation ??
I have tried "scaleToSeconds" function and "ConsolidateBy" functions and many more to first get the count of non zero value minutes, but no success.
Any help would be greatly appreciated.
Thanks in advance.
There are a few different ways to tackle this, there are 2 places that aggregation happens in this situation:
When you query for a time range longer than your raw retention interval and whisper returns aggregated data. The aggregation method used here is defined in your carbon aggregation configuration.
When Grafana sends a query to Graphite it passes maxDataPoints=<width of graph in pixels>, and Graphite will perform aggregation to return at most that many points (because you don't have enough pixels to render more points than that). The method used for this consolidation is controlled by the consolidateBy function.
It is possible for both of these to be used in the same query if you eg have a panel that queries 3 days worth of data and you store 2 days at 1-minute and 7 days at 5-minute intervals in whisper then you'd have 72 * 60 / 5 = 864 points from the 5-minute archive in whisper, but if your graph is only 500px wide then at runtime that would be consolidated down to 10-minute intervals and return 432 points.
So, if you want to always have access to the count then you can change your carbon configuration to use sum aggregation for those series (and remove the existing whisper files so new ones are created with the new aggregation config), and pass consolidateBy('sum') in your queries, and you'll always get the sum back for each interval.
That said, you can also address this at query time by multiplying the average back out to get a total (assuming that your whisper aggregation config is using average). The simplest way to do that will be to summarize the data with average into buckets that match the longest aggregation interval you'll be querying, then scale those values by that interval to calculate the total number of minutes. Finally, you'll want to use consolidateBy('sum') so that any runtime consolidation will work properly.
consolidateBy(scale(summarize(my.series, '10min', 'avg'), 60), 'sum')
With all of that said, you may want to consider reporting uptime in terms of percentages rather than raw minutes, in which case you can use the raw averages directly.
When you say the value is zero (0), the server is healthy - what other values are reported while the server is unhealthy/unstable? If you're only reporting zero (healthy) or one (unhealthy), for example, then you could use the sumSeries function to get a count across multiple servers.
Some more information is needed here about the types of values the server is reporting in order to give you a better answer.
Grafana does aggregate - or consolidate - data typically by using the average aggregation function. You can override this using the 'sum' aggregation in the consolidateBy function.
To get a running calculation over time, you would most likely have to use the summarize function (also with the sum aggregation) and define the time period, e.g. 1 hour, 1 day, 1 week, and so on. You could take this a step further by combining this with a time template variable so that as the period grows/shrinks, the summarize period will increase/decrease accordingly.

How to sum prometheus counters when k8s pods restart

I'm running Prometheus in a kubernetes cluster. All is running find and my UI pods are counting visitors.
Please ignore the title, what you see here is the query at the bottom of the image. It's a counter. The gaps in the graph are due to pods restarting. I have two pods running simultaneously!
Now suppose I would like to count the total of visitors, so I need to sum over all the pods
This is what I expect considering the first image, right?
However, I don't want the graph to drop when a pod restarts. I would like to have something cumulative over a specified amount of time (somehow ignoring pods restarting). Hope this makes any sense. Any suggestions?
UPDATE
Below is suggested to do the following
Its a bit hard to see because I've plotted everything there, but the suggested answer sum(rate(NumberOfVisitors[1h])) * 3600 is the continues green line there. What I don't understand now is the value of 3 it has? Also why does the value increase after 21:55, because I can see some values before that.
As the approach seems to be ok, I noticed that the actual increase is actually 3, going from 1 to 4. In the graph below I've used just one time series to reduce noise
Rate, then sum, then multiply by the time range in seconds. That will handle rollovers on counters too.
Prometheus doesn't provide the ability to sum counters, which may be reset. Additionally, the increase() function in Prometheus has some issues, which may prevent from using it for querying counter increase over the specified time range:
It may return fractional values over integer counters because of extrapolation. See this issue for details.
It may miss counter increase between raw sample just before the lookbehind window in square brackets and the first raw sample inside the lookbehind window. For example, increase(NumberOfVisitors[1m]) at timestamp t may miss the counter increase between the last raw sample just before the t-1m time and the first raw sample at (t-1m ... t] time range. See more details here and here.
It may miss the increase for the first raw sample in a time series. For example, if the NumberOfVisitors counter is increased to 10 just before the first scrape of this counter by Prometheus, then increase() over the time range with the first sample would under-count the counter increase by 10.
Prometheus developers are going to fix these issues - see this design doc. In the mean time it is possible to use VictoriaMetrics - its' increase() function is free from these issues.
Returning to the original question - the sum of multiple counters, which may be reset, can be returned with the following MetricsQL query in VictoriaMetrics:
running_sum(sum(increase(NumberOfVisitor)))
It uses the following functions:
increase() for calculating increase per each counter between adjacent points on the graph.
sum() for summing the calculated increases per each point on the graph.
running_sum() for calculating the running sum over per-point increases on the graph.

Prometheus/Grafana Rate()....what Unit for Y Axis

I have a counter that I am plotting on Grafana.
rate(processed_work_items_total{job="MainWorker"}[1m])
I am not getting the expected numbers in grafana.
What I want is the # of Work Items Processed per minute.
Is my query wrong? or my Unit of Measure in my Y Axis. I currently have it as ops/min and its giving me a super small number.
According to the documentation, rate(processed_work_items_total{job="MainWorker"}[1m]) will calculate the number of work items processed per second, measured over the last one minute (that's the [1m] from your query).
If you want the number of items per minute, simply multiply the above metric with 60.
If you need to calculate per-minute increase rate for a counter metric, then use increase(...[1m]). For example, the following query returns the increase of processed_work_items_total{job="MainWorker"} time series over the last minute:
increase(processed_work_items_total{job="MainWorker"}[1m])
Note that the increase() function in Prometheus may return unexpected results due to the following issues:
It may return fractional results over integer metric because of extrapolation. See this issue for details.
It may miss counter increase between the last raw sample just before the lookbehind window specified in square brackets and the first raw sample inside the lookbehind window.
It may miss the initial counter increase at the beginning of the time series.
These issues are going to be addressed in Prometheus according to this design doc. In the mean time it is possible to use MetricsQL, which is free from these issues.

How to show delta from start of timespan to point in time for counter

I am trying to visualize a counter increase over time.
But I'm facing two problems:
The graph doesn't start at zero for the timeframe and
When ever the counter resets, the graph hits zero again
This leads to the graph being very hard to read cause what I realy would like is to see how quickly the counter increases over time while being able to quickly get an overview of total amounts of increases at a given point in time measured from the start of the time frame.
Visualisation of my problem
Update 20. November
Result of 'increase([your_metric_name][1m])'
You need to use some type of rate or increase function to get the type of graph you're looking for. And since you're using Prometheus, your query will look something like this:
rate([your_metric_name][1m])
If you want the rate per second, OR
increase([your_metric_name][1m])
If you want something more like a delta.
These pages can give you more information too: https://prometheus.io/docs/prometheus/latest/querying/functions/#rate()
https://prometheus.io/docs/prometheus/latest/querying/functions/#increase()