Prometheus histograms and averaging sets with NaN values included - grafana

In my app I have histograms setup for websocket ping times to every country, one histogram per country. In Grafana I have a graph of the average ping time for several countries I'm most interested in via the following query
rate(country_ping_sum{country=~"AU|NZ|CA|GB|US",instance="$instance"}[15m]) / rate(country_ping_count{country=~"AU|NZ|CA|GB|US",instance="$instance"}[15m])
This works perfectly well. I get a graph for each country. Now I want to add to the same graph an average of all the other countries combined into one.
avg(rate(country_ping_sum{country!~"AU|NZ|CA|GB|US",instance="$instance"}[15m]) / rate(country_ping_count{country!~"AU|NZ|CA|GB|US",instance="$instance"}[15m]))
This fails. When I try the query in the Prometheus query in the Prometheus console I get a value of NaN. If I take the same query and remove the avg() function then I get a list of every matching country, some have values and some have NaN. Many of the countries have a rate of 0 for both the sum and the count. Clearly those divisions by 0 are amounting to NaN for those particular countries.
So my question, how can I filter out NaN values before passing to avg()?

You're effectively taking an average of an average, which is generally not correct.
Instead do a sum of each rate, and then divide to get the overall average.

Related

Why is sum of series coming as fractional and less than the actual values in graphite

I am creating a dashboard using metrics in graphite. I have tried consolidatedBy to get all the metrics. The metric values looks correct and are of the range of 1000s.
The graphite query for the same is
consolidateBy(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count, 'sum')
I want to get the total number of requests, which would be sum of all this series.
So, I tried the sum function but it is giving the sum around 1-12, which is actually less than the actual values.
The query is
sum(consolidateBy(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count, 'sum'))
This query also gives the same result
sum(monitors.x.y.z.client_metrics.k.*.*.*.*.*.response_codes.*.count)
My questions are:
Why is the sum of series giving point's value less than the actual values ?
I just want to calculate the total number of requests. If there is an easier solution. Can you please specify it.

Want to SUM all values for a specific date within column NOT sum all values in that column

I want to create a graph which shows the total capacity for each week relative to remaining availability across a series of specific dates. Just now when I attempt this in Power Bi it calculates this correctly for one of the values (remaining availability) but generates a value much higher than expected by manual calculation for the total capacity - instead showing the total for the entire column rather than for each specific date.
Why is Power Bi doing this and how can I solve it?
So far, I have tried generating the graph like this:
(https://i.stack.imgur.com/GV3vk.png)
and as you can see the capacity values are incredibly high they should be 25 days.
The total availability values are correct (ranging from 0 to 5.5 days).
When I create matrices to see the sum breakdown they are correct but it only appears to be that when combined together one of the values changes to the value for the whole column.
If anyone could help me with this issue that would be great! Thanks!

How to do a distinct count of a metric using graphite datasource in grafana?

I have a metric that shows the state of a server. The values are integers and if the value is 0 (zero) then the server is stable, else it is unstable. And the graph we have is at a minute level. So, I want to show an aggregated value to know how many hours the server is unstable in the selected time range.
Lets say, if I select "Last 7 days" as the time duration...we have get X hours of instability of server.
And one more thing, I have a line graph (time series graph) that shows the state of server...but, the thing is when I select "Last 24 hours or 48 hours" I am getting the graph at a minute level...when I increase the duration to a quarter I am getting the graph for every 5 min or something like that....I understand it's aggregating the values....but does any body know how the grafana is doing the aggregation ??
I have tried "scaleToSeconds" function and "ConsolidateBy" functions and many more to first get the count of non zero value minutes, but no success.
Any help would be greatly appreciated.
Thanks in advance.
There are a few different ways to tackle this, there are 2 places that aggregation happens in this situation:
When you query for a time range longer than your raw retention interval and whisper returns aggregated data. The aggregation method used here is defined in your carbon aggregation configuration.
When Grafana sends a query to Graphite it passes maxDataPoints=<width of graph in pixels>, and Graphite will perform aggregation to return at most that many points (because you don't have enough pixels to render more points than that). The method used for this consolidation is controlled by the consolidateBy function.
It is possible for both of these to be used in the same query if you eg have a panel that queries 3 days worth of data and you store 2 days at 1-minute and 7 days at 5-minute intervals in whisper then you'd have 72 * 60 / 5 = 864 points from the 5-minute archive in whisper, but if your graph is only 500px wide then at runtime that would be consolidated down to 10-minute intervals and return 432 points.
So, if you want to always have access to the count then you can change your carbon configuration to use sum aggregation for those series (and remove the existing whisper files so new ones are created with the new aggregation config), and pass consolidateBy('sum') in your queries, and you'll always get the sum back for each interval.
That said, you can also address this at query time by multiplying the average back out to get a total (assuming that your whisper aggregation config is using average). The simplest way to do that will be to summarize the data with average into buckets that match the longest aggregation interval you'll be querying, then scale those values by that interval to calculate the total number of minutes. Finally, you'll want to use consolidateBy('sum') so that any runtime consolidation will work properly.
consolidateBy(scale(summarize(my.series, '10min', 'avg'), 60), 'sum')
With all of that said, you may want to consider reporting uptime in terms of percentages rather than raw minutes, in which case you can use the raw averages directly.
When you say the value is zero (0), the server is healthy - what other values are reported while the server is unhealthy/unstable? If you're only reporting zero (healthy) or one (unhealthy), for example, then you could use the sumSeries function to get a count across multiple servers.
Some more information is needed here about the types of values the server is reporting in order to give you a better answer.
Grafana does aggregate - or consolidate - data typically by using the average aggregation function. You can override this using the 'sum' aggregation in the consolidateBy function.
To get a running calculation over time, you would most likely have to use the summarize function (also with the sum aggregation) and define the time period, e.g. 1 hour, 1 day, 1 week, and so on. You could take this a step further by combining this with a time template variable so that as the period grows/shrinks, the summarize period will increase/decrease accordingly.

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

How to calculate the average value in a Prometheus query from Grafana

I was trying to create a Prometheus graph on Grafana, but i can't find the function which calculate the average value.
For example , to create a graph for read_latency, the result contain many tags. If there are 3 machine, there will be 3 tag seperately, for machine1, machine2, machine3. Here is a graph(click to show)
Prometheus
I want to combine these three together, so there will be only one tag : machines, and the value is the average of those three.
It seems that Prometheus query function doesn't have something like average(), so I am not sure how to do this.
I used to work on InfluxDB, and the graph can show like (click to show):
influxDB
I think you are searching for the avg() operation. see documentation
Use built-in $__interval variable, where node, name are custom labels (depending on you metrics):
sum(avg_over_time(some_metric[$__interval])) by (node, name)
or fixed value like 1m,1h etc:
sum(avg_over_time(some_metric[1m])) by (node, name)
You can filter using Grafana variables:
sum(avg_over_time(some_metric{cluster=~"$cluster"}[1m])) by (node, name)
Short answer: use avg() function to return the average value across multiple time series. For example, avg(metric) returns the average value for time series with metric name.
Long answer: Prometheus provides two functions for calculating the average:
avg_over_time calculates the average over raw sample stored in the database on the lookbehind window specified in square brackets. The average is calculated independently per each matching time series. For example, avg_over_time(metric[1h]) calculates average values for raw samples over the last hour per each time series with metric name.
avg calculates the average over multiple time series. The average is calculated independently per each point on the graph.
If you need to calculate the average over raw samples across all the time series, which match the given selector, per each time bucket, e.g.:
SELECT
time_bucket('5 minutes', timestamp) AS t,
avg(value)
FROM table
GROUP BY t
Then the following PromQL query must be used:
sum(sum_over_time(metric[$__interval])) / sum(count_over_time(metric[$__interval]))
Do not use avg(avg_over_time(metric[$__interval])), since it returns average of averages, which isn't equal to real average. See this explanation for details.