If I change the _pdpstep and heartbeat the RRD not updated properly - perl

I have rrd file "abcd" with _pdpstep = 300 and heartbeat = 700. If this is the configuration then it works fine means accept value. But If I create this file newly with _pdpstep = 1200 and heartbeat = 1500 then it gives all value as Nan. How Can I check what is wrong. If you require I can send rrdtool info for both files.

There's not enough information to answer your question.
However, you should probably look at the documentation
Specifically the bit about heartbeat, step and the 'xff' in your RRA definitions.
xff The xfiles factor defines what part of a consolidation interval may be made up from UNKNOWN data while the consolidated value is still regarded as known. It is given as the ratio of allowed UNKNOWN PDPs to the number of PDPs in the interval. Thus, it ranges from 0 to 1 (exclusive).
It's quite likely that if you're using a different heartbeat, then your sampling interval is now too low.
The "heartbeat" defines the maximum acceptable interval between samples/updates. If the interval between samples is less than "heartbeat", then an average rate is calculated and applied for that interval. If the interval between samples is longer than "heartbeat", then that entire interval is considered "unknown". Note that there are other things that can make a sample interval "unknown", such as the rate exceeding limits, or a sample that was explicitly marked as unknown.
So, short answer is - if you chance your RRA definition to have a lower xff, then you should stop getting NaNs in your data.

Related

prometheus: is it possible to use event number of gauge as a counter?

I use prometheus to monitor a api service. Currently, I use a Counter to count number of requests received and a Gauge for the response time in milliseconds.
I've tried to use something like count_over_time(response_time_ms[1m]) to count requests during a time range. However, I got result that each point is value of 10.
Why this doesn't work?
count_over_time(response_time_ms[1m]) will tell you the number of samples, not the number of times your Gauge was updated within (what I assume to be) a Java process. Based on the value of 10 you're seeing, I'm assuming your scrape interval is 6 seconds.
For an explanation of why this doesn't work as you would expect it, a Gauge is simply a Java object wrapping a double value. Every time you set its value, that value changes, but nothing more. There's no count of how many times the value changed or any notification sent to Prometheus that this happened. Prometheus simply polls every 6 seconds and collects whatever value was there at the time (never the wiser that the value changed 15 times since the last time it was collected). This is why gauges are intended to measure single values that go up and down (such as memory utilization: it's now 645 MB, in 6 seconds it's 648 MB, in 12 seconds 543 MB): you know the value constantly changes, but the best you can do is sample it every now and then.
For something like request latency, you should use a Histogram: it's basically a counter for the number of observations (i.e. number of requests); a counter for the sum of all observations (i.e. how long all requests put together took); and separate counters for each bucket (i.e. how many requests took less than 1 ms; how many requests took less than 10 ms; etc.). From this you can get an accurate average over any multiple of your scrape interval (i.e. change in total time divided by change in number of requests) as well as estimates for any percentile (including the median). How precise said percentiles are depends on the bucket sizes you choose (and how well they actually match the actual measurements).
Or, if all you're interested in is the number of requests, then a counter that's incremented on every request will be enough. To adjust for counter resets (e.g. job restarts), you should use increase() rather than the simple difference suggested above:
increase(number_of_requests_total[1m])
If you want to count number of requests in some specific time from now (in last 1m in this case) just use
number_of_requests_counter - number_of_requests_counter offset 1m
If you want to have sth like requests per second, than use
rate(number_of_requests_counter[1m])
I can tell you why it's not working with your Gauge, but first of all specify what do you assign to this metric. I mean, do you assing some avarage, last response time, or some other stuff?
For response time you should use Summary or Histogram (more info here)

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

How should i interpret this grafana visualized prometheus histogram buckets heatmap?

I visualized prometheus histogram buckets as heatmap with grafana, below pic shows the query and the outcome graph, how should i interpret this?
According to my attacker, in total i sent 300 requests in that period exactly, but when i sum those numbers up on above graph i can never get exact 300,
and also looks those numbers are fluctuating with the time elapsing, how should i interpret this graph in a meaningful way?
And if i want those numbers to be the exact request counts locate in each of those bucket in that time window, what should i do?
Oh, for the X-Axis Mode i chose Series and the Value i chose Current.
There are real reasons why you can't always get a precise rate/increase value out of Prometheus. One of them is failed scrapes, i.e. every now and then a scrape will fail or time out due to a slow service, slow Prometheus or network issue.
The other reason is the fact that collected samples are never exactly scrape_interval apart: there will always be a few milliseconds or seconds of delay here and there. So (to take an extreme example) how can you tell the precise increase over the past 1 minute if you only have 2 samples 63 seconds apart? Is it the difference between the two values? Is it that difference adjusted to 60 seconds (i.e. / 63 * 60)?
That being said, Prometheus further boxes itself into a corner by only looking at samples falling strictly within the requested time range. To explain myself: how would a reasonable person calculate the increase of a counter over the last 30 minutes? They would likely take the value of said counter now and the value 30 minutes ago and subtract them. I.e. in PromQL terms (adjusting for counter resets where necessary):
request_duration_bucket - request_duration_bucket offset 30m
What Prometheus does instead (assuming a scrape_interval of 1m and an ideal timeseries with samples spaced exactly 1m apart) is essentially this:
(request_duration_bucket - request_duration_bucket offset 29m) / 29 * 30
I.e. it takes the increase over 29 minutes and extrapolates it to 30. Because of self-imposed limitations, nothing to do with the nature of the problem at hand.
Note that this works fine with counters that increase smoothly and continuously. E.g. if you have a counter that increases by 500 every minute, then taking the increase over 29 minutes and extrapolating to 30 is exactly correct. But for anything that increases in jumps and fits (which is most real-life counters) it will either slightly overestimate the increase if it occurs during the 29 minutes it actually samples (by exactly 1/29) or seriously underestimate it (if the increase occurs in the 1 minute not included in the sampling). This is even worse if you compute a rate/increase over a range covering fewer samples. E.g. if your range only covers 5 samples on average, the overestimate will be 20%, i.e. 1 / (5 - 1) and (each of) your increases will totally disappear 1 minute out of 5.
The only way I've found to work around this limitation is (again, assuming a scrape_interval of 1m) to reverse engineer Prometheus' extrapolation:
increase(request_duration_bucket[31m]) / 31 * 30
But this requires you to be aware of your scrape_interval and adjust for it and is very brittle (if you ever change your scrape_interval all your careful tweaking goes to hell).
Or, if you are OK with your increase falling to zero every time an instance is restarted:
clamp_min(request_duration_bucket - request_duration_bucket offset 30m, 0)
I do actually have a proposed patch to Prometheus to add xrate/xincrease functions that actually behave more as you would expect them to (and as described above) but it doesn't look very likely to be accepted: https://github.com/prometheus/prometheus/issues/3806

Prometheus/Grafana Rate()....what Unit for Y Axis

I have a counter that I am plotting on Grafana.
rate(processed_work_items_total{job="MainWorker"}[1m])
I am not getting the expected numbers in grafana.
What I want is the # of Work Items Processed per minute.
Is my query wrong? or my Unit of Measure in my Y Axis. I currently have it as ops/min and its giving me a super small number.
According to the documentation, rate(processed_work_items_total{job="MainWorker"}[1m]) will calculate the number of work items processed per second, measured over the last one minute (that's the [1m] from your query).
If you want the number of items per minute, simply multiply the above metric with 60.
If you need to calculate per-minute increase rate for a counter metric, then use increase(...[1m]). For example, the following query returns the increase of processed_work_items_total{job="MainWorker"} time series over the last minute:
increase(processed_work_items_total{job="MainWorker"}[1m])
Note that the increase() function in Prometheus may return unexpected results due to the following issues:
It may return fractional results over integer metric because of extrapolation. See this issue for details.
It may miss counter increase between the last raw sample just before the lookbehind window specified in square brackets and the first raw sample inside the lookbehind window.
It may miss the initial counter increase at the beginning of the time series.
These issues are going to be addressed in Prometheus according to this design doc. In the mean time it is possible to use MetricsQL, which is free from these issues.

Calculating Mbps in Prometheus from cumulative total

I have a metric in Prometheus called unifi_devices_wireless_received_bytes_total, it represents the cumulative total amount of bytes a wireless device has received. I'd like to convert this to the download speed in Mbps (or even MBps to start).
I've tried:
rate(unifi_devices_wireless_received_bytes_total[5m])
Which I think is saying: "please give me the rate of bytes received per second", over the last 5 minutes, based on the documentation of rate, here.
But I don't understand what "over the last 5 minutes" means in this context.
In short, how can I determine the Mbps based on this cumulative amount of bytes metric? This is ultimately to display in a Grafana graph.
You want rate(unifi_devices_wireless_received_bytes_total[5m]) / 1000 / 1000
But I don't understand what "over the last 5 minutes" means in this context.
It's the average over the last 5 minutes.
The rate() function returns the average per-second increase rate for the counter passed to it. The average rate is calculated over the lookbehind window passed in square brackets to rate().
For example, rate(unifi_devices_wireless_received_bytes_total[5m]) calculates the average per-second increase rate over the last 5 minutes. It returns lower than expected rate when 100MB of data in transferred in 10 seconds, because it divides those 100MB by 5 minutes and returns the average data transfer speed as 100MB/5minutes = 333KB/s instead of 10MB/s.
Unfortinately, using 10s as a lookbehind window doesn't work as expected - it is likely the rate(unifi_devices_wireless_received_bytes_total[10s]) would return nothing. This is because rate() in Prometheus expects at least two raw samples on the lookbehind window. This means that new samples must be written at least every 5 seconds or more frequently into Prometheus for [10s] lookbehind window. The solution is to use irate() function instead of rate():
irate(unifi_devices_wireless_received_bytes_total[5m])
It is likely this query would return data transfer rate, which is closer to the expected 10MBs if the interval between raw samples (aka scrape_interval) is lower than 10 seconds.
Unfortunately, it isn't recommended to use irate() function in general case, since it tends to return jumpy results when refreshing graphs on big time ranges. Read this article for details.
So the ultimate solution is to use rollup_rate function from VictoriaMetrics - the project I work on. It reliably detects spikes in counter rates by returning the minimum, maximum and average per-second increase rate across all the raw samples on the selected time range.