Locust 95 percentile is higher than max - locust

Sometimes when I run locust for some scenarios 95 percentile value is more than max. As far as I understood 95 percentile means the 95% of requests took lesser time than this.So how can Max value be lesser than 95 percentile? I am I doing something wrong here.
I also found that this only happens when there very less number of requests like 15 or less.

Percentiles are approximated in Locust.
This is done for performance reasons, as calculating an exact percentile would need to consider every sample (and doing this continously for large runs would just not work)
Min, max and average (mean) are accurate though.
And in longer runs (more than those 15 requests) the 95th percentile should not exceed your max.

Related

Difference between istio_request_bytes_count and istio_request_bytes_sum?

Can someone briefly explain what is the difference istio_request_bytes_count and istio_request_bytes_sum?. And why the "istio_request_bytes" standard metric is missing.
Istio Standard Metrics notes that istio_request_bytes is a DISTRIBUTION type metric. In Prometheus, this would appear as a histogram metric. So, you should see three metrics:
istio_request_bytes_count is the number of requests
istio_request_bytes_sum is the total number of bytes, added together across all requests
istio_request_bytes_bucket{le="1024"} is the total number of requests where the request size is 1 KiB or smaller
You can calculate the average request size by dividing the sum by the count. You can also use Prometheus functions like histogram_quantile() to calculate the median (50th-percentile) size.
This also applies to the other standard metrics. A common thing to measure is 95th-percentile latency ("p95"); how long does it take 95% of the requests to execute, where the remaining 5% take longer than this? histogram_quantile(0.95, istio_request_duration_milliseconds_bucket[1h]) could compute this over the most recent hour.

Graph Of utilization

My problem is as follows: I would like to create a graph of the percentage use of boxes over 24 hours. However, the box.utilization() function is cumulative, so I tried to solve the problem by creating a dataset that collects the values every hour and an event that resets the utilization so that the next hour is not affected by the previous hour's utilization.
(I attach a picture of the graph I created).
Is there a more efficient way?
I have faced the same issue before. Here is how I handled it:
Instead of cumulative utilization, I calculate the maximum hourly utilization. That is, I record the number of seized resource for every minute and get an array of 60 elements. Then divide the maximum number in that array by the total number of resources available. An example:
I have 100 machines
During an hour, maximum of 60 of them were busy
60/100= 60% maximum utilization during that hour
Then I plot these for each hour.

prometheus: is it possible to use event number of gauge as a counter?

I use prometheus to monitor a api service. Currently, I use a Counter to count number of requests received and a Gauge for the response time in milliseconds.
I've tried to use something like count_over_time(response_time_ms[1m]) to count requests during a time range. However, I got result that each point is value of 10.
Why this doesn't work?
count_over_time(response_time_ms[1m]) will tell you the number of samples, not the number of times your Gauge was updated within (what I assume to be) a Java process. Based on the value of 10 you're seeing, I'm assuming your scrape interval is 6 seconds.
For an explanation of why this doesn't work as you would expect it, a Gauge is simply a Java object wrapping a double value. Every time you set its value, that value changes, but nothing more. There's no count of how many times the value changed or any notification sent to Prometheus that this happened. Prometheus simply polls every 6 seconds and collects whatever value was there at the time (never the wiser that the value changed 15 times since the last time it was collected). This is why gauges are intended to measure single values that go up and down (such as memory utilization: it's now 645 MB, in 6 seconds it's 648 MB, in 12 seconds 543 MB): you know the value constantly changes, but the best you can do is sample it every now and then.
For something like request latency, you should use a Histogram: it's basically a counter for the number of observations (i.e. number of requests); a counter for the sum of all observations (i.e. how long all requests put together took); and separate counters for each bucket (i.e. how many requests took less than 1 ms; how many requests took less than 10 ms; etc.). From this you can get an accurate average over any multiple of your scrape interval (i.e. change in total time divided by change in number of requests) as well as estimates for any percentile (including the median). How precise said percentiles are depends on the bucket sizes you choose (and how well they actually match the actual measurements).
Or, if all you're interested in is the number of requests, then a counter that's incremented on every request will be enough. To adjust for counter resets (e.g. job restarts), you should use increase() rather than the simple difference suggested above:
increase(number_of_requests_total[1m])
If you want to count number of requests in some specific time from now (in last 1m in this case) just use
number_of_requests_counter - number_of_requests_counter offset 1m
If you want to have sth like requests per second, than use
rate(number_of_requests_counter[1m])
I can tell you why it's not working with your Gauge, but first of all specify what do you assign to this metric. I mean, do you assing some avarage, last response time, or some other stuff?
For response time you should use Summary or Histogram (more info here)

How should i interpret this grafana visualized prometheus histogram buckets heatmap?

I visualized prometheus histogram buckets as heatmap with grafana, below pic shows the query and the outcome graph, how should i interpret this?
According to my attacker, in total i sent 300 requests in that period exactly, but when i sum those numbers up on above graph i can never get exact 300,
and also looks those numbers are fluctuating with the time elapsing, how should i interpret this graph in a meaningful way?
And if i want those numbers to be the exact request counts locate in each of those bucket in that time window, what should i do?
Oh, for the X-Axis Mode i chose Series and the Value i chose Current.
There are real reasons why you can't always get a precise rate/increase value out of Prometheus. One of them is failed scrapes, i.e. every now and then a scrape will fail or time out due to a slow service, slow Prometheus or network issue.
The other reason is the fact that collected samples are never exactly scrape_interval apart: there will always be a few milliseconds or seconds of delay here and there. So (to take an extreme example) how can you tell the precise increase over the past 1 minute if you only have 2 samples 63 seconds apart? Is it the difference between the two values? Is it that difference adjusted to 60 seconds (i.e. / 63 * 60)?
That being said, Prometheus further boxes itself into a corner by only looking at samples falling strictly within the requested time range. To explain myself: how would a reasonable person calculate the increase of a counter over the last 30 minutes? They would likely take the value of said counter now and the value 30 minutes ago and subtract them. I.e. in PromQL terms (adjusting for counter resets where necessary):
request_duration_bucket - request_duration_bucket offset 30m
What Prometheus does instead (assuming a scrape_interval of 1m and an ideal timeseries with samples spaced exactly 1m apart) is essentially this:
(request_duration_bucket - request_duration_bucket offset 29m) / 29 * 30
I.e. it takes the increase over 29 minutes and extrapolates it to 30. Because of self-imposed limitations, nothing to do with the nature of the problem at hand.
Note that this works fine with counters that increase smoothly and continuously. E.g. if you have a counter that increases by 500 every minute, then taking the increase over 29 minutes and extrapolating to 30 is exactly correct. But for anything that increases in jumps and fits (which is most real-life counters) it will either slightly overestimate the increase if it occurs during the 29 minutes it actually samples (by exactly 1/29) or seriously underestimate it (if the increase occurs in the 1 minute not included in the sampling). This is even worse if you compute a rate/increase over a range covering fewer samples. E.g. if your range only covers 5 samples on average, the overestimate will be 20%, i.e. 1 / (5 - 1) and (each of) your increases will totally disappear 1 minute out of 5.
The only way I've found to work around this limitation is (again, assuming a scrape_interval of 1m) to reverse engineer Prometheus' extrapolation:
increase(request_duration_bucket[31m]) / 31 * 30
But this requires you to be aware of your scrape_interval and adjust for it and is very brittle (if you ever change your scrape_interval all your careful tweaking goes to hell).
Or, if you are OK with your increase falling to zero every time an instance is restarted:
clamp_min(request_duration_bucket - request_duration_bucket offset 30m, 0)
I do actually have a proposed patch to Prometheus to add xrate/xincrease functions that actually behave more as you would expect them to (and as described above) but it doesn't look very likely to be accepted: https://github.com/prometheus/prometheus/issues/3806

Calculating Mbps in Prometheus from cumulative total

I have a metric in Prometheus called unifi_devices_wireless_received_bytes_total, it represents the cumulative total amount of bytes a wireless device has received. I'd like to convert this to the download speed in Mbps (or even MBps to start).
I've tried:
rate(unifi_devices_wireless_received_bytes_total[5m])
Which I think is saying: "please give me the rate of bytes received per second", over the last 5 minutes, based on the documentation of rate, here.
But I don't understand what "over the last 5 minutes" means in this context.
In short, how can I determine the Mbps based on this cumulative amount of bytes metric? This is ultimately to display in a Grafana graph.
You want rate(unifi_devices_wireless_received_bytes_total[5m]) / 1000 / 1000
But I don't understand what "over the last 5 minutes" means in this context.
It's the average over the last 5 minutes.
The rate() function returns the average per-second increase rate for the counter passed to it. The average rate is calculated over the lookbehind window passed in square brackets to rate().
For example, rate(unifi_devices_wireless_received_bytes_total[5m]) calculates the average per-second increase rate over the last 5 minutes. It returns lower than expected rate when 100MB of data in transferred in 10 seconds, because it divides those 100MB by 5 minutes and returns the average data transfer speed as 100MB/5minutes = 333KB/s instead of 10MB/s.
Unfortinately, using 10s as a lookbehind window doesn't work as expected - it is likely the rate(unifi_devices_wireless_received_bytes_total[10s]) would return nothing. This is because rate() in Prometheus expects at least two raw samples on the lookbehind window. This means that new samples must be written at least every 5 seconds or more frequently into Prometheus for [10s] lookbehind window. The solution is to use irate() function instead of rate():
irate(unifi_devices_wireless_received_bytes_total[5m])
It is likely this query would return data transfer rate, which is closer to the expected 10MBs if the interval between raw samples (aka scrape_interval) is lower than 10 seconds.
Unfortunately, it isn't recommended to use irate() function in general case, since it tends to return jumpy results when refreshing graphs on big time ranges. Read this article for details.
So the ultimate solution is to use rollup_rate function from VictoriaMetrics - the project I work on. It reliably detects spikes in counter rates by returning the minimum, maximum and average per-second increase rate across all the raw samples on the selected time range.