How do you represent response time and requests-per-second in one measure? - visualization

I have a graph for a number of endpoint response times for a Web service. For some endpoints, high response time can be misleading because if the requests-per-second is low, the impact is also low. Thus, I always have to check a corresponding requests-per-second graph to understand the true impact.
How can I represent response time and requests-per-second in one measure? I'd like to avoid plotting response-time and requests-per-second on one graph because I'd like to get a single measure of multiple endpoints on one graph.

Related

Athena/DDB to condense millions of data points for plotting them on a graph

I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?
I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream

what is the default grafana setting for $__rate_interval

I understand that rate(xyz[5m]) * 60 is the rate of xyz per minute, averaged over 5 mins.
How then would $__rate_interval and $__interval be defined,
possibly in the same syntax?
What format is rate being measured here, in my panel? Per minute, per second?
What is the interval= 30s in my panel here? My scraping interval is set to 5s.
How do i change the rate format?
See New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work.
Rate is always per second. See Grafana documentation for the rate function.
Click on Query options, then click on the Info-Symbol. An explanation will be displayed.
To get rate per minute, just multiply the rate with 60.
Edit: ($__rate_interval and $_interval)
Prometheus periodically fetches data from your application. Grafana periodically fetches Data from Prometheus. Grafana does not know, how often Prometheus polls your application for data. Grafana will estimate this time by looking at the data. The $__interval variable expands to the duration between two data points in the graph. (Note that this is only true for small time ranges and high resolution as the intended use case for $__interval is reducing the number of data points when the time range is wide. See Approximate Calculation of $__interval.)
If the time-distance between every two data points in each series is 15 seconds, it does not make sense to use anything less than [15s] as interval in the rate function. The rate function works best with at least 4 data points. Therefore [1m] would be much better than anything betweeen [15s] and [1m]. This is what $__rate_interval tries to achieve: guessing a minimal sensible interval for the rate function.
Personally, I think, this does not always work if your application delivers sparse data. I prefer using fixed intervals like 10m or even 1h or 1d in these situations. The interval need to be great enough to get you enough data points for the metric to work with the rate function.
A different approach would be to use any of $__rate_interval and $_interval but also set the Min step parameter for the query in the Grafana UI to be big enough.

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

Averaged Historical Data from Xively feed API

The xively (Cosm) web interface issues the following function for averaged historical datapoints
// For averaged historical datapoints
https://www.xively.com/feeds/<feedId>/datastreams/Humidity/graph.json&duration=21600seconds&interval=30&limit=1000&find_previous=true&function=average
I would like to fetch averaged historical data points (That is if there are multiple samples within the interval I am asking then return an averaged rollup as representative point of the interval) using Xively REST API
However this seems to return the raw data points (They just pick one datapoint to represent the sample interval)
https://api.xively.com/v2/feeds/127181539.json?datastreams=TEMP&duration=1month&interval=21600&limit=200&function=average
So questions
1) How can I return averaged data points like the Xively web interface? what parameter is needed for feed API call?
2) Does anyone know about the parameter interval_type? I have read what is here (https://xively.com/dev/docs/api/quick_reference/historical_data/) about 50 times now but I still don't get it!
Update
function=sum as well as function=average works for
/datastreams/TEMP.json endpoint. Also, they are discrete by default.
The function=average does not works with /feeds/feed_id.json
endPoint. Maybe a Bug?
If you've got "function=average" (which you have) as a query parameter, then the points you get back should be bucketed to the interval you specified (21600 seconds / 6 hours). Each point represents the average value for that period.
It might be worth making this query against the datastreams endpoint though, e.g.
https://api.xively.com/v2/feeds/127181539/datastreams/TEMP.json?duration=1month&interval=21600&limit=200&function=average
Hope this helps!

Benchmarking Memcached Server

I am trying to benchmark a memcached server. The results produced for TCP traffic are in terms of number of requests, number of hits, number of misses, number of gets, number of sets, delay time, etc. I am confused about how to produce throughput measure from it.
I suggest doing a lot of experiments at different loads, and drawing a graph of response time vs. requests-per-second.
Typically you will get a graph that looks like the one at the top of this paper by Hart et al which has an obvious "knee" which shows that if you apply too much load the response time suddenly gets much worse.
You could consider the requests-per-second of this knee to be throughput of your memcached system.