Prometheus/Grafana plot wait time percentiles for jobs - grafana

I have a job scheduling engine which can run jobs on various machines. I have a queue of pending jobs coming in as a stream (usually at least 10s of thousands of jobs waiting for execution). I have an algorithm to execute jobs on different machines.
One of the core metrics to track is how long after a job gets requested does it get scheduled for execution (usually it is less than 5 minutes, but can be up to 1 hour due to various reasons).
Is there a way to plot the percentiles of how long the current unassigned jobs have been in there for using Prometheus + Grafana (or mix of prometheus and other solutions like Redis)? I want to know what is Median waiting time, 95 and 99 percentiles of waiting times for the jobs.
The issue is until the job gets scheduled for execution there is no event generated and longer we wait the higher bucket the job will move into. Furthermore, since the job could take very different times to get scheduled (not each job is the same), simply relying on how long past few jobs took to get scheduled is wrong.
One simple way would to iterate over all pending jobs and compute the percentiles continuously, but that would be very expensive.

The Prometheus histogram implementations assume a fixed set of buckets (e.g. less than 1 second, less than 2 seconds, less than 5 seconds etc.) which may only be incremented (together with all buckets above them).
In your case, you have 2 options:
Record the duration each job has been queued for in the histogram. The problems with this approach are that (a) you would have to keep "moving" every job up the histogram as time goes on; and (b) you can't remove a job from the histogram once it is processed (because of the monotonicity requirement).
Record the time when each job was added into a histogram (e.g. records added before 1 minute past the hour, records added before 2 minutes past the hour etc.). The problem here is that your histogram is not static in size and will grow indefinitely (assuming your Prometheus client allowed it in the first place).
You are thus left with a couple of alternatives:
Iterate over your queue and create a fresh histogram (or directly the percentiles you're interested in) every time you're scraped by Prometheus. Tens of thousands of jobs to iterate over doesn't sound all that bad, it should take milliseconds to do. You could even replace the data structure you use for your queue with e.g. a binary search tree, which should make it real easy to figure out the exact percentiles you are interested in, in logarithmic time.
Give up on recording queuing times for pending jobs and only do it for processed jobs. Every time a job is processed, you increment a histogram. It doesn't get simpler than that.

Related

What period is P99 or P95 calculated over?

When committing to/setting SLAs for a service, what time period should the SLA be calculated over?
For example, if I wanted all the services in my organization to commit to P95 latency, and one of the services commits to 500ms, what is the time window - because the P95 will be different based on the time window we look at.
Depends on in what cycles your latency fluctuates.
No daily / hourly peaks? A couple thousand samples do just fine.
Daily fluctuations (e.g. peak usage, concurrent backups etc.)? Then you will need to measure at least a whole day.
Weekly fluctuations (e.g. tied to work hours or evening activities etc.)? Then you will need to sample over a full week.
There is no strict requirement to sample everything over the chosen time window, but your time window better be representative or you may be held liable. Also make sure to be fair when you under-sample.
If you want to be on the safe side, take the worst-case-scenario in your load cycle, and within that scenario take a full minute worth of samples. That gives you a good estimate what will be held against you.

What is the best way to represent a chart of distribution of time intervals in Datadog?

I have a server that processes packets from different devices. Devices can report in different intervals.
I would like to make a chart showing the distribution of intervals by the count of devices (how many devices are reporting within 5 sec/10 sec/60 sec ...)
Intervals for each device can vary.
Now I'm sending metric with Set using deviceId with tags that represent interval (5 sec, 10 sec, 30 sec, and more) but I'm not sure that it is correct.
What is the best way to realize it?
Set is almost never the right custom metric type to use. It will send a count of the number of unique items per a given tag. The underlying items details will be stripped from the metric, meaning that from one time slice to the next, you will have no idea that actual true number of items over time.
For example
3:00:07-3:00:32 | 5 second bucket:[device1,device4,device7] -> 3 values
3:00:32-3:00:47 | 5 second bucket:[device1,device3] -> 2 values
Your time series to datadog will report 3, and then 2. But because the underlying device info is stripped you have no idea how to combine that 2 and 3 if you to zoom out in time and roll up the numbers to show 1 data point per minute. It could be any number from 3 to 5, but the Datadog backend has no idea. (even though we know that across those 30 seconds there were 4 unique values total)
Plus even if it was accurate somehow, you can't create an alert of it or notify anyone, because you won't know which device is having issues if you see a spike of devices in the 60 second bucket.
So let's go through other metric options.
The only metric types that are ever worth using are usually distributions or gauges, or [counts].
A gauge metric is just a measurement of the latency at a point in time, it's usually good for things like CPU or Memory of a computer, or temperature in a room. Numbers that are impossible to actually collect all dat a points for so you just take measurements every 10 seconds, or every minute, or however often you never to get an idea of the behavior.
A count metric is more exact, it's the number of things that happened. Usually good for number of requests to a server, or number of files processed. Even something like the amount of bytes flowing through something, although that usually is treated like a gauge by most people.
Distributions are good for when you want to create a gauge metric, but you need detailed measurements for every single event that happens. For example a web server is handling hundreds of requests per second and we need to know the latency metrics of that server. It's not possible to send a latency metric for every request as a gauge. Gauges have a built in limit of 1 data point per second (in Datadog). Anything more sent in a 1 second interval gets dropped. But we need stats for every request, so a distribution will summarize the data, it keep a running count, min, max, average, and optionally several percentiles (p50, p75, p99).
I haven't seen many good use cases for metric types outside of those 3. For your scenario, it seems like you would want to be sending a distribution metric for that device interval. So device 1 sends a value of 10.14 and device 3 sends a value of 2.3 and so on.
Then you can use a distribution widget in a dashboard to show the number of devices for each interval bucket.
Of course make sure you tag each metric by the device that is generating the metric.

How to visualize cycle times of a periodic process?

I have a periodic backend process and I would like to visualize the history of the length of cycles on my dashboard. Is it possible?
I have full control over the data/metrics I generate, so I could perhaps increment a counter every time a cycle completes (a cycle takes about 3 days), so I would get counter updates every 3 days or so. Then how could I get Grafana to report the length of each cycle? (for instance: 72h; 69h; 74h; etc.) The actual widget doesn't matter, but I need something visual to tell me at once if cycles are getting faster or slower.
Any pointers or ideas are welcome.
It looks like a standard time series: X-axis - time, Y-axis - duration [s]:
Then you may add:
trend line
aggregations (min/max/avg/derivation/diff/...)
moving average
other math functions, which are available in used datasource

AnyLogic mean waiting time in queue

I would like to get the mean waiting time of each unit spending in my queue of every hour. (so betweeen 7-8 am for example 4 minutes, 8-9 10 minutes and so on). Thats my current queue with my timemeasure Is there a way to do so?
]
Create a normal dataset and call it datasetHourly. Deactivate the option Use time as horizontal value. This is where we will store your hourly data.
Creat a cyclic event and set the trigger to cyclic, once every hour.
This cyclic event will get the current mean of your time measurement ( waiting time + service time in your example) and save this single value in the extra dataset.
Also we have to clear the dataset that is integrated into the timeMeasurementEnd, in order to get clean statistics again for the next hour interval.
datasetHourly.add(time(HOUR),timeMeasureEnd.dataset.getYMean());
timeMeasureEnd.dataset.reset();
You can now visualise the hourly development by adding the hourlyDataset to a normal plot.

prometheus: is it possible to use event number of gauge as a counter?

I use prometheus to monitor a api service. Currently, I use a Counter to count number of requests received and a Gauge for the response time in milliseconds.
I've tried to use something like count_over_time(response_time_ms[1m]) to count requests during a time range. However, I got result that each point is value of 10.
Why this doesn't work?
count_over_time(response_time_ms[1m]) will tell you the number of samples, not the number of times your Gauge was updated within (what I assume to be) a Java process. Based on the value of 10 you're seeing, I'm assuming your scrape interval is 6 seconds.
For an explanation of why this doesn't work as you would expect it, a Gauge is simply a Java object wrapping a double value. Every time you set its value, that value changes, but nothing more. There's no count of how many times the value changed or any notification sent to Prometheus that this happened. Prometheus simply polls every 6 seconds and collects whatever value was there at the time (never the wiser that the value changed 15 times since the last time it was collected). This is why gauges are intended to measure single values that go up and down (such as memory utilization: it's now 645 MB, in 6 seconds it's 648 MB, in 12 seconds 543 MB): you know the value constantly changes, but the best you can do is sample it every now and then.
For something like request latency, you should use a Histogram: it's basically a counter for the number of observations (i.e. number of requests); a counter for the sum of all observations (i.e. how long all requests put together took); and separate counters for each bucket (i.e. how many requests took less than 1 ms; how many requests took less than 10 ms; etc.). From this you can get an accurate average over any multiple of your scrape interval (i.e. change in total time divided by change in number of requests) as well as estimates for any percentile (including the median). How precise said percentiles are depends on the bucket sizes you choose (and how well they actually match the actual measurements).
Or, if all you're interested in is the number of requests, then a counter that's incremented on every request will be enough. To adjust for counter resets (e.g. job restarts), you should use increase() rather than the simple difference suggested above:
increase(number_of_requests_total[1m])
If you want to count number of requests in some specific time from now (in last 1m in this case) just use
number_of_requests_counter - number_of_requests_counter offset 1m
If you want to have sth like requests per second, than use
rate(number_of_requests_counter[1m])
I can tell you why it's not working with your Gauge, but first of all specify what do you assign to this metric. I mean, do you assing some avarage, last response time, or some other stuff?
For response time you should use Summary or Histogram (more info here)