What is the best way to represent a chart of distribution of time intervals in Datadog? - charts

I have a server that processes packets from different devices. Devices can report in different intervals.
I would like to make a chart showing the distribution of intervals by the count of devices (how many devices are reporting within 5 sec/10 sec/60 sec ...)
Intervals for each device can vary.
Now I'm sending metric with Set using deviceId with tags that represent interval (5 sec, 10 sec, 30 sec, and more) but I'm not sure that it is correct.
What is the best way to realize it?

Set is almost never the right custom metric type to use. It will send a count of the number of unique items per a given tag. The underlying items details will be stripped from the metric, meaning that from one time slice to the next, you will have no idea that actual true number of items over time.
For example
3:00:07-3:00:32 | 5 second bucket:[device1,device4,device7] -> 3 values
3:00:32-3:00:47 | 5 second bucket:[device1,device3] -> 2 values
Your time series to datadog will report 3, and then 2. But because the underlying device info is stripped you have no idea how to combine that 2 and 3 if you to zoom out in time and roll up the numbers to show 1 data point per minute. It could be any number from 3 to 5, but the Datadog backend has no idea. (even though we know that across those 30 seconds there were 4 unique values total)
Plus even if it was accurate somehow, you can't create an alert of it or notify anyone, because you won't know which device is having issues if you see a spike of devices in the 60 second bucket.
So let's go through other metric options.
The only metric types that are ever worth using are usually distributions or gauges, or [counts].
A gauge metric is just a measurement of the latency at a point in time, it's usually good for things like CPU or Memory of a computer, or temperature in a room. Numbers that are impossible to actually collect all dat a points for so you just take measurements every 10 seconds, or every minute, or however often you never to get an idea of the behavior.
A count metric is more exact, it's the number of things that happened. Usually good for number of requests to a server, or number of files processed. Even something like the amount of bytes flowing through something, although that usually is treated like a gauge by most people.
Distributions are good for when you want to create a gauge metric, but you need detailed measurements for every single event that happens. For example a web server is handling hundreds of requests per second and we need to know the latency metrics of that server. It's not possible to send a latency metric for every request as a gauge. Gauges have a built in limit of 1 data point per second (in Datadog). Anything more sent in a 1 second interval gets dropped. But we need stats for every request, so a distribution will summarize the data, it keep a running count, min, max, average, and optionally several percentiles (p50, p75, p99).
I haven't seen many good use cases for metric types outside of those 3. For your scenario, it seems like you would want to be sending a distribution metric for that device interval. So device 1 sends a value of 10.14 and device 3 sends a value of 2.3 and so on.
Then you can use a distribution widget in a dashboard to show the number of devices for each interval bucket.
Of course make sure you tag each metric by the device that is generating the metric.

Related

How to persist previous data point when time range doesn't include a data point

TL;DR:
Can I get Grafana to show me the previous data point, when the currently selected time period does not have a data point? I have an example which sounds ridiculous, but at least it's simple to understand: I send data every 1 minute, and I wish to zoom into the last 30 seconds, and still see data. You may ask "why not just zoom out to 2 minutes" but the reason is that other data is on the same graph that has updated more often, and I wish to compare with that data. Also, for the more lengthy reasons below.
If not, how can I achieve what I want to achieve, see below?
Context
For a few years, I have been monitoring the water level in three of our basement sumps (which have pumps installed) by sending this data from Node-RED to InfluxDB, then visualising the sump levels in Grafana. I have set up three waterproof ultrasonic distance sensors, each pointed down a pipe that is inserted vertically into each sump. The water fills the pipe and the distance sensor, connected to an Arduino, sends me the reading. The Arduino also has other sensors connected (temp / humidity) and deals with distance calibrations to calculate the percent full of each sump. All this data is sent to Node-RED. In total, I am sending 4 values per sump: distance measurement in mm, percent full, temp, humidity. So that's 12 fields. Data is sent every 2 seconds, because I wished to have a reasonably high resolution to see nice curves in graphs.
Also I decided to store all this data so that I could later troubleshoot issues (we have had sewage floods resulting in water not being able to be pumped away, etc...) and design some warning systems for these issues based on data.
Storing 12 values for every 2 seconds, over the course of a number of years, takes up a lot of space (8GB).
Nature of the data
Storing this resolution of data has also helped me be able to describe the nature of the data. I will do so here.
(1) Non-meaningful NOISE (see below) - the percent-full reading goes up and down by 1 or 2 percent every couple of seconds:
(2) Meaningful DRIFT (see below) - I don't mean sensor drift, I am referring to actual water levels changing slowly over time, e.g. over 1 day or 1 week. Perhaps condensation on the walls drips down into the sump, or water evaporates from the sump, and the value can waver by a few percent over the course of a day. Each sump has slightly different characteristics.
(3) Meaningful MONITORING DATA - during wet weather, depending on rainfall amount, the sumps fill up over the course of say 30 mins to 3 hours. Then the pumps run and the water level drops again, wavers a bit, then the sumps continue to fill up. If the rain stopped, you can see a lovely curve as the water fills in progressively more slowly (see the green line below):
Solution to downsample
I know Influx has its own downsampling possibilities, however because of the nature of the data (which can hardly vary for 2 months but when it does, I really need to capture it in detail), I don't think lowering the sample rate is a great idea.
I have some understanding of digital filters (e.g. low pass etc) but have never programmed one myself. So I have written a basic filter in javascript (a Node-RED function) to filter the data in realtime as follows: only send each reading when it has changed from the previous one by x amount. (And update the previous one, when that occurs.)
This has already vastly reduced the amount of data being stored, and I can vary x to filter out noise shown in my first graph above, at the expense of resolution when the pumps run. Even if I set the x value to 2, it still vastly reduces data over long periods of dry weather.
So - onto my problem! Now data is not being logged to InfluxDB unless there is some meaningful change. Which means that when I zoom in to e.g. 15 minute timeframe of data, there is nothing to see.
Grafana does have the option of "fill (previous)" but this draws a line between points on the existing graph, rather than showing the previous data as if it hasn't changed since that point. Now my grafana dashboard looks a bit sad :(
One proposed solution is, in addition to sending "delta" data, send "summary" data, that is - send a full suite of data every 1 minute regardless of whether data changed or not. But then we get noise back again, and pointless storage.
Any other ideas?

what is the default grafana setting for $__rate_interval

I understand that rate(xyz[5m]) * 60 is the rate of xyz per minute, averaged over 5 mins.
How then would $__rate_interval and $__interval be defined,
possibly in the same syntax?
What format is rate being measured here, in my panel? Per minute, per second?
What is the interval= 30s in my panel here? My scraping interval is set to 5s.
How do i change the rate format?
See New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work.
Rate is always per second. See Grafana documentation for the rate function.
Click on Query options, then click on the Info-Symbol. An explanation will be displayed.
To get rate per minute, just multiply the rate with 60.
Edit: ($__rate_interval and $_interval)
Prometheus periodically fetches data from your application. Grafana periodically fetches Data from Prometheus. Grafana does not know, how often Prometheus polls your application for data. Grafana will estimate this time by looking at the data. The $__interval variable expands to the duration between two data points in the graph. (Note that this is only true for small time ranges and high resolution as the intended use case for $__interval is reducing the number of data points when the time range is wide. See Approximate Calculation of $__interval.)
If the time-distance between every two data points in each series is 15 seconds, it does not make sense to use anything less than [15s] as interval in the rate function. The rate function works best with at least 4 data points. Therefore [1m] would be much better than anything betweeen [15s] and [1m]. This is what $__rate_interval tries to achieve: guessing a minimal sensible interval for the rate function.
Personally, I think, this does not always work if your application delivers sparse data. I prefer using fixed intervals like 10m or even 1h or 1d in these situations. The interval need to be great enough to get you enough data points for the metric to work with the rate function.
A different approach would be to use any of $__rate_interval and $_interval but also set the Min step parameter for the query in the Grafana UI to be big enough.

prometheus: is it possible to use event number of gauge as a counter?

I use prometheus to monitor a api service. Currently, I use a Counter to count number of requests received and a Gauge for the response time in milliseconds.
I've tried to use something like count_over_time(response_time_ms[1m]) to count requests during a time range. However, I got result that each point is value of 10.
Why this doesn't work?
count_over_time(response_time_ms[1m]) will tell you the number of samples, not the number of times your Gauge was updated within (what I assume to be) a Java process. Based on the value of 10 you're seeing, I'm assuming your scrape interval is 6 seconds.
For an explanation of why this doesn't work as you would expect it, a Gauge is simply a Java object wrapping a double value. Every time you set its value, that value changes, but nothing more. There's no count of how many times the value changed or any notification sent to Prometheus that this happened. Prometheus simply polls every 6 seconds and collects whatever value was there at the time (never the wiser that the value changed 15 times since the last time it was collected). This is why gauges are intended to measure single values that go up and down (such as memory utilization: it's now 645 MB, in 6 seconds it's 648 MB, in 12 seconds 543 MB): you know the value constantly changes, but the best you can do is sample it every now and then.
For something like request latency, you should use a Histogram: it's basically a counter for the number of observations (i.e. number of requests); a counter for the sum of all observations (i.e. how long all requests put together took); and separate counters for each bucket (i.e. how many requests took less than 1 ms; how many requests took less than 10 ms; etc.). From this you can get an accurate average over any multiple of your scrape interval (i.e. change in total time divided by change in number of requests) as well as estimates for any percentile (including the median). How precise said percentiles are depends on the bucket sizes you choose (and how well they actually match the actual measurements).
Or, if all you're interested in is the number of requests, then a counter that's incremented on every request will be enough. To adjust for counter resets (e.g. job restarts), you should use increase() rather than the simple difference suggested above:
increase(number_of_requests_total[1m])
If you want to count number of requests in some specific time from now (in last 1m in this case) just use
number_of_requests_counter - number_of_requests_counter offset 1m
If you want to have sth like requests per second, than use
rate(number_of_requests_counter[1m])
I can tell you why it's not working with your Gauge, but first of all specify what do you assign to this metric. I mean, do you assing some avarage, last response time, or some other stuff?
For response time you should use Summary or Histogram (more info here)

reset chart to 0 in grafan

Below is a chart I have in grafana:
My problem is that if my chosen time range is say 5 minutes, the graph wont show only what happened in the last 5 minutes. So in the picture, nothing happened in the past 5 minutes so it's just showing the last points it has. How can I change this so that it goes back to zero if nothing has changed? I'm using a Prometheus counter for this, if that is relevant.
As explained in the Prometheus documentation, a counter value in itself is not of much use. It depends on when your job was last restarted and everything that happened since.
What's interesting about a counter is how much it changed over some period of time. I.e. either the average rate of change per second (e.g. 3 queries per second) or the increase over some time range (e.g. 10K queries in the last hour).
So instead of graphing something like e.g. http_requests, you should graph rate(http_requests[1m]) (the averate number of requests over the previous 1 minute) or increase(http_requests[1h]) (the total number of requests over the past hour). You can play with the range size until you get something which makes sense for your data. But make sure to use a range at least 2x your scrape interval (and ideally more, as Prometheus is somewhat daft in the way it computes rates/increases).

Alert in RAM/CPU Usage Detection in e-Commerce Server

Currently I'm building my monitoring services for my e-commerce Server, which mostly focus on CPU/RAM usage. It's likely Anomaly Detection on Timeseries data.
My approach is building LSTM Neural Network to predict next CPU/RAM value on chart trending and compare with STD (standard deviation) value multiply with some number (currently is 10)
But in real life conditions, it depends on many differents conditions, such as:
1- Maintainance Time (in this time "anomaly" is not "anomaly")
2- Sales time in day-off events, holidays, etc., RAM/CPU usages increase is normal, of courses
3- If percentages of CPU/RAM decrement are the same over 3 observations: 5 mins, 10 mins & 15 mins -> Anomaly. But if 5 mins decreased 50%, but 10 mins it didn't decrease too much (-5% ~ +5%) -> Not an "anomaly".
Currently I detect anomaly on formular likes this:
isAlert = (Diff5m >= 10 && Diff10m >= 15 && Diff30m >= 40)
where Diff is Different Percentage in Absolute value.
Unfortunately I don't save my "pure" data for building neural network, for example, when it detects anomaly, I modified that it is not an anomaly anymore.
I would like to add some attributes to my input for model, such as isMaintenance, isPromotion, isHoliday, etc. but sometimes it leads to overfitting.
I also want to my NN can adjust baseline over the time, for example, when my Service is more popular, etc.
There are any hints on these aims?
Thanks
I would say that an anomaly is an unusual outcome, i.e. a outcome that's not expected given the inputs. As you've figured out, there are a few variables that are expected to influence CPU and RAM usage. So why not feed those to the network? That's the whole point of Machine Learning. Your network will make a prediction of CPU usage, taking into account the sales volume, whether there is (or was) a maintenance window, etc.
Note that you probably don't need an isPromotion input if you include actual sales volumes. The former is a discrete input, and only captures a fraction of the information present in the totalSales input
Machine Learning definitely needs data. If you threw that away, you'll have to restart capturing it. As for adjusting the baseline, you can achieve that by overweighting recent input data.