What is the criteria used by Grafana's alertmanager to start evaluating a rule? - grafana

I have a data source that ingests new data within first 30mins of every hour. The rules need to be run such that once there is new data, they should evaluate and fire if the threshold is exceeded. So roughly, 45th min of every hour.
We are not able to figure out how to do that. Also, on what basis/database column does Grafana decide when to start evaluating? I went through Grafana postgres database, it has the table alert_rule and alert_instance among others.
alert_rule has a column called updated. Is that the basis?
alert_instance has a column called last_eval_time. How is this time decided?
Grafana version: 9.2.2
Current configuration: Evaluate every 1h for 0s.
They are all firing around 31st minute of the hour. Want to understand on what basis is this happening.
Also, if there is a data point that got populated at 25th min, and the rule has fired at 31st min, will this new data point be part of the calculation?
How does Grafana behave when there is no data point available for a defined time window? For ex, consider an alert rule configured to compare between two different time windows, if one of the time window data is not found in the data source, does Grafana looks for the last data point available and pick that? We have been observing some inconsistencies around this. Our understanding is that in this case the rule should not fire.
Thanks!

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

Query most recent values for active & stale Prometheus timeseries over range query

I am recording and monitoring SLOs (server-side request duration) of Kubernetes Pods via Prometheus using a HistogramVec within a Golang HTTP server. Every request’s duration is timed and persisted as described in the Prometheus practices and partitioned by status code, method and HTTP path.
I am running autoscaling experiments therefore Pods are created & terminated. After each experiment I fetch the metrics for all pods (including the ones already deleted) and plot a cumulative distribution, e.g.:
In order to make these plots more “accurate”, I opted for many, smaller histogram buckets and aggregate & analyze the data locally and do not use the built-in Histogram Quantiles. The ideal query would therefore return only the most recent value for all time series that have existed over a specified time range (green + red circles).
Currently, I am using a range query within the script generating all the plots, e.g.:
http://localhost:9090/api/v1/query_range?query=http_request_duration_milliseconds_bucket{path="/service/login"}&start=1591803898&end=1591804801&step=5s
However, I am aware that this is highly inefficient and costly as it retrieves a huge amount of surplus data even though I am only interested in the very last value for each individual time series. On the other hand, if I use an instant query, I only get the values for a specified moment, thus I’d need to shoot multiple queries & first find out when some time series (red circles) were marked stale - which doesn’t seem great either.
So, basically I'm looking for a way to work around the Prometheus basics on staleness, and stop stale time series from "disappearing":
If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.
I am almost certain that there is a way to do this (e.g. an option to simply include stale time series), but I haven’t been able to put it together so far.
The solution is to use last_over_time() function. For example, the following query returns the last values seen during the last hour per each histogram bucket:
last_over_time(http_request_duration_milliseconds_bucket{path="/service/login"}[1h])
This query must be sent to /api/v1/query instead of /api/v1/query_range, since /api/v1/query calculates the query only once at the given time timestamp, while /api/v1/query_range calculates the query 1+(end-start)/step times at every point on the timer range [start ... end] with interval step.
Note also that big number of histogram buckets multiplied by big number of unique path label values may result into too many time series, which is known as high cardinality. See this article for more details.
See also VictoriaMetrics historgrams, which solve common issues in Prometheus histograms.
Found another way to do this following the input in this thread as well as increasing the lookbackDelta QuerySpec option.
Now shooting queries such as
http://localhost:9090/api/v1/query?query=max_over_time(http_request_duration_milliseconds_bucket{path="/service/login",le="101"}[2h])
return the desired result:

Grafana data shown by hour

I'm using Grafana and I want to see which hours are better to perform operations. So, I want to sum the requests and show the number of requests per hour in, let say, the last week. I mean: how many requests it had from 9:00 to 10:00 despite any day of the last week (and the same for every hour).
My backend is elasticsearch, but I can gather information from a prometheus too.
Does anyone know any way to get these data shown?
The Grafana version I'm using is 7.0.3.
EDIT
I found a possible solution by adding the plugin for hourly heatmaps

prometheus aggregate table data from offset; ie pull historical data from 2 weeks ago to present

so i am constructing a table within grafana with prometheus as a data source. right now, my queries are set to instant, and thus it's showing scrape data from the instant that the query is made (in my case, shows data from the past 2 days)
however, i want to see data from the past 14 days. i know that you can adjust time shift in grafana as well as use the offset <timerange> command to shift the moment when the query is run, however these only adjust query execution points.
using a range vector such as go_info[10h] does indeed go back that range, however the scrapes are done in 15s intervals and as such produce duplicate data in addition to producing query results for a query done in that instant
(and not an offset timepoint), which I don't want
I am wondering if there's a way to gather data from two weeks ago until today, essentially aggregating data from multiple offset time points.
i've tried writing multiple queries on my table to perform this,
e.g:
go_info offset 2d
go_info offset 3d
and so on..
however this doesn't seem very efficient and the values from each query end up in different columns (a problem i could probably alleviate with an alteration to the query, however that doesn't solve the issue of complexity in queries)
is there a more efficient, simpler way to do this? i understand that the latest version of Prometheus offers subquerys as a feature, but i am currently not able to upgrade Prometheus (at least in a simple manner with the way it's currently set up) and am also not sure it would solve my problem. if it is indeed the answer to my question, it'll be worth the upgrade. i just haven't had the environment to test it out
thanks for any help anyone can provide :)
figured it out;
it's not pretty but i had to use offset <#>d for each query in a single metric.
e.g.:
something_metric offset 1d
something_metric offset 2d

How to ensure spark not read same data twice from cassandra

I'm learning spark and cassandra. My problem is as follows.
I have cassandra table which records row data from a sensor
CREATE TABLE statistics.sensor_row (
name text,
date timestamp,
value int,
PRIMARY KEY (name, date)
)
Now I want to aggregate these rows through a spark batch job (ie. Daily)
So I could write
val rdd = sc.cassandraTable("statistics","sensor_row")
//and do map and reduce to get what i want and perhaps write back to aggregated table.
But my problem is I will be running this code periodically. I need to make sure I dont read same data twice.
One thing I can do is delete rows which I read, which looks pretty ugly, or use filter
sensorRowRDD.where("date >'2016-02-05 07:32:23+0000'")
Second one looks much nicer, but then I need to record when was the job run last and continue from there. However according to DataStax driver data locality, each worker will load data only in its local cassandra node. Which mean instead of tracking a global date, i need to track date of each cassandra/spark node. Still does not look very elegant.
Is there any better ways of doing this ?
The DataFrame filters will be pushed down to Cassandra, so this is an efficient solution to the problem. But you are right to worry about the consistency issue.
One solution is to set not just a start date, but an end date also. When your job starts, it looks at the clock. It is 2016-02-05 12:00. Perhaps you have a few minutes delay in collecting late-arriving data, and the clocks are not absolutely precise either. You decide to use 10 minutes of delay and set your end time to 2016-02-05 11:50. You record this in a file/database. The end time of the previous run was 2016-02-04 11:48. So your filter is date > '2016-02-04 11:48' and date < '2016-02-05 11:50'.
Because the date ranges cover all time, you will only miss events that have been saved into a past range after the range has been processed. You can increase the delay from 10 minutes if this happens too often.