Query most recent values for active & stale Prometheus timeseries over range query - kubernetes

I am recording and monitoring SLOs (server-side request duration) of Kubernetes Pods via Prometheus using a HistogramVec within a Golang HTTP server. Every request’s duration is timed and persisted as described in the Prometheus practices and partitioned by status code, method and HTTP path.
I am running autoscaling experiments therefore Pods are created & terminated. After each experiment I fetch the metrics for all pods (including the ones already deleted) and plot a cumulative distribution, e.g.:
In order to make these plots more “accurate”, I opted for many, smaller histogram buckets and aggregate & analyze the data locally and do not use the built-in Histogram Quantiles. The ideal query would therefore return only the most recent value for all time series that have existed over a specified time range (green + red circles).
Currently, I am using a range query within the script generating all the plots, e.g.:
http://localhost:9090/api/v1/query_range?query=http_request_duration_milliseconds_bucket{path="/service/login"}&start=1591803898&end=1591804801&step=5s
However, I am aware that this is highly inefficient and costly as it retrieves a huge amount of surplus data even though I am only interested in the very last value for each individual time series. On the other hand, if I use an instant query, I only get the values for a specified moment, thus I’d need to shoot multiple queries & first find out when some time series (red circles) were marked stale - which doesn’t seem great either.
So, basically I'm looking for a way to work around the Prometheus basics on staleness, and stop stale time series from "disappearing":
If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.
I am almost certain that there is a way to do this (e.g. an option to simply include stale time series), but I haven’t been able to put it together so far.

The solution is to use last_over_time() function. For example, the following query returns the last values seen during the last hour per each histogram bucket:
last_over_time(http_request_duration_milliseconds_bucket{path="/service/login"}[1h])
This query must be sent to /api/v1/query instead of /api/v1/query_range, since /api/v1/query calculates the query only once at the given time timestamp, while /api/v1/query_range calculates the query 1+(end-start)/step times at every point on the timer range [start ... end] with interval step.
Note also that big number of histogram buckets multiplied by big number of unique path label values may result into too many time series, which is known as high cardinality. See this article for more details.
See also VictoriaMetrics historgrams, which solve common issues in Prometheus histograms.

Found another way to do this following the input in this thread as well as increasing the lookbackDelta QuerySpec option.
Now shooting queries such as
http://localhost:9090/api/v1/query?query=max_over_time(http_request_duration_milliseconds_bucket{path="/service/login",le="101"}[2h])
return the desired result:

Related

What is the criteria used by Grafana's alertmanager to start evaluating a rule?

I have a data source that ingests new data within first 30mins of every hour. The rules need to be run such that once there is new data, they should evaluate and fire if the threshold is exceeded. So roughly, 45th min of every hour.
We are not able to figure out how to do that. Also, on what basis/database column does Grafana decide when to start evaluating? I went through Grafana postgres database, it has the table alert_rule and alert_instance among others.
alert_rule has a column called updated. Is that the basis?
alert_instance has a column called last_eval_time. How is this time decided?
Grafana version: 9.2.2
Current configuration: Evaluate every 1h for 0s.
They are all firing around 31st minute of the hour. Want to understand on what basis is this happening.
Also, if there is a data point that got populated at 25th min, and the rule has fired at 31st min, will this new data point be part of the calculation?
How does Grafana behave when there is no data point available for a defined time window? For ex, consider an alert rule configured to compare between two different time windows, if one of the time window data is not found in the data source, does Grafana looks for the last data point available and pick that? We have been observing some inconsistencies around this. Our understanding is that in this case the rule should not fire.
Thanks!

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

how to remove None (null) values from graphite result

I'm trying to show at time series that has a lot of missing data points as a table in grafana and I get a lot of - results that correlate with nulls.
Is there a way to tell graphite to not return data points with a value of null?
The problem may be in small time slots in graphite datastore and large time between neighbor metrics due to rare updates.
As work around, you may apply movingAverage(10min) function in query, where 10min is update interval of metric.

prometheus aggregate table data from offset; ie pull historical data from 2 weeks ago to present

so i am constructing a table within grafana with prometheus as a data source. right now, my queries are set to instant, and thus it's showing scrape data from the instant that the query is made (in my case, shows data from the past 2 days)
however, i want to see data from the past 14 days. i know that you can adjust time shift in grafana as well as use the offset <timerange> command to shift the moment when the query is run, however these only adjust query execution points.
using a range vector such as go_info[10h] does indeed go back that range, however the scrapes are done in 15s intervals and as such produce duplicate data in addition to producing query results for a query done in that instant
(and not an offset timepoint), which I don't want
I am wondering if there's a way to gather data from two weeks ago until today, essentially aggregating data from multiple offset time points.
i've tried writing multiple queries on my table to perform this,
e.g:
go_info offset 2d
go_info offset 3d
and so on..
however this doesn't seem very efficient and the values from each query end up in different columns (a problem i could probably alleviate with an alteration to the query, however that doesn't solve the issue of complexity in queries)
is there a more efficient, simpler way to do this? i understand that the latest version of Prometheus offers subquerys as a feature, but i am currently not able to upgrade Prometheus (at least in a simple manner with the way it's currently set up) and am also not sure it would solve my problem. if it is indeed the answer to my question, it'll be worth the upgrade. i just haven't had the environment to test it out
thanks for any help anyone can provide :)
figured it out;
it's not pretty but i had to use offset <#>d for each query in a single metric.
e.g.:
something_metric offset 1d
something_metric offset 2d

Core Reporting API Total results found

I want to return a large result-set of Google Analytics data across a two month period.
However, the total results found is not accurate or what I expect.
If I narrow the start-date and end-date to a particular day it returns roughly 40k of results. Which over a two month period there should be 2.4 million records. However the total results found from the api suggests 350k.
There is some discrepancy and the numbers do not add up when selecting a larger date range. I can confirm there is no gap in ga data over the two month period.
Would be great if someone has come across this issue and has found a reason for it.
In your query you need to supply a sampiling level
samplingLevel=DEFAULT Optional.
Use this parameter to set the sampling level (i.e. the number of visits used to
calculate the result) for a reporting query. The allowed values are consistent with
the web interface and include:
•DEFAULT — Returns response with a sample size that balances speed and accuracy.
•FASTER — Returns a fast response with a smaller sample size.
•HIGHER_PRECISION — Returns a more accurate response using a large sample size,
but this may result in the response being slower.
If not supplied, the DEFAULT sampling level will be used.
There is no way to completely remove sampling large request will still return sampled data even if you have set it to Higher_precission make your request smaller go day by day if you have to.
If you want to pay for a premium Google Analytics account you can extract your data into BigQuery and you will have access to unsampled reports.