prometheus aggregate table data from offset; ie pull historical data from 2 weeks ago to present - grafana

so i am constructing a table within grafana with prometheus as a data source. right now, my queries are set to instant, and thus it's showing scrape data from the instant that the query is made (in my case, shows data from the past 2 days)
however, i want to see data from the past 14 days. i know that you can adjust time shift in grafana as well as use the offset <timerange> command to shift the moment when the query is run, however these only adjust query execution points.
using a range vector such as go_info[10h] does indeed go back that range, however the scrapes are done in 15s intervals and as such produce duplicate data in addition to producing query results for a query done in that instant
(and not an offset timepoint), which I don't want
I am wondering if there's a way to gather data from two weeks ago until today, essentially aggregating data from multiple offset time points.
i've tried writing multiple queries on my table to perform this,
e.g:
go_info offset 2d
go_info offset 3d
and so on..
however this doesn't seem very efficient and the values from each query end up in different columns (a problem i could probably alleviate with an alteration to the query, however that doesn't solve the issue of complexity in queries)
is there a more efficient, simpler way to do this? i understand that the latest version of Prometheus offers subquerys as a feature, but i am currently not able to upgrade Prometheus (at least in a simple manner with the way it's currently set up) and am also not sure it would solve my problem. if it is indeed the answer to my question, it'll be worth the upgrade. i just haven't had the environment to test it out
thanks for any help anyone can provide :)

figured it out;
it's not pretty but i had to use offset <#>d for each query in a single metric.
e.g.:
something_metric offset 1d
something_metric offset 2d

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

Query most recent values for active & stale Prometheus timeseries over range query

I am recording and monitoring SLOs (server-side request duration) of Kubernetes Pods via Prometheus using a HistogramVec within a Golang HTTP server. Every request’s duration is timed and persisted as described in the Prometheus practices and partitioned by status code, method and HTTP path.
I am running autoscaling experiments therefore Pods are created & terminated. After each experiment I fetch the metrics for all pods (including the ones already deleted) and plot a cumulative distribution, e.g.:
In order to make these plots more “accurate”, I opted for many, smaller histogram buckets and aggregate & analyze the data locally and do not use the built-in Histogram Quantiles. The ideal query would therefore return only the most recent value for all time series that have existed over a specified time range (green + red circles).
Currently, I am using a range query within the script generating all the plots, e.g.:
http://localhost:9090/api/v1/query_range?query=http_request_duration_milliseconds_bucket{path="/service/login"}&start=1591803898&end=1591804801&step=5s
However, I am aware that this is highly inefficient and costly as it retrieves a huge amount of surplus data even though I am only interested in the very last value for each individual time series. On the other hand, if I use an instant query, I only get the values for a specified moment, thus I’d need to shoot multiple queries & first find out when some time series (red circles) were marked stale - which doesn’t seem great either.
So, basically I'm looking for a way to work around the Prometheus basics on staleness, and stop stale time series from "disappearing":
If no sample is found (by default) 5 minutes before a sampling timestamp, no value is returned for that time series at this point in time. This effectively means that time series "disappear" from graphs at times where their latest collected sample is older than 5 minutes or after they are marked stale.
I am almost certain that there is a way to do this (e.g. an option to simply include stale time series), but I haven’t been able to put it together so far.
The solution is to use last_over_time() function. For example, the following query returns the last values seen during the last hour per each histogram bucket:
last_over_time(http_request_duration_milliseconds_bucket{path="/service/login"}[1h])
This query must be sent to /api/v1/query instead of /api/v1/query_range, since /api/v1/query calculates the query only once at the given time timestamp, while /api/v1/query_range calculates the query 1+(end-start)/step times at every point on the timer range [start ... end] with interval step.
Note also that big number of histogram buckets multiplied by big number of unique path label values may result into too many time series, which is known as high cardinality. See this article for more details.
See also VictoriaMetrics historgrams, which solve common issues in Prometheus histograms.
Found another way to do this following the input in this thread as well as increasing the lookbackDelta QuerySpec option.
Now shooting queries such as
http://localhost:9090/api/v1/query?query=max_over_time(http_request_duration_milliseconds_bucket{path="/service/login",le="101"}[2h])
return the desired result:

Data retention in timescaledb

Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.

Implement interval analysis on top of PostgreSQL

I have a couple of millions entries in a table which start and end timestamps. I want to implement an analysis tool which determines unique entries for a specific interval. Let's say between yesterday and 2 month before yesterday.
Depending on the interval the queries take between a couple of seconds and 30 minutes. How would I implement an analysis tool for a web front-end which would allow to quite quickly query this data, similar to Google Analytics.
I was thinking of moving the data into Redis and do something clever with interval and sorted sets etc. but I was wondering if there's something in PostgreSQL which would allow to execute aggregated queries, re-use old queries, so that for instance, after querying the first couple of days it does not start from scratch again when looking at different interval.
If not, what should I do? Export the data to something like Apache Spark or Dynamo DB and analysis in there to fill Redis for retrieving it quicker?
Either will do.
Aggregation is a basic task they all can do, and your data is smll enough to fit into main memory. So you don't even need a database (but the aggregation functions of a database may still be better implemented than if you rewrite them; and SQL is quite convenient to use.
Jusr do it. Give it a try.
P.S. make sure to enable data indexing, and choose the right data types. Maybe check query plans, too.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.