Athena/DDB to condense millions of data points for plotting them on a graph - rest

I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?

I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream

Related

How to persist previous data point when time range doesn't include a data point

TL;DR:
Can I get Grafana to show me the previous data point, when the currently selected time period does not have a data point? I have an example which sounds ridiculous, but at least it's simple to understand: I send data every 1 minute, and I wish to zoom into the last 30 seconds, and still see data. You may ask "why not just zoom out to 2 minutes" but the reason is that other data is on the same graph that has updated more often, and I wish to compare with that data. Also, for the more lengthy reasons below.
If not, how can I achieve what I want to achieve, see below?
Context
For a few years, I have been monitoring the water level in three of our basement sumps (which have pumps installed) by sending this data from Node-RED to InfluxDB, then visualising the sump levels in Grafana. I have set up three waterproof ultrasonic distance sensors, each pointed down a pipe that is inserted vertically into each sump. The water fills the pipe and the distance sensor, connected to an Arduino, sends me the reading. The Arduino also has other sensors connected (temp / humidity) and deals with distance calibrations to calculate the percent full of each sump. All this data is sent to Node-RED. In total, I am sending 4 values per sump: distance measurement in mm, percent full, temp, humidity. So that's 12 fields. Data is sent every 2 seconds, because I wished to have a reasonably high resolution to see nice curves in graphs.
Also I decided to store all this data so that I could later troubleshoot issues (we have had sewage floods resulting in water not being able to be pumped away, etc...) and design some warning systems for these issues based on data.
Storing 12 values for every 2 seconds, over the course of a number of years, takes up a lot of space (8GB).
Nature of the data
Storing this resolution of data has also helped me be able to describe the nature of the data. I will do so here.
(1) Non-meaningful NOISE (see below) - the percent-full reading goes up and down by 1 or 2 percent every couple of seconds:
(2) Meaningful DRIFT (see below) - I don't mean sensor drift, I am referring to actual water levels changing slowly over time, e.g. over 1 day or 1 week. Perhaps condensation on the walls drips down into the sump, or water evaporates from the sump, and the value can waver by a few percent over the course of a day. Each sump has slightly different characteristics.
(3) Meaningful MONITORING DATA - during wet weather, depending on rainfall amount, the sumps fill up over the course of say 30 mins to 3 hours. Then the pumps run and the water level drops again, wavers a bit, then the sumps continue to fill up. If the rain stopped, you can see a lovely curve as the water fills in progressively more slowly (see the green line below):
Solution to downsample
I know Influx has its own downsampling possibilities, however because of the nature of the data (which can hardly vary for 2 months but when it does, I really need to capture it in detail), I don't think lowering the sample rate is a great idea.
I have some understanding of digital filters (e.g. low pass etc) but have never programmed one myself. So I have written a basic filter in javascript (a Node-RED function) to filter the data in realtime as follows: only send each reading when it has changed from the previous one by x amount. (And update the previous one, when that occurs.)
This has already vastly reduced the amount of data being stored, and I can vary x to filter out noise shown in my first graph above, at the expense of resolution when the pumps run. Even if I set the x value to 2, it still vastly reduces data over long periods of dry weather.
So - onto my problem! Now data is not being logged to InfluxDB unless there is some meaningful change. Which means that when I zoom in to e.g. 15 minute timeframe of data, there is nothing to see.
Grafana does have the option of "fill (previous)" but this draws a line between points on the existing graph, rather than showing the previous data as if it hasn't changed since that point. Now my grafana dashboard looks a bit sad :(
One proposed solution is, in addition to sending "delta" data, send "summary" data, that is - send a full suite of data every 1 minute regardless of whether data changed or not. But then we get noise back again, and pointless storage.
Any other ideas?

Knowledge Graph for TIme-Series Data

Would storing time series data in a Knowledge Graph be a good idea ? What could be the benefits of doing so ?
It depends on the queries you want to do on the time series data, but I suspect the answer is NO.
Typical queries on time series data include the following:
moving averages; e.g. 30 day average of stock prices
median
accounting functions; e.g. average growth rate, amortization, internal rate of return and so on.
statistical functions; e.g. autocorrelation, and correlation between two series.
pattern finding; i.e. find a time series (or multiple time series) that has a similar pattern to this time series
In general time series data have a greater need for aggregation of a collection of data rather creating a graph of the data. This will likely cause any time series related queries to have poor performance on a graph like database.
A factor to consider is that the amount of data stored for time series can be way bigger than that for of a typical knowledge graph depending on the sample rate of the time series data.
Here are some of the references that brought me to this conclusion:
Indexing Strategies for Time Series Data
Demystifying Graph Databases - Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries

Prometheus query quantile of pod memory usage performance

I'd like to get the 0.95 percentile memory usage of my pods from the last x time. However this query start to take too long if I use a 'big' (7 / 10d) range.
The query that i'm using right now is:
quantile_over_time(0.95, container_memory_usage_bytes[10d])
Takes around 100s to complete
I removed extra namespace filters for brevity
What steps could I take to make this query more performant ? (except making the machine bigger)
I thought about calculating the 0.95 percentile every x time (let's say 30min) and label it p95_memory_usage and in the query use p95_memory_usage instead of container_memory_usage_bytes, so that i can reduce the amount of points the query has to go through.
However, would this not distort the values ?
As you already observed, aggregating quantiles (over time or otherwise) doesn't really work.
You could try to build a histogram of memory usage over time using recording rules, looking like a "real" Prometheus histogram (consisting of _bucket, _count and _sum metrics) although doing it may be tedious. Something like:
- record: container_memory_usage_bytes_bucket
labels:
le: 100000.0
expr: |
container_memory_usage_bytes > bool 100000.0
+
(
container_memory_usage_bytes_bucket{le="100000.0"}
or ignoring(le)
container_memory_usage_bytes * 0
)
Repeat for all bucket sizes you're interested in, add _count and _sum metrics.
Histograms can be aggregated (over time or otherwise) without problems, so you can use a second set of recording rules that computes an increase of the histogram metrics, at much lower resolution (e.g. hourly or daily increase, at hourly or daily resolution). And finally, you can use histogram_quantile over your low resolution histogram (which has a lot fewer samples than the original time series) to compute your quantile.
It's a lot of work, though, and there will be a couple of downsides: you'll only get hourly/daily updates to your quantile and the accuracy may be lower, depending on how many histogram buckets you define.
Else (and this only came to me after writing all of the above) you could define a recording rule that runs at lower resolution (e.g. once an hour) and records the current value of container_memory_usage_bytes metrics. Then you could continue to use quantile_over_time over this lower resolution metric. You'll obviously lose precision (as you're throwing away a lot of samples) and your quantile will only update once an hour, but it's much simpler. And you only need to wait for 10 days to see if the result is close enough. (o:
The quantile_over_time(0.95, container_memory_usage_bytes[10d]) query can be slow because it needs to take into account all the raw samples for all the container_memory_usage_bytes time series on the last 10 days. The number of samples to process can be quite big. It can be estimated with the following query:
sum(count_over_time(container_memory_usage_bytes[10d]))
Note that if the quantile_over_time(...) query is used for building a graph in Grafana (aka range query instead of instant query), then the number of raw samples returned from the sum(count_over_time(...)) must be multiplied by the number of points on Grafana graph, since Prometheus executes the quantile_over_time(...) individually per each point on the displayed graph. Usually Grafana requests around 1000 points for building smooth graph. So the number returned from sum(count_over_time(...)) must be multiplied by 1000 in order to estimate the number of raw samples Prometheus needs to process for building the quantile_over_time(...) graph. See more details in this article.
There are the following solutions for reducing query duration:
To add more specific label filters in order to reduce the number of selected time series and, consequently, the number of raw samples to process.
To reduce the lookbehind window in square brackets. For example, changing [10d] to [1d] reduces the number of raw samples to process by 10x.
To use recording rules for calculating coarser-grained results.
To try using other Prometheus-compatible systems, which may process heavy queries at faster speed. Try, for example, VictoriaMetrics.

Effective way to display the data in the chart

I have an application where some values are stored in DB, e.g. one value per second. It is 604800 values per 7 days and if I want to view this value in graph I need some effective way how to get only e.g. 800 values from DB if I have chart with 800px width.
I use some aggregation logic where mean value is computed for values in 2, 3, 4, 5, 6, 10, 12 minute interval and then hour and day interval aggregates are computed.
I use PostgreSQL and this aggregations are computed with statement:
"INSERT INTO aggre_table_ ... SELECT sum(...)/count(*) ... WHERE timestamp > ... and timestamp < ..."
Is there any better way how to do this or what is the best way of data aggregation for later displaying in charts?
Is it better to do this by some trigger or calling stored procedures?
Is there any DB support for aggregations for D3js, Highcharts or Google Charts?
How to aggregate your data is a large topic that is independent of your technology choices. It depends largely on how sensitive the data is, what the important indicators of the data are, what the implications of those indicators are, etc.
Is a single out of range point significant? Or are you looking for the overall trend? These are big questions with answers that aren't always easy.
My general suggestion:
to display a week worth of data, aggregate to hourly averages.
provide a range around that line indicating the distribution of points around each average
if something significant happened within that aggregated point, indicate it with a separate marker
provide drill down capability for each aggregated point to see the full detail charted, if that level of detail is important (chances are, it's not)
In Highcharts (Highstock in the fact) dataGrouping is used for approximation (see demo).
Also, here you can find more about Highstock.

Reducing Large Datasets with MongoDB and D3

I'm working on a D3 visualization and luckily have made some progress. However, I've run into an issue... and to be honest, I'm not sure if its a MongoDB issue, or a D3 issue. You see, I'm trying to make a series of graphs from a set of sensor points (my JSON object contains timestamps, light, temperature, humidity, and motion detection levels or each datapoint). However, my sensors are uploading data to my MongoDB database every 8 seconds. So, if I query the MongoDB database for just one days worth of data... I get 10,800 datapoints. Worse, if I were to ask for one month of data, I'd be swamped with 324,000 datapoints. My issue is that my d3 visualization slows to a crawl when dealing with more than about 1000 points (I'm visualizing the data on four different graphs each which use a single brush to select a certain domain on the graph. Is there a way to limit the amount of data I'm trying to visualize? Is this better done using MongoDB (so basically filter the data I'm querying and only getting every nth data point based on how big of a time value I'm trying to query). Or is there a better way? Should I try to filter the data using D3 once I've retrieved the entire dataset? What is the best way to go about reducing the amount of points I need to deal with? Thanks in advance.
Mongodb is great at filtering. If you really only need a subset of the data, specify that in a find query -- this could limit to a subset of time, or if you're clever only get data for the first minute of every hour or similar.
Or you can literally reduce the amount of data coming out of mongodb using the aggregation framework. This could be used to get partial sums or averages or similar.