Large Windows in Apache Beam pipeline - apache-beam

I have a high-level question regarding window sizes in apache beam. Most streaming examples show using beam with relatively small window sizes. Our use case involves looking at data that encompasses 15 to 30-day windows.
My question is does anyone foresee any downside to have windows this large?

If you have a 15-day window, this would require buffering up data for 15 days before finally producing an answer. Most people with streaming pipelines are interested in latencies lower than that.
You could look at using sliding windows with a lower offset, e.g. sliding windows that produces data for the past 15 days ever day. If the ratio of the window size to sliding duration is large, this requires storing many intermediates per key (e.g. 15-day/1-hour sliding windows would require 15*24 intermediates).
You could also use triggers to get speculative data out before the windows is "complete." Using triggers you'd could get data N days into the full window, constantly growing until the window was finally done 15 days later. This requires greater care in handling "early" results to avoid double counting (e.g. one wouldn't want to include emitted "day 1 sum" and "day 1-2 sum" into the same aggregation).

Related

Athena/DDB to condense millions of data points for plotting them on a graph

I need to plot trend charts on the react app based on user inputs such as timestamps, devices, etc. I have related time series data in DynamoDB and S3 (which I can query using Athena).
Returning all those millions of data points for a graph seems unreasonable and is super laggy.
I guess one option is "binning" where I decide the number of bins based on how big the time range is and take averages of the readings in that bin. However, concerned about how well it will show the drops and high we need to show them accurately.
Athena queries and DDB queries (due to the 1MB limit) - both seem fairly slow so far.
Of course the size of the response payload is another concern as API and Lambda both limit it to 10 and 6Mb respectively.
Any ideas?
I can't suggest anything smarter than "binning", but if you are concerned that the bucket interval might become too wide and performance might suffer, you can fixate the interval. Then create more than one table. For example, the interval can be 1 hour and you can have a new table for each week.
This is what we did when we had to deal with time series in dynamo. At some point, we decided to switch to Amazon Timestream

How to persist previous data point when time range doesn't include a data point

TL;DR:
Can I get Grafana to show me the previous data point, when the currently selected time period does not have a data point? I have an example which sounds ridiculous, but at least it's simple to understand: I send data every 1 minute, and I wish to zoom into the last 30 seconds, and still see data. You may ask "why not just zoom out to 2 minutes" but the reason is that other data is on the same graph that has updated more often, and I wish to compare with that data. Also, for the more lengthy reasons below.
If not, how can I achieve what I want to achieve, see below?
Context
For a few years, I have been monitoring the water level in three of our basement sumps (which have pumps installed) by sending this data from Node-RED to InfluxDB, then visualising the sump levels in Grafana. I have set up three waterproof ultrasonic distance sensors, each pointed down a pipe that is inserted vertically into each sump. The water fills the pipe and the distance sensor, connected to an Arduino, sends me the reading. The Arduino also has other sensors connected (temp / humidity) and deals with distance calibrations to calculate the percent full of each sump. All this data is sent to Node-RED. In total, I am sending 4 values per sump: distance measurement in mm, percent full, temp, humidity. So that's 12 fields. Data is sent every 2 seconds, because I wished to have a reasonably high resolution to see nice curves in graphs.
Also I decided to store all this data so that I could later troubleshoot issues (we have had sewage floods resulting in water not being able to be pumped away, etc...) and design some warning systems for these issues based on data.
Storing 12 values for every 2 seconds, over the course of a number of years, takes up a lot of space (8GB).
Nature of the data
Storing this resolution of data has also helped me be able to describe the nature of the data. I will do so here.
(1) Non-meaningful NOISE (see below) - the percent-full reading goes up and down by 1 or 2 percent every couple of seconds:
(2) Meaningful DRIFT (see below) - I don't mean sensor drift, I am referring to actual water levels changing slowly over time, e.g. over 1 day or 1 week. Perhaps condensation on the walls drips down into the sump, or water evaporates from the sump, and the value can waver by a few percent over the course of a day. Each sump has slightly different characteristics.
(3) Meaningful MONITORING DATA - during wet weather, depending on rainfall amount, the sumps fill up over the course of say 30 mins to 3 hours. Then the pumps run and the water level drops again, wavers a bit, then the sumps continue to fill up. If the rain stopped, you can see a lovely curve as the water fills in progressively more slowly (see the green line below):
Solution to downsample
I know Influx has its own downsampling possibilities, however because of the nature of the data (which can hardly vary for 2 months but when it does, I really need to capture it in detail), I don't think lowering the sample rate is a great idea.
I have some understanding of digital filters (e.g. low pass etc) but have never programmed one myself. So I have written a basic filter in javascript (a Node-RED function) to filter the data in realtime as follows: only send each reading when it has changed from the previous one by x amount. (And update the previous one, when that occurs.)
This has already vastly reduced the amount of data being stored, and I can vary x to filter out noise shown in my first graph above, at the expense of resolution when the pumps run. Even if I set the x value to 2, it still vastly reduces data over long periods of dry weather.
So - onto my problem! Now data is not being logged to InfluxDB unless there is some meaningful change. Which means that when I zoom in to e.g. 15 minute timeframe of data, there is nothing to see.
Grafana does have the option of "fill (previous)" but this draws a line between points on the existing graph, rather than showing the previous data as if it hasn't changed since that point. Now my grafana dashboard looks a bit sad :(
One proposed solution is, in addition to sending "delta" data, send "summary" data, that is - send a full suite of data every 1 minute regardless of whether data changed or not. But then we get noise back again, and pointless storage.
Any other ideas?

Alert in RAM/CPU Usage Detection in e-Commerce Server

Currently I'm building my monitoring services for my e-commerce Server, which mostly focus on CPU/RAM usage. It's likely Anomaly Detection on Timeseries data.
My approach is building LSTM Neural Network to predict next CPU/RAM value on chart trending and compare with STD (standard deviation) value multiply with some number (currently is 10)
But in real life conditions, it depends on many differents conditions, such as:
1- Maintainance Time (in this time "anomaly" is not "anomaly")
2- Sales time in day-off events, holidays, etc., RAM/CPU usages increase is normal, of courses
3- If percentages of CPU/RAM decrement are the same over 3 observations: 5 mins, 10 mins & 15 mins -> Anomaly. But if 5 mins decreased 50%, but 10 mins it didn't decrease too much (-5% ~ +5%) -> Not an "anomaly".
Currently I detect anomaly on formular likes this:
isAlert = (Diff5m >= 10 && Diff10m >= 15 && Diff30m >= 40)
where Diff is Different Percentage in Absolute value.
Unfortunately I don't save my "pure" data for building neural network, for example, when it detects anomaly, I modified that it is not an anomaly anymore.
I would like to add some attributes to my input for model, such as isMaintenance, isPromotion, isHoliday, etc. but sometimes it leads to overfitting.
I also want to my NN can adjust baseline over the time, for example, when my Service is more popular, etc.
There are any hints on these aims?
Thanks
I would say that an anomaly is an unusual outcome, i.e. a outcome that's not expected given the inputs. As you've figured out, there are a few variables that are expected to influence CPU and RAM usage. So why not feed those to the network? That's the whole point of Machine Learning. Your network will make a prediction of CPU usage, taking into account the sales volume, whether there is (or was) a maintenance window, etc.
Note that you probably don't need an isPromotion input if you include actual sales volumes. The former is a discrete input, and only captures a fraction of the information present in the totalSales input
Machine Learning definitely needs data. If you threw that away, you'll have to restart capturing it. As for adjusting the baseline, you can achieve that by overweighting recent input data.

How to run a multi-queue code using OpenCL?

For example, I'm doing some image processing work on every frame of a video.
Every frame's processing using 200ms including writing, processing and reading.
And the fps is 25, in that case every two frames' distance is 40ms. Then the processing is too slow to show continuous result.
So here is my idea, I use multi-queues for this work.
In CPU part,
while(video is not over)
{
1. read the frame0;
processing the **frame0** using **queue0**;
wait 40 ms;
2. read the frame1;
processing the frame1 using **queue1**;
wait 40 ms;
3.4.5.
...(after 5 frames(just about the 200ms's processing time))
6. download the **frame0**'s result.
7. read the frame5;
processing the frame5 using **queue0**;
wait 40 ms;
...
}
The code means that, I use different queues for reading and processing the same frame in a video.
However, my experiment result is faster, but just 2 times faster, but not in my imaginary speed.
Can anyone tell me how to deal it? THX!
Assuming you have one Device, here are some thoughts on this point:
Main reason to have multiple Command Queues (CQ) per single OpenCL Device is the ability to execute kernels & do IO operations simultaneously.
Usually one CQ is enought to load single Device at ~100%. Though, your multi-CQ idea is good (in my opinion), as you're constantly feeding GPU with workload.
Look at kernel execution time. May be, it's big enough, so that your Device is constantly executing kernels & can't go any faster.
I think, you don't need to wait for 40ms. Good solution is to process frames in queue, in which they are put to eliminate the difference between bitstream & display order.
If you have too many CQ, your OpenCL driver thread will be busy maintaining them, so that performance may decrease.

Dymola/Modelica real-time simulation advances too fast

I want to simulate a model in Dymola in real-time for HiL use. In the results I see that the Simulation is advancing about 5% too fast.
Integration terminated successfully at T = 691200
CPU-time for integration : 6.57e+005 seconds
CPU-time for one GRID interval: 951 milli-seconds
I already tried to increase the grid interval to reduce the relativ error, but still the simulation is advancing too fast. I only read about aproaches to reduce model complexity to allow simulation within the defined time steps.
Note, that the Simulation does keep up with real-time and is even faster. How can I ín this case match simulated time and real time?
Edit 1:
I used Lsodar solver with checked "Synchronize with realtime option" in Realtime tab. I have the realtime simulation licence option. I use Dymola 2013 on Windows 7. Here the result for a stepsize of 15s:
Integration terminated successfully at T = 691200
CPU-time for integration : 6.6e+005 seconds
CPU-time for one GRID interval : 1.43e+004 milli-seconds
The deviation still is roughly about 4.5%.
I did however not use inline integration.
Do I need hard realtime or inline integration to improve those results? It should be possible to get a deviation lower than 4.5% using soft realtime or not?
Edit 2:
I took the Python27 block from the Berkeley Buildings library to read the System time and compare it with the Simulation advance. The result shows that 36 hours after Simulation start, the Simulation slows down slightly (compared to real time). About 72 hours after the start of the simulation it starts getting about 10% faster than real time. In addition, the jitter in the result increases after those 72 hours.
Any explanations?
Next steps will be:
-changing to fixed step solver (Might well be this is a big part of the solution)
-Changing from DDE Server to OPC Server, which at the Moment doesn't not seem to be possible in Dymola 2013 however.
Edit 3:
Nope... using a fixed step solver does seem to solve the problem. In the first 48 hours of simulation time the deviation seems to be equal to the deviation using a solver with variable step size. In this example I used the Rkfix 3 solver with an integrator step of 0.1.
Nobody knows how to get rid of those huge deviations?
If I recall correctly, Dymola has a special compilation option for real-time performance. However, I think it is a licensed option (not sure).
I suspect that Dymola is picking up the wrong clock speed.
You could use the "Slowdown factor" that is in the Simulation Setup, on the Realtime tab just below "Synchronize with realtime". Set this to 1/0.95.
There is a parameter in Dymola that you can use to set the CPU speed but I could not find this now, I will have a look for this again later.
I solved the problem switching to an embedded OPC-Server. Error between real time and simulation time in this case is shown below.
Compiling Dymola Problems with an embedded OPC-Server requires administrator rights (which I did not have before). The active folder of Dymola must not be write protected.