mongodb: old data aggregation - mongodb

I have a script that collects data from somewhere and writes it into a mongodb collection every 10 minutes. Then we have a frontend that displays the historical trends of data in form of graphs/charts etc. I noticed that now we have data for around 2 years. All the data is at a 10 minute resolution. Generally, we like to see data at resolution of 10 minutes for past 6 months only. 6 month to 1 year old data is checked at an hourly resolution while older than an year at a daily resolution.
This means that we should somehow aggregate the 10 minute resolution data to higher resolution if its older than some time. E.g. average out (or may be max/min depending on the parameter) the data older than a year at an hourly basis and remove the 10 entries and make a new single entry.
Are there any frameworks available out there that could support such policy based data management?

Related

How to handle lots of 'archived' data in Postgres

We have a huge Postgres database where we store fiscal data (invoices, bank statements, sales orders) for thousands of companies. In the UI of our app the data is divided per fiscal year (which is 1 calendar year most of the times). So a user chooses a year and only sees data for that specific year.
For example we have a table that stores journal entries (every incoice line can result in multiple journal entries). This table is quite slow on the more complex queries. It's one big table going like 15 years back. However, users rarely access old data anymore. Only the past 2 or 3 years will be actively accessed, data older than that will almost never be accessed.
What is the best way to deal with this old, almost archived data? Partitioning? Clustering? If anyone could point me in the right direction that would be of great help.
Ps. our database is hosted in Google Cloud

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

MongoDB for huge amount of data

I need to get Weather data from almost 200 German cities.
The point is I need to save the data since the beginning of this year and I should save the data from every single day, including the temperature during the hours of the day (Hourly temperature) and the min and max temperature for the whole day.
I know that is a huge amount of data, and it could be even bigger because it’s not decided yet if we will get the historical weather data from 10 years ago till now. Besides that the number of cities included into this could grow to add cities from other countries.
Is MongoDB a good way to save this data? If not, which method would be better to do it?
You can use MongoDB for a weather data. MongoDB is flexible and document-based, you can store JSON-like binary data points in one place without having to define what “types” of data those are in advance.
MongoDB is a schema-less database and can load a high volume of data and it's a very easy to scale. It supports sharding which is the process of storing the data in different machines when the size of the data grows. This results in the horizontal scaling and more amount of data can be written.
It’s been used by The Weather Channel organization, because weather changes quickly. The Weather Channel turned to MongoDB to get information to users quickly. Changes that used to take weeks can now be pushed out in hours. So, MongoDB database would be more than capable of handling that amount of weather data.

Given a fixed (but non-linear) calendar of future equipment usage; how to calculate the life at a given date in time

I have a challenge that i'm really struggling with.
I have a table which contains a 'scenario' that a user has defined, that they will consume 'usage'. For example, how many hours a machine will be turned on for.
In month 1, they will use 300 (hours, stored as an integer in minutes), Month 2 100, Month 3 450 etc etc etc.
I then have a list of tasks which need to be performed at specific intervals of usage. Given the above scenario, how could I forecast when these tasks will be due (the date). I also need to show repeat accomplishments and the dates.
The task contains the number of consumed hours at the last point of accomplishment, the interval between accomplishments and the expected life total the next time it is due (Last Done + Interval = Next Due)
I've tried SO many different options, ideally I want this to be compiled at run time (i.e. the only things that are saved into a permanent table are the forecast and the list of tasks). I have 7-800 scenarios, plus given the number of pieces of equipment, there are 12,000 tasks to be carried out. I need to show at least the next 100 years of tasks.
At the minute, I get the scenario, cross apply a list of dates between now and the year 2118. I then (in the where clause) filter out where the period number (the month number of the year) isn't matching the date, then divide the period usage by the number of days in that period. That gives me a day to day usage over the next 100 years (~36,000 Rows) When I join on the 12000 tasks and then try to filter where the due at value matches my dates table, I can't return ANY rows, even just the date column and the query runs for 7-8 minutes.
We measure more than just hours of usage too, there are 25 different measurement points, all of which are specificed in the scenario.
I can't use linear regression, because lets say for example, we shut down over the summer, and we don't utilize the equipment, then an average usage over the year means that tasks are due, even when we are telling the scenario we're shut down.
Are there any strategies out there that I could apply. I'm not looking for a 'Here's the SQL' answer, I'm just running out of strategies to form a solid query that can deal with the volume of data I'm dealing with.
I can get a query running perfectly with one task, but it just doesn't scale... at all...
If SQL isn't the answer, then i'm open to suggestions.
Thanks,
Harry

Estimated pricing for CloudKit usage?

I have a series of questions all for CloudKit pricing, hence a single post with multiple questions as they are all interrelated.
The current CloudKit calculator (as of Feb 2017) shows the following pricing details for 100,000 users:
Imagine working on an application which has large assets and large amount of data transfer (even after using compression and architecting it such that transfers are minimized).
Now, assume my actual numbers for an app such as the one I just described with 100,000 users is:
Asset Storage: 7.5 TB
Data Transfer: 375 TB (this looks pretty high, but assume it is true)
My Questions
Then will the Data Transfer component of my usage bill will be: (375 - 5) * 1000 GB * 0.10 $/GB = 37000 $?
Also, are there some pricing changes if say one is within the 5TB limit, but exceeds the 50MB per user limit or is that per user limit just an average, so even if the data transfer per user may be higher than 50MB, but if I stay within 5TB limit, I won't be charged?
What does Active Users really mean in this pricing context? The number of users who have downloaded the app or the number of users using the app in a given month?
How is the asset storage counted? Imagine for 2 successive months, this is the asset size uploaded: Month 1: 7.5 TB, Month 2: 7.5 TB. Then in the second month, will my asset storage be counted as 15 TB or 7.5 TB?
Is it correct that asset storage etc allocations increase for every user that is added (the screenshot does say that) or the allocations are increased in bulk only when you hit certain numbers such as: 10K, 20K, .., 100K etc? I read about bulk allocation but cannot find the source now and I am asking this question just to be sure, to avoid unpleasant surprises later.
Last but not the least, is CloudKit usage billed monthly?