PostgreSQL delete and aggregate data periodically - postgresql

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

Related

Line Chart Data handle issue

what is the best way to handle Line Chart data to send in API?
we have chart like this It has Hours, Days, Weeks, Months Data so how Will I manage with easy way?
I tired normal X,Y values but there is bulk data in every category so it is hard to handle.
my question about what data should I get from the server so I will show these bulk data easily.
As i can figure out, you are facing data processing issue while converting all seconds data to Hour, Month and other required formats every second.
If this is the issue then you can follow these steps to overcome processing data every second.
Process all data only once when you receive data for the first time
Next second when you receive data, do not process all data. Just get latest data and show in Heart Rate.
You can process all data again after sometime, based on your requirements. (this can take time)
Because as per my knowledge you are receiving all data every second, but your chart's minimum time parameter is in hours, so you have to process same data every second, which can consume little more time and results in delay.
If this is not an exact issue you are facing then please update your question with exact issue you are facing.

Re-index data more than one in Apache Druid

I want to get last one hour and day aggregation result from druid. Most queries I use includes ad-hoc queries. I want to ask two question;
1- Is a good idea that ingest all raw data without rollup? Without rollup, Can I re-index data with multiple times?. For example; one task reindex data to find unique user counts for each hour, and another task re-index the same data to find total count for each 10 minutes.
2- If rollup enabled to find some basic summarizes, this prevent to get information from the raw data(because it is summerized). When I want to reindex data, some useful informations may not found. Is good practise that enable rollup in streaming mode?
Whether to enable roll-up depends on your data size. Normally we
keep data outside of druid to replay and reindex again in the
different data sources. If you have a reasonable size of the data
you can keep your segment granularity to be hours/day/ week/month
ensuring that each segment doesn't exceed the ideal segment size (
500 MB recommended ). And query granularity to the none at index
time, so you can do this unique and total count aggregation at query
time.
You can actually set your query granularity at the index time to be
10 mins and it can still provide you uniques in 1 hr and total count
received in 1 hr.
Also, you can index data in multiple data sources if that's what you
are asking. If you are reindexing data for the same data source, it
will create duplicates and skew your results.
It depends on your use case. Rollup will help you better performance
and space optimization in druid cluster. Ideally, I would suggest
keeping your archived data separate in replayable format to reuse.

What are some strategies to efficiently store a lot of data (millions of rows) in Postgres?

I host a popular website and want to store certain user events to analyze later. Things like: clicked on item, added to cart, removed from cart, etc. I imagine about 5,000,000+ new events would be coming in every day.
My basic idea is to take the event, and store it in a row in Postgres along with a unique user id.
What are some strategies to handle this much data? I can't imagine one giant table is realistic. I've had a couple people recommend things like: dumping the tables into Amazon Redshift at the end of every day, Snowflake, Google BigQuery, Hadoop.
What would you do?
I would partition the table, and as soon as you don't need the detailed data in the live system, detach a partition and export it to an archive and/or aggregate it and put the results into a data warehouse for analyses.
We have similar use case with PostgreSQL 10 and 11. We collect different metrics from customers' websites.
We have several partitioned tables for different data and together we collect per day more then 300 millions rows, i.e. 50-80 GB data daily. In some special days even 2x-3x more.
Collecting database keeps data for current and last day (because especially around midnight there can be big mess with timestamps from different part of the world).
On previous versions PG 9.x we transferred data 1x per day to our main PostgreSQL Warehouse DB (currently 20+ TB). Now we implemented logical replication from collecting database into Warehouse because sync of whole partitions was lately really heavy and long.
Beside of it we daily copy new data to Bigquery for really heavy analytical processing which would on PostgreSQL take like 24+ hours (real life results - trust me). On BQ we get results in minutes but pay sometimes a lot for it...
So daily partitions are reasonable segmentation. Especially with logical replication you do not need to worry. From our experiences I would recommend to not do any exports to BQ etc. from collecting database. Only from Warehouse.

MongoDB for huge amount of data

I need to get Weather data from almost 200 German cities.
The point is I need to save the data since the beginning of this year and I should save the data from every single day, including the temperature during the hours of the day (Hourly temperature) and the min and max temperature for the whole day.
I know that is a huge amount of data, and it could be even bigger because it’s not decided yet if we will get the historical weather data from 10 years ago till now. Besides that the number of cities included into this could grow to add cities from other countries.
Is MongoDB a good way to save this data? If not, which method would be better to do it?
You can use MongoDB for a weather data. MongoDB is flexible and document-based, you can store JSON-like binary data points in one place without having to define what “types” of data those are in advance.
MongoDB is a schema-less database and can load a high volume of data and it's a very easy to scale. It supports sharding which is the process of storing the data in different machines when the size of the data grows. This results in the horizontal scaling and more amount of data can be written.
It’s been used by The Weather Channel organization, because weather changes quickly. The Weather Channel turned to MongoDB to get information to users quickly. Changes that used to take weeks can now be pushed out in hours. So, MongoDB database would be more than capable of handling that amount of weather data.

What could cause duplicate rows in fetched Google Analytics reports?

I'm working on a tool to fetch about 3 years of historic data from a site, in order to perform some data analysis & machine learning.
The dimensions of the report I am requesting are:
[ ga:cityId, ga:dateHour, ga:userType, ga:deviceCategory ]
And my starting point is to import to a postgres db (the data may live elsewhere eventually but we have Good Reasons for starting with a relational database).
I've defined a unique index on the [ ga:cityId, ga:dateHour, ga:userType, ga:deviceCategory ] tuple for the postgres table, and my import job currently routinely fails every 30000-50000 rows due to a duplicate of that tuple.
What would cause google to return duplicate rows?
I'm batching the inserts by 1000 rows / statement because row-at-a-time would be very time consuming, so I think my best workaround is to disable the unique index for the duration of the initial import, de-dupe, and then re-enable it and do daily imports of fresh data row-at-a-time. Other strategies?
There shouldn't be duplicate reports coming back from google if the time ranges are unique.
Are you using absolute or relative (to present) dates? If the latter, you should ensure that changes in the time period cause by the progression of the relative time (ie the present) don't cause an overlap.
Using relative time period could also cause gaps in your data.