TimescaleDB Vs. InfluxDB to plc logging - postgresql

I am 100% new to logging in an database at all and that's proberbly why there will maybe be some stupid questions here and i hope that it's ok.
I would like to logg data from an Beckhoff plc controller into an DB that is placed on the same IPC as my PLC.
The Beckhoff plc has a direct link function to both InfluxDB and to PostgreSQL that TimescaleDB is based on, so the connection will work fine.
We would like to log data to time so we can go back and see what time certain things did happen and also make questions to the database based on time.
I have been talking to different people and most of them recommend to use TimescaleDB so it would be great to hear the benefits between them and what you guys would recommend me to choose.
The data size we will log is pretty small.
We will have an structure of data and that will contain like 10 INT registers so 20 bytes.
We will make an log to the database evry 1 second on the quick machines and sometimes only one time each 20minutes, but this part i will control in my plc.
So putting data in the DB i belive will be pretty straight forward but then i have some toughts about what i would like to do and what is possible.
Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
is there any big difference in setting up these two databases?
I proberly have more questions but these are the main functions that i need in the system.
Many thanks

Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
Yes! You can use queries to make it. Consider the following table structure:
CREATE TABLE conditions (
time TIMESTAMPTZ NOT NULL,
device INTEGER NOT NULL,
temperature FLOAT NOT NULL
);
SELECT * FROM create_hypertable('conditions', 'time');
ALTER TABLE conditions SET (timescaledb.compress, timescaledb.compress_orderby='time');
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
You can create a continuous aggregates that is a fast method to keep your resumed data materialized.
CREATE MATERIALIZED VIEW conditions_hourly(time, device, low, high, average )
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) as time,
device,
min(temperature) as low,
max(temperature) as high,
AVG(temperature) as average
FROM conditions
GROUP BY 1,2;
And then you can add a retention policy for keeping only the last 7 days.
SELECT add_retention_policy('conditions', INTERVAL '7 day');
And add a continuous aggregates policy that will keep your view up to date every hour:
SELECT add_continuous_aggregate_policy('conditions_hourly',
start_offset => INTERVAL '1 day',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
In Postgresql you can use async commits: https://www.postgresql.org/docs/current/wal-async-commit.html

Related

Question regarding pg_stat_reset() in postgresql

I have a question regarding function pg_stat_reset(); I am trying to collect database table stats on regular basis and for that purpose i am using stats from various postgres tables like pg_stat_all_tables, pg_stat_database. But data in this table is accumulated over period of time. I am planning to reset it on daily basis. I just want to know if its safe to run this function on daily, hourly basis on production database. And also if safe are there any settings PG offer by which i can reset stats of regular interval, apart from setting cronjob and running pg_stat_reset()
https://www.postgresql.org/docs/12/monitoring-stats.html
Don't do it, it will interfere with proper functioning of autovacuum. Rather than resetting counters to zero, just take snapshots of the counters and subtract one snapshot from a later one.

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

What are some strategies to efficiently store a lot of data (millions of rows) in Postgres?

I host a popular website and want to store certain user events to analyze later. Things like: clicked on item, added to cart, removed from cart, etc. I imagine about 5,000,000+ new events would be coming in every day.
My basic idea is to take the event, and store it in a row in Postgres along with a unique user id.
What are some strategies to handle this much data? I can't imagine one giant table is realistic. I've had a couple people recommend things like: dumping the tables into Amazon Redshift at the end of every day, Snowflake, Google BigQuery, Hadoop.
What would you do?
I would partition the table, and as soon as you don't need the detailed data in the live system, detach a partition and export it to an archive and/or aggregate it and put the results into a data warehouse for analyses.
We have similar use case with PostgreSQL 10 and 11. We collect different metrics from customers' websites.
We have several partitioned tables for different data and together we collect per day more then 300 millions rows, i.e. 50-80 GB data daily. In some special days even 2x-3x more.
Collecting database keeps data for current and last day (because especially around midnight there can be big mess with timestamps from different part of the world).
On previous versions PG 9.x we transferred data 1x per day to our main PostgreSQL Warehouse DB (currently 20+ TB). Now we implemented logical replication from collecting database into Warehouse because sync of whole partitions was lately really heavy and long.
Beside of it we daily copy new data to Bigquery for really heavy analytical processing which would on PostgreSQL take like 24+ hours (real life results - trust me). On BQ we get results in minutes but pay sometimes a lot for it...
So daily partitions are reasonable segmentation. Especially with logical replication you do not need to worry. From our experiences I would recommend to not do any exports to BQ etc. from collecting database. Only from Warehouse.

How to create KDB query which groups by time interval and do not bring RDB down?

We receive quotes from exchange and store them in KDB Ticker Plant. We want to analyze volume in RDB and HDB with minimum impact on performance of these database since they are also used by other teams.
Firstly, how we may create a function which splits a day in 10 minutes intervals and for each interval create a stat with volume? Which KDB functions do we need to use?
Secondly, how to do it safely? Should we extract records in a loop portion by portion or in one go with one query? We have around 150 million records for each day in our database.
I'll make some assumptions about table and column names, which I'm sure you can extrapolate
We receive quotes from exchange and store them in KDB Ticker Plant
As a matter of definition, tickerplant only stores data for a very small amount of time and then logs it to file and fires the data off to RDB (and other listeners).
with minimum impact on performance of these databases
It all depends on (a) your data volume (b) a most optimal where clause. It also depends on whether you have enough RAM on your machine to cope with the queries. The closer you get to the critical, the harder it is for the OS to allocate memory, and therefore the longer it takes to make the query (although memory allocation time pales in comparison to getting data off a disk - so disk speed is also a factor).
Firstly, how we may create a function which splits a day in 10 minutes intervals and for each interval create a stat with volume?
Your friend here is xbar: http://code.kx.com/q/ref/arith-integer/#xbar
getBy10MinsRDB:{[instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where sym=instrument, market=mkt
};
For an HDB the most optimal query (for a date-parted database) is date then sym then time. In your case you haven't asked for time, so I omit.
getBy10MinsHDB:{[dt;instrument;mkt]
select max volume, min volume, sum volume, avg volume by 10 xbar `minute$time from table where date=dt,sym=instrument, market=mkt
};
Should we extract records in a loop portion by portion or in one go with one query?
No, that's the absolute worst way of doing things in KDB :-) there's almost always a nice vector-ised solution.
We have around 150 million records for each day in our database.
Since KDB is a columnar database, the types of the columns you have are as important as the number of records; as that impacts memory.
because they are also used by other teams
If simple queries like above are causing issues, you need to consider splitting the table up by market, perhaps, to reduce query clash and load. If memory isn't an issue, consider -s for HDB's for multithreaded queries (over multiple days). Consider negative port number on HDB for multithreaded input queue to minimise query clash (although it doesn't necessarily make things faster).

Software for collected sensor data analysis

I have an application that collects data from about a dozen sensors in a Smart House. It stores its data in a MySQL database with the following table format:
CREATE TABLE IF NOT EXISTS `datapoints` (
`PointID` int(11) NOT NULL,
`System` varchar(50) NOT NULL,
`Sensor` varchar(50) NOT NULL,
`Value` varchar(50) NOT NULL,
`Timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`PointID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The System field is for grouping sensors, for example, the "Air" system has "Temperature" and "Humidity" sensors, and the "Solar Panel" system has "Power production (kW)" and "Production Today (kWh)". The fields are all varchar because there are several datatypes coming in from the sensors and the original database designer took the easy way out. (I know that this data structure isn't very efficient but it's too late to change it.)
The sensors include air temperature, humidity, solar panel output, solar water-heater temperature, and others.
We now have this database which collects tens of thousands of data points every day. Until now, we have used an application that queries the database to build graphs of sensor data over time. However, we now have many gigabytes of data and will eventually run out of storage space on the logging hardware.
I am looking for a way to collect statistics from the data and then delete it. I am thinking of something a lot like [Google Analytics | Piwik | Awstats] for the data. The problem is that I don't really know where to start. I want to be able to look at more detailed data from more recent times, for example:
1 day's worth of all data
1 week's worth of hourly data
1 month's worth of daily data
I think I want to keep the weekly and monthly stats forever.
However, I don't want to smooth the data too much. Eventually, I will have to smooth the data but I want to keep it detailed as long as possible. For example, if I have a big spike in power production, if it is smoothed (lower) into the hourly data, and then again (lower) in the daily data, and then again (lower) in the weekly data, the week's data will not reflect that there was a spike, as the average of averages isn't the same as the average of all the points.
Is there any software that already does this? If not, what is a good way to start? I can do it in any language, but preference goes to .NET, PHP, or C (for Windows), (in that order) because those are languages that other project members already know and that the logging hardware already has set up.
You data problem is so big and potentially open-ended, I don't think there is any single tool that will solve your problem. You will likely need to invent your own tools for your specific problem.
I think you should take a look at the Python-based tools used by the science and engineering communities. This includes IPython and Matplotlib for interactive data analysis and visualization. Use Numpy and Scipy for handling and processing large data arrays. Finally, consider SciKit Learn for when you need to do some serious number crunching.
Good luck.
If I were still doing this project today (and for other projects of this type), I would use a Time Series Database (TSDB).
A TSDB is specifically designed for ingesting large volumes of data points over time and allowing analysis of them.
I have now been playing with the TimeScale extension for PostgreSQL for another project, and it would have done exactly what I needed.