Software for collected sensor data analysis - average

I have an application that collects data from about a dozen sensors in a Smart House. It stores its data in a MySQL database with the following table format:
CREATE TABLE IF NOT EXISTS `datapoints` (
`PointID` int(11) NOT NULL,
`System` varchar(50) NOT NULL,
`Sensor` varchar(50) NOT NULL,
`Value` varchar(50) NOT NULL,
`Timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`PointID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The System field is for grouping sensors, for example, the "Air" system has "Temperature" and "Humidity" sensors, and the "Solar Panel" system has "Power production (kW)" and "Production Today (kWh)". The fields are all varchar because there are several datatypes coming in from the sensors and the original database designer took the easy way out. (I know that this data structure isn't very efficient but it's too late to change it.)
The sensors include air temperature, humidity, solar panel output, solar water-heater temperature, and others.
We now have this database which collects tens of thousands of data points every day. Until now, we have used an application that queries the database to build graphs of sensor data over time. However, we now have many gigabytes of data and will eventually run out of storage space on the logging hardware.
I am looking for a way to collect statistics from the data and then delete it. I am thinking of something a lot like [Google Analytics | Piwik | Awstats] for the data. The problem is that I don't really know where to start. I want to be able to look at more detailed data from more recent times, for example:
1 day's worth of all data
1 week's worth of hourly data
1 month's worth of daily data
I think I want to keep the weekly and monthly stats forever.
However, I don't want to smooth the data too much. Eventually, I will have to smooth the data but I want to keep it detailed as long as possible. For example, if I have a big spike in power production, if it is smoothed (lower) into the hourly data, and then again (lower) in the daily data, and then again (lower) in the weekly data, the week's data will not reflect that there was a spike, as the average of averages isn't the same as the average of all the points.
Is there any software that already does this? If not, what is a good way to start? I can do it in any language, but preference goes to .NET, PHP, or C (for Windows), (in that order) because those are languages that other project members already know and that the logging hardware already has set up.

You data problem is so big and potentially open-ended, I don't think there is any single tool that will solve your problem. You will likely need to invent your own tools for your specific problem.
I think you should take a look at the Python-based tools used by the science and engineering communities. This includes IPython and Matplotlib for interactive data analysis and visualization. Use Numpy and Scipy for handling and processing large data arrays. Finally, consider SciKit Learn for when you need to do some serious number crunching.
Good luck.

If I were still doing this project today (and for other projects of this type), I would use a Time Series Database (TSDB).
A TSDB is specifically designed for ingesting large volumes of data points over time and allowing analysis of them.
I have now been playing with the TimeScale extension for PostgreSQL for another project, and it would have done exactly what I needed.

Related

TimescaleDB Vs. InfluxDB to plc logging

I am 100% new to logging in an database at all and that's proberbly why there will maybe be some stupid questions here and i hope that it's ok.
I would like to logg data from an Beckhoff plc controller into an DB that is placed on the same IPC as my PLC.
The Beckhoff plc has a direct link function to both InfluxDB and to PostgreSQL that TimescaleDB is based on, so the connection will work fine.
We would like to log data to time so we can go back and see what time certain things did happen and also make questions to the database based on time.
I have been talking to different people and most of them recommend to use TimescaleDB so it would be great to hear the benefits between them and what you guys would recommend me to choose.
The data size we will log is pretty small.
We will have an structure of data and that will contain like 10 INT registers so 20 bytes.
We will make an log to the database evry 1 second on the quick machines and sometimes only one time each 20minutes, but this part i will control in my plc.
So putting data in the DB i belive will be pretty straight forward but then i have some toughts about what i would like to do and what is possible.
Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
is there any big difference in setting up these two databases?
I proberly have more questions but these are the main functions that i need in the system.
Many thanks
Is it possible to ask questions to the DB to give me the amount, higest value, lowest value, mean value the last 60 miuntes, or 24hours etc and then can the database retur these values based on the time frame i give the database in my question to it?
Yes! You can use queries to make it. Consider the following table structure:
CREATE TABLE conditions (
time TIMESTAMPTZ NOT NULL,
device INTEGER NOT NULL,
temperature FLOAT NOT NULL
);
SELECT * FROM create_hypertable('conditions', 'time');
ALTER TABLE conditions SET (timescaledb.compress, timescaledb.compress_orderby='time');
The resolution i log with that is controlled from the plc is only needed to be in that high for 7 days and after that i would like to "downstream / compress" the data. Is that possible in both these databases and is there any benefit in one of them? Maybe easier in one of them?
You can create a continuous aggregates that is a fast method to keep your resumed data materialized.
CREATE MATERIALIZED VIEW conditions_hourly(time, device, low, high, average )
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) as time,
device,
min(temperature) as low,
max(temperature) as high,
AVG(temperature) as average
FROM conditions
GROUP BY 1,2;
And then you can add a retention policy for keeping only the last 7 days.
SELECT add_retention_policy('conditions', INTERVAL '7 day');
And add a continuous aggregates policy that will keep your view up to date every hour:
SELECT add_continuous_aggregate_policy('conditions_hourly',
start_offset => INTERVAL '1 day',
end_offset => INTERVAL '1 hour',
schedule_interval => INTERVAL '1 hour');
Is there in one of these two databases a possibility to not write to the HD / disk everytime my plc is putting data to it? Or it will write to the disk everytime automaticly? I did read about something called WAL, what is that or that will not use the RAM to store the data before it writes more data and not so often to the disk?
In Postgresql you can use async commits: https://www.postgresql.org/docs/current/wal-async-commit.html

How to design the Timescaledb schema for engineering data

for a project in the engineering firm I work in, I want to leverage Timescaledb to store experiment results.
I'm trying to figure out the db schema that will deliver the best read performances.
In my use case, I will be collecting several sensor logs for several experiments. For each experiment the time will start from 0.
From newbie thinking I think there are couple of options:
1- each experiment has its own table. The table will have as many columns as many sensors will be used. I bet, this is a terrible solution.
2- each sensor has its own table. I'm thinking to have 2 columns, value and an ID that identifies the experiment and the table name will be the sensor's name. In this table there will be many time duplicates since each experiment time starts from 0.
I'm not sure either of these solutions are actually going to be good.
What kind of schema should I use?
Thank you in advance,
Guido
Update1:
This is the schema I'm testing out right now:
After loading data from about 10 tests (a few hundreds of sensors logged at 100Hz) the Ch_Values table has 180M rows already and queries are terribly slow. I added indexes on ch_index, run_index, and time and now it's exponentially better. This is only ~10 tests tho. In reality, the db will contain hundreds of tests.
Any suggestion on how to efficiently store these data?

Should a I use a timeseries database to store real estate prices information?

I'm working on a personal project to analyze real estate data from three different sites. I want to do some statistical analysis (Python, numpy, pandas, scikit) an data visualization on the data to identify trends, outliers, variations on the market, opportunities, identify clusters, etc.
Part of the information to store:
Price (stored daily)
ID
Property age
Location (initially and string but eventually geo coordinates)
Amenities
Publication date
Square foots
Parking spaces
The total number of properties is 250.000,00. Initially I'll download the information daily to understand the characteristics of the data. After this, I will change the refresh rate of the data, probably twice a week.
I'm thinking about using a relational (PostgreSQL) database for non time dependent data and a timeseries (influxdb or graphite) database for prices.
What do you think? I'm choosing the right stack?

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers

Are document databases good for storing large amounts of Stock Tick data? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)