I would like some comments for speeding up writing and sending data for my vibration measurement system.
I have a system of 10 Raspberry Pi's (both rpi 3b+ and rpi 4), each with a triaxial accelerometer which acquires 3200 samples/second in each x,y,z-axis. Each minute a new CVS file is generated on each rpi, and recordings for the next minute is appended to this file, closed and send via ftp to a laptop where all data are gathered and post-processed.
Data within the CSV files are integers and integers only. No headers or similar. Filenames consist of a sensor number and an epoch timestamp, so all filenames are unique.
Should I change format, e.g. HDF5 (why - seems only to be good for larger data tables), or Panda PyTables, feather-format (appears only to utilize binear data - should I consider converting data?), Pickle, UFF58, ...?
Should I consider another approach instead of generating a new CSV file every minute, maybe not append but keep data in memory until they are needed for writing to a CSV file? (I need the data send to my laptop every minute). If so, how can this be done?
Other considerations as to improve I/O performance?
Related
My first first post (ever), so if i've violated some rule(s) regarding SO usage, let me know...
I have an IMU connected to a Raspberry Pi and wish to stream the acc/gyro data to my laptop for graphing, analysis, etc. The IMU code on the RPi is functional and the data is stored nicely to file on the RPi.
The first question is related to Linux and files. Assuming I have some code that streams the data, how are the contents of this file containing the data managed. In other words, if the streaming mechanism xfers 1000 lines of data, how does the next stream action avoid reading the same data? Is this something that I have to do in my code or does Linux and/or "streaming" deal with this? I may have a misunderstanding of what "streaming" means.
From SO, I've taken an example of TCP streaming and implemented it and that all works but it self generates random data. I want it to read the acc/gyro data from the file. Should the streaming code just shoot the data across as in the example code or should I use the GET/SET method? I'm not a tcp/network/html guy and have no insight as to what is appropriate. I should note that I would like to stream the data at a rapid pace but understand that this will be limited by the RPi/IMU, streaming and code to process it (i'll throttle it to fit).
The code example from SO I've implemented is from (not sure how to add links)
questions/64642122/how-to-send-real-time-sensor-data-to-pc-from-raspberry-pi-zero
Thanks
I am trying to simulate a large Modelica model in Dymola. This model uses several records that define time series input data (data with 900 second intervals for 1 year), which it reads via the CombiTimeTable model.
If I limit the records to only contain the data for 2 weeks (also 900 second intervals), the model simulates fine.
With the yearly data, the translation seems to run successfully, but simulation fails. The dslog file contains the message Not enough storage for initial variable data.
This happens on a Windows 10 system with 8 GB RAM as well as on a Windows 7 system with 32 GB RAM.
Is there any way to avoid this error and get the simulation to run? Thanks in advance!
The recommended way is to have the time series data not within the records (that is in your model or library) but as external data files. The CombiTimeTable supports both reading from text file and MATLAB MAT file at simulation run-time. You will also benefit from shorter translation times.
You still can organize your external files relative to your library by means of Modelica URIs since the CombiTimeTable (as well as the other table blocks) already call the loadResource function. The recommended way is to organize these files in an Resources directory of your Modelica package.
I'm storing the output of a sensor pod in a Postgres db. There are many sensor pods (dozens), each generating a reading every 100ms, 24 hours per day - that's 86x,xxx records per pod per day. The sensor pod is relatively dumb, and lives in an environment of unreliable connectivity, so it produces n(tbd) line CSV files locally, and then ships them all off when it has network access.
The sensor pod knows its own name, and the data it produces but that's about it. I'm trying to decide how to load the data efficiently into the DB. There are two options I'm considering:
Use the COPY syntax, and give it the file directly
Do a mass insert
Can someone speak to the performance ramifications of both of those?
My hesitation with option 1 is that I need to supply some ancillary data (some foreign keys, etc.) that won't be in the file. The only way to do that, without making the sensor pod aware of the random crap that's also going into that table is to generate temporary tables for the load, and then move them into their final destination with an insert .. select which seems wasteful.
This is a high write / low read environment.
I am looking a better to store, write and read meteorological data (about 30 GB in raw text format).
Currently I am using NetCDF file format to store weather records. In this NetCDF file, I have 3 dimensions: time, climate variables, locations. But the dimension order is the key constrain for my tasks (see below).
The first task is to update weather records every day for about 3000 weather stations. Dimension order (time, var, name) provides the best performance for writing as the new data will be added at the end of NetCDF file.
The second task is to read all daily weather record for a station to preform analyzing. Dimension order (name, var, time) provides the best performance for reading as all records of one site are stored together.
The two tasks have the conflict designs of NetCDF file (best performance in one task, but the worst performance in another task).
My questions is whether there are alternative methods/software/data format to store, write and read my datasets to provide best performance of my two tasks? As I have to repeat the two steps everyday and data analyzing is time consuming, I need to find a best way to minimize I/O.
Thanks for any suggestions. Please let me know if my question is not clear.
Ok, what You need is chunking. I created a small python script to test, without chunking it basically confirms Your observation that access is slow in one dimension. I tested with station number 3000, variables per station 10 and timesteps 10000. I did put stations and variables into the same dimension for testing, but it should give similar results in 3D case if You really need it.
My test output without chunking:
File chunking type: None
Variable shape: (30000, 10000)
Total time, file creation: 13.665503025054932
Average time for adding one measurement time: 0.00136328568459 0.00148195505142 0.0018851685524
Read all timeseries one by one with single file open
Average read time per station/variable: 0.524109539986
And with chunking:
File chunking type: [100, 100]
Variable shape: (30000, 10000)
Total time, file creation: 18.610711812973022
Average time for adding one measurement time: 0.00185681316853 0.00168470859528 0.00213300466537
Read all timeseries one by one with single file open
Average read time per station/variable: 0.000948731899261
What You can see is that chunking increases write time by about 50%, but dramatically improves read time. I did not try to optimise chunk sizes, just tested that it works in the right direction.
Feel free to ask if code is not clear or You are not familiar with python.
# -*- coding: utf-8 -*-
from time import time
import numpy as np
from netCDF4 import Dataset
test_dataset_name='test_dataset.nc4'
num_stations=3000
num_vars=10
chunks=None
#chunks=[100,100]
def create_dataset():
ff=Dataset(test_dataset_name,'w')
ff.createDimension('Time',None)
ff.createDimension('Station_variable',num_stations*num_vars)
if chunks:
var1=ff.createVariable('TimeSeries','f8', ('Station_variable','Time'),chunksizes=chunks)
else:
var1=ff.createVariable('TimeSeries','f8',('Station_variable','Time'))
return ff
def add_data(ff,timedim):
var1=ff.variables['TimeSeries']
var1[0:1000,timedim]=timedim*np.ones((1000),'f8')
def dataset_close(inds):
inds.close()
## CREATE DATA FILE
time_start=time()
time1=[]
time2=[]
time3=[]
time4=[]
testds=create_dataset()
dataset_close(testds)
for i in range(10000):
time1.append(time())
ff=Dataset(test_dataset_name,'a')
time2.append(time())
add_data(ff,i)
time3.append(time())
ff.sync()
ff.close()
time4.append(time())
time_end=time()
time1=np.array(time1)
time2=np.array(time2)
time3=np.array(time3)
time4=np.array(time4)
## READ ALL STAION-VARIABLE COMBINATIONS AS ONE TIMESERIES
ff=Dataset(test_dataset_name,'r')
## PRINT DATA FILE CREATION SUMMARY
print("File chunking type:",chunks)
print("Variable shape:",ff.variables['TimeSeries'][:].shape)
print("Total time, file creation:", time_end-time_start)
print("Average time for adding one measurement time: ",np.mean(time4- time1), np.mean(time4[:100]-time1[:100]),np.mean(time4[-100:]- time1[-100:]))
print("Read all timeseries one by one with single file open")
time_rstart=[]
time_rend=[]
for i in range(0,ff.variables['TimeSeries'][:].shape[0],int(ff.variables['TimeSeries'][:].shape[0]/100)):
time_rstart.append(time())
dataline=ff.variables['TimeSeries'][i,:]
time_rend.append(time())
time_rstart=np.array(time_rstart)
time_rend=np.array(time_rend)
print("Average read time per station/variable: ",np.mean(time_rend- time_rstart))
I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.