A better approach to store, write, and read a big dataset of meteorological data - mongodb

I am looking a better to store, write and read meteorological data (about 30 GB in raw text format).
Currently I am using NetCDF file format to store weather records. In this NetCDF file, I have 3 dimensions: time, climate variables, locations. But the dimension order is the key constrain for my tasks (see below).
The first task is to update weather records every day for about 3000 weather stations. Dimension order (time, var, name) provides the best performance for writing as the new data will be added at the end of NetCDF file.
The second task is to read all daily weather record for a station to preform analyzing. Dimension order (name, var, time) provides the best performance for reading as all records of one site are stored together.
The two tasks have the conflict designs of NetCDF file (best performance in one task, but the worst performance in another task).
My questions is whether there are alternative methods/software/data format to store, write and read my datasets to provide best performance of my two tasks? As I have to repeat the two steps everyday and data analyzing is time consuming, I need to find a best way to minimize I/O.
Thanks for any suggestions. Please let me know if my question is not clear.

Ok, what You need is chunking. I created a small python script to test, without chunking it basically confirms Your observation that access is slow in one dimension. I tested with station number 3000, variables per station 10 and timesteps 10000. I did put stations and variables into the same dimension for testing, but it should give similar results in 3D case if You really need it.
My test output without chunking:
File chunking type: None
Variable shape: (30000, 10000)
Total time, file creation: 13.665503025054932
Average time for adding one measurement time: 0.00136328568459 0.00148195505142 0.0018851685524
Read all timeseries one by one with single file open
Average read time per station/variable: 0.524109539986
And with chunking:
File chunking type: [100, 100]
Variable shape: (30000, 10000)
Total time, file creation: 18.610711812973022
Average time for adding one measurement time: 0.00185681316853 0.00168470859528 0.00213300466537
Read all timeseries one by one with single file open
Average read time per station/variable: 0.000948731899261
What You can see is that chunking increases write time by about 50%, but dramatically improves read time. I did not try to optimise chunk sizes, just tested that it works in the right direction.
Feel free to ask if code is not clear or You are not familiar with python.
# -*- coding: utf-8 -*-
from time import time
import numpy as np
from netCDF4 import Dataset
test_dataset_name='test_dataset.nc4'
num_stations=3000
num_vars=10
chunks=None
#chunks=[100,100]
def create_dataset():
ff=Dataset(test_dataset_name,'w')
ff.createDimension('Time',None)
ff.createDimension('Station_variable',num_stations*num_vars)
if chunks:
var1=ff.createVariable('TimeSeries','f8', ('Station_variable','Time'),chunksizes=chunks)
else:
var1=ff.createVariable('TimeSeries','f8',('Station_variable','Time'))
return ff
def add_data(ff,timedim):
var1=ff.variables['TimeSeries']
var1[0:1000,timedim]=timedim*np.ones((1000),'f8')
def dataset_close(inds):
inds.close()
## CREATE DATA FILE
time_start=time()
time1=[]
time2=[]
time3=[]
time4=[]
testds=create_dataset()
dataset_close(testds)
for i in range(10000):
time1.append(time())
ff=Dataset(test_dataset_name,'a')
time2.append(time())
add_data(ff,i)
time3.append(time())
ff.sync()
ff.close()
time4.append(time())
time_end=time()
time1=np.array(time1)
time2=np.array(time2)
time3=np.array(time3)
time4=np.array(time4)
## READ ALL STAION-VARIABLE COMBINATIONS AS ONE TIMESERIES
ff=Dataset(test_dataset_name,'r')
## PRINT DATA FILE CREATION SUMMARY
print("File chunking type:",chunks)
print("Variable shape:",ff.variables['TimeSeries'][:].shape)
print("Total time, file creation:", time_end-time_start)
print("Average time for adding one measurement time: ",np.mean(time4- time1), np.mean(time4[:100]-time1[:100]),np.mean(time4[-100:]- time1[-100:]))
print("Read all timeseries one by one with single file open")
time_rstart=[]
time_rend=[]
for i in range(0,ff.variables['TimeSeries'][:].shape[0],int(ff.variables['TimeSeries'][:].shape[0]/100)):
time_rstart.append(time())
dataline=ff.variables['TimeSeries'][i,:]
time_rend.append(time())
time_rstart=np.array(time_rstart)
time_rend=np.array(time_rend)
print("Average read time per station/variable: ",np.mean(time_rend- time_rstart))

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

report progress during tplog row count using -11!(-2;`:tplog)

I'm counting the number of rows in a 1TB tplog using -11!(-2;`:tplog) internal function, which takes a long time as expected.
Is there a way to track progress (preferably as a percentage) of this operation?
No I don't believe this would be possible for the -2 option. It is possible with every other use of -11! by defining a custom .z.ps but it seems .z.ps isn't invoked when counting number of chunks.
Your best bet might be to check the raw file size on disk using a linux system command or using hcount and then based on some previous estimates tested on your machine you could create logic to come up with a ballpark estimate (maybe with a linear interpolation) of timing based only on the file size.
Of course - once you have the chunk count from the -2 option then you can easily create a progress estimate of an actual replay, just not for the chunk count part itself
Fundamentally, entries are variable length and there is no tracking built into the file, so there's no exact answer.
You might be able to do this:
Replay a short segment (say 1,000 entries) and a custom handler:
.z.ps:{estimate+:-22!x}
-11!(1000;`:tplog)
-1"estimate of entry size: ", string estimate%:1000
-1"estimate of number of entries: ", string floor n:hcount[`:tplog]%estimate
\x .z.ps
Then repeat the full file with a standard upd handler.

What are some of the most efficient workflows for processing "big data" (250+ GB) from postgreSQL databases?

I am constructing a script that will be processing well-over 250+ GB of data from a single postgreSQL table. The table's shape is ~ 150 cols x 74M rows (150x74M). My goal is to somehow sift through all the data and make sure that each cell entry meets certain criteria that I will be tasked with defining. After the data has been processed I want to pipeline it into an AWS instance. Here are some scenarios I will need to consider:
How can I ensure that each cell entry meets certain criteria of the column it resides in? For example, all entries in the 'Date' column should be in the format 'yyyy-mm-dd', etc.
What tools/languages are best for handling such large data? I use Python and the Pandas module often for DataFrame manipulation, and am aware of the read_sql function, but I think that this much data will simply take too long to process in Python.
I know how to manually process the data chunk-by-chunk in Python, however I think that this is probably too inefficient and the script could take well over 12 hours.
Simply put or TLDR: I'm looking for a simple, streamlined solution to manipulating and performing QC analysis on postgreSQL data.

Dymola: Avoid "Not enough storage for initial variable data" for large Modelica model

I am trying to simulate a large Modelica model in Dymola. This model uses several records that define time series input data (data with 900 second intervals for 1 year), which it reads via the CombiTimeTable model.
If I limit the records to only contain the data for 2 weeks (also 900 second intervals), the model simulates fine.
With the yearly data, the translation seems to run successfully, but simulation fails. The dslog file contains the message Not enough storage for initial variable data.
This happens on a Windows 10 system with 8 GB RAM as well as on a Windows 7 system with 32 GB RAM.
Is there any way to avoid this error and get the simulation to run? Thanks in advance!
The recommended way is to have the time series data not within the records (that is in your model or library) but as external data files. The CombiTimeTable supports both reading from text file and MATLAB MAT file at simulation run-time. You will also benefit from shorter translation times.
You still can organize your external files relative to your library by means of Modelica URIs since the CombiTimeTable (as well as the other table blocks) already call the loadResource function. The recommended way is to organize these files in an Resources directory of your Modelica package.

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.