Folks,
I am new to dealing with GRIB format and seek your advice on the following question:
we have an application where we plan to receive data at every 6 hour interval. The forecast will be for next 10 to 15 days.
There is a requirement where to reduce the download size, the system should only download incremental changes meaning the new GRIB files will only contain data which has changed.
So all the previously downloaded GRIB files should display data and for the parts where there was a change (assuming clients will know) the client will downloaded and display the GRIB file which has this incremental update ..
Is this kind of incremental changes to GRIB supported by standard?
I suspect this option is not supported by GRIB files. As the data in GRIB files is packed, you cannot know what variables have changed and which not.
In addition, most likely most of the parameters have a slight and insignificant change between the forecasts (I mean the forecast for let us say 07:00 o'clock done at 00:00 and done at 06:00 will have differences for most of the parameters, but they can be in order of 10^-X - meaning they are insignificant). Some parameters or regions of course might have larger differences that you would like to highlight.
Related
I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.
Trying to wrap my head around timescaledb, but my google-fu is failing me. Most likely because I'm not searching for the correct term.
With RRD tool, old data can be stored as averages, reducing the amount of data being stored.
I can't seem to find out how to do this with timescaledb. I'd like 5 minute resolution for 90 days, but after that, it's pointless to keep all those data points, and I'd like to reduce it to 30 or 60 minute averages for a couple years, then maybe daily averages after that.
Is this something that I can set in the database itself, or is this something I would have to implement in a housekeeping job?
We had the exact same question half a year ago.
The term "Data Retention" is also used by the timescaledb team. It is currently implemented using drop_chunks policies (see their doc here). It's a Enterprise feature but IMHO not (yet) as useful as it could/should be (and it surely does not do what you are looking for).
Let me explain: probably the easiest approach for down-sampling your data are Continuous Aggregates (their doc here). You can quite easily aggregate virtually any numeric value to whatever resolution you desire. However, Continuous Aggregates will be affected by the deletions of the drop_chunks, too. Your data is gone.
One workaround would be to create other Hypertables instead. Then, create your own background workers copying the data from the original, hi-res table to these new lo-res Hypertables.
For housekeeping, either use the Data Retention Enterprise feature or create your own background workers.
I am trying to simulate a large Modelica model in Dymola. This model uses several records that define time series input data (data with 900 second intervals for 1 year), which it reads via the CombiTimeTable model.
If I limit the records to only contain the data for 2 weeks (also 900 second intervals), the model simulates fine.
With the yearly data, the translation seems to run successfully, but simulation fails. The dslog file contains the message Not enough storage for initial variable data.
This happens on a Windows 10 system with 8 GB RAM as well as on a Windows 7 system with 32 GB RAM.
Is there any way to avoid this error and get the simulation to run? Thanks in advance!
The recommended way is to have the time series data not within the records (that is in your model or library) but as external data files. The CombiTimeTable supports both reading from text file and MATLAB MAT file at simulation run-time. You will also benefit from shorter translation times.
You still can organize your external files relative to your library by means of Modelica URIs since the CombiTimeTable (as well as the other table blocks) already call the loadResource function. The recommended way is to organize these files in an Resources directory of your Modelica package.
I am looking a better to store, write and read meteorological data (about 30 GB in raw text format).
Currently I am using NetCDF file format to store weather records. In this NetCDF file, I have 3 dimensions: time, climate variables, locations. But the dimension order is the key constrain for my tasks (see below).
The first task is to update weather records every day for about 3000 weather stations. Dimension order (time, var, name) provides the best performance for writing as the new data will be added at the end of NetCDF file.
The second task is to read all daily weather record for a station to preform analyzing. Dimension order (name, var, time) provides the best performance for reading as all records of one site are stored together.
The two tasks have the conflict designs of NetCDF file (best performance in one task, but the worst performance in another task).
My questions is whether there are alternative methods/software/data format to store, write and read my datasets to provide best performance of my two tasks? As I have to repeat the two steps everyday and data analyzing is time consuming, I need to find a best way to minimize I/O.
Thanks for any suggestions. Please let me know if my question is not clear.
Ok, what You need is chunking. I created a small python script to test, without chunking it basically confirms Your observation that access is slow in one dimension. I tested with station number 3000, variables per station 10 and timesteps 10000. I did put stations and variables into the same dimension for testing, but it should give similar results in 3D case if You really need it.
My test output without chunking:
File chunking type: None
Variable shape: (30000, 10000)
Total time, file creation: 13.665503025054932
Average time for adding one measurement time: 0.00136328568459 0.00148195505142 0.0018851685524
Read all timeseries one by one with single file open
Average read time per station/variable: 0.524109539986
And with chunking:
File chunking type: [100, 100]
Variable shape: (30000, 10000)
Total time, file creation: 18.610711812973022
Average time for adding one measurement time: 0.00185681316853 0.00168470859528 0.00213300466537
Read all timeseries one by one with single file open
Average read time per station/variable: 0.000948731899261
What You can see is that chunking increases write time by about 50%, but dramatically improves read time. I did not try to optimise chunk sizes, just tested that it works in the right direction.
Feel free to ask if code is not clear or You are not familiar with python.
# -*- coding: utf-8 -*-
from time import time
import numpy as np
from netCDF4 import Dataset
test_dataset_name='test_dataset.nc4'
num_stations=3000
num_vars=10
chunks=None
#chunks=[100,100]
def create_dataset():
ff=Dataset(test_dataset_name,'w')
ff.createDimension('Time',None)
ff.createDimension('Station_variable',num_stations*num_vars)
if chunks:
var1=ff.createVariable('TimeSeries','f8', ('Station_variable','Time'),chunksizes=chunks)
else:
var1=ff.createVariable('TimeSeries','f8',('Station_variable','Time'))
return ff
def add_data(ff,timedim):
var1=ff.variables['TimeSeries']
var1[0:1000,timedim]=timedim*np.ones((1000),'f8')
def dataset_close(inds):
inds.close()
## CREATE DATA FILE
time_start=time()
time1=[]
time2=[]
time3=[]
time4=[]
testds=create_dataset()
dataset_close(testds)
for i in range(10000):
time1.append(time())
ff=Dataset(test_dataset_name,'a')
time2.append(time())
add_data(ff,i)
time3.append(time())
ff.sync()
ff.close()
time4.append(time())
time_end=time()
time1=np.array(time1)
time2=np.array(time2)
time3=np.array(time3)
time4=np.array(time4)
## READ ALL STAION-VARIABLE COMBINATIONS AS ONE TIMESERIES
ff=Dataset(test_dataset_name,'r')
## PRINT DATA FILE CREATION SUMMARY
print("File chunking type:",chunks)
print("Variable shape:",ff.variables['TimeSeries'][:].shape)
print("Total time, file creation:", time_end-time_start)
print("Average time for adding one measurement time: ",np.mean(time4- time1), np.mean(time4[:100]-time1[:100]),np.mean(time4[-100:]- time1[-100:]))
print("Read all timeseries one by one with single file open")
time_rstart=[]
time_rend=[]
for i in range(0,ff.variables['TimeSeries'][:].shape[0],int(ff.variables['TimeSeries'][:].shape[0]/100)):
time_rstart.append(time())
dataline=ff.variables['TimeSeries'][i,:]
time_rend.append(time())
time_rstart=np.array(time_rstart)
time_rend=np.array(time_rend)
print("Average read time per station/variable: ",np.mean(time_rend- time_rstart))
I would like to report the database size to myself via email every week and make a comparison to the week before and display the growth in Megabyte and/or %.
I have everything besides the comparison done.
Imagine this setup :
SQL server with 100 databases
Now there are plenty of ways to do a comparison, I thought about writing the sizes into XML by powershell and later read out using a second script and report to me.
Since I trained myself in powershell I might have gaps here, so I am afraid to miss an easy way.
Does anyone has a nice Idea of how to compare the size?
The report and calculation I will manage myself later, I just need a good way to do that.
Currently I am on Powershell 3.0 but I can upgrade to 4.0
Don't invent the wheel again. Sql Server already has tools to monitor DB file sizes. So does Performance Monitor. There are several 3rd party products available too. Ask your local DBA if there already is such a system present.
A common practice is to query the server for DB file sizes on, say, daily basis and store it in utility db table with timestamp. Calculating change volumes, ratios and whatnot can be done on TSQL side. (Not that it is CPU intensive anyway.)
I would creat foreach database an csv file. then write out two rows:
Date,Size
27.08.2014,1024
28.08.2014,1040
29.08.2014,1080
Then you can import the csv file, sort the row by date, compare the last two sizes and send the result by mail.