Howto compress time-series data in Matlab (e.g. delta-rle)? - matlab

We have terabytes of time-series data, handled with Matlab, filling our disks.
I know that rle-compression could greatly reduce the storage requirement of time-series data, if it were delta-encoded (or even delta-of-delta).
(Reason: The differences between consecutive values are mostly small, often just 0, +1, -1, and rle-compression works great if data contains some values much more often than others.)
As the admin I want our Matlab-users to save disk space. Is there an easy way to do this in Matlab? Ideally mostly transparent, so they do not need to manually rewrite all places in their scripts where they access data.
I don't even need the actual compression, because the filesystem already does that, just automatic delta-encoding would be sufficient.

Related

MATLAB: Are there any problems with many (millions) small files compared to few (thousands) large files?

I'm working on a real-time test software in MATLAB. On user input I want to extract the value of one (or a few neighbouring) pixels from 50-200 high resolution images (~25 MB).
My problem is that the total image set is to big (~2000 images) to store in RAM, consequently I need to read each of the 50-200 images from disk after each user-input which of course is way to slow!
So I was thinking about splitting the images into sub-images (~100x100 pixels) and saving these separately. This would make the image-read process quick enough.
Are there any problems I should be aware of with this approach? For instance I've read about people having trouble copying many small files, will this affect me to i.e. make the image-read slower?
rahnema1 is right - imread(...,'PixelRegion') will fasten read operation. If it is not enough for you, even if your files are not fragmented, may be it is time to think about some database?
Disk operations are always the bottleneck. First we switch to disk caches, then distributed storage, then RAID, and after some more time, we finish with in-memory databases. You should choose which access speed is reasonable.

Optimizing compression using HDF5/H5 in Matlab

Using Matlab, I am going to generate several data files and store them in H5 format as 20x1500xN, where N is an integer that can vary, but typically around 2300. Each file will have 4 different data sets with equal structure. Thus, I will quickly achieve a storage problem. My two questions:
Is there any reason not the split the 4 different data sets, and just save as 4x20x1500xNinstead? I would prefer having them split, since it is different signal modalities, but if there is any computational/compression advantage to not having them separated, I will join them.
Using Matlab's built-in compression, I set deflate=9 (and DataType=single). However, I have now realized that using deflate multiplies my computational time with 5. I realize this could have something to do with my ChunkSize, which I just put to 20x1500x5 - without any reasoning behind it. Is there a strategic way to optimize computational load w.r.t. deflation and compression time?
Thank you.
1- Splitting or merging? It won't make a difference in the compression procedure, since it is performed in blocks.
2- Your choice of chunkshape seems, indeed, bad. Chunksize determines the shape and size of each block that will be compressed independently. The bad is that each chunk is of 600 kB, that is much larger than the L2 cache, so your CPU is likely twiddling its fingers, waiting for data to come in. Depending on the nature of your data and the usage pattern you will use the most (read the whole array at once, random reads, sequential reads...) you may want to target the L1 or L2 sizes, or something in between. Here are some experiments done with a Python library that may serve you as a guide.
Once you have selected your chunksize (how many bytes will your compression blocks have), you have to choose a chunkshape. I'd recommend the shape that most closely fits your reading pattern, if you are doing partial reads, or filling in in a fastest-axis-first if you want to read the whole array at once. In your case, this will be something like 1x1500x10, I think (second axis being the fastest, last one the second fastest, and fist the slowest, change if I am mistaken).
Lastly, keep in mind that the details are quite dependant on the specific machine you run it: the CPU, the quality and load of the hard drive or SSD, speed of RAM... so the fine tuning will always require some experimentation.

Tensorflow tf.train.Saver saves suspiciously large .ckpt files?

I'm working with a reasonably sized net (1 convolutional layer, 2 fully connected layers). Every time I save variables using tf.train.Saver, the .ckpt files are half a gigabyte each of disk space (512 MB to be exact). Is this normal? I have a Caffe net with the same architecture that requires only a 7MB .caffemodel file. Is there a particular reason why Tensorflow saves such large file sizes?
Many thanks.
Hard to tell how large your net is from what you've described -- the number of connections between two fully connected layers scales up quadratically with the size of each layer, so perhaps your net is quite large depending on the size of your fully connected layers.
If you'd like to save space in the checkpoint files, you could replace this line:
saver = tf.train.Saver()
with the following:
saver = tf.train.Saver(tf.trainable_variables())
By default, tf.train.Saver() saves all variables in your graph -- including the variables created by your optimizer to accumulate gradient information. Telling it to save only trainable variables means it will save only the weights and biases of your network, and discard the accumulated optimizer state. Your checkpoints will probably be a lot smaller, with the tradeoff that it you may experience slower training for the first few training batches after you resume training, while the optimizer re-accumulates gradient information. It doesn't take long at all to get back up to speed, in my experience, so personally, I think the tradeoff is worth it for the smaller checkpoints.
Maybe you can try (in Tensorflow 1.0):
saver.save(sess, filename, write_meta_graph=False)
which doesn't save meta Graph information.
See:
https://www.tensorflow.org/versions/master/api_docs/python/tf/train/Saver
https://www.tensorflow.org/programmers_guide/meta_graph
Typically you only save tf.global_variables() (which is shorthand for tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), i.e. the collection of global variables). This collection is meant to include variables which are necessary for restoring the state of the model, so things like current moving averages for batch normalization, the global step, the states of the optimizer(s) and, of course, the tf.GraphKeys.TRAINABLE_VARIABLES collection. Variables of more temporary nature, such as the gradients, are collected in LOCAL_VARIABLES and it is usually not necessary to store them and they might take up a lot of disk space.

What is a better approach of storing and querying a big dataset of meteorological data

I am looking for a convenient way to store and to query huge amount of meteorological data (few TB). More information about the type of data in the middle of the question.
Previously I was looking in the direction of MongoDB (I was using it for many of my own previous projects and feel comfortable dealing with it), but recently I found out about HDF5 data format. Reading about it, I found some similarities with Mongo:
HDF5 simplifies the file structure to include only two major types of
object: Datasets, which are multidimensional arrays of a homogenous
type Groups, which are container structures which can hold datasets
and other groups This results in a truly hierarchical, filesystem-like
data format. Metadata is stored in the form of user-defined, named
attributes attached to groups and datasets.
Which looks like arrays and embedded objects in Mongo and also it supports indices for querying the data.
Because it uses B-trees to index table objects, HDF5 works well for
time series data such as stock price series, network monitoring data,
and 3D meteorological data.
The data:
Specific region is divided into smaller squares. On the intersection of each one of the the sensor is located (a dot).
This sensor collects the following information every X minutes:
solar luminosity
wind location and speed
humidity
and so on (this information is mostly the same, sometimes a sensor does not collect all the information)
It also collects this for different height (0m, 10m, 25m). Not always the height will be the same. Also each sensor has some sort of metainformation:
name
lat, lng
is it in water, and many others
Giving this, I do not expect the size of one element to be bigger than 1Mb.
Also I have enough storage at one place to save all the data (so as far as I understood no sharding is required)
Operations with the data.
There are several ways I am going to interact with a data:
convert as store big amount of it: Few TB of data will be given to me as some point of time in netcdf format and I will need to store them (and it is relatively easy to convert it HDF5). Then, periodically smaller parts of data (1 Gb per week) will be provided and I have to add them to the storage. Just to highlight: I have enough storage to save all this data on one machine.
query the data. Often there is a need to query the data in a real-time. The most of often queries are: tell me the temperature of sensors from the specific region for a specific time, show me the data from a specific sensor for specific time, show me the wind for some region for a given time-range. Aggregated queries (what is the average temperature over the last two months) are highly unlikely. Here I think that Mongo is nicely suitable, but hdf5+pytables is an alternative.
perform some statistical analysis. Currently I do not know what exactly it would be, but I know that this should not be in a real time. So I was thinking that using hadoop with mongo might be a nice idea but hdf5 with R is a reasonable alternative.
I know that the questions about better approach are not encouraged, but I am looking for an advice of experienced users. If you have any questions, I would be glad to answer them and will appreciate your help.
P.S I reviewed some interesting discussions, similar to mine: hdf-forum, searching in hdf5, storing meteorological data
It's a difficult question and I am not sure if I can give a definite answer but I have experience with both HDF5/pyTables and some NoSQL databases.
Here are some thoughts.
HDF5 per se has no notion of index. It's only a hierarchical storage format that is well suited for multidimensional numeric data. It's possible to extend on top of HDF5 to implement an index (i.e. PyTables, HDF5 FastQuery) for the data.
HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
HDF5 supports compression filters which can - unlike popular belief - make data access actually faster (however you have to think about proper chunk size which depends on the way you access the data).
HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
There is a package (SciHadoop) that combines Hadoop and HDF5.
HDF5 makes it relatively easy to do out core computation (i.e. if the data is too big to fit into memory).
PyTables supports some fast "in kernel" computations directly in HDF5 using numexpr
I think your data generally is a good fit for storing in HDF5. You can also do statistical analysis either in R or via Numpy/Scipy.
But you can also think about a hybdrid aproach. Store the raw bulk data in HDF5 and use MongoDB for the meta-data or for caching specific values that are often used.
You can try SciDB if loading NetCDF/HDF5 into this array database is not a problem for you. Note that if your dataset is extremely large, the data loading phase will be very time consuming. I'm afraid this is a problem for all the databases. Anyway, SciDB also provides an R package, which should be able to support the analysis you need.
Alternatively, if you want to perform queries without transforming HDF5 into something else, you can use the product here: http://www.cse.ohio-state.edu/~wayi/papers/HDF5_SQL.pdf
Moreover, if you want to perform a selection query efficiently, you should use index; if you want to perform aggregation query in real time (in seconds), you can consider approximate aggregation. Our group has developed some products to support those functions.
In terms of statistical analysis, I think the answer depends on the complexity of your analysis. If all you need is to compute something like entropy or correlation coefficient, we have products to do it in real time. If the analysis is very complex and ad-hoc, you may consider SciHadoop or SciMATE, which can process scientific data in the MapReduce framework. However, I am not sure if SciHadoop currently can support HDF5 directly.

Alternatives to Matlab's Mat File Format

I'm finding that writing and reading the native mat file format becomes very, very slow with larger data structures of about 1G in size. In addition we have other, non-matlab, software that should be able to read and write these files. So I would to find an alternative format to use to serialize matlab data structures. Ideally this format would ...
be able to represent an arbitrary matlab structure to a file.
have faster I/O than than mat files.
have I/O libraries for other languages like Java, Python and C++.
Simplifying your data structures and using the new v7.3 MAT file format, which is a variant of HDF5, might actually be the best approach. The HDF5 format is open and already has I/O libraries for your other languages. And depending on your data structure, they may be faster than the old binary mat files.
Simplify the data structures you're saving, preferring large arrays of primitives to complex container structures.
Try turning off compression if your data structures are still complex.
Try the v7.3 MAT file format using "-v7.3"
If using a network file system, consider saving and loading to a temporary dir on a fast local drive and copying to/from the network
For large data structures, your MAT file I/O speed may be determined more by the internal structure of the data you're writing out than the size of the resulting MAT file itself. (In my experience, this has usually been the major factor in slow MAT files.) When you say "arbitrary Matlab structure", that suggests you might be using cells, structs, or objects to make complex data structures. That slows down MAT I/O because there is per-array overhead in MAT file I/O, and the members of cell and struct arrays (container types) all count as separate arrays. For example, 5,000 strings stored in a cellstr are much, much slower than the same 5,000 strings stored in a 2-D char array. And objects have even more overhead. As a test, try writing out a 1 GB file that contains just a 1 GB primitive array of random uint8s, and see how long that takes. From there, see if you can simplify your data to reduce the total mxarray count, even if that means reshaping it for serialization. (My experience with this is mostly with the v7 format; the newer HDF5 format may have less per element overhead.)
If your data files live on the network, you could also try doing the save and load operations on temporary files on fast local drives, and separately using copy operations to move them back and forth between the network. At least on Windows networks, I've seen speedups of up to 2x from doing this. Possibly due to optimizations the full-file copy operation can do that the MAT I/O code can't.
It would probably be a substantial effort to come up with an alternate file format that supported fully arbitrary Matlab data structures and was portable to other languages. I'd try making smaller changes around your use of the existing format first.
mat format has changed with Matlab versions. v7.3 uses HDF5 format, which has builtin compression and other features and it can take a large time to read/write. However, you can force Matlab to use previous formats which are faster (but might take more space).
See here:
http://www.mathworks.com/help/matlab/import_export/mat-file-versions.html