Does h5py read the whole file into memory? - h5py

Does h5py read the whole file into the memory?
If so, what if I have a very very big file?
If not, will it be quite slow if I take data from hard disk every time I want a single data? How can I make it faster?

Does h5py read the whole file into the memory?
No, it does not. In particular, slicing (dataset[50:100]) allows you to load fractions of a dataset into memory. For details, see the h5py docs.
If not, will it be quite slow if I take data from hard disk every time I want a single data?
In general, hdf5 is very fast. But reading from memory is obviously faster than reading from disk. It's your decision how much of a dataset is read into memory (dataset[:] loads the whole dataset).
How can I make it faster?
If you care to optimize performance, you should read the sections about chunking and compression. There's also a book that explains these things in detail (disclaimer: I'm not the author).

In case you need to load entire HDF5 file (possibly nested) into memory, here is a simple utility function:
def load_entire_hdf5(dct):
if isinstance(dct, h5py.Dataset):
return dct[()]
ret = {}
for k, v in dct.items():
ret[k] = load_entire_hdf5(v)
return ret
with h5py.File("<filepath>", "r") as f:
data = load_entire_hdf5(f)

Related

How does mmap() help read information at a specific offset versus regular Posix I/O

I'm trying to understanding something a bit better about mmap. I recently read this portion of this accepted answer in the related stackoverflow question (quoted below):mmap and memory usage
Let's say you read a 100MB chunk of data, and according to the initial
1MB of header data, the information that you want is located at offset
75MB, so you don't need anything between 1~74.9MB! You have read it
for nothing but to make your code simpler. With mmap, you will only
read the data you have actually accessed (rounded 4kb, or the OS page
size, which is mostly 4kb), so it would only read the first and the
75th MB.
I understand most of the benefits of mmap (no need for context-switches, no need to swap contents out, etc), but I don't quite understand this offset. If we don't mmap and we need information at the 75th MB offset, can't we do that with standard POSIX file I/O calls without having to use mmap? Why does mmap exactly help here?
Of course you could. You can always open a file and read just the portions you need.
mmap() can be convenient when you don't want to write said code or you need sparse access to the contents and don't want to have to write a bunch of caching logic.
With mmap(), you're "mapping" the entire contest of the file to offsets in memory. Most implementation of mmap() do this lazily, so each ~4K block of the file is read on-demand, as you access those memory locations.
All you have to do is access the data in your file like it was a huge array of chars (i.e. int* someInt = &map[750000000]; return *someInt;), and let the OS worry about what portions of the file have been read, when to read the file, how much, writing the dirty data blocks back to the file, and purging the memory to free up RAM.

Memory issues with large amounts of data stored as nested cells in MATLAB

I have large amounts of data stored as nested cells in .mat files. My biggest problem right now is the load times for accessing these files, but I'm wondering if the underlying problem is that I came up with an inefficient way for storing the data and I should restructure it to be smaller.
The full file consists of a cell aray:
Hemi{1,h} where there are 52 versions of h
.{n,p} where there are 85 versions of n and up to ~100 versions of p
.Variable where there are 10 variables, each with ~2500 values
This full file ate up all my memory, so I saved it in parts, aka:
Hemi1.mat=Hemi{1,1}
Hemi2.mat=Hemi{1,2}
etc.
The next step for this application is to load each file, determine which part of it is an appropriate solution (I need Hemi{1,h}.{n,p}.Var1, Hemi{1,h}.{n,p}.Var2, and Hemi{1,h}.{n,p}.Var3 for this, but I still need to keep track of the other Variables), save the solution, then close the file and move to the next one.
Is there a faster way to load these files?
Is the problem less my dataset and more how I've chosen to store it? Is there a better alternative?
That is quite a lot of data. I have a few suggestions that you could look into. The first is to see if you can change the datatypes to something like Categorical Objects. They are way more memory efficient. Also if you are storing strings as your final data this can be quite heavy storage wise.
Second you could look into HDF5 file storage. I hear it is a nice way to store structured data.
You could finally try to convert your {n,p} arrays into Table structures. I am not sure if this is better for memory, but tables are nice to work with and it may help you out. (Depending on your version of Matlab you may not have tables :P).
I hope this helps!
-Kyle

Confused about the advantage of MongoDB gridfs

MongoDB gridfs says the big advantage is that splitting big file to chunks, and then you don't have to load entire file to memory if you just want to see part of the file. But my confusion is that even though I open a big file from local disk I can just use skip() API to just load part of the file which I wanted. I don't have to load the entire file at all. So how come MongoDB says that is the advantage?
Even though cursor.skip() method does not return the entire file, it has to load it into memory. It requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return results(Doesn't greatly affect when collection is small in size).
As the offset increases, cursor.skip() will become slower and more CPU intensive. With larger collections, cursor.skip() may become IO bound.
However, Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document.
Thus, allowing the user to access information from arbitrary sections of files, such as to “skip” to the middle of file(using id or filename) without being CPU intensive.
Official documentations: 1.Skip
2.GridFS.
Update:
About what Peter Brittain is suggesting:
There are many things to consider(infrastructure,presumed usage stats,file size etc.) while one is choosing between filesystem and GridFS.
For example: If you have millions of files, GridFS tends to
handle it better, also you need to consider file system limitations
like the maximum number of files/directory etc.
You might want to consider going through this article:
Why use GridFS over ordinary Filesystem Storage?

Writing to/loading from a file vs. storing in GUI appdata

I am doing calculations on an image and intend to save a variable to a .mat file. When accessing the variable, would it be faster for me to load from the file or store the variable to the GUI appdata?
Also, I noticed that when I originally didn't save the variable (89x512x512 double array), it ran much faster. Is saving to a file generally time expensive?
You already have that array in memory - so storing it via setappdata/getappdata is certainly the faster alternative and, given the moderate size doesn't have any real drawback.
So, no reason to store it to a file, imho.
Any yes, writing stuff to a file is comparingly slow and i.e. takes a certain minimum amount of time not matter how tiny your data is.

Is Tie::File lazily loading a file?

I'm planning on writing a simple text viewer, which I'd expect to be able to deal with very large sized files. I was thinking of using Tie::File for this, and kind of paginate the lines. Is this loading the lines lazily, or all of them at once?
It won't load the whole file. From the documentation:
The file is not loaded into memory, so this will work even for gigantic files.
As far as I can see from its source code it stores only used lines in memory. And yes, it loads data only when needed.
You can limit amount of used memory with memory parameter.
It also tracks offsets of all lines in the file to optimize disk access.