Why is reading h5 files extremely slow? - h5py

I have a data generator which works but is extremely slow to read data from a 200k image dataset.
I use:
X=f[self.trName][idx * self.batch_size:(idx + 1) * self.batch_size]
after having opened the file with f=h5py.File(fileName,'r')
It seems to be slower as the idx is large (sequential access?)
but in any case it is at least 10 seconds (sometimes >20 sec) to read a batch, which is far too slow (moreover reading from an SSD!)
Any ideas?
The dataset is taking 50.4 GB on disk (compressed) and its shape is:
(210000, 2, 128, 128)
(this is the shape of the trainingset, the targets have the same shape, and are stored as another dataset inside this same .h5 file)

Related

how large storage do I need for postgres comparing to data size

I did some analysis using some sample data and found table size is usually 2 twice as much as raw data (by importing a csv file into a postgres table, then csv file size is raw data size).
And the disk space seems 4 times as raw data most likely because of WAL log.
Is there any commonly used formulator to estimate how much disk space I need if we want to store like 1G size of data.
I know there are many factors affecting this, I just would like to have a quick estimate.

Performance issue with reading DICOM data into cell array

I need to read 4000 or more DICOM files. I have written the following code to read the files and store the data into a cell array so I can process them later. A single DICOM file contains 128 * 931 data. But once I execute the code it took more than 55 minutes to complete the iteration. Can someone point out to me the performance issue of the following code?
% read the file information form the disk to memory
readFile=dir('d:\images','*.dcm');
for i=1:4000
% Read the information form the dicom files in to arrays
data{i}=dicomread(readFile(i).name);
info{i}=dicominfo(readFile(i).name);
data_double{i}=double(data{1,i}); % convert 16 bit data into double
first_chip{i}=data_double{1,i}(1:129,1:129); % extracting first chip data into an array
end
You are reading 128*931*4000 pixels into memory (assuming 16-bit values, that's nearly 1 GB), converting that to doubles (4 GB) and extracting a region (129*129*4000*8 = 0.5 GB). You are keeping all three of these copies, which is a terrible amount of data! Try not keeping all that data around:
readFile = dir('d:\images','*.dcm');
first_chip = cell(size(readFile));
info = cell(size(readFile));
for ii = 1:numel(readFile)
info{ii} = dicominfo(readFile(ii).name);
data = dicomread(info{ii});
data = (1:129,1:129); % extracting first chip data
first_chip{ii} = double(data); % convert 16 bit data into double
end
Here, I have pre-allocated the first_chip and info arrays. If you don't do this, the arrays will be re-allocated every time you add an element, causing expensive copies. I have also extracted the ROI first, then converted to double, as suggested by Rahul in his answer. Finally, I am re-using the DICOM info structure to read the file. I don't know if this makes a big difference in speed, but it saves the dicomread function some effort.
But note that this process will still take a considerable amount of time. Reading DICOM files is complex, and takes time. I suggest you read them all in once, then save the first_chip and info cell arrays into a MAT-file, which will be a lot faster to read in at a later time.
You can run the profiler to check which part of the code is taking up most of the time! But as far as it looks to me is that its your iteration size & the time taken is very much genuine. You could try and use parallel computing ( parfor loop) if you have a multicore processor, that should decrease the runtime significantly depending upon the number of cores that you have.
One suggestion would be to exctract the 'first chip data' first and then convert it to double, as the conversion process takes a significant amount of time.

How to get the dataset size of a Caffe net in python?

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?
You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.
It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).

loading large csv files in Matlab

I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.

Taking the left and right channels of two wav files and joining them into one separate stereo file

I have two stereo wav files that I would like to take the left channel of the first audio file and take the right channel of the second audio file and join them into one new wave file.
Here's an image of what I'm trying to do.
I know I can read files into matlab / octave and get the separate left right channels with the code below:
[imported_sig_1, fs_rate, nbitsraw] = wavread(strcat('/tmp/01a.wav'));
imported_sig_L=imported_sig_1(:,1)';
[imported_sig_2, fs_rate, nbitsraw] = wavread(strcat('/tmp/02a.wav'));
imported_sig_R=imported_sig_2(:,2)';
I can then write the new channels that I want out using the code
wavwrite([(imported_sig_L)' (imported_sig_R)'] ,fs_rate,16,'newfile.wav'); %
The problem I'm running into is the time it takes to import the file and size of the array the wave files take up. The files I'm importing are about 1-4 hours long and it takes a while to import and it takes a lot of memory in the array is there away around importing the full file and then exporting them?
I'm using octave 3.8.1 on Ubuntu 14.04 which is like matlab but I also have access to sox
I assume the bottleneck is your hard drive and your system has sufficient memory to keep all thee files in memory at the same time. If so you won't gain an speed reading only one channel. With a 16 bit wav your HDD would have to skip 2 bytes, read 2 bytes, skip 2 bytes, read 2 bytes... For such a read operation it is much faster to copy the full file into the memory and remove the unwanted channels afterwards.