How to get the dataset size of a Caffe net in python?

How to get the dataset size of a Caffe net in python? - neural-network

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?

You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.

It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).

Related

yolov4..cfg : increasing subdivisions parameter consequences

I'm trying to train a custom dataset using Darknet framework and Yolov4. I built up my own dataset but I get a Out of memory message in google colab. It also said "try to change subdivisions to 64" or something like that.
I've searched around the meaning of main .cfg parameters such as batch, subdivisions, etc. and I can understand that increasing the subdivisions number means splitting into smaller "pictures" before processing, thus avoiding to get the fatal "CUDA out of memory". And indeed switching to 64 worked well. Now I couldn't find anywhere the answer to the ultimate question: is the final weight file and accuracy "crippled" by doing this? More specifically what are the consequences on the final result? If we put aside the training time (which would surely increase since there are more subdivisions to train), how will be the accuracy?
In other words: if we use exactly the same dataset and train using 8 subdivisions, then do the same using 64 subdivisions, will the best_weight file be the same? And will the object detections success % be the same or worse?
Thank you.

first read comments
suppose you have 100 batches.
batch size = 64
subdivision = 8
it will divide your batch = 64/8 => 8
Now it will load and work one by one on 8 divided parts into the RAM, because of LOW RAM capacity you can change the parameter according to ram capacity.
you can also reduce batch size , so it will take low space in ram.
It will do nothing to the datasets images.
It is just splitting the large batch size which can't be load in RAM, so divided into small pieces.

How to compensate if I cant do a large batch size in neural network

I am trying to run an action recognition code from GitHub. The original code used a batch size of 128 with 4 GPUS. I only have two gpus so I cannot match their bacth size number. Is there anyway I can compensate this difference in batch. I saw somewhere that iter_size might compensate according to a formula effective_batchsize= batch_size*iter_size*n_gpu. what is iter_size in this formula?
I am using PYthorch not Caffe.

In pytorch, when you perform the backward step (calling loss.backward() or similar) the gradients are accumulated in-place. This means that if you call loss.backward() multiple times, the previously calculated gradients are not replaced, but in stead the new gradients get added on to the previous ones. That is why, when using pytorch, it is usually necessary to explicitly zero the gradients between minibatches (by calling optimiser.zero_grad() or similar).
If your batch size is limited, you can simulate a larger batch size by breaking a large batch up into smaller pieces, and only calling optimiser.step() to update the model parameters after all the pieces have been processed.
For example, suppose you are only able to do batches of size 64, but you wish to simulate a batch size of 128. If the original training loop looks like:
optimiser.zero_grad()
loss = model(batch_data) # batch_data is a batch of size 128
loss.backward()
optimiser.step()
then you could change this to:
optimiser.zero_grad()
smaller_batches = batch_data[:64], batch_data[64:128]
for batch in smaller_batches:
loss = model(batch) / 2
loss.backward()
optimiser.step()
and the updates to the model parameters would be the same in each case (apart maybe from some small numerical error). Note that you have to rescale the loss to make the update the same.

The important concept is not so much the batch size; it's the quantity of epochs you train. Can you double the batch size, giving you the same cluster batch size? If so, that will compensate directly for the problem. If not, double the quantity of iterations, so you're training for the same quantity of epochs. The model will quickly overcome the effects of the early-batch bias.
However, if you are comfortable digging into the training code, myrtlecat gave you an answer that will eliminate the batch-size difference quite nicely.

Set Batch Size and Number of Training Iterations for a neural network?

I am using the KNIME Doc2Vec Learner node to build a Word Embedding. I know how Doc2Vec works. In KNIME I have the option to set the parameters
Batch Size: The number of words to use for each batch.
Number of Epochs: The number of epochs to train.
Number of Training Iterations: The number of updates done for each batch.
From Neural Networks I know that (lazily copied from https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network):
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
As far as I understand it makes little sense to set batch size and iterations, because one is determined by the other (given the data size, which is given by the circumstances). So why can I change both parameters?

This is not necessarily the case. You can also train "half epochs". For example, in Google's inceptionV3 pretrained script, you usually set the number of iterations and the batch size at the same time. This can lead to "partial epochs", which can be fine.
If it is a good idea or not to train half epochs may depend on your data. There is a thread about this but not a concluding answer.
I am not familiar with KNIME Doc2Vec, so I am not sure if the meaning is somewhat different there. But from the definitions you gave setting batch size + iterations seems fine. Also setting number of epochs could cause conflicts though leading to situations where numbers don't add up to reasonable combinations.

keras fit_generator parameter steps_per_epoch

I want to use the keras model.fit_generator method for that I wrote my own generator and for the method I need to define the parameter "steps_per_epoch" I want to use every training data once for every epoch.
Now my problem is I generate the features in the generator I read wav-files and create the fft and before I start the training I don't know how much batches/samples I have. I can calculate the fft for every file before I start using the fit_generator method but every time I change my dataset(>20GB) I would need to recalculate the fft for every file and save the count for steps per epoch. Is there a better way how I can define that the fit_generator uses every sample only one time without calculate the steps per epoch? Or can my own generator pass the fit_generator when to start a new epoch?
Here is the code for my generator
def my_generator(filename_list):
while True:
for fname in filename_list:
data, sr = librosa.load(fname)
fft_result = librosa.core.stft(data)
batches = features.create_batches(fft_result, batch_size)
for i in range(len(batches)):
yield (batches[i], label)
model.fit_generator(my_generator(filename_list=filename_list, batch_size=batch_size), steps_per_epoch=100, epochs=10)

For each file in the list, you have to calculate fft that has 'n' batches where 'n' is different for each file. If this is the case than:
Navie method is to loop through the batch generator to calculate the actual number of batches. This process needs to be done only once. you can save that number for future use as well.
The second method could be to assign an arbitrary number to step_per_epoch. That arbitrary number should be greater than or equal to the number of files in the list multiplied by the number of the batches each fft can generate. The number of fft batches could be an arbitrary number. This way, if you shuffle data after the external "for" loop completes, then after some epoch statistically speaking all training data would be seen by the model. By using early_stop you can have properly converged model where "epochs" should be a very large value, 1000 for example.

PyBrain: MemoryError: while loading training dataset

I am trying to train a feedforward neural network, for binary classification. My Dataset is 6.2M with 1.5M dimension. I am using PyBrain. I am unable to load even a single datapoint. I am getting MemoryError.
My Code snippet is:
Train_ds = SupervisedDataSet(FV_length, 1) #FV_length is a computed value. 150000
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
#activate feature values
for index in nonzero_index_list:
feature_vector[index] = 1
Train_ds.addSample(feature_vector,class_label) # both the arguments are tuples

It looks like your computer just does not have the memory to add your feature and class label arrays to the supervised data set Train_ds.
If there is no way for you to allocate more memory to your system it might be a good idea to random sample from your data set and train on the smaller sample.
This should still give accurate results assuming the sample is large enough to be representative.