I have created an lmdb file that contains non-encoded 6-channel images. When I load it into a network in caffe, after the network is loaded, the system RAM usage (as seen using the 'top' command) is initially around 10%, but it goes on increasing, until in reaches above 90%. I am using a system with 32 GB RAM, and it begins to slow down extremely, until the code crashes with the following error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Note that this happens even before running a single forward pass.
The size of the lmdb file I'm using is 545 MB.
I've used python netspec to define the network. Following is the code:
net = caffe.NetSpec()
net.data0, net.label = CreateAnnotatedDataLayer(train_data,
batch_size=1,train=True, output_label=True,
label_map_file=label_map_file,
transform_param=train_transform_param, batch_sampler=batch_sampler)
net.data, net.data_d = L.Slice(net.data0, slice_param={'axis': 1}, ntop=2, name='data_slicer')
Since my lmdb has 6-channel images, and the pretrained network has 3 channels, I am using a slice layer to split the image into 3-channel images that can be fed into two different convolutional layers.
Any suggestions would be helpful.
Related
I'm trying to train a custom dataset using Darknet framework and Yolov4. I built up my own dataset but I get a Out of memory message in google colab. It also said "try to change subdivisions to 64" or something like that.
I've searched around the meaning of main .cfg parameters such as batch, subdivisions, etc. and I can understand that increasing the subdivisions number means splitting into smaller "pictures" before processing, thus avoiding to get the fatal "CUDA out of memory". And indeed switching to 64 worked well. Now I couldn't find anywhere the answer to the ultimate question: is the final weight file and accuracy "crippled" by doing this? More specifically what are the consequences on the final result? If we put aside the training time (which would surely increase since there are more subdivisions to train), how will be the accuracy?
In other words: if we use exactly the same dataset and train using 8 subdivisions, then do the same using 64 subdivisions, will the best_weight file be the same? And will the object detections success % be the same or worse?
Thank you.
first read comments
suppose you have 100 batches.
batch size = 64
subdivision = 8
it will divide your batch = 64/8 => 8
Now it will load and work one by one on 8 divided parts into the RAM, because of LOW RAM capacity you can change the parameter according to ram capacity.
you can also reduce batch size , so it will take low space in ram.
It will do nothing to the datasets images.
It is just splitting the large batch size which can't be load in RAM, so divided into small pieces.
I ran out of memory (11G VRAM) while testing my CNN with 10 test images. I'm using the U-Net architecture, 20 training images each (1600x1200x1), 48x48 patches (190000 sub images) and batch size of 32 (recommended).
So right now I'm testing my network 5 times eatch 2 images. After that I want to evaluate my network using one ROC curve.
So here are my questions: Can I evaluate my network, if I split the testing? If yes, how can I manage it?
If not, what do I have to change in my config so that the memory doesn't run out?
btw I'm a beginner in NN and I'm sorry for my bad english!
I'm trying to train a network on my own data. Whole dataset consists of 256x256 jpeg images. There is 236 objects for classification. Training and validation sets have ~247K and ~61K images, respectively. I've made LMDBs from them using $CAFFE_ROOT/build/tools/convert_imageset utility.
Just for starting I'm using caffenet's topology for my model. During training I come across the weird message "Data layer prefetch queue empty" that I never seen before.
Moreover, initially, network has an abnormal accuracy (~0.00378378) and during next 1000 iterations, it reaches max ~0.01 and further does not increase (just fluctuates).
What I'm doing wrong and how can I improve the accuracy?
Runtime log:
http://paste.ubuntu.com/15568421/
Model:
http://paste.ubuntu.com/15568426/
Solver:
http://paste.ubuntu.com/15568430/
P.S. I'm using the latest version of Caffe, Ubuntu Server 14.04 LTS and g2.2xlarge instance on AWS.
I'm working with a reasonably sized net (1 convolutional layer, 2 fully connected layers). Every time I save variables using tf.train.Saver, the .ckpt files are half a gigabyte each of disk space (512 MB to be exact). Is this normal? I have a Caffe net with the same architecture that requires only a 7MB .caffemodel file. Is there a particular reason why Tensorflow saves such large file sizes?
Many thanks.
Hard to tell how large your net is from what you've described -- the number of connections between two fully connected layers scales up quadratically with the size of each layer, so perhaps your net is quite large depending on the size of your fully connected layers.
If you'd like to save space in the checkpoint files, you could replace this line:
saver = tf.train.Saver()
with the following:
saver = tf.train.Saver(tf.trainable_variables())
By default, tf.train.Saver() saves all variables in your graph -- including the variables created by your optimizer to accumulate gradient information. Telling it to save only trainable variables means it will save only the weights and biases of your network, and discard the accumulated optimizer state. Your checkpoints will probably be a lot smaller, with the tradeoff that it you may experience slower training for the first few training batches after you resume training, while the optimizer re-accumulates gradient information. It doesn't take long at all to get back up to speed, in my experience, so personally, I think the tradeoff is worth it for the smaller checkpoints.
Maybe you can try (in Tensorflow 1.0):
saver.save(sess, filename, write_meta_graph=False)
which doesn't save meta Graph information.
See:
https://www.tensorflow.org/versions/master/api_docs/python/tf/train/Saver
https://www.tensorflow.org/programmers_guide/meta_graph
Typically you only save tf.global_variables() (which is shorthand for tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), i.e. the collection of global variables). This collection is meant to include variables which are necessary for restoring the state of the model, so things like current moving averages for batch normalization, the global step, the states of the optimizer(s) and, of course, the tf.GraphKeys.TRAINABLE_VARIABLES collection. Variables of more temporary nature, such as the gradients, are collected in LOCAL_VARIABLES and it is usually not necessary to store them and they might take up a lot of disk space.
I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?
You can use the lmdb library to access the lmdb directly
import lmdb
db = lmdb.open('/path/to/lmdb_folder') //Needs lmdb - method
num_examples = int( db.stat()['entries'] )
Should do the trick for you.
It seems that you mixed iterations and amount of samples in one question. In the provided example we can see only number of iterations, i. e. how many times training phase will be repeated. The is no any direct relationship between amount of iterations (network training parameters) and amount of samples in dataset (network input).
Some more detailed explanation:
EDIT: Caffe will totally load (batch size x iterations) samples for training or testing, but there is no relation with amount of loaded samples and actual database size: it will start reading from the beginning after reaching database last record - it other words, database in caffe acts like a circular buffer.
Mentioned example points to this configuration. We can see that it expects lmdb input, and sets batch size to 64 (some more info about batches and BLOBs) for training phase and 100 for testing phase. Really we don't make any assumption about input dataset size, i. e. number of samples in dataset: batch size is only processing chunk size, iterations is how many batches caffe will take. It won't stop after reaching database end.
In other words, network itself (i. e. protobuf config files) doesn't point to any number of samples in database - only to dataset name and format and desired amount of samples. There is no way to determine database size with caffe at the current moment, as I know.
Thus if you want to load entire dataset for testing, you have only option to firstly determine amount of samples in mnist_test_lmdb or mnist_train_lmdb manually, and then specify corresponding values for batch size and iterations.
You have some options for this:
Look at ./examples/mnist/create_mnist.sh console output - it prints amount of samples while converting from initial format (I believe that you followed this tutorial);
follow #Shai's advice (read lmdb file directly).