How important is the Input Size for Deep Learning Architectures? - matlab

Recently, I've been playing with the MATLAB's RCNN deep learning example here. In this example, MATLAB has designed a basic 15 layer CNN with the input size of 32x32. They use the CIFAR10 dataset to pre-train this CNN. The CIFAR10 dataset has training images of size 32x32 too. Later they use a small dataset of stop signs to fine tune this CNN to detect stop signs. This small dataset of stop signs has only 41 images; so they use these 41 images to fine tune the CNN and namely train an RCNN network. this is how they detect a stop sign:
As you see the bounding box almost covers the whole stop signs except a small part on the top.
Playing with the code I decided to fine tune the same network pre-trained on the CIFAR10 dataset with the PASCAL VOC dataset but only for the "aeroplane" class.
These are some results I get:
As you see the detected bounding boxes barely cover the whole airplane; so this causes the precision to be 0 later when I evaluate them. I understand that in the original RCNN paper mentioned in the MATLAB example the input size 227x227 and their CNN has 25 layers. Could this be why the detections are not accurate? How does the input size of a CNN affect the end result?

almost surely yes!
when you pass an image through a net, the net tries to minimize the data taken from the image until it gets the most relevant data. during this process, the input shrinks again and again. If, for example, you insert to a net an image that smaller than the wanted, all the data from the image may lost during the pass in the net.
In your case, an optional reason to your results is that the net "looks for" features in limited resolution and maybe the big airplane has over high resolution.


neural network converges too fast and predicts blank results

I am using a UNet model to train a segmentation algorithm with roughly 1,000 grayscale medical images and 1,000 corresponding masks where the section of interest in the medical image is white pixel and the background is black.
I am using dice loss and a similar dice score as an accuracy metric to account for the fact that my white pixels are generally less in number than the black background pixels. But I am still having a few problems when training
1) The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Assumption - Overfitting:
I assumed this was due to overfitting, so I augmented the dataset as much as possible with rigid transformations - flipping and rotating, but still no help.
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
2)Model doesn't look like it's even training. I was able to check the model before it reduced all the test data to blackness, but even then the results would look like blurry versions of the original without segmenting the features highlighted by my training mask
3) The loss vs epochs and accuracy vs epochs output charts are very smooth: They present none of the oscillating behaviour that I expect to see when doing semantic segmentation. According to this related post a smooth chart usually occurs when there is only one class. I however assumed that my model would see the training masks (white pixels vs black pixels) and see that as a two class problem. Am I wrong in this assumption?
4) According to this post Dice is good for an unbalanced training set. I have also tried to get precision/recall/F1 results as they suggest, but was unable to do it and assuming it might be related to my 3rd issue where the model sees my segmentation task as a single class problem.
TLDR: How can I fix the black output results I am getting? Can you please help me clarify if my learning model is actually seeing my white and black pixels in each mask as two separate classes and if not what is it actually doing?
Your model is only predicting one class (the background/back pixels) because of the class imbalance.
The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Lower your learning rate. 0.01 is really high, so try something like 3e-5 for your learning and see how your model performs.
Also, having a 100% accuracy (supposedly you're using dice?) suggests that you're still using accuracy, so I believe that your model does not recognize that you're using dice/dice loss for training and evaluation(code snippets would be appreciated).
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
Try using model.evaluate(test_data, test_label). If the evaluated performance is good (dice should be extremely low if you're only predicting 0s), then either your labels are messed in some way or there is something wrong with your pipeline.
Possible Solutions if all else fails:
make sure to go through all the sanity checks in this article
You might not have enough data, so try to use a patchwise approach with random crops.
Add more regularization (dropout, BatchNormalization, InstanceNormalization, increasing input image size, etc.)

Will YOLO anyhow perform differently from VGG-16. Will using it for image classification instead of VGG make sense?

I have already implemented image captioning using VGG as the image classification model. I have read about YOLO being a fast image classification and detection model and it is primarily used for multiple object detection. However for image captioning i just want the classes not the bounding boxes.
I completely agree with what Parag S. Chandakkar mentioned in his answer. YOLO and RCNN the two most used object detection models are slow if used just for classification compared to VGG-16 and other object classification networks. However in support of YOLO, I would mention that , you can create a single model for image captioning and image object detection.
YOLO generates a vector of length 1470.
Tune YOLO to generate number of classes as supported by your dataset i.e make YOLO generate a vector of 49*(number of classes in your dataset) + 98 + 392.
Use this vector to generate the Bounding boxes.
Further tune this vector to generate a vector of size equal to the number of classes. You can use a dense layer for the same.
Pass this vector to your language model for generating captions.
Thus to sum up, you can generate the bounding boxes first and then further tune that vector to generate captions.
My initial guess is it would not make sense to use YOLO for image classification. YOLO is fast for object detection, but networks used for image classification are faster than YOLO since they have do lesser work (so the comparison is not fair).
According to benchmarks provided here, we can consider Inception-v1 network that has 27 layers. YOLO base network has 24 layers. Now, with latest cuDNN, on Maxwell TitanX, Inception v1 takes 19.29 ms for 16 images, which translates into ~ 830 fps (again expect lower fps when you pass a single image because GPU is fast at processing mini-batches i.e. making one forward pass with mini-batch of 16 is faster than making 16 forward passes with mini-batch size 1).
Latest version of YOLO runs at 67 fps and its tiny version runs at 207 fps, still a lot slower than Inception v1 (note that YOLO does not Inception v1 as their base network, but still number of layers are comparable).
So, in short, I do not see any speed advantage in using YOLO for image classification. Now, regarding accuracy, I cannot say for sure if YOLO would be able to detect presence of an object better than a conventional image classification network, if the object is tiny.

Convolution Neural Network for image detection/classification

So here is there setup, I have a set of images (labeled train and test) and I want to train a conv net that tells me whether or not a specific object is within this image.
To do this, I followed the tensorflow tutorial on MNIST, and I train a simple conv net reduced to the area of interest (the object) which are training on image of size 128x128. The architecture is as follows : successively 3 layers consisting of 2 conv layers and 1 max pool down-sampling layers, and one fully connected softmax layers (with two class 0 and 1 whether the object is present or not)
I impleted it using tensorflow, and this works quite well, but since I have enough computing power I was wondering how I could improve the complexity of the classification:
- adding more layers ?
- adding more channel at each layer ? (currently 32,64,128 and 1024 for the fully connected)
- anything else ?
But the most important part is that now I want to detect this same object on larger images (roughle 600x600 whereas the size of the object should be around 100x100).
I was wondering how I could use the previously training "small" network used for small images, in order to pretrained a larger network on the large images ? One option could be to classify the image using a slicing window of size 128x128 and scan the whole image but I would like to try if possible to train a whole network on it.
Any suggestion on how to proceed ? Or an article / ressource tackling this kind of problem ? (I am really new to deep learning so sorry if this is stupid question...)
Thanks !
I suggest that you continue reading on the field overall. Your search keys include CNN, image classification, neural net, AlexNet, GoogleNet, and ResNet. This will return many articles, on-line classes and lectures, and other materials to help you learn about classification with neural nets.
Don't just add layers or filters: the complexity of the topology (net design) must be fitted to the task; a net that's too complex will over-fit the training data. The one you've been using is probably LeNet; the three I cite above are for the ImageNet image classification contest.
Since you are working on images, I would suggest you to use a pretrained image classification network (like VGG, Alexnet etc.)and fine tune this network with your 128x128 image data. In my experience until we have very large data set fine tuned network will give more accuracy and also save training time. After building a good image classifier on your data set you can use any popular algorithm to generate region of proposal from the image. Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object is present in that region. Otherwise it's not. If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

Can neural network fail to learn a function? and How to choose better feature descriptors for pattern recognition?

I was working on webots which is an environment used to model, program and simulate mobile robots. Basically i have a small robot with a VGA camera, and it looks for simple blue coloured patterns on white walls of a small lego maze and moves accordingly
The method I used here was
Obtain images of the patterns from webots and save it in a location
in PC.
​​Detect the blue pattern, form a square enclosing the pattern
with atleast 2 edges of the pattern being part of the boundary of the
​Resize it to 7x7 matrix(using nearest neighbour
interpolation algorithm)
The input to the network is nothing but the red pixel intensities of each of the 7x7 image(when i look at the blue pixel through a red filter it appears black so). The intensities of each pixel is extracted and the 7x7 matrix is then converted it to a 1D vector i.e 1x49 which is my input to the neural network. (I chose this characteristic as my input because it is 'relatively' less difficult to access this information using C and webots.​​)
I used MATLAB for this offline training method and I used a slower learning rate(0.06) to ensure parameter convergence and tested it on large and small datasets(1189 and 346 respectively). On all the numerous times I have tried, the network fails to classify the pattern.(it says the pattern belongs to all the 4 classes !!!! ) . There is nothing wrong with the program as I tested it out on the simpleclass_dataset in matlab and it works almost perfectly
Is it possible that the neural network fails to learn the function because of really poor data? (by poor data i mean that the datapoints corresponding to one sample of one class are very close to another sample belonging to a different class or something of that sort). Or can the neural network fail because of very poor feature descriptors?
Can anyone suggest a simpler method to extract features from the image(I am now shifting to MATLAB as I am now only concerned with simulations in webots and not the real robot). What sort of features can I choose? The patterns are very simple (L,an inverted L and its reflected versions are the 4 patterns)
Neural networks CAN fail to learn a function; this is most often caused by employing a network topology which is too simple to model the necessary function. A classic example of this case is attempting to learn an XOR function using a perceptron classifier, although it can even happen in multilayer neural nets sometimes; especially for complex tasks like image recognition. See my previous answer for a rough guide on how to select neural network parameters (ignore the convolution stuff if you want, although I would highly recommened looking into convolutional neural networks if you are still having problems).
It is a possiblity that there is too little seperability between classes, although I doubt that this is the case given your current features. Is there a reason that your network needs to allow an image to be four classifications simultaneously? If not, then perhaps you could classify the input as the output with the highest activation instead of all those with high activations.

Characters Recognition for Matlab Neural Network

I am working on my final project. I chose to implement a NN for characters recognition.
My plan is to take 26 images containg 26 English letters as training data, but I have no idea how to convert these images as inputs to my neural network.
Let's say I have a backpropagation neural network with 2 layers - a hidden layer and an output layer. The output layer has 26 neurons that produces 26 letters. I self created 26 images (size is 100*100 pixels in 24bit bmp format) that each of them contains a English letter. I don't need to do image segmentation, Because I am new to the image processing, so can you guys give me some suggestions on how to convert images into input vectors in Matlab (or do I need to do edge, morphology or other image pre-processing stuffs?).
Thanks a lot.
You NN will work only if the letters are the same (position of pixels is fixed). You need to convert images to gray-scale and pixelize them. In other words, use grid that split images on squares. Squares have to be small enough to get letter details but large enough so you don't use too much neurons. Each pixel (in gray scale) is a input for the NN. What is left is to determine the way to connect neurons e.g NN topology. Two layers NN should be enough. Most probably you should connect each input "pixel" to each neuron at first layer and each neuron at first layer to each neuron at second layer
This doesn't directly answer the questions you asked, but might be useful:
1) You'll want more training data. Much more, if I understand you correctly (only one sample for each letter??)
2) This is a pretty common project, and if it's allowed, you might want to try to find already-processed data sets on the internet so you can focus on the NN component.
Since you will be doing character recognition I suggest you use a SOM neural network which does not require any training data; You will have 26 input neurons one neuron for each letter. For the image processing bit Ross has a usefull suggestion for isolating each letter.