I have been doing deep learning with CNN for a while and I realize that the inputs for a model are always squared images.
I see that neither convolution operation or neural network architecture itself require such property.
So, what is the reason for that?
Because square images are pleasing to the eye. But there are applications on non-square images when domain requires it. For instance SVHN original dataset is an image of several digits, and hence rectangular images are used as input to convnet, as here
From Suhas Pillai:
The problem is not with convolutional layers, it's the fully connected
layers of the network ,which require fix number of neurons.For
example, take a small 3 layer network + softmax layer. If first 2
layers are convolutional + max pooling, assuming the dimensions are
same before and after convolution, and pooling reduces dim/2 ,which is
usually the case. For an image of 3*32*32(C,W,H)with 4 filters in the
first layer and 6 filters in the second layer ,the output after
convolutional + max pooling at the end of 2nd layer, will be 6*8*8
,whereas for an image with 3*64*64, at the end of 2nd layer output
will be 6*16*16. Before doing fully connected,we stretch this as a
single vector( 6*8*8=384 neurons)and do a fully connected operation.
So, you cannot have different dimension fully connected layers for
different size images. One way to tackle this is using spatial pyramid
pooling, where you force the output of last convolutional layer to
pool it to a fixed number of bins(I.e neurons) such that fully
connected layer has same number of neurons. You can also check fully
convolutional networks, which can take non-square images.
It is not necessary to have squared images. I see two "reasons" for it:
scaling: If images are scaled automatically from another aspect ratio (and landscape / portrait mode) this in average might introduce the least error
publications / visualizations: square images are easy to display together
Related
When coding a convolutional neural network I am unsure of where to start with the convolutional layer. When different convolutional filters are used to produce different feature maps does that mean that the filters have different sizes (for example, 3x3, 2x2 etc.) ?
In most examples which is a good indication of how to go about coding a convolutional neural network, you will find to start with 1 convolutional layer and pass layer sizes, 3x3 window, input data features.
model.add(Conv2D(layer_size, (3,3), input_shape = x.shape[1:]))
The filter sizes usually only differ in the max pooling layer e.g 2x2.
model.add(MaxPooling2D(pool_size=(2,2)))
Layer sizes are usually selected from range layer_size = [32, 64,128] and you can do the same to experiment with different convolution_layers = [1,2,3]
I've never seen different kernel sizes for the filters in the same layer, although it is possible to do so, is not a default option the frameworks I have used. What makes filters yield different feature maps are the weights.
Along different layers different kernel sizes are used because the idea of the convolutional networks is to gradually reduce dimensionality through downsampling layers (max pooling for example), so in deep levels you have smaller feature maps and a smaller filter keeps it convolutional and less fully connected (having a kernel the same size as the image is equivalent to have a dense layer).
If you're starting with convolutionals I recommend you to play with this interactive visualization of a CNN, it helped me with a lot of concepts.
So I’m trying to make a CNN and so far I think I understand all of the forward propagation and the back propagation in the fully connected layers. However, I’m having some issues with the back prop in the convolutional layers.
Basically I’ve written out the dimensions of everything at each stage in a CNN with two convolutional layers and two fully connected layers, with the input having a depth of 1(as it is black and white) and only one filter being applied at each convolutional layer. I haven’t bothered to use pooling at this stage as to my knowledge it shouldn’t have any impact on the calculus, just to where it is assigned, so the dimensions should still fit as long as I also don’t include any uppooling in my backprop. I also haven’t bothered to write out the dimensions after the application of the activation functions as they would be the same as that as their input and I would be writing the same values twice.
The dimensions, as you will see, vary slightly in format. For the convolutional layers I’ve written them as though they are images, rather than in a matrix form. Whilst for the fully connected layers I’ve written the dimensions as that of the size of the matrices used(will hopefully make more sense when you see it).
The issue is that in calculating the delta for the convolutional layers, the dimensions don’t fit, what am I doing wrong?
Websites used:
http://cs231n.github.io/convolutional-networks/
http://neuralnetworksanddeeplearning.com/chap2.html#the_cross-entropy_cost_function
http://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/
Calculation of dimensions:
So here is there setup, I have a set of images (labeled train and test) and I want to train a conv net that tells me whether or not a specific object is within this image.
To do this, I followed the tensorflow tutorial on MNIST, and I train a simple conv net reduced to the area of interest (the object) which are training on image of size 128x128. The architecture is as follows : successively 3 layers consisting of 2 conv layers and 1 max pool down-sampling layers, and one fully connected softmax layers (with two class 0 and 1 whether the object is present or not)
I impleted it using tensorflow, and this works quite well, but since I have enough computing power I was wondering how I could improve the complexity of the classification:
- adding more layers ?
- adding more channel at each layer ? (currently 32,64,128 and 1024 for the fully connected)
- anything else ?
But the most important part is that now I want to detect this same object on larger images (roughle 600x600 whereas the size of the object should be around 100x100).
I was wondering how I could use the previously training "small" network used for small images, in order to pretrained a larger network on the large images ? One option could be to classify the image using a slicing window of size 128x128 and scan the whole image but I would like to try if possible to train a whole network on it.
Any suggestion on how to proceed ? Or an article / ressource tackling this kind of problem ? (I am really new to deep learning so sorry if this is stupid question...)
Thanks !
I suggest that you continue reading on the field overall. Your search keys include CNN, image classification, neural net, AlexNet, GoogleNet, and ResNet. This will return many articles, on-line classes and lectures, and other materials to help you learn about classification with neural nets.
Don't just add layers or filters: the complexity of the topology (net design) must be fitted to the task; a net that's too complex will over-fit the training data. The one you've been using is probably LeNet; the three I cite above are for the ImageNet image classification contest.
Since you are working on images, I would suggest you to use a pretrained image classification network (like VGG, Alexnet etc.)and fine tune this network with your 128x128 image data. In my experience until we have very large data set fine tuned network will give more accuracy and also save training time. After building a good image classifier on your data set you can use any popular algorithm to generate region of proposal from the image. Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object is present in that region. Otherwise it's not. If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.
I read some books but still cannot make sure how should I organize the network. For example, I have pgm image with size 120*100, how the input should be like(like a one dimensional array with size 120*100)? and how many nodes should I adapt.
It's typically best to organize your input image as a 2D matrix. The reason is that the layers at the lower levels of the neural networks used in machine perception tasks are typically locally connected. For example, each neuron of the first layer of such a neural net will only process the pixels of a small NxN patch of the input image. This naturally leads to a 2D structure which can be more easily described with 2D matrices.
For a detailed explanation I'll refer you to the DeepFace paper which describes the stat of the art in face recognition systems.
120*100 one dimensional vector is fine. The locations of the pixel values in that vector does not matter, because all nodes are fully connected with the nodes in the next layer anyway. But you must be consistent with their locations between training, validating, and testing.
The most successful approach so far was to go with a convolutional neural network with 2D input, just as #benoitsteiner stated. For a far simpler example I'd refer you to a LeNet-5, a small neural network developed for MNIST hand-written digit recognition. It is used in EBLearn for face recognition with quite good results.
Assume that I have a method or other neural network to do pattern detection on an image correctly. How should I design a neural network where there are multiple patterns in an image?
Say that in an image, there are X patterns to be detected, what would be the best approach? AFAIK output layer neurons values should be [-1,1]. How would I know if there are X amount of patterns recognised? Does this mean that I have to set a hardcoded limit on how many patterns it can recognise (since number of output neuron is fixed)?
Here's a suggestion using face detection as an example. This Face Detection link on Github is described to detect multiples pattern (i.e. faces) using a Haar Classifier. If you read under the Implementation section it states that the algorithm uses scaleOption and templateSizeOption parameters (among others) to govern how many faces are detected in an image. It sounds like you should look for features in subspaces or windows of a given image (perhaps even spaces that overlap).
scaleOption - this parameter is used to specify the
rate at which the haar features used
for face detection will be scaled. A
lower scale option means that more
faces will be detected, while a higher
scale option will perform a faster
detection, but may miss some faces
from the input image. The default
scale value is 1.1, that determines an
increase in the features dimension of
10% at each step.
templateSizeOption – it is used to
specify the minimal area in which to
search for a face. If we want to
detect persons from close-up images,
the size should be over 40 pixels,
otherwise a 25 region pixels (which is
the default value ) is enough for
detecting a large number of faces.
to do this use a hopfild net.at first in equal windows extract your target and save in your the net. then with a simple algoritm search in your image and in any time compare the sim of the net with your target and for any target use separate array to save the result.at the end extract the nearest pattern in each array.you can use some image proccesing in your original image before starting.
Yes, this can be done by neural network. I think that most practical solutions would involve applying the neural network to a window which scanned over the image. Multiple hits from the neural network would imply multiple target objects in the image.
Incidentally, neural networks do not have to lie in the range -1 .. 1.