I've read this articles http://www.codeproject.com/Articles/143059/Neural-Network-for-Recognition-of-Handwritten-Di and when I turn to this one:
Layer #0: is the gray scale image of the handwritten character in the MNIST database which is padded to 29x29 pixel. There are 29x29= 841 neurons in the input layer.
Layer #1: is a convolutional layer with six (6) feature maps. There are 13x13x6 = 1014 neurons, (5x5+1)x6 = 156 weights, and 1014x26 = 26364 connections from layer #1 to the previous layer.
How can we get the six(6) feature maps just from convolution on image ?
I think we just get only one feature map. Or am i wrong ?
I'm doing my research around convolution neural network.
Six different kernels(or filters) are convoluted on the same image to generate six feature map.
Layer #0: Input image with 29x29 pixels thus have 29*29=841 neuron(input neuron)
Layer #1: Convolutional layer uses 6 different kernels(or filters) of size 5x5 pixel and stride length 2(amount of shift while convoluting input with kernals or filters) which are convoluted with the input image(29x29) generating 6 different feature maps(13x13) thus 13x13x6=1014 neuron.
Filter size 5x5 and a bias(for weight correction) thus (5x5)+1 neuron and was we have 6 kernals(or filters), gives 6*[(5x5)+1]= 156 neuron.
During convolution we move kernels(or filters) 26 times(13 horizontal move + 13 vertical move) and finally 1014*26=26364 connections from Layer #0 to Layer #1.
You should go through this research paper by Y LeCun, L Bottou, Y Bengio: Gradient- Based learing applied to document recognition Section II to understand convolution neural network(I recommend to read the whole paper).
Another place where you can find detailed explanation and python implementation fo CNN is here. If you have time I recommend to go through this site for more details about deep learning.
Thank you.
you get six feature maps by convolving with six different kernel on the same image.
Related
In AlexNet, the filter size is 5*5*48 in first layer and 3*3*128 in second layer.
Why 48 and 128 are used as depth? Can we change both to different numbers?
Thanks
The depiction of neural network there could be confusing for some. Actually, layer with 48 dimensions, specifically 5 * 5 * 48 dimensions is the second convolutional layer. From the article;
..The second convolutional layer takes as input the (response-normalized
and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48
I assumed though, your confusion stem from the first layer was described as 11 * 11 * 96 dimensions but depiction in the image was not. In case you are asking why did the authors chose such size, is still something varies in scientific community as about deciding the parameters of a neural network is somewhat done by intuition (at least by this time).
I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance
As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]
I have been doing deep learning with CNN for a while and I realize that the inputs for a model are always squared images.
I see that neither convolution operation or neural network architecture itself require such property.
So, what is the reason for that?
Because square images are pleasing to the eye. But there are applications on non-square images when domain requires it. For instance SVHN original dataset is an image of several digits, and hence rectangular images are used as input to convnet, as here
From Suhas Pillai:
The problem is not with convolutional layers, it's the fully connected
layers of the network ,which require fix number of neurons.For
example, take a small 3 layer network + softmax layer. If first 2
layers are convolutional + max pooling, assuming the dimensions are
same before and after convolution, and pooling reduces dim/2 ,which is
usually the case. For an image of 3*32*32(C,W,H)with 4 filters in the
first layer and 6 filters in the second layer ,the output after
convolutional + max pooling at the end of 2nd layer, will be 6*8*8
,whereas for an image with 3*64*64, at the end of 2nd layer output
will be 6*16*16. Before doing fully connected,we stretch this as a
single vector( 6*8*8=384 neurons)and do a fully connected operation.
So, you cannot have different dimension fully connected layers for
different size images. One way to tackle this is using spatial pyramid
pooling, where you force the output of last convolutional layer to
pool it to a fixed number of bins(I.e neurons) such that fully
connected layer has same number of neurons. You can also check fully
convolutional networks, which can take non-square images.
It is not necessary to have squared images. I see two "reasons" for it:
scaling: If images are scaled automatically from another aspect ratio (and landscape / portrait mode) this in average might introduce the least error
publications / visualizations: square images are easy to display together
In Wikipedia article about MNIST database it is said, that lowest error rate is of "committee of 35 convolutional networks" with the scheme:
1-20-P-40-P-150-10
What does this scheme mean?
Numbers are probably neuron numbers. But what does 1 mean then?
What do P letters mean?
In this particular scheme, 'P' means 'pooling' layer.
So, basic structure is following:
One grayscale input image
20 images after convolution layer (20 different filters)
Pooling layer
40 outputs from next convolution
Pooling layer
150... can be either 150 small convolution outputs or just fully-connected 150 neurons
10 output fully-connected neurons
That's why 1-20-P-40-P-150-10. Not best notation, but still pretty clear if you familiar with CNN.
You can read more details about internal structure of CNN in base article of Yann LeCun "Gradient-Based Learning Applied to Document Recognition".
I am working on my final project. I chose to implement a NN for characters recognition.
My plan is to take 26 images containg 26 English letters as training data, but I have no idea how to convert these images as inputs to my neural network.
Let's say I have a backpropagation neural network with 2 layers - a hidden layer and an output layer. The output layer has 26 neurons that produces 26 letters. I self created 26 images (size is 100*100 pixels in 24bit bmp format) that each of them contains a English letter. I don't need to do image segmentation, Because I am new to the image processing, so can you guys give me some suggestions on how to convert images into input vectors in Matlab (or do I need to do edge, morphology or other image pre-processing stuffs?).
Thanks a lot.
You NN will work only if the letters are the same (position of pixels is fixed). You need to convert images to gray-scale and pixelize them. In other words, use grid that split images on squares. Squares have to be small enough to get letter details but large enough so you don't use too much neurons. Each pixel (in gray scale) is a input for the NN. What is left is to determine the way to connect neurons e.g NN topology. Two layers NN should be enough. Most probably you should connect each input "pixel" to each neuron at first layer and each neuron at first layer to each neuron at second layer
This doesn't directly answer the questions you asked, but might be useful:
1) You'll want more training data. Much more, if I understand you correctly (only one sample for each letter??)
2) This is a pretty common project, and if it's allowed, you might want to try to find already-processed data sets on the internet so you can focus on the NN component.
Since you will be doing character recognition I suggest you use a SOM neural network which does not require any training data; You will have 26 input neurons one neuron for each letter. For the image processing bit Ross has a usefull suggestion for isolating each letter.