VGG and AlexNet, amongst others, require a fixed image input of square dimensions (H == W). How can one fine-tune or otherwise perform net surgery such that non-square inputs can be provided?
For your reference, I'm using Caffe and intend to extract FC7 features for non-square image inputs.
For the convolutional part of the net - the input size does not really matter: the shape of the output will change as you change the input size.
However, when it comes to "InnerProduct" layers - the shape of the weights is fixed and it is determined by input size.
You can perform "net surgery" converting your "InnerProduct" layers into "Convolution" layers: This way your net can process inputs at any size they come. However, your outputs will also vary in shape.
Another option is to define your net according to a new fixed input size, re-use all the learned weights of the covolutions and only fine-tune the weights of the fully connected layers.
Related
I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance
As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]
I found the picture below:
which is shown in this web page
I wonder how are this kind of images generated?
This picture is made by printing out the weights of first layer filters, later in that course there's a section about visualizing the networks.
For higher layers printing out the weights may not make sense, there's a paper that might be useful Visualizing Higher-Layer Features of a Deep Network.
Each convolutional kernel is just a matrix N x M (weight matrix), thus you can simply plot it (each square in the above plot is a single convoluational matrix). Color ones are probably taken from 3-channel convolution, thus each is encoded as one color.
I am playing with TensorFlow to understand convolutional autoencoders. I have implemented a simple single-layer autoencoder which does this:
Input (Dimension: 95x95x1) ---> Encoding (convolution with 32 5x5 filters) ---> Latent representation (Dimension: 95x95x1x32) ---> Decoding (using tied weights) ---> Reconstructed input (Dimension: 95x95x1)
The inputs are black-and-white edge images i.e. the results of edge detection on RGB images.
I initialised the filters randomly and then trained the model to minimise loss, where loss is defined as the mean-squared-error of the input and the reconstructed input.
loss = 0.5*(tf.reduce_mean(tf.square(tf.sub(x,x_reconstructed))))
After training with 1000 steps, my loss converges and the network is able to reconstruct the images well. However, when I visualise the learned filters, they do not look very different from the randomly-initialised filters! But the values of the filters change from training step to training step.
Example of learned filters
I would have expected at least horizontal and vertical edge filters. Or if my network was learning "identity filters" I would have expected the filters to all be white or something?
Does anyone have any idea about this? Or are there any suggestions as to what I can do to analyse what is happening? Should I include pooling and depooling layers before decoding?
Thank you!
P/S: I tried the same model on RGB images and again the filters look random (like random blotches of colours).
I am making 8 x 8 tiles of Images and I want to train a RBF Neural Network in Matlab using those tiles as inputs. I understand that I can convert the matrix into a vector and use it. But is there a way to train them as matrices? (to preserve the locality) Or is there any other technique to solve this problem?
There is no way to use a matrix as an input to such a neural network, but anyway this won't change anything:
Assume you have any neural network with an image as input, one hidden layer, and the output layer. There will be one weight from every input pixel to every hidden unit. All weights are initialized randomly and then trained using backpropagation. The development of these weights does not depend on any local information - it only depends on the gradient of the output error with respect to the weight. Having a matrix input will therefore make no difference to having a vector input.
For example, you could make a vector out of the image, shuffle that vector in any way (as long as you do it the same way for all images) and the result would be (more or less, due to the random initialization) the same.
The way to handle local structures in the input data is using convolutional neural networks (CNN).
I have trained a 3-layer (input, hidden and output) feedforward neural network in Matlab. After training, I would like to simulate the trained network with an input test vector and obtain the response of the neurons of the hidden layer (not the final output layer). How can I go about doing this?
Additionally, after training a neural network, is it possible to "cut away" the final output layer and make the current hidden layer as the new output layer (for any future use)?
Extra-info: I'm building an autoencoder network.
The trained weights for a trained network are available in the net.LW property. You can use these weights to get the hidden layer outputs
From Matlab Documentation
nnproperty.net_LW
Neural network LW property.
NET.LW
This property defines the weight matrices of weights going to layers
from other layers. It is always an Nl x Nl cell array, where Nl is the
number of network layers (net.numLayers).
The weight matrix for the weight going to the ith layer from the jth
layer (or a null matrix []) is located at net.LW{i,j} if
net.layerConnect(i,j) is 1 (or 0).
The weight matrix has as many rows as the size of the layer it goes to
(net.layers{i}.size). It has as many columns as the product of the size
of the layer it comes from with the number of delays associated with the
weight:
net.layers{j}.size * length(net.layerWeights{i,j}.delays)
Addition to using input and layer weights and biases, you may add a output connect from desired layer (after training the network). I found it possible and easy but I didn't exam the correctness of it.