How to extract memnet heat maps with the caffe model? - neural-network

I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance

As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]

Related

Why the normalized mean square error is not changing in a variational autoencoder even after changing the network

The calculation of NMSE is not changing even after changing the latent dimension from 512 to 32
the normalized mean square error values should change after changing the latent dimension.
Variational autoencoder is different from autoencoder in a way such that it provides a statistic manner for describing the samples of the dataset in latent space. Therefore, in variational autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a single output value.
Variational autoencoders were originally designed to generate simple synthetic images. Since their introduction, VAEs have been shown to work quite well with images that are more complex than simple 28 x 28 MNIST images. For example, it is possible to use a VAE to generate very realistic looking images of people.

How to visualize kernel generated by a trained convolutional neural network?

I found the picture below:
which is shown in this web page
I wonder how are this kind of images generated?
This picture is made by printing out the weights of first layer filters, later in that course there's a section about visualizing the networks.
For higher layers printing out the weights may not make sense, there's a paper that might be useful Visualizing Higher-Layer Features of a Deep Network.
Each convolutional kernel is just a matrix N x M (weight matrix), thus you can simply plot it (each square in the above plot is a single convoluational matrix). Color ones are probably taken from 3-channel convolution, thus each is encoded as one color.

How to ensure consistency in SIFT features?

I am working with a classification algorithm that requires the size of the feature vector of all samples in training and testing to be the same.
I am also to use the SIFT feature extractor. This is causing problems as the feature vector of every image is coming up as a different sized matrix. I know that SIFT detects variable keypoints in each image, but is there a way to ensure that the size of the SIFT features is consistent so that I do not get a dimension mismatch error.
I have tried rootSIFT as a workaround:
[~, features] = vl_sift(single(images{i}));
double_features = double(features);
root_it = sqrt( double_features/sum(double_features) ); %root-sift
feats{i} = root_it;
This gives me a consistent 128 x 1 vector for every image, but it is not working for me as the size of each vector is now very small and I am getting a lot of NaN in my classification result.
Is there any way to solve this?
Using SIFT there are 2 steps you need to perform in general.
Extract SIFT features. These points (first output argument of
size NPx2 (x,y) of your function) are scale invariant, and should in
theory be present in each different image of the same object. This
is not completely true. Often points are unique to each frame
(image). These points are described by 128 descriptors each (second
argument of your function).
Match points. Each time you compute features of a different image the amount of points computed is different! Lots of them should be the same point as in the previous image, but lots of them WON'T. You will have new points and old points may not be present any more. This is why you should perform a feature matching step, to link those points in different images. usually this is made by knn matching or RANSAC. You can Google how to perform this task and you'll have tons of examples.
After the second step, you should have a fixed amount of points for the whole set of images (considering they are images of the same object). The amount of points will be significantly smaller than in each single image (sometimes 30~ times less amount of points). Then do whatever you want with them!
Hint for matching: http://www.vlfeat.org/matlab/vl_ubcmatch.html
UPDATE:
You seem to be trying to train some kind of OCR. You would need to probably match SIFT features independently for each character.
How to use vl_ubcmatch:
[~, features1] = vl_sift(I1);
[~, features2] = vl_sift(I2);
matches=vl_ubcmatch(features1,features2)
You can apply a dense SIFT to the image. This way you have more control over from where you get the feature descriptors. I haven't used vlfeat, but looking at the documentation I see there's a function to extract dense SIFT features called vl_dsift. With vl_sift, I see there's a way to bypass the detector and extract the descriptors from points of your choice using the 'frames' option. Either way it seems you can get a fixed number of descriptors.
If you are using images of the same size, dense SIFT or the frames option is okay. There's a another approach you can take and it's called the bag-of-features model (similar to bag-of-words model) in which you cluster the features that you extracted from images to generate codewords and feed them into a classifier.

PCA on Sift desciptors and Fisher Vectors

I was reading this particular paper http://www.robots.ox.ac.uk/~vgg/publications/2011/Chatfield11/chatfield11.pdf and I find the Fisher Vector with GMM vocabulary approach very interesting and I would like to test it myself.
However, it is totally unclear (to me) how do they apply PCA dimensionality reduction on the data. I mean, do they calculate Feature Space and once it is calculated they perform PCA on it? Or do they just perform PCA on every image after SIFT is calculated and then they create feature space?
Is this supposed to be done for both training test sets? To me it's an 'obviously yes' answer, however it is not clear.
I was thinking of creating the feature space from training set and then run PCA on it. Then, I could use that PCA coefficient from training set to reduce each image's sift descriptor that is going to be encoded into Fisher Vector for later classification, whether it is a test or a train image.
EDIT 1;
Simplistic example:
[coef , reduced_feat_space]= pca(Feat_Space','NumComponents', 80);
and then (for both test and train images)
reduced_test_img = test_img * coef; (And then choose the first 80 dimensions of the reduced_test_img)
What do you think? Cheers
It looks to me like they do SIFT first and then do PCA. the article states in section 2.1 "The local descriptors are fixed in all experiments to be SIFT descriptors..."
also in the introduction section "the following three steps:(i) extraction
of local image features (e.g., SIFT descriptors), (ii) encoding of the local features in an image descriptor (e.g., a histogram of the quantized local features), and (iii) classification ... Recently several authors have focused on improving the second component" so it looks to me that the dimensionality reduction occurs after SIFT and the paper is simply talking about a few different methods of doing this, and the performance of each
I would also guess (as you did) that you would have to run it on both sets of images. Otherwise your would be using two different metrics to classify the images it really is like comparing apples to oranges. Comparing a reduced dimensional representation to the full one (even for the same exact image) will show some variation. In fact that is the whole premise of PCA, you are giving up some smaller features (usually) for computational efficiency. The real question with PCA or any dimensionality reduction algorithm is how much information can I give up and still reliably classify/segment different data sets
And as a last point, you would have to treat both images the same way, because your end goal is to use the Fisher Feature Vector for classification as either test or training. Now imagine you decided training images dont get PCA and test images do. Now I give you some image X, what would you do with it? How could you treat one set of images differently from another BEFORE you've classified them? Using the same technique on both sets means you'd process my image X then decide where to put it.
Anyway, I hope that helped and wasn't to rant-like. Good Luck :-)

KNN Classifier using cross validation

I am trying to implement KNN classifier using the cross validation approach where I have different images of a certain character for training(e.g 5 images) and another two for testing. Now I get the idea of the cross validation by simply choosing the K with the least error value when training & then using it with the test data to find how accurate my results are.
My question is how do I train images in matlab to get my K value? Do I compare them and try to find mismatch or what?!
Any help would be really appreciated.
First of you need to define your task precisely. F.ex Given an image I in R^(MxN) we wish to classify I as an image containing faces or an image without faces.
I often work with pixel classifiers, where the task is something like: For an image I decide if each pixel is a face pixel or a non-face pixel.
An important part of defining the task is to make a hypotheses that can be used as basis for training a classifier. F.ex We believe that the distribution of pixel intensities can be used to discriminate images of faces from images not containing faces.
Then you need to select some features that define your image. This can be done in many ways and you should search for what other people do when they analyse the same type of images you are working with.
One widely used method in pixel classification is to use pixel intensity values and do a multi-scale analysis of the image. The idea in multi-scale analysis is that different structures are most evident at different level of blurring called scales. As an illustration consider an image of a tree. Without blurring we notice the fine structure, such as small branches and leafs. When we blur the image we notice the trunk and major branches. This is often used as part of segmentation methods.
When you know your task and the features, you can train a classifier. If you use kNN and cross-validation to find the best k, you should split you dataset in train/testing and then split the training set in train/validate sets. You then train using the reduced training set and use the validation set to decide which k is the best. In the case of binary classification e.g face vs non-face the error rate is often used as a measure of performance.
Finally you use the parameters to train the classifier on the full dataset and estimate its performance on the test set.
A classification example: With or without milk?
As a full example, consider images of a cup of coffee taken from above so it shows the rim of the cup surrounding a brownly colored disk. Further assume that all images are scaled and cropped so the diameter of the disk is the same and dimensions of the image are the same. To simplify the task, we convert the color image to grayscale and scale the pixel intensities to the range [0,1].
We want to train a classifier so it can distinguish coffee with milk from coffee without milk. From inspection of histograms of some of the coffee images, we see that each image has two "bumps" in the histogram that are clearly separated. We believe that these bumps correspond to foreground (coffee) and background. Now we make the hypothesis that the average intensity of the foreground can be used to distinguish between coffee+milk/coffee.
To find the foreground pixels we observe that because the foreground/background ratio is the same (by design) we can just find the intensity value that gives us that ratio for each image. Then we calculate the average intensity of the foreground pixels and use this value as a feature for each image.
If we have N images that we have manually labeled, we split this into training and test set. We then calculate the average foreground intensity for each image in the training set, giving us a set of (average foreground intensity, label) values. We want to use kNN where an image is assigned the same class as the majority class of the k closest images. We measure the distance as the absolute value of the difference in average foreground pixel intensity.
We search for the optimal k with cross validation. We use 2-fold cross validation (aka holdout) to find the best k. We test k = {1,3,5} and select the k that gives the least prediction error on the validation set.