Convolutional autoencoder not learning meaningful filters - neural-network

I am playing with TensorFlow to understand convolutional autoencoders. I have implemented a simple single-layer autoencoder which does this:
Input (Dimension: 95x95x1) ---> Encoding (convolution with 32 5x5 filters) ---> Latent representation (Dimension: 95x95x1x32) ---> Decoding (using tied weights) ---> Reconstructed input (Dimension: 95x95x1)
The inputs are black-and-white edge images i.e. the results of edge detection on RGB images.
I initialised the filters randomly and then trained the model to minimise loss, where loss is defined as the mean-squared-error of the input and the reconstructed input.
loss = 0.5*(tf.reduce_mean(tf.square(tf.sub(x,x_reconstructed))))
After training with 1000 steps, my loss converges and the network is able to reconstruct the images well. However, when I visualise the learned filters, they do not look very different from the randomly-initialised filters! But the values of the filters change from training step to training step.
Example of learned filters
I would have expected at least horizontal and vertical edge filters. Or if my network was learning "identity filters" I would have expected the filters to all be white or something?
Does anyone have any idea about this? Or are there any suggestions as to what I can do to analyse what is happening? Should I include pooling and depooling layers before decoding?
Thank you!
P/S: I tried the same model on RGB images and again the filters look random (like random blotches of colours).

Related

How to extract memnet heat maps with the caffe model?

I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance
As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]

what would be the right feature extraction methods for these images?

I am trying to classify cataract images, First i crop out the pupil area and save it in another folder, then i execute the wavelet transform on these cropped images using wavedec function and sym8 filter, then i take the approximation coefficients as the feature vector and the final step would be sending these feature vectors to neural network, after trying the neural network to classify my data set which contains 51 images,I have tried 10,20, 50, 70 hidden layers,But It shows that only 90% is the percentage of the correct classified images. so I want to increase this percentage,
So, any suggestions on what to use other than wavelet?
why when i increased the dataset size to 83 images the neural network showed me bad results such as 70% or 60%?!
should i stop using Neural and start looking for another classifier?
some image from the dataset
another one

Convolution Neural Network for image detection/classification

So here is there setup, I have a set of images (labeled train and test) and I want to train a conv net that tells me whether or not a specific object is within this image.
To do this, I followed the tensorflow tutorial on MNIST, and I train a simple conv net reduced to the area of interest (the object) which are training on image of size 128x128. The architecture is as follows : successively 3 layers consisting of 2 conv layers and 1 max pool down-sampling layers, and one fully connected softmax layers (with two class 0 and 1 whether the object is present or not)
I impleted it using tensorflow, and this works quite well, but since I have enough computing power I was wondering how I could improve the complexity of the classification:
- adding more layers ?
- adding more channel at each layer ? (currently 32,64,128 and 1024 for the fully connected)
- anything else ?
But the most important part is that now I want to detect this same object on larger images (roughle 600x600 whereas the size of the object should be around 100x100).
I was wondering how I could use the previously training "small" network used for small images, in order to pretrained a larger network on the large images ? One option could be to classify the image using a slicing window of size 128x128 and scan the whole image but I would like to try if possible to train a whole network on it.
Any suggestion on how to proceed ? Or an article / ressource tackling this kind of problem ? (I am really new to deep learning so sorry if this is stupid question...)
Thanks !
I suggest that you continue reading on the field overall. Your search keys include CNN, image classification, neural net, AlexNet, GoogleNet, and ResNet. This will return many articles, on-line classes and lectures, and other materials to help you learn about classification with neural nets.
Don't just add layers or filters: the complexity of the topology (net design) must be fitted to the task; a net that's too complex will over-fit the training data. The one you've been using is probably LeNet; the three I cite above are for the ImageNet image classification contest.
Since you are working on images, I would suggest you to use a pretrained image classification network (like VGG, Alexnet etc.)and fine tune this network with your 128x128 image data. In my experience until we have very large data set fine tuned network will give more accuracy and also save training time. After building a good image classifier on your data set you can use any popular algorithm to generate region of proposal from the image. Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object is present in that region. Otherwise it's not. If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

How to train a Matlab Neural Network using matrices as inputs?

I am making 8 x 8 tiles of Images and I want to train a RBF Neural Network in Matlab using those tiles as inputs. I understand that I can convert the matrix into a vector and use it. But is there a way to train them as matrices? (to preserve the locality) Or is there any other technique to solve this problem?
There is no way to use a matrix as an input to such a neural network, but anyway this won't change anything:
Assume you have any neural network with an image as input, one hidden layer, and the output layer. There will be one weight from every input pixel to every hidden unit. All weights are initialized randomly and then trained using backpropagation. The development of these weights does not depend on any local information - it only depends on the gradient of the output error with respect to the weight. Having a matrix input will therefore make no difference to having a vector input.
For example, you could make a vector out of the image, shuffle that vector in any way (as long as you do it the same way for all images) and the result would be (more or less, due to the random initialization) the same.
The way to handle local structures in the input data is using convolutional neural networks (CNN).

Deconvolution with caffe

I was wondering if it is possible to perform a deconvolution of images in Caffe using a point spread function of objective at a given focal point. Something along the lines of this approach.
If yes, what would be the best way to proceed?
It is possible to deconvolve images using Caffe (and CNN in general), but the approach may not be as general as you hope it to be.
CNNs can take blurry image as an input and output sharp image. As the networks are convolutional, the input can be of any size. This can be easily done in Caffe using Convolutional layers and Euclidean Loss layer. Optionally, you can experiment with adding some pooling and deconvolution layers.
CNNs can be trained to deconvolve images for specific blur PSF as in your link. (see: [Xu et al.:Deep Convolutional Neural Network for Image Deconvolution. NIPS 2014]). This works well but you have to re-train the CNN for each new PSF (which takes lot of time).
I've tried to train CNNs to do blind deconvolution (PSF is not known) and it works very well for text documents. You can get trained nets and python-Caffe scripts at [Hradiš et al.: Convolutional Neural Networks for Direct Text Deblurring. BMVC 2015]. This approach could work for other types of images, but it would not work for unrestricted photographs and diverse blurs. For general photos, I would guess It could work for small range of blurs.
Another possibility is to do inverse filtration (e.g. using Wiener filter) and process the output using a CNN. The advantage of this is that you can compute the inverse filter for new PSF very fast and the CNN stays the same. [Schuler et al.: A machine learning approach for non-blind image deconvolution. CVPR 2013]