Image generation using autoencoder vs. variational autoencoder - autoencoder

When we use the convolutional autoencoder for new image generating,
does the model generate the same image every time we run the model? or does it rather generate images with random variation?
I think that the autoencoder (AE) generates the same new images every time we run the model because it maps the input image to a single point in the latent space. On the other hand, the variational autoencoder (VAE) maps the the input image to a distribution. Therefore, if we need images with some random variation we need to use VAE and if we need the same generated images every time we run the model we use AE. Is this true?
My question is:
Does AE generate images with random variation?

Autoencoders first encode the input data into some latent representation and then use that representation (bottleneck layer) to reconstruct the same input.
I trained an autoencoder on MNIST data and encoded the digits into a two-dimensional vector. The network learned quite a useful representation of the data that I plotted.
Latent representation of MNIST digit
You can see for each digit the latent representation has some range of values for example zero's latent representation has nearly range from -2 to 4 on the x-axis and 4 to 8 on the y axis.
Now if you sample a random two-dimensional random vector in that range and run it through a decoder, you will get a random image of zero.
Now the problem is this is a very simple case. How about latent vector has 64 dimensions or even higher and a number of categories are also more. In that case, we need to model the distribution of the latent vector in order to sample valid vectors. Otherwise, we would never know that which latent vector is valid.
So autoencoder may give random samples, but it needs to know the distribution of data and that point we cover in VAE.

Related

Why the normalized mean square error is not changing in a variational autoencoder even after changing the network

The calculation of NMSE is not changing even after changing the latent dimension from 512 to 32
the normalized mean square error values should change after changing the latent dimension.
Variational autoencoder is different from autoencoder in a way such that it provides a statistic manner for describing the samples of the dataset in latent space. Therefore, in variational autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a single output value.
Variational autoencoders were originally designed to generate simple synthetic images. Since their introduction, VAEs have been shown to work quite well with images that are more complex than simple 28 x 28 MNIST images. For example, it is possible to use a VAE to generate very realistic looking images of people.

Proper way of generating initial random vectors for generator model in GAN?

Frequently linear interpolation is used with a Gaussian or uniform prior which has unit variance and zero mean where the size of the vector can be defined in an arbitrary way e.g. 100 to generate initial random vectors for generator model in Generative Adversarial Neural (GAN).
Let's say we have 1000 images for training and batch size is 64. Then each epoch, need to generate a number of random vectors using prior distribution corresponding to each image given small batch. But the problem I see is that since there is no mapping between random vector and corresponding image, the same image can be generated using multiple initial random vectors. In this paper, it suggests overcoming this problem by using different spherical interpolation up to some extent.
So what will happens if initially generate random vectors corresponding to the number of training images and when train the model uses the same random vector which is generated initially?
In GANs the random seed used as input does not actually correspond to any real input image. What GANs actually do is learn a transformation function from a known noise distribution (e.g. Gaussian) to a complex unknown distribution, which is representated by i.i.d. samples (e.g. your training set). What the discriminator in a GAN does is to calculate a divergence (e.g. Wasserstein divergence, KL-divergence, etc.) between the generated data (e.g. transformed gaussian) and the real data (your training data). This is done in a stochastic fashion and therefore no link is neccessary between the real and the fake data. If you want to learn more about this on a hands on example, I can recommend that you train to train a Wasserstein GAN to transform one 1D gaussian distribution into another one. There you can visualize the discriminator and the gradient of the discriminator and really see the dynamics of such a system.
Anyways, what your paper is trying to tell you is after you have trained your GAN and want to see how it has mapped the generated data from the known noise space to the unknown image space. For this reason interpolation schemes have been invented like the spherical one you are quoting. They also show that the GAN has learned to map some parts of the latent space to key characteristics in images, like smiles. But this has nothing to do with the training of GANs.

How to extract memnet heat maps with the caffe model?

I want to extract both memorability score and memorability heat maps by using the available memnet caffemodel by Khosla et al. at link
Looking at the prototxt model, I can understand that the final inner-product output should be the memorability score, but how should I obtain the memorability map for a given input image? Here some examples.
Thanks in advance
As described in their paper [1], the CNN (MemNet) outputs a single, real-valued output for the memorability. So, the network they made publicly available, calculates this single memorability score, given an input image - and not a heatmap.
In section 5 of the paper, they describe how to use this trained CNN to predict a memorability heatmap:
To generate memorability maps, we simply scale up the image and apply MemNet to overlapping regions of the image. We do this for multiple scales of the image and average the resulting memorability maps.
Let's consider the two important steps here:
Problem 1: Make the CNN work with any input size.
To make the CNN work on images of any arbitrary size, they use the method presented in [2].
While convolutional layers can be applied to images of arbitrary size - resulting in smaller or larger outputs - the inner product layers have a fixed input and output size.
To make an inner product layer work with any input size, you apply it just like a convolutional kernel. For an FC layer with 4096 outputs, you interpret it as a 1x1 convolution with 4096 feature maps.
To do that in Caffe, you can directly follow the Net Surgery tutorial. You create a new .prototxt file, where you replace the InnerProduct layers with Convolution layers. Now, Caffe won't recognize the weights in the .caffemodel anymore, as the layer types don't match anymore. So, you load the old net and its parameters into Python, load the new net, and assign the old parameters to the new net and save it as a new .caffemodel file.
Now, we can run images of any dimensions (larger or equal than 227x227) through the network.
Problem 2: Generate the heat map
As explained in the paper [1], you apply this fully-convolutional network from Problem 1 to the same image at different scales. The MemNet is a re-trained AlexNet, so the default input dimension is 227x227. They mention that a 451x451 input gives a 8x8 output, which implies a stride of 28 for applying the layers. So a simple example could be:
Scale 1: 227x227 → 1x1. (I guess they definitely use this scale.)
Scale 2: 283x283 → 2x2. (Wild guess)
Scale 3: 339x339 → 4x4. (Wild guess)
Scale 4: 451x451 → 8x8. (This scale is mentioned in the paper.)
The results will look like this:
So, you'll just average these outputs to get your final 8x8 heatmap. From the image above, it should be clear how to average the different-scale outputs: you'll have to upsample the low-res ones to 8x8, and average then.
From the paper, I assume that they use very high-res scales, so their heatmap will be around the same size as the image initially was. They write that it takes 1s on a "normal" GPU. This is a quite long time, which also indicates that they probably upsample the input images quite to quite high dimensions.
Bibliography:
[1]: A. Khosla, A. S. Raju, A. Torralba, and A. Oliva, "Understanding and Predicting Image Memorability at a Large Scale", in: ICCV, 2015. [PDF]
[2]: J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation", in: CVPR, 2015. [PDF]

How to train a Matlab Neural Network using matrices as inputs?

I am making 8 x 8 tiles of Images and I want to train a RBF Neural Network in Matlab using those tiles as inputs. I understand that I can convert the matrix into a vector and use it. But is there a way to train them as matrices? (to preserve the locality) Or is there any other technique to solve this problem?
There is no way to use a matrix as an input to such a neural network, but anyway this won't change anything:
Assume you have any neural network with an image as input, one hidden layer, and the output layer. There will be one weight from every input pixel to every hidden unit. All weights are initialized randomly and then trained using backpropagation. The development of these weights does not depend on any local information - it only depends on the gradient of the output error with respect to the weight. Having a matrix input will therefore make no difference to having a vector input.
For example, you could make a vector out of the image, shuffle that vector in any way (as long as you do it the same way for all images) and the result would be (more or less, due to the random initialization) the same.
The way to handle local structures in the input data is using convolutional neural networks (CNN).

dimension reduction using pca for FastICA

I am trying to develop a system for image classification. I am using following the article:
INDEPENDENT COMPONENT ANALYSIS (ICA) FOR TEXTURE CLASSIFICATION by Dr. Dia Abu Al Nadi and Ayman M. Mansour
In a paragraph it says:
Given the above texture images, the Independent Components are learned by the method outlined above.
The (8 x 8) ICA basis function for the above textures are shown in Figure 2. respectively. The dimension is reduced by PCA, resulting in a total of 40 functions. Note that independent components from different windows size are different.
The "method outlined above" is FastICA, the textures are taken from Brodatz album , each texture image has 640x640 pixels. My question is:
What the authors means with "The dimension is reduced by PCA, resulting in a total of 40 functions.", and how can I get that functions using matlab?
PCA (Principal Component Analysis) is a method for finding an orthogonal basis (think of a coordinate system) for a high-dimensional (data) space. The "axes" of the PCA basis are sorted by variance, i.e. along the first PCA "axis" your data has the largest variance, along the second "axis" the second largest variance, etc.
This is exploited for dimension reduction: Say you have 1000 dimensional data. Then you do a PCA, transform your data into the PCA basis and throw away all but the first 20 dimensions (just an example). If your data follows a certain statistical distribution, then chances are that the 20 PCA dimensions describe your data almost as well as the 64 original dimensions did. There are methods for finding the number of dimensions to use, but that is beyond scope here.
Computationally, PCA amounts to finding the Eigen-decomposition of your data's covariance matrix, in Matlab: [V,D] = eig(cov(MyData)).
Note that if you want to work with these concepts you should do some serious reading. A classic article on what you can do with PCA on image data is Turk and Pentland's Eigenfaces. It also gives some background in an understandable way.
PCA reduce the dimension of data,ICA extracts the components of the data of which dimension must <=
data dimension