Can we have both upsampling and downsampling layers in an Encoder/Decoder? - autoencoder

I am new to auto-encoders. All the autoencoders that i have seen usually exhibit a downsampling encoder followed by an upsampling decoder or an upsampling encoder followed by a down sampling decoder.
Now, i want to ask, can we have an encoder that contains both upsampling and downsampling layers simultaneously, followed by a decoder having exact mirrored layers of the encoder??
For example can we have the following architecture of autoencoder?
Encoder: 16 neurons - 200 neurons - 400 neurons - 200 neurons - 4 neurons (latent representation) - Decoder: 200 neurons - 400 neurons - 200 neurons - 16 neurons
Is this a valid autoencoder? or is it a simple tandem neural network ?

I would still consider this architecture as an autoencoder, given it is trained as such. There is no formal definition for the requirements of the layer sizes, other than that input and output have to be of the same dimensionality. You can also build "overcomplete" autoencoders, where the dimensionality of your latent space is bigger than the dimensionality of your input.
As long as you are using it as an autoencoder, meaning you train with input data x to generate an output x' while penalizing with something like L = ||x-x'||², the architecture of the layers are pretty much arbitrary.
Whether it makes sense to have different upscaling and downscaling in both en- and decoding is another issue.

Related

Why the normalized mean square error is not changing in a variational autoencoder even after changing the network

The calculation of NMSE is not changing even after changing the latent dimension from 512 to 32
the normalized mean square error values should change after changing the latent dimension.
Variational autoencoder is different from autoencoder in a way such that it provides a statistic manner for describing the samples of the dataset in latent space. Therefore, in variational autoencoder, the encoder outputs a probability distribution in the bottleneck layer instead of a single output value.
Variational autoencoders were originally designed to generate simple synthetic images. Since their introduction, VAEs have been shown to work quite well with images that are more complex than simple 28 x 28 MNIST images. For example, it is possible to use a VAE to generate very realistic looking images of people.

why is tanh performing better than relu in simple neural network

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows
Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
Hidden layer has 49 neuron fully connected to previous 784.
Output layer has 9 neurons denoting class of image.
Loss function is defined as cross entropy of softmax of output layer.
Initialized all weights as random real number from [-1,+1].
Now I did training with 500 fixed samples for each class.
Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.
I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.
I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.
Isn't relu expected to perform better ?
Isn't relu expected to perform better ?
In general, no. RELU will perform better on many problems but not all problems.
Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.
Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.
One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.

Is it meaningful to remove layers with %100 sparsity in a CNN - VGG 16

I am training an autoencoder by using VGG-16 as feature extraction. Whenever I check the sparsity (number of parameters equal to zero / total parameters) of the blocks, I noticed that Block 5 (deepest block of VGG-16) has %100 sparsity. As it is the most time consuming block for VGG architecture, I would like to remove it if possible.
So, is it meaningful to remove CNN layers with %100 sparsity to increase performance as all of them are actually equal to zero?

Convolutional autoencoder not learning meaningful filters

I am playing with TensorFlow to understand convolutional autoencoders. I have implemented a simple single-layer autoencoder which does this:
Input (Dimension: 95x95x1) ---> Encoding (convolution with 32 5x5 filters) ---> Latent representation (Dimension: 95x95x1x32) ---> Decoding (using tied weights) ---> Reconstructed input (Dimension: 95x95x1)
The inputs are black-and-white edge images i.e. the results of edge detection on RGB images.
I initialised the filters randomly and then trained the model to minimise loss, where loss is defined as the mean-squared-error of the input and the reconstructed input.
loss = 0.5*(tf.reduce_mean(tf.square(tf.sub(x,x_reconstructed))))
After training with 1000 steps, my loss converges and the network is able to reconstruct the images well. However, when I visualise the learned filters, they do not look very different from the randomly-initialised filters! But the values of the filters change from training step to training step.
Example of learned filters
I would have expected at least horizontal and vertical edge filters. Or if my network was learning "identity filters" I would have expected the filters to all be white or something?
Does anyone have any idea about this? Or are there any suggestions as to what I can do to analyse what is happening? Should I include pooling and depooling layers before decoding?
Thank you!
P/S: I tried the same model on RGB images and again the filters look random (like random blotches of colours).

What do P letters mean in neural network layer scheme?

In Wikipedia article about MNIST database it is said, that lowest error rate is of "committee of 35 convolutional networks" with the scheme:
1-20-P-40-P-150-10
What does this scheme mean?
Numbers are probably neuron numbers. But what does 1 mean then?
What do P letters mean?
In this particular scheme, 'P' means 'pooling' layer.
So, basic structure is following:
One grayscale input image
20 images after convolution layer (20 different filters)
Pooling layer
40 outputs from next convolution
Pooling layer
150... can be either 150 small convolution outputs or just fully-connected 150 neurons
10 output fully-connected neurons
That's why 1-20-P-40-P-150-10. Not best notation, but still pretty clear if you familiar with CNN.
You can read more details about internal structure of CNN in base article of Yann LeCun "Gradient-Based Learning Applied to Document Recognition".