Convolutional neural network back propagation - delta calculation in convolutional layer - neural-network

So I’m trying to make a CNN and so far I think I understand all of the forward propagation and the back propagation in the fully connected layers. However, I’m having some issues with the back prop in the convolutional layers.
Basically I’ve written out the dimensions of everything at each stage in a CNN with two convolutional layers and two fully connected layers, with the input having a depth of 1(as it is black and white) and only one filter being applied at each convolutional layer. I haven’t bothered to use pooling at this stage as to my knowledge it shouldn’t have any impact on the calculus, just to where it is assigned, so the dimensions should still fit as long as I also don’t include any uppooling in my backprop. I also haven’t bothered to write out the dimensions after the application of the activation functions as they would be the same as that as their input and I would be writing the same values twice.
The dimensions, as you will see, vary slightly in format. For the convolutional layers I’ve written them as though they are images, rather than in a matrix form. Whilst for the fully connected layers I’ve written the dimensions as that of the size of the matrices used(will hopefully make more sense when you see it).
The issue is that in calculating the delta for the convolutional layers, the dimensions don’t fit, what am I doing wrong?
Websites used:
http://cs231n.github.io/convolutional-networks/
http://neuralnetworksanddeeplearning.com/chap2.html#the_cross-entropy_cost_function
http://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/
Calculation of dimensions:

Related

Deep Learning on Encrypted Images

Suppose we have a set of images and labels meant for a machine-learning classification task. The problem is that these images come with a relatively short retention policy. While one could train a model online (i.e. update it with new image data every day), I'm ideally interested in a solution that can somehow retain images for training and testing.
To this end, I'm interested if there are any known techniques, for example some kind of one-way hashing on images, which obfuscates the image, but still allows for deep learning techniques on it.
I'm not an expert on this but the way I'm thinking about it is as follows: we have a NxN image I (say 1024x1024) with pixel values in P:={0,1,...,255}^3, and a one-way hash map f(I):P^(NxN) -> S. Then, when we train a convolutional neural network on I, we first map the convolutional filters via f, to then train on a high-dimensional space S. I think there's no need for f to locally-sensitive, in that pixels near each other don't need to map to values in S near each other, as long as we know how to map the convolutional filters to S. Please note that it's imperative that f is not invertible, and that the resulting stored image in S is unrecognizable.
One option for f,S is to use a convolutional neural network on I to then extract the representation of I from it's fully connected layer. This is not ideal because there's a high chance that this network won't retain the finer features needed for the classification task. So I think this rules out a CNN or auto encoder for f.

How does upsampling in Fully Connected Convolutional network work?

I read several posts / articles and have some doubts on the mechanism of upsampling after the CNN downsampling.
I took the 1st answer from this question:
https://www.quora.com/How-do-fully-convolutional-networks-upsample-their-coarse-output
I understood that similar to normal convolution operation, the "upsampling" also uses kernels which need to be trained.
Question1: if the "spatial information" is already lost during the first stages of CNN, how can it be re-constructed in anyway ?
Question2: Why >"Upsampling from a small (coarse) featuremap deep in the network has good semantic information but bad resolution. Upsampling from a larger feature map closer to the input, will produce better detail but worse semantic information" ?
Question #1
Upsampling doesn't (and cannot) reconstruct any lost information. Its role is to bring back the resolution to the resolution of previous layer.
Theoretically, we can eliminate the down/up sampling layers altogether. However to reduce the number of computations, we can downsample the input before a layers and then upsample its output.
Therefore, the sole purpose of down/up sampling layers is to reduce computations in each layer, while keeping the dimension of input/output as before.
You might argue the down-sampling might cause information loss. That is always a possibility but remember the role of CNN is essentially extracting "useful" information from the input and reducing it into a smaller dimension.
Question #2
As we go from the input layer in CNN to the output layer, the dimension of data generally decreases while the semantic and extracted information hopefully increases.
Suppose we have the a CNN for image classification. In such CNN, the early layers usually extract the basic shapes and edges in the image. The next layers detect more complex concepts like corners, circles. You can imagine the very last layers might have nodes that detect very complex features (like presence of a person in the image).
So up-sampling from a large feature map close to the input produces better detail but has lower semantic information compared to the last layers. In retrospect, the last layers generally have lower dimension hence their resolution is worse compared to the early layers.

Effect of adding/removing layers in CNN

I wanted to know about how adding layers or removing layers in a convolutional neural network affects the results produced. My problem specifically deals with detection of lips, something like this https://img-s3.onedio.com/id-57dc3cd17669c0cf0e4e2705/rev-0/raw/s-4909b77cbbb875d94fea9406a207936f767dc5de.jpg. (I use this CNN network in a DCGAN).
I have an average GPU, which is taking a lot of time to train the model. I currently have 4 layers in my CNN network, and wanted to know how removing one layer may affect the results and training time of the model. I understand that to correctly answer this question, a lot more details are needed. I am just looking for a broad answer, primarily focusing on what role a single layer plays, and how can reducing them affect the model.

Why are inputs for convolutional neural networks always squared images?

I have been doing deep learning with CNN for a while and I realize that the inputs for a model are always squared images.
I see that neither convolution operation or neural network architecture itself require such property.
So, what is the reason for that?
Because square images are pleasing to the eye. But there are applications on non-square images when domain requires it. For instance SVHN original dataset is an image of several digits, and hence rectangular images are used as input to convnet, as here
From Suhas Pillai:
The problem is not with convolutional layers, it's the fully connected
layers of the network ,which require fix number of neurons.For
example, take a small 3 layer network + softmax layer. If first 2
layers are convolutional + max pooling, assuming the dimensions are
same before and after convolution, and pooling reduces dim/2 ,which is
usually the case. For an image of 3*32*32(C,W,H)with 4 filters in the
first layer and 6 filters in the second layer ,the output after
convolutional + max pooling at the end of 2nd layer, will be 6*8*8
,whereas for an image with 3*64*64, at the end of 2nd layer output
will be 6*16*16. Before doing fully connected,we stretch this as a
single vector( 6*8*8=384 neurons)and do a fully connected operation.
So, you cannot have different dimension fully connected layers for
different size images. One way to tackle this is using spatial pyramid
pooling, where you force the output of last convolutional layer to
pool it to a fixed number of bins(I.e neurons) such that fully
connected layer has same number of neurons. You can also check fully
convolutional networks, which can take non-square images.
It is not necessary to have squared images. I see two "reasons" for it:
scaling: If images are scaled automatically from another aspect ratio (and landscape / portrait mode) this in average might introduce the least error
publications / visualizations: square images are easy to display together

How does a neural network work with correlated image data

I am new to TensorFlow and deep learning. I am trying to create a fully connected neural network for image processing. I am somewhat confused.
We have an image, say 28x28 pixels. This will have 784 inputs to the NN. For non-correlated inputs, this is fine, but image pixels are generally correlated. For instance, consider a picture of a cow's eye. How can a neural network understand this when we have all pixels lined up in an array for a fully-connected network. How does it determine the correlation?
Please research some tutorials on CNN (Convolutional Neural Network); here is a starting point for you. A fully connected layer of a NN surrenders all of the correlation information it might have had with the input. Structurally, it implements the principle that the inputs are statistically independent.
Alternately, a convolution layer depends upon the physical organization of the inputs (such as pixel adjacency), using that to find simple combinations (convolutions) of feature form one layer to another.
Bottom line: your NN doesn't find the correlation: the topology is wrong, and cannot do the job you want.
Also, please note that a layered network consisting of fully-connected neurons with linear weight combinations, is not deep learning. Deep learning has at least one hidden layer, a topology which fosters "understanding" of intermediate structures. A purely linear, fully-connected layering provides no such hidden layers. Even if you program hidden layers, the outputs remain a simple linear combination of the inputs.
Deep learning requires some other discrimination, such as convolutions, pooling, rectification, or other non-linear combinations.
Let's take it into peaces to understand the intuition behind NN learning to predict.
to predict a class of given image we have to find a correlation or direct link between once of it is input values to the class. we can think about finding one pixel can tell us this image belongs to this class. which is impossible so what we have to do is build up more complex function or let's call complex features. which will help us to find to generate a correlated data to the wanted class.
To make it simpler imagine you want to build AND function (p and q), OR function (p or q) in the both cases there is a direct link between the input and the output. in and function if there 0 in the input the output always zero. so what if we want to xor function (p xor q) there is no direct link between the input and the output. the answer is to build first layer of classifying AND and OR then by a second layer taking the result of the first layer we can build the function and classify the XOR function
(p xor q) = (p or q) and not (p and q)
By applying this method on Multi-layer NN you'll have the same result. but then you'll have to deal with huge amount of parameters. one solution to avoid this is to extract representative, variance and uncorrelated features between images and correlated with their class from the images and feed the to the Network. you can look for image features extraction on the web.
this is a small explanation for how to see the link between images and their classes and how NN work to classify them. you need to understand NN concept and then you can go to read about Deep-learning.