How does upsampling in Fully Connected Convolutional network work? - neural-network

I read several posts / articles and have some doubts on the mechanism of upsampling after the CNN downsampling.
I took the 1st answer from this question:
https://www.quora.com/How-do-fully-convolutional-networks-upsample-their-coarse-output
I understood that similar to normal convolution operation, the "upsampling" also uses kernels which need to be trained.
Question1: if the "spatial information" is already lost during the first stages of CNN, how can it be re-constructed in anyway ?
Question2: Why >"Upsampling from a small (coarse) featuremap deep in the network has good semantic information but bad resolution. Upsampling from a larger feature map closer to the input, will produce better detail but worse semantic information" ?

Question #1
Upsampling doesn't (and cannot) reconstruct any lost information. Its role is to bring back the resolution to the resolution of previous layer.
Theoretically, we can eliminate the down/up sampling layers altogether. However to reduce the number of computations, we can downsample the input before a layers and then upsample its output.
Therefore, the sole purpose of down/up sampling layers is to reduce computations in each layer, while keeping the dimension of input/output as before.
You might argue the down-sampling might cause information loss. That is always a possibility but remember the role of CNN is essentially extracting "useful" information from the input and reducing it into a smaller dimension.
Question #2
As we go from the input layer in CNN to the output layer, the dimension of data generally decreases while the semantic and extracted information hopefully increases.
Suppose we have the a CNN for image classification. In such CNN, the early layers usually extract the basic shapes and edges in the image. The next layers detect more complex concepts like corners, circles. You can imagine the very last layers might have nodes that detect very complex features (like presence of a person in the image).
So up-sampling from a large feature map close to the input produces better detail but has lower semantic information compared to the last layers. In retrospect, the last layers generally have lower dimension hence their resolution is worse compared to the early layers.

Related

Deep Learning on Encrypted Images

Suppose we have a set of images and labels meant for a machine-learning classification task. The problem is that these images come with a relatively short retention policy. While one could train a model online (i.e. update it with new image data every day), I'm ideally interested in a solution that can somehow retain images for training and testing.
To this end, I'm interested if there are any known techniques, for example some kind of one-way hashing on images, which obfuscates the image, but still allows for deep learning techniques on it.
I'm not an expert on this but the way I'm thinking about it is as follows: we have a NxN image I (say 1024x1024) with pixel values in P:={0,1,...,255}^3, and a one-way hash map f(I):P^(NxN) -> S. Then, when we train a convolutional neural network on I, we first map the convolutional filters via f, to then train on a high-dimensional space S. I think there's no need for f to locally-sensitive, in that pixels near each other don't need to map to values in S near each other, as long as we know how to map the convolutional filters to S. Please note that it's imperative that f is not invertible, and that the resulting stored image in S is unrecognizable.
One option for f,S is to use a convolutional neural network on I to then extract the representation of I from it's fully connected layer. This is not ideal because there's a high chance that this network won't retain the finer features needed for the classification task. So I think this rules out a CNN or auto encoder for f.

Convolutional neural network back propagation - delta calculation in convolutional layer

So I’m trying to make a CNN and so far I think I understand all of the forward propagation and the back propagation in the fully connected layers. However, I’m having some issues with the back prop in the convolutional layers.
Basically I’ve written out the dimensions of everything at each stage in a CNN with two convolutional layers and two fully connected layers, with the input having a depth of 1(as it is black and white) and only one filter being applied at each convolutional layer. I haven’t bothered to use pooling at this stage as to my knowledge it shouldn’t have any impact on the calculus, just to where it is assigned, so the dimensions should still fit as long as I also don’t include any uppooling in my backprop. I also haven’t bothered to write out the dimensions after the application of the activation functions as they would be the same as that as their input and I would be writing the same values twice.
The dimensions, as you will see, vary slightly in format. For the convolutional layers I’ve written them as though they are images, rather than in a matrix form. Whilst for the fully connected layers I’ve written the dimensions as that of the size of the matrices used(will hopefully make more sense when you see it).
The issue is that in calculating the delta for the convolutional layers, the dimensions don’t fit, what am I doing wrong?
Websites used:
http://cs231n.github.io/convolutional-networks/
http://neuralnetworksanddeeplearning.com/chap2.html#the_cross-entropy_cost_function
http://www.jefkine.com/general/2016/09/05/backpropagation-in-convolutional-neural-networks/
Calculation of dimensions:

Max-pooling vs. zero padding: Loosing spatial information

When it comes to convolutional neural networks there are normally many papers recommending different strategies. I have heard people say that it is an absolute must to add padding to the images before a convolution, otherwise to much spatial information is lost. On the other hand they are happy to use pooling, normally max-pooling, to reduce the size of the images. I guess the thought here is that max pooling reduces the spatial information but also reduces the sensitivity to relative positions, so it is a trade-off?
I have heard other people saying that zero-padding does not keep more information, just more empty data. This is because by adding zeros you will not get a reaction from your kernel anyway when part of the information is missing.
I can imagine that zero-padding works if you have big kernels with "scrap values" in the edges and the source of activation centered in a smaller region of the kernel?
I would be happy to read some papers about the effect of down-sampling using pooling contra not using padding, but I cant find much about it. Any good recommendations or thoughts?
Figure: Spatial down-sampling using convolution contra pooling (Researchgate)
Adding padding is NOT an "absolute must". Sometimes it can be useful to control the size of the output so that it is not reduced by the convolution (it can also augment the output, depending on its size and kernel size). The only information that zero padding adds is the condition of border (or near-border) of the features- pixels in the limits of the input, also depending on kernel size. (You can think of it as a "passe-partout" in a picture frame)
Pooling is of MUCH MORE IMPORTANCE in convnets. Pooling is not exactly "down-sampling", or "losing spatial information". Consider first that kernel calculations have been made previous to pooling, with full spatial information. Pooling reduces dimension but keeps -hopefully- the information learnt by the kernels previously. And, by doing so, achieves one of the most interesting things about convnets; robustness to displacement, rotation or distortion of the input. Invariance, if learnt, is located even if it appears in another location or with distortions. It also implies learning through increasing scale, discovering -again, hopefully- hierarchical patterns on different scales. And of course, and also necessary in convnets, pooling makes computation possible as number of layers grows.
I have bothered on this question for a while too, and I have also seen some papers mention this same issue. Here is a recent paper I found; Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation. I have not fully read the paper but it seems to bother on your question. I can update this answer as soon as I fully grasp the paper.

Convolutional Neural Network for time-dependent features

I need to do dimensionality reduction from a series of images. More specifically, each image is a snapshot of a ball moving and the optimal features would be its position and velocity. As far as I know, CNN are the state-of-the-art for reducing the features for image classification, but in that case only a single frame is provided. Is it possible to extract also time-dependent features given many images at different time steps? Otherwise which is the state-of-the-art techniques for doing so?
It's the first time I use CNN and I would also appreciate any reference or any other suggestion.
If you want to be able to have the network somehow recognize a progression which is time dependent, you should probably look into recurrent neural nets (RNN). Since you would be operating on video, you should look into recurrent convolutional neural nets (RCNN) such as in: http://jmlr.org/proceedings/papers/v32/pinheiro14.pdf
Recurrence adds some memory of a previous state of the input data. See this good explanation by Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
In your case you need to have the recurrence across multiple images instead of just within one image. It would seem like the first problem you need to solve is the image segmentation problem (being able to pick the ball out of the rest of the image) and the first paper linked above deals with segmentation. (then again, maybe you're trying to take advantage of the movement in order to identify the moving object?)
Here's another thought: perhaps you could only look at differences between sequential frames and use that as your input data to your convnet? The input "image" would then show where the moving object was in the previous frame and where it is in the current one. Larger differences would indicate larger amounts of movement. That would probably have a similar effect to using a recurrent network.

How do neural networks handle large images where the area of interest is small?

If I've understood correctly, when training neural networks to recognize objects in images it's common to map single pixel to a single input layer node. However, sometimes we might have a large picture with only a small area of interest. For example, if we're training a neural net to recognize traffic signs, we might have images where the traffic sign covers only a small portion of it, while the rest is taken by the road, trees, sky etc. Creating a neural net which tries to find a traffic sign from every position seems extremely expensive.
My question is, are there any specific strategies to handle these sort of situations with neural networks, apart from preprocessing the image?
Thanks.
Using 1 pixel per input node is usually not done. What enters your network is the feature vector and as such you should input actual features, not raw data. Inputing raw data (with all its noise) will not only lead to bad classification but training will take longer than necessary.
In short: preprocessing is unavoidable. You need a more abstract representation of your data. There are hundreds of ways to deal with the problem you're asking. Let me give you some popular approaches.
1) Image proccessing to find regions of interest. When detecting traffic signs a common strategy is to use edge detection (i.e. convolution with some filter), apply some heuristics, use a threshold filter and isolate regions of interest (blobs, strongly connected components etc) which are taken as input to the network.
2) Applying features without any prior knowledge or image processing. Viola/Jones use a specific image representation, from which they can compute features in a very fast way. Their framework has been shown to work in real-time. (I know their original work doesn't state NNs but I applied their features to Multilayer Perceptrons in my thesis, so you can use it with any classifier, really.)
3) Deep Learning.
Learning better representations of the data can be incorporated into the neural network itself. These approaches are amongst the most popular researched atm. Since this is a very large topic, I can only give you some keywords so that you can research it on your own. Autoencoders are networks that learn efficient representations. It is possible to use them with conventional ANNs. Convolutional Neural Networks seem a bit sophisticated at first sight but they are worth checking out. Before the actual classification of a neural network, they have alternating layers of subwindow convolution (edge detection) and resampling. CNNs are currently able to achieve some of the best results in OCR.
In every scenario you have to ask yourself: Am I 1) giving my ANN a representation that has all the data it needs to do the job (a representation that is not too abstract) and 2) keeping too much noise away (and thus staying abstract enough).
We usually dont use fully connected network to deal with image because the number of units in the input layer will be huge. In neural network, we have specific neural network to deal with image which is Convolutional neural network(CNN).
However, CNN plays a role of feature extractor. The encoded feature will finally feed into a fully connected network which act as a classifier. In your case, I dont know how small your object is compare to the full image. But if the interested object is really small, even use CNN, the performance for image classification wont be very good. Then we probably need to use object detection(which used sliding window) to deal with it.
If you want recognize small objects on large sized image, you should use "scanning window".
For "scanning window" you can to apply dimention reducing methods:
DCT (http://en.wikipedia.org/wiki/Discrete_cosine_transform)
PCA (http://en.wikipedia.org/wiki/Principal_component_analysis)