CNNs: First Convolution Operation - neural-network

I have not fully understood one aspect of the CNNs: is the first hidden layer the same than the convolved image?
I mean, Can we talk about the first hidden layer and the first convolution operation in the same way ? Are they two ways of expressing the same thing ?

Related

create deep network in matlab with logsig layer instead of softmax layer

I want to create a deep classification net, but my classes aren't mutually exclusive (that is what sofmaxlayer do).
Is it possible to define a non mutually exclusive classification layer (i.e., a data can be in more than one class)?
One way to do it, it would be with a logsig function in the classification layer, instead of a softmax, but I have no idea how to acomplish that....
In CNN you can have multiple class in last layer as you know. But if I understand correctly your need in last layer an out put with that is in a range of numbers instead of 1 or 0 for each class. Its mean you need regression. If your labels support this task it's OK and you can do it with regression just like what happen in bounding box regression for localization. And you don't need soft-max in last layer. just use other activation functions that produce sufficient out put for your task.

Why is the convolutional filter flipped in convolutional neural networks? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I don't understand why there is the need to flip filters when using convolutional neural networks.
According to the lasagne documentation,
flip_filters : bool (default: True)
Whether to flip the filters before sliding them over the input,
performing a convolution (this is the default), or not to flip them
and perform a correlation. Note that for some other convolutional
layers in Lasagne, flipping incurs an overhead and is disabled by
default – check the documentation when using learned weights from
another layer.
What does that mean? I never read about flipping filters when convolving in any neural network book. Would someone clarify, please?
The underlying reason for transposing a convolutional filter is the definition of the convolution operation - which is a result of signal processing. When performing the convolution, you want the kernel to be flipped with respect to the axis along which you're performing the convolution because if you don't, you end up computing a correlation of a signal with itself. It's a bit easier to understand if you think about applying a 1D convolution to a time series in which the function in question changes very sharply - you don't want your convolution to be skewed by, or correlated with, your signal.
This answer from the digital signal processing stack exchange site gives an excellent explanation that walks through the mathematics of why convolutional filters are defined to go in the reverse direction of the signal.
This page walks through a detailed example where the flip is done. This is a particular type of filter used for edge detection called a Sobel filter. It doesn't explain why the flip is done, but is nice because it gives you a worked-out example in 2D.
I mentioned that it is a bit easier to understand the why (as in, why is convolution defined this way) in the 1D case (the answer from the DSP SE site is really a great explanation); but this convention does apply to 2D and 3D as well (the Conv2DDNN anad Conv3DDNN layers both have the flip_filter option). Ultimately, however, because the convolutional filter weights are not something that the human programs, but rather are "learned" by the network, it is entirely arbitrary - unless you are loading weights from another network, in which case you must be consistent with the definition of convolution in that network. If convolution was defined correctly (i.e., according to convention), the filter will be flipped. If it was defined incorrectly (in the more "naive" and "lazy" way), it will not.
The broader field that convolutions are a part of is "linear systems theory" so searching for this term might turn up more about this, albeit outside the context of neural networks.
Note that the convolution/correlation distinction is also mentioned in the docstrings of the corrmm.py class in lasagne:
flip_filters : bool (default: False)
Whether to flip the filters and perform a convolution, or not to flip
them and perform a correlation. Flipping adds a bit of overhead, so it
is disabled by default. In most cases this does not make a difference
anyway because the filters are learnt. However, flip_filters should
be set to True if weights are loaded into it that were learnt using
a regular :class:lasagne.layers.Conv2DLayer, for example.
I never read about flipping filters when convolving in any neural
network book.
You can try a simple experiment. Take an image having the centermost pixel as value 1 and all other pixels with value 0. Now take any filter smaller than the image (let us say a 3 by 3 filter with values from 1-9). Now do a simple correlation instead of convolution. You end up with the flipped filter as the output after the operation.
Now flip the filter yourself and then do the same operation. You obviously end up with the original filter as the output.
The second operation somehow seems neater. It is like multiplying with a 1 and returning the same value. However the first one is not necessarily wrong. It works most of the times even though it may not have nice mathematical properties. After all, why would the program care about whether the operation is associative or not. It just does the job which it is told to do. Moreover the filter could be symmetrical..flipping it returns the same filter so correlation operation and convolution operation return the same output.
Is there a case where these mathematical properties help? Well sure, they do! If (ab)c is not equal to a(bc), then I wouldn't be able to combine 2 filters and then apply the result on an image. To clarify, imagine I have 2 filters a,b and an image c. I would have to first apply 'b' on the image 'c' and then 'a' on the above result in case of correlation. In case of convolution, I could just do 'a b' first and then apply the result on the image 'c'. If I have a million images to process, the efficiencies gained due to combining the filters 'a' and 'b' start becoming obvious.
Every single mathematical property that a convolution satisfies gives certain benefits and hence if we have a choice (& we certainly do) we should prefer convolutions to correlations. The only difference between them is - in convolution we flip the filter before doing the multiplication operation and in correlation - we directly do the multiplication operation.
Applying convolution satisfies the mathematician inside all of us and also gives us some tangible benefits as well.
Though nowadays feature engineering in images is done end-to-end completely by Mrs DL itself and we need not even bother about it, there are other traditional image operations that may need these kind of operations.
Firstly, since CNNs are trained from scratch instead of human-designed, if the flip operation is necessary, the learned filters would be the flipped one and
the cross-correlation with the flipped filters is implemented.
Secondly, flipping is neccessary in 1D time-series processing, since the past inputs impact the current system output given the "current" input. But in 2D/3D image spatial convolution, there is not "time" concept, then not "past" input and its impact on "now", therefore, we don't need to consider the relationship of "signal" and "system", and there is only the relationship of "signal"(image patch) and "signal"(image patch), which means we only need cross-correlation instead of convolution (although DL borrow this concept from signal processing).
Therefore, the flip operation is actually not needed.
(I guess.)

Why are inputs for convolutional neural networks always squared images?

I have been doing deep learning with CNN for a while and I realize that the inputs for a model are always squared images.
I see that neither convolution operation or neural network architecture itself require such property.
So, what is the reason for that?
Because square images are pleasing to the eye. But there are applications on non-square images when domain requires it. For instance SVHN original dataset is an image of several digits, and hence rectangular images are used as input to convnet, as here
From Suhas Pillai:
The problem is not with convolutional layers, it's the fully connected
layers of the network ,which require fix number of neurons.For
example, take a small 3 layer network + softmax layer. If first 2
layers are convolutional + max pooling, assuming the dimensions are
same before and after convolution, and pooling reduces dim/2 ,which is
usually the case. For an image of 3*32*32(C,W,H)with 4 filters in the
first layer and 6 filters in the second layer ,the output after
convolutional + max pooling at the end of 2nd layer, will be 6*8*8
,whereas for an image with 3*64*64, at the end of 2nd layer output
will be 6*16*16. Before doing fully connected,we stretch this as a
single vector( 6*8*8=384 neurons)and do a fully connected operation.
So, you cannot have different dimension fully connected layers for
different size images. One way to tackle this is using spatial pyramid
pooling, where you force the output of last convolutional layer to
pool it to a fixed number of bins(I.e neurons) such that fully
connected layer has same number of neurons. You can also check fully
convolutional networks, which can take non-square images.
It is not necessary to have squared images. I see two "reasons" for it:
scaling: If images are scaled automatically from another aspect ratio (and landscape / portrait mode) this in average might introduce the least error
publications / visualizations: square images are easy to display together

How does a neural network work with correlated image data

I am new to TensorFlow and deep learning. I am trying to create a fully connected neural network for image processing. I am somewhat confused.
We have an image, say 28x28 pixels. This will have 784 inputs to the NN. For non-correlated inputs, this is fine, but image pixels are generally correlated. For instance, consider a picture of a cow's eye. How can a neural network understand this when we have all pixels lined up in an array for a fully-connected network. How does it determine the correlation?
Please research some tutorials on CNN (Convolutional Neural Network); here is a starting point for you. A fully connected layer of a NN surrenders all of the correlation information it might have had with the input. Structurally, it implements the principle that the inputs are statistically independent.
Alternately, a convolution layer depends upon the physical organization of the inputs (such as pixel adjacency), using that to find simple combinations (convolutions) of feature form one layer to another.
Bottom line: your NN doesn't find the correlation: the topology is wrong, and cannot do the job you want.
Also, please note that a layered network consisting of fully-connected neurons with linear weight combinations, is not deep learning. Deep learning has at least one hidden layer, a topology which fosters "understanding" of intermediate structures. A purely linear, fully-connected layering provides no such hidden layers. Even if you program hidden layers, the outputs remain a simple linear combination of the inputs.
Deep learning requires some other discrimination, such as convolutions, pooling, rectification, or other non-linear combinations.
Let's take it into peaces to understand the intuition behind NN learning to predict.
to predict a class of given image we have to find a correlation or direct link between once of it is input values to the class. we can think about finding one pixel can tell us this image belongs to this class. which is impossible so what we have to do is build up more complex function or let's call complex features. which will help us to find to generate a correlated data to the wanted class.
To make it simpler imagine you want to build AND function (p and q), OR function (p or q) in the both cases there is a direct link between the input and the output. in and function if there 0 in the input the output always zero. so what if we want to xor function (p xor q) there is no direct link between the input and the output. the answer is to build first layer of classifying AND and OR then by a second layer taking the result of the first layer we can build the function and classify the XOR function
(p xor q) = (p or q) and not (p and q)
By applying this method on Multi-layer NN you'll have the same result. but then you'll have to deal with huge amount of parameters. one solution to avoid this is to extract representative, variance and uncorrelated features between images and correlated with their class from the images and feed the to the Network. you can look for image features extraction on the web.
this is a small explanation for how to see the link between images and their classes and how NN work to classify them. you need to understand NN concept and then you can go to read about Deep-learning.

Neural Network Approximation Function

I'm trying to test the efficiency of the Neural Networks as approximation functions.
The function I need to approximate has 5 inputs and 1 output, which structure should I use?
I have no idea on what criteria should be applied in order to decide the number of Hidden Layer and the number of Nodes for each layer.
Thank you in advance,
Regards
Giuseppe.
I always use a single hidden layer. Theoretically, there are no functions which can be approximated by 2 or more hidden layers that cannot be approximated with one. To make a single hidden layer more complex, add more hidden nodes.
Typically, the number of hidden nodes is varied to observe the effect on model performance (as measured by accuracy or whatever). Too few hidden nodes results in a worse fit due to underfitting (the neural network's output function is too simple, and misses important details in the data). Too many hidden nodes results in a worse fit due to overfitting (the neural network becomes so flexible that it chases every bit of noise in the data).
Note that for classification problems you need at least 2 hidden layers if you want to separate concave polygons.
I'm not sure how the number of hidden layers affects function approximation.