Why is the convolutional filter flipped in convolutional neural networks? [closed] - neural-network

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I don't understand why there is the need to flip filters when using convolutional neural networks.
According to the lasagne documentation,
flip_filters : bool (default: True)
Whether to flip the filters before sliding them over the input,
performing a convolution (this is the default), or not to flip them
and perform a correlation. Note that for some other convolutional
layers in Lasagne, flipping incurs an overhead and is disabled by
default – check the documentation when using learned weights from
another layer.
What does that mean? I never read about flipping filters when convolving in any neural network book. Would someone clarify, please?

The underlying reason for transposing a convolutional filter is the definition of the convolution operation - which is a result of signal processing. When performing the convolution, you want the kernel to be flipped with respect to the axis along which you're performing the convolution because if you don't, you end up computing a correlation of a signal with itself. It's a bit easier to understand if you think about applying a 1D convolution to a time series in which the function in question changes very sharply - you don't want your convolution to be skewed by, or correlated with, your signal.
This answer from the digital signal processing stack exchange site gives an excellent explanation that walks through the mathematics of why convolutional filters are defined to go in the reverse direction of the signal.
This page walks through a detailed example where the flip is done. This is a particular type of filter used for edge detection called a Sobel filter. It doesn't explain why the flip is done, but is nice because it gives you a worked-out example in 2D.
I mentioned that it is a bit easier to understand the why (as in, why is convolution defined this way) in the 1D case (the answer from the DSP SE site is really a great explanation); but this convention does apply to 2D and 3D as well (the Conv2DDNN anad Conv3DDNN layers both have the flip_filter option). Ultimately, however, because the convolutional filter weights are not something that the human programs, but rather are "learned" by the network, it is entirely arbitrary - unless you are loading weights from another network, in which case you must be consistent with the definition of convolution in that network. If convolution was defined correctly (i.e., according to convention), the filter will be flipped. If it was defined incorrectly (in the more "naive" and "lazy" way), it will not.
The broader field that convolutions are a part of is "linear systems theory" so searching for this term might turn up more about this, albeit outside the context of neural networks.
Note that the convolution/correlation distinction is also mentioned in the docstrings of the corrmm.py class in lasagne:
flip_filters : bool (default: False)
Whether to flip the filters and perform a convolution, or not to flip
them and perform a correlation. Flipping adds a bit of overhead, so it
is disabled by default. In most cases this does not make a difference
anyway because the filters are learnt. However, flip_filters should
be set to True if weights are loaded into it that were learnt using
a regular :class:lasagne.layers.Conv2DLayer, for example.

I never read about flipping filters when convolving in any neural
network book.
You can try a simple experiment. Take an image having the centermost pixel as value 1 and all other pixels with value 0. Now take any filter smaller than the image (let us say a 3 by 3 filter with values from 1-9). Now do a simple correlation instead of convolution. You end up with the flipped filter as the output after the operation.
Now flip the filter yourself and then do the same operation. You obviously end up with the original filter as the output.
The second operation somehow seems neater. It is like multiplying with a 1 and returning the same value. However the first one is not necessarily wrong. It works most of the times even though it may not have nice mathematical properties. After all, why would the program care about whether the operation is associative or not. It just does the job which it is told to do. Moreover the filter could be symmetrical..flipping it returns the same filter so correlation operation and convolution operation return the same output.
Is there a case where these mathematical properties help? Well sure, they do! If (ab)c is not equal to a(bc), then I wouldn't be able to combine 2 filters and then apply the result on an image. To clarify, imagine I have 2 filters a,b and an image c. I would have to first apply 'b' on the image 'c' and then 'a' on the above result in case of correlation. In case of convolution, I could just do 'a b' first and then apply the result on the image 'c'. If I have a million images to process, the efficiencies gained due to combining the filters 'a' and 'b' start becoming obvious.
Every single mathematical property that a convolution satisfies gives certain benefits and hence if we have a choice (& we certainly do) we should prefer convolutions to correlations. The only difference between them is - in convolution we flip the filter before doing the multiplication operation and in correlation - we directly do the multiplication operation.
Applying convolution satisfies the mathematician inside all of us and also gives us some tangible benefits as well.
Though nowadays feature engineering in images is done end-to-end completely by Mrs DL itself and we need not even bother about it, there are other traditional image operations that may need these kind of operations.

Firstly, since CNNs are trained from scratch instead of human-designed, if the flip operation is necessary, the learned filters would be the flipped one and
the cross-correlation with the flipped filters is implemented.
Secondly, flipping is neccessary in 1D time-series processing, since the past inputs impact the current system output given the "current" input. But in 2D/3D image spatial convolution, there is not "time" concept, then not "past" input and its impact on "now", therefore, we don't need to consider the relationship of "signal" and "system", and there is only the relationship of "signal"(image patch) and "signal"(image patch), which means we only need cross-correlation instead of convolution (although DL borrow this concept from signal processing).
Therefore, the flip operation is actually not needed.
(I guess.)

Related

How does upsampling in Fully Connected Convolutional network work?

I read several posts / articles and have some doubts on the mechanism of upsampling after the CNN downsampling.
I took the 1st answer from this question:
https://www.quora.com/How-do-fully-convolutional-networks-upsample-their-coarse-output
I understood that similar to normal convolution operation, the "upsampling" also uses kernels which need to be trained.
Question1: if the "spatial information" is already lost during the first stages of CNN, how can it be re-constructed in anyway ?
Question2: Why >"Upsampling from a small (coarse) featuremap deep in the network has good semantic information but bad resolution. Upsampling from a larger feature map closer to the input, will produce better detail but worse semantic information" ?
Question #1
Upsampling doesn't (and cannot) reconstruct any lost information. Its role is to bring back the resolution to the resolution of previous layer.
Theoretically, we can eliminate the down/up sampling layers altogether. However to reduce the number of computations, we can downsample the input before a layers and then upsample its output.
Therefore, the sole purpose of down/up sampling layers is to reduce computations in each layer, while keeping the dimension of input/output as before.
You might argue the down-sampling might cause information loss. That is always a possibility but remember the role of CNN is essentially extracting "useful" information from the input and reducing it into a smaller dimension.
Question #2
As we go from the input layer in CNN to the output layer, the dimension of data generally decreases while the semantic and extracted information hopefully increases.
Suppose we have the a CNN for image classification. In such CNN, the early layers usually extract the basic shapes and edges in the image. The next layers detect more complex concepts like corners, circles. You can imagine the very last layers might have nodes that detect very complex features (like presence of a person in the image).
So up-sampling from a large feature map close to the input produces better detail but has lower semantic information compared to the last layers. In retrospect, the last layers generally have lower dimension hence their resolution is worse compared to the early layers.

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

Error function and ReLu in a CNN

I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).

How does a neural network work with correlated image data

I am new to TensorFlow and deep learning. I am trying to create a fully connected neural network for image processing. I am somewhat confused.
We have an image, say 28x28 pixels. This will have 784 inputs to the NN. For non-correlated inputs, this is fine, but image pixels are generally correlated. For instance, consider a picture of a cow's eye. How can a neural network understand this when we have all pixels lined up in an array for a fully-connected network. How does it determine the correlation?
Please research some tutorials on CNN (Convolutional Neural Network); here is a starting point for you. A fully connected layer of a NN surrenders all of the correlation information it might have had with the input. Structurally, it implements the principle that the inputs are statistically independent.
Alternately, a convolution layer depends upon the physical organization of the inputs (such as pixel adjacency), using that to find simple combinations (convolutions) of feature form one layer to another.
Bottom line: your NN doesn't find the correlation: the topology is wrong, and cannot do the job you want.
Also, please note that a layered network consisting of fully-connected neurons with linear weight combinations, is not deep learning. Deep learning has at least one hidden layer, a topology which fosters "understanding" of intermediate structures. A purely linear, fully-connected layering provides no such hidden layers. Even if you program hidden layers, the outputs remain a simple linear combination of the inputs.
Deep learning requires some other discrimination, such as convolutions, pooling, rectification, or other non-linear combinations.
Let's take it into peaces to understand the intuition behind NN learning to predict.
to predict a class of given image we have to find a correlation or direct link between once of it is input values to the class. we can think about finding one pixel can tell us this image belongs to this class. which is impossible so what we have to do is build up more complex function or let's call complex features. which will help us to find to generate a correlated data to the wanted class.
To make it simpler imagine you want to build AND function (p and q), OR function (p or q) in the both cases there is a direct link between the input and the output. in and function if there 0 in the input the output always zero. so what if we want to xor function (p xor q) there is no direct link between the input and the output. the answer is to build first layer of classifying AND and OR then by a second layer taking the result of the first layer we can build the function and classify the XOR function
(p xor q) = (p or q) and not (p and q)
By applying this method on Multi-layer NN you'll have the same result. but then you'll have to deal with huge amount of parameters. one solution to avoid this is to extract representative, variance and uncorrelated features between images and correlated with their class from the images and feed the to the Network. you can look for image features extraction on the web.
this is a small explanation for how to see the link between images and their classes and how NN work to classify them. you need to understand NN concept and then you can go to read about Deep-learning.

Essential philosophy behind Support Vector Machine

I am studying Support Vector Machines (SVM) by reading a lot of material. However, it seems that most of it focuses on how to classify the input 2D data by mapping it using several kernels such as linear, polynomial, RBF / Gaussian, etc.
My first question is, can SVM handle high-dimensional (n-D) input data?
According to what I found, the answer is YES!
If my understanding is correct, n-D input data will be
constructed in Hilbert hyperspace, then those data will be
simplified by using some approaches (such as PCA ?) to combine it together / project it back to 2D plane, so that
the kernel methods can map it into an appropriate shape such a line or curve can separate it into distinguish groups.
It means most of the guides / tutorials focus on step (3). But some toolboxes I've checked cannot plot if the input data greater than 2D. How can the data after be projected to 2D?
If there is no projection of data, how can they classify it?
My second question is: is my understanding correct?
My first question is, does SVM can handle high-dimensional (n-D) input data?
Yes. I have dealt with data where n > 2500 when using LIBSVM software: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. I used linear and RBF kernels.
My second question is, does it correct my understanding?
I'm not entirely sure on what you mean here, so I'll try to comment on what you said most recently. I believe your intuition is generally correct. Data is "constructed" in some n-dimensional space, and a hyperplane of dimension n-1 is used to classify the data into two groups. However, by using kernel methods, it's possible to generate this information using linear methods and not consume all the memory of your computer.
I'm not sure if you've seen this already, but if you haven't, you may be interested in some of the information in this paper: http://pyml.sourceforge.net/doc/howto.pdf. I've copied and pasted a part of the text that may appeal to your thoughts:
A kernel method is an algorithm that depends on the data only through dot-products. When this is the case, the dot product can be replaced by a kernel function which computes a dot product in some possibly high dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classifiers. Second, the use of kernel functions allows the user to apply a classifier to data that have no obvious fixed-dimensional vector space representation. The prime example of such data in bioinformatics are sequence, either DNA or protein, and protein structure.
It would also help if you could explain what "guides" you are referring to. I don't think I've ever had to project data on a 2-D plane before, and it doesn't make sense to do so anyway for data with a ridiculous amount of dimensions (or "features" as it is called in LIBSVM). Using selected kernel methods should be enough to classify such data.