neural networks and back propagation, justification for removeconstantrows in MATLAB - matlab

I was wondering, MATLAB has a removeconstantrows function that should be applied to feedforward neural network input and target output data. This function removes constant rows from the data. For example if one input vector for a 5-input neural network is [1 1 1 1 1] then it is removed.
Googling, the best explanation I could find is that (paraphrasing) "constant rows are not needed and can be replaced by appropriate adjustments to the biases of the output layer".
Can someone elaborate?
Who does this adjustment?
From my book, the weight adjustment for simple gradient descent is:
Δ weight_i = learning_rate * local_gradient * input_i
Which means that all weights of a neuron at the first hidden layer are adjusted the same amount. But they ARE adjusted.

I think there is a misundertanding. The "row" is not an input pattern, but a feature, that is i-th component in all patterns. It's obvious that if some feature does not have big variance on all data set, it does not provide valuable information and does not play a noticable role for network training.
The comparison to a bias is feasible (though I don't agree, that this applies to output layer (only), bacause it depends on where the constant row is found - if it's in input data, then it is right as well for the first hidden layer, imho). If you remeber, it's recommended for each neuron in backpropagation network to have a special bias weight, connected to 1 constant signal. If, for example, a training set contains a row with all 1-th, then this is the same as additional bias. If the constant row has a different value, then the bias will have different effect, but in any case you can simply eliminate this row, and add the constant value of the row into the existing bias.
Disclaimer: I'm not a Matlab user. My background in neural networks comes solely from programming area.

Related

Are the bias values actually ajusted or only the weights with respect to the connection channels between them and the neuron's layer?

I was reading some literature about ANN and got a bit confused with how the biases are updated. I understand that the process is done through backpropagation, however I am confused to which part of the biases are actually adjusted since I read that their value is always one.
So my question is if the biases values are adjusted because their connection channel weights are update therefore causing the adjustment or if is the actual value one that is updated.
Thanks in advance!
Bias is just another parameter that is trained by computing derivatives, as every other part of the neural network. One can simulate a bias by concatenating extra 1 to activations on the previous layer, since
w x + b = <[w b], [x 1]>
where [ ] is concatenation. Consequently it is not the bias that is 1, bias is just a trainable parameter, but one can think about a bias as if it was regular neuron-neuron connection, where the input neuron is equal to 1.

Meaning of Bias with zero inputs in Perception at ANNs

I'm student in a graduate computer science program. Yesterday we had a lecture about neural networks.
I think I understood the specific parts of a perceptron in neural networks with one exception. I already made my research about the bias in an perceptron- but still I didn't got it.
So far I know that, with the bias I can manipulate the sum over the inputs with there weights in a perception to evaluate that the sum minus a specific bias is bigger than the activation function threshold - if the function should fire (Sigmoid).
But on the presentation slides from my professor he mentioned something like this:
The bias is added to the perceptron to avoid issues where all inputs
could be equal to zero - no multiplicative weight would have an effect
I can't figure out whats the meaning behind this sentence and why is it important, that sum over all weighted inputs can't be equal to zero ?. If all inputs are equal to zero, there should be no impact on the next perceptions in the next hidden layer, right? Furthermore this perception is a static value for backpropagation and has no influence on changing this weights at the perception.
Or am I wrong?
Has anyone a solution for that?
thanks in advance
Bias
A bias is essentially an offset.
Imagine the simple case of a single perceptron, with a relationship between the input and the output, say:
y = 2x + 3
Without the bias term, the perceptron could match the slope (often called the weight) of "2", meaning it could learn:
y = 2x
but it could not match the "+ 3" part.
Although this is a simple example, this logic scales to neural networks in general. The neural network can capture nonlinear functions, but often it needs an offset to do so.
What you asked
What your professor said is another good example of why an offset would be needed. Imagine all the inputs to a perceptron are 0. A perceptron's output is the sum of each of the inputs multiplied by a weight. This means that each weight is being multiplied by 0, then added together. Therefore, the result will always be 0.
With a bias, however, the output could still retain a value.

Can a convolutional neural network be built with perceptrons?

I was reading this interesting article on convolutional neural networks. It showed this image, explaining that for every receptive field of 5x5 pixels/neurons, a value for a hidden value is calculated.
We can think of max-pooling as a way for the network to ask whether a given feature is found anywhere in a region of the image. It then throws away the exact positional information.
So max-pooling is applied.
With multiple convolutional layers, it looks something like this:
But my question is, this whole architecture could be build with perceptrons, right?
For every convolutional layer, one perceptron is needed, with layers:
input_size = 5x5;
hidden_size = 10; e.g.
output_size = 1;
Then for every receptive field in the original image, the 5x5 area is inputted into a perceptron to output the value of a neuron in the hidden layer. So basically doing this for every receptive field:
So the same perceptron is used 24x24 amount of times to construct the hidden layer, because:
is that we're going to use the same weights and bias for each of the 24×24 hidden neurons.
And this works for the hidden layer to the pooling layer as well, input_size = 2x2; output_size = 1;. And in the case of a max-pool layer, it's just a max() function on an array.
and then finally:
The final layer of connections in the network is a fully-connected
layer. That is, this layer connects every neuron from the max-pooled
layer to every one of the 10 output neurons.
which is a perceptron again.
So my final architecture looks like this:
-> 1 perceptron for every convolutional layer/feature map
-> run this perceptron for every receptive field to create feature map
-> 1 perceptron for every pooling layer
-> run this perceptron for every field in the feature map to create a pooling layer
-> finally input the values of the pooling layer in a regular ALL to ALL perceptron
Or am I overseeing something? Or is this already how they are programmed?
The answer very much depends on what exactly you call a Perceptron. Common options are:
Complete architecture. Then no, simply because it's by definition a different NN.
A model of a single neuron, specifically y = 1 if (w.x + b) > 0 else 0, where x is the input of the neuron, w and b are its trainable parameters and w.b denotes the dot product. Then yes, you can force a bunch of these perceptrons to share weights and call it a CNN. You'll find variants of this idea being used in binary neural networks.
A training algorithm, typically associated with the Perceptron architecture. This would make no sense to the question, because the learning algorithm is in principle orthogonal to the architecture. Though you cannot really use the Perceptron algorithm for anything with hidden layers, which would suggest no as the answer in this case.
Loss function associated with the original Perceptron. This notion of Peceptron is orthogonal to the problem at hand, you're loss function with a CNN is given by whatever you try to do with your whole model. You can eventually use it, but it is non-differentiable, so good luck :-)
A sidenote rant: You can see people refer to feed-forward, fully-connected NNs with hidden layers as "Multilayer Perceptrons" (MLPs). This is a misnomer, there are no Perceptrons in MLPs, see e.g. this discussion on Wikipedia -- unless you go explore some really weird ideas. It would make sense call these networks as Multilayer Linear Logistic Regression, because that's what they used to be composed of. Up till like 6 years ago.

Training data range for Neural Network

Is it better for Neural Network to use smaller range of training data or it does not matter? For example, if I want to train an ANN with angles (values of float) should I pass those values in degrees [0; 360] or in radians [0; 6.28] or maybe all values should be normalized to range [0; 1]? Does the range of training data affects ANN learing quality?
My Neural Network has 6 input neurons, 1 hidden layer and I am using sigmoid symmetric activation function (tanh).
For the neural network it doesn't matter whether the data is normalised.
However, the performance of the training method can vary a lot.
In a nutshell: typically the methods prefer variables which have larger values. This might send the training method off-track.
Crucial for most NN training methods is that all dimensions of the training data have the same domain. If all your variables are angles it doesn't matter, whether they are [0,1) or [0,2*pi) or [0,360) as long as they have the same domain. However, you should avoid having one variable for the angle [0,2*pi) and another variable for the distance in mm where distance can be much larger then 2000000mm.
Two cases where an algorithm might suffer in these cases:
(a) regularisation: if the weights of the NN are force to be small a tiny change of a weight controlling the input of a large domain variable has a much larger impact, than for a small domain
(b) gradient descent: if the step size is limited you have similar effects.
Recommendation: All variables should have the same domain size whether it is [0,1] or [0,2*pi] or ... doesn't matter.
Addition: for many domain "z-score normalisation" works extremely well.
The data points range affects the way you train a model. Suppose the range of values for features in the data set is not normalized. Then, depending on your data, you may end up having elongated Ellipses for the data points in the feature space and the learning model will have a very hard time learning the manifold on which the data points lie on (learn the underlying distribution). Also, in most cases the data points are sparsely spread in the feature space, if not normalized (see this). So, the take-home message is to normalize the features when possible.

Issues with neural network

I am having some issues with using neural network. I am using a non linear activation function for the hidden layer and a linear function for the output layer. Adding more neurons in the hidden layer should have increased the capability of the NN and made it fit to the training data more/have less error on training data.
However, I am seeing a different phenomena. Adding more neurons is decreasing the accuracy of the neural network even on the training set.
Here is the graph of the mean absolute error with increasing number of neurons. The accuracy on the training data is decreasing. What could be the cause of this?
Is it that the nntool that I am using of matlab splits the data randomly into training,test and validation set for checking generalization instead of using cross validation.
Also I could see lots of -ve output values adding neurons while my targets are supposed to be positives. Could it be another issues?
I am not able to explain the behavior of NN here. Any suggestions? Here is the link to my data consisting of the covariates and targets
https://www.dropbox.com/s/0wcj2y6x6jd2vzm/data.mat
I am unfamiliar with nntool but I would suspect that your problem is related to the selection of your initial weights. Poor initial weight selection can lead to very slow convergence or failure to converge at all.
For instance, notice that as the number of neurons in the hidden layer increases, the number of inputs to each neuron in the visible layer also increases (one for each hidden unit). Say you are using a logit in your hidden layer (always positive) and pick your initial weights from the random uniform distribution between a fixed interval. Then as the number of hidden units increases, the inputs to each neuron in the visible layer will also increase because there are more incoming connections. With a very large number of hidden units, your initial solution may become very large and result in poor convergence.
Of course, how this all behaves depends on your activation functions and the distributio of the data and how it is normalized. I would recommend looking at Efficient Backprop by Yann LeCun for some excellent advice on normalizing your data and selecting initial weights and activation functions.