Why is softmax function necessory? Why not simple normalization? - neural-network

I am not familiar with deep learning so this might be a beginner question.
In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class.
If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).
Applying simple normalization (dividing each elements by the sum(16))
output will be y = (0.625 0.1875 0.125 0.166).
It seems like simple normalization could also distribute the probabilities.
So, what is the advantage of using softmax function on the output layer?

Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?
But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.

This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.
The empirical reason is simple: SoftMax is used where it produces better results.

Related

Normalize in Adaboost without numerical error - Matlab

I'm implementing AdaBoost on Matlab. This algorithm requires that in every iteration the weights of each data point in the training set sum up to one.
If I simply use the following normalization v = v / sum(v) I get a vector whose 1-norm is 1 except some numerical error which later leads to the failure of the algorithm.
Is there a matlab function for normalizing a vector so that it's 1-norm is EXACTLY 1?
Assuming you want identical values to be normalised with the same factor, this is not possible. Simple counter example:
v=ones(21,1);
v = v / sum(v);
sum(v)-1
One common way to deal with it, is enforce values sum(v)>=1 or sum(v)<=1 if your algorithm can deal with a derivation to one side:
if sum(v)>1
v=v-eps(v)
end
Alternatively you can try using vpa, but this will drastically increase your computation time.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

Connecting perceptrons with output of previous ones?

Because of the help I received and researched here I was able to create a simple perceptron in C#, code of which goes like:
int Input1 = A;
int Input2 = B;
//weighted sum
double WSum = A * W1 + B * W2 + Bias;
//get the sign: -1 for negative, +1 for positive
int Sign=Math.Sign(WSum);
double error = Desired - Sign;
//updating weights
W1 += error * Input1 * 0.1; //0.1 being a learning rate
W2 += error * Input2 * 0.1;
return Sign;
I do not use Sigmoid here and just get -1 or 1.
I would have two questions:
1) Is that correct that my weights get values like -5 etc? When input is e.g. 100,50 it goes like: W1+=error*100*0.1
2) I want to proceed deeper and create more connected neurons - I guess I would need at least two to provide inputs to the third. Is that correct that the third will be fed with values just -1..1? I am aiming to a simple pattern recognition but so far do not understand how it should work.
It is perfectly valid that the values of your weights range from -Infinity to +Infinity. You should always use real numbers instead of integers (so as mentioned above, double will work. 32 bit floats precision is perfectly sufficient for neural networks).
Moreover, you should decay your learning rate with every learning step, e.g. reduce it by a factor of 0.99 after each update. Else, your algorithm will oscillate when approaching an optimum.
If you want to go "deeper", you will need to implement a Multilayer Perceptron (MLP). There exists a proof that a neural network with simple threshold neurons and multiple layers alsways has an equivalent with only 1 layer. This is why several decades ago the research community temporarily abandoned the idea of artificial neural networks. 1986, Geoffrey Hinton made the Backpropagation algorithm popular. With it you can train MLPs with multiple hidden layers.
To solve non-linear problems like XOR or other complex problems like pattern recognition, you need to apply a non-linear activation function. Have a look at the logistic sigmoid activation function for a start. f(x) = 1. / (1. + exp(-x)). When doing this you should normalize your input as well as your output values to the range [0.0; 1.0]. This is especially important for the output neurons since the output of the logistic sigmoid activation function is defined in exactly this range.
A simple Python implementation of feed-forward MLPs using arrays can be found in this answer.
Edit: You also need at least 1 hidden layer to solve e.g. XOR.
Try to set your weights as double.
Also i think it's much better to work with arrays, especially in neural networks and perceptron is the only way.
And you will need some for or while loops to succeed what you want.

Neural Network (FFW, BP) - function approximation

is it possible to train NN to approximate this function:
If I tun approximation for x^2 or sin or something simple, it works fine, but for this sort of function i got only constant valued line.
My NN has 2 inputs (x, f(x)), one hidden layer (10 neurons), 1 output (f(x))
For training I am using BP, activation functions sigmoid -> tanh
My goal is to get "smooth" function without noise, that catch function on image above.
Or is there any other way with NN or genetic algorithm, how to approximate this ?
You're gping to have major problems because the input (x, f(x)) is discontinuous (not exactly, but sort of).
Therefore, your NN will have to literally memorize the x-f(x) mapping given the large discontinuities.
One approach is to use a four-layer NN which can address the discontinuities.
But really, you may simply want to look at other smoothening methods rather than NN for thos problem.
You have a periodic function so first of all, only use one period, or you will memorize and not generalize.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?
Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)
This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.
Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.