Why are Neural Network Loss Functions Always Positive - neural-network

I am trying to fill a gap in my knowledge. In looking at the majority of loss functions for neural networks, such as mse, mae, L1, L2, the loss is always recorded as a positive value. What I don't understand is why? Shouldn't the loss function have positive or negative values in order to raise or lower the weights of the network as needed?

Loss functions like Mean Squared Error ( MSE ) function always give positive loss values. They tend to display whether how big the error is and not where is it done.
Suppose our Neural Network is a basketball player. Its task is to
throw the ball in the basket. If the ball falls to the left of the
basket, the error is negative. But, if it falls to the right, the
error is positive. If it falls in the basket, the error is 0. This
approach was followed by earlier loss functions. In this case, MSE
gives a positive loss and gives the loss regarding that the ball has
not reached the basket. It does not bother about whether the ball fell
to the right or left of the basket.

Related

Reinforcement Learning- Won't Converge

I'm working on my bachelor thesis.
My topic is reinforcement learning. The Setup:
Unity3D (C#)
Own neural network framework
Confirmed the network working by testing to training a sine-function.
It can approximate it. Well. there are some values which won't get to their desired value but it's good enough.
When training it with single Values it always converges.
Here is my problem:
I try to teach my network the Q-Value-Function of a simple game,
catch balls:
In this game it just has to catch a ball dropping from random position and with random angle.
+1 if catch
-1 if failed
My network-model has 1 hidden layer with neurons ranging from 45-180 (i tested this numbers with no success)
It uses replay with 32 samples from a 100k memory with a learning-rate of 0.0001
It learns for 50000 frames then tests for 10000 frames. This happens 10 times.
Inputs are PlatformPosX, BallPosX, BallPosY from the last 4 frames
Pseudocode:
Choose action (e-greedy)
Do action,
Store state action, CurrentReward. Done in memory
if in learnphase: Replay
My problem is:
Its actions starts clipping to either 0 or 1 with some variance sometimes.
It never has a ideal policy like if the platform would just follow the ball.
EDIT:
Sorry for cheap info...
My Quality-Function is trained by:
Reward + Gamma(nextEstimated_Reward)
So its discounting.
Why would you possibly expect that to work?
Your training can barely approximate a 1-dimensional function. And now you expect it to solve a 12-dimensional function which involves a differential equation? You should have verified first whether your training does even converge for a multi dimensional function at all, with the chosen training parameters.
Your training, given the little detail you provided, also appears to be unsuitable. There is hardly a chance it ever successfully catches the ball, and even when it does, you are rewarding it mostly for random outputs. Only correlation between in- and output is in the last few frames when the pad can only reach the target in time by a limited set of possible actions.
Then there is the choice of inputs. Don't require your model to differentiate by itself. Relevant inputs would had been x, y, dx, dy. Preferably even x, y relative to pad position, not world. Should have a much better chance to converge. Even if it was only learning to keep x minimal.
Working with absolute world coordinates is pretty much bound to fail, as it would require the training to cover the entire range of possible input combinations. And also the network to be big enough to even store all the combinations. Be aware that the network isn't learning the actual function, it's learning an approximation for every single possible set of inputs. Even if the ideal solution is actually just a linear equation, the non linear properties of the activation function make it impossible to learn it in a generalized form for unbound inputs.

Meaning of Bias with zero inputs in Perception at ANNs

I'm student in a graduate computer science program. Yesterday we had a lecture about neural networks.
I think I understood the specific parts of a perceptron in neural networks with one exception. I already made my research about the bias in an perceptron- but still I didn't got it.
So far I know that, with the bias I can manipulate the sum over the inputs with there weights in a perception to evaluate that the sum minus a specific bias is bigger than the activation function threshold - if the function should fire (Sigmoid).
But on the presentation slides from my professor he mentioned something like this:
The bias is added to the perceptron to avoid issues where all inputs
could be equal to zero - no multiplicative weight would have an effect
I can't figure out whats the meaning behind this sentence and why is it important, that sum over all weighted inputs can't be equal to zero ?. If all inputs are equal to zero, there should be no impact on the next perceptions in the next hidden layer, right? Furthermore this perception is a static value for backpropagation and has no influence on changing this weights at the perception.
Or am I wrong?
Has anyone a solution for that?
thanks in advance
Bias
A bias is essentially an offset.
Imagine the simple case of a single perceptron, with a relationship between the input and the output, say:
y = 2x + 3
Without the bias term, the perceptron could match the slope (often called the weight) of "2", meaning it could learn:
y = 2x
but it could not match the "+ 3" part.
Although this is a simple example, this logic scales to neural networks in general. The neural network can capture nonlinear functions, but often it needs an offset to do so.
What you asked
What your professor said is another good example of why an offset would be needed. Imagine all the inputs to a perceptron are 0. A perceptron's output is the sum of each of the inputs multiplied by a weight. This means that each weight is being multiplied by 0, then added together. Therefore, the result will always be 0.
With a bias, however, the output could still retain a value.

Matlab: Dealing with denorm performance cost conversion when close to realmin in backprop

I understand that if a number gets closer to zero than realmin, then Matlab converts the double to a denorm . I am noticing this causes significant performance cost. In particular I am using a gradient descent algorithm that when near convergence, the gradients (in backprop for my bespoke neural network) drop below realmin such that the algorithm incurs heavy performance cost (due to, I am assuming, type conversion behind the scenes). I have used the following code to validate my gradient matrices so that no numbers falls below realmin:
function mat= validateSmallDoubles(obj, mat, threshold)
mat= mat.*(abs(mat)>threshold);
end
Is this usual practice and what value should threshold take (obviously you want this as close to realmin as possible, but not too close otherwise any additional division operations will send some elements of mat below realmin after validation)?. Also, specifically for neural networks, where are the best places to do gradient validation without ruining the network's ability to learn?. I would be grateful to know what solutions people with experience in training neural networks have? I am sure this is a problem for all languages. Tentative threshold values have ruined my network's learning.
I do not know if it is somehow related to your problem, but I had a similar problem with underflows while doing exponentially weighted average of gradients (say while implementing Momentum or Adam).
In particular, at some point you do something like:
v := 0.9*v + 0.1*gradient where v is the exponentially weighted average of your gradient g. If in a lot of successive iterations a same element of your g matrix remains 0, your v is quickly becoming very small and you hit dernormals.
So the problem, is why all those zeros ? In my case the culprit where the ReLu units which outputed a lot of zeros (if x<0 , relu(x) is zero). Because when Relu outputs zero on a given neurons the related weight has no effect it means the corresponding partial derivative will be zero in g. So it happened to me that in a lot of successive iterations that particular neuron was not fired.
To avoiding having zero activations (and derivatives), I used "leaky relu" so to have a very small derivative instead.
Another solution, is to use gradient clipping before applying your weighted average to threshold your gradients to a minimum value. Which is quite similar to what you did.
I traced the diminishing gradient occurrences to the Adam SGD optimiser - the biased moving average matrix calculations in the Adam optimiser were causing matlab to carry out the denorm operation. I simply thresholded the matrix elements for each layer after these calculations, with threshold=10*realmin, to zero without any effect on learning. I have yet to investigate why my moving averages were getting so close to zero as my architecture and weight initialisation priors would normally mitigate this.

how to adjust the weights in gradient descent

I am currently trying to teach me something about neural networks. So I bought myself this book called Applied Artificial Intelligence written by Wolfgang Beer and I am now stuck at understanding a part of his code. Actually I understand the code I just do not understand one mathematical step behind it...
The part looks like this:
for i in range(iterations):
guessed = sig(inputs*weights)
error = output - guessed
adjustment = error*sig_d(outpus)
#Why is there no learningrate?
#Why is the adjustment relative to the error
#muliplied by the derivative of your main function?
weights += adjustment
I tried to look up how the gradient descent method works, but I never got the part with ajusting the weights. How does the math behind it work and why do you use the derivative for it?
Alo when I started to look in the internet for other solutions I always saw them using a learning rate. I understand the consept of it but why is this method not used in this book? It would realy help me if someone could awnser me these questions...
And thanks for all these rapid responses in the past.
To train a regression model we start with arbitrary weights and adjust weights so that the error will be minimum. If we plot the error as a function of weights we will get a plot like above figure where error J(θ0,θ1) is a function of weights θ0,θ1. We will be succeeded when our error will be very bottom of the graph when its value is the minimum. The red arrows show the minimum points in the graph. To reach to the minimum point we take derivative (the tangential line to a function) of our error function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.
The gradient descent algorithm is:
repeat until convergence:
θj:=θj −[ Derivative of J(θ0,θ1) in respect of θj]
where
j=0,1 represents the weights' index number.
In the above figure we plot error J(θ1) is a function of weight θ1. We start with an arbitrary value of θ1 and take derivative(slope of the tangent) of error J(θ1) to adjust weight θ1 so we can reach the bottom where error is minimum. If slope is positive we have to go left or decrease weight θ1. And if slope is negative we have to go right or increase θ1. We have to repeat this procedure until convergence or reaching minimum point.
If learning rate α is too small gradient descent converges too slow. And if α is too large gradient descent overshoots and fails to converge.
All the figures have been taken from Andrew Ng's machine learning course on coursera.org
https://www.coursera.org/learn/machine-learning/home/welcome
Why is there no learningrate?
there are lots of different flavors of neural networks, some will use learning rates and others probably just keep this constant
Why is the adjustment relative to the error
what else should it be relative to? If there is a lot of error then chances are you need to do a lot of adjustments, if there was only a little error then you would only want to adjust your weights a small amount.
muliplied by the derivative of your main function?
dont really have a good answer for this one.

Should I use loss or accuracy as the early stopping metric?

I am learning and experimenting with neural networks and would like to have the opinion from someone more experienced on the following issue:
When I train an Autoencoder in Keras ('mean_squared_error' loss function and SGD optimizer), the validation loss is gradually going down. and the validation accuracy is going up. So far so good.
However, after a while, the loss keeps decreasing but the accuracy suddenly falls back to a much lower low level.
Is it 'normal' or expected behavior that the accuracy goes up very fast and stay high to fall suddenly back?
Should I stop training at the maximum accuracy even if the validation loss is still decreasing? In other words, use val_acc or val_loss as metric to monitor for early stopping?
See images:
Loss: (green = val, blue = train]
Accuracy: (green = val, blue = train]
UPDATE:
The comments below pointed me in the right direction and I think I understand it better now. It would be nice if someone could confirm that following is correct:
the accuracy metric measures the % of y_pred==Y_true and thus only make sense for classification.
my data is a combination of real and binary features. The reason why the accuracy graph goes up very steep and then falls back, while the loss continues to decrease is because around epoch 5000, the network probably predicted +/- 50% of the binary features correctly. When training continues, around epoch 12000, the prediction of real and binary features together improved, hence the decreasing loss, but the prediction of the binary features alone, are a little less correct. Therefor the accuracy falls down, while the loss decreases.
If the prediction is real-time or the data is continuous rather than discrete, then use MSE(Mean Square Error) because the values are real time.
But in the case of Discrete values (i.e) classification or clustering use accuracy because the values given are either 0 or 1 only. So, here the concept of MSE will not applicable, rather use accuracy= no of error values/total values * 100.