I'm training a feed forward neural network
(stochastic gradient descent, 3 small hidden layers, elu activation, inputs scaled between 0 and 1, weights initialized according to TiRune from
https://stats.stackexchange.com/questions/229885/whats-the-recommended-weight-initialization-strategy-when-using-the-elu-activat)
on a function that outputs values from about 0 to 55.000. I'm satisfied with the result, it learns to approximate the function pretty well.
But when I scale the outputs to be between 0 and 1 (just outputs divided by 55.000) it stops learning pretty early, it performs much worse. I tried various learning rates, of course.
Is there a reason it learns much better when the output values are between 0 and 55.000 than when they are between 0 and 1? Or does it not make any sense and my problem is somewhere else?
If I understand correctly, the only difference between the networks is the output scaling (target scaling). For complete answer, I will give a list of possible reasons and include the learning rate you mentioned:
How scaling the outputs can effect my learning?
You may have a bug. If you scale the networks output, make sure you scale both predictions and real outputs that you feed during the training, validation and test.
Your output activation function cannot output at the target range. For instance, sigmoids can output values between 0 and 1. Scaling output values between 0 and 10 will reduce performance since the targets cannot be produced.
Make sure you use correct data types. Normalization can be good, but if normalization causes loss of information due to data types, you should normalize to a larger range. Truncation and rounding errors will cause information loss.
Adjust the learning rate - normalization changes the derivatives values, and therefore the propagated gradients all the way to the updates.
Good luck!
Related
I have a neural network for regression prediction means that the output is a real value number in range 0 to 1.
I used drop out for all layers and the errors suddenly increased and never converged.
Is drop out usable for regression task? Because if we disregard some nodes, the last layer will have fewer nodes and the predicted value will definitely very different from the actual value. So the back propagated error will be large and the model will be destroyed. Then Why should we use drop out for regression task in neural networks?
Because if we disregard some nodes, the last layer will have fewer
nodes and the predicted value will definitely very different from the
actual value.
You are correct. Hence most frameworks scale up the number of neurons during training (and don't during prediction time). This simple hack is effective and works well for most cases. However, it doesn't work that well for a regression task. It works well where the outputs of activation can be relative to each other (like softmax). In regression the values are absolute and the small differences in "train" and "prediction" setups do cause mild instabilities on occasions.
It is always best to start with a 0 dropout and then increase it slowly to observe what value gives the best result
I used drop out for all layers and the errors suddenly increased and never converged.
This also happens when you use too many dropouts, especially in regression tasks. Did you tried reducing dropouts?? Also, dropouts is recommended for those layers which has very high number of trainable parameters. Also consider removing dropouts from last layer and then check once.
I am working on a Neural Network model and I was wondering how I was supposed to scale my inputs.
For now, I am simply scaling all the inputs as inputs with mean = 0 / std(Standard Deviation) = 1. However, my inputs are not all normally distributed. Some are normally distributed and some are linearly distributed.
How should I handle and scale my inputs ? Is it possible to scale some inputs with mean = 0 & std = 1 and linearly scale other inputs ?
Thanks !
Paul
The only value addition that scaling can have for construction of a neural network is that of avoiding training errors and faster convergence. Theoretically, you should get the same result irrespective of the use scaling (or in your case using different methods of scaling for different input neurons). As long as you don't change the basic structure of your data, it should be fine.
Ideally you should use a single scaling method, irrespective of the distribution of your inputs, as it is meant to make all inputs comparable to each other. You can choose different scaling methods too. There isn't a right answer to your question, and most answers will tend to use some empirical information to choose the scaling method. It depends on your data.
Statistically speaking, you should start your model creation by first training a net without scaling. If in case you net doesn't converge, you can try getting past the training issues by playing with the number of iterations and initial weights. If that doesn't work, then proceed to scaling, and eventually differentiated scaling.
Having neural network with alot of inputs causes my network problems like
Neural network gets stuck and feed forward calculation always gives output as
1.0 because of the output sum being too big and while doing backpropagation, sum of gradients will be too high what causes the
learning speed to be too dramatic.
Neural network is using tanh as an active function in all layers.
Giving alot of thought, I came up with following solutions:
Initalizing smaller random weight values ( WeightRandom / PreviousLayerNeuronCount )
or
After calculation the sum of either outputs or gradients, dividing the sum with the number of 'neurons in previus layer for output sum' and number of 'neurons in next layer for gradient sum' and then passing sum into activation/derivative function.
I don't feel comfortable with solutions I came up with.
Solution 1. does not solve problem entirely. Possibility of gradient or output sum getting to high is still there. Solution 2. seems to solve the problem but I fear that it completely changes network behavior in a way that it might not solve some problems anymore.
What would you suggest me in this situation, keeping in mind that reducing neuron count in layers is not an option?
Thanks in advance!
General things that affect the output backpropagation include weights and biases of early elections, the number of hidden units, the amount of exercise patterns, and long iterations. As an alternative way, the selection of initial weights and biases there are several algorithms that can be used, one of which is an algorithm Nguyen widrow. You can use it to initialize the weights and biases early, I've tried it and gives good results.
I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.
I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.
I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.
Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize: