Scaling inputs on different scales Neural Networks - neural-network

I am working on a Neural Network model and I was wondering how I was supposed to scale my inputs.
For now, I am simply scaling all the inputs as inputs with mean = 0 / std(Standard Deviation) = 1. However, my inputs are not all normally distributed. Some are normally distributed and some are linearly distributed.
How should I handle and scale my inputs ? Is it possible to scale some inputs with mean = 0 & std = 1 and linearly scale other inputs ?
Thanks !
Paul

The only value addition that scaling can have for construction of a neural network is that of avoiding training errors and faster convergence. Theoretically, you should get the same result irrespective of the use scaling (or in your case using different methods of scaling for different input neurons). As long as you don't change the basic structure of your data, it should be fine.
Ideally you should use a single scaling method, irrespective of the distribution of your inputs, as it is meant to make all inputs comparable to each other. You can choose different scaling methods too. There isn't a right answer to your question, and most answers will tend to use some empirical information to choose the scaling method. It depends on your data.
Statistically speaking, you should start your model creation by first training a net without scaling. If in case you net doesn't converge, you can try getting past the training issues by playing with the number of iterations and initial weights. If that doesn't work, then proceed to scaling, and eventually differentiated scaling.

Related

Drop out in regression task for neural network

I have a neural network for regression prediction means that the output is a real value number in range 0 to 1.
I used drop out for all layers and the errors suddenly increased and never converged.
Is drop out usable for regression task? Because if we disregard some nodes, the last layer will have fewer nodes and the predicted value will definitely very different from the actual value. So the back propagated error will be large and the model will be destroyed. Then Why should we use drop out for regression task in neural networks?
Because if we disregard some nodes, the last layer will have fewer
nodes and the predicted value will definitely very different from the
actual value.
You are correct. Hence most frameworks scale up the number of neurons during training (and don't during prediction time). This simple hack is effective and works well for most cases. However, it doesn't work that well for a regression task. It works well where the outputs of activation can be relative to each other (like softmax). In regression the values are absolute and the small differences in "train" and "prediction" setups do cause mild instabilities on occasions.
It is always best to start with a 0 dropout and then increase it slowly to observe what value gives the best result
I used drop out for all layers and the errors suddenly increased and never converged.
This also happens when you use too many dropouts, especially in regression tasks. Did you tried reducing dropouts?? Also, dropouts is recommended for those layers which has very high number of trainable parameters. Also consider removing dropouts from last layer and then check once.

Neural Network learns better at high output values

I'm training a feed forward neural network
(stochastic gradient descent, 3 small hidden layers, elu activation, inputs scaled between 0 and 1, weights initialized according to TiRune from
https://stats.stackexchange.com/questions/229885/whats-the-recommended-weight-initialization-strategy-when-using-the-elu-activat)
on a function that outputs values from about 0 to 55.000. I'm satisfied with the result, it learns to approximate the function pretty well.
But when I scale the outputs to be between 0 and 1 (just outputs divided by 55.000) it stops learning pretty early, it performs much worse. I tried various learning rates, of course.
Is there a reason it learns much better when the output values are between 0 and 55.000 than when they are between 0 and 1? Or does it not make any sense and my problem is somewhere else?
If I understand correctly, the only difference between the networks is the output scaling (target scaling). For complete answer, I will give a list of possible reasons and include the learning rate you mentioned:
How scaling the outputs can effect my learning?
You may have a bug. If you scale the networks output, make sure you scale both predictions and real outputs that you feed during the training, validation and test.
Your output activation function cannot output at the target range. For instance, sigmoids can output values between 0 and 1. Scaling output values between 0 and 10 will reduce performance since the targets cannot be produced.
Make sure you use correct data types. Normalization can be good, but if normalization causes loss of information due to data types, you should normalize to a larger range. Truncation and rounding errors will cause information loss.
Adjust the learning rate - normalization changes the derivatives values, and therefore the propagated gradients all the way to the updates.
Good luck!

Neural Networks Regression : scaling the outputs or using a linear layer?

I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul
I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.

ANN different results for same train-test sets

I'm implementing a neural network for a supervised classification task in MATLAB.
I have a training set and a test set to evaluate the results.
The problem is that every time I train the network for the same training set I get very different results (sometimes I get a 95% classification accuracy and sometimes like 60%) for the same test set.
Now I know this is because I get different initial weights and I know that I can use 'seed' to set the same initial weights but the question is what does this say about my data and what is the right way to look at this? How do I define the accuracy I'm getting using my designed ANN? Is there a protocol for this (like running the ANN 50 times and get an average accuracy or something)?
Thanks
Make sure your test set is large enough compared to the training set (e.g. 10% of the overall data) and check it regarding diversity. If your test set only covers very specific cases, this could be a reason. Also make sure you always use the same test set. Alternatively you should google the term cross-validation.
Furthermore, observing good training set accuracy while observing bad test set accuracy is a sign for overfitting. Try to apply regularization like a simple L2 weight decay (simply multiply your weight matrices with e.g. 0.999 after each weight update). Depending on your data, Dropout or L1 regularization could also help (especially if you have a lot of redundancies in your input data). Also try to choose a smaller network topology (fewer layers and/or fewer neurons per layer).
To speed up training, you could also try alternative learning algorithms like RPROP+, RPROP- or RMSProp instead of plain backpropagation.
Looks like your ANN is not converging to the optimal set of weights. Without further details of the ANN model, I cannot pinpoint the problem, but I would try increasing the number of iterations.

Neural Network Approximation Function

I'm trying to test the efficiency of the Neural Networks as approximation functions.
The function I need to approximate has 5 inputs and 1 output, which structure should I use?
I have no idea on what criteria should be applied in order to decide the number of Hidden Layer and the number of Nodes for each layer.
Thank you in advance,
Regards
Giuseppe.
I always use a single hidden layer. Theoretically, there are no functions which can be approximated by 2 or more hidden layers that cannot be approximated with one. To make a single hidden layer more complex, add more hidden nodes.
Typically, the number of hidden nodes is varied to observe the effect on model performance (as measured by accuracy or whatever). Too few hidden nodes results in a worse fit due to underfitting (the neural network's output function is too simple, and misses important details in the data). Too many hidden nodes results in a worse fit due to overfitting (the neural network becomes so flexible that it chases every bit of noise in the data).
Note that for classification problems you need at least 2 hidden layers if you want to separate concave polygons.
I'm not sure how the number of hidden layers affects function approximation.