Rescale Input Feature in a Neural Network - neural-network

I'm reading through the Make Your Own Neural Network book and in the example where it is shown on how to classify the hand written digits, the text says that the input color values that are in the range from 0 to 255 will be rescaled to the much smaller range between 0.01 to 1.0. A few questions on this!
What is against using the actual range which is 0 to 255? What would rescaling bring me?
Does this mean that if I rescale my training set, train my model with this rescaled data, I then should also use a rescaled test data?
Any arguments please?

Rescaling the data will lead to faster convergence when using methods like gradient descent. Also when your dataset features highly varying in magnitudes, using solution that includes eucliden distance can lead to bad results. In order to avoid it, scaling the features to range between 0.0 and 1.0 will be a wise solution.
For the second question, you should rescale test data.
Click those links 1, 2 and 3 to obtain more information.

Related

Multiclass classification or regression?

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Thanks!
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.

Neural Network learns better at high output values

I'm training a feed forward neural network
(stochastic gradient descent, 3 small hidden layers, elu activation, inputs scaled between 0 and 1, weights initialized according to TiRune from
https://stats.stackexchange.com/questions/229885/whats-the-recommended-weight-initialization-strategy-when-using-the-elu-activat)
on a function that outputs values from about 0 to 55.000. I'm satisfied with the result, it learns to approximate the function pretty well.
But when I scale the outputs to be between 0 and 1 (just outputs divided by 55.000) it stops learning pretty early, it performs much worse. I tried various learning rates, of course.
Is there a reason it learns much better when the output values are between 0 and 55.000 than when they are between 0 and 1? Or does it not make any sense and my problem is somewhere else?
If I understand correctly, the only difference between the networks is the output scaling (target scaling). For complete answer, I will give a list of possible reasons and include the learning rate you mentioned:
How scaling the outputs can effect my learning?
You may have a bug. If you scale the networks output, make sure you scale both predictions and real outputs that you feed during the training, validation and test.
Your output activation function cannot output at the target range. For instance, sigmoids can output values between 0 and 1. Scaling output values between 0 and 10 will reduce performance since the targets cannot be produced.
Make sure you use correct data types. Normalization can be good, but if normalization causes loss of information due to data types, you should normalize to a larger range. Truncation and rounding errors will cause information loss.
Adjust the learning rate - normalization changes the derivatives values, and therefore the propagated gradients all the way to the updates.
Good luck!

TensorFlow: Binary classification accuracy

In the context of a binary classification, I use a neural network with 1 hidden layer using a tanh activation function. The input is coming from a word2vect model and is normalized.
The classifier accuracy is between 49%-54%.
I used a confusion matrix to have a better understanding on what’s going on. I study the impact of feature number in input layer and the number of neurons in the hidden layer on the accuracy.
What I can observe from the confusion matrix is the fact that the model predict based on the parameters sometimes most of the lines as positives and sometimes most of the times as negatives.
Any suggestion why this issue happens? And which other points (other than input size and hidden layer size) might impact the accuracy of the classification?
Thanks
It's a bit hard to guess given the information you provide.
Are the labels balanced (50% positives, 50% negatives)? So this would mean your network is not training at all as your performance corresponds to the random performance, roughly. Is there maybe a bug in the preprocessing? Or is the task too difficult? What is the training set size?
I don't believe that the number of neurons is the issue, as long as it's reasonable, i.e. hundreds or a few thousand.
Alternatively, you can try another loss function, namely cross entropy, which is standard for multi-class classification and can also be used for binary classification:
https://www.tensorflow.org/api_docs/python/nn/classification#softmax_cross_entropy_with_logits
Hope this helps.
The data set is well balanced, 50% positive and negative.
The training set shape is (411426,X)
The training set shape is (68572,X)
X is the number of the feature coming from word2vec and I try with the values between [100,300]
I have 1 hidden layer, and the number of neurons that I test varied between [100,300]
I also test with mush smaller features/neurons size: 2-20 features and 10 neurons on the hidden layer.
I use also the cross entropy as cost fonction.

How to normalize close range data?

I use logistic regression. I have some features. Their values are between 0 and 1, (The maximum value that the function can produce is 1 and the minimum value is 0), but both in training and test data the maximum value is very low (e.g. 0.11) therefore all values are low and close to each other. My question is that what is the best standard way to normalize/transfer the feature values to a normal scale (between 0 and 1) so that the logistic regression isn't affected by inappropriate values.
Any help would be highly appreciated.
There are different methods for feature scaling/normalization.
If you just want the feature values to be in range [0..1] do the following for each feature:
Some tutorials recommend to scale features into the range [-0.5 .. 0.5]:
I prefer to scale features by their standard deviation how explained in Stanford lectures (see chapter Preprocessing your data):

Why does my neural network trained on MNIST data set not predict 7 and 9 correctly?

I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.
I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.
Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize: