Backpropagation learning fails to converge - neural-network

I use a neural network with 3 layers for categorization problem: 1) ~2k neurons 2) ~2k neurons 3) 20 neurons. My training set consists of 2 examples, most of the inputs in each example are zeros. For some reason after the backpropagation training the network gives virtually the same output for both examples (which is either valid for only 1 of examples or have 1.0 for outputs where one of example has 1s). It comes to this state after the first epoch and doesn't change much afterwards, even if learning rate is minimal double vale. I use sigmoid as activation function.
I thought it could be something wrong with my code so I've used AForge open source library, and seems like it suffers from the same issue.
What might be the problem here?
Solution: I've removed one layer and decreased the number of neurons in hidden layer to 800

2000 by 2000 by 20 is huge. That's approximately 4 million weights to determine, meaning the algorithm has to search a 4-million-dimensional space. Any optimization algorithm will be totally at a loss in this case. I'm assuming you're using gradient descent, which is not even that powerful, so likely the algorithm is stuck in a local optimum somewhere in this gigantic search space.
Simplify your model!
Added:
And please also describe in more detail what you're trying to do. Do you really have only 2 training examples? That's like trying to categorize 2 points using a 4-million-dimensional plane. It doesn't make sense to me.

You mentioned that most of the inputs are zero. To your reduce the size of your search space, try removing redundancy in your training examples. For instance if
trainingExample[0].inputValue[i] == trainingExample[1].inputValue[i]
then x.inputValue[i] has no information bearing data for the NN.
Also, perhaps it's not clear, but it seems that two training examples seem small.

Related

Neural networks: why does it work worse when you give it more neurons?

I just coded my first neural network and experimented a little with it... It's task is very simple: it should basically output the rounded number. It consists of one input neuron and one output neuron with 1 hidden layer consisting of 2 hidden neurons. At first I gave it about 2000 random generated training data sets.
When I gave it 3 hidden layers consisting of 10 hidden neurons. The results started to get worse and even after 10000 training sets it still output many wrong answers. The neural network with 2 hidden neurons worked way better.
Why does this happen? I thought the more neurons a neural network had, the better it would be...
So how do you find the best number of neurons and hiddenlayers?
If by "worse" you mean less accuracy on a test set, the problem is most likely overfitting.
In general, I can tell you this: More layers will fit more complicated functions on data. Maybe your data resembles a lot a straight line, so a simple linear function will do great. But imagine you try fitting a 6-th degree polynomial to the data. As you may know, high degree even functions go to infinity (+-) very fast, so this high degree model will predict too large values at the extremes.
In summary, your problem is most likely overfitting (high variance). You can check out several more intuitive explanations on the bias-variance tradeoff somewhere with graphs.
quick google search: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

Using neural networks (MLP) for estimation

Im new with NN and i have this problem:
I have a dataset with 300 rows and 33 columns. Each row has 3 more columns for the results.
Im trying to use MLP for trainning a model so that when i have a new row, it estimates those 3 result columns.
I can easily reduce the error during trainning to 0.001 but when i use cross validation it keep estimating very poorly.
It estimates correctly if i use the same entry it used to train, but if i use another values that werent used for trainning the results are very wrong
Im using two hidden layers with 20 neurons each, so my architecture is [33 20 20 3]
For activation function im using biporlarsigmoid function.
Do you guys have some suggestion on where i could try to change to improve this?
Overfitting
As mentioned in the comments, this perfectly describes overfitting.
I strongly suggest reading the wikipedia article on overfitting, as it well describes causes, but I'll summarize some key points here.
Model complexity
Overfitting often happens when you model is needlessly complex for the problem. I don't know anything about your dataset, but I'm guessing [33 20 20 3] is more parameters than necessary for predicting.
Try running your cross-validation methods again, this time with either fewer layers, or fewer nodes per layer. Right now you are using 33*20 + 20*20 + 20*3 = 1120 parameters (weights) to make your prediction, is this necessary?
Regularization
A common solution to overfitting is regularization. The driving principle is KISS (keep it simple, stupid).
By applying an L1 regularizer to your weights, you keep preference for the smallest number of weights to solve your problem. The network will pull many weights to 0 as they aren't need.
By applying an L2 regularizer to your weights, you keep preference for lower rank solutions to your problem. This means that your network will prefer weights matrices that span lower dimensions. Practically this means your weights will be smaller numbers, and are less likely to be able to "memorize" the data.
What is L1 and L2? These are types of vector norms. L1 is the sum of the absolute value of your weights. L2 is the sqrt of the sum of squares of your weights. (L3 is the cubed root of the sum of cubes of weights, L4 ...).
Distortions
Another commonly used technique is to augment your training data with distorted versions of your training samples. This only makes sense with certain types of data. For instance images can be rotated, scaled, shifted, add gaussian noise, etc. without dramatically changing the content of the image.
By adding distortions, your network will no longer memorize your data, but will also learn when things look similar to your data. The number 1 rotated 2 degrees still looks like a 1, so the network should be able to learn from both of these.
Only you know your data. If this is something that can be done with your data (even just adding a little gaussian noise to each feature), then maybe this is worth looking into. But do not use this blindly without considering the implications it may have on your dataset.
Careful analysis of data
I put this last because it is an indirect response to the overfitting problem. Check your data before pumping it through a black-box algorithm (like a neural network). Here are a few questions worth answering if your network doesn't work:
Are any of my features strongly correlated with each other?
How do baseline algorithms perform? (Linear regression, logistic regression, etc.)
How are my training samples distributed among classes? Do I have 298 samples of one class and 1 sample of the other two?
How similar are my samples within a class? Maybe I have 100 samples for this class, but all of them are the same (or nearly the same).

Why is softmax not used in hidden layers [duplicate]

This question already has answers here:
Why use softmax only in the output layer and not in hidden layers?
(5 answers)
Closed 5 years ago.
I have read the answer given here. My exact question pertains to the accepted answer:
Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).
Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
Kindly explain what is being said here
Batch normalization: I understand this, No issues here
1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.
Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.
3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

Is there a rule/good advice on how big a artificial neural network should be?

My last lecture on ANN's was a while ago but I'm currently facing a project where I would want to use one.
So, the basics - like what type (a mutli-layer feedforward network), trained by an evolutionary algorithm (thats a given by the project), how many input-neurons (8) and how many ouput-neurons (7) - are set.
But I'm currently trying to figure out how many hidden layers I should use and how many neurons in each of these layers (the ea doesn't modify the network itself, but only the weights).
Is there a general rule or maybe a guideline on how to figure this out?
The best approach for this problem is to implement the cascade correlation algorithm, in which hidden nodes are sequentially added as necessary to reduce the error rate of the network. This has been demonstrated to be very useful in practice.
An alternative, of course, is a brute-force test of various values. I don't think simple answers such as "10 or 20 is good" are meaningful because you are directly addressing the separability of the data in high-dimensional space by the basis function.
A typical neural net relies on hidden layers in order to converge on a particular problem solution. A hidden layer of about 10 neurons is standard for networks with few input and output neurons. However, a trial an error approach often works best. Since the neural net will be trained by a genetic algorithm the number of hidden neurons may not play a significant role especially in training since its the weights and biases on the neurons which would be modified by an algorithm like back propogation.
As rcarter suggests, trial and error might do fine, but there's another thing you could try.
You could use genetic algorithms in order to determine the number of hidden layers or and the number of neurons in them.
I did similar things with a bunch of random forests, to try and find the best number of trees, branches, and parameters given to each tree, etc.