Validation loss rising before falling again - neural-network

I've been getting some strange behaviour on my training losses, and I don't know what's causing it. Axes are loss vs epochs.
There are two things going on here : first, the validation loss starts coming down nicely with the training loss, and then they begin to diverge strongly. I'm assuming this is some form of overfitting, even though the validation loss comes back down at a later point - is that correct ?
Then, the validation loss comes back down to meet the training loss - here, it coincided with a huge spike in training loss.
Does anyone have any insight into what's causing this, and what can be done to ensure things go downwards and do so smoothly ?
This was obtained using the Adam optimiser, in this case on a convolutional autoencoder, but I've also had this on an LSTM.

I have had similar experiences on unbalanced datasets, where the model tended to produce only 1 class output. Then it inversed the decision boundary and the performance was abysmal for a single epoch. I suggest to look at irregularities in the dataset. This could be said a balancing between the samples in each class or it could be that the ground truth is improperly set.
Anyways it is hard to accurately judge what is going wrong without knowing the use-case and/or the data.

Related

Is running more epochs really a direct cause of overfitting?

I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).

How to interpret the discriminator's loss and the generator's loss in Generative Adversarial Nets?

I am reading people's implementation of DCGAN, especially this one in tensorflow.
In that implementation, the author draws the losses of the discriminator and of the generator, which is shown below (images come from https://github.com/carpedm20/DCGAN-tensorflow):
Both the losses of the discriminator and of the generator don't seem to follow any pattern. Unlike general neural networks, whose loss decreases along with the increase of training iteration. How to interpret the loss when training GANs?
Unfortunately, like you've said for GANs the losses are very non-intuitive. Mostly it happens down to the fact that generator and discriminator are competing against each other, hence improvement on the one means the higher loss on the other, until this other learns better on the received loss, which screws up its competitor, etc.
Now one thing that should happen often enough (depending on your data and initialisation) is that both discriminator and generator losses are converging to some permanent numbers, like this:
(it's ok for loss to bounce around a bit - it's just the evidence of the model trying to improve itself)
This loss convergence would normally signify that the GAN model found some optimum, where it can't improve more, which also should mean that it has learned well enough. (Also note, that the numbers themselves usually aren't very informative.)
Here are a few side notes, that I hope would be of help:
if loss haven't converged very well, it doesn't necessarily mean that the model hasn't learned anything - check the generated examples, sometimes they come out good enough. Alternatively, can try changing learning rate and other parameters.
if the model converged well, still check the generated examples - sometimes the generator finds one/few examples that discriminator can't distinguish from the genuine data. The trouble is it always gives out these few, not creating anything new, this is called mode collapse. Usually introducing some diversity to your data helps.
as vanilla GANs are rather unstable, I'd suggest to use some version
of the DCGAN models, as they contain some features like convolutional
layers and batch normalisation, that are supposed to help with the
stability of the convergence. (the picture above is a result of the DCGAN rather than vanilla GAN)
This is some common sense but still: like with most neural net structures tweaking the model, i.e. changing its parameters or/and architecture to fit your certain needs/data can improve the model or screw it.

Sudden drop in accuracy while training a deep neural net

I am using mxnet to train a 11-class image classifier. I am observing a weird behavior training accuracy was increasing slowly and went upto 39% and in next epoch it went down to 9% and then it stays close to 9% for rest of the training.
I restarted the training with saved model (with 39% training accuracy) keeping all other parameter same . Now training accuracy is increasing again. What can be the reason here ? I am not able to understand it . And its getting difficult to train the model this way as it requires me to see training accuracy values constantly.
learning rate is constant at 0.01
as you can see your late accuracy is near random one. there is 2 common issue in this kind of cases.
your learning rate is high. try to lower it
The error (or entropy) you are trying to use is giving you NaN value. if you are trying to use entropies with log functions you must use them precisely.
It is common during training of neural networks for accuracy to improve for a while and then get worse -- in general this is caused by over-fitting. It's also fairly common for the network to "get unlucky" and get knocked into a bad part of parameter space corresponding to a sudden decrease in accuracy -- sometimes it can recover from this quickly, but sometimes not.
In general, lowering your learning rate is a good approach to this kind of problem. Also, setting a learning rate schedule like FactorScheduler can help you achieve more stable convergence by lowering the learning rate every few epochs. In fact, this can sometimes cover up mistakes in picking an initial learning rate that is too high.
I faced the same problem.And I solved it by use (y-a)^a loss function instead of the cross-entropy function(because of log(0)).I hope there is better solution for this problem.
These problems often come up. I observed that this may happen due to one of the following reasons:
Something returning NaN
The inputs of the network are not as expected - many modern frameworks do not raise errors in some of such cases
The model layers get incompatible shapes at some point
It happened probably because 0log0 returns NaN.
You might avoid it by;
cross_entropy = -tf.reduce_sum(labels*tf.log(tf.clip_by_value(logits,1e-10,1.0)))

Backpropagation makes network worse

i am experimenting with neural networks. I have a network with 8 input neurons, 5 hidden and 2 output. When i let the network learn with backpropagation, sometimes, it produces worse result between single iterations of training. What can be the cause? It should not be implementation error, because i even tried using implementation from Introduction to Neural Networks for Java and it does exactly the same.
Nothing is wrong. Back propagation is just a gradient optimization, and gradient methods do not have a guarantee of making error smaller in each iteration (you do have a guarantee that there exists a very small step size/learning rate which has such property, but in practise no way of finding it); furthermore you are probably updating weights after each sample making your training stochastic, which is even more "unstable" in this matter (as you do not really calculate the true gradient). However, if due to this, your method is not converging - think about proper scaling of your data as well as reducing the learning rate and probably adding the momentum term. These are just gradient-based optimization-related issues, not BP as such.

Increased Error with more Training Data for a Neural Network in Matlab

I have a question regarding the Matlab NN toolbox. As a part of research project I decided to create a Matlab script that uses the NN toolbox for some fitting solutions.
I have a data stream that is being loaded to my system. The Input data consists of 5 input channels and 1 output channel. I train my data on on this configurations for a while and try to fit the the output (for a certain period of time) as new data streams in. I retrain my network constantly to keep it updated.
So far everything works fine, but after a certain period of time the results get bad and do not represent the desired output. I really can't explain why this happens, but i could imagine that there must be some kind of memory issue, since as the data set is still small, everything is ok.
Only when it gets bigger the quality of the simulation drops down. Is there something as a memory which gets full, or is the bad sim just a result of the huge data sets? I'm a beginner with this tool and will really appreciate your feedback. Best Regards and thanks in advance!
Please elaborate on your method of retraining with new data. Do you run further iterations? What do you consider as "time"? Do you mean epochs?
At a first glance, assuming time means epochs, I would say that you're overfitting the data. Neural Networks are supposed to be trained for a limited number of epochs with early stopping. You could try regularization, different gradient descent methods (if you're using a GD method), GD momentum. Also depending on the values of your first few training datasets, you may have trained your data using an incorrect normalization range. You should check these issues out if my assumptions are correct.