I am working on binary classification problems, and using pycaret to try out different methods. All my problems are imbalanced, so one class is typically around 10% of the data.
When I use pycaret to pick a threshold which optimizes for say f1, I often get very low optimal thresholds like 0.18 or sometimes even as low as 0.1. I wanted to know if this is an inherent issue? Typically, I would expect thresholds close to or higher than 0.5 for a good model. Is this normal behavior?
I am already checking for overfitting by checking the validation and test set metrics to ensure they are similar ...
I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).
I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter
I have 2 questions :
1) Do you see any other hyperparameter I might have forgotten ?
2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way.
Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ...
I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ?
Is there any paper or article which talks about properly tuning all the parameters in a special order ?
Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?
For the learning rate, I used the decay :
LR(t) = LRI/sqrt(epoch)
My general order is:
Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
Number of neurons in the network
Number of layers
Rest (dropout, L2 reg, etc.)
I'd assume that the optimal values of
learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers
strongly depend on each other. I am not an expert on that field though.
As for your hyperparameters:
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.
Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.
For parameters that are less important you can probably just pick a reasonable value and stick with it.
Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.
I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.
For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.
Before you begin:
Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.
Then, if you want to do it one by one, I would go like this:
Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.
I am using mxnet to train a 11-class image classifier. I am observing a weird behavior training accuracy was increasing slowly and went upto 39% and in next epoch it went down to 9% and then it stays close to 9% for rest of the training.
I restarted the training with saved model (with 39% training accuracy) keeping all other parameter same . Now training accuracy is increasing again. What can be the reason here ? I am not able to understand it . And its getting difficult to train the model this way as it requires me to see training accuracy values constantly.
learning rate is constant at 0.01
as you can see your late accuracy is near random one. there is 2 common issue in this kind of cases.
your learning rate is high. try to lower it
The error (or entropy) you are trying to use is giving you NaN value. if you are trying to use entropies with log functions you must use them precisely.
It is common during training of neural networks for accuracy to improve for a while and then get worse -- in general this is caused by over-fitting. It's also fairly common for the network to "get unlucky" and get knocked into a bad part of parameter space corresponding to a sudden decrease in accuracy -- sometimes it can recover from this quickly, but sometimes not.
In general, lowering your learning rate is a good approach to this kind of problem. Also, setting a learning rate schedule like FactorScheduler can help you achieve more stable convergence by lowering the learning rate every few epochs. In fact, this can sometimes cover up mistakes in picking an initial learning rate that is too high.
I faced the same problem.And I solved it by use (y-a)^a loss function instead of the cross-entropy function(because of log(0)).I hope there is better solution for this problem.
These problems often come up. I observed that this may happen due to one of the following reasons:
Something returning NaN
The inputs of the network are not as expected - many modern frameworks do not raise errors in some of such cases
The model layers get incompatible shapes at some point
It happened probably because 0log0 returns NaN.
You might avoid it by;
cross_entropy = -tf.reduce_sum(labels*tf.log(tf.clip_by_value(logits,1e-10,1.0)))
Through all training process, accuracy is 0.1. What am I doing wrong?
Model, solver and part of log here:
Topology in png format:
P.S. I am using the latest version of Caffe and g2.2xlarge instance on AWS.
You're working on CIFAR-10 dataset which has 10 classes. When the training of a network commences, the first guess is usually random due to which your accuracy is 1/N, where N is the number of classes. In your case it is 1/10, i.e., 0.1. If your accuracy stays the same over time it implies that your network isn't learning anything. This may happen due to a large learning rate. The basic idea of training a network is that you calculate the loss and propagate it back. The gradients are multiplied with the learning rate and added to the current weights and biases. If the learning rate is too big you may overshoot the local minima every time. If it is too small, the convergence will be slow. I see that your base_lr here is 0.01. As far as my experience goes, this is somewhat large. You may want to keep it at 0.001 in the beginning and then go on reducing it by a factor of 10 whenever you observe that the accuracy is not improving. But then anything below 0.00001 usually doesn't make much of a difference. The trick is to observe the progress of the training and make parameter changes as and when required.
I know the thread is quite old but maybe my answer helps somebody. I experienced the same problem with an accuracy like a random guess.
What helped was to set the number of outputs of the last layer before the accuracy layer to the number of labels.
In your case that should be the ip2 layer. Open the model definition of your net and set num_outputs to the number of labels.
See Section 4.4 for more information: A Practical Introduction to Deep Learning with Caffe and Python
I am using
net = newfit(in,out,lag(j),{'tansig','tansig'});
to generate a new neural network. The default value of the number of validation checks is 6.
I am training a lot of networks and this is taking a lot of time. I guess it doesn't matter if my results are a bit less accurate if they can be made considerably faster.
How can I train faster?
I believe one of the ways might be to reduce the value of the number of validation checks. How can I do that (in code, not using GUI)
Is there some other way to increase speed.
As I said, the increase in speed may be at a little loss of accuracy.
Just to extend #mtrw answer, according to the documentation, training stops when any of these conditions occurs:
The maximum number of epochs is reached: net.trainParam.epochs
The maximum amount of time is exceeded: net.trainParam.time
Performance is minimized to the goal: net.trainParam.goal
The performance gradient falls below min_grad: net.trainParam.min_grad
mu exceeds mu_max: net.trainParam.mu_max
Validation performance has increased more than max_fail times since
the last time it decreased (when using validation): net.trainParam.max_fail
Epochs and time contraints allows to put an upper bound on the training duration.
Goal constraint stop the training when the performance (error) drops below it, and usually allows you to adjust the level of time/accuracy trade-off: less accurate results for faster execution.
This is similar to min_grad (gradient tells you the strength of the "descent") in that if the magnitude of the gradient is less than mingrad, training stops. It can be understood by the fact that if the error function is not changing by much, then we are reaching a plateau and we should probably stop training since we are not going to improve by much.
mu, mu_dec, and mu_max are used to control the weight updating process (backpropagation).
max_fail is usually used to avoid over-fitting, not so much for speedup.
My advice, set time and epochs to the maximum possible that your application constraints allow (otherwise the results will be poor). And in turn, you can control goal and min_grad to reach the level of speed/accuracy trade-off desired. Keep in mind that max_fails wont make you gain any time, since its mainly used to assure good generalization power.
(Disclaimer: I don't have the neural network toolbox, so I'm only extrapolating from the Mathworks documentation)
It looks from your input parameters like you're using TRAINLM. According to the documentation, you can set the net.trainParam.max_fail parameter to change the validation checks.
You can set the initial mu value, as well as the increment and decrement factors. But this would require some insight into the expected answer and performance of the search.
For a more blunt approach, you can also control the maximum number of iterations by setting the net.trainParam.epochs parameter to something less than its default 100. You might also set the net.trainParam.time parameter to limit the number of seconds.
You should probably set net.trainParam.show to NaN to skip any displays.
Neural nets are treated as objects in MATLAB. To access any parameter before (or after) training, you need to access the network's properties using the . operator.
In addition to mtrw's and Amro's answers, make MATLAB's Neural Network Toolbox documentation your new best friend. It will usually explain things in much better detail.