Is there a better structure for training Fully Convolution Neural Network? - neural-network

I am training a fully convolution neural network, with 3080*16 input images for training, giving 16 images in a batch. I am doing this for 100 epochs.
in every epoch:
after each batch:
calculate errors, do weight update, get confusion matrix
after each validation_batch
calculate errors and confusion matrix
I am trying to give the maximum batch size possible.

In this situation (when number of epochs is fixed) - you have the trade-off between number of updates and the quality of update. The more often you will update your network (the smaller the batch is) - the better network you might get (assuming that you are using right regularization and baby-sitting). The better approximation of a real update parameters you get (the batch size is bigger) - the faster your network might converge to the quality solution omitting changes which actually may worsen your model.
The best way to set a batch size is either research if someone already found out the best batch size for your task or a grid / random search meta optimization - where you set a reasonable values of a possible batch size and test each option in order to find the best value.

Related

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

In what order should we tune hyperparameters in Neural Networks?

I have a quite simple ANN using Tensorflow and AdamOptimizer for a regression problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate : initial learning rate, learning rate decay
The AdamOptimizer needs 4 arguments (learning-rate, beta1, beta2, epsilon) so we need to tune them - at least epsilon
batch-size
nb of iterations
Lambda L2-regularization parameter
Number of neurons, number of layers
what kind of activation function for the hidden layers, for the output layer
dropout parameter
I have 2 questions :
1) Do you see any other hyperparameter I might have forgotten ?
2) For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way.
Is there a special order to tune the parameters ? E.g learning rate first, then batch size, then ...
I am not sure that all these parameters are independent - in fact, I am quite sure that some of them are not. Which ones are clearly independent and which ones are clearly not independent ? Should we then tune them together ?
Is there any paper or article which talks about properly tuning all the parameters in a special order ?
EDIT :
Here are the graphs I got for different initial learning rates, batch sizes and regularization parameters. The purple curve is completely weird for me... Because the cost decreases like way slowly that the others, but it got stuck at a lower accuracy rate. Is it possible that the model is stuck in a local minimum ?
Accuracy
Cost
For the learning rate, I used the decay :
LR(t) = LRI/sqrt(epoch)
Thanks for your help !
Paul
My general order is:
Batch size, as it will largely affect the training time of future experiments.
Architecture of the network:
Number of neurons in the network
Number of layers
Rest (dropout, L2 reg, etc.)
Dependencies:
I'd assume that the optimal values of
learning rate and batch size
learning rate and number of neurons
number of neurons and number of layers
strongly depend on each other. I am not an expert on that field though.
As for your hyperparameters:
For the Adam optimizer: "Recommended values in the paper are eps = 1e-8, beta1 = 0.9, beta2 = 0.999." (source)
For the learning rate with Adam and RMSProp, I found values around 0.001 to be optimal for most problems.
As an alternative to Adam, you can also use RMSProp, which reduces the memory footprint by up to 33%. See this answer for more details.
You could also tune the initial weight values (see All you need is a good init). Although, the Xavier initializer seems to be a good way to prevent having to tune the weight inits.
I don't tune the number of iterations / epochs as a hyperparameter. I train the net until its validation error converges. However, I give each run a time budget.
Get Tensorboard running. Plot the error there. You'll need to create subdirectories in the path where TB looks for the data to plot. I do that subdir creation in the script. So I change a parameter in the script, give the trial a name there, run it, and plot all the trials in the same chart. You'll very soon get a feel for the most effective settings for your graph and data.
For parameters that are less important you can probably just pick a reasonable value and stick with it.
Like you said, the optimal values of these parameters all depend on each other. The easiest thing to do is to define a reasonable range of values for each hyperparameter. Then randomly sample a parameter from each range and train a model with that setting. Repeat this a bunch of times and then pick the best model. If you are lucky you will be able to analyze which hyperparameter settings worked best and make some conclusions from that.
I don't know any tool specific for tensorflow, but the best strategy is to first start with the basic hyperparameters such as learning rate of 0.01, 0.001, weight_decay of 0.005, 0.0005. And then tune them. Doing it manually will take a lot of time, if you are using caffe, following is the best option that will take the hyperparameters from a set of input values and will give you the best set.
https://github.com/kuz/caffe-with-spearmint
for more information, you can follow this tutorial as well:
http://fastml.com/optimizing-hyperparams-with-hyperopt/
For number of layers, What I suggest you to do is first make smaller network and increase the data, and after you have sufficient data, increase the model complexity.
Before you begin:
Set batch size to maximal (or maximal power of 2) that works on your hardware. Simply increase it until you get a CUDA error (or system RAM usage > 90%).
Set regularizes to low values.
The architecture and exact numbers of neurons and layers - use known architectures as inspirations and adjust them to your specific performance requirements: more layers and neurons -> possibly a stronger, but slower model.
Then, if you want to do it one by one, I would go like this:
Tune learning rate in a wide range.
Tune other parameters of the optimizer.
Tune regularizes (dropout, L2 etc).
Fine tune learning rate - it's the most important hyper-parameter.

ANN different results for same train-test sets

I'm implementing a neural network for a supervised classification task in MATLAB.
I have a training set and a test set to evaluate the results.
The problem is that every time I train the network for the same training set I get very different results (sometimes I get a 95% classification accuracy and sometimes like 60%) for the same test set.
Now I know this is because I get different initial weights and I know that I can use 'seed' to set the same initial weights but the question is what does this say about my data and what is the right way to look at this? How do I define the accuracy I'm getting using my designed ANN? Is there a protocol for this (like running the ANN 50 times and get an average accuracy or something)?
Thanks
Make sure your test set is large enough compared to the training set (e.g. 10% of the overall data) and check it regarding diversity. If your test set only covers very specific cases, this could be a reason. Also make sure you always use the same test set. Alternatively you should google the term cross-validation.
Furthermore, observing good training set accuracy while observing bad test set accuracy is a sign for overfitting. Try to apply regularization like a simple L2 weight decay (simply multiply your weight matrices with e.g. 0.999 after each weight update). Depending on your data, Dropout or L1 regularization could also help (especially if you have a lot of redundancies in your input data). Also try to choose a smaller network topology (fewer layers and/or fewer neurons per layer).
To speed up training, you could also try alternative learning algorithms like RPROP+, RPROP- or RMSProp instead of plain backpropagation.
Looks like your ANN is not converging to the optimal set of weights. Without further details of the ANN model, I cannot pinpoint the problem, but I would try increasing the number of iterations.

Neural Net - Selecting Data For Each Mini Batch

Possibly an ANN 101 question regarding minim batch processing. Google didn't seem to have the answer. A search here didn't yield anything either. My guess is there's a book somewhere that says, "do it this way!" and I just haven't read that book.
I'm coding a neural net in Python (not that the language matters). I'm attempting to add mini-batch updates instead of full batch. Is it necessary to select each observation once for each epoch? Mini-batches would be data values 1:10, 11:20, 21:30, etc. so that all observations are used, and they are all used once.
Or is it correct to select the mini batch randomly from the training data set based on a probability? The result being that each observation may be used once, multiple times, or not at all in any given epoch. For 20 mini-batches per epoch, each data element would be given a 5% chance of being selected for any given mini-batch. Mini batches would be randomly selected and random in size but approximately 1 of every 20 data points would be included in each of 20 mini batches with no guarantee of selection.
Some tips regarding mini-batch training:
Shuffle your samples before every epoch
The reason is the same as why you shuffle the samples in online training: Otherwise the network might simply memorize the order in which you feed the samples.
Use a fixed batch size for every batch and for every epoch
There is probably also a statistical reason, but it simplifies the implementation as it enables you to use fast implementations of matrix multiplications for your calculations. (e.g. BLAS)
Adapt your learning rate to the batch size
For larger batches you'll have to use a smaller learning rate, otherwise the ANN tends to converge towards a sub-optimal minimum. I always scaled my learning rates by 1/sqrt(n), where n is the batch size. Please note that this is just an empirical value from experiments.
Your first guess is correct.
Just randomize your dataset first.
Then for (say) a 20 mini-batch. Use: 1-20, then 21-40, etc...
So, all your dataset will be used.
Ben don't say that the data set are only used once. You normally need to do multiple epochs on all the dataset for your network to learn properly.
Mini-batch is primarily use to speed up the learning process.

How to change the default parameters for newfit() in MATLAB?

I am using
net = newfit(in,out,lag(j),{'tansig','tansig'});
to generate a new neural network. The default value of the number of validation checks is 6.
I am training a lot of networks and this is taking a lot of time. I guess it doesn't matter if my results are a bit less accurate if they can be made considerably faster.
How can I train faster?
I believe one of the ways might be to reduce the value of the number of validation checks. How can I do that (in code, not using GUI)
Is there some other way to increase speed.
As I said, the increase in speed may be at a little loss of accuracy.
Just to extend #mtrw answer, according to the documentation, training stops when any of these conditions occurs:
The maximum number of epochs is reached: net.trainParam.epochs
The maximum amount of time is exceeded: net.trainParam.time
Performance is minimized to the goal: net.trainParam.goal
The performance gradient falls below min_grad: net.trainParam.min_grad
mu exceeds mu_max: net.trainParam.mu_max
Validation performance has increased more than max_fail times since
the last time it decreased (when using validation): net.trainParam.max_fail
Epochs and time contraints allows to put an upper bound on the training duration.
Goal constraint stop the training when the performance (error) drops below it, and usually allows you to adjust the level of time/accuracy trade-off: less accurate results for faster execution.
This is similar to min_grad (gradient tells you the strength of the "descent") in that if the magnitude of the gradient is less than mingrad, training stops. It can be understood by the fact that if the error function is not changing by much, then we are reaching a plateau and we should probably stop training since we are not going to improve by much.
mu, mu_dec, and mu_max are used to control the weight updating process (backpropagation).
max_fail is usually used to avoid over-fitting, not so much for speedup.
My advice, set time and epochs to the maximum possible that your application constraints allow (otherwise the results will be poor). And in turn, you can control goal and min_grad to reach the level of speed/accuracy trade-off desired. Keep in mind that max_fails wont make you gain any time, since its mainly used to assure good generalization power.
(Disclaimer: I don't have the neural network toolbox, so I'm only extrapolating from the Mathworks documentation)
It looks from your input parameters like you're using TRAINLM. According to the documentation, you can set the net.trainParam.max_fail parameter to change the validation checks.
You can set the initial mu value, as well as the increment and decrement factors. But this would require some insight into the expected answer and performance of the search.
For a more blunt approach, you can also control the maximum number of iterations by setting the net.trainParam.epochs parameter to something less than its default 100. You might also set the net.trainParam.time parameter to limit the number of seconds.
You should probably set net.trainParam.show to NaN to skip any displays.
Neural nets are treated as objects in MATLAB. To access any parameter before (or after) training, you need to access the network's properties using the . operator.
In addition to mtrw's and Amro's answers, make MATLAB's Neural Network Toolbox documentation your new best friend. It will usually explain things in much better detail.