Possibly an ANN 101 question regarding minim batch processing. Google didn't seem to have the answer. A search here didn't yield anything either. My guess is there's a book somewhere that says, "do it this way!" and I just haven't read that book.
I'm coding a neural net in Python (not that the language matters). I'm attempting to add mini-batch updates instead of full batch. Is it necessary to select each observation once for each epoch? Mini-batches would be data values 1:10, 11:20, 21:30, etc. so that all observations are used, and they are all used once.
Or is it correct to select the mini batch randomly from the training data set based on a probability? The result being that each observation may be used once, multiple times, or not at all in any given epoch. For 20 mini-batches per epoch, each data element would be given a 5% chance of being selected for any given mini-batch. Mini batches would be randomly selected and random in size but approximately 1 of every 20 data points would be included in each of 20 mini batches with no guarantee of selection.
Some tips regarding mini-batch training:
Shuffle your samples before every epoch
The reason is the same as why you shuffle the samples in online training: Otherwise the network might simply memorize the order in which you feed the samples.
Use a fixed batch size for every batch and for every epoch
There is probably also a statistical reason, but it simplifies the implementation as it enables you to use fast implementations of matrix multiplications for your calculations. (e.g. BLAS)
Adapt your learning rate to the batch size
For larger batches you'll have to use a smaller learning rate, otherwise the ANN tends to converge towards a sub-optimal minimum. I always scaled my learning rates by 1/sqrt(n), where n is the batch size. Please note that this is just an empirical value from experiments.
Your first guess is correct.
Just randomize your dataset first.
Then for (say) a 20 mini-batch. Use: 1-20, then 21-40, etc...
So, all your dataset will be used.
Ben don't say that the data set are only used once. You normally need to do multiple epochs on all the dataset for your network to learn properly.
Mini-batch is primarily use to speed up the learning process.
Related
If I have 1000 observations in my dataset with 15 features and 1 label, how is the data in input neurons fed for forward pass and back propagation? Is it fed row wise for 1000 observations (one at a time) and weights are updated with each observation fed or full data is given in terms of input matrix and then with number of epochs, the network learns corresponding weight values? Also if it is fed one at time, what is epochs in that case?
Thanks
Assuming that the data is formatted into rows (1000 instances with 16 features each, with the last one being the label), you would feed in the first 15 features row by row and use the last "feature"/label as the target. This is called online learning. Online learning requires you to feed the data in one example at a time and conduct the back propagation and the weight update for every example. As you can imagine this can get quite intensive due to the backpropagation and update for every instance of your data.
The other option that you mentioned is feeding in the entire data into the network. This performs poorly in practice as the convergence is extremely slow.
In practice, mini-batches are used. This involves sending a small subset of the dataset through and then doing the back propagation and weight update. This provides the benefit of relatively frequent weight updates to speed up learning but is less intensive than the online learning. For more information on mini-batches see this University of Toronto Lecture by Geoffrey Hinton
Finally, an epoch is always 1 run through all of your data. It doesn't matter if you feed it in one at a time or all at once.
I hope this clarified your questions.
I am currently trying to do time series prediction with LSTM implemented with Keras.
I tried to train a LSTM model with 10 000 samples in the train and 2 500 samples in the test. I am using a batch size of 30.
Now, I am trying to train the exact same model but with more data. I have a train with 100 000 samples and test with 25 000 samples.
The time for one epoch is multiplicated by 100 when using the big dataset.
Even if I have more data, the size of the batch size is the same so the training should not be taking more time. Is it possible that this is the calculation of the loss on the train and test data that take a lot of time (here all the data is used) ?
Concerning the size of the batch size : should I put it higher because I have more data ?
EDIT 1
I tried to change the batch size and to put a bigger one. When I do that, the time of training decrease a lot.
With a big batch size, the computation of the gradient should be longer than with a small batch size ?
I have no clue here, I really do not understand why this is happening.
Does someone know why this is happening ? Is it linked to the data I use ? How theorically can this happen ?
EDIT 2
My processor is Intel Xeon W3520 (4 hearts / 8 threads) with 32G of RAM.
The data is composed of sequence of length 6 with 4 features. I use one LSMT layer with 50 units and a dense output layer. Whether I am training with 10 000 samples or 100 000 it is really the size of the batch size that change the time of computation. I can go from 2 seconds for one epoch with a batch size = 1000, to 200 seconds with a batch size = 30.
I do not use a generator, I use the basic line of code model.fit(Xtrain, Ytrain, nb_epoch, batch_size, verbose=2,callbacks,validation_data=(Xtest, Ytest)) with callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=2), history]
You seemingly have misunderstood parts of how SGD (Stochastic Gradient Descent) works.
I explained parts of this answer in another post here on Stackoverflow, that might help you understand this better, but I'll take the time to explain it another time here.
The basic idea of Gradient Descent is to calculate the forward pass (and store the activations) of all trainig samples, and then afterwards update your weights once. Now, since you might not have enough memory to store all the activations (which you need for the calculation of your backpropagation gradient), and for other reasons (mainly convergence), you often cannot do classical gradient descent.
Stochastic Gradient Descent makes the assumption that, by sampling in a random order, you can reach convergence by looking at only one training sample at a time, and then updating directly after. This is called an iteration, whereas we call the pass through all training samples an epoch.
Mini batches now only change SGD by - instead of using one single sample - taking a "handful" of values, determined by the batch size.
Now, the updating of the weights is a quite costly process, and it should be clear at this point that updating the weights a great number of times (with SGD) is more costly than computing the gradient and updating only a few times (with a large batch size).
I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html
I am training a fully convolution neural network, with 3080*16 input images for training, giving 16 images in a batch. I am doing this for 100 epochs.
in every epoch:
after each batch:
calculate errors, do weight update, get confusion matrix
after each validation_batch
calculate errors and confusion matrix
I am trying to give the maximum batch size possible.
In this situation (when number of epochs is fixed) - you have the trade-off between number of updates and the quality of update. The more often you will update your network (the smaller the batch is) - the better network you might get (assuming that you are using right regularization and baby-sitting). The better approximation of a real update parameters you get (the batch size is bigger) - the faster your network might converge to the quality solution omitting changes which actually may worsen your model.
The best way to set a batch size is either research if someone already found out the best batch size for your task or a grid / random search meta optimization - where you set a reasonable values of a possible batch size and test each option in order to find the best value.
I am trying to implement a general SOM with batch training. and i have doubt regarding the formula for batch training.
i have read about it in the following link
http://cs-www.cs.yale.edu/c2/images/uploads/HR15.pdf
https://notendur.hi.is//~benedikt/Courses/Mia_report2.pdf
i noticed that the weight updates are assigned rather than added at the end of an epoch - wouldn't that overwrite the whole networks previous values, and the update formula did not include the previous weights of the nodes, then how does it even work?
when i was implementing it, a lot of the nodes in network became NaN because the neighborhood value became zero for a lot of nodes due to gradient decrease at the end of training and the update formula resulted in a division by zero.
can someone explain the batch algorithm correctly. i DID google it, and i saw a lot of "improving batch" or "speeding up batch" but nothing about just batch kohonen directly. and among the ones that did explain the formula was the same and that doesn't work.
The update rule of the Batch SOM that you see is the good one.
The basic idea behind this algorithm is to train your SOM using the whole training dataset and so at each iteration, the weights of your neurons re present the mean of the closest inputs.
And so, the information of the previous weights are in the BMU (Best matching Unit).
As you said, some neurons weights produce NaN due to division by zero.
To overcome this problem you can use neighbor function that is always greater than zero (for example a Gaussian function).