If I have 1000 observations in my dataset with 15 features and 1 label, how is the data in input neurons fed for forward pass and back propagation? Is it fed row wise for 1000 observations (one at a time) and weights are updated with each observation fed or full data is given in terms of input matrix and then with number of epochs, the network learns corresponding weight values? Also if it is fed one at time, what is epochs in that case?
Thanks
Assuming that the data is formatted into rows (1000 instances with 16 features each, with the last one being the label), you would feed in the first 15 features row by row and use the last "feature"/label as the target. This is called online learning. Online learning requires you to feed the data in one example at a time and conduct the back propagation and the weight update for every example. As you can imagine this can get quite intensive due to the backpropagation and update for every instance of your data.
The other option that you mentioned is feeding in the entire data into the network. This performs poorly in practice as the convergence is extremely slow.
In practice, mini-batches are used. This involves sending a small subset of the dataset through and then doing the back propagation and weight update. This provides the benefit of relatively frequent weight updates to speed up learning but is less intensive than the online learning. For more information on mini-batches see this University of Toronto Lecture by Geoffrey Hinton
Finally, an epoch is always 1 run through all of your data. It doesn't matter if you feed it in one at a time or all at once.
I hope this clarified your questions.
Related
I have a dataset of the daily temperature for a couple of years. The data is in the interval form, including daily high temp and daily low temp.
I want to do a forecasting of the data, and I recently read several paper mention that the multilayer perceptron have the advantage to do this. However, after reading the paper I still got puzzled. I know in order to conduct it, I will need to have input, hidden layer and output. But in Matlab, though I have the code already, I still don't know how to simulate it. What should I put as its input and output, should I put the interval data as the input and output? And how can I choose hidden layer?
The input in an MLP network is the input feature data that you are trying to predict the outcome of. The output is what you are trying to predict. For the hidden layer that will determine how well it predicts, which you want as large as it needs to achieve reasonable prediction results. Going too large and it just memorizes the data rather than generalize on a pattern when training is run.
For example, if your input layer would be what day of the year it is (1-365), what the high was of the day, and what the low is of the day. And I assume is what the high and low temperature would be for the next day?
The more relevant input features you have the better the network will be.
Is there a limit on how many times one can train their CNN model? In the sense, say I have my CNN model and a training set.I train my model and using a unseen data test it. If I am not satisfied with the test accuracy, can I retrain my CNN as many times as possible (hypothetically) and test it again, till the performance is better?
I know other ways to improve the performance like, changing the structure of the network, filter size and number of filters, but say I want to have the structure and hyper-parameters fixed. Also I see when I train my CNN for the fifth or sixth time it gives me better test accuracy.
Is this correct?
Thanks for your time and help.
--Venkat
There is no limit on the number of times one can train a neural network, but the important thing would be to save the weights of your model after some iterations so that you can reload it whenever you want and continue from wherever the training left. This would help you in saving time as well as compute. Number of iterations required for a neural network varies from data to data and architecture to architecture, Ideally shallow models would need less iterations and deeper models would need more. I have both worked on models producing good results in a single iterations as well as models converging after fifty iterations.
I am currently trying to do time series prediction with LSTM implemented with Keras.
I tried to train a LSTM model with 10 000 samples in the train and 2 500 samples in the test. I am using a batch size of 30.
Now, I am trying to train the exact same model but with more data. I have a train with 100 000 samples and test with 25 000 samples.
The time for one epoch is multiplicated by 100 when using the big dataset.
Even if I have more data, the size of the batch size is the same so the training should not be taking more time. Is it possible that this is the calculation of the loss on the train and test data that take a lot of time (here all the data is used) ?
Concerning the size of the batch size : should I put it higher because I have more data ?
EDIT 1
I tried to change the batch size and to put a bigger one. When I do that, the time of training decrease a lot.
With a big batch size, the computation of the gradient should be longer than with a small batch size ?
I have no clue here, I really do not understand why this is happening.
Does someone know why this is happening ? Is it linked to the data I use ? How theorically can this happen ?
EDIT 2
My processor is Intel Xeon W3520 (4 hearts / 8 threads) with 32G of RAM.
The data is composed of sequence of length 6 with 4 features. I use one LSMT layer with 50 units and a dense output layer. Whether I am training with 10 000 samples or 100 000 it is really the size of the batch size that change the time of computation. I can go from 2 seconds for one epoch with a batch size = 1000, to 200 seconds with a batch size = 30.
I do not use a generator, I use the basic line of code model.fit(Xtrain, Ytrain, nb_epoch, batch_size, verbose=2,callbacks,validation_data=(Xtest, Ytest)) with callbacks = [EarlyStopping(monitor='val_loss', patience=10, verbose=2), history]
You seemingly have misunderstood parts of how SGD (Stochastic Gradient Descent) works.
I explained parts of this answer in another post here on Stackoverflow, that might help you understand this better, but I'll take the time to explain it another time here.
The basic idea of Gradient Descent is to calculate the forward pass (and store the activations) of all trainig samples, and then afterwards update your weights once. Now, since you might not have enough memory to store all the activations (which you need for the calculation of your backpropagation gradient), and for other reasons (mainly convergence), you often cannot do classical gradient descent.
Stochastic Gradient Descent makes the assumption that, by sampling in a random order, you can reach convergence by looking at only one training sample at a time, and then updating directly after. This is called an iteration, whereas we call the pass through all training samples an epoch.
Mini batches now only change SGD by - instead of using one single sample - taking a "handful" of values, determined by the batch size.
Now, the updating of the weights is a quite costly process, and it should be clear at this point that updating the weights a great number of times (with SGD) is more costly than computing the gradient and updating only a few times (with a large batch size).
I am trying to implement a general SOM with batch training. and i have doubt regarding the formula for batch training.
i have read about it in the following link
http://cs-www.cs.yale.edu/c2/images/uploads/HR15.pdf
https://notendur.hi.is//~benedikt/Courses/Mia_report2.pdf
i noticed that the weight updates are assigned rather than added at the end of an epoch - wouldn't that overwrite the whole networks previous values, and the update formula did not include the previous weights of the nodes, then how does it even work?
when i was implementing it, a lot of the nodes in network became NaN because the neighborhood value became zero for a lot of nodes due to gradient decrease at the end of training and the update formula resulted in a division by zero.
can someone explain the batch algorithm correctly. i DID google it, and i saw a lot of "improving batch" or "speeding up batch" but nothing about just batch kohonen directly. and among the ones that did explain the formula was the same and that doesn't work.
The update rule of the Batch SOM that you see is the good one.
The basic idea behind this algorithm is to train your SOM using the whole training dataset and so at each iteration, the weights of your neurons re present the mean of the closest inputs.
And so, the information of the previous weights are in the BMU (Best matching Unit).
As you said, some neurons weights produce NaN due to division by zero.
To overcome this problem you can use neighbor function that is always greater than zero (for example a Gaussian function).
Possibly an ANN 101 question regarding minim batch processing. Google didn't seem to have the answer. A search here didn't yield anything either. My guess is there's a book somewhere that says, "do it this way!" and I just haven't read that book.
I'm coding a neural net in Python (not that the language matters). I'm attempting to add mini-batch updates instead of full batch. Is it necessary to select each observation once for each epoch? Mini-batches would be data values 1:10, 11:20, 21:30, etc. so that all observations are used, and they are all used once.
Or is it correct to select the mini batch randomly from the training data set based on a probability? The result being that each observation may be used once, multiple times, or not at all in any given epoch. For 20 mini-batches per epoch, each data element would be given a 5% chance of being selected for any given mini-batch. Mini batches would be randomly selected and random in size but approximately 1 of every 20 data points would be included in each of 20 mini batches with no guarantee of selection.
Some tips regarding mini-batch training:
Shuffle your samples before every epoch
The reason is the same as why you shuffle the samples in online training: Otherwise the network might simply memorize the order in which you feed the samples.
Use a fixed batch size for every batch and for every epoch
There is probably also a statistical reason, but it simplifies the implementation as it enables you to use fast implementations of matrix multiplications for your calculations. (e.g. BLAS)
Adapt your learning rate to the batch size
For larger batches you'll have to use a smaller learning rate, otherwise the ANN tends to converge towards a sub-optimal minimum. I always scaled my learning rates by 1/sqrt(n), where n is the batch size. Please note that this is just an empirical value from experiments.
Your first guess is correct.
Just randomize your dataset first.
Then for (say) a 20 mini-batch. Use: 1-20, then 21-40, etc...
So, all your dataset will be used.
Ben don't say that the data set are only used once. You normally need to do multiple epochs on all the dataset for your network to learn properly.
Mini-batch is primarily use to speed up the learning process.