I'm training a ResNet (CIFAR-10 dataset) and train accuracy is mostly (in 95% epochs) increasing, but sometimes it drops 5-10% and then it starts increasing again.
Here is an example:
Epoch 45/100
40000/40000 [==============================] - 50s 1ms/step - loss: 0.0323 - acc: 0.9948 - val_loss: 1.6562 - val_acc: 0.7404
Epoch 46/100
40000/40000 [==============================] - 52s 1ms/step - loss: 0.0371 - acc: 0.9932 - val_loss: 1.6526 - val_acc: 0.7448
Epoch 47/100
40000/40000 [==============================] - 50s 1ms/step - loss: 0.0266 - acc: 0.9955 - val_loss: 1.6925 - val_acc: 0.7426
Epoch 48/100
40000/40000 [==============================] - 50s 1ms/step - loss: 0.0353 - acc: 0.9940 - val_loss: 2.2682 - val_acc: 0.6496
Epoch 49/100
40000/40000 [==============================] - 50s 1ms/step - loss: 1.6391 - acc: 0.4862 - val_loss: 1.2524 - val_acc: 0.5659
Epoch 50/100
40000/40000 [==============================] - 52s 1ms/step - loss: 0.9220 - acc: 0.6830 - val_loss: 0.9726 - val_acc: 0.6738
Epoch 51/100
40000/40000 [==============================] - 51s 1ms/step - loss: 0.5453 - acc: 0.8165 - val_loss: 1.0232 - val_acc: 0.6963
I've quit execution after this, but this was my second run and in first, same thing happened and after some time it got back to 99%.
Batch is 128 so I guess this is not a problem. I haven't change learning rate or any other Adam parameters, but I guess that's also not an issue since accuracy is increasing most of the time.
So, why are those sudden drops happening?
Since training and validation loss and accuracy all increase it looks like your optimization algorithm has temporarily overshot the downhill part of the loss function that it was trying to follow.
Remember gradient descent and related methods calculate the gradient at a point and then use that (and sometimes some additional data) to guess the direction and distance to move. This is not always perfect and sometimes it will go too far and end up further uphill again.
If your learning rate is aggressive you will see this every now and then, but you might still converge faster than with a smaller learning rate. You can experiment with different learning rates, but I would not be concerned unless your loss starts to diverge.
Related
The high cpu usage rate threshold looks to have been lowered on iOS 15. Possibly from 80% in 60sec to 15% in 60sec? I have noticed that my app does NOT run correctly on IOS 15 and the background operations seem to stop like location, and some timers... I have a lot of #Published property's being updated while in the background, would this affect the background operations terminating after ~40 seconds in the background? If So how would I go about Updating my UI and keeping the constant updates to the published property.
I am getting the message:
Event: cpu usage
Action taken: none
CPU: 9 seconds cpu time over 36 seconds (25% cpu average), exceeding limit of 15% cpu over 60 seconds
CPU limit: 9s
Limit duration: 60s
CPU used: 9s
CPU duration: 36s
Duration: 35.85s
Duration Sampled: 25.92s
Steps: 5
While running my model,the time for each epoch is 0 sec, does that mean there is something wrong with my modelenter image description here
Probably training is just fast.
These epoch times are very possible when using GPU's. Of course, only on smaller datasets/architectures.
Say you are training MNIST with fairly straightforward network (LeNet alike CNN), these times are pretty normal.
From scanning the loss/acc, something is happening as the numbers are not constant.
I am learning NN and Keras. My test data is something like this:
Result, HomeWinPossibility, DrawPossibility, AwayWinPossibility
[['AwayWin' 0.41 0.28 0.31]
['HomeWin' 0.55 0.25 0.2]
['AwayWin' 0.17 0.21 0.62]
.....
Here is my model:
model = Sequential()
model.add(Dense(16, input_shape=(3,)))
model.add(Activation('sigmoid'))
model.add(Dense(8, activation='relu'))
model.add(Dense(3))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=["accuracy"])
model.fit(train_X, train_y_ohe, epochs=100, batch_size=1, verbose=1);
The output from fit is:
Epoch 1/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9151 - acc: 0.5737
Epoch 2/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9181 - acc: 0.5474
Epoch 3/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9111 - acc: 0.5526
Epoch 100/100
190/190 [==============================] - 1s 3ms/step - loss: 0.9130 - acc: 0.5579
So why the loss is not going down as some NN tutorials I read? Is it because the data I provided are just noises, so NN can't find any clue or something not right with my model?
As the acc is always around 0.55 (so 50%), does it mean the NN actually achieved better than random guessing (> 33%)? If this is true, why it achieved accuracy 0.57 at the first epoch?
So why the loss is not going down as some NN tutorials I read?
It might be many reasons - all depending on your data. Here are things you could adjust:
You have a very low batch size. Although some data might actually respond to this, I think that a batch size of 1 would be too small in most cases - without getting started on the redundantness of the structure you show when you use batch size 1. Batch size is very dependent on how much, and what kind of, data you have, but try somewhere around 20-30 if you have sufficient data.
Try different activation functions (but always have softmax or sigmoid in the last layer because you want numbers between 0 and 1).
Increase the number of units in the first and/or second layer (if you have enough data).
Try to set the learning rate (lr) for the Adam optimizer: model.compile(optimizer=keras.optimizers.Adam(lr=0.001), ...)
Is it because the data I provided are just noises
If your data is pure noise across classes, then very probably, given that there are roughly the same number of datapoints in each class, the accuracy would be around 33%, since it would essentially just guess at random.
As the acc is always around 0.55(so 50%). does it mean the NN actually achieved better than random guessing (33%)?
Not necessarily. The accuracy is a measure of how many classes that were correctly classified. Say that the validation data (conventionally the part of the dataset that the accuracy is calculated on) only contains data from one class. Then if the NN only classifies everything to that one class, the validation data would have 100% accuracy!
That means if you don't have the same number of datapoints from each class, accuracy is not to be trusted alone! A much better measure in cases where you have unbalanced datasets is e.g. the AUC (Area under the ROC curve) or the F1 score, which takes false positives into account as well.
I would recommend that you look into the theory behind this. Just blindly running around will probably be very annoying because you'd have a very hard time getting good results. And even if you got good results, they might often not be as good as you think. One place to read would be Ian Goodfellow's book on deep learning.
I'm training a network which has multiple losses and both creating and feeding the data into my network using a generator.
I've checked the structure of the data and it looks fine generally and it also trains pretty much as expected the majority of the time, however at a random epoch almost every time, the the training loss for every prediction suddenly jumps from say
# End of epoch 3
loss: 2.8845
to
# Beginning of epoch 4
loss: 1.1921e-07
I thought it could be the data, however, from what I can tell the data is generally fine and it's even more suspicious because this will happen at a random epoch (could be because of a random data point chosen during SGD?) but will persist throughout the rest of training. As in if at epoch 3, the training loss decreases to 1.1921e-07 then it will continue this way in epoch 4, epoch 5, etc.
However, there are times when it reaches epoch 5 and hasn't done this yet and then might do it at epoch 6 or 7.
Is there any viable reason outside of the data that could cause this? Could it even happen that a few fudgy data points causes this so fast?
Thanks
EDIT:
Results:
300/300 [==============================] - 339s - loss: 3.2912 - loss_1: 1.8683 - loss_2: 9.1352 - loss_3: 5.9845 -
val_loss: 1.1921e-07 - val_loss_1: 1.1921e-07 - val_loss_2: 1.1921e-07 - val_loss_3: 1.1921e-07
The next epochs after this all have trainig loss 1.1921e-07
Not entirely sure how satisfactory this is as an answer but my findings seem to show that using multiple categorical_crossentropy loss's together seems to result in a super unstable network? Swapping this out for other loss functions fixes the problem with the data remaining unchanged.
I run the example code for LSTM networks that uses imdb dataset in Keras. One can find the code in the following link.
imdb_lstm.py
My problem is that as code progresses the training loss decreases and training accuracy increases as expected but validation accuracy fluctuates in an interval and validation loss increases to a high value. I attach a part of the log of the training phase below. Even I observe that when training loss is very small (~ 0.01-0.03) sometimes it increases in the next epoch and then it decreases again. What I mention can be seen in epochs 75-77. But in general it decreases.
What I expect is that training accuracy always increases up to 0.99-1 and training loss always decreases. Moreover, the validation accuracy should start from maybe 0.4 and raise to for example 0.8 in the end. If validation accuracy does not improve over epochs what is the point of waiting during epochs? Also the test accuracy is close 0.81 at the end.
I also tried with my own data and came up with same situation. I processed my data in a similar way. I mean my training, validation and test points are processed in same logic as the ones in this example code.
Besides, I did not understand how this code represents the whole sentence after obtaining the outputs from LSTM for each word. Does it conduct mean or max pooling or does it take only the last output from LSTM layer before giving it to a logistic regression classifier?
Any help would be appreciated.
Using Theano backend.
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
X_train shape: (25000, 80)
X_test shape: (25000, 80)
Build model...
Train...
Train on 22500 samples, validate on 2500 samples
Epoch 1/100
22500/22500 [==============================] - 236s - loss: 0.5438 - acc: 0.7209 - val_loss: 0.4305 - val_acc: 0.8076
Epoch 2/100
22500/22500 [==============================] - 237s - loss: 0.3843 - acc: 0.8346 - val_loss: 0.3791 - val_acc: 0.8332
Epoch 3/100
22500/22500 [==============================] - 245s - loss: 0.3099 - acc: 0.8716 - val_loss: 0.3736 - val_acc: 0.8440
Epoch 4/100
22500/22500 [==============================] - 243s - loss: 0.2458 - acc: 0.9023 - val_loss: 0.4206 - val_acc: 0.8372
Epoch 5/100
22500/22500 [==============================] - 239s - loss: 0.2120 - acc: 0.9138 - val_loss: 0.3844 - val_acc: 0.8384
....
....
Epoch 75/100
22500/22500 [==============================] - 238s - loss: 0.0134 - acc: 0.9868 - val_loss: 0.9045 - val_acc: 0.8132
Epoch 76/100
22500/22500 [==============================] - 241s - loss: 0.0156 - acc: 0.9845 - val_loss: 0.9078 - val_acc: 0.8211
Epoch 77/100
22500/22500 [==============================] - 235s - loss: 0.0129 - acc: 0.9883 - val_loss: 0.9105 - val_acc: 0.8234
When to stop training: it's an usual way to stop training when a some metric computed on validation data starts to grow. This an usual indicator of overfitting. But please notice that you are using a dropout technique - which results in training slightly different model during every epochs - that's why you should apply some kind of patience - and stop training when a phenomena like this occurs in a several consecutive epochs.
The reason of fluctuations: the same as in first point - you are using a dropout technique which introduces some sort of randomness to your network. This is in my opinion the main reason of the fluctuations observed.
What Keras models take as an inputs to a Dense layer: if you study carefully a documentation of a LSTM/RNN layer you will notice return_sequences=False set as a default argument. This means that only the last output from a processed sequence is taken as an input to the following layer. You could change that using 1-D Convolutions.