TFLEARN multivariable regression does not converge (attempting to duplicate fitlab fitnet) - matlab

I am trying to write a model in TFLEARN to fit to 16 parameters.
I have previously run this same experiment in Matlab using the "fitnet" function with 2 hidden layers of 2000 and 1500 nodes.
I am attempting to replicate these results in tensorflow before exploring other architectures/descent algos/hyperparameter tuning. I have done some research and determined the matlab fitnet function uses tanh nodes for hidden layers and linear for output. Also, the descent algorithm is defaulted to levenberg-Marquardt, but worked for me with other (sgd) algorithms as well.
It appears that the accuracy is maxing out around .2, and then oscillating below this over successive epochs. I did not see this behavior in matlab.
My TFLEARN code looks like:
tnorm = tflearn.initializations.uniform_scaling()
adam = tflearn.optimizers.Adam (learning_rate=0.1, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')
# network building
input_data = tflearn.input_data(shape=[None, np.shape(prepared_x)[1]])
fc1 = tflearn.fully_connected(input_data, 2000,activation='tanh',weights_init=tnorm)
fc2 = tflearn.fully_connected(fc1,1500,activation='tanh',weights_init=tnorm)
output = tflearn.fully_connected(fc2, 16, activation='linear',weights_init=tnorm)
network = tflearn.regression(output, optimizer=adam, loss='mean_square')
#define model with checkpoints
model = tflearn.DNN(network, tensorboard_dir='output/', tensorboard_verbose=3, checkpoint_path='output')
#Train Model
model.fit(prepared_x, prepared_t, n_epoch=5, batch_size=100,shuffle=True, show_metric=True, snapshot_epoch=False,validation_set=0.1 )
#save
model.save('TFLEARN_FC_final.tfl')
The output of the traing session looks like:
Run id: UTSD6N
Log directory: output/
[?25l---------------------------------
Training samples: 43200
Validation samples: 4800
--
Training Step: 1
[2K
| Adam | epoch: 000 | loss: 0.00000 - acc: 0.0000 -- iter: 00100/43200
[A[ATraining Step: 2 | total loss: [1m[32m0.67871[0m[0m
[2K
| Adam | epoch: 000 | loss: 0.67871 - acc: 0.0455 -- iter: 00200/43200
[A[ATraining Step: 3 | total loss: [1m[32m33.14599[0m[0m
[2K
| Adam | epoch: 000 | loss: 33.14599 - acc: 0.0082 -- iter: 00300/43200
[A[ATraining Step: 4 | total loss: [1m[32m28.01067[0m[0m
[2K
| Adam | epoch: 000 | loss: 28.01067 - acc: 0.0021 -- iter: 00400/43200
[A[ATraining Step: 5 | total loss: [1m[32m17.35706[0m[0m
[2K
| Adam | epoch: 000 | loss: 17.35706 - acc: 0.0006 -- iter: 00500/43200
[A[ATraining Step: 6 | total loss: [1m[32m9.73368[0m[0m
[2K
| Adam | epoch: 000 | loss: 9.73368 - acc: 0.0002 -- iter: 00600/43200
[A[ATraining Step: 7 | total loss: [1m[32m5.19867[0m[0m
[2K
| Adam | epoch: 000 | loss: 5.19867 - acc: 0.0001 -- iter: 00700/43200
[A[ATraining Step: 8 | total loss: [1m[32m3.54779[0m[0m
[2K
| Adam | epoch: 000 | loss: 3.54779 - acc: 0.0113 -- iter: 00800/43200
[A[ATraining Step: 9 | total loss: [1m[32m3.80998[0m[0m
[2K
| Adam | epoch: 000 | loss: 3.80998 - acc: 0.0106 -- iter: 00900/43200
[A[ATraining Step: 10 | total loss: [1m[32m4.33370[0m[0m
[2K
| Adam | epoch: 000 | loss: 4.33370 - acc: 0.0053 -- iter: 01000/43200
[A[ATraining Step: 11 | total loss: [1m[32m4.24100[0m[0m
[2K
...
[2K
| Adam | epoch: 004 | loss: 0.02448 - acc: 0.1817 -- iter: 42800/43200
[A[ATraining Step: 2157 | total loss: [1m[32m0.02633[0m[0m
[2K
| Adam | epoch: 004 | loss: 0.02633 - acc: 0.1875 -- iter: 42900/43200
[A[ATraining Step: 2158 | total loss: [1m[32m0.02509[0m[0m
[2K
| Adam | epoch: 004 | loss: 0.02509 - acc: 0.1688 -- iter: 43000/43200
[A[ATraining Step: 2159 | total loss: [1m[32m0.02525[0m[0m
[2K
| Adam | epoch: 004 | loss: 0.02525 - acc: 0.1529 -- iter: 43100/43200
[A[ATraining Step: 2160 | total loss: [1m[32m0.02695[0m[0m
[2K
| Adam | epoch: 005 | loss: 0.02695 - acc: 0.1456 -- iter: 43200/43200
image of accuracy/loss from tensorboard
Any suggestions would be much appreciated.

For any future lurkers -- I solved my own problem by fixing the descent algorithm.
The default learning rate for the Adam optimizer is .001 but this was too high, I had to switch to .005 for convergence.

Related

changing the loss function leads to neural network returns nans

I'm using the Deep SVDD on CIFAR10 for one-class classification. When I change the L2 norm to Lp for p<1 I got nans after some epochs.
It is working for loss= torch.mean((outputs - inputs)2)
But I got nan for loss= torch.mean((abs(outputs - inputs))(0.9))
The loss for each epoch is shown here:
INFO:root: Epoch 1/50 Time: 1.514 Loss: 84.51767029
INFO:root: Epoch 2/50 Time: 1.617 Loss: 82.70055634
INFO:root: Epoch 3/50 Time: 1.528 Loss: 80.92372467
INFO:root: Epoch 4/50 Time: 1.612 Loss: 79.23560699
INFO:root: Epoch 5/50 Time: 1.495 Loss: 77.56893951
INFO:root: Epoch 6/50 Time: 1.596 Loss: 75.95311737
INFO:root: Epoch 7/50 Time: 1.504 Loss: 74.40722260
INFO:root: Epoch 8/50 Time: 1.593 Loss: 72.84329010
INFO:root: Epoch 9/50 Time: 1.639 Loss: 71.34644287
INFO:root: Epoch 10/50 Time: 1.578 Loss: 69.86484253
INFO:root: Epoch 11/50 Time: 1.553 Loss: 68.41005692
INFO:root: Epoch 12/50 Time: 1.670 Loss: 66.96582977
INFO:root: Epoch 13/50 Time: 1.607 Loss: 65.56927887
INFO:root: Epoch 14/50 Time: 1.573 Loss: 64.20584961
INFO:root: Epoch 15/50 Time: 1.605 Loss: 62.85230591
INFO:root: Epoch 16/50 Time: 1.483 Loss: 61.53305466
INFO:root: Epoch 17/50 Time: 1.616 Loss: 60.22836166
INFO:root: Epoch 18/50 Time: 1.499 Loss: 58.94760498
INFO:root: Epoch 19/50 Time: 1.611 Loss: 57.73990845
INFO:root: Epoch 20/50 Time: 1.507 Loss: 56.51732086
INFO:root: Epoch 21/50 Time: 1.624 Loss: 55.30994400
INFO:root: Epoch 22/50 Time: 1.482 Loss: 54.13251587
INFO:root: Epoch 23/50 Time: 1.606 Loss: 52.98952118
INFO:root: Epoch 24/50 Time: 1.508 Loss: 51.86713654
INFO:root: Epoch 25/50 Time: 1.587 Loss: 50.76639069
INFO:root: Epoch 26/50 Time: 1.523 Loss: 49.68750381
INFO:root: Epoch 27/50 Time: 1.574 Loss: 48.62197098
INFO:root: Epoch 28/50 Time: 1.537 Loss: 47.59307220
INFO:root: Epoch 29/50 Time: 1.560 Loss: 46.58890167
INFO:root: Epoch 30/50 Time: 1.607 Loss: 45.59774643
INFO:root: Epoch 31/50 Time: 1.504 Loss: 44.61755203
INFO:root: Epoch 32/50 Time: 1.592 Loss: 43.67579239
INFO:root: Epoch 33/50 Time: 1.480 Loss: 42.76135941
INFO:root: Epoch 34/50 Time: 1.577 Loss: 41.84933487
INFO:root: Epoch 35/50 Time: 1.488 Loss: 40.96647171
INFO:root: Epoch 36/50 Time: 1.596 Loss: 40.10220779
INFO:root: Epoch 37/50 Time: 1.534 Loss: 39.26658310
INFO:root: Epoch 38/50 Time: 1.615 Loss: 38.44916168
INFO:root: Epoch 39/50 Time: 1.518 Loss: nan
INFO:root: Epoch 40/50 Time: 1.574 Loss: nan
INFO:root: Epoch 41/50 Time: 1.511 Loss: nan
INFO:root: Epoch 42/50 Time: 1.556 Loss: nan
INFO:root: Epoch 43/50 Time: 1.565 Loss: nan
INFO:root: Epoch 44/50 Time: 1.561 Loss: nan
INFO:root: Epoch 45/50 Time: 1.600 Loss: nan
INFO:root: Epoch 46/50 Time: 1.518 Loss: nan
INFO:root: Epoch 47/50 Time: 1.618 Loss: nan
INFO:root: Epoch 48/50 Time: 1.540 Loss: nan
INFO:root: Epoch 49/50 Time: 1.591 Loss: nan
INFO:root: Epoch 50/50 Time: 1.504 Loss: nan
For different learning rates and output dimensions still, the network returns nan after some epochs

Save the best model trained on Faster RCNN (COCO dataset) with Pytorch avoiding to "overfitting"

I am training a Faster RCNN neural network on COCO dataset with Pytorch.
I have followed next tutorial:
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
The training results are as follows:
Epoch: [6] [ 0/119] eta: 0:01:16 lr: 0.000050 loss: 0.3780 (0.3780) loss_classifier: 0.1290 (0.1290) loss_box_reg: 0.1848 (0.1848) loss_objectness: 0.0239 (0.0239) loss_rpn_box_reg: 0.0403 (0.0403) time: 0.6451 data: 0.1165 max mem: 3105
Epoch: [6] [ 10/119] eta: 0:01:13 lr: 0.000050 loss: 0.4129 (0.4104) loss_classifier: 0.1277 (0.1263) loss_box_reg: 0.2164 (0.2059) loss_objectness: 0.0244 (0.0309) loss_rpn_box_reg: 0.0487 (0.0473) time: 0.6770 data: 0.1253 max mem: 3105
Epoch: [6] [ 20/119] eta: 0:01:07 lr: 0.000050 loss: 0.4165 (0.4302) loss_classifier: 0.1277 (0.1290) loss_box_reg: 0.2180 (0.2136) loss_objectness: 0.0353 (0.0385) loss_rpn_box_reg: 0.0499 (0.0491) time: 0.6843 data: 0.1265 max mem: 3105
Epoch: [6] [ 30/119] eta: 0:01:00 lr: 0.000050 loss: 0.4205 (0.4228) loss_classifier: 0.1271 (0.1277) loss_box_reg: 0.2125 (0.2093) loss_objectness: 0.0334 (0.0374) loss_rpn_box_reg: 0.0499 (0.0484) time: 0.6819 data: 0.1274 max mem: 3105
Epoch: [6] [ 40/119] eta: 0:00:53 lr: 0.000050 loss: 0.4127 (0.4205) loss_classifier: 0.1209 (0.1265) loss_box_reg: 0.2102 (0.2085) loss_objectness: 0.0315 (0.0376) loss_rpn_box_reg: 0.0475 (0.0479) time: 0.6748 data: 0.1282 max mem: 3105
Epoch: [6] [ 50/119] eta: 0:00:46 lr: 0.000050 loss: 0.3973 (0.4123) loss_classifier: 0.1202 (0.1248) loss_box_reg: 0.1947 (0.2039) loss_objectness: 0.0315 (0.0366) loss_rpn_box_reg: 0.0459 (0.0470) time: 0.6730 data: 0.1297 max mem: 3105
Epoch: [6] [ 60/119] eta: 0:00:39 lr: 0.000050 loss: 0.3900 (0.4109) loss_classifier: 0.1206 (0.1248) loss_box_reg: 0.1876 (0.2030) loss_objectness: 0.0345 (0.0365) loss_rpn_box_reg: 0.0431 (0.0467) time: 0.6692 data: 0.1276 max mem: 3105
Epoch: [6] [ 70/119] eta: 0:00:33 lr: 0.000050 loss: 0.3984 (0.4085) loss_classifier: 0.1172 (0.1242) loss_box_reg: 0.2069 (0.2024) loss_objectness: 0.0328 (0.0354) loss_rpn_box_reg: 0.0458 (0.0464) time: 0.6707 data: 0.1252 max mem: 3105
Epoch: [6] [ 80/119] eta: 0:00:26 lr: 0.000050 loss: 0.4153 (0.4113) loss_classifier: 0.1178 (0.1246) loss_box_reg: 0.2123 (0.2036) loss_objectness: 0.0328 (0.0364) loss_rpn_box_reg: 0.0480 (0.0468) time: 0.6744 data: 0.1264 max mem: 3105
Epoch: [6] [ 90/119] eta: 0:00:19 lr: 0.000050 loss: 0.4294 (0.4107) loss_classifier: 0.1178 (0.1238) loss_box_reg: 0.2098 (0.2021) loss_objectness: 0.0418 (0.0381) loss_rpn_box_reg: 0.0495 (0.0466) time: 0.6856 data: 0.1302 max mem: 3105
Epoch: [6] [100/119] eta: 0:00:12 lr: 0.000050 loss: 0.4295 (0.4135) loss_classifier: 0.1171 (0.1235) loss_box_reg: 0.2124 (0.2034) loss_objectness: 0.0460 (0.0397) loss_rpn_box_reg: 0.0498 (0.0469) time: 0.6955 data: 0.1345 max mem: 3105
Epoch: [6] [110/119] eta: 0:00:06 lr: 0.000050 loss: 0.4126 (0.4117) loss_classifier: 0.1229 (0.1233) loss_box_reg: 0.2119 (0.2024) loss_objectness: 0.0430 (0.0394) loss_rpn_box_reg: 0.0481 (0.0466) time: 0.6822 data: 0.1306 max mem: 3105
Epoch: [6] [118/119] eta: 0:00:00 lr: 0.000050 loss: 0.4006 (0.4113) loss_classifier: 0.1171 (0.1227) loss_box_reg: 0.2028 (0.2028) loss_objectness: 0.0366 (0.0391) loss_rpn_box_reg: 0.0481 (0.0466) time: 0.6583 data: 0.1230 max mem: 3105
Epoch: [6] Total time: 0:01:20 (0.6760 s / it)
creating index...
index created!
Test: [ 0/59] eta: 0:00:15 model_time: 0.1188 (0.1188) evaluator_time: 0.0697 (0.0697) time: 0.2561 data: 0.0634 max mem: 3105
Test: [58/59] eta: 0:00:00 model_time: 0.1086 (0.1092) evaluator_time: 0.0439 (0.0607) time: 0.2361 data: 0.0629 max mem: 3105
Test: Total time: 0:00:14 (0.2378 s / it)
Averaged stats: model_time: 0.1086 (0.1092) evaluator_time: 0.0439 (0.0607)
Accumulating evaluation results...
DONE (t=0.02s).
IoU metric: bbox
Average Precision (AP) #[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.210
Average Precision (AP) #[ IoU=0.50 | area= all | maxDets=100 ] = 0.643
Average Precision (AP) #[ IoU=0.75 | area= all | maxDets=100 ] = 0.079
Average Precision (AP) #[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) #[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.210
Average Precision (AP) #[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.011
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.096
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.333
Average Recall (AR) #[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) #[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.333
Average Recall (AR) #[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
Epoch: [7] [ 0/119] eta: 0:01:16 lr: 0.000050 loss: 0.3851 (0.3851) loss_classifier: 0.1334 (0.1334) loss_box_reg: 0.1845 (0.1845) loss_objectness: 0.0287 (0.0287) loss_rpn_box_reg: 0.0385 (0.0385) time: 0.6433 data: 0.1150 max mem: 3105
Epoch: [7] [ 10/119] eta: 0:01:12 lr: 0.000050 loss: 0.3997 (0.4045) loss_classifier: 0.1250 (0.1259) loss_box_reg: 0.1973 (0.2023) loss_objectness: 0.0292 (0.0303) loss_rpn_box_reg: 0.0479 (0.0459) time: 0.6692 data: 0.1252 max mem: 3105
Epoch: [7] [ 20/119] eta: 0:01:07 lr: 0.000050 loss: 0.4224 (0.4219) loss_classifier: 0.1250 (0.1262) loss_box_reg: 0.2143 (0.2101) loss_objectness: 0.0333 (0.0373) loss_rpn_box_reg: 0.0493 (0.0484) time: 0.6809 data: 0.1286 max mem: 3105
Epoch: [7] [ 30/119] eta: 0:01:00 lr: 0.000050 loss: 0.4120 (0.4140) loss_classifier: 0.1191 (0.1221) loss_box_reg: 0.2113 (0.2070) loss_objectness: 0.0357 (0.0374) loss_rpn_box_reg: 0.0506 (0.0475) time: 0.6834 data: 0.1316 max mem: 3105
Epoch: [7] [ 40/119] eta: 0:00:53 lr: 0.000050 loss: 0.4013 (0.4117) loss_classifier: 0.1118 (0.1210) loss_box_reg: 0.2079 (0.2063) loss_objectness: 0.0357 (0.0371) loss_rpn_box_reg: 0.0471 (0.0473) time: 0.6780 data: 0.1304 max mem: 3105
Epoch: [7] [ 50/119] eta: 0:00:46 lr: 0.000050 loss: 0.3911 (0.4035) loss_classifier: 0.1172 (0.1198) loss_box_reg: 0.1912 (0.2017) loss_objectness: 0.0341 (0.0356) loss_rpn_box_reg: 0.0449 (0.0464) time: 0.6768 data: 0.1314 max mem: 3105
Epoch: [7] [ 60/119] eta: 0:00:39 lr: 0.000050 loss: 0.3911 (0.4048) loss_classifier: 0.1186 (0.1213) loss_box_reg: 0.1859 (0.2013) loss_objectness: 0.0334 (0.0360) loss_rpn_box_reg: 0.0412 (0.0462) time: 0.6729 data: 0.1306 max mem: 3105
Epoch: [7] [ 70/119] eta: 0:00:33 lr: 0.000050 loss: 0.4046 (0.4030) loss_classifier: 0.1177 (0.1209) loss_box_reg: 0.2105 (0.2008) loss_objectness: 0.0359 (0.0354) loss_rpn_box_reg: 0.0462 (0.0459) time: 0.6718 data: 0.1282 max mem: 3105
Epoch: [7] [ 80/119] eta: 0:00:26 lr: 0.000050 loss: 0.4125 (0.4067) loss_classifier: 0.1187 (0.1221) loss_box_reg: 0.2105 (0.2022) loss_objectness: 0.0362 (0.0362) loss_rpn_box_reg: 0.0469 (0.0462) time: 0.6725 data: 0.1285 max mem: 3105
Epoch: [7] [ 90/119] eta: 0:00:19 lr: 0.000050 loss: 0.4289 (0.4068) loss_classifier: 0.1288 (0.1223) loss_box_reg: 0.2097 (0.2009) loss_objectness: 0.0434 (0.0375) loss_rpn_box_reg: 0.0479 (0.0461) time: 0.6874 data: 0.1327 max mem: 3105
Epoch: [7] [100/119] eta: 0:00:12 lr: 0.000050 loss: 0.4222 (0.4086) loss_classifier: 0.1223 (0.1221) loss_box_reg: 0.2101 (0.2021) loss_objectness: 0.0405 (0.0381) loss_rpn_box_reg: 0.0483 (0.0463) time: 0.6941 data: 0.1348 max mem: 3105
Epoch: [7] [110/119] eta: 0:00:06 lr: 0.000050 loss: 0.4082 (0.4072) loss_classifier: 0.1196 (0.1220) loss_box_reg: 0.2081 (0.2013) loss_objectness: 0.0350 (0.0379) loss_rpn_box_reg: 0.0475 (0.0461) time: 0.6792 data: 0.1301 max mem: 3105
Epoch: [7] [118/119] eta: 0:00:00 lr: 0.000050 loss: 0.4070 (0.4076) loss_classifier: 0.1196 (0.1223) loss_box_reg: 0.2063 (0.2016) loss_objectness: 0.0313 (0.0375) loss_rpn_box_reg: 0.0475 (0.0462) time: 0.6599 data: 0.1255 max mem: 3105
Epoch: [7] Total time: 0:01:20 (0.6763 s / it)
creating index...
index created!
Test: [ 0/59] eta: 0:00:14 model_time: 0.1194 (0.1194) evaluator_time: 0.0633 (0.0633) time: 0.2511 data: 0.0642 max mem: 3105
Test: [58/59] eta: 0:00:00 model_time: 0.1098 (0.1102) evaluator_time: 0.0481 (0.0590) time: 0.2353 data: 0.0625 max mem: 3105
Test: Total time: 0:00:13 (0.2371 s / it)
Averaged stats: model_time: 0.1098 (0.1102) evaluator_time: 0.0481 (0.0590)
Accumulating evaluation results...
DONE (t=0.02s).
IoU metric: bbox
Average Precision (AP) #[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.210
Average Precision (AP) #[ IoU=0.50 | area= all | maxDets=100 ] = 0.649
Average Precision (AP) #[ IoU=0.75 | area= all | maxDets=100 ] = 0.079
Average Precision (AP) #[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) #[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.210
Average Precision (AP) #[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.011
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.095
Average Recall (AR) #[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.334
Average Recall (AR) #[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) #[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
Average Recall (AR) #[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
I have two questions:
Overfitting: I don't know if my model is overfitting or underfitting. How I can find out looking the metrics?
Save the best model of all epochs: How I can save the best model trained during the differents epochs? Which is the best epoch according to the results?
Thank you!
You need to keep track of loss on test dataset (or some other metric like recall). Draw your attention to this part of code:
for epoch in range(num_epochs):
# train for one epoch, printing every 10 iterations
train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
# update the learning rate
lr_scheduler.step()
# evaluate on the test dataset
evaluate(model, data_loader_test, device=device)
train_one_epoch and evaluate are defined here. Evaluate function returns object of type CocoEvaluator, but you can modify the code so that it returns test loss (you need to either extract metrics from CocoEvaluator object somehow, or write your own metric evaluation).
So, the answers are:
Keep track of test loss, it will tell you about overfitting.
Save the model state after every epoch until test loss begins to increase. Tutorial about saving models is here.

Why does loss decrease but accuracy decreases too (Pytorch, LSTM)?

I have built a model with LSTM - Linear modules in Pytorch for a classification problem (10 classes). I am training the model and for each epoch I output the loss and accuracy in the training set. The ouput is as follows:
epoch: 0 start!
Loss: 2.301875352859497
Acc: 0.11388888888888889
epoch: 1 start!
Loss: 2.2759320735931396
Acc: 0.29
epoch: 2 start!
Loss: 2.2510263919830322
Acc: 0.4872222222222222
epoch: 3 start!
Loss: 2.225804567337036
Acc: 0.6066666666666667
epoch: 4 start!
Loss: 2.199286699295044
Acc: 0.6511111111111111
epoch: 5 start!
Loss: 2.1704766750335693
Acc: 0.6855555555555556
epoch: 6 start!
Loss: 2.1381614208221436
Acc: 0.7038888888888889
epoch: 7 start!
Loss: 2.1007182598114014
Acc: 0.7194444444444444
epoch: 8 start!
Loss: 2.0557992458343506
Acc: 0.7283333333333334
epoch: 9 start!
Loss: 1.9998993873596191
Acc: 0.7427777777777778
epoch: 10 start!
Loss: 1.9277743101119995
Acc: 0.7527777777777778
epoch: 11 start!
Loss: 1.8325848579406738
Acc: 0.7483333333333333
epoch: 12 start!
Loss: 1.712520718574524
Acc: 0.7077777777777777
epoch: 13 start!
Loss: 1.6056485176086426
Acc: 0.6305555555555555
epoch: 14 start!
Loss: 1.5910680294036865
Acc: 0.4938888888888889
epoch: 15 start!
Loss: 1.6259561777114868
Acc: 0.41555555555555557
epoch: 16 start!
Loss: 1.892195224761963
Acc: 0.3655555555555556
epoch: 17 start!
Loss: 1.4949012994766235
Acc: 0.47944444444444445
epoch: 18 start!
Loss: 1.4332982301712036
Acc: 0.48833333333333334
For loss function I have used nn.CrossEntropyLoss and Adam Optimizer.
Although the loss is constantly decreasing, the accuracy increases until epoch 10 and then begins for some reason to decrease.
Why is this happening ?
Even if my model is overfitting, doesn't that mean that the accuracy should be high ?? (always speaking for accuracy and loss measured on the training set, not the validation set)
Decreasing loss does not mean improving accuracy always.
I will try to address this for the cross-entropy loss.
CE-loss= sum (-log p(y=i))
Note that loss will decrease if the probability of correct class increases and loss increases if the probability of correct class decreases. Now, when you compute average loss, you are averaging over all the samples, some of the probabilities may increase and some of them can decrease, making overall loss smaller but also accuracy drops.

Is it normal in PyTorch for accuracy to increase and decrease repeatedly?

I am new to PyTorch, currently working on a Transfer Learning simple code. When I am training my model, I am getting a big variance between increase and decrease of the accuracy and loss. I trained the network for 50 epochs, and below is the result:
Epoch [1/50], Loss: 0.5477, Train Accuracy: 63%
Epoch [2/50], Loss: 2.1935, Train Accuracy: 75%
Epoch [3/50], Loss: 1.8811, Train Accuracy: 79%
Epoch [4/50], Loss: 0.0671, Train Accuracy: 77%
Epoch [5/50], Loss: 0.2522, Train Accuracy: 80%
Epoch [6/50], Loss: 0.0962, Train Accuracy: 88%
Epoch [7/50], Loss: 1.8883, Train Accuracy: 74%
Epoch [8/50], Loss: 0.3565, Train Accuracy: 83%
Epoch [9/50], Loss: 0.0228, Train Accuracy: 81%
Epoch [10/50], Loss: 0.0124, Train Accuracy: 81%
Epoch [11/50], Loss: 0.0252, Train Accuracy: 84%
Epoch [12/50], Loss: 0.5184, Train Accuracy: 81%
Epoch [13/50], Loss: 0.1233, Train Accuracy: 86%
Epoch [14/50], Loss: 0.1704, Train Accuracy: 82%
Epoch [15/50], Loss: 2.3164, Train Accuracy: 79%
Epoch [16/50], Loss: 0.0294, Train Accuracy: 85%
Epoch [17/50], Loss: 0.2860, Train Accuracy: 85%
Epoch [18/50], Loss: 1.5114, Train Accuracy: 81%
Epoch [19/50], Loss: 0.1136, Train Accuracy: 86%
Epoch [20/50], Loss: 0.0062, Train Accuracy: 80%
Epoch [21/50], Loss: 0.0748, Train Accuracy: 84%
Epoch [22/50], Loss: 0.1848, Train Accuracy: 84%
Epoch [23/50], Loss: 0.1693, Train Accuracy: 81%
Epoch [24/50], Loss: 0.1297, Train Accuracy: 77%
Epoch [25/50], Loss: 0.1358, Train Accuracy: 78%
Epoch [26/50], Loss: 2.3172, Train Accuracy: 75%
Epoch [27/50], Loss: 0.1772, Train Accuracy: 79%
Epoch [28/50], Loss: 0.0201, Train Accuracy: 80%
Epoch [29/50], Loss: 0.3810, Train Accuracy: 84%
Epoch [30/50], Loss: 0.7281, Train Accuracy: 79%
Epoch [31/50], Loss: 0.1918, Train Accuracy: 81%
Epoch [32/50], Loss: 0.3289, Train Accuracy: 88%
Epoch [33/50], Loss: 1.2363, Train Accuracy: 81%
Epoch [34/50], Loss: 0.0362, Train Accuracy: 89%
Epoch [35/50], Loss: 0.0303, Train Accuracy: 90%
Epoch [36/50], Loss: 1.1700, Train Accuracy: 81%
Epoch [37/50], Loss: 0.0031, Train Accuracy: 81%
Epoch [38/50], Loss: 0.1496, Train Accuracy: 81%
Epoch [39/50], Loss: 0.5070, Train Accuracy: 76%
Epoch [40/50], Loss: 0.1984, Train Accuracy: 77%
Epoch [41/50], Loss: 0.1152, Train Accuracy: 79%
Epoch [42/50], Loss: 0.0603, Train Accuracy: 82%
Epoch [43/50], Loss: 0.2293, Train Accuracy: 84%
Epoch [44/50], Loss: 0.1304, Train Accuracy: 80%
Epoch [45/50], Loss: 0.0381, Train Accuracy: 82%
Epoch [46/50], Loss: 0.1833, Train Accuracy: 84%
Epoch [47/50], Loss: 0.0222, Train Accuracy: 84%
Epoch [48/50], Loss: 0.0010, Train Accuracy: 81%
Epoch [49/50], Loss: 1.0852, Train Accuracy: 79%
Epoch [50/50], Loss: 0.0167, Train Accuracy: 83%
There are some epochs that have a much better accuracy and loss than others. However, the model loses them in later epochs. As I know, the accuracy should improve every epoch. Did I write the training code wrongly? If not, then is that normal? Any way to solve it? Shall the previous accuracy be saved and only if the accuracy of the next epoch is greater than the previous one then train one more epoch? I have been working on Keras previously, and I haven't experienced that problem. I am fine tuning the resent by freezing previous weights and adding only 2 classes for the final layer. Below is my code:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_conv.fc.parameters(), lr=0.001, momentum=0.9)
num_epochs = 50
for epoch in range (num_epochs):
#Reset the correct to 0 after passing through all the dataset
correct = 0
for images,labels in dataloaders['train']:
images = Variable(images)
labels = Variable(labels)
if torch.cuda.is_available():
images = images.cuda()
labels = labels.cuda()
optimizer.zero_grad()
outputs = model_conv(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
_, predicted = torch.max(outputs, 1)
correct += (predicted == labels).sum()
train_acc = 100 * correct / dataset_sizes['train']
print ('Epoch [{}/{}], Loss: {:.4f}, Train Accuracy: {}%'
.format(epoch+1, num_epochs, loss.item(), train_acc))
I would say it depends on dataset and architecture. Hence, fluctuations are normal, but in general loss should improve.It could be a result of noise in the test dataset, i.e. wrongly labeled examples.
If the test accuracy starts to decrease it might be that your network is overfitting.
You might want to stop the learning just before you reach that point or take other steps to counter the overfitting problem.
Is it normal in PyTorch for accuracy to increase and decrease repeatedly
It should always go down compared on the one epoch level.
Compared to the one batch level it may fluctuate, but generally it should get smaller over time since this is the whole point when we minimize the loss we are improving accuracy.

Test score vs test accuracy when evaluating model using Keras

Im using a neural network implemented with the Keras library and below is the results during training. At the end it prints a test score and a test accuracy. I can't figure out exactly what the score represents, but the accuracy I assume to be the number of predictions that was correct when running the test.
Epoch 1/15 1200/1200 [==============================] - 4s - loss:
0.6815 - acc: 0.5550 - val_loss: 0.6120 - val_acc: 0.7525
Epoch 2/15 1200/1200 [==============================] - 3s - loss:
0.5481 - acc: 0.7250 - val_loss: 0.4645 - val_acc: 0.8025
Epoch 3/15 1200/1200 [==============================] - 3s - loss:
0.5078 - acc: 0.7558 - val_loss: 0.4354 - val_acc: 0.7975
Epoch 4/15 1200/1200 [==============================] - 3s - loss:
0.4603 - acc: 0.7875 - val_loss: 0.3978 - val_acc: 0.8350
Epoch 5/15 1200/1200 [==============================] - 3s - loss:
0.4367 - acc: 0.7992 - val_loss: 0.3809 - val_acc: 0.8300
Epoch 6/15 1200/1200 [==============================] - 3s - loss:
0.4276 - acc: 0.8017 - val_loss: 0.3884 - val_acc: 0.8350
Epoch 7/15 1200/1200 [==============================] - 3s - loss:
0.3975 - acc: 0.8167 - val_loss: 0.3666 - val_acc: 0.8400
Epoch 8/15 1200/1200 [==============================] - 3s - loss:
0.3916 - acc: 0.8183 - val_loss: 0.3753 - val_acc: 0.8450
Epoch 9/15 1200/1200 [==============================] - 3s - loss:
0.3814 - acc: 0.8233 - val_loss: 0.3505 - val_acc: 0.8475
Epoch 10/15 1200/1200 [==============================] - 3s - loss:
0.3842 - acc: 0.8342 - val_loss: 0.3672 - val_acc: 0.8450
Epoch 11/15 1200/1200 [==============================] - 3s - loss:
0.3674 - acc: 0.8375 - val_loss: 0.3383 - val_acc: 0.8525
Epoch 12/15 1200/1200 [==============================] - 3s - loss:
0.3624 - acc: 0.8367 - val_loss: 0.3423 - val_acc: 0.8650
Epoch 13/15 1200/1200 [==============================] - 3s - loss:
0.3497 - acc: 0.8475 - val_loss: 0.3069 - val_acc: 0.8825
Epoch 14/15 1200/1200 [==============================] - 3s - loss:
0.3406 - acc: 0.8500 - val_loss: 0.2993 - val_acc: 0.8775
Epoch 15/15 1200/1200 [==============================] - 3s - loss:
0.3252 - acc: 0.8600 - val_loss: 0.2960 - val_acc: 0.8775
400/400 [==============================] - 0s
Test score: 0.299598811865
Test accuracy: 0.88
Looking at the Keras documentation, I still don't understand what score is. For the evaluate function, it says:
Returns the loss value & metrics values for the model in test mode.
One thing I noticed is that when the test accuracy is lower, the score is higher, and when accuracy is higher, the score is lower.
For reference, the two relevant parts of the code:
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
score, acc = model.evaluate(x_test, y_test,
batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
Score is the evaluation of the loss function for a given input.
Training a network is finding parameters that minimize a loss function (or cost function).
The cost function here is the binary_crossentropy.
For a target T and a network output O, the binary crossentropy can defined as
f(T,O) = -(T*log(O) + (1-T)*log(1-O) )
So the score you see is the evaluation of that.
If you feed it a batch of inputs it will most likely return the mean loss.
So yeah, if your model has lower loss (at test time), it should often have lower prediction error.
Loss is often used in the training process to find the "best" parameter values for your model (e.g. weights in neural network). It is what you try to optimize in the training by updating weights.
Accuracy is more from an applied perspective. Once you find the optimized parameters above, you use this metrics to evaluate how accurate your model's prediction is compared to the true data.
This answer provides a detailed info:
How to interpret "loss" and "accuracy" for a machine learning model