I am using a UNet model to train a segmentation algorithm with roughly 1,000 grayscale medical images and 1,000 corresponding masks where the section of interest in the medical image is white pixel and the background is black.
I am using dice loss and a similar dice score as an accuracy metric to account for the fact that my white pixels are generally less in number than the black background pixels. But I am still having a few problems when training
1) The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Assumption - Overfitting:
I assumed this was due to overfitting, so I augmented the dataset as much as possible with rigid transformations - flipping and rotating, but still no help.
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
2)Model doesn't look like it's even training. I was able to check the model before it reduced all the test data to blackness, but even then the results would look like blurry versions of the original without segmenting the features highlighted by my training mask
3) The loss vs epochs and accuracy vs epochs output charts are very smooth: They present none of the oscillating behaviour that I expect to see when doing semantic segmentation. According to this related post a smooth chart usually occurs when there is only one class. I however assumed that my model would see the training masks (white pixels vs black pixels) and see that as a two class problem. Am I wrong in this assumption?
4) According to this post Dice is good for an unbalanced training set. I have also tried to get precision/recall/F1 results as they suggest, but was unable to do it and assuming it might be related to my 3rd issue where the model sees my segmentation task as a single class problem.
TLDR: How can I fix the black output results I am getting? Can you please help me clarify if my learning model is actually seeing my white and black pixels in each mask as two separate classes and if not what is it actually doing?
Your model is only predicting one class (the background/back pixels) because of the class imbalance.
The loss converges too fast. If I have my SGD optimizer's learning rate at 0.01 for example, at around 2 epochs the loss (training and validation) will drop to 0.00009 and the accuracy shoots up and settles at 100% in proportion. Testing on an unseen set gives blank images.
Lower your learning rate. 0.01 is really high, so try something like 3e-5 for your learning and see how your model performs.
Also, having a 100% accuracy (supposedly you're using dice?) suggests that you're still using accuracy, so I believe that your model does not recognize that you're using dice/dice loss for training and evaluation(code snippets would be appreciated).
Example:
model.compile(optimizer=Adam(lr=TRAIN_SEG_LEARNING_RATE),
loss=dice_coef_loss,
metrics=[dice_coef])
Also if I test the model against the same data I used to train it, it still predicts blank images. So does this mean it isn't a case of overfitting?
Try using model.evaluate(test_data, test_label). If the evaluated performance is good (dice should be extremely low if you're only predicting 0s), then either your labels are messed in some way or there is something wrong with your pipeline.
Possible Solutions if all else fails:
make sure to go through all the sanity checks in this article
You might not have enough data, so try to use a patchwise approach with random crops.
Add more regularization (dropout, BatchNormalization, InstanceNormalization, increasing input image size, etc.)
Related
Can someone help me what to do with a classification, if I get a training and validation error shown in the picture to improve my neural network? I tried to stop the training earlier, so that the validation error is smaller, but it's still too high. I get a validation accuracy of 62.45%, but thats too low. The dataset are images that show objects somewhere in the image (not centered). If I use the same network with the same number of images, but where the shown objects are always centered to the principal point, it works much better with a validation accuracy of 95%,
One can look for following things while implementing the Neural net:
Dataset Issues:
i) Check if the input data you are feeding the network makes sense and is there too much noise in the data.
ii) Try passing random input and see if the error performance persist. If it does, then it's time to make changes in your net.
iii) Check if the input data has appropriate labels.
iv) If the input data is not shuffled and is passed in a specific order of label, leads to negative impact on the learning. So, shuffling of data and label together is necessary.
v) Reduce the batch size and make sure batch don't contain the same label.
vi) Too much data augmentation is not good as it has a regularizing effect and when combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.
vii) Data must be pre-processed as per the requirement of the data. For example, if you are training the network for face classification then the image face without or very any background should be passed to the network for learning.
Implementation Issues:
i) Check your loss function, weight initialization, and gradient checking to make sure the backpropagation works in an appropriate manner.
ii) Visualize the biases, activation, and weights for each layer with help of visualization library like Tensorboard.
iii) Try using dynamic learning rate concept, where the learning rate changes with a designed set of epochs.
iv) Increase the network size by adding more layer or more neurons, as it might not be enough to capture the features of its mark.
Recently, I've been playing with the MATLAB's RCNN deep learning example here. In this example, MATLAB has designed a basic 15 layer CNN with the input size of 32x32. They use the CIFAR10 dataset to pre-train this CNN. The CIFAR10 dataset has training images of size 32x32 too. Later they use a small dataset of stop signs to fine tune this CNN to detect stop signs. This small dataset of stop signs has only 41 images; so they use these 41 images to fine tune the CNN and namely train an RCNN network. this is how they detect a stop sign:
As you see the bounding box almost covers the whole stop signs except a small part on the top.
Playing with the code I decided to fine tune the same network pre-trained on the CIFAR10 dataset with the PASCAL VOC dataset but only for the "aeroplane" class.
These are some results I get:
As you see the detected bounding boxes barely cover the whole airplane; so this causes the precision to be 0 later when I evaluate them. I understand that in the original RCNN paper mentioned in the MATLAB example the input size 227x227 and their CNN has 25 layers. Could this be why the detections are not accurate? How does the input size of a CNN affect the end result?
almost surely yes!
when you pass an image through a net, the net tries to minimize the data taken from the image until it gets the most relevant data. during this process, the input shrinks again and again. If, for example, you insert to a net an image that smaller than the wanted, all the data from the image may lost during the pass in the net.
In your case, an optional reason to your results is that the net "looks for" features in limited resolution and maybe the big airplane has over high resolution.
I am training a CNN for predicting joints on hands. The problem is that my net always converges to the mean value of the training set, and I can only get identical results for different test images. Do you know how to prevent this?
I think you must be using the MSECriterion()? It is the standard l2 (minimum square error) loss. While the CNN tries to predict results, there are multiple modes through which the result can be correct. And what l2 loss does is that it converges to an average of all these modes as that is the most feasible way it can intuitively approach to attain less-penalized results.
The MSE-based solution
appears overly smooth due to the pixel-wise average of
possible solutions in the pixel space
To pick the optimum mode of answer, you can look into adversarial loss LINK. This loss picks the optimum mode based on what it thinks is realistic in terms of the data it has seen.
For further clarification, look at figure 3 in this paper: SRGAN
I was using tensorflow. Was trying to do some regression using simple CNN with one neuron in output layer. Was optimizing mean square error:
cost = tf.reduce_mean(tf.abs(y_prediction - y_output_placeholder))
optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE).minimize(cost)
My problem was I made output placeholder of true values with different shape than output predictions of the net.
placeholder's shape was [None]
predictions's shape was [None, 1].
When I changed placeholder's shape to match the one of prediction output problem was solved.
I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html
I've developed a "Pong" style game which effectively has a ball at the bottom of the screen and bouncy walls on the left and right and a sticky wall on the top. It randomly chooses a point on the bottom (on a straight horizontal line) and a random angle, bounces off the side walls, and hits the top wall. This is repeated a 1000 times and each time, the x-value of the launch position, the launch angle and the final x-value of the position it collides with on the top wall.
This gives me 2 inputs - x-value of launch and launch angle and 1 output - x-value of final position. I tried using a multilayer perceptron with 2 input nodes, 2 hidden nodes (1 layer) and 1 output node. However it converges upto a point ~20 and then tapers off. Here's what I've tried and none of them helped, either the error never converges or it starts diverging:
Transform inputs and output to be between 0 and 1
Transform inputs and output to be between -1 and 1
Increase number of hidden layers
Increase number of nodes in hidden layer
Convert the launch position, launch angle and final position into 0s and 1s resulting in ~750+175 inputs and ~750 outputs - no convergence
So, after spending all night and morning and making my brain and body revolt against me, I'm hoping someone can help me identify the problem here. Is this a task that's just not solvable by a neural network or am I doing something wrong?
PS: I'm using the online version of Neuroph and not coding my own procedure. At least this will help me avoid issues in implementation
If it doesn't minimize the training error, that's most likely a bug in the implementation. If you're measuring the accuracy on a held-out test set, on the other hand, there's nothing surprising about the error going up after a while.
As to the formulation, I think with sufficient amount of training data and sufficiently long training time, a sufficiently complex NN can learn the mapping whether you binarize the input or not (provided the implementation you use supports non-binary input and output). I have only a vague idea of what "sufficient" means in the above sentence, but I'd venture a guess that 1000 samples won't do. Note also that the more complex the network, the more data it will generally need to estimate the parameters.
To eliminate potential implementation issues in Neuroph, I'd suggest trying the exact same process (Multi-Layer Perceptron, same parameters, same data, etc.) but use Weka instead.
I've used the MLP in Weka before with success, so I can verify that this implementation works correctly. I know Weka has a fairly high-penetration in the academic community and its fairly well vetted, but I'm not sure about Neuroph since its newer. If you get the same results as Neuroph, then you know the issue is in your data or neural net topology or configuration.
Qnan brings up a good point - what exactly is the error you are measuring? To really determine why the training error isn't converging towards zero, you need to determine what exactly it is that the error represents.
Also, how many epochs (i.e., number of iterations) is the neural net running in training before it stops converging?
In Weka, if I recall correctly you can set the training to execute either until the error reaches a certain value or for a certain number of epochs. Looks like Neuroph is the same way, from a quick look.
If you're limiting the number of epochs, try bumping up the number to something significantly higher to give the network more iterations to converge.