Weight Cross-Entroy (WCE) helps to handle an imbalanced dataset, and Cityscapes is quite imbalanced as seen below:
If we check the best benchmarks on this dataset, most of the works use bare CE as a loss function. I don't get it if there are any special causes that would lead WCE to a worse result for semantic segmentation tasks on the mIoU evaluation.
I'm especially asking because I'm working in an even higher unbalanced dataset (multi-minority classes on the ratio of 1:1000 to the majority classes) and got very surprised when bare CE outperformed WCE on the mIoU metric.
I found so far that WCE can yield many false positives from minority classes, but beyond that, would there be more reasons for it?
Read Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár Focal Loss for Dense Object Detection (ICCV 2017). They discuss in length the shortcoming of CE loss when classes are unbalanced and argue (quite compellingly) that WCE simply does not address this limitation of CE.
CE loss never goes to zero: it always has non-zero gradients even if the prediction is perfect. CE strives to increase the margin between the different classes. As a result, when there is an imbalance between classes, CE will put an equal effort into being "more certain" about the dominant class as well as making fewer mistakes on the minority class. Putting weights on the CE will not make a fundamental change to this behavior.
In contrast, what you actually want from a loss function, in this case, is to ignore samples that you already predict correctly, and make an effort to correct wrong predictions. This is usually achieved via hard-negative mining of Focal loss.
Related
I have seen MICE implemented with different types of algorithms e.g. RandomForest or Stochastic Regression etc.
My question is that does it matter which type of algorithm i.e. does one perform the best? Is there any empirical evidence?
I am struggling to find any info on the web
Thank you
Yes, (depending on your task) it can matter quite a lot, which algorithm you choose.
You also can be sure, the mice developers wouldn't out effort into providing different algorithms, if there was one algorithm that anyway always performs best. Because, of course like in machine learning the "No free lunch theorem" is also relevant for imputation.
In general you can say, that the default settings of mice are often a good choice.
Look at this example from the miceRanger Vignette to see, how far imputations can differ for different algorithms. (the real distribution is marked in red, the respective multiple imputations in black)
The Predictive Mean Matching (pmm) algorithm e.g. makes sure that only imputed values appear, that were really in the dataset. This is for example useful, where only integer values like 0,1,2,3 appear in the data (and no values in between). Other algorithms won't do this, so while doing their regression they will also provide interpolated values like on the picture to the right ( so they will provide imputations that are e.g. 1.1, 1.3, ...) Both solutions can come with certain drawbacks.
That is why it is important to actually assess imputation performance afterwards. There are several diagnostic plots in mice to do this.
I've seen some comments in online articles/tutorials or Stack Overflow questions which suggest that increasing number of epochs can result in overfitting. But my intuition tells me that there should be no direct relationship at all between number of epochs and overfitting. So I'm looking for answer which explains if I'm right or wrong (or whatever's in between).
Here's my reasoning though. To overfit, you need to have enough free parameters (I think this is called "capacity" in neural networks) in your model to generate a function which can replicate the sample data points. If you don't have enough free parameters, you'll never overfit. You might just underfit.
So really, if you don't have too many free parameters, you could run infinite epochs and never overfit. If you have too many free parameters, then yes, the more epochs you have the more likely it is that you get to a place where you're overfitting. But that's just because running more epochs revealed the root cause: too many free parameters. The real loss function doesn't care about how many epochs you run. It existed the moment you defined your model structure, before you ever even tried to do gradient descent on it.
In fact, I'd venture as far as to say: assuming you have the computational resources and time, you should always aim to run as many epochs as possible, because that will tell you whether your model is prone to overfitting. Your best model will be the one that provides great training and validation accuracy, no matter how many epochs you run it for.
EDIT
While reading more into this, I realise I forgot to take into account that you can arbitrarily vary the sample size as well. Given a fixed model, a smaller sample size is more prone to being overfit. And then that kind of makes me doubt my intuition above. Still happy to get an answer though!
Your intuition to me seems completely correct.
But here is the caveat. The whole purpose of deep models is that they are "deep" (duh!!). So what happens is that your feature space gets exponentially larger as you grow your network.
Here is an example to compare a deep model with a simpler mode:
Assume you have a 10-variable data set. With a crazy amount of feature engineering, you might be able to extract 50 features out of it. Then if you run a traditional model (let's say a logistic regression), you will have 50 parameters (capacity in your word, or degree of freedom) to train.
But, if you use a very simple deep model with Layer 1: 10 unit, layer2: 10 units, layer3: 5 units, layer4: 2 units, you will end up with (10*10 + 10*10 + 5*2 = 210) parameters to train.
Therefore, usually when we train a neural net for a long time, we end of with a memorized version of our data set(this gets worse if our data set is small and easy to be memorized).
But as you also mentioned, there is no intrinsic reason why higher number of epochs result in overfitting. Early stopping is usually a very good way for avoiding this. Just set patience equal to 5-10 epochs.
If the amount of trainable parameters is small with respect to the size of your training set (and your training set is reasonably diverse) then running over the same data multiple times will not be that significant, since you will be learning some features about your problem, rather than just memorizing the training data set. The problem arises when the amount of parameters is comparable to your training data set size (or bigger), it is basically the same problem as with any machine learning technique that uses too many features. This is quite common if you use large layers with dense connections. To combat this overfitting problem there are lots of regularization techniques (dropout, L1 regularizer, constraining certain connections to be 0 or equal such as in CNN).
The problem is that might still be left with too many trainable parameters. A simple way to regularize even further is to have a small learning rate (i.e. don't learn too much from this particular example lest you memorize it) combined with monitoring the epochs (if there is a large gap increase between validation/training accuracy, you are starting to overfit your model). You can then use the gap info to stop your training. This is a version of what is known as early stopping (stop before you reach the minimum in your loss function).
I am reading people's implementation of DCGAN, especially this one in tensorflow.
In that implementation, the author draws the losses of the discriminator and of the generator, which is shown below (images come from https://github.com/carpedm20/DCGAN-tensorflow):
Both the losses of the discriminator and of the generator don't seem to follow any pattern. Unlike general neural networks, whose loss decreases along with the increase of training iteration. How to interpret the loss when training GANs?
Unfortunately, like you've said for GANs the losses are very non-intuitive. Mostly it happens down to the fact that generator and discriminator are competing against each other, hence improvement on the one means the higher loss on the other, until this other learns better on the received loss, which screws up its competitor, etc.
Now one thing that should happen often enough (depending on your data and initialisation) is that both discriminator and generator losses are converging to some permanent numbers, like this:
(it's ok for loss to bounce around a bit - it's just the evidence of the model trying to improve itself)
This loss convergence would normally signify that the GAN model found some optimum, where it can't improve more, which also should mean that it has learned well enough. (Also note, that the numbers themselves usually aren't very informative.)
Here are a few side notes, that I hope would be of help:
if loss haven't converged very well, it doesn't necessarily mean that the model hasn't learned anything - check the generated examples, sometimes they come out good enough. Alternatively, can try changing learning rate and other parameters.
if the model converged well, still check the generated examples - sometimes the generator finds one/few examples that discriminator can't distinguish from the genuine data. The trouble is it always gives out these few, not creating anything new, this is called mode collapse. Usually introducing some diversity to your data helps.
as vanilla GANs are rather unstable, I'd suggest to use some version
of the DCGAN models, as they contain some features like convolutional
layers and batch normalisation, that are supposed to help with the
stability of the convergence. (the picture above is a result of the DCGAN rather than vanilla GAN)
This is some common sense but still: like with most neural net structures tweaking the model, i.e. changing its parameters or/and architecture to fit your certain needs/data can improve the model or screw it.
I am trying to solve classification problem using Matlab GPTIPS framework.
I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.
What I need now is some help with two difficulties:
My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is
Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution
Please, advize, how can I improve my function to make correction for data bias.
Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%.
I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help.
I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?
UPDATE:
For my second question I've found great article "Experiments on Controlling Overtting
in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem.
Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.
65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.
I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.
In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).
Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.
I am building a system with a NN trained for classification.
I am interested in what is error rate for systems you have built?
Classic example from UCI ML is the Iris data set.
NN trained on it is almost perfect - error rate 0-1%; however it is a very basic dataset.
My network has following structure: 80in, 208hid, 2out.
My result is 8% error rate on testing dataset.
Basically in this question I want to ask about various research results you encountered,
in your work, papers etc.
Addition 1:
the error rate is of course on testing data - not training. So it is completely new dataset for the network
Addition 2 (from my comment under the question):
My new results. 1200 entries, 900 training, 300 testing. 85 in Class1, 1115 in Class2. Out of 85, 44 in testing set. Error rate - 6%. It is not so bad because 44 is ~15% of 300. So I am 2.5 times better..
Model performance is completely problem-specific. Even among situations with similar quality and volumes of development data, with identical target variable definitions, performance can vary substantially. Obviously, the more similar the problem definitions, the more likely the performance of different models are to match.
Another thing to consider is the difference between technical performance and business performance. In some applications, an accuracy of 52% is tremendously profitable, whereas in other areas, and accuracy of 98% would be hopelessly low.
Let me also add that besides what Predictor mentions, measuring your performance on the training set is usually useless as a guide to determine how your classifier would perform on previously unseen data. Many times with relatively simple classifiers you can get 0% error rate on the training set without learning anything useful (this is called overfitting).
What is more commonly used (and more helpful in determining how your classifier works) is either held out data or cross validation, even better if you separate your data in three: training, validating and testing.
Also it is very hard to get a sense of how good a classifier works from one threshold and giving only true positive + true negatives. People tend to also evaluate false positives and false negatives and plot ROC curves to see/evaluate the tradeoff. So, saying "2.5 times better" you should be clear that your comparing to a classifier that classifies everything as C2, which is a pretty crappy baseline.
See for example this paper:
Danilo P. Mandic and Jonathon A. Chanbers (2000). Towards the Optimal Learning Rate for
Backpropagation, Neural Processing Letters 11: 1–5. PDF