Dice Score for semantic segmentation when some of labels has all zeros - semantic-segmentation

I am calculating Dice Score for binary segmentation case, in some of my ground truths there is no label, i.e. it has all zeros. So when I use a different batch size for inference I am getting different results, especially worst for batch size=1, I came to know the reason as shown in the following figure: It averages all the cases even when the TP=0:
[results descriptions][1]
[1]: https://i.stack.imgur.com/mHj3o.png
What is the logical solution, and how do the experts deal with this problem, one possible solution can be: Calculate Dice Score only for those predictions for which ground truth>0
Is it the right approach to publish the results? I didn't see any paper mentioning this issue:
Any link to the published work which dealt with this problem will be appreciated.
Thank You

Related

How to prevent converge to mean solution for regression problems in CNN?

I am training a CNN for predicting joints on hands. The problem is that my net always converges to the mean value of the training set, and I can only get identical results for different test images. Do you know how to prevent this?
I think you must be using the MSECriterion()? It is the standard l2 (minimum square error) loss. While the CNN tries to predict results, there are multiple modes through which the result can be correct. And what l2 loss does is that it converges to an average of all these modes as that is the most feasible way it can intuitively approach to attain less-penalized results.
The MSE-based solution
appears overly smooth due to the pixel-wise average of
possible solutions in the pixel space
To pick the optimum mode of answer, you can look into adversarial loss LINK. This loss picks the optimum mode based on what it thinks is realistic in terms of the data it has seen.
For further clarification, look at figure 3 in this paper: SRGAN
I was using tensorflow. Was trying to do some regression using simple CNN with one neuron in output layer. Was optimizing mean square error:
cost = tf.reduce_mean(tf.abs(y_prediction - y_output_placeholder))
optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE).minimize(cost)
My problem was I made output placeholder of true values with different shape than output predictions of the net.
placeholder's shape was [None]
predictions's shape was [None, 1].
When I changed placeholder's shape to match the one of prediction output problem was solved.

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !
There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection
I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.
One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.
Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.
I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

Kalman Filter and sudden measurements jumps

Ok here is what i need to do:
I want to do some tracking using Kalman filter(possibly adaptive).My measurements(when they are available) are very good with very small error from the real measurements. In some cases though the measurements jump to a value,completely off from the correct position i am looking for, and then after few frames the come back to their correct position.
The problem is that if my filter(not adaptive) has specific values for Measurement Noise Covariance(R) and State Error Covariance(Q) matrices the results are not very accurate,because even for these 1% of cases i have to do a compromise between R and Q.
So i decided to use an adaptive Kalman filter as they do in here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.367.1747&rep=rep1&type=pdf
They estimate the measurement noise covariance matrix based on the innovation sequences.
Basically, they are using a moving window on previous samples and the calculate the covariance of the error between the previous measurements-prior estimations. For eg 5 past measurements and the 5 prior estimations.When a faulty measurement comes under the window, the covariance increases and thus the R increases also.
But in practice the R increases(but not enough) so in the next step the estimation is still good but just a bit towards the the faulty measurement.In the next step(because now the the previous estimation has moved a bit towards the measurement) the R becomes smaller with result the new estimation to go even closer to the measurements, and so on and so forth.
In the end after a few frames the estimations follow the faulty measurements. Here is a plot to understand better what i mean.
https://www.dropbox.com/s/rkv0tjcm4s54kv3/untitled.tif
Maybe what i am trying to do is completely wrong and can't be done with the adaptive Kalman filter.Maybe someone who has worked extensively with Kalman Filter in the past and he has faced this problem before can help.
Any idea is welcome!
Before the answer, I want to be sure I got the problem you have right.
You have measurements, some of them are good (Low measurement noise) yet others are outliers.
The problem you're having is tuning the measurement noise covariance matrix.
Practically, you tune for the good measurements.
Outliers measurements are rejected by using the Error Covariance.
If the innovation falls outside an ellipse you define using the Error Covariance Matrix the measurement is rejected.
Whenever a measurement is rejected you just apply the prediction step again and wait for another measurement.
Yes the problem is exactly this.
However i manage to solve it without the need to define any ellipse.What i was doing was correct except the fact that was not working if i had a lot of(lets say fifty) consecutive outliers.
This is normal if you think the size of your window.If it is for example only 10 samples and you have 20 outliers obviously it won't work.But for 5 consecutive outliers work perfectly.Generally i haven't used any threshold as you propose("if the innovation falls outside an ellipse") reject the measurements.I keep the measurements but in the same time when i start to have outliers the Error measurement covariance becomes very large.So the estimation is based more in previous estimation than in current measurement.
If i used your method which is indeed more logical(reject the current measurement,if it is an outlier based on a threshold) i have the problem that i have to define this threshold a priori,right?Maybe i am missing something..

Mapping Vision Outputs To Neural Network Inputs

I'm fairly new to MATLAB, but have acquainted myself with Simulink and Computer Vision over the past few days. My problem statement involves taking a traffic/highway video input and detecting if an accident has occurred.
I plan to do this by extracting the values of centroid to plot trajectory, velocity difference (between frames) and distance between two vehicles. I can successfully track the centroids, and aim to derive the rest of the features.
What I don't know is how to map these to ANN. I mean, every image has more than one vehicle blobs, which means, there are multiple centroids in a single frame/image. So, how does NN act on multiple inputs (the extracted features per vehicle) simultaneously? I am obviously missing the link. Help me figure it out please.
Also, am I looking at time series data?
I am not exactly sure about your question. The problem can be both time series data and not. You might be able to transform the time series version of the problem, such that it can be solved using ANN, but it is sort of a Maslow's hammer :). Also, Could you rephrase the problem.
As you said, you could give it features from two or three frames and then use the classifier to detect accident or not, but it might be difficult to train such a classifier. The problem is really difficult and the so you might need tons of training samples to get it right, esp really good negative samples (for examples cars travelling close to each other) etc.
There are multiple ways you can try to solve this problem of accident detection. For example : Build a classifier (ANN/SVM etc) to detect accidents without time series data. In which case your input would be accident images and non accident images or some sort of positive and negative samples for training and later images for test. In this specific case, you are not looking at the time series data. But here you might need lots of features to detect the same (this in some sense a single frame version of the problem).
The second method would be to use time series data, in which case you will have to detect the features, track the features (say using Lucas Kanade/Horn and Schunck) and then use the information about velocity and centroid to detect the accident. You might even be able to formulate it for HMMs.

SVM - Works/Doesn't work for big range of numbers?

I have been playing around with the SVM and I have stumbled upon something interesting.
It might be something I may be doing wrong, hence the post for comments and clarification.
I have data set of around 3000 x 30.
Each value is in the range of -100 to 100. Plus, they are not integers. They are floating point numbers. They are not evenly distributed.
It's like,
the numbers are -99.659, -99.758, -98.234 and then we wont have something till like -1.234, -1.345 and so.
So even though the range is big, the data is clustered around at some points and they usually differ by fraction values.
( I thought and from what my readings and understanding goes, this shouldn't ideally affect the SVM classification accuracy. Correct me if I am wrong please. Do Comment on this with a yes or no of I am right or wrong. )
My labels for the classification are 0 and 1.
So, then I take a test data of 30 x 30 and tried to test my SVM.
I am getting an accuracy of somewhere around 50% when is the kernel_function as mlp.
In other methods, I simply get 0's and NaN's as result which is weird as no 1s were in the output and I didn't understand the NaN's in the output labels.
So, mlp was basically giving me the best results and that too just 50%.
I have then used the method as 'QP' with 'mlp' as kernel_function and the code has been running for like 8 hours now. I don't suppose, something as small as 3400 x 30 should take that much time.
So the question really is, is the SVM a wrong choice for the data I have? (As asked above).
Or is there something I am missing out that is causing the accuracy to drop significantly?
Also, I know the input data is not screwed up, because I tested the same using a Neural Network and I was able to have a very good accuracy.
Is there a way to make SVM work? Because, from what I have read on the internet- SVM should generally work better than Neural Network in this label deciding problem.
It sounds like you might be having some numerical stability problems that are being caused by the small size of the data clusters (although I'm not sure why that would be: it really shouldn't). SVM shouldn't care as an algorithm about the distributions you are describing: in fact, it should do a pretty good job under normal circumstances when presented something so distinctly separated.
One thing to investigate is if any of your columns are very strongly correlated. Really strongly correlated column groups should be replaced by a single column for performance reasons and I have seen implementations that become numerically unstable when faced with almost perfect correlation in columns.
While independent features are nice, it is not neccesary for the algorithm, after all, we are saying in advance we do not know what features contribute what to the data. Are you scaling your data? Also, 30 data points is perhaps a little small to create a training set. Can we see your code?