How to adjust for the grading bias of labels in a classification task? - classification

I am currently working on a convolutional neural network for pathological changes detection on x-ray images. It is a simple binary classification task. In the beginning of the project we gathered around 6000 x-rays and asked 3 different doctors (domain experts) to label them. Each of them got around 2000 randomly selected images (and those 3 sets were separable - one image was labelled only by one doctor).
After the labelling was finished I wanted to check how many cases per doctor were labelled as having and non-having the changes and this is what I've got:
# A tibble: 3 x 3
doctor no_changes (%) changes (%)
<int> <dbl> <dbl>
1 1 15.9 84.1
2 2 54.1 45.9
3 3 17.8 82.2
From my perspective, if each of the doctors got a randomly sampled dataset of x-rays, the % of cases with and without changes should be pretty much the same for each of them, assuming that they are "thinking similarly", which isn't the case in here.
We were talking with one of the doctors and he told us that it's possible that one doctor can say that there are changes on the x-ray and another can say something different, because typically they're not looking at changes in a binary way - so for example amount/size of changes could decide in labelling and each of the doctors could have a different cutoff in the mind.
Knowing that I started thinking about removing/centering labels bias. This is what I come up with:
Because I know doctor 1 (let's say he is the best expert) I decided to "move" labels of doctor 2 and 3 into direction of doctor 1.
I gathered 300 new images and ask all 3 of them to label them (so each image was labelled by 3 different doctors this time). Than I've checked the distribution of labels between doctor 1 and 2/3. For example for doctor 1 and 2 I got something like:
doctor2 no_changes changes all
doctor1 no_changes 15 3 18
changes 154 177 331
all 169 180
From this I can see that doctor 2 had 169 cases that he lebeled as not having changes and doctor 1 agreed with him only in 15 cases. Knowing that I've changed labels (probabilities) for doctor 2 in non-changes case from [1, 0] to [15/169, 1- 15/169]. Similarly doctor 2 had 180 cases of changes in x-rays and doctor 1 agreed with him in 177 cases so I've changed labels (probabilities) for doctor 2 in changes case from [0, 1] to [1 - 177/180, 177/180].
Do the same thing for doctor 3
Doing that I've retrained neural network with cross-entropy loss.
My question is, is my solution correct or should I do something differently? Are the any other solutions for this problem ?

It looks correct.
With cross-entropy you actually compare the probability distribution output by your model with some reference probability P(changes = 1). In binary classification we usually assume that our training data follow empirical distribution, which yield either 1.0 or 0.0 depending on the label. As you already note this does not need to be the case, e.g. in case when we do not have full confidence in our data.
You can express your reference probability as:
P(changes = 1) = P(changes = 1, doc_k = 0) + P(changes = 0, doc_k = 1)
We just marginalize all possible k-th doctor decisions. It's similar for P(changes = 0). Each joint distribution can be further expanded:
P(changes = 1, doc_k = L) = P(changes = 1 | doc_k = X) P(doc_k = L)
The conditional is a constant that you are computing by comparing each doctor with oracle doctor 1. I cannot think of a better way to approximate this probability given the data you have. (You could, however, try to improve it with some additional annotations). The P(doc_k = X) probability is just 0 or 1, because we know for sure what annotation has been given by each doctor.
All those expansions match your solution. For an example with no changes detected by the 2nd doctor:
P(changes = 0) = P(changes = 0 | doc_2 = 0) * 1 + 0 = 15/169
and for an example with changes:
P(changes = 1) = 0 + P(changes = 1 | doc_2 = 1) * 1 = 177/180
In both cases the constants 0 and 1 come from value of probability P(doc_2 = L).

Related

Undefined F1 scores in multiclass classifications when model does not predict one class

I am trying to use F1 scores for model selection in multiclass classification.
I am calculating them class-wise and average over them:
(F1(class1)+F1(class1)+F1(class1))/3 = F1(total)
However, in some cases I get NaN values for the F1 score. Here is an example:
Let true_label = [1 1 1 2 2 2 3 3 3] and pred_label = [2 2 2 2 2 2 3 3 3].
Then the confusion matrix looks like:
C =[0 3 0; 0 3 0; 0 0 3]
Which means when I calculate the precision (to calculate the F1 score) for the first class, I obtain: 0/(0+0+0), which is not defined or NaN.
Firstly, am I making a mistake in calculating F1 scores or precisions here?
Secondly, how should I treat these cases in model selection? Ignore them or should I just set the F1 scores for this class to 0 (reducing the total F1 score for this model).
Any help would be greatly appreciated!
You need to avoid the division by zero for the precision in order to report meaningful results. You might find this answer useful, in which you explicitly report a poor outcome. Additionally, this implementation suggests an alternate way to differentiate in your reporting between good and poor outcomes.

Using bin counts as weights for random number selection

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Recognizing poker hands from a 2D matrix of values

I have a 1000 x 5 sorted matrix with numbers from 1-13. Each number denotes the numerical value of a playing card. The Ace has the value 1, then the numbers 2 through 10 follow, then the Jack has value 11, Queen with value 12 and King with value 13. Therefore, each row of this matrix constitutes a poker hand. I am trying to create a program that recognizes poker hands using these cards that are enumerated in this way.
For example:
A = [1 1 2 4 5; 2 3 4 5 7; 3, 3, 5, 5, 6; 8, 8, 8, 9, 9]
Therefore, in this matrix A, the first row has a pair (1,1). The second row has high card (7), the third row has two pair ((3,3) and (5,5)) and the last one is a full house (Pair of 9s and 3 of a kind (8).
Is there a good way to do this in MATLAB?
bsxfun won't work for this situation. This is a counting problem. It's all a matter of counting what you have. Specifically, poker hands deal with counting up how much of each card you have, and figuring out the right combination of counts to get a valid hand. Here's a nice picture that shows us every possible poker hand known to man:
Source: http://www.bestonlinecasino.tips
Because we don't have the suits in your matrix, I'm going to ignore the Royal Flush, Straight Flush and the Flush scenario. Every hand you want to recognize can be chalked up to taking a histogram of each row with bins from 1 to 13, and determining if (in order of rank):
Situation #1: A high hand - all of the bins have a bin count of exactly 1
Situation #2: A pair - you have exactly 1 bin that has a count of 2
Situation #3: A two pair - you have exactly 2 bins that have a count of 2
Situation #4: A three of a kind - you have exactly 1 bin that has a count of 3
Situation #5: Straight - You don't need to compute the histogram here. Simply sort your hand, and take neighbouring differences and make sure that the difference between successive values is 1.
Situation #6: Full House - you have exactly 1 bin that has a count of 2 and you have exactly 1 bin that has a count of 3.
Situation #7: Four of a kind - you have exactly 1 bin that has a count of 4.
As such, find the histogram of your hand using histc or histcounts depending on your MATLAB version. I would also pre-sort your hand over each row to make things simpler when finding a straight. You mentioned in your post that the matrix is pre-sorted, but I'm going to assume the general case where it may not be sorted.
As such, here's some pre-processing code, given that your matrix is in A:
Asort = sort(A,2); %// Sort rowwise
diffSort = diff(Asort, 1, 2); %// Take row-wise differences
counts = histc(Asort, 1:13, 2); %// Count each row up
diffSort contains column-wise differences over each row and counts gives you a N x 13 matrix where N are the total number of hands you're considering... so in your case, that's 1000. For each row, it tells you how many of a particular card has been encountered. So all you have to do now is go through each situation and see what you have.
Let's make an ID array where it's a vector that is the same size as the number of hands you have, and the ID tells you which hand we have played. Specifically:
* ID = 1 --> High Hand
* ID = 2 --> One Pair
* ID = 3 --> Two Pairs
* ID = 4 --> Three of a Kind
* ID = 5 --> Straight
* ID = 6 --> Full House
* ID = 7 --> Four of a Kind
As such, here's what you'd do to check for each situation, and allocating out to contain our IDs:
%// To store IDs
out = zeros(size(A,1),1);
%// Variables for later
counts1 = sum(counts == 1, 2);
counts2 = sum(counts == 2, 2);
counts3 = sum(counts == 3, 2);
counts4 = sum(counts == 4, 2);
%// Situation 1 - High Hand
check = counts1 == 5;
out(check) = 1;
%// Situation 2 - One Pair
check = counts2 == 1;
out(check) = 2;
%// Situation 3 - Two Pair
check = counts2 == 2;
out(check) = 3;
%// Situation 4 - Three of a Kind
check = counts3 == 1;
out(check) = 4;
%// Situation 5 - Straight
check = all(diffSort == 1, 2);
out(check) = 5;
%// Situation 6 - Full House
check = counts2 == 1 & counts3 == 1;
out(check) = 6;
%// Situation 7 - Four of a Kind
check = counts4 == 1;
out(check) = 7;
Situation #1 basically checks to see if all of the bins that are encountered just contain 1 card. If we check for all bins that just have 1 count and we sum all of them together, we should get 5 cards.
Situation #2 checks to see if we have only 1 bin that has 2 cards and there's only one such bin.
Situation #3 checks if we have 2 bins that contain 2 cards.
Situation #4 checks if we have only 1 bin that contains 3 cards.
Situation #5 checks if the neighbouring differences for each row of the sorted result are all equal to 1. This means that the entire row consists of 1 when finding neighbouring distances. Should this be the case, then we have a straight. We use all and check every row independently to see if all values are equal to 1.
Situation #6 checks to see if we have one bin that contains 2 cards and one bin that contains 3 cards.
Finally, Situation #7 checks to see if we have 1 bin that contains 4 cards.
A couple of things to note:
A straight hand is also technically a high hand given our definition, but because the straight check happens later in the pipeline, any hands that were originally assigned a high hand get assigned to be a straight... so that's OK for us.
In addition, a full house can also be a three of a kind because we're only considering the three of a kind that the full house contains. However, the later check for the full house will also include checking for a pair of cards, and so any hands that were assigned a three of a kind will become full houses eventually.
One more thing I'd like to note is that if you have an invalid poker hand, it will automatically get assigned a value of 0.
Running through your example, this is what I get:
>> out
out =
2
1
3
6
This says that the first hand is a one pair, the next hand is a high card, the next pair is two pairs and the last hand is a full house. As a bonus, we can actually output what the strings are for each hand:
str = {'Invalid Hand', 'High Card', 'One Pair', 'Two Pair', 'Three of a Kind', 'Straight', 'Full House', 'Four of a Kind'};
hands = str(out+1);
I've made a placeholder for the invalid hand, and if we got a legitimate hand in our vector, you simply have to add 1 to each index to access the right hand. If we don't have a good hand, it'll show you an Invalid Hand string.
We get this for the strings:
hands =
'One Pair' 'High Card' 'Two Pair' 'Full House'

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

How to count matches in several matrices?

Making a dichotomous study, I have to count how many times a condition takes place?
The study is based on two kinds of matrices, ones with forecasts and others with analyzed data.
Both in the forecast and analysis matrices, in case a condition is satisfied we add 1 to a counter. This process is repeated for a points distributed in a grid.
Are there any functions in MATLAB that help me with counting or any script that supports this procedure?
Thanks guys!
EDIT:
The case goes about precipitation registered and forecasted. When both exceed a threshold I consider it as a hit. I have Europe divided in several grid points, and I have to count how many times the forecast is correct. I also have 50 forecasts for each year, so the result (hit/no hit) must be a cumulative action.
I've trying with count and sum functions, but they reduce the spatial dimension of the matrices.
It's difficult to tell exactly what you are trying to do but the following may help.
forecasted = [ 40 10 50 0 15];
registered = [ 0 15 30 0 10];
mismatch = abs( forecasted - registered );
maxDelta = 10;
forecastCorrect = mismatch <= maxDelta
totalCorrectForecasts = sum(forecastCorrect)
Results:
forecastCorrect =
0 1 0 1 1
totalCorrectForecasts =
3